Book Notes: Designing Data-Intensive Applications (Part 1)

2018-03-04

Book: Designing Data-Intensive Applications

Book Notes on Part 1 of Designing Data-Intensive Applications - the topic is Foundations of Data Systems

Part 1

Chapter 1. Reliable, Maintainable and Scalable Applications

Applications are data-intensive if their limiting factor is not compute resources but rather the amount of data, complexity of data or the speed at which it is changing
Data systems can take the form of databases, caches, queues, search indexes, etc.

Reliability

System should continue to work correctly even in the face of adversity
- Can tolerate user mistakes
- Performance is good enough under expected load and data volume
- System prevents unauthorized use and abuse
Fault is defined as one component deviating from its spec, failure is when the system as a whole stops working
Many critical bugs are due to poor error handling
Faults include
- hardware (disk failure, power outage)
- software (bugs, incorrect assumptions)
- human (configuration error, incorrect commands)
Aim to minimize ability to make errors
- elegant API
- smart defaults
- rolling deploys
- multiple hardware instances

Scalability

Systems ability to cope with increased load
Load is described depending on the system type - requests per minute, cache hits / minute, simultaneous users
Performance also depends on the system type - examples are throughput, response time
- Increase a load parameter and system resources remain unchanged - how is your performance affected?
- Increase a load parameter how much do you need to increase the resources if you want to keep performance unchanged?
Often performance is not a single number, but rather a distribution. For example, response times should be calculated by percentiles, i.e. x% of users experience y ms response time
Coping with load
- Scaling up - moving to a more powerful machine
- Scaling out - adding instances and distributing load
Distributed systems can add more complexity but may be more flexible

Maintainability

Operability, Simplicity, Evolability
The ability for a system to continue to be operational

Operability

Operation teams are vital to keep a system running smoothly
- Monitoring health of system
- Keep software up to databases
- Tracking cause of issues such as degraded performance or system failures`
- Deployment and configuration management
- Preserving organizational knowledge
Important to make these tasks easy
- Avoid dependency on individual machines
- Provide good documentation
- Visibility into runtime behavior
- Good default behavior

Simplicity

Simple systems are easy to reason about
Projects mired in complxity (big balls of mud) are hard to improve upon or fix
Good abstractions allow for increasing complexity
Key goal for systems we build

Evolvability

System must be extensible for future improvements
Good testing
Simple systems are easier to iterate on

Chapter 2. Data Models and Query Languages

Data models not only affect how the software is written, but how we think about the problems we are solving
Each layer of data provides abstractions for interactions with layers above it

Relational Model vs Document Model

Relational model (SQL in modern days) proposed by Edgar Codd in 1970
Prior databases at the time forced application developers to think about the internal representation of the data in the database
NoSQL (No Only SQL) spawned from a need for
- Greater scalability than relational databases, i.e. large datasets or high write throughput
- FOSS preference
- Frustration with schema restrictiveness and desire for more dynamic and expressive model
Object-Relational mismatch often require ORMs to translate relational model to OO style design
Network model (lost to relational model in 1970s)
- CODASYL (Conference on Data System Languages) is an example
- Generalized form of hierarchical model (each document had one parent)
- Must know access path to document
- Need to keep path visit, since cycles possible
- Like traversal of linked list, but connections made at write time
Document model
- Data locality (everything is stored on the object)
- Denormalization (duplication of data rather than reference)
- Good for needing all information in document, since entire document is loaded
- Hard to model many-to-one or many-to-many relationships
- Most don’t support native joins
- Updating document may not be performant due to flexibility of document size, and needing to allocate variable amounts of space
- Schema on read
- Allows schema to be flexible, don’t need to update all documents to add field
Relational model
- Sets of relations (tables) with collection of tuples (rows)
- Query optimizer is abstraction that doesn’t require application developer to think about data representation
- Schema on write
- “Statically typed”

Query Languages for Data

Imperative vs. Declarative
- Imperative requires executions to be applied in a certain order - think of a for loop with variables and mutable data structures
- Declarative defines the pattern of the result, but doesn’t specify the implementation, allowing optimizations to be performed behind the scenes
Examples of declarative languages: CSS, SQL, MapReduce, MongoDB’s Aggregation Pipeline
Graph-like Data Models
- Vertices and edges
- Property Graphs
  - Vertices & edges have unique ids and properties
  - Data schema / types flexible
  - Cypher is declarative query language for Neo4j graph database
- Triple-store
  - Equivalent to property graph model using different terminology
  - All information stored in (subject, predicate, object) such as (Jim, likes, bananas)
  - Examples of implementation: Semantic web/RDF (Resource Description Framework)
  - SPARQL (SPARQL Protocol and RDF Query Language) is a RDF format triple-store declarative query language
- Datalog
  - Similar to triple-store model, much older than Cypher or SPARQL
  - Define small rules to build up a query to build virtual predicates
- Different than Network model (comparing to CODASYL)
  - Data types are flexible
  - Queries can be declarative, while CODASYL is always imperative
  - Vertices and edges are not ordered
  - Vertices and edges can be referred to be their unique id
Alternatives to the relational model
- Graph-like for targeting use cases where everything is potentially related to everything
- Document model for targeting use cases where references are self-contained, and cross-document relationships are rare

Chapter 3. Storage and Retrieval

Data Structures that power Databases

A log is an append-only data file used by many databases in various forms
An index is a data structure that can take various forms to improve lookup performance derived from primary data, i.e. it does not affect the data content of a database
Well-chosen indexes speed up read queries, but every index slows down writes
Compaction is the general concept of removing old, duplicated values within segments of log data files

Indexes

Hash Indexes

Examples include keeping an in-memory hash map where each key is mapped to a location on disk
- Well-suited for situations where the value for each key is updated frequently
- Well-suited for situations where the number of keys is low
- Bitcask, the default storage engine in Riak, uses this model as long as all the keys fit in the available RAM
Not good for range queries since keys are unsorted
Must fit into memory, or else key / value lookup will incur disk reads and performance costs

Advantages of logs

Crash recovery - since all old data is kept until it’s no longer needed, no need to worry about crashing in the middle of a value being overwritten
Concurrency - since values are not updated in place, concurrency operations can occur
Compacting old data avoids the problem of data files becoming fragmented over time
Appending and segment merging are sequential write operations which tend to be much faster than random writes

LSM-Trees & SSTables

Rather than keeping an unsorted log of values, keep sequences of key-values sorted by key
- Format is called Sorted String Table, or SSTable. Each key should only appear once per merged segment file
Merging segments becomes simpler and more effecient, and can use a merge-sort like algorithm
Can keep a spare index rather than all keys in-memory, since all keys are sequential and can be used to roughly determine a keys on-disk location
Easy to compress segments since they need to be scanned and fetched anyway, saving disk space and I/O time
LSM-trees are Log-Structured Merge-Trees
- Writes are added to in-memory balanced tree called a memtable
- When memtable gets too big, write to an SSTable file, which can be done efficiently because the keys are already sorted. New writes occur to a new memtable instance
- Fetching a key requires looking at the memtable first, and then iterating through all past SSTable files
- Compaction merges and resorts SSTable data files
Bloom filters can approximate whether or not a key exists in a set, so that not all SSTables need to be iterated through until a key can finally be determined to not exist

B-Trees

Ubiquitous in databases, mostly in relational databases
A tree of pages of a specific size, that reference other pages until a leaf page with values is reached
Each layer of pages has a more specific range of references
Inserting and deleting occur just like a tree - if page becomes overloaded with values, page is broken into 2 separate pages and the parent page is updated
Updating a single value rewrites the entire page in more storage engines
A B-tree with n keys always has a depth of O(log n)
The number of references to child pages in one page is called the branching factor, which depends on the space required to store a page and range boundaries - often it is several hundred
A four-level tree of 4KB pages with a branching factor of 500 can store up to 256TB
Often uses a write-ahead log or (WAL) to store each change before it is applied to the pages itself - this ensures that in the event of a crash the B-tree can be restored back to a consistent state
Reads may see pages in an inconsistent state based on other writes to the same page - latches are used to prevent concurrency issues

B-Trees vs LSM-Trees

B-Trees can be fragmented over time, as data is not written sequentially and updating pages may leave available memory in large, inconsistent chunks
B-trees can have higher write amplification since data is always written at least twice, both to the WAL and the page
LSM-Trees can be compressed better
LSM-Trees need to deal with the large performance drains of compaction, especially if the rate of incoming writes is faster than the rate of compaction
B-Trees offer strong transaction semantics due to transaction isolation and non-duplicate keys

Other Indexing Structures

Clustered indexes store rows within the index itself, but must pay a penalty for data duplication
Multi-column indexes
- Concatenated index concatenates the keys based on sort order, which means that the reverse sort order is not indexed
- Multi-dimensional indexes are not able to be indexed efficiently with B-trees or LSM-trees, and can use other data structures such as R-trees
Full-text search / fuzzy index
- Allows searching for similar keys like mispelled words
- Lucene uses trie-like data structures and Leinshtein automation
In-memory databases are growing
- Cost / GB is going down for RAM storage, allowing cheaper in-memory databases
- Solutions to durability such as periodically writing to disk, special hardware (battery-powered RAM) or replicating state in other machines
- In-memory performance comes from not just not having to write to disk, but not needing to encode in-memory data structures to something that can be written and fetch efficiently on disk

Data Warehouses

OLTP - online transaction processing
- Small number of records / query, end-user, latest state of data
OLAP - online analytics processing
- Aggregate over large number of records, history of events, used internally for business intelligence
Column-oriented storage
- More efficient when queries only need a few columns - rather than loading entire wide rows, just load the columns needed
- Each column is own “file” that is n long, with n as the number of rows
- Can be compressed very efficiently using bitmap encoding
- Writing is more complicated, and uses LSM-trees often
Materialized Views
- Written data tables that are denormalized - quicker for queries since values are precomputed

Chapter 4. Encoding and Evolution

Evolvability is a major focus when designing systems - changes are consistent
We often cannot make major changes in an atomic transaction, so we need to be backward / forward compatible
Backward compatibility: Newer code can read data written by older code
Forward compatibility: Older code can read data written by newer code

Formats for Encoding Data

Programs usually work with data in two ways
1. In memory, optimized for efficient access using pointers as objects, structs, lists, arrays, trees, etc.
2. Self-contained sequence of bytes to be transmitted over a network or written to a file
Encoding / Marshalling / Serializing and Decoding / Unmarshalling / Deserialization translate between the two forms
Many languages have language-specific encoding / decoding, but usually is only interoperable within the language and also provides some security concerns due to the need to instantiate arbitrary classes and execute arbitrary code

JSON, XML, Binary

XML - criticized as being too verbose & complicated
CSV - no schema, quite vague and escaping is hard (what to do with commas, newlines, etc?)
JSON - simple in comparison with XML, built-in browser support due to being a subset of JavaScript
Issues with JSON / XML / CSV
- Ambiguity with number encoding - XML / CSV can’t distinguish between digit strings & numbers, JSON does not differentiate between floats and integers so inaccuracies occur with large numbers or large precisions
- Do no support binary strings, which often need to be encoded as Base64
- Optional schema support, which can cause subtle problems
However, XML & JSON are good for many purposes as they are quick to set-up, flexible, and common

Binary Encoding

Binary encodings for JSON & XML - MessagePack, BSON, BISON, etc., WBXML, etc.
Usually not efficient as designed binary encodings, as they often need to include field names

Thrift and Protocol Buffers

Binary encoding libraries made open-source in 2007-8
Require a schema for any encoded data
Include code generation tool to produce classes that conform to schema
Smaller data size since field names are not needed (defined in schema)
Schema evolution
- Allows for adding fields, which will be ignored by older readers (new fields cannot be required however, as new readers will throw errors from old writers)
- Allows for changing field names
- Removing fields requires the field to be optional and the field tag to be reserved (never used again)
- In some cases can change data type to be bigger, or to be a list rather than single element

Avro

Started in 2009 to specifically fit with Hadoop
Can be more compact than Thrift / Protobufs
No type annotations in encoded format
Separate schema for writer and reader, but only need to be compatible
- Field order is important in schema definition
Requires schema acknowledgement between reader and writer
- Writer schema defined in Avro object container files (large file with lots of record)
- Each record has a specific version number (records in a DB)
- Negotiate schema versions on connection (Sending records over a network connection)
Schema evolution
- Only add or remove fields that have a default value
- Allows aliases, but changing names are a little trickier - backwards compatible but not forward compatible
Advantage is dynamically generated schemas
- Fields are not tagged by numbers, but by names
- Can generate Avro records and object container file when reading a CSV or Relational DB
- Thrift or Protobufs would require the schema to be assigned by hand and manually set the mapping from columns to field tags

The Merits of Schemas

Schemas support detailed validation
More compact than JSON / XML / more flexible encoding formats
Schema is a valuable form of documentation
Keeping a database of schemas allows forward and backward compatibility to be checked
Code generation is useful and provides compile-time type checking and run-time schema enforcement

Modes of Dataflow

Whenever you send data to another process with which you don’t share memory with, data needs to be encoded
Via databases, service calls, and asynchronous message passing

Databases

Storing data for a process in the future to access
Data lasts longer than code
Schema evolution allows the database to appear as it was encoded with the single schema, even though the underlying storage may contain records encoded with historical versions

REST and RPC

To communicate over the network, you often has clients and servers
HTTP is often the transportation protocol for information over the public internet
- If a service is communicated with by HTTP, it is referred to as a web service
Many APIs are built upon HTTP transport
Services are APIs exposed by a server that allows for query of data or actions to be performed
Organization-internal services may communicate through Remote Procedure Calls, or RPC
Many services may exist that serve small-scoped purposes - this is microservice architecture or service oriented architecture
REST and SOAP are two approaches to web services
- REST - Representational State Transfer
  - Use urls to identify resources and use HTTP headers to dictate authentication, cache-control, content-type, etc.
  - Simpler approaches than SOAP
- SOAP - Simple Object Access Protocol
  - Avoids most HTTP features
  - XML-based
  - API is described using WSDL (Web Services Description Language) that enables code generation
  - Interoperability between different vendors often causes problems
RPCs aim to mock local function calls, but are inherently more complex
- Unpredictable, because networks are unpredictable
- Need to deal with timeouts if function call takes a long time. This may result in duplication
- Network responses can get lost, and include much more latency than local function calls
- Local function calls can efficiently pass references to local memory around - RPC calls must have easily-serializable data
New generation of RPC frameworks are explicit that they occur over a network, and aim to simplify service discovery and abstract over network issues
Can achieve better performance but are more suited for internal organization message passing

Message-Passing

Queue-like, which don’t often wait for a response like RPC calls
Advantages over RPC
- Act as a buffer if recipient is unavailable or overloaded
- Automatically redeliver messages to crashed processes
- Avoids the client from needing to know specific location of recipient (useful in virtual deployments)
- One Message -> Many Receivers
- Decouples sender from receiver
Message brokers include RabbitMQ, ActiveMQ, Kafka
Consumers and publishers to a “topic” or “channel”
Actor model is a programming model for concurrency in a single process
Rather than deal with threads directly, each actor represents one client or entity and communicates with other actors by sending and receiving asynchronous messages
Message delivery is not guaranteed
Can abstract over whether the receiver is on the same node or another node