Book Notes: Designing Data-Intensive Applications (Part 1)

Book Notes on Part 1 of Designing Data-Intensive Applications - the topic is Foundations of Data Systems

Part 1

Chapter 1. Reliable, Maintainable and Scalable Applications

  • Applications are data-intensive if their limiting factor is not compute resources but rather the amount of data, complexity of data or the speed at which it is changing
  • Data systems can take the form of databases, caches, queues, search indexes, etc.

Reliability

  • System should continue to work correctly even in the face of adversity
    • Can tolerate user mistakes
    • Performance is good enough under expected load and data volume
    • System prevents unauthorized use and abuse
  • Fault is defined as one component deviating from its spec, failure is when the system as a whole stops working
  • Many critical bugs are due to poor error handling
  • Faults include
    • hardware (disk failure, power outage)
    • software (bugs, incorrect assumptions)
    • human (configuration error, incorrect commands)
  • Aim to minimize ability to make errors
    • elegant API
    • smart defaults
    • rolling deploys
    • multiple hardware instances

Scalability

  • Systems ability to cope with increased load
  • Load is described depending on the system type - requests per minute, cache hits / minute, simultaneous users
  • Performance also depends on the system type - examples are throughput, response time
    • Increase a load parameter and system resources remain unchanged - how is your performance affected?
    • Increase a load parameter how much do you need to increase the resources if you want to keep performance unchanged?
  • Often performance is not a single number, but rather a distribution. For example, response times should be calculated by percentiles, i.e. x% of users experience y ms response time
  • Coping with load
    • Scaling up - moving to a more powerful machine
    • Scaling out - adding instances and distributing load
  • Distributed systems can add more complexity but may be more flexible

Maintainability

  • Operability, Simplicity, Evolability
  • The ability for a system to continue to be operational
Operability
  • Operation teams are vital to keep a system running smoothly
    • Monitoring health of system
    • Keep software up to databases
    • Tracking cause of issues such as degraded performance or system failures`
    • Deployment and configuration management
    • Preserving organizational knowledge
  • Important to make these tasks easy
    • Avoid dependency on individual machines
    • Provide good documentation
    • Visibility into runtime behavior
    • Good default behavior
Simplicity
  • Simple systems are easy to reason about
  • Projects mired in complxity (big balls of mud) are hard to improve upon or fix
  • Good abstractions allow for increasing complexity
  • Key goal for systems we build
Evolvability
  • System must be extensible for future improvements
  • Good testing
  • Simple systems are easier to iterate on

Chapter 2. Data Models and Query Languages

  • Data models not only affect how the software is written, but how we think about the problems we are solving
  • Each layer of data provides abstractions for interactions with layers above it

Relational Model vs Document Model

  • Relational model (SQL in modern days) proposed by Edgar Codd in 1970
  • Prior databases at the time forced application developers to think about the internal representation of the data in the database
  • NoSQL (No Only SQL) spawned from a need for
    • Greater scalability than relational databases, i.e. large datasets or high write throughput
    • FOSS preference
    • Frustration with schema restrictiveness and desire for more dynamic and expressive model
  • Object-Relational mismatch often require ORMs to translate relational model to OO style design
  • Network model (lost to relational model in 1970s)
    • CODASYL (Conference on Data System Languages) is an example
    • Generalized form of hierarchical model (each document had one parent)
    • Must know access path to document
    • Need to keep path visit, since cycles possible
    • Like traversal of linked list, but connections made at write time
  • Document model
    • Data locality (everything is stored on the object)
    • Denormalization (duplication of data rather than reference)
    • Good for needing all information in document, since entire document is loaded
    • Hard to model many-to-one or many-to-many relationships
    • Most don’t support native joins
    • Updating document may not be performant due to flexibility of document size, and needing to allocate variable amounts of space
    • Schema on read
    • Allows schema to be flexible, don’t need to update all documents to add field
  • Relational model
    • Sets of relations (tables) with collection of tuples (rows)
    • Query optimizer is abstraction that doesn’t require application developer to think about data representation
    • Schema on write
    • “Statically typed”

Query Languages for Data

  • Imperative vs. Declarative
    • Imperative requires executions to be applied in a certain order - think of a for loop with variables and mutable data structures
    • Declarative defines the pattern of the result, but doesn’t specify the implementation, allowing optimizations to be performed behind the scenes
  • Examples of declarative languages: CSS, SQL, MapReduce, MongoDB’s Aggregation Pipeline
  • Graph-like Data Models
    • Vertices and edges
    • Property Graphs
      • Vertices & edges have unique ids and properties
      • Data schema / types flexible
      • Cypher is declarative query language for Neo4j graph database
    • Triple-store
      • Equivalent to property graph model using different terminology
      • All information stored in (subject, predicate, object) such as (Jim, likes, bananas)
      • Examples of implementation: Semantic web/RDF (Resource Description Framework)
      • SPARQL (SPARQL Protocol and RDF Query Language) is a RDF format triple-store declarative query language
    • Datalog
      • Similar to triple-store model, much older than Cypher or SPARQL
      • Define small rules to build up a query to build virtual predicates
    • Different than Network model (comparing to CODASYL)
      • Data types are flexible
      • Queries can be declarative, while CODASYL is always imperative
      • Vertices and edges are not ordered
      • Vertices and edges can be referred to be their unique id
  • Alternatives to the relational model
    • Graph-like for targeting use cases where everything is potentially related to everything
    • Document model for targeting use cases where references are self-contained, and cross-document relationships are rare

Chapter 3. Storage and Retrieval

Data Structures that power Databases

  • A log is an append-only data file used by many databases in various forms
  • An index is a data structure that can take various forms to improve lookup performance derived from primary data, i.e. it does not affect the data content of a database
  • Well-chosen indexes speed up read queries, but every index slows down writes
  • Compaction is the general concept of removing old, duplicated values within segments of log data files

Indexes

Hash Indexes

  • Examples include keeping an in-memory hash map where each key is mapped to a location on disk
    • Well-suited for situations where the value for each key is updated frequently
    • Well-suited for situations where the number of keys is low
    • Bitcask, the default storage engine in Riak, uses this model as long as all the keys fit in the available RAM
  • Not good for range queries since keys are unsorted
  • Must fit into memory, or else key / value lookup will incur disk reads and performance costs

Advantages of logs

  • Crash recovery - since all old data is kept until it’s no longer needed, no need to worry about crashing in the middle of a value being overwritten
  • Concurrency - since values are not updated in place, concurrency operations can occur
  • Compacting old data avoids the problem of data files becoming fragmented over time
  • Appending and segment merging are sequential write operations which tend to be much faster than random writes

LSM-Trees & SSTables

  • Rather than keeping an unsorted log of values, keep sequences of key-values sorted by key
    • Format is called Sorted String Table, or SSTable. Each key should only appear once per merged segment file
  • Merging segments becomes simpler and more effecient, and can use a merge-sort like algorithm
  • Can keep a spare index rather than all keys in-memory, since all keys are sequential and can be used to roughly determine a keys on-disk location
  • Easy to compress segments since they need to be scanned and fetched anyway, saving disk space and I/O time
  • LSM-trees are Log-Structured Merge-Trees
    • Writes are added to in-memory balanced tree called a memtable
    • When memtable gets too big, write to an SSTable file, which can be done efficiently because the keys are already sorted. New writes occur to a new memtable instance
    • Fetching a key requires looking at the memtable first, and then iterating through all past SSTable files
    • Compaction merges and resorts SSTable data files
  • Bloom filters can approximate whether or not a key exists in a set, so that not all SSTables need to be iterated through until a key can finally be determined to not exist

B-Trees

  • Ubiquitous in databases, mostly in relational databases
  • A tree of pages of a specific size, that reference other pages until a leaf page with values is reached
  • Each layer of pages has a more specific range of references
  • Inserting and deleting occur just like a tree - if page becomes overloaded with values, page is broken into 2 separate pages and the parent page is updated
  • Updating a single value rewrites the entire page in more storage engines
  • A B-tree with n keys always has a depth of O(log n)
  • The number of references to child pages in one page is called the branching factor, which depends on the space required to store a page and range boundaries - often it is several hundred
  • A four-level tree of 4KB pages with a branching factor of 500 can store up to 256TB
  • Often uses a write-ahead log or (WAL) to store each change before it is applied to the pages itself - this ensures that in the event of a crash the B-tree can be restored back to a consistent state
  • Reads may see pages in an inconsistent state based on other writes to the same page - latches are used to prevent concurrency issues

B-Trees vs LSM-Trees

  • B-Trees can be fragmented over time, as data is not written sequentially and updating pages may leave available memory in large, inconsistent chunks
  • B-trees can have higher write amplification since data is always written at least twice, both to the WAL and the page
  • LSM-Trees can be compressed better
  • LSM-Trees need to deal with the large performance drains of compaction, especially if the rate of incoming writes is faster than the rate of compaction
  • B-Trees offer strong transaction semantics due to transaction isolation and non-duplicate keys

Other Indexing Structures

  • Clustered indexes store rows within the index itself, but must pay a penalty for data duplication
  • Multi-column indexes
    • Concatenated index concatenates the keys based on sort order, which means that the reverse sort order is not indexed
    • Multi-dimensional indexes are not able to be indexed efficiently with B-trees or LSM-trees, and can use other data structures such as R-trees
  • Full-text search / fuzzy index
    • Allows searching for similar keys like mispelled words
    • Lucene uses trie-like data structures and Leinshtein automation
  • In-memory databases are growing
    • Cost / GB is going down for RAM storage, allowing cheaper in-memory databases
    • Solutions to durability such as periodically writing to disk, special hardware (battery-powered RAM) or replicating state in other machines
    • In-memory performance comes from not just not having to write to disk, but not needing to encode in-memory data structures to something that can be written and fetch efficiently on disk

Data Warehouses

  • OLTP - online transaction processing
    • Small number of records / query, end-user, latest state of data
  • OLAP - online analytics processing
    • Aggregate over large number of records, history of events, used internally for business intelligence
  • Column-oriented storage
    • More efficient when queries only need a few columns - rather than loading entire wide rows, just load the columns needed
    • Each column is own “file” that is n long, with n as the number of rows
    • Can be compressed very efficiently using bitmap encoding
    • Writing is more complicated, and uses LSM-trees often
  • Materialized Views
    • Written data tables that are denormalized - quicker for queries since values are precomputed

Chapter 4. Encoding and Evolution

  • Evolvability is a major focus when designing systems - changes are consistent
  • We often cannot make major changes in an atomic transaction, so we need to be backward / forward compatible
  • Backward compatibility: Newer code can read data written by older code
  • Forward compatibility: Older code can read data written by newer code

Formats for Encoding Data

  • Programs usually work with data in two ways
    1. In memory, optimized for efficient access using pointers as objects, structs, lists, arrays, trees, etc.
    2. Self-contained sequence of bytes to be transmitted over a network or written to a file
  • Encoding / Marshalling / Serializing and Decoding / Unmarshalling / Deserialization translate between the two forms
  • Many languages have language-specific encoding / decoding, but usually is only interoperable within the language and also provides some security concerns due to the need to instantiate arbitrary classes and execute arbitrary code

JSON, XML, Binary

  • XML - criticized as being too verbose & complicated
  • CSV - no schema, quite vague and escaping is hard (what to do with commas, newlines, etc?)
  • JSON - simple in comparison with XML, built-in browser support due to being a subset of JavaScript
  • Issues with JSON / XML / CSV
    • Ambiguity with number encoding - XML / CSV can’t distinguish between digit strings & numbers, JSON does not differentiate between floats and integers so inaccuracies occur with large numbers or large precisions
    • Do no support binary strings, which often need to be encoded as Base64
    • Optional schema support, which can cause subtle problems
  • However, XML & JSON are good for many purposes as they are quick to set-up, flexible, and common
Binary Encoding
  • Binary encodings for JSON & XML - MessagePack, BSON, BISON, etc., WBXML, etc.
  • Usually not efficient as designed binary encodings, as they often need to include field names

Thrift and Protocol Buffers

  • Binary encoding libraries made open-source in 2007-8
  • Require a schema for any encoded data
  • Include code generation tool to produce classes that conform to schema
  • Smaller data size since field names are not needed (defined in schema)
  • Schema evolution
    • Allows for adding fields, which will be ignored by older readers (new fields cannot be required however, as new readers will throw errors from old writers)
    • Allows for changing field names
    • Removing fields requires the field to be optional and the field tag to be reserved (never used again)
    • In some cases can change data type to be bigger, or to be a list rather than single element

Avro

  • Started in 2009 to specifically fit with Hadoop
  • Can be more compact than Thrift / Protobufs
  • No type annotations in encoded format
  • Separate schema for writer and reader, but only need to be compatible
    • Field order is important in schema definition
  • Requires schema acknowledgement between reader and writer
    • Writer schema defined in Avro object container files (large file with lots of record)
    • Each record has a specific version number (records in a DB)
    • Negotiate schema versions on connection (Sending records over a network connection)
  • Schema evolution
    • Only add or remove fields that have a default value
    • Allows aliases, but changing names are a little trickier - backwards compatible but not forward compatible
  • Advantage is dynamically generated schemas
    • Fields are not tagged by numbers, but by names
    • Can generate Avro records and object container file when reading a CSV or Relational DB
    • Thrift or Protobufs would require the schema to be assigned by hand and manually set the mapping from columns to field tags

The Merits of Schemas

  • Schemas support detailed validation
  • More compact than JSON / XML / more flexible encoding formats
  • Schema is a valuable form of documentation
  • Keeping a database of schemas allows forward and backward compatibility to be checked
  • Code generation is useful and provides compile-time type checking and run-time schema enforcement

Modes of Dataflow

  • Whenever you send data to another process with which you don’t share memory with, data needs to be encoded
  • Via databases, service calls, and asynchronous message passing

Databases

  • Storing data for a process in the future to access
  • Data lasts longer than code
  • Schema evolution allows the database to appear as it was encoded with the single schema, even though the underlying storage may contain records encoded with historical versions

REST and RPC

  • To communicate over the network, you often has clients and servers
  • HTTP is often the transportation protocol for information over the public internet
    • If a service is communicated with by HTTP, it is referred to as a web service
  • Many APIs are built upon HTTP transport
  • Services are APIs exposed by a server that allows for query of data or actions to be performed
  • Organization-internal services may communicate through Remote Procedure Calls, or RPC
  • Many services may exist that serve small-scoped purposes - this is microservice architecture or service oriented architecture
  • REST and SOAP are two approaches to web services
    • REST - Representational State Transfer
      • Use urls to identify resources and use HTTP headers to dictate authentication, cache-control, content-type, etc.
      • Simpler approaches than SOAP
    • SOAP - Simple Object Access Protocol
      • Avoids most HTTP features
      • XML-based
      • API is described using WSDL (Web Services Description Language) that enables code generation
      • Interoperability between different vendors often causes problems
  • RPCs aim to mock local function calls, but are inherently more complex
    • Unpredictable, because networks are unpredictable
    • Need to deal with timeouts if function call takes a long time. This may result in duplication
    • Network responses can get lost, and include much more latency than local function calls
    • Local function calls can efficiently pass references to local memory around - RPC calls must have easily-serializable data
  • New generation of RPC frameworks are explicit that they occur over a network, and aim to simplify service discovery and abstract over network issues
  • Can achieve better performance but are more suited for internal organization message passing

Message-Passing

  • Queue-like, which don’t often wait for a response like RPC calls
  • Advantages over RPC
    • Act as a buffer if recipient is unavailable or overloaded
    • Automatically redeliver messages to crashed processes
    • Avoids the client from needing to know specific location of recipient (useful in virtual deployments)
    • One Message -> Many Receivers
    • Decouples sender from receiver
  • Message brokers include RabbitMQ, ActiveMQ, Kafka
  • Consumers and publishers to a “topic” or “channel”
  • Actor model is a programming model for concurrency in a single process
  • Rather than deal with threads directly, each actor represents one client or entity and communicates with other actors by sending and receiving asynchronous messages
  • Message delivery is not guaranteed
  • Can abstract over whether the receiver is on the same node or another node