References, articles, and information on various database systems and database internals. It serves as a reference and bookmarks for systems, architectures, and implementation details.
An outline of database systems and architecture. Work in progress. The goal is for me to structure my notes better and to have a reference for when I need to look up something.
Neon is a system that is both protocol and application-compatible with Postgres. Its architecture, which draws inspiration from Aurora's design, separates compute and storage and is managed by a control plane1.
At its core, Neon operates on the Page (8KB) abstraction level. It functions as a replacement for the Postgres storage engine without concern of the specific page content. This approach allows for a minimal integration surface with Postgres, enabling the use of only slightly modified forks of the database.
The storage layer is composed of Pageservers and Safekeepers. Pageservers have the responsibility of storing Postgres pages, whereas Safekeepers are tasked with persisting the write-ahead logs (WALs). When Postgres needs to read data, it retrieves pages directly from the Pageserver.
WAL records are streamed to the Safekeeper cluster. For reliability, a cluster of three Safekeeper nodes is maintained using Paxos. Following this, Pageservers ingest the WALs from Safekeepers and store the data both locally and on S3 as "layers". The layers enable point in time backups and efficient branching utilizing copy-on-write technique 2. To efficiently handle read requests, Pageservers compact these layers and create materialized pages from the WAL data 3.
YugabyteDB is a high-performance, distributed SQL database. It is built on a globally-distributed, strongly-consistent document store and is fully compatible with PostgreSQL. It offers horizontal scalability and resilience 6.
It uses a modified Raft algorithm for data replication and sharding for data distribution. It is optimized for cloud-native environments and supports various deployment options.
The storage engine, DocDB, is based on RocksDB. Unlike RocksDB it's a "key to object/document" instead of a "key to value" store 7.
Aurora is a MySQL and PostgreSQL-compatible relational database built for the cloud. It uses a distributed, fault-tolerant, self-healing storage system that auto-scales up to 64TB. It runs in a single-master configuration where only a single node can process write requests and all other nodes are read replicas. If the writer node becomes unavailable, a failover mechanism promotes one of the read-only nodes to be the new writer 4.
A recent addition to Aurora allows for multi-master configurations, albeit with limitations and new failure modes 9.
As a single-node system, DuckDB shares high-level architecture with Velox, HyPer, and Umbra. It uses a "morsel-driven" query execution framework where query plans are represented as trees of operators 11 12.
The architecture follows a process-per-connection model, offering a trade-off between fault isolation and resource consumption. Its MVCC implementation provides granular control over transactions, offering advanced isolation levels without lock contention.
Its write-ahead log serves as the cornerstone for data durability and integrity, and facilitates features like point-in-time recovery and various replication configurations.
The system's extensibility is evident through a decade of extensions for Postgres, systems that fork to build new systems on, or use the query engine as a library.
- 📘 PostgreSQL 14 Internals - A deep dive into the server mechanics (E Rogov 2022)
- 🏷️ PostgreSQL Lock Conflicts
- 🏷️ PostgreSQL Internals Through Pictures (B Momjian 2015) 📺
- 🏷️ The internals of PostgreSQL (H Suzuki 2021)
- 🏷️ Postgres architecture
- 🏷️ Solving PostgreSQL wicked problems (2021)
- 🏷️ Unofficial documentation for PostgreSQL hooks
- 🏷️ The four types of Postgres extensions
- 🏷️ PostgreSQL execution plan visualizer
- 🏷️ pgstats: a vmstat-like tool for PostgreSQL
This is a good post on isolation levels for database transactions. The post looks at the isolation levels and their anomalies. Specifically how Postgres implements isolation levels, but it's a good overview nevertheless.
Hermitage by Martin Kleppmanns, is an attempt to nail down precisely what different database systems actually mean with their isolation levels. It's a suite of tests that simulates various concurrency issues — some common, some more obscure — and documents how different databases handle those situations.
The journey to understand disk storage could well start with Information Retrieval (CJ van Rijsbergen 1979), File Structures. This focuses on file structures for document retrieval but also reveals that many challenges we encounter today were already addressed five decades ago.
Despite this, storing data on disk securely and durably remains a complex task. The main challenges can be grouped into three key layers:
File API: Complex and error-prone, making it difficult to enforce data ordering without expensive cache flushes.
File Systems: Poor error-handling, leading to data corruption and loss. Even well-known file systems frequently ignore or mishandle errors.
Disks: Unreliable flushing mechanisms and inconsistent error rates further compromise data integrity.
Overall, systemic weaknesses exist from the application layer down to the hardware, making data loss and corruption a lasting concern 14.
📄 Ubiquitous B-tree. Comer. ACM Computing Surveys (CSUR) 11 (1979)
"The Ubiquitous B-Tree" is considered a seminal work that played a significant role in promoting the understanding and adoption of B-trees. The paper helped to solidify the B-tree's position as a go-to data structure for database management systems and file systems. It made the concept accessible and laid out the advantages, contributing to B-trees becoming a standard choice for many storage systems.
The B-tree version described in the paper includes data in the internal nodes. In modern implementations, it's common to only store data in the leaf nodes. This makes B-trees more efficient for range queries and better suited for systems where data is stored on disk. The internal nodes in a B+-tree contain only keys and act as a guide to locate the leaf node containing the desired data.
📄 Modern B-tree techniques. Graefe, and others. Foundations and Trends in Databases 3 (2011)
This paper provides an updated look at B-tree optimizations and techniques that have emerged over the years. It's a must-read for understanding contemporary B-tree practices.
📄 Two is Better Than One: The Case for 2-Tree for Skewed Data Sets. Zhou, Yu, Graefe, and Stonebraker. memory 11 (2023) 📺
This is a discussion on a 2-Tree architecture to improve memory utilization in databases with skewed data sets. B-trees don't handle skewed data well, leading to poor memory utilization. In a 2-Tree, frequently accessed records are stored in one tree, and cold records in another. A migration protocol moves records between the trees as their access frequency changes.
It's designed to be an on-disk index that handles larger-than-memory indexes effectively.
The record migration protocol adaptable to any B-tree.
Next step would be to expand the approach for an N-Tree system with multiple storage levels. It's like an LSM of B-tree's except that the levels are mutable and records can move between them.
The approach aims to significantly improve memory utilization and throughput without compromising range scan performance.
📄 Adaptive Hybrid Indexes. Anneser, Kipf, Zhang, Neumann, and Kemper (2022)
The problem this paper tries to address is the large memory usage of indices in database systems. Traditional compact indexes, while reducing space overhead, compromise on performance due to their static nature. Unable to to adjust to varying workload patterns and to trade-off between memory efficiency and performance is a challenge.
The objective they try to achieve is is to optimize both space and performance in an index structures by dynamically adjusting encodings at run-time based on workload patterns.
🏷️ Taming B-Trees (ScyllaDB)
Is an article on how ScyllaDB transitioned from Red-Black to B-trees to enhance its in-memory cache, addressing issues like suboptimal memory consumption and CPU usage during tree searches. While B-trees are often thought of as an on-disk data structure the article goes into how B-trees offer better cache locality and mitigate branch misprediction problems associated with binary trees, optimizing key searches and overall performance.
📒 The Log-Structured Merge-Tree (LSM-Tree) (P O'Neil 1996).
It's the 1996 paper by Patrick O'Neil that introduces the LSM Tree. However, it didn't gain widespread adoption until a decade later. This delay was largely due to hardware constraints and the lack of immediate need for the write-optimization benefits LSM Trees offered. With the advent of large-scale, write-intensive applications and improved hardware, LSM Trees became popular and are now fundamental in most databases.
Google introduced LevelDB specifically to optimize for high write throughput on commodity hardware. Before this big databases primarily relied on B-trees and variations thereof. These databases were often optimized for read-heavy workloads and deployed on specialized, high-end hardware.
📒 An Empirical Guide to the Behavior and Use of Scalable Persistent Memory (FAST '20, Yang 2020) 📺
📒 FPGA-Accelerated Compactions for LSM-based Key-Value Store (FAST '20, Zhang 2020) 📺
- 🏷️ Pebble LSM tree in Go
- 🏷️ Alibaba X-engine (alibabacloud.com)
- 🏷️ X-Engine Storage Engine: Putting LSM-Tree into NVM Part 2
- 📘 Database System Concepts 7th ed. (A Silberschatz 2019).
- 📘 Database Internals (A Petrov 2019).
- 📘 Readings in Database Systems, 5th Edition (P Bailis, J Hellerstein, M Stonebraker 2015).
- 📘 PostgreSQL 14 Internals - A deep dive into the server mechanics (E Rogov 2022).
- 📘 Indexing Beyond The Basics (ebook) (T Petry 2023). An applied approach good for application development.
- pgBadger - A fast PostgreSQL log analysis tool that generates detailed reports. It offers various graphs, charts, and counters that help in identifying slow queries and performance bottlenecks.
A copy-on-write clone of a Neon project's primary branch or previously created child branch. A branch can be created from the current or past state of the parent branch. Neon uses the copy-on-write technique to copy data when creating a branch. Neon Branching ↩︎
Heikki Linnakangas's presentation for the "¡Databases! Seminar" series. He goes into the storage architecture for Neon and mentions how they don't try to solve the multi-writer problem. Neon is a conventional Postgres with a network storage interface that allow them to separate compute from storing files on disk. Video ↩︎
Persistent Storage of Adaptive Radix Trees in DuckDB. DuckDB uses Adaptive Radix Tree (ART) Indexes to enforce constraints and to speed up query filters. It persist ART Indexes to disk. ↩︎
Morsel-Driven Parallelism: A NUMA-Aware Query Evaluation Framework for the Many-Core Age (SIGMOD’14, V Leis 2014). ↩︎
The INGRES relational database management system (DBMS) was implemented during 1975-1977 at the Univerisity of California. Since 1978 various prototype extensions have been made to support distributed databases.
- The design of Postgres. Stonebraker, and Rowe. ACM Sigmod Record 15 (1986)
- A Brief History of PostgreSQL (postgresql.org)
Writing to a file safely—without data corruption—is more complex than commonly thought. File APIs are difficult to use correctly, file systems can drop critical errors, and disks can corrupt data at alarming rates. To write a file safely, one would typically use a system call like
pwrite, which allows interaction with the file system. Despite high-level languages offering simpler interfaces, they ultimately rely on such system calls, and the complexity and risks remain. FILES. ↩︎