术语表
Glossary
Note
Please note that the definitions in this glossary are short and simple, intended to convey the core idea but not the full subtleties of a term. For more detail, please follow the references into the main text.
请注意,这个词汇表中的定义是简短而简单的,旨在传达核心概念,而不是一个术语的所有细节。如需更多细节,请查看主文中的参考资料。
- asynchronous
-
Not waiting for something to complete (e.g., sending data over the network to another node), and not making any assumptions about how long it is going to take. See “Synchronous Versus Asynchronous Replication” , “Synchronous Versus Asynchronous Networks” , and “System Model and Reality” .
不等待某个操作完成(例如,向另一个节点发送数据),也不做任何关于操作所需时间的假设。请参考“同步与异步复制”、“同步与异步网络”和“系统模型与现实”。
- atomic
-
-
In the context of concurrent operations: describing an operation that appears to take effect at a single point in time, so another concurrent process can never encounter the operation in a “half-finished” state. See also isolation .
描述一个操作在同时发生时,出现在单个时间点,因此另一个并发进程永远不会遇到该操作处于“半完成”的状态。请参考隔离。
-
In the context of transactions: grouping together a set of writes that must either all be committed or all be rolled back, even if faults occur. See “Atomicity” and “Atomic Commit and Two-Phase Commit (2PC)” .
在交易的背景下:将一组写入操作分组,无论发生何种故障,其必须全部提交或全部回滚。请参见“原子性”和“原子提交和两阶段提交(2PC)”。
-
- backpressure
-
Forcing the sender of some data to slow down because the recipient cannot keep up with it. Also known as flow control . See “Messaging Systems” .
强制数据发送方减缓速度,因为接收方无法跟上。也称为流量控制。参见“消息系统”。
- batch process
-
A computation that takes some fixed (and usually large) set of data as input and produces some other data as output, without modifying the input. See Chapter 10 .
一种计算,它以一些固定(通常很大)的数据集作为输入并产生一些其他数据作为输出,而不修改输入。请参见第十章。
- bounded
-
Having some known upper limit or size. Used for example in the context of network delay (see “Timeouts and Unbounded Delays” ) and datasets (see the introduction to Chapter 11 ).
具有某些已知上限或大小。例如,在网络延迟(参见“超时和无界延迟”)和数据集(参见第11章介绍)的上下文中使用。
- Byzantine fault
-
A node that behaves incorrectly in some arbitrary way, for example by sending contradictory or malicious messages to other nodes. See “Byzantine Faults” .
一个节点以某种任意的方式行为不当,例如向其他节点发送矛盾或恶意信息。请参见“拜占庭故障”。
- cache
-
A component that remembers recently used data in order to speed up future reads of the same data. It is generally not complete: thus, if some data is missing from the cache, it has to be fetched from some underlying, slower data storage system that has a complete copy of the data.
一个记忆最近使用的数据以加速未来读取同样数据的组件。通常并不完整:如果缓存中遗漏了某些数据,则必须从底层、较慢的数据存储系统中获取完整的数据拷贝。
- CAP theorem
-
A widely misunderstood theoretical result that is not useful in practice. See “The CAP theorem” .
一个在实践中没有用处且被广泛误解的理论结果。请查看“CAP定理”。
- causality
-
The dependency between events that arises when one thing “happens before” another thing in a system. For example, a later event that is in response to an earlier event, or builds upon an earlier event, or should be understood in the light of an earlier event. See “The “happens-before” relationship and concurrency” and “Ordering and Causality” .
当一个事件“在某个系统中发生之前”,另一个事件发生时它们之间就出现了依赖关系。例如,后续事件是对先前事件的响应,是建立在先前事件之上,或者应该在先前事件的基础上加以理解。请参阅“The “happens-before” relationship and concurrency”和“Ordering and Causality”。
- consensus
-
A fundamental problem in distributed computing, concerning getting several nodes to agree on something (for example, which node should be the leader for a database cluster). The problem is much harder than it seems at first glance. See “Fault-Tolerant Consensus” .
分散式計算中的根本問題,涉及使多個節點就某事達成一致(例如,應該是哪個節點成為數據庫集群的首領)。這個問題比看上去困難得多。請參閱“容錯共識”。
- data warehouse
-
A database in which data from several different OLTP systems has been combined and prepared to be used for analytics purposes. See “Data Warehousing” .
将来自多个不同OLTP系统的数据组合并准备好用于分析目的的数据库。 参见“数据仓库”。
- declarative
-
Describing the properties that something should have, but not the exact steps for how to achieve it. In the context of queries, a query optimizer takes a declarative query and decides how it should best be executed. See “Query Languages for Data” .
描述某物应该具有的特性,但不是如何实现它的确切步骤。在查询的上下文中,查询优化器会接收一个声明性的查询并决定如何最好地执行它。请参见“数据查询语言”。
- denormalize
-
To introduce some amount of redundancy or duplication in a normalized dataset, typically in the form of a cache or index , in order to speed up reads. A denormalized value is a kind of precomputed query result, similar to a materialized view. See “Single-Object and Multi-Object Operations” and “Deriving several views from the same event log” .
在规范化数据集中引入一定程度的冗余或重复,通常以缓存或索引的形式来加速读取。非规范化值是一种预先计算的查询结果,类似于材料化视图。请参阅“单对象和多对象操作”和“从相同的事件日志派生多个视图”。
- derived data
-
A dataset that is created from some other data through a repeatable process, which you could run again if necessary. Usually, derived data is needed to speed up a particular kind of read access to the data. Indexes, caches, and materialized views are examples of derived data. See the introduction to Part III .
衍生数据集是通过可重复的过程从其他数据中创建的,如果需要,您可以再次运行该过程。通常,衍生数据用于加快对数据的特定读取访问速度。索引、缓存和物化视图是衍生数据的示例。请参见第III部分的介绍。
- deterministic
-
Describing a function that always produces the same output if you give it the same input. This means it cannot depend on random numbers, the time of day, network communication, or other unpredictable things.
描述一个函数,如果你给它相同的输入,它总是会产生相同的输出。这意味着它不能依赖于随机数、时间、网络通信或其他不可预测的事物。
- distributed
-
Running on several nodes connected by a network. Characterized by partial failures : some part of the system may be broken while other parts are still working, and it is often impossible for the software to know what exactly is broken. See “Faults and Partial Failures” .
运行在由网络连接的多个节点上。特点是部分故障:系统的一些部分可能出现故障,而其他部分仍在工作,软件通常无法知道哪个部分出现故障。请参见“故障和部分故障”。
- durable
-
Storing data in a way such that you believe it will not be lost, even if various faults occur. See “Durability” .
以一种方式存储数据,即使发生各种故障,您也认为数据不会丢失。参见“耐久性”。
- ETL
-
Extract–Transform–Load. The process of extracting data from a source database, transforming it into a form that is more suitable for analytic queries, and loading it into a data warehouse or batch processing system. See “Data Warehousing” .
抽取-转换-加载。从源数据库中提取数据,将其转换为更适合进行分析查询的形式,并将其加载到数据仓库或批处理系统中的过程。参见“数据仓库”。
- failover
-
In systems that have a single leader, failover is the process of moving the leadership role from one node to another. See “Handling Node Outages” .
在只有一个领导者的系统中,故障转移是将领导角色从一个节点移动到另一个节点的过程。请参见“处理节点故障”。
- fault-tolerant
-
Able to recover automatically if something goes wrong (e.g., if a machine crashes or a network link fails). See “Reliability” .
自动恢复功能可在出现问题时(例如机器崩溃或网络链路故障)进行恢复。请参见“可靠性”。
- flow control
-
看背压。
- follower
-
A replica that does not directly accept any writes from clients, but only processes data changes that it receives from a leader. Also known as a secondary , slave , read replica , or hot standby . See “Leaders and Followers” .
一种复制品,不直接接受客户端的任何写入操作,而是仅处理从领导者接收到的数据更改。也称为次要、从属、只读复制品或热备份。参见“领导者和跟随者”。
- full-text search
-
Searching text by arbitrary keywords, often with additional features such as matching similarly spelled words or synonyms. A full-text index is a kind of secondary index that supports such queries. See “Full-text search and fuzzy indexes” .
通过任意关键字搜索文本,通常具有匹配拼写类似单词或同义词等其他功能。全文索引是一种支持此类查询的次要索引。参见“全文搜索和模糊索引”。
- graph
-
A data structure consisting of vertices (things that you can refer to, also known as nodes or entities ) and edges (connections from one vertex to another, also known as relationships or arcs ). See “Graph-Like Data Models” .
一个由顶点(也称为节点或实体,可以引用的事物)和边缘(从一个顶点到另一个顶点的连接,也称为关系或弧)组成的数据结构。参见“类似图形的数据模型”。
- hash
-
A function that turns an input into a random-looking number. The same input always returns the same number as output. Two different inputs are very likely to have two different numbers as output, although it is possible that two different inputs produce the same output (this is called a collision ). See “Partitioning by Hash of Key” .
一个将输入转换为随机外表数字的函数。相同的输入始终会返回相同的输出数字。两个不同的输入非常可能会产生不同的输出数字,虽然有可能两个不同的输入会产生相同的输出数字(这被称为冲突)。参见“按键哈希分区”。
- idempotent
-
Describing an operation that can be safely retried; if it is executed more than once, it has the same effect as if it was only executed once. See “Idempotence” .
描述一个可以安全重试的操作;如果它被执行多次,它的效果与它只被执行一次相同。请参见“幂等性”。
- index
-
A data structure that lets you efficiently search for all records that have a particular value in a particular field. See “Data Structures That Power Your Database” .
一种数据结构,可让您有效地搜索特定字段中具有特定值的所有记录。请参见“驱动您的数据库的数据结构”。
- isolation
-
In the context of transactions, describing the degree to which concurrently executing transactions can interfere with each other. Serializable isolation provides the strongest guarantees, but weaker isolation levels are also used. See “Isolation” .
在交易的背景下,描述同时执行的交易之间会产生干扰的程度。串行化隔离提供最强的保证,但使用较弱的隔离级别也很常见。参见“隔离”。
- join
-
To bring together records that have something in common. Most commonly used in the case where one record has a reference to another (a foreign key, a document reference, an edge in a graph) and a query needs to get the record that the reference points to. See “Many-to-One and Many-to-Many Relationships” and “Reduce-Side Joins and Grouping” .
将具有共同点的记录汇集在一起。通常用于一个记录引用另一个记录的情况(外键、文档引用、图中的边),并且查询需要获取引用所指向的记录。参见“一对多和多对多关系”和“Reduce-Side连接和分组”。
- leader
-
When data or a service is replicated across several nodes, the leader is the designated replica that is allowed to make changes. A leader may be elected through some protocol, or manually chosen by an administrator. Also known as the primary or master . See “Leaders and Followers” .
当数据或服务被复制到多个节点时,领导者是被指定允许进行更改的副本。领导者可以通过某种协议选举,或由管理员手动选择。也称为主要或主控。请参见“领导者和跟随者”。
- linearizable
-
Behaving as if there was only a single copy of data in the system, which is updated by atomic operations. See “Linearizability” .
表现得好像系统中只有单个数据副本,而该副本通过原子操作进行更新。请参见“线性化”。
- locality
-
A performance optimization: putting several pieces of data in the same place if they are frequently needed at the same time. See “Data locality for queries” .
一个性能优化:如果一些数据在同一时间频繁地需要,将这些数据放在同一个位置。请参阅“查询的数据局部性”。
- lock
-
A mechanism to ensure that only one thread, node, or transaction can access something, and anyone else who wants to access the same thing must wait until the lock is released. See “Two-Phase Locking (2PL)” and “The leader and the lock” .
确保只有一个线程、节点或事务可以访问某物,并且任何想要访问相同物品的人必须等待锁被释放的机制。参见“两阶段锁定(2PL)”和“领导者和锁”。
- log
-
An append-only file for storing data. A write-ahead log is used to make a storage engine resilient against crashes (see “Making B-trees reliable” ), a log-structured storage engine uses logs as its primary storage format (see “SSTables and LSM-Trees” ), a replication log is used to copy writes from a leader to followers (see “Leaders and Followers” ), and an event log can represent a data stream (see “Partitioned Logs” ).
一个仅能附加的文件用于存储数据。采用写前日志来使存储引擎能够抵御崩溃(见“使B树可靠”),采用日志结构的存储引擎将日志用作其主要存储格式(见“SSTables和LSM-Trees”),复制日志用于将写操作从领导者复制到跟随者(见“领导者和跟随者”),事件日志可以表示数据流(见“分区日志”)。
- materialize
-
To perform a computation eagerly and write out its result, as opposed to calculating it on demand when requested. See “Aggregation: Data Cubes and Materialized Views” and “Materialization of Intermediate State” .
急切地执行一项计算并输出结果,而不是在需要时计算。请参见“聚合:数据立方体和材料化视图”和“中间状态的材料化”。
- node
-
An instance of some software running on a computer, which communicates with other nodes via a network in order to accomplish some task.
某台计算机上运行的软件实例,通过网络与其他节点通信,以完成某项任务。
- normalized
-
Structured in such a way that there is no redundancy or duplication. In a normalized database, when some piece of data changes, you only need to change it in one place, not many copies in many different places. See “Many-to-One and Many-to-Many Relationships” .
规范化的数据库结构设计避免了冗余和重复的存在。当某个数据发生变化时,你只需要在一个地方修改,而不是在许多不同的地方修改。请参见“一对多和多对多关系”。
- OLAP
-
Online analytic processing. Access pattern characterized by aggregating (e.g., count, sum, average) over a large number of records. See “Transaction Processing or Analytics?” .
在线分析处理。访问模式通过对大量记录进行聚合(如计数,求和,平均值)来表征。参见“事务处理还是分析?”。
- OLTP
-
Online transaction processing. Access pattern characterized by fast queries that read or write a small number of records, usually indexed by key. See “Transaction Processing or Analytics?” .
在线事务处理。访问模式特征为快速查询读取或写入数量较少的记录,通常由键索引。请参阅“事务处理还是分析?”。
- partitioning
-
Splitting up a large dataset or computation that is too big for a single machine into smaller parts and spreading them across several machines. Also known as sharding . See Chapter 6 .
将太大无法在单个机器上处理的大型数据集或计算拆分成较小的部分,并将它们分散在多台机器上。也称为分片。请参见第6章。
- percentile
-
A way of measuring the distribution of values by counting how many values are above or below some threshold. For example, the 95th percentile response time during some period is the time t such that 95% of requests in that period complete in less than t , and 5% take longer than t . See “Describing Performance” .
一种通过计算有多少值高于或低于某个阈值来衡量价值分布的方法。例如,在某段时间内,第95个百分位响应时间是时间t,该时间下的请求完成时间少于t的百分之95,而超过t的有5%。请参阅“描述性能”。
- primary key
-
A value (typically a number or a string) that uniquely identifies a record. In many applications, primary keys are generated by the system when a record is created (e.g., sequentially or randomly); they are not usually set by users. See also secondary index .
一个唯一标识记录的值(通常是数字或字符串)。在许多应用程序中,当记录被创建时(例如按顺序或随机),系统会生成主键;它们通常不是由用户设置的。另见二级索引。
- quorum
-
The minimum number of nodes that need to vote on an operation before it can be considered successful. See “Quorums for reading and writing” .
操作被视为成功之前需要投票的最小节点数。请参见“读写配额”。
- rebalance
-
To move data or services from one node to another in order to spread the load fairly. See “Rebalancing Partitions” .
将数据或服务从一个节点移动到另一个节点,以使负载分布更加公平。参见“重新平衡分区”。
- replication
-
Keeping a copy of the same data on several nodes ( replicas ) so that it remains accessible if a node becomes unreachable. See Chapter 5 .
将相同的数据副本存储在多个节点上,以便在一个节点无法访问时仍然可访问。见第五章。
- schema
-
A description of the structure of some data, including its fields and datatypes. Whether some data conforms to a schema can be checked at various points in the data’s lifetime (see “Schema flexibility in the document model” ), and a schema can change over time (see Chapter 4 ).
一些数据的结构描述,包括其字段和数据类型。可以在数据的生命周期的各个阶段检查某些数据是否符合模式(请参见“文档模型中的模式灵活性”),并且模式可以随时间改变(请参见第四章)。
- secondary index
-
An additional data structure that is maintained alongside the primary data storage and which allows you to efficiently search for records that match a certain kind of condition. See “Other Indexing Structures” and “Partitioning and Secondary Indexes” .
附加的数据结构同时维护在主数据存储之外,能够帮助你高效地搜索符合某种条件的记录。请参见“其他索引结构”和“分区和二级索引”。
- serializable
-
A guarantee that if several transactions execute concurrently, they behave the same as if they had executed one at a time, in some serial order. See “Serializability” .
如果多个事务同时执行,有一个保证它们的行为就像它们一个接一个地按照某种顺序执行一样。请参见“串行化”。
- shared-nothing
-
An architecture in which independent nodes—each with their own CPUs, memory, and disks—are connected via a conventional network, in contrast to shared-memory or shared-disk architectures. See the introduction to Part II .
一种体系结构,其中独立的节点——每个节点都有自己的CPU、内存和磁盘——通过传统网络连接,与共享内存或共享磁盘体系结构相反。请参阅第II部分的介绍。
- skew
-
-
Imbalanced load across partitions, such that some partitions have lots of requests or data, and others have much less. Also known as hot spots . See “Skewed Workloads and Relieving Hot Spots” and “Handling skew” .
分区之间的负载不平衡,导致一些分区有大量请求或数据,其他分区则较少。也称为热点。请参见“倾斜工作负载和缓解热点”和“处理倾斜”。
-
A timing anomaly that causes events to appear in an unexpected, nonsequential order. See the discussions of read skew in “Snapshot Isolation and Repeatable Read” , write skew in “Write Skew and Phantoms” , and clock skew in “Timestamps for ordering events” .
时间异常导致事件以意外的非顺序方式出现。请参阅“快照隔离和可重复读”中关于读取偏斜的讨论,“写偏斜和幻影”中关于写入偏斜的讨论,以及“事件排序的时间戳”中关于时钟偏差的讨论。
-
- split brain
-
A scenario in which two nodes simultaneously believe themselves to be the leader, and which may cause system guarantees to be violated. See “Handling Node Outages” and “The Truth Is Defined by the Majority” .
两个节点同时认为自己是领导者的情况可能会导致系统保障被违反。请参见“处理节点故障”和“真相由多数决定”。
- stored procedure
-
A way of encoding the logic of a transaction such that it can be entirely executed on a database server, without communicating back and forth with a client during the transaction. See “Actual Serial Execution” .
一种编码事务逻辑的方式,使其可以完全在数据库服务器上执行,而无需在事务期间与客户端来回通信。参见“实际串行执行”。
- stream process
-
A continually running computation that consumes a never-ending stream of events as input, and derives some output from it. See Chapter 11 .
一个持续运行的计算,消耗无止境的事件流作为输入,并从中派生一些输出。请参阅第11章。
- synchronous
-
The opposite of asynchronous .
同步的。
- system of record
-
A system that holds the primary, authoritative version of some data, also known as the source of truth . Changes are first written here, and other datasets may be derived from the system of record. See the introduction to Part III .
一个持有某些数据主要权威版本,也被称为真相来源的系统。更改首先在此处书写,其他数据集可以派生自记录系统。请参见第三部分的介绍。
- timeout
-
One of the simplest ways of detecting a fault, namely by observing the lack of a response within some amount of time. However, it is impossible to know whether a timeout is due to a problem with the remote node, or an issue in the network. See “Timeouts and Unbounded Delays” .
其中一种最简单的检测故障的方法是观察在某段时间内未收到响应。但是,无法确定超时是由远程节点的问题还是网络问题导致。请参阅《超时和无界延迟》。
- total order
-
A way of comparing things (e.g., timestamps) that allows you to always say which one of two things is greater and which one is lesser. An ordering in which some things are incomparable (you cannot say which is greater or smaller) is called a partial order . See “The causal order is not a total order” .
一种比较事物(例如时间戳)的方法,使您始终可以说出两个事物中哪一个更大或更小。一种排序,其中某些事物是无法比较的(您无法说哪一个更大或更小),被称为偏序。请参见“因果关系不是完全排序”。
- transaction
-
Grouping together several reads and writes into a logical unit, in order to simplify error handling and concurrency issues. See Chapter 7 .
将几次读写操作归为一个逻辑单元,以简化错误处理和并发问题。详见第七章。
- two-phase commit (2PC)
-
An algorithm to ensure that several database nodes either all commit or all abort a transaction. See “Atomic Commit and Two-Phase Commit (2PC)” .
一个算法,确保多个数据库节点要么全部提交要么全部撤回一个事务。参见“原子提交和两阶段提交(2PC)”。
- two-phase locking (2PL)
-
An algorithm for achieving serializable isolation that works by a transaction acquiring a lock on all data it reads or writes, and holding the lock until the end of the transaction. See “Two-Phase Locking (2PL)” .
一种实现可串行隔离的算法,其原理是事务在读取或写入数据时需获取锁,并在事务结束之前一直持有该锁。参考“两阶段锁定(2PL)”。
- unbounded
-
Not having any known upper limit or size. The opposite of bounded .
没有任何已知的上限或大小。与有界相反。
Index
A
-
aborts (transactions)
,
Transactions
,
Atomicity
- in two-phase commit , Introduction to two-phase commit
- performance of optimistic concurrency control , Performance of serializable snapshot isolation
- retrying aborted transactions , Handling errors and aborts
- abstraction , Simplicity: Managing Complexity , Data Models and Query Languages , Transactions , Summary , Consistency and Consensus
- access path (in network model) , The network model , The SPARQL query language
- accidental complexity, removing , Simplicity: Managing Complexity
- accountability , Responsibility and accountability
-
ACID properties (transactions)
,
Transaction Processing or Analytics?
,
The Meaning of ACID
- atomicity , Atomicity , Single-Object and Multi-Object Operations
- consistency , Consistency , Maintaining integrity in the face of software bugs
- durability , Durability
- isolation , Isolation , Single-Object and Multi-Object Operations
- acknowledgements (messaging) , Acknowledgments and redelivery
- active/active replication ( see multi-leader replication)
- active/passive replication ( see leader-based replication)
-
ActiveMQ (messaging)
,
Message brokers
,
Message brokers compared to databases
- distributed transaction support , XA transactions
- ActiveRecord (object-relational mapper) , The Object-Relational Mismatch , Handling errors and aborts
-
actor model
,
Distributed actor frameworks
- ( see also message-passing)
- comparison to Pregel model , The Pregel processing model
- comparison to stream processing , Message passing and RPC
- Advanced Message Queuing Protocol ( see AMQP)
- aerospace systems , Reliability , Human Errors , Byzantine Faults , Membership services
-
aggregation
- data cubes and materialized views , Aggregation: Data Cubes and Materialized Views
- in batch processes , GROUP BY
- in stream processes , Stream analytics
- aggregation pipeline query language , MapReduce Querying
-
Agile
,
Evolvability: Making Change Easy
- minimizing irreversibility , Philosophy of batch process outputs , Reprocessing data for application evolution
- moving faster with confidence , The end-to-end argument again
- Unix philosophy , The Unix Philosophy
-
agreement
,
Fault-Tolerant Consensus
- ( see also consensus)
- Airflow (workflow scheduler) , MapReduce workflows
- Ajax , Dataflow Through Services: REST and RPC
- Akka (actor framework) , Distributed actor frameworks
-
algorithms
- algorithm correctness , Correctness of an algorithm
- B-trees , B-Trees - B-tree optimizations
- for distributed systems , System Model and Reality
- hash indexes , Hash Indexes - Hash Indexes
- mergesort , SSTables and LSM-Trees , Distributed execution of MapReduce , Sort-merge joins
- red-black trees , Constructing and maintaining SSTables
- SSTables and LSM-trees , SSTables and LSM-Trees - Performance optimizations
- all-to-all replication topologies , Multi-Leader Replication Topologies
- AllegroGraph (database) , Graph-Like Data Models
- ALTER TABLE statement (SQL) , Schema flexibility in the document model , Encoding and Evolution
-
Amazon
- Dynamo (database) , Leaderless Replication
-
Amazon Web Services (AWS)
,
Hardware Faults
- Kinesis Streams (messaging) , Using logs for message storage
- network reliability , Network Faults in Practice
- postmortems , Software Errors
- RedShift (database) , The divergence between OLTP databases and data warehouses
-
S3 (object storage)
,
MapReduce and Distributed Filesystems
- checking data integrity , Don’t just blindly trust what they promise
-
amplification
- of bias , Bias and discrimination
- of failures , Limitations of distributed transactions , Maintaining derived state
- of tail latency , Describing Performance , Partitioning Secondary Indexes by Document
- write amplification , Advantages of LSM-trees
-
AMQP (Advanced Message Queuing Protocol)
,
Message brokers compared to databases
- ( see also messaging systems)
- comparison to log-based messaging , Logs compared to traditional messaging , Replaying old messages
- message ordering , Acknowledgments and redelivery
-
analytics
,
Transaction Processing or Analytics?
- comparison to transaction processing , Transaction Processing or Analytics?
- data warehousing ( see data warehousing)
- parallel query execution in MPP databases , Comparing Hadoop to Distributed Databases
- predictive ( see predictive analytics)
- relation to batch processing , The Output of Batch Workflows
- schemas for , Stars and Snowflakes: Schemas for Analytics - Stars and Snowflakes: Schemas for Analytics
- snapshot isolation for queries , Snapshot Isolation and Repeatable Read
- stream analytics , Stream analytics
- using MapReduce, analysis of user activity events (example) , Example: analysis of user activity events
- anti-caching (in-memory databases) , Keeping everything in memory
- anti-entropy , Read repair and anti-entropy
- Apache ActiveMQ ( see ActiveMQ)
- Apache Avro ( see Avro)
- Apache Beam ( see Beam)
- Apache BookKeeper ( see BookKeeper)
- Apache Cassandra ( see Cassandra)
- Apache CouchDB ( see CouchDB)
- Apache Curator ( see Curator)
- Apache Drill ( see Drill)
- Apache Flink ( see Flink)
- Apache Giraph ( see Giraph)
- Apache Hadoop ( see Hadoop)
- Apache HAWQ ( see HAWQ)
- Apache HBase ( see HBase)
- Apache Helix ( see Helix)
- Apache Hive ( see Hive)
- Apache Impala ( see Impala)
- Apache Jena ( see Jena)
- Apache Kafka ( see Kafka)
- Apache Lucene ( see Lucene)
- Apache MADlib ( see MADlib)
- Apache Mahout ( see Mahout)
- Apache Oozie ( see Oozie)
- Apache Parquet ( see Parquet)
- Apache Qpid ( see Qpid)
- Apache Samza ( see Samza)
- Apache Solr ( see Solr)
- Apache Spark ( see Spark)
- Apache Storm ( see Storm)
- Apache Tajo ( see Tajo)
- Apache Tez ( see Tez)
- Apache Thrift ( see Thrift)
- Apache ZooKeeper ( see ZooKeeper)
- Apama (stream analytics) , Complex event processing
- append-only B-trees , B-tree optimizations , Indexes and snapshot isolation
- append-only files ( see logs)
-
Application Programming Interfaces (APIs)
,
Thinking About Data Systems
,
Data Models and Query Languages
- for batch processing , MapReduce workflows
- for change streams , API support for change streams
- for distributed transactions , XA transactions
- for graph processing , The Pregel processing model
-
for services
,
Dataflow Through Services: REST and RPC
-
Data encoding and evolution for RPC
- ( see also services)
- evolvability , Data encoding and evolution for RPC
- RESTful , Web services
- SOAP , Web services
- application state ( see state)
- approximate search ( see similarity search)
- archival storage, data from databases , Archival storage
- arcs ( see edges)
- arithmetic mean , Describing Performance
- ASCII text , Thrift and Protocol Buffers , A uniform interface
- ASN.1 (schema language) , The Merits of Schemas
-
asynchronous networks
,
Unreliable Networks
,
Glossary
- comparison to synchronous networks , Synchronous Versus Asynchronous Networks
- formal model , System Model and Reality
-
asynchronous replication
,
Synchronous Versus Asynchronous Replication
,
Glossary
- conflict detection , Synchronous versus asynchronous conflict detection
- data loss on failover , Leader failure: Failover
- reads from asynchronous follower , Problems with Replication Lag
- Asynchronous Transfer Mode (ATM) , Can we not simply make network delays predictable?
- atomic broadcast ( see total order broadcast)
-
atomic clocks (caesium clocks)
,
Clock readings have a confidence interval
,
Synchronized clocks for global snapshots
- ( see also clocks)
-
atomicity (concurrency)
,
Glossary
- atomic increment-and-get , Implementing total order broadcast using linearizable storage
-
compare-and-set
,
Compare-and-set
,
What Makes a System Linearizable?
- ( see also compare-and-set operations)
- replicated operations , Conflict resolution and replication
- write operations , Atomic write operations
-
atomicity (transactions)
,
Atomicity
,
Single-Object and Multi-Object Operations
,
Glossary
-
atomic commit
,
Distributed Transactions and Consensus
- avoiding , Multi-partition request processing , Coordination-avoiding data systems
- blocking and nonblocking , Three-phase commit
- in stream processing , Exactly-once message processing , Atomic commit revisited
- maintaining derived data , Keeping Systems in Sync
- for multi-object transactions , Single-Object and Multi-Object Operations
- for single-object writes , Single-object writes
-
atomic commit
,
Distributed Transactions and Consensus
-
auditability
,
Trust, but Verify
-
Tools for auditable data systems
- designing for , Designing for auditability
- self-auditing systems , A culture of verification
- through immutability , Advantages of immutable events
- tools for auditable data systems , Tools for auditable data systems
-
availability
,
Hardware Faults
- ( see also fault tolerance)
- in CAP theorem , The CAP theorem
- in service level agreements (SLAs) , Describing Performance
-
Avro (data format)
,
Avro
-
Code generation and dynamically typed languages
- code generation , Code generation and dynamically typed languages
- dynamically generated schemas , Dynamically generated schemas
- object container files , But what is the writer’s schema? , Archival storage , Philosophy of batch process outputs
- reader determining writer’s schema , But what is the writer’s schema?
- schema evolution , The writer’s schema and the reader’s schema
- use in Hadoop , Philosophy of batch process outputs
- awk (Unix tool) , Simple Log Analysis
- AWS ( see Amazon Web Services)
- Azure ( see Microsoft)
B
-
B-trees (indexes)
,
B-Trees
-
B-tree optimizations
- append-only/copy-on-write variants , B-tree optimizations , Indexes and snapshot isolation
- branching factor , B-Trees
- comparison to LSM-trees , Comparing B-Trees and LSM-Trees - Downsides of LSM-trees
- crash recovery , Making B-trees reliable
- growing by splitting a page , B-Trees
- optimizations , B-tree optimizations
- similarity to dynamic partitioning , Dynamic partitioning
-
backpressure
,
Messaging Systems
,
Glossary
- in TCP , Network congestion and queueing
-
backups
- database snapshot for replication , Setting Up New Followers
- integrity of , Don’t just blindly trust what they promise
- snapshot isolation for , Snapshot Isolation and Repeatable Read
- use for ETL processes , Example: analysis of user activity events
- backward compatibility , Encoding and Evolution
- BASE, contrast to ACID , The Meaning of ACID
- bash shell (Unix) , Data Structures That Power Your Database , The Unix Philosophy , What’s missing?
-
batch processing
,
Relational Model Versus Document Model
,
Batch Processing
-
Summary
,
Glossary
-
combining with stream processing
- lambda architecture , The lambda architecture
- unifying technologies , Unifying batch and stream processing
- comparison to MPP databases , Comparing Hadoop to Distributed Databases - Designing for frequent faults
- comparison to stream processing , Processing Streams
- comparison to Unix , Philosophy of batch process outputs - Philosophy of batch process outputs
- dataflow engines , Dataflow engines - Discussion of materialization
- fault tolerance , Bringing related data together in the same place , Philosophy of batch process outputs , Fault tolerance , Messaging Systems
- for data integration , Batch and Stream Processing - Unifying batch and stream processing
- graphs and iterative processing , Graphs and Iterative Processing - Parallel execution
- high-level APIs and languages , MapReduce workflows , High-Level APIs and Languages - Specialization for different domains
- log-based messaging and , Replaying old messages
- maintaining derived state , Maintaining derived state
-
MapReduce and distributed filesystems
,
MapReduce and Distributed Filesystems
-
Key-value stores as batch process output
- ( see also MapReduce)
- measuring performance , Describing Performance , Batch Processing
-
outputs
,
The Output of Batch Workflows
-
Key-value stores as batch process output
- key-value stores , Key-value stores as batch process output
- search indexes , Building search indexes
- using Unix tools (example) , Batch Processing with Unix Tools - Sorting versus in-memory aggregation
-
combining with stream processing
- Bayou (database) , Uniqueness in log-based messaging
- Beam (dataflow library) , Unifying batch and stream processing
- bias , Bias and discrimination
- big ball of mud , Simplicity: Managing Complexity
- Bigtable data model , Data locality for queries , Column Compression
-
binary data encodings
,
Binary encoding
-
The Merits of Schemas
- Avro , Avro - Code generation and dynamically typed languages
- MessagePack , Binary encoding - Binary encoding
- Thrift and Protocol Buffers , Thrift and Protocol Buffers - Datatypes and schema evolution
-
binary encoding
- based on schemas , The Merits of Schemas
- by network drivers , The Merits of Schemas
- binary strings, lack of support in JSON and XML , JSON, XML, and Binary Variants
- BinaryProtocol encoding (Thrift) , Thrift and Protocol Buffers
-
Bitcask (storage engine)
,
Hash Indexes
- crash recovery , Hash Indexes
-
Bitcoin (cryptocurrency)
,
Tools for auditable data systems
- Byzantine fault tolerance , Byzantine Faults
- concurrency bugs in exchanges , Weak Isolation Levels
- bitmap indexes , Column Compression
-
blockchains
,
Tools for auditable data systems
- Byzantine fault tolerance , Byzantine Faults
- blocking atomic commit , Three-phase commit
- Bloom (programming language) , Designing Applications Around Dataflow
- Bloom filter (algorithm) , Performance optimizations , Stream analytics
- BookKeeper (replicated log) , Allocating work to nodes
- Bottled Water (change data capture) , Implementing change data capture
-
bounded datasets
,
Summary
,
Stream Processing
,
Glossary
- ( see also batch processing)
-
bounded delays
,
Glossary
- in networks , Synchronous Versus Asynchronous Networks
- process pauses , Response time guarantees
- broadcast hash joins , Broadcast hash joins
- brokerless messaging , Direct messaging from producers to consumers
- Brubeck (metrics aggregator) , Direct messaging from producers to consumers
- BTM (transaction coordinator) , Introduction to two-phase commit
- bulk synchronous parallel (BSP) model , The Pregel processing model
- bursty network traffic patterns , Can we not simply make network delays predictable?
- business data processing , Relational Model Versus Document Model , Transaction Processing or Analytics? , Batch Processing
- byte sequence, encoding data in , Formats for Encoding Data
-
Byzantine faults
,
Byzantine Faults
-
Weak forms of lying
,
System Model and Reality
,
Glossary
- Byzantine fault-tolerant systems , Byzantine Faults , Tools for auditable data systems
- Byzantine Generals Problem , Byzantine Faults
- consensus algorithms and , Fault-Tolerant Consensus
C
-
caches
,
Keeping everything in memory
,
Glossary
- and materialized views , Aggregation: Data Cubes and Materialized Views
- as derived data , Derived Data , Composing Data Storage Technologies - What’s missing?
- database as cache of transaction log , State, Streams, and Immutability
- in CPUs , Memory bandwidth and vectorized processing , Linearizability and network delays , The move toward declarative query languages
- invalidation and maintenance , Keeping Systems in Sync , Maintaining materialized views
- linearizability , Linearizability
- CAP theorem , The CAP theorem - The CAP theorem , Glossary
-
Cascading (batch processing)
,
Beyond MapReduce
,
High-Level APIs and Languages
- hash joins , Broadcast hash joins
- workflows , MapReduce workflows
- cascading failures , Software Errors , Operations: Automatic or Manual Rebalancing , Timeouts and Unbounded Delays
- Cascalog (batch processing) , The Foundation: Datalog
-
Cassandra (database)
- column-family data model , Data locality for queries , Column Compression
- compaction strategy , Performance optimizations
- compound primary key , Partitioning by Hash of Key
- gossip protocol , Request Routing
- hash partitioning , Partitioning by Hash of Key - Partitioning by Hash of Key
- last-write-wins conflict resolution , Last write wins (discarding concurrent writes) , Timestamps for ordering events
- leaderless replication , Leaderless Replication
- linearizability, lack of , Linearizability and quorums
- log-structured storage , Making an LSM-tree out of SSTables
- multi-datacenter support , Multi-datacenter operation
- partitioning scheme , Partitioning proportionally to nodes
- secondary indexes , Partitioning Secondary Indexes by Document
- sloppy quorums , Sloppy Quorums and Hinted Handoff
- cat (Unix tool) , Simple Log Analysis
-
causal context
,
Version vectors
- ( see also causal dependencies)
-
causal dependencies
,
The “happens-before” relationship and concurrency
-
Version vectors
-
capturing
,
Version vectors
,
Capturing causal dependencies
,
Ordering events to capture causality
,
Reads are events too
- by total ordering , The limits of total ordering
- causal ordering , Ordering and Causality
- in transactions , Decisions based on an outdated premise
- sending message to friends (example) , Ordering events to capture causality
-
capturing
,
Version vectors
,
Capturing causal dependencies
,
Ordering events to capture causality
,
Reads are events too
-
causality
,
Glossary
-
causal ordering
,
Ordering and Causality
-
Capturing causal dependencies
- linearizability and , Linearizability is stronger than causal consistency
- total order consistent with , Sequence Number Ordering , Lamport timestamps
- consistency with , Sequence Number Ordering - Lamport timestamps
- consistent snapshots , Ordering and Causality
- happens-before relationship , The “happens-before” relationship and concurrency
- in serializable transactions , Decisions based on an outdated premise - Detecting writes that affect prior reads
- mismatch with clocks , Timestamps for ordering events
- ordering events to capture , Ordering events to capture causality
- violations of , Consistent Prefix Reads , Multi-Leader Replication Topologies , Timestamps for ordering events , Ordering and Causality
- with synchronized clocks , Synchronized clocks for global snapshots
-
causal ordering
,
Ordering and Causality
-
Capturing causal dependencies
- CEP ( see complex event processing)
- certificate transparency , Tools for auditable data systems
-
chain replication
,
Synchronous Versus Asynchronous Replication
- linearizable reads , Implementing linearizable storage using total order broadcast
-
change data capture
,
Logical (row-based) log replication
,
Change Data Capture
- API support for change streams , API support for change streams
- comparison to event sourcing , Event Sourcing
- implementing , Implementing change data capture
- initial snapshot , Initial snapshot
- log compaction , Log compaction
-
changelogs
,
State, Streams, and Immutability
- change data capture , Change Data Capture
- for operator state , Rebuilding state after a failure
- generating with triggers , Implementing change data capture
- in stream joins , Stream-table join (stream enrichment)
- log compaction , Log compaction
- maintaining derived state , Databases and Streams
- Chaos Monkey , Reliability , Network Faults in Practice
-
checkpointing
- in batch processors , Fault tolerance , Fault tolerance
- in high-performance computing , Cloud Computing and Supercomputing
- in stream processors , Microbatching and checkpointing , Multi-partition request processing
- chronicle data model , Event Sourcing
- circuit-switched networks , Synchronous Versus Asynchronous Networks
- circular buffers , Disk space usage
- circular replication topologies , Multi-Leader Replication Topologies
- clickstream data, analysis of , Example: analysis of user activity events
-
clients
- calling services , Dataflow Through Services: REST and RPC
- pushing state changes to , Pushing state changes to clients
- request routing , Request Routing
- stateful and offline-capable , Clients with offline operation , Stateful, offline-capable clients
-
clocks
,
Unreliable Clocks
-
Limiting the impact of garbage collection
- atomic (caesium) clocks , Clock readings have a confidence interval , Synchronized clocks for global snapshots
- confidence interval , Clock readings have a confidence interval - Synchronized clocks for global snapshots
- for global snapshots , Synchronized clocks for global snapshots
- logical ( see logical clocks)
- skew , Relying on Synchronized Clocks - Clock readings have a confidence interval , Implementing Linearizable Systems
- slewing , Monotonic clocks
- synchronization and accuracy , Clock Synchronization and Accuracy - Clock Synchronization and Accuracy
- synchronization using GPS , Unreliable Clocks , Clock Synchronization and Accuracy , Clock readings have a confidence interval , Synchronized clocks for global snapshots
- time-of-day versus monotonic clocks , Monotonic Versus Time-of-Day Clocks
- timestamping events , Whose clock are you using, anyway?
-
cloud computing
,
Distributed Data
,
Cloud Computing and Supercomputing
- need for service discovery , Service discovery
- network glitches , Network Faults in Practice
- shared resources , Network congestion and queueing
- single-machine reliability , Hardware Faults
- Cloudera Impala ( see Impala)
- clustered indexes , Storing values within the index
-
CODASYL model
,
The network model
- ( see also network model)
-
code generation
- with Avro , Code generation and dynamically typed languages
- with Thrift and Protocol Buffers , Thrift and Protocol Buffers
- with WSDL , Web services
-
collaborative editing
- multi-leader replication and , Collaborative editing
- column families (Bigtable) , Data locality for queries , Column Compression
-
column-oriented storage
,
Column-Oriented Storage
-
Writing to Column-Oriented Storage
- column compression , Column Compression
- distinction between column families and , Column Compression
- in batch processors , The move toward declarative query languages
- Parquet , Column-Oriented Storage , Archival storage , Philosophy of batch process outputs
- sort order in , Sort Order in Column Storage - Several different sort orders
- vectorized processing , Memory bandwidth and vectorized processing , The move toward declarative query languages
- writing to , Writing to Column-Oriented Storage
- comma-separated values ( see CSV)
- command query responsibility segregation (CQRS) , Deriving several views from the same event log
- commands (event sourcing) , Commands and events
-
commits (transactions)
,
Transactions
-
atomic commit
,
Atomic Commit and Two-Phase Commit (2PC)
-
From single-node to distributed atomic commit
- ( see also atomicity; transactions)
- read committed isolation , Read Committed
- three-phase commit (3PC) , Three-phase commit
- two-phase commit (2PC) , Introduction to two-phase commit - Coordinator failure
-
atomic commit
,
Atomic Commit and Two-Phase Commit (2PC)
-
From single-node to distributed atomic commit
- commutative operations , Conflict resolution and replication
-
compaction
-
of changelogs
,
Log compaction
- ( see also log compaction)
- for stream operator state , Rebuilding state after a failure
-
of log-structured storage
,
Hash Indexes
- issues with , Downsides of LSM-trees
- size-tiered and leveled approaches , Performance optimizations
-
of changelogs
,
Log compaction
- CompactProtocol encoding (Thrift) , Thrift and Protocol Buffers
-
compare-and-set operations
,
Compare-and-set
,
What Makes a System Linearizable?
- implementing locks , Membership and Coordination Services
- implementing uniqueness constraints , Constraints and uniqueness guarantees
- implementing with total order broadcast , Implementing linearizable storage using total order broadcast
- relation to consensus , Linearizability and quorums , Implementing linearizable storage using total order broadcast , Implementing total order broadcast using linearizable storage , Summary
- relation to transactions , Single-object writes
-
compatibility
,
Encoding and Evolution
,
Modes of Dataflow
- calling services , Data encoding and evolution for RPC
- properties of encoding formats , Summary
- using databases , Dataflow Through Databases - Archival storage
- using message-passing , Distributed actor frameworks
- compensating transactions , From single-node to distributed atomic commit , Advantages of immutable events , Loosely interpreted constraints
- complex event processing (CEP) , Complex event processing
-
complexity
- distilling in theoretical models , Mapping system models to the real world
- hiding using abstraction , Data Models and Query Languages
- of software systems, managing , Simplicity: Managing Complexity
- composing data systems ( see unbundling databases)
- compute-intensive applications , Reliable, Scalable, and Maintainable Applications , Cloud Computing and Supercomputing
-
concatenated indexes
,
Multi-column indexes
- in Cassandra , Partitioning by Hash of Key
- Concord (stream processor) , Stream analytics
-
concurrency
-
actor programming model
,
Distributed actor frameworks
,
Message passing and RPC
- ( see also message-passing)
- bugs from weak transaction isolation , Weak Isolation Levels
- conflict resolution , Handling Write Conflicts , Custom conflict resolution logic
- detecting concurrent writes , Detecting Concurrent Writes - Version vectors
- dual writes, problems with , Keeping Systems in Sync
- happens-before relationship , The “happens-before” relationship and concurrency
- in replicated systems , Problems with Replication Lag - Version vectors , Linearizability - Linearizability and network delays
- lost updates , Preventing Lost Updates
- multi-version concurrency control (MVCC) , Implementing snapshot isolation
- optimistic concurrency control , Pessimistic versus optimistic concurrency control
- ordering of operations , What Makes a System Linearizable? , The causal order is not a total order
- reducing, through event logs , Implementing linearizable storage using total order broadcast , Concurrency control , Dataflow: Interplay between state changes and application code
- time and relativity , The “happens-before” relationship and concurrency
- transaction isolation , Isolation
- write skew (transaction isolation) , Write Skew and Phantoms - Materializing conflicts
-
actor programming model
,
Distributed actor frameworks
,
Message passing and RPC
- conflict-free replicated datatypes (CRDTs) , Custom conflict resolution logic
-
conflicts
-
conflict detection
,
Synchronous versus asynchronous conflict detection
- causal dependencies , The “happens-before” relationship and concurrency , Capturing causal dependencies
- in consensus algorithms , Epoch numbering and quorums
- in leaderless replication , Detecting Concurrent Writes
- in log-based systems , Implementing linearizable storage using total order broadcast , Uniqueness constraints require consensus
- in nonlinearizable systems , Capturing causal dependencies
- in serializable snapshot isolation (SSI) , Detecting writes that affect prior reads
- in two-phase commit , A system of promises , Limitations of distributed transactions
-
conflict resolution
- automatic conflict resolution , Custom conflict resolution logic
- by aborting transactions , Pessimistic versus optimistic concurrency control
- by apologizing , Loosely interpreted constraints
- convergence , Converging toward a consistent state - Custom conflict resolution logic
- in leaderless systems , Merging concurrently written values
- last write wins (LWW) , Last write wins (discarding concurrent writes) , Timestamps for ordering events
- using atomic operations , Conflict resolution and replication
- using custom logic , Custom conflict resolution logic
- determining what is a conflict , What is a conflict? , Uniqueness in log-based messaging
-
in multi-leader replication
,
Handling Write Conflicts
-
What is a conflict?
- avoiding conflicts , Conflict avoidance
- lost updates , Preventing Lost Updates - Conflict resolution and replication
- materializing , Materializing conflicts
- relation to operation ordering , Ordering Guarantees
- write skew (transaction isolation) , Write Skew and Phantoms - Materializing conflicts
-
conflict detection
,
Synchronous versus asynchronous conflict detection
-
congestion (networks)
- avoidance , Network congestion and queueing
- limiting accuracy of clocks , Clock readings have a confidence interval
- queueing delays , Network congestion and queueing
-
consensus
,
Consistency and Consensus
,
Fault-Tolerant Consensus
-
Summary
,
Glossary
-
algorithms
,
Consensus algorithms and total order broadcast
-
Epoch numbering and quorums
- preventing split brain , Single-leader replication and consensus
- safety and liveness properties , Fault-Tolerant Consensus
- using linearizable operations , Implementing total order broadcast using linearizable storage
- cost of , Limitations of consensus
-
distributed transactions
,
Distributed Transactions and Consensus
-
Summary
- in practice , Distributed Transactions in Practice - Limitations of distributed transactions
- two-phase commit , Atomic Commit and Two-Phase Commit (2PC) - Three-phase commit
- XA transactions , XA transactions - Limitations of distributed transactions
- impossibility of , Distributed Transactions and Consensus
- membership and coordination services , Membership and Coordination Services - Membership services
- relation to compare-and-set , Linearizability and quorums , Implementing linearizable storage using total order broadcast , Implementing total order broadcast using linearizable storage , Summary
- relation to replication , Synchronous Versus Asynchronous Replication , Using total order broadcast
- relation to uniqueness constraints , Uniqueness constraints require consensus
-
algorithms
,
Consensus algorithms and total order broadcast
-
Epoch numbering and quorums
-
consistency
,
Consistency
,
Timeliness and Integrity
- across different databases , Leader failure: Failover , Keeping Systems in Sync , Deriving several views from the same event log , Derived data versus distributed transactions
- causal , Ordering and Causality - Timestamp ordering is not sufficient , Ordering events to capture causality
- consistent prefix reads , Consistent Prefix Reads - Consistent Prefix Reads
-
consistent snapshots
,
Setting Up New Followers
,
Snapshot Isolation and Repeatable Read
-
Repeatable read and naming confusion
,
Synchronized clocks for global snapshots
,
Initial snapshot
,
Creating an index
- ( see also snapshots)
- crash recovery , Making B-trees reliable
- enforcing constraints ( see constraints)
-
eventual
,
Problems with Replication Lag
,
Consistency Guarantees
- ( see also eventual consistency)
- in ACID transactions , Consistency , Maintaining integrity in the face of software bugs
- in CAP theorem , The CAP theorem
- linearizability , Linearizability - Linearizability and network delays
- meanings of , Consistency
- monotonic reads , Monotonic Reads - Monotonic Reads
- of secondary indexes , The need for multi-object transactions , Indexes and snapshot isolation , Atomic Commit and Two-Phase Commit (2PC) , Reasoning about dataflows , Creating an index
- ordering guarantees , Ordering Guarantees - Implementing total order broadcast using linearizable storage
- read-after-write , Reading Your Own Writes - Reading Your Own Writes
- sequential , Implementing linearizable storage using total order broadcast
- strong ( see linearizability)
- timeliness and integrity , Timeliness and Integrity
- using quorums , Limitations of Quorum Consistency , Linearizability and quorums
- consistent hashing , Partitioning by Hash of Key
- consistent prefix reads , Consistent Prefix Reads
-
constraints (databases)
,
Consistency
,
Characterizing write skew
- asynchronously checked , Loosely interpreted constraints
- coordination avoidance , Coordination-avoiding data systems
- ensuring idempotence , Operation identifiers
-
in log-based systems
,
Enforcing Constraints
-
Multi-partition request processing
- across multiple partitions , Multi-partition request processing
- in two-phase commit , From single-node to distributed atomic commit , A system of promises
- relation to consensus , Summary , Uniqueness constraints require consensus
- relation to event ordering , Timestamp ordering is not sufficient
- requiring linearizability , Constraints and uniqueness guarantees
- Consul (service discovery) , Service discovery
-
consumers (message streams)
,
Message brokers
,
Transmitting Event Streams
- backpressure , Messaging Systems
- consumer offsets in logs , Consumer offsets
- failures , Acknowledgments and redelivery , Consumer offsets
- fan-out , Describing Load , Multiple consumers , Logs compared to traditional messaging
- load balancing , Multiple consumers , Logs compared to traditional messaging
- not keeping up with producers , Messaging Systems , Disk space usage , Making unbundling work
- context switches , Describing Performance , Process Pauses
- convergence (conflict resolution) , Converging toward a consistent state - Custom conflict resolution logic , Consistency Guarantees
-
coordination
- avoidance , Coordination-avoiding data systems
- cross-datacenter , Multi-datacenter operation , The limits of total ordering
- cross-partition ordering , Partitioning , Synchronized clocks for global snapshots , Total Order Broadcast , Multi-partition request processing
- services , Locking and leader election , Membership and Coordination Services - Membership services
-
coordinator (in 2PC)
,
Introduction to two-phase commit
- failure , Coordinator failure
- in XA transactions , XA transactions - Limitations of distributed transactions
- recovery , Recovering from coordinator failure
- copy-on-write (B-trees) , B-tree optimizations , Indexes and snapshot isolation
- CORBA (Common Object Request Broker Architecture) , The problems with remote procedure calls (RPCs)
-
correctness
,
Thinking About Data Systems
- auditability , Trust, but Verify - Tools for auditable data systems
- Byzantine fault tolerance , Byzantine Faults , Tools for auditable data systems
- dealing with partial failures , Faults and Partial Failures
- in log-based systems , Enforcing Constraints - Multi-partition request processing
- of algorithm within system model , Correctness of an algorithm
- of compensating transactions , From single-node to distributed atomic commit
- of consensus , Epoch numbering and quorums
- of derived data , The lambda architecture , Designing for auditability
- of immutable data , Advantages of immutable events
- of personal data , Responsibility and accountability , Privacy and use of data
- of time , Multi-Leader Replication Topologies , Clock Synchronization and Accuracy - Synchronized clocks for global snapshots
- of transactions , Consistency , Aiming for Correctness , Maintaining integrity in the face of software bugs
- timeliness and integrity , Timeliness and Integrity - Coordination-avoiding data systems
-
corruption of data
- detecting , The end-to-end argument , Don’t just blindly trust what they promise - Tools for auditable data systems
- due to pathological memory access , Trust, but Verify
- due to radiation , Byzantine Faults
- due to split brain , Leader failure: Failover , The leader and the lock
- due to weak transaction isolation , Weak Isolation Levels
- formalization in consensus , Consensus algorithms and total order broadcast
- integrity as absence of , Timeliness and Integrity
- network packets , Weak forms of lying
- on disks , Durability
- preventing using write-ahead logs , Making B-trees reliable
- recovering from , Philosophy of batch process outputs , Advantages of immutable events
-
Couchbase (database)
- durability , Keeping everything in memory
- hash partitioning , Partitioning by Hash of Key - Partitioning by Hash of Key , Fixed number of partitions
- rebalancing , Operations: Automatic or Manual Rebalancing
- request routing , Request Routing
-
CouchDB (database)
- B-tree storage , Indexes and snapshot isolation
- change feed , API support for change streams
- document data model , The Object-Relational Mismatch
- join support , Many-to-One and Many-to-Many Relationships
- MapReduce support , MapReduce Querying , Distributed execution of MapReduce
- replication , Clients with offline operation , Custom conflict resolution logic
- covering indexes , Storing values within the index
-
CPUs
- cache coherence and memory barriers , Linearizability and network delays
- caching and pipelining , Memory bandwidth and vectorized processing , The move toward declarative query languages
- increasing parallelism , Query Languages for Data
- CRDTs ( see conflict-free replicated datatypes)
- CREATE INDEX statement (SQL) , Other Indexing Structures , Creating an index
- credit rating agencies , Responsibility and accountability
-
Crunch (batch processing)
,
Beyond MapReduce
,
High-Level APIs and Languages
- hash joins , Broadcast hash joins
- sharded joins , Handling skew
- workflows , MapReduce workflows
-
cryptography
- defense against attackers , Byzantine Faults
- end-to-end encryption and authentication , The end-to-end argument , Legislation and self-regulation
- proving integrity of data , Tools for auditable data systems
- CSS (Cascading Style Sheets) , Declarative Queries on the Web
- CSV (comma-separated values) , Data Structures That Power Your Database , JSON, XML, and Binary Variants , A uniform interface
- Curator (ZooKeeper recipes) , Locking and leader election , Allocating work to nodes
- curl (Unix tool) , Current directions for RPC , Separation of logic and wiring
- cursor stability , Atomic write operations
-
Cypher (query language)
,
The Cypher Query Language
- comparison to SPARQL , The SPARQL query language
D
- data corruption ( see corruption of data)
- data cubes , Aggregation: Data Cubes and Materialized Views
- data formats ( see encoding)
-
data integration
,
Data Integration
-
Unifying batch and stream processing
,
Summary
-
batch and stream processing
,
Batch and Stream Processing
-
Unifying batch and stream processing
- lambda architecture , The lambda architecture
- maintaining derived state , Maintaining derived state
- reprocessing data , Reprocessing data for application evolution
- unifying , Unifying batch and stream processing
-
by unbundling databases
,
Unbundling Databases
-
Multi-partition data processing
- comparison to federated databases , The meta-database of everything
-
combining tools by deriving data
,
Combining Specialized Tools by Deriving Data
-
Ordering events to capture causality
- derived data versus distributed transactions , Derived data versus distributed transactions
- limits of total ordering , The limits of total ordering
- ordering events to capture causality , Ordering events to capture causality
- reasoning about dataflows , Reasoning about dataflows
- need for , Derived Data
-
batch and stream processing
,
Batch and Stream Processing
-
Unifying batch and stream processing
- data lakes , Diversity of storage
- data locality ( see locality)
-
data models
,
Data Models and Query Languages
-
Summary
-
graph-like models
,
Graph-Like Data Models
-
The Foundation: Datalog
- Datalog language , The Foundation: Datalog - The Foundation: Datalog
- property graphs , Property Graphs
- RDF and triple-stores , Triple-Stores and SPARQL - The SPARQL query language
- query languages , Query Languages for Data - MapReduce Querying
- relational model versus document model , Relational Model Versus Document Model - Convergence of document and relational databases
-
graph-like models
,
Graph-Like Data Models
-
The Foundation: Datalog
- data protection regulations , Legislation and self-regulation
-
data systems
,
Reliable, Scalable, and Maintainable Applications
- about , Thinking About Data Systems
- concerns when designing , Thinking About Data Systems
-
future of
,
The Future of Data Systems
-
Summary
- correctness, constraints, and integrity , Aiming for Correctness - Tools for auditable data systems
- data integration , Data Integration - Unifying batch and stream processing
- unbundling databases , Unbundling Databases - Multi-partition data processing
- heterogeneous, keeping in sync , Keeping Systems in Sync
- maintainability , Maintainability - Evolvability: Making Change Easy
- possible faults in , Transactions
-
reliability
,
Reliability
-
How Important Is Reliability?
- hardware faults , Hardware Faults
- human errors , Human Errors
- importance of , How Important Is Reliability?
- software errors , Software Errors
- scalability , Scalability - Approaches for Coping with Load
- unreliable clocks , Unreliable Clocks - Limiting the impact of garbage collection
-
data warehousing
,
Data Warehousing
-
Stars and Snowflakes: Schemas for Analytics
,
Glossary
- comparison to data lakes , Diversity of storage
- ETL (extract-transform-load) , Data Warehousing , Diversity of storage , Keeping Systems in Sync
- keeping data systems in sync , Keeping Systems in Sync
- schema design , Stars and Snowflakes: Schemas for Analytics
- slowly changing dimension (SCD) , Time-dependence of joins
- data-intensive applications , Reliable, Scalable, and Maintainable Applications
- database triggers ( see triggers)
- database-internal distributed transactions , Distributed Transactions in Practice , Limitations of distributed transactions , Atomic commit revisited
-
databases
- archival storage , Archival storage
- comparison of message brokers to , Message brokers compared to databases
- dataflow through , Dataflow Through Databases
-
end-to-end argument for
,
The end-to-end argument
-
Applying end-to-end thinking in data systems
- checking integrity , The end-to-end argument again
-
inside-out
,
Designing Applications Around Dataflow
- ( see also unbundling databases)
- output from batch workflows , Key-value stores as batch process output
-
relation to event streams
,
Databases and Streams
-
Limitations of immutability
- ( see also changelogs)
- API support for change streams , API support for change streams , Separation of application code and state
- change data capture , Change Data Capture - API support for change streams
- event sourcing , Event Sourcing - Commands and events
- keeping systems in sync , Keeping Systems in Sync - Keeping Systems in Sync
- philosophy of immutable events , State, Streams, and Immutability - Limitations of immutability
-
unbundling
,
Unbundling Databases
-
Multi-partition data processing
- composing data storage technologies , Composing Data Storage Technologies - What’s missing?
- designing applications around dataflow , Designing Applications Around Dataflow - Stream processors and services
- observing derived state , Observing Derived State - Multi-partition data processing
-
datacenters
- geographically distributed , Distributed Data , Reading Your Own Writes , Unreliable Networks , The limits of total ordering
- multi-tenancy and shared resources , Network congestion and queueing
- network architecture , Cloud Computing and Supercomputing
- network faults , Network Faults in Practice
-
replication across multiple
,
Multi-datacenter operation
- leaderless replication , Multi-datacenter operation
- multi-leader replication , Multi-datacenter operation , The Cost of Linearizability
-
dataflow
,
Modes of Dataflow
-
Distributed actor frameworks
,
Designing Applications Around Dataflow
-
Stream processors and services
- correctness of dataflow systems , Correctness of dataflow systems
- differential , What’s missing?
- message-passing , Message-Passing Dataflow - Distributed actor frameworks
- reasoning about , Reasoning about dataflows
- through databases , Dataflow Through Databases
- through services , Dataflow Through Services: REST and RPC - Data encoding and evolution for RPC
-
dataflow engines
,
Dataflow engines
-
Discussion of materialization
- comparison to stream processing , Processing Streams
- directed acyclic graphs (DAG) , Graphs and Iterative Processing
- partitioning, approach to , Summary
- support for declarative queries , The move toward declarative query languages
- Datalog (query language) , The Foundation: Datalog - The Foundation: Datalog
-
datatypes
- binary strings in XML and JSON , JSON, XML, and Binary Variants
- conflict-free , Custom conflict resolution logic
- in Avro encodings , Avro
- in Thrift and Protocol Buffers , Datatypes and schema evolution
- numbers in XML and JSON , JSON, XML, and Binary Variants
-
Datomic (database)
- B-tree storage , Indexes and snapshot isolation
- data model , Graph-Like Data Models , The semantic web
- Datalog query language , The Foundation: Datalog
- excision (deleting data) , Limitations of immutability
- languages for transactions , Pros and cons of stored procedures
- serial execution of transactions , Actual Serial Execution
-
deadlocks
- detection, in two-phase commit (2PC) , Limitations of distributed transactions
- in two-phase locking (2PL) , Implementation of two-phase locking
- Debezium (change data capture) , Implementing change data capture
-
declarative languages
,
Query Languages for Data
,
Glossary
- Bloom , Designing Applications Around Dataflow
- CSS and XSL , Declarative Queries on the Web
- Cypher , The Cypher Query Language
- Datalog , The Foundation: Datalog
- for batch processing , The move toward declarative query languages
- recursive SQL queries , Graph Queries in SQL
- relational algebra and SQL , Query Languages for Data
- SPARQL , The SPARQL query language
-
delays
- bounded network delays , Synchronous Versus Asynchronous Networks
- bounded process pauses , Response time guarantees
- unbounded network delays , Timeouts and Unbounded Delays
- unbounded process pauses , Process Pauses
- deleting data , Limitations of immutability
-
denormalization (data representation)
,
Many-to-One and Many-to-Many Relationships
,
Glossary
- costs , Which data model leads to simpler application code?
- in derived data systems , Derived Data
- materialized views , Aggregation: Data Cubes and Materialized Views
- updating derived data , Single-Object and Multi-Object Operations , The need for multi-object transactions , Combining Specialized Tools by Deriving Data
- versus normalization , Deriving several views from the same event log
-
derived data
,
Derived Data
,
Stream Processing
,
Glossary
- from change data capture , Implementing change data capture
- in event sourcing , Deriving current state from the event log - Deriving current state from the event log
- maintaining derived state through logs , Databases and Streams - API support for change streams , State, Streams, and Immutability - Concurrency control
- observing, by subscribing to streams , End-to-end event streams
- outputs of batch and stream processing , Batch and Stream Processing
- through application code , Application code as a derivation function
- versus distributed transactions , Derived data versus distributed transactions
-
deterministic operations
,
Pros and cons of stored procedures
,
Faults and Partial Failures
,
Glossary
- accidental nondeterminism , Fault tolerance
- and fault tolerance , Fault tolerance , Fault tolerance
- and idempotence , Idempotence , Reasoning about dataflows
- computing derived data , Maintaining derived state , Correctness of dataflow systems , Designing for auditability
- in state machine replication , Using total order broadcast , Databases and Streams , Deriving current state from the event log
- joins , Time-dependence of joins
- DevOps , The Unix Philosophy
- differential dataflow , What’s missing?
- dimension tables , Stars and Snowflakes: Schemas for Analytics
- dimensional modeling ( see star schemas)
- directed acyclic graphs (DAGs) , Graphs and Iterative Processing
- dirty reads (transaction isolation) , No dirty reads
- dirty writes (transaction isolation) , No dirty writes
- discrimination , Bias and discrimination
- disks ( see hard disks)
- distributed actor frameworks , Distributed actor frameworks
-
distributed filesystems
,
MapReduce and Distributed Filesystems
-
MapReduce and Distributed Filesystems
- decoupling from query engines , Diversity of processing models
- indiscriminately dumping data into , Diversity of storage
- use by MapReduce , MapReduce workflows
-
distributed systems
,
The Trouble with Distributed Systems
-
Summary
,
Glossary
- Byzantine faults , Byzantine Faults - Weak forms of lying
- cloud versus supercomputing , Cloud Computing and Supercomputing
- detecting network faults , Detecting Faults
- faults and partial failures , Faults and Partial Failures - Cloud Computing and Supercomputing
- formalization of consensus , Fault-Tolerant Consensus
- impossibility results , The CAP theorem , Distributed Transactions and Consensus
- issues with failover , Leader failure: Failover
- limitations of distributed transactions , Limitations of distributed transactions
- multi-datacenter , Multi-datacenter operation , The Cost of Linearizability
- network problems , Unreliable Networks - Can we not simply make network delays predictable?
- quorums, relying on , The Truth Is Defined by the Majority
- reasons for using , Distributed Data , Replication
- synchronized clocks, relying on , Relying on Synchronized Clocks - Synchronized clocks for global snapshots
- system models , System Model and Reality - Mapping system models to the real world
- use of clocks and time , Unreliable Clocks
- distributed transactions ( see transactions)
- Django (web framework) , Handling errors and aborts
- DNS (Domain Name System) , Request Routing , Service discovery
- Docker (container manager) , Separation of application code and state
-
document data model
,
The Object-Relational Mismatch
-
Convergence of document and relational databases
- comparison to relational model , Relational Versus Document Databases Today - Convergence of document and relational databases
- document references , Comparison to document databases , Reduce-Side Joins and Grouping
- document-oriented databases , The Object-Relational Mismatch
- many-to-many relationships and joins , Are Document Databases Repeating History?
- multi-object transactions, need for , The need for multi-object transactions
-
versus relational model
- convergence of models , Convergence of document and relational databases
- data locality , Data locality for queries
- document-partitioned indexes , Partitioning Secondary Indexes by Document , Summary , Building search indexes
- domain-driven design (DDD) , Event Sourcing
- DRBD (Distributed Replicated Block Device) , Leaders and Followers
- drift (clocks) , Clock Synchronization and Accuracy
- Drill (query engine) , The divergence between OLTP databases and data warehouses
- Druid (database) , Deriving several views from the same event log
- Dryad (dataflow engine) , Dataflow engines
- dual writes, problems with , Keeping Systems in Sync , Dataflow: Interplay between state changes and application code
-
duplicates, suppression of
,
Duplicate suppression
- ( see also idempotence)
- using a unique ID , Operation identifiers , Multi-partition request processing
- durability (transactions) , Durability , Glossary
-
duration (time)
,
Unreliable Clocks
- measurement with monotonic clocks , Monotonic clocks
- dynamic partitioning , Dynamic partitioning
-
dynamically typed languages
- analogy to schema-on-read , Schema flexibility in the document model
- code generation and , Code generation and dynamically typed languages
- Dynamo-style databases ( see leaderless replication)
E
-
edges (in graphs)
,
Graph-Like Data Models
,
Reduce-Side Joins and Grouping
- property graph model , Property Graphs
- edit distance (full-text search) , Full-text search and fuzzy indexes
-
effectively-once semantics
,
Fault Tolerance
,
Exactly-once execution of an operation
- ( see also exactly-once semantics)
- preservation of integrity , Correctness of dataflow systems
- elastic systems , Approaches for Coping with Load
-
Elasticsearch (search server)
- document-partitioned indexes , Partitioning Secondary Indexes by Document
- partition rebalancing , Fixed number of partitions
- percolator (stream search) , Search on streams
- usage example , Thinking About Data Systems
- use of Lucene , Making an LSM-tree out of SSTables
- ElephantDB (database) , Key-value stores as batch process output
- Elm (programming language) , Designing Applications Around Dataflow , End-to-end event streams
-
encodings (data formats)
,
Encoding and Evolution
-
The Merits of Schemas
- Avro , Avro - Code generation and dynamically typed languages
- binary variants of JSON and XML , Binary encoding
-
compatibility
,
Encoding and Evolution
- calling services , Data encoding and evolution for RPC
- using databases , Dataflow Through Databases - Archival storage
- using message-passing , Distributed actor frameworks
- defined , Formats for Encoding Data
- JSON, XML, and CSV , JSON, XML, and Binary Variants
- language-specific formats , Language-Specific Formats
- merits of schemas , The Merits of Schemas
- representations of data , Formats for Encoding Data
- Thrift and Protocol Buffers , Thrift and Protocol Buffers - Datatypes and schema evolution
-
end-to-end argument
,
Cloud Computing and Supercomputing
,
The end-to-end argument
-
Applying end-to-end thinking in data systems
- checking integrity , The end-to-end argument again
- publish/subscribe streams , End-to-end event streams
- enrichment (stream) , Stream-table join (stream enrichment)
- Enterprise JavaBeans (EJB) , The problems with remote procedure calls (RPCs)
- entities ( see vertices)
- epoch (consensus algorithms) , Epoch numbering and quorums
- epoch (Unix timestamps) , Time-of-day clocks
- equi-joins , Reduce-Side Joins and Grouping
- erasure coding (error correction) , MapReduce and Distributed Filesystems
- Erlang OTP (actor framework) , Distributed actor frameworks
-
error handling
- for network faults , Network Faults in Practice
- in transactions , Handling errors and aborts
- error-correcting codes , Cloud Computing and Supercomputing , MapReduce and Distributed Filesystems
- Esper (CEP engine) , Complex event processing
-
etcd (coordination service)
,
Membership and Coordination Services
-
Membership services
- linearizable operations , Implementing Linearizable Systems
- locks and leader election , Locking and leader election
- quorum reads , Implementing linearizable storage using total order broadcast
- service discovery , Service discovery
- use of Raft algorithm , Using total order broadcast , Distributed Transactions and Consensus
- Ethereum (blockchain) , Tools for auditable data systems
-
Ethernet (networks)
,
Cloud Computing and Supercomputing
,
Unreliable Networks
,
Can we not simply make network delays predictable?
- packet checksums , Weak forms of lying , The end-to-end argument
- Etherpad (collaborative editor) , Collaborative editing
-
ethics
,
Doing the Right Thing
-
Legislation and self-regulation
- code of ethics and professional practice , Doing the Right Thing
- legislation and self-regulation , Legislation and self-regulation
-
predictive analytics
,
Predictive Analytics
-
Feedback loops
- amplifying bias , Bias and discrimination
- feedback loops , Feedback loops
-
privacy and tracking
,
Privacy and Tracking
-
Legislation and self-regulation
- consent and freedom of choice , Consent and freedom of choice
- data as assets and power , Data as assets and power
- meaning of privacy , Privacy and use of data
- surveillance , Surveillance
- respect, dignity, and agency , Legislation and self-regulation , Summary
- unintended consequences , Doing the Right Thing , Feedback loops
-
ETL (extract-transform-load)
,
Data Warehousing
,
Example: analysis of user activity events
,
Keeping Systems in Sync
,
Glossary
- use of Hadoop for , Diversity of storage
-
event sourcing
,
Event Sourcing
-
Commands and events
- commands and events , Commands and events
- comparison to change data capture , Event Sourcing
- comparison to lambda architecture , The lambda architecture
- deriving current state from event log , Deriving current state from the event log
- immutability and auditability , State, Streams, and Immutability , Designing for auditability
- large, reliable data systems , Operation identifiers , Correctness of dataflow systems
- Event Store (database) , Event Sourcing
- event streams ( see streams)
-
events
,
Transmitting Event Streams
- deciding on total order of , The limits of total ordering
- deriving views from event log , Deriving several views from the same event log
- difference to commands , Commands and events
- event time versus processing time , Event time versus processing time , Microbatching and checkpointing , Unifying batch and stream processing
- immutable, advantages of , Advantages of immutable events , Designing for auditability
- ordering to capture causality , Ordering events to capture causality
- reads as , Reads are events too
- stragglers , Knowing when you’re ready , The lambda architecture
- timestamp of, in stream processing , Whose clock are you using, anyway?
- EventSource (browser API) , Pushing state changes to clients
-
eventual consistency
,
Replication
,
Problems with Replication Lag
,
Safety and liveness
,
Consistency Guarantees
- ( see also conflicts)
- and perpetual inconsistency , Timeliness and Integrity
-
evolvability
,
Evolvability: Making Change Easy
,
Encoding and Evolution
- calling services , Data encoding and evolution for RPC
- graph-structured data , Property Graphs
- of databases , Schema flexibility in the document model , Dataflow Through Databases - Archival storage , Deriving several views from the same event log , Reprocessing data for application evolution
- of message-passing , Distributed actor frameworks
- reprocessing data , Reprocessing data for application evolution , Unifying batch and stream processing
- schema evolution in Avro , The writer’s schema and the reader’s schema
- schema evolution in Thrift and Protocol Buffers , Field tags and schema evolution
- schema-on-read , Schema flexibility in the document model , Encoding and Evolution , The Merits of Schemas
-
exactly-once semantics
,
Exactly-once message processing
,
Fault Tolerance
,
Exactly-once execution of an operation
- parity with batch processors , Unifying batch and stream processing
- preservation of integrity , Correctness of dataflow systems
- exclusive mode (locks) , Implementation of two-phase locking
- eXtended Architecture transactions ( see XA transactions)
- extract-transform-load ( see ETL)
F
-
Facebook
- Presto (query engine) , The divergence between OLTP databases and data warehouses
- React, Flux, and Redux (user interface libraries) , End-to-end event streams
- social graphs , Graph-Like Data Models
- Wormhole (change data capture) , Implementing change data capture
- fact tables , Stars and Snowflakes: Schemas for Analytics
-
failover
,
Leader failure: Failover
,
Glossary
- ( see also leader-based replication)
- in leaderless replication, absence of , Writing to the Database When a Node Is Down
- leader election , The leader and the lock , Total Order Broadcast , Distributed Transactions and Consensus
- potential problems , Leader failure: Failover
-
failures
- amplification by distributed transactions , Limitations of distributed transactions , Maintaining derived state
-
failure detection
,
Detecting Faults
- automatic rebalancing causing cascading failures , Operations: Automatic or Manual Rebalancing
- perfect failure detectors , Three-phase commit
- timeouts and unbounded delays , Timeouts and Unbounded Delays , Network congestion and queueing
- using ZooKeeper , Membership and Coordination Services
- faults versus , Reliability
- partial failures in distributed systems , Faults and Partial Failures - Cloud Computing and Supercomputing , Summary
- fan-out (messaging systems) , Describing Load , Multiple consumers
-
fault tolerance
,
Reliability
-
How Important Is Reliability?
,
Glossary
- abstractions for , Consistency and Consensus
-
formalization in consensus
,
Fault-Tolerant Consensus
-
Limitations of consensus
- use of replication , Single-leader replication and consensus
- human fault tolerance , Philosophy of batch process outputs
- in batch processing , Bringing related data together in the same place , Philosophy of batch process outputs , Fault tolerance , Fault tolerance
- in log-based systems , Applying end-to-end thinking in data systems , Timeliness and Integrity - Correctness of dataflow systems
-
in stream processing
,
Fault Tolerance
-
Rebuilding state after a failure
- atomic commit , Atomic commit revisited
- idempotence , Idempotence
- maintaining derived state , Maintaining derived state
- microbatching and checkpointing , Microbatching and checkpointing
- rebuilding state after a failure , Rebuilding state after a failure
- of distributed transactions , XA transactions - Limitations of distributed transactions
- transaction atomicity , Atomicity , Atomic Commit and Two-Phase Commit (2PC) - Exactly-once message processing
-
faults
,
Reliability
- Byzantine faults , Byzantine Faults - Weak forms of lying
- failures versus , Reliability
- handled by transactions , Transactions
- handling in supercomputers and cloud computing , Cloud Computing and Supercomputing
- hardware , Hardware Faults
- in batch processing versus distributed databases , Designing for frequent faults
- in distributed systems , Faults and Partial Failures - Cloud Computing and Supercomputing
- introducing deliberately , Reliability , Network Faults in Practice
-
network faults
,
Network Faults in Practice
-
Detecting Faults
- asymmetric faults , The Truth Is Defined by the Majority
- detecting , Detecting Faults
- tolerance of, in multi-leader replication , Multi-datacenter operation
- software errors , Software Errors
- tolerating ( see fault tolerance)
- federated databases , The meta-database of everything
- fence (CPU instruction) , Linearizability and network delays
-
fencing (preventing split brain)
,
Leader failure: Failover
,
The leader and the lock
-
Fencing tokens
- generating fencing tokens , Using total order broadcast , Membership and Coordination Services
- properties of fencing tokens , Correctness of an algorithm
- stream processors writing to databases , Idempotence , Exactly-once execution of an operation
- Fibre Channel (networks) , MapReduce and Distributed Filesystems
- field tags (Thrift and Protocol Buffers) , Thrift and Protocol Buffers - Field tags and schema evolution
- file descriptors (Unix) , A uniform interface
- financial data , Advantages of immutable events
- Firebase (database) , API support for change streams
-
Flink (processing framework)
,
Dataflow engines
-
Discussion of materialization
- dataflow APIs , High-Level APIs and Languages
- fault tolerance , Fault tolerance , Microbatching and checkpointing , Rebuilding state after a failure
- Gelly API (graph processing) , The Pregel processing model
- integration of batch and stream processing , Batch and Stream Processing , Unifying batch and stream processing
- machine learning , Specialization for different domains
- query optimizer , The move toward declarative query languages
- stream processing , Stream analytics
- flow control , Network congestion and queueing , Messaging Systems , Glossary
- FLP result (on consensus) , Distributed Transactions and Consensus
- FlumeJava (dataflow library) , MapReduce workflows , High-Level APIs and Languages
-
followers
,
Leaders and Followers
,
Glossary
- ( see also leader-based replication)
- foreign keys , Comparison to document databases , Reduce-Side Joins and Grouping
- forward compatibility , Encoding and Evolution
- forward decay (algorithm) , Describing Performance
-
Fossil (version control system)
,
Limitations of immutability
- shunning (deleting data) , Limitations of immutability
-
FoundationDB (database)
- serializable transactions , Serializable Snapshot Isolation (SSI) , Performance of serializable snapshot isolation , Limitations of distributed transactions
- fractal trees , B-tree optimizations
- full table scans , Reduce-Side Joins and Grouping
-
full-text search
,
Glossary
- and fuzzy indexes , Full-text search and fuzzy indexes
- building search indexes , Building search indexes
- Lucene storage engine , Making an LSM-tree out of SSTables
- functional reactive programming (FRP) , Designing Applications Around Dataflow
- functional requirements , Summary
- futures (asynchronous operations) , Current directions for RPC
- fuzzy search ( see similarity search)
G
-
garbage collection
- immutability and , Limitations of immutability
-
process pauses for
,
Describing Performance
,
Process Pauses
-
Limiting the impact of garbage collection
,
The Truth Is Defined by the Majority
- ( see also process pauses)
- genome analysis , Summary , Specialization for different domains
- geographically distributed datacenters , Distributed Data , Reading Your Own Writes , Unreliable Networks , The limits of total ordering
- geospatial indexes , Multi-column indexes
- Giraph (graph processing) , The Pregel processing model
- Git (version control system) , Custom conflict resolution logic , The causal order is not a total order , Limitations of immutability
- GitHub, postmortems , Leader failure: Failover , Leader failure: Failover , Mapping system models to the real world
- global indexes ( see term-partitioned indexes)
- GlusterFS (distributed filesystem) , MapReduce and Distributed Filesystems
- GNU Coreutils (Linux) , Sorting versus in-memory aggregation
-
GoldenGate (change data capture)
,
Trigger-based replication
,
Multi-datacenter operation
,
Implementing change data capture
- ( see also Oracle)
-
Google
-
Bigtable (database)
- data model ( see Bigtable data model)
- partitioning scheme , Partitioning , Partitioning by Key Range
- storage layout , Making an LSM-tree out of SSTables
- Chubby (lock service) , Membership and Coordination Services
-
Cloud Dataflow (stream processor)
,
Stream analytics
,
Atomic commit revisited
,
Unifying batch and stream processing
- ( see also Beam)
- Cloud Pub/Sub (messaging) , Message brokers compared to databases , Using logs for message storage
- Docs (collaborative editor) , Collaborative editing
- Dremel (query engine) , The divergence between OLTP databases and data warehouses , Column-Oriented Storage
- FlumeJava (dataflow library) , MapReduce workflows , High-Level APIs and Languages
- GFS (distributed file system) , MapReduce and Distributed Filesystems
- gRPC (RPC framework) , Current directions for RPC
-
MapReduce (batch processing)
,
Batch Processing
- ( see also MapReduce)
- building search indexes , Building search indexes
- task preemption , Designing for frequent faults
- Pregel (graph processing) , The Pregel processing model
- Spanner ( see Spanner)
- TrueTime (clock API) , Clock readings have a confidence interval
-
Bigtable (database)
- gossip protocol , Request Routing
- government use of data , Data as assets and power
-
GPS (Global Positioning System)
- use for clock synchronization , Unreliable Clocks , Clock Synchronization and Accuracy , Clock readings have a confidence interval , Synchronized clocks for global snapshots
- GraphChi (graph processing) , Parallel execution
-
graphs
,
Glossary
-
as data models
,
Graph-Like Data Models
-
The Foundation: Datalog
- example of graph-structured data , Graph-Like Data Models
- property graphs , Property Graphs
- RDF and triple-stores , Triple-Stores and SPARQL - The SPARQL query language
- versus the network model , The SPARQL query language
-
processing and analysis
,
Graphs and Iterative Processing
-
Parallel execution
- fault tolerance , Fault tolerance
- Pregel processing model , The Pregel processing model
-
query languages
- Cypher , The Cypher Query Language
- Datalog , The Foundation: Datalog - The Foundation: Datalog
- recursive SQL queries , Graph Queries in SQL
- SPARQL , The SPARQL query language - The SPARQL query language
-
as data models
,
Graph-Like Data Models
-
The Foundation: Datalog
- Gremlin (graph query language) , Graph-Like Data Models
- grep (Unix tool) , Simple Log Analysis
- GROUP BY clause (SQL) , GROUP BY
-
grouping records in MapReduce
,
GROUP BY
- handling skew , Handling skew
H
-
Hadoop (data infrastructure)
- comparison to distributed databases , Batch Processing
- comparison to MPP databases , Comparing Hadoop to Distributed Databases - Designing for frequent faults
- comparison to Unix , Philosophy of batch process outputs - Philosophy of batch process outputs , Unbundling Databases
- diverse processing models in ecosystem , Diversity of processing models
- HDFS distributed filesystem ( see HDFS)
- higher-level tools , MapReduce workflows
-
join algorithms
,
Reduce-Side Joins and Grouping
-
MapReduce workflows with map-side joins
- ( see also MapReduce)
- MapReduce ( see MapReduce)
- YARN ( see YARN)
-
happens-before relationship
,
Ordering and Causality
- capturing , Capturing the happens-before relationship
- concurrency and , The “happens-before” relationship and concurrency
-
hard disks
- access patterns , Advantages of LSM-trees
- detecting corruption , The end-to-end argument , Don’t just blindly trust what they promise
- faults in , Hardware Faults , Durability
- sequential write throughput , Hash Indexes , Disk space usage
- hardware faults , Hardware Faults
-
hash indexes
,
Hash Indexes
-
Hash Indexes
- broadcast hash joins , Broadcast hash joins
- partitioned hash joins , Partitioned hash joins
-
hash partitioning
,
Partitioning by Hash of Key
-
Partitioning by Hash of Key
,
Summary
- consistent hashing , Partitioning by Hash of Key
- problems with hash mod N , How not to do it: hash mod N
- range queries , Partitioning by Hash of Key
- suitable hash functions , Partitioning by Hash of Key
- with fixed number of partitions , Fixed number of partitions
- HAWQ (database) , Specialization for different domains
-
HBase (database)
- bug due to lack of fencing , The leader and the lock
- bulk loading , Key-value stores as batch process output
- column-family data model , Data locality for queries , Column Compression
- dynamic partitioning , Dynamic partitioning
- key-range partitioning , Partitioning by Key Range
- log-structured storage , Making an LSM-tree out of SSTables
- request routing , Request Routing
- size-tiered compaction , Performance optimizations
- use of HDFS , Diversity of processing models
- use of ZooKeeper , Membership and Coordination Services
-
HDFS (Hadoop Distributed File System)
,
MapReduce and Distributed Filesystems
-
MapReduce and Distributed Filesystems
- ( see also distributed filesystems)
- checking data integrity , Don’t just blindly trust what they promise
- decoupling from query engines , Diversity of processing models
- indiscriminately dumping data into , Diversity of storage
- metadata about datasets , MapReduce workflows with map-side joins
- NameNode , MapReduce and Distributed Filesystems
- use by Flink , Rebuilding state after a failure
- use by HBase , Dynamic partitioning
- use by MapReduce , MapReduce workflows
- HdrHistogram (numerical library) , Describing Performance
- head (Unix tool) , Simple Log Analysis
- head vertex (property graphs) , Property Graphs
- head-of-line blocking , Describing Performance
- heap files (databases) , Storing values within the index
- Helix (cluster manager) , Request Routing
- heterogeneous distributed transactions , Distributed Transactions in Practice , Limitations of distributed transactions
- heuristic decisions (in 2PC) , Recovering from coordinator failure
- Hibernate (object-relational mapper) , The Object-Relational Mismatch
- hierarchical model , Are Document Databases Repeating History?
- high availability ( see fault tolerance)
- high-frequency trading , Clock Synchronization and Accuracy , Limiting the impact of garbage collection
- high-performance computing (HPC) , Cloud Computing and Supercomputing
- hinted handoff , Sloppy Quorums and Hinted Handoff
- histograms , Describing Performance
-
Hive (query engine)
,
Beyond MapReduce
,
High-Level APIs and Languages
- for data warehouses , The divergence between OLTP databases and data warehouses
- HCatalog and metastore , MapReduce workflows with map-side joins
- map-side joins , Broadcast hash joins
- query optimizer , The move toward declarative query languages
- skewed joins , Handling skew
- workflows , MapReduce workflows
- Hollerith machines , Batch Processing
-
hopping windows (stream processing)
,
Types of windows
- ( see also windows)
- horizontal scaling ( see scaling out)
-
HornetQ (messaging)
,
Message brokers
,
Message brokers compared to databases
- distributed transaction support , XA transactions
-
hot spots
,
Partitioning of Key-Value Data
- due to celebrities , Skewed Workloads and Relieving Hot Spots
- for time-series data , Partitioning by Key Range
- in batch processing , Handling skew
- relieving , Skewed Workloads and Relieving Hot Spots
- hot standbys ( see leader-based replication)
- HTTP, use in APIs ( see services)
- human errors , Human Errors , Network Faults in Practice , Philosophy of batch process outputs
- HyperDex (database) , Multi-column indexes
- HyperLogLog (algorithm) , Stream analytics
I
- I/O operations, waiting for , Process Pauses
-
IBM
-
DB2 (database)
- distributed transaction support , XA transactions
- recursive query support , Graph Queries in SQL
- serializable isolation , Repeatable read and naming confusion , Implementation of two-phase locking
- XML and JSON support , The Object-Relational Mismatch , Convergence of document and relational databases
- electromechanical card-sorting machines , Batch Processing
-
IMS (database)
,
Are Document Databases Repeating History?
- imperative query APIs , Declarative Queries on the Web
- InfoSphere Streams (CEP engine) , Complex event processing
-
MQ (messaging)
,
Message brokers compared to databases
- distributed transaction support , XA transactions
- System R (database) , The Slippery Concept of a Transaction
- WebSphere (messaging) , Message brokers
-
DB2 (database)
-
idempotence
,
The problems with remote procedure calls (RPCs)
,
Idempotence
,
Glossary
- by giving operations unique IDs , Operation identifiers , Multi-partition request processing
- idempotent operations , Exactly-once execution of an operation
-
immutability
- advantages of , Advantages of immutable events , Designing for auditability
- deriving state from event log , State, Streams, and Immutability - Limitations of immutability
- for crash recovery , Hash Indexes
- in B-trees , B-tree optimizations , Indexes and snapshot isolation
- in event sourcing , Event Sourcing
- inputs to Unix commands , Transparency and experimentation
- limitations of , Limitations of immutability
-
Impala (query engine)
- for data warehouses , The divergence between OLTP databases and data warehouses
- hash joins , Broadcast hash joins
- native code generation , The move toward declarative query languages
- use of HDFS , Diversity of processing models
- impedance mismatch , The Object-Relational Mismatch
-
imperative languages
,
Query Languages for Data
- setting element styles (example) , Declarative Queries on the Web
-
in doubt (transaction status)
,
Coordinator failure
- holding locks , Holding locks while in doubt
- orphaned transactions , Recovering from coordinator failure
-
in-memory databases
,
Keeping everything in memory
- durability , Durability
- serial transaction execution , Actual Serial Execution
-
incidents
- cascading failures , Software Errors
- crashes due to leap seconds , Clock Synchronization and Accuracy
- data corruption and financial losses due to concurrency bugs , Weak Isolation Levels
- data corruption on hard disks , Durability
- data loss due to last-write-wins , Converging toward a consistent state , Timestamps for ordering events
- data on disks unreadable , Mapping system models to the real world
- deleted items reappearing , Custom conflict resolution logic
- disclosure of sensitive data due to primary key reuse , Leader failure: Failover
- errors in transaction serializability , Maintaining integrity in the face of software bugs
- gigabit network interface with 1 Kb/s throughput , Summary
- network faults , Network Faults in Practice
- network interface dropping only inbound packets , Network Faults in Practice
- network partitions and whole-datacenter failures , Faults and Partial Failures
- poor handling of network faults , Network Faults in Practice
- sending message to ex-partner , Ordering events to capture causality
- sharks biting undersea cables , Network Faults in Practice
- split brain due to 1-minute packet delay , Leader failure: Failover , Network Faults in Practice
- vibrations in server rack , Describing Performance
- violation of uniqueness constraint , Maintaining integrity in the face of software bugs
-
indexes
,
Data Structures That Power Your Database
,
Glossary
- and snapshot isolation , Indexes and snapshot isolation
- as derived data , Derived Data , Composing Data Storage Technologies - What’s missing?
- B-trees , B-Trees - B-tree optimizations
- building in batch processes , Building search indexes
- clustered , Storing values within the index
- comparison of B-trees and LSM-trees , Comparing B-Trees and LSM-Trees - Downsides of LSM-trees
- concatenated , Multi-column indexes
- covering (with included columns) , Storing values within the index
- creating , Creating an index
- full-text search , Full-text search and fuzzy indexes
- geospatial , Multi-column indexes
- hash , Hash Indexes - Hash Indexes
- index-range locking , Index-range locks
- multi-column , Multi-column indexes
- partitioning and secondary indexes , Partitioning and Secondary Indexes - Partitioning Secondary Indexes by Term , Summary
-
secondary
,
Other Indexing Structures
- ( see also secondary indexes)
- problems with dual writes , Keeping Systems in Sync , Reasoning about dataflows
- SSTables and LSM-trees , SSTables and LSM-Trees - Performance optimizations
- updating when data changes , Keeping Systems in Sync , Maintaining materialized views
- Industrial Revolution , Remembering the Industrial Revolution
- InfiniBand (networks) , Can we not simply make network delays predictable?
- InfiniteGraph (database) , Graph-Like Data Models
-
InnoDB (storage engine)
- clustered index on primary key , Storing values within the index
- not preventing lost updates , Automatically detecting lost updates
- preventing write skew , Characterizing write skew , Implementation of two-phase locking
- serializable isolation , Implementation of two-phase locking
- snapshot isolation support , Snapshot Isolation and Repeatable Read
-
inside-out databases
,
Designing Applications Around Dataflow
- ( see also unbundling databases)
- integrating different data systems ( see data integration)
-
integrity
,
Timeliness and Integrity
- coordination-avoiding data systems , Coordination-avoiding data systems
- correctness of dataflow systems , Correctness of dataflow systems
- in consensus formalization , Fault-Tolerant Consensus
-
integrity checks
,
Don’t just blindly trust what they promise
- ( see also auditing)
- end-to-end , The end-to-end argument , The end-to-end argument again
- use of snapshot isolation , Snapshot Isolation and Repeatable Read
- maintaining despite software bugs , Maintaining integrity in the face of software bugs
- Interface Definition Language (IDL) , Thrift and Protocol Buffers , Avro
- intermediate state, materialization of , Materialization of Intermediate State - Discussion of materialization
- internet services, systems for implementing , Cloud Computing and Supercomputing
-
invariants
,
Consistency
- ( see also constraints)
- inversion of control , Separation of logic and wiring
-
IP (Internet Protocol)
- unreliability of , Cloud Computing and Supercomputing
- ISDN (Integrated Services Digital Network) , Synchronous Versus Asynchronous Networks
-
isolation (in transactions)
,
Isolation
,
Single-Object and Multi-Object Operations
,
Glossary
- correctness and , Aiming for Correctness
- for single-object writes , Single-object writes
-
serializability
,
Serializability
-
Performance of serializable snapshot isolation
- actual serial execution , Actual Serial Execution - Summary of serial execution
- serializable snapshot isolation (SSI) , Serializable Snapshot Isolation (SSI) - Performance of serializable snapshot isolation
- two-phase locking (2PL) , Two-Phase Locking (2PL) - Index-range locks
- violating , Single-Object and Multi-Object Operations
-
weak isolation levels
,
Weak Isolation Levels
-
Materializing conflicts
- preventing lost updates , Preventing Lost Updates - Conflict resolution and replication
- read committed , Read Committed - Implementing read committed
- snapshot isolation , Snapshot Isolation and Repeatable Read - Repeatable read and naming confusion
- iterative processing , Graphs and Iterative Processing - Parallel execution
J
-
Java Database Connectivity (JDBC)
- distributed transaction support , XA transactions
- network drivers , The Merits of Schemas
- Java Enterprise Edition (EE) , The problems with remote procedure calls (RPCs) , Introduction to two-phase commit , XA transactions
-
Java Message Service (JMS)
,
Message brokers compared to databases
- ( see also messaging systems)
- comparison to log-based messaging , Logs compared to traditional messaging , Replaying old messages
- distributed transaction support , XA transactions
- message ordering , Acknowledgments and redelivery
- Java Transaction API (JTA) , Introduction to two-phase commit , XA transactions
-
Java Virtual Machine (JVM)
- bytecode generation , The move toward declarative query languages
- garbage collection pauses , Process Pauses
- process reuse in batch processors , Dataflow engines
-
JavaScript
- in MapReduce querying , MapReduce Querying
- setting element styles (example) , Declarative Queries on the Web
- use in advanced queries , MapReduce Querying
- Jena (RDF framework) , The RDF data model
- Jepsen (fault tolerance testing) , Aiming for Correctness
- jitter (network delay) , Network congestion and queueing
-
joins
,
Glossary
- by index lookup , Reduce-Side Joins and Grouping
- expressing as relational operators , The move toward declarative query languages
- in relational and document databases , Many-to-One and Many-to-Many Relationships
-
MapReduce map-side joins
,
Map-Side Joins
-
MapReduce workflows with map-side joins
- broadcast hash joins , Broadcast hash joins
- merge joins , Map-side merge joins
- partitioned hash joins , Partitioned hash joins
-
MapReduce reduce-side joins
,
Reduce-Side Joins and Grouping
-
Handling skew
- handling skew , Handling skew
- sort-merge joins , Sort-merge joins
- parallel execution of , Comparing Hadoop to Distributed Databases
- secondary indexes and , Other Indexing Structures
-
stream joins
,
Stream Joins
-
Time-dependence of joins
- stream-stream join , Stream-stream join (window join)
- stream-table join , Stream-table join (stream enrichment)
- table-table join , Table-table join (materialized view maintenance)
- time-dependence of , Time-dependence of joins
- support in document databases , Convergence of document and relational databases
- JOTM (transaction coordinator) , Introduction to two-phase commit
-
JSON
- Avro schema representation , Avro
- binary variants , Binary encoding
- for application data, issues with , JSON, XML, and Binary Variants
- in relational databases , The Object-Relational Mismatch , Convergence of document and relational databases
- representing a résumé (example) , The Object-Relational Mismatch
- Juttle (query language) , Designing Applications Around Dataflow
K
- k-nearest neighbors , Specialization for different domains
-
Kafka (messaging)
,
Message brokers
,
Using logs for message storage
- Kafka Connect (database integration) , API support for change streams , Deriving several views from the same event log
-
Kafka Streams (stream processor)
,
Stream analytics
,
Maintaining materialized views
- fault tolerance , Rebuilding state after a failure
- leader-based replication , Leaders and Followers
- log compaction , Log compaction , Maintaining materialized views
- message offsets , Using logs for message storage , Idempotence
- request routing , Request Routing
- transaction support , Atomic commit revisited
- usage example , Thinking About Data Systems
- Ketama (partitioning library) , Partitioning proportionally to nodes
-
key-value stores
,
Data Structures That Power Your Database
- as batch process output , Key-value stores as batch process output
- hash indexes , Hash Indexes - Hash Indexes
- in-memory , Keeping everything in memory
-
partitioning
,
Partitioning of Key-Value Data
-
Skewed Workloads and Relieving Hot Spots
- by hash of key , Partitioning by Hash of Key , Summary
- by key range , Partitioning by Key Range , Summary
- dynamic partitioning , Dynamic partitioning
- skew and hot spots , Skewed Workloads and Relieving Hot Spots
- Kryo (Java) , Language-Specific Formats
- Kubernetes (cluster manager) , Designing for frequent faults , Separation of application code and state
L
- lambda architecture , The lambda architecture
- Lamport timestamps , Lamport timestamps
- Large Hadron Collider (LHC) , Summary
-
last write wins (LWW)
,
Converging toward a consistent state
,
Implementing Linearizable Systems
- discarding concurrent writes , Last write wins (discarding concurrent writes)
- problems with , Timestamps for ordering events
- prone to lost updates , Conflict resolution and replication
- late binding , Separation of logic and wiring
-
latency
- instability under two-phase locking , Performance of two-phase locking
- network latency and resource utilization , Can we not simply make network delays predictable?
- response time versus , Describing Performance
- tail latency , Describing Performance , Partitioning Secondary Indexes by Document
-
leader-based replication
,
Leaders and Followers
-
Trigger-based replication
- ( see also replication)
- failover , Leader failure: Failover , The leader and the lock
- handling node outages , Handling Node Outages
-
implementation of replication logs
-
change data capture
,
Change Data Capture
-
API support for change streams
- ( see also changelogs)
- statement-based , Statement-based replication
- trigger-based replication , Trigger-based replication
- write-ahead log (WAL) shipping , Write-ahead log (WAL) shipping
-
change data capture
,
Change Data Capture
-
API support for change streams
- linearizability of operations , Implementing Linearizable Systems
- locking and leader election , Locking and leader election
- log sequence number , Setting Up New Followers , Consumer offsets
- read-scaling architecture , Problems with Replication Lag
- relation to consensus , Single-leader replication and consensus
- setting up new followers , Setting Up New Followers
- synchronous versus asynchronous , Synchronous Versus Asynchronous Replication - Synchronous Versus Asynchronous Replication
-
leaderless replication
,
Leaderless Replication
-
Version vectors
- ( see also replication)
-
detecting concurrent writes
,
Detecting Concurrent Writes
-
Version vectors
- capturing happens-before relationship , Capturing the happens-before relationship
- happens-before relationship and concurrency , The “happens-before” relationship and concurrency
- last write wins , Last write wins (discarding concurrent writes)
- merging concurrently written values , Merging concurrently written values
- version vectors , Version vectors
- multi-datacenter , Multi-datacenter operation
-
quorums
,
Quorums for reading and writing
-
Limitations of Quorum Consistency
- consistency limitations , Limitations of Quorum Consistency - Monitoring staleness , Linearizability and quorums
- sloppy quorums and hinted handoff , Sloppy Quorums and Hinted Handoff
- read repair and anti-entropy , Read repair and anti-entropy
-
leap seconds
,
Software Errors
,
Clock Synchronization and Accuracy
- in time-of-day clocks , Time-of-day clocks
-
leases
,
Process Pauses
- implementation with ZooKeeper , Membership and Coordination Services
- need for fencing , The leader and the lock
-
ledgers
,
Advantages of immutable events
- distributed ledger technologies , Tools for auditable data systems
- legacy systems, maintenance of , Maintainability
- less (Unix tool) , Transparency and experimentation
- LevelDB (storage engine) , Making an LSM-tree out of SSTables
- leveled compaction , Performance optimizations
- Levenshtein automata , Full-text search and fuzzy indexes
- limping (partial failure) , Summary
-
linearizability
,
Linearizability
-
Linearizability and network delays
,
Glossary
-
cost of
,
The Cost of Linearizability
-
Linearizability and network delays
- CAP theorem , The CAP theorem
- memory on multi-core CPUs , Linearizability and network delays
- definition , What Makes a System Linearizable? - What Makes a System Linearizable?
- implementing with total order broadcast , Implementing linearizable storage using total order broadcast
- in ZooKeeper , Membership and Coordination Services
-
of derived data systems
,
Derived data versus distributed transactions
,
Timeliness and Integrity
- avoiding coordination , Coordination-avoiding data systems
-
of different replication methods
,
Implementing Linearizable Systems
-
Linearizability and quorums
- using quorums , Linearizability and quorums
-
relying on
,
Relying on Linearizability
-
Cross-channel timing dependencies
- constraints and uniqueness , Constraints and uniqueness guarantees
- cross-channel timing dependencies , Cross-channel timing dependencies
- locking and leader election , Locking and leader election
- stronger than causal consistency , Linearizability is stronger than causal consistency
- using to implement total order broadcast , Implementing total order broadcast using linearizable storage
- versus serializability , What Makes a System Linearizable?
-
cost of
,
The Cost of Linearizability
-
Linearizability and network delays
-
LinkedIn
- Azkaban (workflow scheduler) , MapReduce workflows
- Databus (change data capture) , Trigger-based replication , Implementing change data capture
- Espresso (database) , The Object-Relational Mismatch , But what is the writer’s schema? , Different values written at different times , Leaders and Followers , Request Routing
- Helix (cluster manager) ( see Helix)
- profile (example) , The Object-Relational Mismatch
- reference to company entity (example) , Many-to-One and Many-to-Many Relationships
- Rest.li (RPC framework) , Current directions for RPC
- Voldemort (database) ( see Voldemort)
- Linux, leap second bug , Software Errors , Clock Synchronization and Accuracy
- liveness properties , Safety and liveness
- LMDB (storage engine) , B-tree optimizations , Indexes and snapshot isolation
-
load
- approaches to coping with , Approaches for Coping with Load
- describing , Describing Load
- load testing , Describing Performance
- load balancing (messaging) , Multiple consumers
- local indexes ( see document-partitioned indexes)
-
locality (data access)
,
The Object-Relational Mismatch
,
Data locality for queries
,
Glossary
- in batch processing , Distributed execution of MapReduce , Example: analysis of user activity events , Dataflow engines
- in stateful clients , Clients with offline operation , Stateful, offline-capable clients
- in stream processing , Stream-table join (stream enrichment) , Rebuilding state after a failure , Stream processors and services , Uniqueness in log-based messaging
-
location transparency
,
The problems with remote procedure calls (RPCs)
- in the actor model , Distributed actor frameworks
-
locks
,
Glossary
- deadlock , Implementation of two-phase locking
-
distributed locking
,
The leader and the lock
-
Fencing tokens
,
Locking and leader election
- fencing tokens , Fencing tokens
- implementation with ZooKeeper , Membership and Coordination Services
- relation to consensus , Summary
-
for transaction isolation
- in snapshot isolation , Implementing snapshot isolation
- in two-phase locking (2PL) , Two-Phase Locking (2PL) - Index-range locks
- making operations atomic , Atomic write operations
- performance , Performance of two-phase locking
- preventing dirty writes , Implementing read committed
- preventing phantoms with index-range locks , Index-range locks , Detecting writes that affect prior reads
- read locks (shared mode) , Implementing read committed , Implementation of two-phase locking
- shared mode and exclusive mode , Implementation of two-phase locking
-
in two-phase commit (2PC)
- deadlock detection , Limitations of distributed transactions
- in-doubt transactions holding locks , Holding locks while in doubt
- materializing conflicts with , Materializing conflicts
- preventing lost updates by explicit locking , Explicit locking
- log sequence number , Setting Up New Followers , Consumer offsets
- logic programming languages , Designing Applications Around Dataflow
-
logical clocks
,
Timestamps for ordering events
,
Sequence Number Ordering
,
Ordering events to capture causality
- for read-after-write consistency , Reading Your Own Writes
- logical logs , Logical (row-based) log replication
-
logs (data structure)
,
Data Structures That Power Your Database
,
Glossary
- advantages of immutability , Advantages of immutable events
-
compaction
,
Hash Indexes
,
Performance optimizations
,
Log compaction
,
State, Streams, and Immutability
- for stream operator state , Rebuilding state after a failure
- creating using total order broadcast , Using total order broadcast
- implementing uniqueness constraints , Uniqueness in log-based messaging
-
log-based messaging
,
Partitioned Logs
-
Replaying old messages
- comparison to traditional messaging , Logs compared to traditional messaging , Replaying old messages
- consumer offsets , Consumer offsets
- disk space usage , Disk space usage
- replaying old messages , Replaying old messages , Reprocessing data for application evolution , Unifying batch and stream processing
- slow consumers , When consumers cannot keep up with producers
- using logs for message storage , Using logs for message storage
-
log-structured storage
,
Data Structures That Power Your Database
-
Performance optimizations
- log-structured merge tree ( see LSM-trees)
-
replication
,
Leaders and Followers
,
Implementation of Replication Logs
-
Trigger-based replication
-
change data capture
,
Change Data Capture
-
API support for change streams
- ( see also changelogs)
- coordination with snapshot , Setting Up New Followers
- logical (row-based) replication , Logical (row-based) log replication
- statement-based replication , Statement-based replication
- trigger-based replication , Trigger-based replication
- write-ahead log (WAL) shipping , Write-ahead log (WAL) shipping
-
change data capture
,
Change Data Capture
-
API support for change streams
- scalability limits , The limits of total ordering
- loose coupling , Separation of logic and wiring , Materialization of Intermediate State , Making unbundling work
- lost updates ( see updates)
-
LSM-trees (indexes)
,
Making an LSM-tree out of SSTables
-
Performance optimizations
- comparison to B-trees , Comparing B-Trees and LSM-Trees - Downsides of LSM-trees
-
Lucene (storage engine)
,
Making an LSM-tree out of SSTables
- building indexes in batch processes , Building search indexes
- similarity search , Full-text search and fuzzy indexes
- Luigi (workflow scheduler) , MapReduce workflows
- LWW ( see last write wins)
M
-
machine learning
-
ethical considerations
,
Bias and discrimination
- ( see also ethics)
- iterative processing , Graphs and Iterative Processing
- models derived from training data , Application code as a derivation function
- statistical and numerical algorithms , Specialization for different domains
-
ethical considerations
,
Bias and discrimination
- MADlib (machine learning toolkit) , Specialization for different domains
- magic scaling sauce , Approaches for Coping with Load
- Mahout (machine learning toolkit) , Specialization for different domains
-
maintainability
,
Maintainability
-
Evolvability: Making Change Easy
,
The Future of Data Systems
- defined , Summary
- design principles for software systems , Maintainability
- evolvability ( see evolvability)
- operability , Operability: Making Life Easy for Operations
- simplicity and managing complexity , Simplicity: Managing Complexity
-
many-to-many relationships
- in document model versus relational model , Which data model leads to simpler application code?
- modeling as graphs , Graph-Like Data Models
- many-to-one and many-to-many relationships , Many-to-One and Many-to-Many Relationships - Many-to-One and Many-to-Many Relationships
- many-to-one relationships , Many-to-One and Many-to-Many Relationships
-
MapReduce (batch processing)
,
Batch Processing
,
MapReduce Job Execution
-
MapReduce Job Execution
- accessing external services within job , Example: analysis of user activity events , Key-value stores as batch process output
-
comparison to distributed databases
- designing for frequent faults , Designing for frequent faults
- diversity of processing models , Diversity of processing models
- diversity of storage , Diversity of storage
- comparison to stream processing , Processing Streams
- comparison to Unix , Philosophy of batch process outputs - Philosophy of batch process outputs
- disadvantages and limitations of , Beyond MapReduce
- fault tolerance , Bringing related data together in the same place , Philosophy of batch process outputs , Fault tolerance
- higher-level tools , MapReduce workflows , High-Level APIs and Languages
-
implementation in Hadoop
,
Distributed execution of MapReduce
-
MapReduce workflows
- the shuffle , Distributed execution of MapReduce
- implementation in MongoDB , MapReduce Querying - MapReduce Querying
- machine learning , Specialization for different domains
-
map-side processing
,
Map-Side Joins
-
MapReduce workflows with map-side joins
- broadcast hash joins , Broadcast hash joins
- merge joins , Map-side merge joins
- partitioned hash joins , Partitioned hash joins
- mapper and reducer functions , MapReduce Job Execution
- materialization of intermediate state , Materialization of Intermediate State - Discussion of materialization
-
output of batch workflows
,
The Output of Batch Workflows
-
Key-value stores as batch process output
- building search indexes , Building search indexes
- key-value stores , Key-value stores as batch process output
-
reduce-side processing
,
Reduce-Side Joins and Grouping
-
Handling skew
- analysis of user activity events (example) , Example: analysis of user activity events
- grouping records by same key , GROUP BY
- handling skew , Handling skew
- sort-merge joins , Sort-merge joins
- workflows , MapReduce workflows
- marshalling ( see encoding)
-
massively parallel processing (MPP)
,
Parallel Query Execution
- comparison to composing storage technologies , Unbundled versus integrated systems
- comparison to Hadoop , Comparing Hadoop to Distributed Databases - Designing for frequent faults , The move toward declarative query languages
- master-master replication ( see multi-leader replication)
- master-slave replication ( see leader-based replication)
-
materialization
,
Glossary
- aggregate values , Aggregation: Data Cubes and Materialized Views
- conflicts , Materializing conflicts
- intermediate state (batch processing) , Materialization of Intermediate State - Discussion of materialization
-
materialized views
,
Aggregation: Data Cubes and Materialized Views
- as derived data , Derived Data , Composing Data Storage Technologies - What’s missing?
- maintaining, using stream processing , Maintaining materialized views , Table-table join (materialized view maintenance)
- Maven (Java build tool) , The move toward declarative query languages
- Maxwell (change data capture) , Implementing change data capture
- mean , Describing Performance
- media monitoring , Search on streams
- median , Describing Performance
- meeting room booking (example) , More examples of write skew , Predicate locks , Enforcing Constraints
- membership services , Membership services
- Memcached (caching server) , Thinking About Data Systems , Keeping everything in memory
-
memory
-
in-memory databases
,
Keeping everything in memory
- durability , Durability
- serial transaction execution , Actual Serial Execution
- in-memory representation of data , Formats for Encoding Data
- random bit-flips in , Trust, but Verify
- use by indexes , Hash Indexes , SSTables and LSM-Trees
-
in-memory databases
,
Keeping everything in memory
- memory barrier (CPU instruction) , Linearizability and network delays
-
MemSQL (database)
- in-memory storage , Keeping everything in memory
- read committed isolation , Implementing read committed
- memtable (in LSM-trees) , Constructing and maintaining SSTables
- Mercurial (version control system) , Limitations of immutability
- merge joins, MapReduce map-side , Map-side merge joins
- mergeable persistent data structures , Custom conflict resolution logic
- merging sorted files , SSTables and LSM-Trees , Distributed execution of MapReduce , Sort-merge joins
- Merkle trees , Tools for auditable data systems
- Mesos (cluster manager) , Designing for frequent faults , Separation of application code and state
- message brokers ( see messaging systems)
-
message-passing
,
Message-Passing Dataflow
-
Distributed actor frameworks
- advantages over direct RPC , Message-Passing Dataflow
- distributed actor frameworks , Distributed actor frameworks
- evolvability , Distributed actor frameworks
- MessagePack (encoding format) , Binary encoding
-
messages
- exactly-once semantics , Exactly-once message processing , Fault Tolerance
- loss of , Messaging Systems
- using total order broadcast , Total Order Broadcast
-
messaging systems
,
Stream Processing
-
Replaying old messages
- ( see also streams)
- backpressure, buffering, or dropping messages , Messaging Systems
- brokerless messaging , Direct messaging from producers to consumers
-
event logs
,
Partitioned Logs
-
Replaying old messages
- comparison to traditional messaging , Logs compared to traditional messaging , Replaying old messages
- consumer offsets , Consumer offsets
- replaying old messages , Replaying old messages , Reprocessing data for application evolution , Unifying batch and stream processing
- slow consumers , When consumers cannot keep up with producers
-
message brokers
,
Message brokers
-
Acknowledgments and redelivery
- acknowledgements and redelivery , Acknowledgments and redelivery
- comparison to event logs , Logs compared to traditional messaging , Replaying old messages
- multiple consumers of same topic , Multiple consumers
- reliability , Messaging Systems
- uniqueness in log-based messaging , Uniqueness in log-based messaging
- Meteor (web framework) , API support for change streams
- microbatching , Microbatching and checkpointing , Batch and Stream Processing
-
microservices
,
Dataflow Through Services: REST and RPC
- ( see also services)
- causal dependencies across services , The limits of total ordering
- loose coupling , Making unbundling work
- relation to batch/stream processors , Batch Processing , Stream processors and services
-
Microsoft
- Azure Service Bus (messaging) , Message brokers compared to databases
- Azure Storage , Synchronous Versus Asynchronous Replication , MapReduce and Distributed Filesystems
- Azure Stream Analytics , Stream analytics
- DCOM (Distributed Component Object Model) , The problems with remote procedure calls (RPCs)
- MSDTC (transaction coordinator) , Introduction to two-phase commit
- Orleans ( see Orleans)
- SQL Server ( see SQL Server)
- migrating (rewriting) data , Schema flexibility in the document model , Different values written at different times , Deriving several views from the same event log , Reprocessing data for application evolution
- modulus operator (%) , How not to do it: hash mod N
-
MongoDB (database)
- aggregation pipeline , MapReduce Querying
- atomic operations , Atomic write operations
- BSON , Data locality for queries
- document data model , The Object-Relational Mismatch
- hash partitioning (sharding) , Partitioning by Hash of Key - Partitioning by Hash of Key
- key-range partitioning , Partitioning by Key Range
- lack of join support , Many-to-One and Many-to-Many Relationships , Convergence of document and relational databases
- leader-based replication , Leaders and Followers
- MapReduce support , MapReduce Querying , Distributed execution of MapReduce
- oplog parsing , Implementing change data capture , API support for change streams
- partition splitting , Dynamic partitioning
- request routing , Request Routing
- secondary indexes , Partitioning Secondary Indexes by Document
- Mongoriver (change data capture) , Implementing change data capture
- monitoring , Human Errors , Operability: Making Life Easy for Operations
- monotonic clocks , Monotonic clocks
- monotonic reads , Monotonic Reads
- MPP ( see massively parallel processing)
- MSMQ (messaging) , XA transactions
- multi-column indexes , Multi-column indexes
-
multi-leader replication
,
Multi-Leader Replication
-
Multi-Leader Replication Topologies
- ( see also replication)
-
handling write conflicts
,
Handling Write Conflicts
- conflict avoidance , Conflict avoidance
- converging toward a consistent state , Converging toward a consistent state
- custom conflict resolution logic , Custom conflict resolution logic
- determining what is a conflict , What is a conflict?
- linearizability, lack of , Implementing Linearizable Systems
- replication topologies , Multi-Leader Replication Topologies - Multi-Leader Replication Topologies
-
use cases
,
Use Cases for Multi-Leader Replication
- clients with offline operation , Clients with offline operation
- collaborative editing , Collaborative editing
- multi-datacenter replication , Multi-datacenter operation , The Cost of Linearizability
-
multi-object transactions
,
Single-Object and Multi-Object Operations
- need for , The need for multi-object transactions
- Multi-Paxos (total order broadcast) , Consensus algorithms and total order broadcast
- multi-table index cluster tables (Oracle) , Data locality for queries
- multi-tenancy , Network congestion and queueing
-
multi-version concurrency control (MVCC)
,
Implementing snapshot isolation
,
Summary
- detecting stale MVCC reads , Detecting stale MVCC reads
- indexes and snapshot isolation , Indexes and snapshot isolation
-
mutual exclusion
,
Pessimistic versus optimistic concurrency control
- ( see also locks)
-
MySQL (database)
- binlog coordinates , Setting Up New Followers
- binlog parsing for change data capture , Implementing change data capture
- circular replication topology , Multi-Leader Replication Topologies
- consistent snapshots , Setting Up New Followers
- distributed transaction support , XA transactions
- InnoDB storage engine ( see InnoDB)
- JSON support , The Object-Relational Mismatch , Convergence of document and relational databases
- leader-based replication , Leaders and Followers
- performance of XA transactions , Distributed Transactions in Practice
- row-based replication , Logical (row-based) log replication
- schema changes in , Schema flexibility in the document model
-
snapshot isolation support
,
Repeatable read and naming confusion
- ( see also InnoDB)
- statement-based replication , Statement-based replication
-
Tungsten Replicator (multi-leader replication)
,
Multi-datacenter operation
- conflict detection , Multi-Leader Replication Topologies
N
- nanomsg (messaging library) , Direct messaging from producers to consumers
- Narayana (transaction coordinator) , Introduction to two-phase commit
- NATS (messaging) , Message brokers
-
near-real-time (nearline) processing
,
Batch Processing
- ( see also stream processing)
-
Neo4j (database)
- Cypher query language , The Cypher Query Language
- graph data model , Graph-Like Data Models
- Nephele (dataflow engine) , Dataflow engines
- netcat (Unix tool) , Separation of logic and wiring
- Netflix Chaos Monkey , Reliability , Network Faults in Practice
- Network Attached Storage (NAS) , Distributed Data , MapReduce and Distributed Filesystems
-
network model
,
The network model
- graph databases versus , The SPARQL query language
- imperative query APIs , Declarative Queries on the Web
- Network Time Protocol ( see NTP)
-
networks
- congestion and queueing , Network congestion and queueing
- datacenter network topologies , Cloud Computing and Supercomputing
- faults ( see faults)
- linearizability and network delays , Linearizability and network delays
- network partitions , Network Faults in Practice , The CAP theorem
- timeouts and unbounded delays , Timeouts and Unbounded Delays
- next-key locking , Index-range locks
- nodes (in graphs) ( see vertices)
-
nodes (processes)
,
Glossary
- handling outages in leader-based replication , Handling Node Outages
- system models for failure , System Model and Reality
- noisy neighbors , Network congestion and queueing
- nonblocking atomic commit , Three-phase commit
-
nondeterministic operations
- accidental nondeterminism , Fault tolerance
- partial failures in distributed systems , Faults and Partial Failures
- nonfunctional requirements , Summary
-
nonrepeatable reads
,
Snapshot Isolation and Repeatable Read
- ( see also read skew)
-
normalization (data representation)
,
Many-to-One and Many-to-Many Relationships
,
Glossary
- executing joins , Which data model leads to simpler application code? , Convergence of document and relational databases , Reduce-Side Joins and Grouping
- foreign key references , The need for multi-object transactions
- in systems of record , Derived Data
- versus denormalization , Deriving several views from the same event log
-
NoSQL
,
The Birth of NoSQL
,
Unbundling Databases
- transactions and , The Slippery Concept of a Transaction
- Notation3 (N3) , Triple-Stores and SPARQL
- npm (package manager) , The move toward declarative query languages
-
NTP (Network Time Protocol)
,
Unreliable Clocks
- accuracy , Clock Synchronization and Accuracy , Timestamps for ordering events
- adjustments to monotonic clocks , Monotonic clocks
- multiple server addresses , Weak forms of lying
- numbers, in XML and JSON encodings , JSON, XML, and Binary Variants
O
-
object-relational mapping (ORM) frameworks
,
The Object-Relational Mismatch
- error handling and aborted transactions , Handling errors and aborts
- unsafe read-modify-write cycle code , Atomic write operations
- object-relational mismatch , The Object-Relational Mismatch
- observer pattern , Separation of application code and state
-
offline systems
,
Batch Processing
- ( see also batch processing)
- stateful, offline-capable clients , Clients with offline operation , Stateful, offline-capable clients
- offline-first applications , Stateful, offline-capable clients
-
offsets
- consumer offsets in partitioned logs , Consumer offsets
- messages in partitioned logs , Using logs for message storage
-
OLAP (online analytic processing)
,
Transaction Processing or Analytics?
,
Glossary
- data cubes , Aggregation: Data Cubes and Materialized Views
-
OLTP (online transaction processing)
,
Transaction Processing or Analytics?
,
Glossary
- analytics queries versus , The Output of Batch Workflows
- workload characteristics , Actual Serial Execution
-
one-to-many relationships
,
The Object-Relational Mismatch
- JSON representation , The Object-Relational Mismatch
-
online systems
,
Batch Processing
- ( see also services)
- Oozie (workflow scheduler) , MapReduce workflows
- OpenAPI (service definition format) , Web services
-
OpenStack
-
Nova (cloud infrastructure)
- use of ZooKeeper , Membership and Coordination Services
- Swift (object storage) , MapReduce and Distributed Filesystems
-
Nova (cloud infrastructure)
- operability , Operability: Making Life Easy for Operations
- operating systems versus databases , Unbundling Databases
- operation identifiers , Operation identifiers , Multi-partition request processing
- operational transformation , Custom conflict resolution logic
-
operators
,
Dataflow engines
- flow of data between , Graphs and Iterative Processing
- in stream processing , Processing Streams
- optimistic concurrency control , Pessimistic versus optimistic concurrency control
-
Oracle (database)
- distributed transaction support , XA transactions
- GoldenGate (change data capture) , Trigger-based replication , Multi-datacenter operation , Implementing change data capture
- lack of serializability , Isolation
- leader-based replication , Leaders and Followers
- multi-table index cluster tables , Data locality for queries
- not preventing write skew , Characterizing write skew
- partitioned indexes , Partitioning Secondary Indexes by Term
- PL/SQL language , Pros and cons of stored procedures
- preventing lost updates , Automatically detecting lost updates
- read committed isolation , Implementing read committed
- Real Application Clusters (RAC) , Locking and leader election
- recursive query support , Graph Queries in SQL
- snapshot isolation support , Snapshot Isolation and Repeatable Read , Repeatable read and naming confusion
- TimesTen (in-memory database) , Keeping everything in memory
- WAL-based replication , Write-ahead log (WAL) shipping
- XML support , The Object-Relational Mismatch
-
ordering
,
Ordering Guarantees
-
Implementing total order broadcast using linearizable storage
- by sequence numbers , Sequence Number Ordering - Timestamp ordering is not sufficient
-
causal ordering
,
Ordering and Causality
-
Capturing causal dependencies
- partial order , The causal order is not a total order
- limits of total ordering , The limits of total ordering
- total order broadcast , Total Order Broadcast - Implementing total order broadcast using linearizable storage
- Orleans (actor framework) , Distributed actor frameworks
- outliers (response time) , Describing Performance
- Oz (programming language) , Designing Applications Around Dataflow
P
- package managers , The move toward declarative query languages , Separation of application code and state
- packet switching , Can we not simply make network delays predictable?
-
packets
- corruption of , Weak forms of lying
- sending via UDP , Direct messaging from producers to consumers
- PageRank (algorithm) , Graph-Like Data Models , Graphs and Iterative Processing
- paging ( see virtual memory)
- ParAccel (database) , The divergence between OLTP databases and data warehouses
- parallel databases ( see massively parallel processing)
-
parallel execution
- of graph analysis algorithms , Parallel execution
- queries in MPP databases , Parallel Query Execution
-
Parquet (data format)
,
Column-Oriented Storage
,
Archival storage
- ( see also column-oriented storage)
- use in Hadoop , Philosophy of batch process outputs
-
partial failures
,
Faults and Partial Failures
,
Summary
- limping , Summary
- partial order , The causal order is not a total order
-
partitioning
,
Partitioning
-
Summary
,
Glossary
- and replication , Partitioning and Replication
- in batch processing , Summary
-
multi-partition operations
,
Multi-partition data processing
- enforcing constraints , Multi-partition request processing
- secondary index maintenance , Maintaining derived state
-
of key-value data
,
Partitioning of Key-Value Data
-
Skewed Workloads and Relieving Hot Spots
- by key range , Partitioning by Key Range
- skew and hot spots , Skewed Workloads and Relieving Hot Spots
-
rebalancing partitions
,
Rebalancing Partitions
-
Operations: Automatic or Manual Rebalancing
- automatic or manual rebalancing , Operations: Automatic or Manual Rebalancing
- problems with hash mod N , How not to do it: hash mod N
- using dynamic partitioning , Dynamic partitioning
- using fixed number of partitions , Fixed number of partitions
- using N partitions per node , Partitioning proportionally to nodes
- replication and , Distributed Data
- request routing , Request Routing - Parallel Query Execution
-
secondary indexes
,
Partitioning and Secondary Indexes
-
Partitioning Secondary Indexes by Term
- document-based partitioning , Partitioning Secondary Indexes by Document
- term-based partitioning , Partitioning Secondary Indexes by Term
- serial execution of transactions and , Partitioning
-
Paxos (consensus algorithm)
,
Consensus algorithms and total order broadcast
- ballot number , Epoch numbering and quorums
- Multi-Paxos (total order broadcast) , Consensus algorithms and total order broadcast
-
percentiles
,
Describing Performance
,
Glossary
- calculating efficiently , Describing Performance
- importance of high percentiles , Describing Performance
- use in service level agreements (SLAs) , Describing Performance
- Percona XtraBackup (MySQL tool) , Setting Up New Followers
-
performance
- describing , Describing Performance
- of distributed transactions , Distributed Transactions in Practice
- of in-memory databases , Keeping everything in memory
- of linearizability , Linearizability and network delays
- of multi-leader replication , Multi-datacenter operation
- perpetual inconsistency , Timeliness and Integrity
- pessimistic concurrency control , Pessimistic versus optimistic concurrency control
-
phantoms (transaction isolation)
,
Phantoms causing write skew
- materializing conflicts , Materializing conflicts
- preventing, in serializability , Predicate locks
- physical clocks ( see clocks)
- pickle (Python) , Language-Specific Formats
-
Pig (dataflow language)
,
Beyond MapReduce
,
High-Level APIs and Languages
- replicated joins , Broadcast hash joins
- skewed joins , Handling skew
- workflows , MapReduce workflows
- Pinball (workflow scheduler) , MapReduce workflows
-
pipelined execution
,
Discussion of materialization
- in Unix , The Unix Philosophy
- point in time , Unreliable Clocks
- polyglot persistence , The Birth of NoSQL
- polystores , The meta-database of everything
-
PostgreSQL (database)
-
BDR (multi-leader replication)
,
Multi-datacenter operation
- causal ordering of writes , Multi-Leader Replication Topologies
- Bottled Water (change data capture) , Implementing change data capture
- Bucardo (trigger-based replication) , Trigger-based replication , Custom conflict resolution logic
- distributed transaction support , XA transactions
- foreign data wrappers , The meta-database of everything
- full text search support , Combining Specialized Tools by Deriving Data
- leader-based replication , Leaders and Followers
- log sequence number , Setting Up New Followers
- MVCC implementation , Implementing snapshot isolation , Indexes and snapshot isolation
- PL/pgSQL language , Pros and cons of stored procedures
- PostGIS geospatial indexes , Multi-column indexes
- preventing lost updates , Automatically detecting lost updates
- preventing write skew , Characterizing write skew , Serializable Snapshot Isolation (SSI)
- read committed isolation , Implementing read committed
- recursive query support , Graph Queries in SQL
- representing graphs , Property Graphs
- serializable snapshot isolation (SSI) , Serializable Snapshot Isolation (SSI)
- snapshot isolation support , Snapshot Isolation and Repeatable Read , Repeatable read and naming confusion
- WAL-based replication , Write-ahead log (WAL) shipping
- XML and JSON support , The Object-Relational Mismatch , Convergence of document and relational databases
-
BDR (multi-leader replication)
,
Multi-datacenter operation
- pre-splitting , Dynamic partitioning
- Precision Time Protocol (PTP) , Clock Synchronization and Accuracy
- predicate locks , Predicate locks
-
predictive analytics
,
Predictive Analytics
-
Feedback loops
- amplifying bias , Bias and discrimination
- ethics of ( see ethics)
- feedback loops , Feedback loops
-
preemption
- of datacenter resources , Designing for frequent faults
- of threads , Process Pauses
- Pregel processing model , The Pregel processing model
-
primary keys
,
Other Indexing Structures
,
Glossary
- compound primary key (Cassandra) , Partitioning by Hash of Key
- primary-secondary replication ( see leader-based replication)
-
privacy
,
Privacy and Tracking
-
Legislation and self-regulation
- consent and freedom of choice , Consent and freedom of choice
- data as assets and power , Data as assets and power
- deleting data , Limitations of immutability
- ethical considerations ( see ethics)
- legislation and self-regulation , Legislation and self-regulation
- meaning of , Privacy and use of data
- surveillance , Surveillance
- tracking behavioral data , Privacy and Tracking
- probabilistic algorithms , Describing Performance , Stream analytics
- process pauses , Process Pauses - Limiting the impact of garbage collection
- processing time (of events) , Reasoning About Time
- producers (message streams) , Transmitting Event Streams
-
programming languages
- dataflow languages , Designing Applications Around Dataflow
- for stored procedures , Pros and cons of stored procedures
- functional reactive programming (FRP) , Designing Applications Around Dataflow
- logic programming , Designing Applications Around Dataflow
-
Prolog (language)
,
The Foundation: Datalog
- ( see also Datalog)
- promises (asynchronous operations) , Current directions for RPC
-
property graphs
,
Property Graphs
- Cypher query language , The Cypher Query Language
-
Protocol Buffers (data format)
,
Thrift and Protocol Buffers
-
Datatypes and schema evolution
- field tags and schema evolution , Field tags and schema evolution
- provenance of data , Designing for auditability
- publish/subscribe model , Messaging Systems
- publishers (message streams) , Transmitting Event Streams
- punch card tabulating machines , Batch Processing
- pure functions , MapReduce Querying
- putting computation near data , Distributed execution of MapReduce
Q
- Qpid (messaging) , Message brokers compared to databases
- quality of service (QoS) , Can we not simply make network delays predictable?
- Quantcast File System (distributed filesystem) , MapReduce and Distributed Filesystems
-
query languages
,
Query Languages for Data
-
MapReduce Querying
- aggregation pipeline , MapReduce Querying
- CSS and XSL , Declarative Queries on the Web
- Cypher , The Cypher Query Language
- Datalog , The Foundation: Datalog
- Juttle , Designing Applications Around Dataflow
- MapReduce querying , MapReduce Querying - MapReduce Querying
- recursive SQL queries , Graph Queries in SQL
- relational algebra and SQL , Query Languages for Data
- SPARQL , The SPARQL query language
- query optimizers , The relational model , The move toward declarative query languages
-
queueing delays (networks)
,
Network congestion and queueing
- head-of-line blocking , Describing Performance
- latency and response time , Describing Performance
- queues (messaging) , Message brokers
-
quorums
,
Quorums for reading and writing
-
Limitations of Quorum Consistency
,
Glossary
- for leaderless replication , Quorums for reading and writing
- in consensus algorithms , Epoch numbering and quorums
- limitations of consistency , Limitations of Quorum Consistency - Monitoring staleness , Linearizability and quorums
- making decisions in distributed systems , The Truth Is Defined by the Majority
- monitoring staleness , Monitoring staleness
- multi-datacenter replication , Multi-datacenter operation
- relying on durability , Mapping system models to the real world
- sloppy quorums and hinted handoff , Sloppy Quorums and Hinted Handoff
R
- R-trees (indexes) , Multi-column indexes
-
RabbitMQ (messaging)
,
Message brokers
,
Message brokers compared to databases
- leader-based replication , Leaders and Followers
-
race conditions
,
Isolation
- ( see also concurrency)
- avoiding with linearizability , Cross-channel timing dependencies
- caused by dual writes , Keeping Systems in Sync
- dirty writes , No dirty writes
- in counter increments , No dirty writes
- lost updates , Preventing Lost Updates - Conflict resolution and replication
- preventing with event logs , Concurrency control , Dataflow: Interplay between state changes and application code
- preventing with serializable isolation , Serializability
- write skew , Write Skew and Phantoms - Materializing conflicts
-
Raft (consensus algorithm)
,
Consensus algorithms and total order broadcast
- sensitivity to network problems , Limitations of consensus
- term number , Epoch numbering and quorums
- use in etcd , Distributed Transactions and Consensus
- RAID (Redundant Array of Independent Disks) , Hardware Faults , MapReduce and Distributed Filesystems
- railways, schema migration on , Reprocessing data for application evolution
- RAMCloud (in-memory storage) , Keeping everything in memory
- ranking algorithms , Graphs and Iterative Processing
-
RDF (Resource Description Framework)
,
The semantic web
- querying with SPARQL , The SPARQL query language
- RDMA (Remote Direct Memory Access) , Cloud Computing and Supercomputing
-
read committed isolation level
,
Read Committed
-
Implementing read committed
- implementing , Implementing read committed
- multi-version concurrency control (MVCC) , Implementing snapshot isolation
- no dirty reads , No dirty reads
- no dirty writes , No dirty writes
- read path (derived data) , Observing Derived State
-
read repair (leaderless replication)
,
Read repair and anti-entropy
- for linearizability , Linearizability and quorums
- read replicas ( see leader-based replication)
-
read skew (transaction isolation)
,
Snapshot Isolation and Repeatable Read
,
Summary
- as violation of causality , Ordering and Causality
-
read-after-write consistency
,
Reading Your Own Writes
,
Timeliness and Integrity
- cross-device , Reading Your Own Writes
- read-modify-write cycle , Preventing Lost Updates
- read-scaling architecture , Problems with Replication Lag
- reads as events , Reads are events too
-
real-time
- collaborative editing , Collaborative editing
-
near-real-time processing
,
Batch Processing
- ( see also stream processing)
- publish/subscribe dataflow , End-to-end event streams
- response time guarantees , Response time guarantees
- time-of-day clocks , Time-of-day clocks
-
rebalancing partitions
,
Rebalancing Partitions
-
Operations: Automatic or Manual Rebalancing
,
Glossary
- ( see also partitioning)
- automatic or manual rebalancing , Operations: Automatic or Manual Rebalancing
- dynamic partitioning , Dynamic partitioning
- fixed number of partitions , Fixed number of partitions
- fixed number of partitions per node , Partitioning proportionally to nodes
- problems with hash mod N , How not to do it: hash mod N
- recency guarantee , Linearizability
-
recommendation engines
- batch process outputs , Key-value stores as batch process output
- batch workflows , MapReduce workflows , Materialization of Intermediate State
- iterative processing , Graphs and Iterative Processing
- statistical and numerical algorithms , Specialization for different domains
-
records
,
MapReduce Job Execution
- events in stream processing , Transmitting Event Streams
- recursive common table expressions (SQL) , Graph Queries in SQL
- redelivery (messaging) , Acknowledgments and redelivery
-
Redis (database)
- atomic operations , Atomic write operations
- durability , Keeping everything in memory
- Lua scripting , Pros and cons of stored procedures
- single-threaded execution , Actual Serial Execution
- usage example , Thinking About Data Systems
-
redundancy
- hardware components , Hardware Faults
-
of derived data
,
Derived Data
- ( see also derived data)
- Reed–Solomon codes (error correction) , MapReduce and Distributed Filesystems
-
refactoring
,
Evolvability: Making Change Easy
- ( see also evolvability)
- regions (partitioning) , Partitioning
- register (data structure) , What Makes a System Linearizable?
-
relational data model
,
Relational Model Versus Document Model
-
Convergence of document and relational databases
- comparison to document model , Relational Versus Document Databases Today - Convergence of document and relational databases
- graph queries in SQL , Graph Queries in SQL
- in-memory databases with , Keeping everything in memory
- many-to-one and many-to-many relationships , Many-to-One and Many-to-Many Relationships
- multi-object transactions, need for , The need for multi-object transactions
- NoSQL as alternative to , The Birth of NoSQL
- object-relational mismatch , The Object-Relational Mismatch
- relational algebra and SQL , Query Languages for Data
-
versus document model
- convergence of models , Convergence of document and relational databases
- data locality , Data locality for queries
-
relational databases
- eventual consistency , Problems with Replication Lag
- history , Relational Model Versus Document Model
- leader-based replication , Leaders and Followers
- logical logs , Logical (row-based) log replication
- philosophy compared to Unix , Unbundling Databases , The meta-database of everything
- schema changes , Schema flexibility in the document model , Encoding and Evolution , Different values written at different times
- statement-based replication , Statement-based replication
- use of B-tree indexes , B-Trees
- relationships ( see edges)
-
reliability
,
Reliability
-
How Important Is Reliability?
,
The Future of Data Systems
- building a reliable system from unreliable components , Cloud Computing and Supercomputing
- defined , Thinking About Data Systems , Summary
- hardware faults , Hardware Faults
- human errors , Human Errors
- importance of , How Important Is Reliability?
- of messaging systems , Messaging Systems
- software errors , Software Errors
- Remote Method Invocation (Java RMI) , The problems with remote procedure calls (RPCs)
-
remote procedure calls (RPCs)
,
The problems with remote procedure calls (RPCs)
-
Data encoding and evolution for RPC
- ( see also services)
- based on futures , Current directions for RPC
- data encoding and evolution , Data encoding and evolution for RPC
- issues with , The problems with remote procedure calls (RPCs)
- using Avro , But what is the writer’s schema? , Current directions for RPC
- using Thrift , Current directions for RPC
- versus message brokers , Message-Passing Dataflow
- repeatable reads (transaction isolation) , Repeatable read and naming confusion
- replicas , Leaders and Followers
-
replication
,
Replication
-
Summary
,
Glossary
- and durability , Durability
- chain replication , Synchronous Versus Asynchronous Replication
- conflict resolution and , Conflict resolution and replication
-
consistency properties
,
Problems with Replication Lag
-
Solutions for Replication Lag
- consistent prefix reads , Consistent Prefix Reads
- monotonic reads , Monotonic Reads
- reading your own writes , Reading Your Own Writes
- in distributed filesystems , MapReduce and Distributed Filesystems
-
leaderless
,
Leaderless Replication
-
Version vectors
- detecting concurrent writes , Detecting Concurrent Writes - Version vectors
- limitations of quorum consistency , Limitations of Quorum Consistency - Monitoring staleness , Linearizability and quorums
- sloppy quorums and hinted handoff , Sloppy Quorums and Hinted Handoff
- monitoring staleness , Monitoring staleness
-
multi-leader
,
Multi-Leader Replication
-
Multi-Leader Replication Topologies
- across multiple datacenters , Multi-datacenter operation , The Cost of Linearizability
- handling write conflicts , Handling Write Conflicts - What is a conflict?
- replication topologies , Multi-Leader Replication Topologies - Multi-Leader Replication Topologies
- partitioning and , Distributed Data , Partitioning and Replication
- reasons for using , Distributed Data , Replication
-
single-leader
,
Leaders and Followers
-
Trigger-based replication
- failover , Leader failure: Failover
- implementation of replication logs , Implementation of Replication Logs - Trigger-based replication
- relation to consensus , Single-leader replication and consensus
- setting up new followers , Setting Up New Followers
- synchronous versus asynchronous , Synchronous Versus Asynchronous Replication - Synchronous Versus Asynchronous Replication
- state machine replication , Using total order broadcast , Databases and Streams
- using erasure coding , MapReduce and Distributed Filesystems
- with heterogeneous data systems , Keeping Systems in Sync
- replication logs ( see logs)
-
reprocessing data
,
Reprocessing data for application evolution
,
Unifying batch and stream processing
- ( see also evolvability)
- from log-based messaging , Replaying old messages
-
request routing
,
Request Routing
-
Parallel Query Execution
- approaches to , Request Routing
- parallel query execution , Parallel Query Execution
-
resilient systems
,
Reliability
- ( see also fault tolerance)
-
response time
- as performance metric for services , Describing Performance , Batch Processing
- guarantees on , Response time guarantees
- latency versus , Describing Performance
- mean and percentiles , Describing Performance
- user experience , Describing Performance
- responsibility and accountability , Responsibility and accountability
-
REST (Representational State Transfer)
,
Web services
- ( see also services)
-
RethinkDB (database)
- document data model , The Object-Relational Mismatch
- dynamic partitioning , Dynamic partitioning
- join support , Many-to-One and Many-to-Many Relationships , Convergence of document and relational databases
- key-range partitioning , Partitioning by Key Range
- leader-based replication , Leaders and Followers
- subscribing to changes , API support for change streams
-
Riak (database)
- Bitcask storage engine , Hash Indexes
- CRDTs , Custom conflict resolution logic , Merging concurrently written values
- dotted version vectors , Version vectors
- gossip protocol , Request Routing
- hash partitioning , Partitioning by Hash of Key - Partitioning by Hash of Key , Fixed number of partitions
- last-write-wins conflict resolution , Last write wins (discarding concurrent writes)
- leaderless replication , Leaderless Replication
- LevelDB storage engine , Making an LSM-tree out of SSTables
- linearizability, lack of , Linearizability and quorums
- multi-datacenter support , Multi-datacenter operation
- preventing lost updates across replicas , Conflict resolution and replication
- rebalancing , Operations: Automatic or Manual Rebalancing
- search feature , Partitioning Secondary Indexes by Term
- secondary indexes , Partitioning Secondary Indexes by Document
- siblings (concurrently written values) , Merging concurrently written values
- sloppy quorums , Sloppy Quorums and Hinted Handoff
- ring buffers , Disk space usage
- Ripple (cryptocurrency) , Tools for auditable data systems
- rockets , Human Errors , Are Document Databases Repeating History? , Byzantine Faults
-
RocksDB (storage engine)
,
Making an LSM-tree out of SSTables
- leveled compaction , Performance optimizations
- rollbacks (transactions) , Transactions
- rolling upgrades , Hardware Faults , Encoding and Evolution
- routing ( see request routing)
-
row-oriented storage
,
Column-Oriented Storage
- row-based replication , Logical (row-based) log replication
- rowhammer (memory corruption) , Trust, but Verify
- RPCs ( see remote procedure calls)
- Rubygems (package manager) , The move toward declarative query languages
- rules (Datalog) , The Foundation: Datalog
S
-
safety and liveness properties
,
Safety and liveness
- in consensus algorithms , Fault-Tolerant Consensus
- in transactions , Transactions
- sagas ( see compensating transactions)
-
Samza (stream processor)
,
Stream analytics
,
Maintaining materialized views
- fault tolerance , Rebuilding state after a failure
- streaming SQL support , Complex event processing
- sandboxes , Human Errors
- SAP HANA (database) , The divergence between OLTP databases and data warehouses
-
scalability
,
Scalability
-
Approaches for Coping with Load
,
The Future of Data Systems
- approaches for coping with load , Approaches for Coping with Load
- defined , Summary
- describing load , Describing Load
- describing performance , Describing Performance
- partitioning and , Partitioning
- replication and , Problems with Replication Lag
- scaling up versus scaling out , Distributed Data
-
scaling out
,
Approaches for Coping with Load
,
Distributed Data
- ( see also shared-nothing architecture)
- scaling up , Approaches for Coping with Load , Distributed Data
- scatter/gather approach, querying partitioned databases , Partitioning Secondary Indexes by Document
- SCD (slowly changing dimension) , Time-dependence of joins
-
schema-on-read
,
Schema flexibility in the document model
- comparison to evolvable schema , The Merits of Schemas
- in distributed filesystems , Diversity of storage
- schema-on-write , Schema flexibility in the document model
- schemaless databases ( see schema-on-read)
-
schemas
,
Glossary
-
Avro
,
Avro
-
Code generation and dynamically typed languages
- reader determining writer’s schema , But what is the writer’s schema?
- schema evolution , The writer’s schema and the reader’s schema
- dynamically generated , Dynamically generated schemas
-
evolution of
,
Reprocessing data for application evolution
- affecting application code , Encoding and Evolution
- compatibility checking , But what is the writer’s schema?
- in databases , Dataflow Through Databases - Archival storage
- in message-passing , Distributed actor frameworks
- in service calls , Data encoding and evolution for RPC
- flexibility in document model , Schema flexibility in the document model
- for analytics , Stars and Snowflakes: Schemas for Analytics - Stars and Snowflakes: Schemas for Analytics
- for JSON and XML , JSON, XML, and Binary Variants
- merits of , The Merits of Schemas
- schema migration on railways , Reprocessing data for application evolution
-
Thrift and Protocol Buffers
,
Thrift and Protocol Buffers
-
Datatypes and schema evolution
- schema evolution , Field tags and schema evolution
- traditional approach to design, fallacy in , Deriving several views from the same event log
-
Avro
,
Avro
-
Code generation and dynamically typed languages
-
searches
- building search indexes in batch processes , Building search indexes
- k-nearest neighbors , Specialization for different domains
- on streams , Search on streams
- partitioned secondary indexes , Partitioning and Secondary Indexes
- secondaries ( see leader-based replication)
-
secondary indexes
,
Other Indexing Structures
,
Glossary
-
partitioning
,
Partitioning and Secondary Indexes
-
Partitioning Secondary Indexes by Term
,
Summary
- document-partitioned , Partitioning Secondary Indexes by Document
- index maintenance , Maintaining derived state
- term-partitioned , Partitioning Secondary Indexes by Term
- problems with dual writes , Keeping Systems in Sync , Reasoning about dataflows
- updating, transaction isolation and , The need for multi-object transactions
-
partitioning
,
Partitioning and Secondary Indexes
-
Partitioning Secondary Indexes by Term
,
Summary
- secondary sorts , Sort-merge joins
- sed (Unix tool) , Simple Log Analysis
- self-describing files , Code generation and dynamically typed languages
- self-joins , Summary
- self-validating systems , A culture of verification
- semantic web , The semantic web
- semi-synchronous replication , Synchronous Versus Asynchronous Replication
-
sequence number ordering
,
Sequence Number Ordering
-
Timestamp ordering is not sufficient
- generators , Synchronized clocks for global snapshots , Noncausal sequence number generators
- insufficiency for enforcing constraints , Timestamp ordering is not sufficient
- Lamport timestamps , Lamport timestamps
- use of timestamps , Timestamps for ordering events , Synchronized clocks for global snapshots , Noncausal sequence number generators
- sequential consistency , Implementing linearizable storage using total order broadcast
-
serializability
,
Isolation
,
Weak Isolation Levels
,
Serializability
-
Performance of serializable snapshot isolation
,
Glossary
- linearizability versus , What Makes a System Linearizable?
- pessimistic versus optimistic concurrency control , Pessimistic versus optimistic concurrency control
-
serial execution
,
Actual Serial Execution
-
Summary of serial execution
- partitioning , Partitioning
- using stored procedures , Encapsulating transactions in stored procedures , Using total order broadcast
-
serializable snapshot isolation (SSI)
,
Serializable Snapshot Isolation (SSI)
-
Performance of serializable snapshot isolation
- detecting stale MVCC reads , Detecting stale MVCC reads
- detecting writes that affect prior reads , Detecting writes that affect prior reads
- distributed execution , Performance of serializable snapshot isolation , Limitations of distributed transactions
- performance of SSI , Performance of serializable snapshot isolation
- preventing write skew , Decisions based on an outdated premise - Detecting writes that affect prior reads
-
two-phase locking (2PL)
,
Two-Phase Locking (2PL)
-
Index-range locks
- index-range locks , Index-range locks
- performance , Performance of two-phase locking
- Serializable (Java) , Language-Specific Formats
-
serialization
,
Formats for Encoding Data
- ( see also encoding)
-
service discovery
,
Current directions for RPC
,
Request Routing
,
Service discovery
- using DNS , Request Routing , Service discovery
- service level agreements (SLAs) , Describing Performance
-
service-oriented architecture (SOA)
,
Dataflow Through Services: REST and RPC
- ( see also services)
-
services
,
Dataflow Through Services: REST and RPC
-
Data encoding and evolution for RPC
-
microservices
,
Dataflow Through Services: REST and RPC
- causal dependencies across services , The limits of total ordering
- loose coupling , Making unbundling work
- relation to batch/stream processors , Batch Processing , Stream processors and services
-
remote procedure calls (RPCs)
,
The problems with remote procedure calls (RPCs)
-
Data encoding and evolution for RPC
- issues with , The problems with remote procedure calls (RPCs)
- similarity to databases , Dataflow Through Services: REST and RPC
- web services , Web services , Current directions for RPC
-
microservices
,
Dataflow Through Services: REST and RPC
-
session windows (stream processing)
,
Types of windows
- ( see also windows)
- sessionization , GROUP BY
- sharding ( see partitioning)
- shared mode (locks) , Implementation of two-phase locking
- shared-disk architecture , Distributed Data , MapReduce and Distributed Filesystems
- shared-memory architecture , Distributed Data
-
shared-nothing architecture
,
Approaches for Coping with Load
,
Distributed Data
-
Distributed Data
,
Glossary
- ( see also replication)
-
distributed filesystems
,
MapReduce and Distributed Filesystems
- ( see also distributed filesystems)
- partitioning , Partitioning
- use of network , Unreliable Networks
-
sharks
- biting undersea cables , Network Faults in Practice
- counting (example) , MapReduce Querying - MapReduce Querying
- finding (example) , Query Languages for Data
- website about (example) , Declarative Queries on the Web
- shredding (in relational model) , Which data model leads to simpler application code?
-
siblings (concurrent values)
,
Merging concurrently written values
,
Conflict resolution and replication
- ( see also conflicts)
-
similarity search
- edit distance , Full-text search and fuzzy indexes
- genome data , Summary
- k-nearest neighbors , Specialization for different domains
- single-leader replication ( see leader-based replication)
-
single-threaded execution
,
Atomic write operations
,
Actual Serial Execution
- in batch processing , Bringing related data together in the same place , Dataflow engines , Parallel execution
- in stream processing , Logs compared to traditional messaging , Concurrency control , Uniqueness in log-based messaging
- size-tiered compaction , Performance optimizations
-
skew
,
Glossary
- clock skew , Relying on Synchronized Clocks - Clock readings have a confidence interval , Implementing Linearizable Systems
-
in transaction isolation
- read skew , Snapshot Isolation and Repeatable Read , Summary
-
write skew
,
Write Skew and Phantoms
-
Materializing conflicts
,
Decisions based on an outdated premise
-
Detecting writes that affect prior reads
- ( see also write skew)
- meanings of , Snapshot Isolation and Repeatable Read
-
unbalanced workload
,
Partitioning of Key-Value Data
- compensating for , Skewed Workloads and Relieving Hot Spots
- due to celebrities , Skewed Workloads and Relieving Hot Spots
- for time-series data , Partitioning by Key Range
- in batch processing , Handling skew
- slaves ( see leader-based replication)
-
sliding windows (stream processing)
,
Types of windows
- ( see also windows)
-
sloppy quorums
,
Sloppy Quorums and Hinted Handoff
- ( see also quorums)
- lack of linearizability , Implementing Linearizable Systems
- slowly changing dimension (data warehouses) , Time-dependence of joins
- smearing (leap seconds adjustments) , Clock Synchronization and Accuracy
-
snapshots (databases)
- causal consistency , Ordering and Causality
- computing derived data , Creating an index
- in change data capture , Initial snapshot
- serializable snapshot isolation (SSI) , Serializable Snapshot Isolation (SSI) - Performance of serializable snapshot isolation , What Makes a System Linearizable?
- setting up a new replica , Setting Up New Followers
-
snapshot isolation and repeatable read
,
Snapshot Isolation and Repeatable Read
-
Repeatable read and naming confusion
- implementing with MVCC , Implementing snapshot isolation
- indexes and MVCC , Indexes and snapshot isolation
- visibility rules , Visibility rules for observing a consistent snapshot
- synchronized clocks for global snapshots , Synchronized clocks for global snapshots
- snowflake schemas , Stars and Snowflakes: Schemas for Analytics
-
SOAP
,
Web services
- ( see also services)
- evolvability , Data encoding and evolution for RPC
-
software bugs
,
Software Errors
- maintaining integrity , Maintaining integrity in the face of software bugs
-
solid state drives (SSDs)
- access patterns , Advantages of LSM-trees
- detecting corruption , The end-to-end argument , Don’t just blindly trust what they promise
- faults in , Durability
- sequential write throughput , Hash Indexes
-
Solr (search server)
- building indexes in batch processes , Building search indexes
- document-partitioned indexes , Partitioning Secondary Indexes by Document
- request routing , Request Routing
- usage example , Thinking About Data Systems
- use of Lucene , Making an LSM-tree out of SSTables
- sort (Unix tool) , Simple Log Analysis , Sorting versus in-memory aggregation , The Unix Philosophy
- sort-merge joins (MapReduce) , Sort-merge joins
- Sorted String Tables ( see SSTables)
-
sorting
- sort order in column storage , Sort Order in Column Storage
- source of truth ( see systems of record)
-
Spanner (database)
- data locality , Data locality for queries
- snapshot isolation using clocks , Synchronized clocks for global snapshots
- TrueTime API , Clock readings have a confidence interval
-
Spark (processing framework)
,
Dataflow engines
-
Discussion of materialization
- bytecode generation , The move toward declarative query languages
- dataflow APIs , High-Level APIs and Languages
- fault tolerance , Fault tolerance
- for data warehouses , The divergence between OLTP databases and data warehouses
- GraphX API (graph processing) , The Pregel processing model
- machine learning , Specialization for different domains
- query optimizer , The move toward declarative query languages
-
Spark Streaming
,
Stream analytics
- microbatching , Microbatching and checkpointing
- stream processing on top of batch processing , Batch and Stream Processing
- SPARQL (query language) , The SPARQL query language
- spatial algorithms , Specialization for different domains
-
split brain
,
Leader failure: Failover
,
Glossary
- in consensus algorithms , Distributed Transactions and Consensus , Single-leader replication and consensus
- preventing , Consistency and Consensus , Implementing Linearizable Systems
- using fencing tokens to avoid , The leader and the lock - Fencing tokens
- spreadsheets, dataflow programming capabilities , Designing Applications Around Dataflow
-
SQL (Structured Query Language)
,
Simplicity: Managing Complexity
,
Relational Model Versus Document Model
,
Query Languages for Data
- advantages and limitations of , Diversity of processing models
- distributed query execution , MapReduce Querying
- graph queries in , Graph Queries in SQL
- isolation levels standard, issues with , Repeatable read and naming confusion
- query execution on Hadoop , Diversity of processing models
- résumé (example) , The Object-Relational Mismatch
- SQL injection vulnerability , Byzantine Faults
- SQL on Hadoop , The divergence between OLTP databases and data warehouses
- statement-based replication , Statement-based replication
- stored procedures , Pros and cons of stored procedures
-
SQL Server (database)
- data warehousing support , The divergence between OLTP databases and data warehouses
- distributed transaction support , XA transactions
- leader-based replication , Leaders and Followers
- preventing lost updates , Automatically detecting lost updates
- preventing write skew , Characterizing write skew , Implementation of two-phase locking
- read committed isolation , Implementing read committed
- recursive query support , Graph Queries in SQL
- serializable isolation , Implementation of two-phase locking
- snapshot isolation support , Snapshot Isolation and Repeatable Read
- T-SQL language , Pros and cons of stored procedures
- XML support , The Object-Relational Mismatch
- SQLstream (stream analytics) , Complex event processing
- SSDs ( see solid state drives)
-
SSTables (storage format)
,
SSTables and LSM-Trees
-
Performance optimizations
- advantages over hash indexes , SSTables and LSM-Trees
- concatenated index , Partitioning by Hash of Key
- constructing and maintaining , Constructing and maintaining SSTables
- making LSM-Tree from , Making an LSM-tree out of SSTables
-
staleness (old data)
,
Reading Your Own Writes
- cross-channel timing dependencies , Cross-channel timing dependencies
- in leaderless databases , Writing to the Database When a Node Is Down
- in multi-version concurrency control , Detecting stale MVCC reads
- monitoring for , Monitoring staleness
- of client state , Pushing state changes to clients
- versus linearizability , Linearizability
- versus timeliness , Timeliness and Integrity
- standbys ( see leader-based replication)
- star replication topologies , Multi-Leader Replication Topologies
-
star schemas
,
Stars and Snowflakes: Schemas for Analytics
-
Stars and Snowflakes: Schemas for Analytics
- similarity to event sourcing , Event Sourcing
- Star Wars analogy (event time versus processing time) , Event time versus processing time
-
state
- derived from log of immutable events , State, Streams, and Immutability
- deriving current state from the event log , Deriving current state from the event log
- interplay between state changes and application code , Dataflow: Interplay between state changes and application code
- maintaining derived state , Maintaining derived state
- maintenance by stream processor in stream-stream joins , Stream-stream join (window join)
- observing derived state , Observing Derived State - Multi-partition data processing
- rebuilding after stream processor failure , Rebuilding state after a failure
- separation of application code and , Separation of application code and state
- state machine replication , Using total order broadcast , Databases and Streams
- statement-based replication , Statement-based replication
-
statically typed languages
- analogy to schema-on-write , Schema flexibility in the document model
- code generation and , Code generation and dynamically typed languages
- statistical and numerical algorithms , Specialization for different domains
- StatsD (metrics aggregator) , Direct messaging from producers to consumers
- stdin, stdout , A uniform interface , Separation of logic and wiring
- Stellar (cryptocurrency) , Tools for auditable data systems
- stock market feeds , Direct messaging from producers to consumers
- STONITH (Shoot The Other Node In The Head) , Leader failure: Failover
- stop-the-world ( see garbage collection)
-
storage
- composing data storage technologies , Composing Data Storage Technologies - What’s missing?
- diversity of, in MapReduce , Diversity of storage
- Storage Area Network (SAN) , Distributed Data , MapReduce and Distributed Filesystems
-
storage engines
,
Storage and Retrieval
-
Summary
-
column-oriented
,
Column-Oriented Storage
-
Writing to Column-Oriented Storage
- column compression , Column Compression - Memory bandwidth and vectorized processing
- defined , Column-Oriented Storage
- distinction between column families and , Column Compression
- Parquet , Column-Oriented Storage , Archival storage
- sort order in , Sort Order in Column Storage - Several different sort orders
- writing to , Writing to Column-Oriented Storage
- comparing requirements for transaction processing and analytics , Transaction Processing or Analytics? - Column-Oriented Storage
-
in-memory storage
,
Keeping everything in memory
- durability , Durability
-
row-oriented
,
Data Structures That Power Your Database
-
Keeping everything in memory
- B-trees , B-Trees - B-tree optimizations
- comparing B-trees and LSM-trees , Comparing B-Trees and LSM-Trees - Downsides of LSM-trees
- defined , Column-Oriented Storage
- log-structured , Hash Indexes - Performance optimizations
-
column-oriented
,
Column-Oriented Storage
-
Writing to Column-Oriented Storage
-
stored procedures
,
Trigger-based replication
,
Encapsulating transactions in stored procedures
-
Pros and cons of stored procedures
,
Glossary
- and total order broadcast , Using total order broadcast
- pros and cons of , Pros and cons of stored procedures
- similarity to stream processors , Application code as a derivation function
-
Storm (stream processor)
,
Stream analytics
- distributed RPC , Message passing and RPC , Multi-partition data processing
- Trident state handling , Idempotence
- straggler events , Knowing when you’re ready , The lambda architecture
-
stream processing
,
Processing Streams
-
Summary
,
Glossary
- accessing external services within job , Stream-table join (stream enrichment) , Microbatching and checkpointing , Idempotence , Exactly-once execution of an operation
-
combining with batch processing
- lambda architecture , The lambda architecture
- unifying technologies , Unifying batch and stream processing
- comparison to batch processing , Processing Streams
- complex event processing (CEP) , Complex event processing
-
fault tolerance
,
Fault Tolerance
-
Rebuilding state after a failure
- atomic commit , Atomic commit revisited
- idempotence , Idempotence
- microbatching and checkpointing , Microbatching and checkpointing
- rebuilding state after a failure , Rebuilding state after a failure
- for data integration , Batch and Stream Processing - Unifying batch and stream processing
- maintaining derived state , Maintaining derived state
- maintenance of materialized views , Maintaining materialized views
- messaging systems ( see messaging systems)
-
reasoning about time
,
Reasoning About Time
-
Types of windows
- event time versus processing time , Event time versus processing time , Microbatching and checkpointing , Unifying batch and stream processing
- knowing when window is ready , Knowing when you’re ready
- types of windows , Types of windows
- relation to databases ( see streams)
- relation to services , Stream processors and services
- search on streams , Search on streams
- single-threaded execution , Logs compared to traditional messaging , Concurrency control
- stream analytics , Stream analytics
-
stream joins
,
Stream Joins
-
Time-dependence of joins
- stream-stream join , Stream-stream join (window join)
- stream-table join , Stream-table join (stream enrichment)
- table-table join , Table-table join (materialized view maintenance)
- time-dependence of , Time-dependence of joins
-
streams
,
Stream Processing
-
Replaying old messages
- end-to-end, pushing events to clients , End-to-end event streams
- messaging systems ( see messaging systems)
- processing ( see stream processing)
-
relation to databases
,
Databases and Streams
-
Limitations of immutability
- ( see also changelogs)
- API support for change streams , API support for change streams
- change data capture , Change Data Capture - API support for change streams
- derivative of state by time , State, Streams, and Immutability
- event sourcing , Event Sourcing - Commands and events
- keeping systems in sync , Keeping Systems in Sync - Keeping Systems in Sync
- philosophy of immutable events , State, Streams, and Immutability - Limitations of immutability
- topics , Transmitting Event Streams
- strict serializability , What Makes a System Linearizable?
- strong consistency ( see linearizability)
- strong one-copy serializability , What Makes a System Linearizable?
- subjects, predicates, and objects (in triple-stores) , Triple-Stores and SPARQL
-
subscribers (message streams)
,
Transmitting Event Streams
- ( see also consumers)
- supercomputers , Cloud Computing and Supercomputing
-
surveillance
,
Surveillance
- ( see also privacy)
- Swagger (service definition format) , Web services
- swapping to disk ( see virtual memory)
-
synchronous networks
,
Synchronous Versus Asynchronous Networks
,
Glossary
- comparison to asynchronous networks , Synchronous Versus Asynchronous Networks
- formal model , System Model and Reality
-
synchronous replication
,
Synchronous Versus Asynchronous Replication
,
Glossary
- chain replication , Synchronous Versus Asynchronous Replication
- conflict detection , Synchronous versus asynchronous conflict detection
-
system models
,
Knowledge, Truth, and Lies
,
System Model and Reality
-
Mapping system models to the real world
- assumptions in , Trust, but Verify
- correctness of algorithms , Correctness of an algorithm
- mapping to the real world , Mapping system models to the real world
- safety and liveness , Safety and liveness
-
systems of record
,
Derived Data
,
Glossary
- change data capture , Implementing change data capture , Reasoning about dataflows
- treating event log as , State, Streams, and Immutability
- systems thinking , Feedback loops
T
- t-digest (algorithm) , Describing Performance
- table-table joins , Table-table join (materialized view maintenance)
- Tableau (data visualization software) , Diversity of processing models
- tail (Unix tool) , Using logs for message storage
- tail vertex (property graphs) , Property Graphs
- Tajo (query engine) , The divergence between OLTP databases and data warehouses
- Tandem NonStop SQL (database) , Partitioning
-
TCP (Transmission Control Protocol)
,
Cloud Computing and Supercomputing
- comparison to circuit switching , Can we not simply make network delays predictable?
- comparison to UDP , Network congestion and queueing
- connection failures , Detecting Faults
- flow control , Network congestion and queueing , Messaging Systems
- packet checksums , Weak forms of lying , The end-to-end argument , Trust, but Verify
- reliability and duplicate suppression , Duplicate suppression
- retransmission timeouts , Network congestion and queueing
- use for transaction sessions , Single-Object and Multi-Object Operations
- telemetry ( see monitoring)
- Teradata (database) , The divergence between OLTP databases and data warehouses , Partitioning
- term-partitioned indexes , Partitioning Secondary Indexes by Term , Summary
- termination (consensus) , Fault-Tolerant Consensus
- Terrapin (database) , Key-value stores as batch process output
-
Tez (dataflow engine)
,
Dataflow engines
-
Discussion of materialization
- fault tolerance , Fault tolerance
- support by higher-level tools , High-Level APIs and Languages
- thrashing (out of memory) , Process Pauses
-
threads (concurrency)
-
actor model
,
Distributed actor frameworks
,
Message passing and RPC
- ( see also message-passing)
- atomic operations , Atomicity
- background threads , Hash Indexes , Downsides of LSM-trees
- execution pauses , Can we not simply make network delays predictable? , Process Pauses - Process Pauses
- memory barriers , Linearizability and network delays
- preemption , Process Pauses
- single ( see single-threaded execution)
-
actor model
,
Distributed actor frameworks
,
Message passing and RPC
- three-phase commit , Three-phase commit
-
Thrift (data format)
,
Thrift and Protocol Buffers
-
Datatypes and schema evolution
- BinaryProtocol , Thrift and Protocol Buffers
- CompactProtocol , Thrift and Protocol Buffers
- field tags and schema evolution , Field tags and schema evolution
- throughput , Describing Performance , Batch Processing
-
TIBCO
,
Message brokers
- Enterprise Message Service , Message brokers compared to databases
- StreamBase (stream analytics) , Complex event processing
-
time
- concurrency and , The “happens-before” relationship and concurrency
- cross-channel timing dependencies , Cross-channel timing dependencies
-
in distributed systems
,
Unreliable Clocks
-
Limiting the impact of garbage collection
- ( see also clocks)
- clock synchronization and accuracy , Clock Synchronization and Accuracy
- relying on synchronized clocks , Relying on Synchronized Clocks - Synchronized clocks for global snapshots
- process pauses , Process Pauses - Limiting the impact of garbage collection
-
reasoning about, in stream processors
,
Reasoning About Time
-
Types of windows
- event time versus processing time , Event time versus processing time , Microbatching and checkpointing , Unifying batch and stream processing
- knowing when window is ready , Knowing when you’re ready
- timestamp of events , Whose clock are you using, anyway?
- types of windows , Types of windows
- system models for distributed systems , System Model and Reality
- time-dependence in stream joins , Time-dependence of joins
- time-of-day clocks , Time-of-day clocks
-
timeliness
,
Timeliness and Integrity
- coordination-avoiding data systems , Coordination-avoiding data systems
- correctness of dataflow systems , Correctness of dataflow systems
-
timeouts
,
Unreliable Networks
,
Glossary
- dynamic configuration of , Network congestion and queueing
- for failover , Leader failure: Failover
- length of , Timeouts and Unbounded Delays
-
timestamps
,
Sequence Number Ordering
- assigning to events in stream processing , Whose clock are you using, anyway?
- for read-after-write consistency , Reading Your Own Writes
- for transaction ordering , Synchronized clocks for global snapshots
- insufficiency for enforcing constraints , Timestamp ordering is not sufficient
- key range partitioning by , Partitioning by Key Range
- Lamport , Lamport timestamps
- logical , Ordering events to capture causality
- ordering events , Timestamps for ordering events , Noncausal sequence number generators
- Titan (database) , Graph-Like Data Models
- tombstones , Hash Indexes , Merging concurrently written values , Log compaction
- topics (messaging) , Message brokers , Transmitting Event Streams
-
total order
,
The causal order is not a total order
,
Glossary
- limits of , The limits of total ordering
- sequence numbers or timestamps , Sequence Number Ordering
-
total order broadcast
,
Total Order Broadcast
-
Implementing total order broadcast using linearizable storage
,
The limits of total ordering
,
Uniqueness in log-based messaging
- consensus algorithms and , Consensus algorithms and total order broadcast - Epoch numbering and quorums
- implementation in ZooKeeper and etcd , Membership and Coordination Services
- implementing with linearizable storage , Implementing total order broadcast using linearizable storage
- using , Using total order broadcast
- using to implement linearizable storage , Implementing linearizable storage using total order broadcast
-
tracking behavioral data
,
Privacy and Tracking
- ( see also privacy)
- transaction coordinator ( see coordinator)
- transaction manager ( see coordinator)
-
transaction processing
,
Relational Model Versus Document Model
,
Transaction Processing or Analytics?
-
Stars and Snowflakes: Schemas for Analytics
- comparison to analytics , Transaction Processing or Analytics?
- comparison to data warehousing , The divergence between OLTP databases and data warehouses
-
transactions
,
Transactions
-
Summary
,
Glossary
-
ACID properties of
,
The Meaning of ACID
- atomicity , Atomicity
- consistency , Consistency
- durability , Durability
- isolation , Isolation
- compensating ( see compensating transactions)
- concept of , The Slippery Concept of a Transaction
-
distributed transactions
,
Distributed Transactions and Consensus
-
Limitations of distributed transactions
- avoiding , Derived data versus distributed transactions , Making unbundling work , Enforcing Constraints - Coordination-avoiding data systems
- failure amplification , Limitations of distributed transactions , Maintaining derived state
- in doubt/uncertain status , Coordinator failure , Holding locks while in doubt
- two-phase commit , Atomic Commit and Two-Phase Commit (2PC) - Three-phase commit
- use of , Distributed Transactions in Practice - Exactly-once message processing
- XA transactions , XA transactions - Limitations of distributed transactions
- OLTP versus analytics queries , The Output of Batch Workflows
- purpose of , Transactions
-
serializability
,
Serializability
-
Performance of serializable snapshot isolation
- actual serial execution , Actual Serial Execution - Summary of serial execution
- pessimistic versus optimistic concurrency control , Pessimistic versus optimistic concurrency control
- serializable snapshot isolation (SSI) , Serializable Snapshot Isolation (SSI) - Performance of serializable snapshot isolation
- two-phase locking (2PL) , Two-Phase Locking (2PL) - Index-range locks
-
single-object and multi-object
,
Single-Object and Multi-Object Operations
-
Handling errors and aborts
- handling errors and aborts , Handling errors and aborts
- need for multi-object transactions , The need for multi-object transactions
- single-object writes , Single-object writes
- snapshot isolation ( see snapshots)
-
weak isolation levels
,
Weak Isolation Levels
-
Materializing conflicts
- preventing lost updates , Preventing Lost Updates - Conflict resolution and replication
- read committed , Read Committed - Snapshot Isolation and Repeatable Read
-
ACID properties of
,
The Meaning of ACID
- transitive closure (graph algorithm) , Graphs and Iterative Processing
- trie (data structure) , Full-text search and fuzzy indexes
-
triggers (databases)
,
Trigger-based replication
,
Transmitting Event Streams
- implementing change data capture , Implementing change data capture
- implementing replication , Trigger-based replication
-
triple-stores
,
Triple-Stores and SPARQL
-
The SPARQL query language
- SPARQL query language , The SPARQL query language
-
tumbling windows (stream processing)
,
Types of windows
- ( see also windows)
- in microbatching , Microbatching and checkpointing
- tuple spaces (programming model) , Dataflow: Interplay between state changes and application code
- Turtle (RDF data format) , Triple-Stores and SPARQL
-
Twitter
- constructing home timelines (example) , Describing Load , Deriving several views from the same event log , Table-table join (materialized view maintenance) , Materialized views and caching
- DistributedLog (event log) , Using logs for message storage
- Finagle (RPC framework) , Current directions for RPC
- Snowflake (sequence number generator) , Synchronized clocks for global snapshots
- Summingbird (processing library) , The lambda architecture
-
two-phase commit (2PC)
,
Distributed Transactions and Consensus
,
Introduction to two-phase commit
-
Coordinator failure
,
Glossary
- confusion with two-phase locking , Introduction to two-phase commit
- coordinator failure , Coordinator failure
- coordinator recovery , Recovering from coordinator failure
- how it works , A system of promises
- issues in practice , Limitations of distributed transactions
- performance cost , Distributed Transactions in Practice
- transactions holding locks , Holding locks while in doubt
-
two-phase locking (2PL)
,
Two-Phase Locking (2PL)
-
Index-range locks
,
What Makes a System Linearizable?
,
Glossary
- confusion with two-phase commit , Introduction to two-phase commit
- index-range locks , Index-range locks
- performance of , Performance of two-phase locking
- type checking, dynamic versus static , Schema flexibility in the document model
U
-
UDP (User Datagram Protocol)
- comparison to TCP , Network congestion and queueing
- multicast , Direct messaging from producers to consumers
-
unbounded datasets
,
Stream Processing
,
Glossary
- ( see also streams)
-
unbounded delays
,
Glossary
- in networks , Timeouts and Unbounded Delays
- process pauses , Process Pauses
-
unbundling databases
,
Unbundling Databases
-
Multi-partition data processing
-
composing data storage technologies
,
Composing Data Storage Technologies
-
What’s missing?
- federation versus unbundling , The meta-database of everything
- need for high-level language , What’s missing?
- designing applications around dataflow , Designing Applications Around Dataflow - Stream processors and services
-
observing derived state
,
Observing Derived State
-
Multi-partition data processing
- materialized views and caching , Materialized views and caching
- multi-partition data processing , Multi-partition data processing
- pushing state changes to clients , Pushing state changes to clients
-
composing data storage technologies
,
Composing Data Storage Technologies
-
What’s missing?
- uncertain (transaction status) ( see in doubt)
-
uniform consensus
,
Fault-Tolerant Consensus
- ( see also consensus)
- uniform interfaces , A uniform interface
- union type (in Avro) , Schema evolution rules
- uniq (Unix tool) , Simple Log Analysis
-
uniqueness constraints
- asynchronously checked , Loosely interpreted constraints
- requiring consensus , Uniqueness constraints require consensus
- requiring linearizability , Constraints and uniqueness guarantees
- uniqueness in log-based messaging , Uniqueness in log-based messaging
-
Unix philosophy
,
The Unix Philosophy
-
Transparency and experimentation
-
command-line batch processing
,
Batch Processing with Unix Tools
-
Sorting versus in-memory aggregation
- Unix pipes versus dataflow engines , Discussion of materialization
- comparison to Hadoop , Philosophy of batch process outputs - Philosophy of batch process outputs
- comparison to relational databases , Unbundling Databases , The meta-database of everything
- comparison to stream processing , Processing Streams
- composability and uniform interfaces , The Unix Philosophy
- loose coupling , Separation of logic and wiring
- pipes , The Unix Philosophy
- relation to Hadoop , Unbundling Databases
-
command-line batch processing
,
Batch Processing with Unix Tools
-
Sorting versus in-memory aggregation
- UPDATE statement (SQL) , Schema flexibility in the document model
-
updates
-
preventing lost updates
,
Preventing Lost Updates
-
Conflict resolution and replication
- atomic write operations , Atomic write operations
- automatically detecting lost updates , Automatically detecting lost updates
- compare-and-set operations , Compare-and-set
- conflict resolution and replication , Conflict resolution and replication
- using explicit locking , Explicit locking
- preventing write skew , Write Skew and Phantoms - Materializing conflicts
-
preventing lost updates
,
Preventing Lost Updates
-
Conflict resolution and replication
V
- validity (consensus) , Fault-Tolerant Consensus
- vBuckets (partitioning) , Partitioning
-
vector clocks
,
Version vectors
- ( see also version vectors)
- vectorized processing , Memory bandwidth and vectorized processing , The move toward declarative query languages
-
verification
,
Trust, but Verify
-
Tools for auditable data systems
- avoiding blind trust , Don’t just blindly trust what they promise
- culture of , A culture of verification
- designing for auditability , Designing for auditability
- end-to-end integrity checks , The end-to-end argument again
- tools for auditable data systems , Tools for auditable data systems
- version control systems, reliance on immutable data , Limitations of immutability
-
version vectors
,
Multi-Leader Replication Topologies
,
Version vectors
- capturing causal dependencies , Capturing causal dependencies
- versus vector clocks , Version vectors
-
Vertica (database)
,
The divergence between OLTP databases and data warehouses
- handling writes , Writing to Column-Oriented Storage
- replicas using different sort orders , Several different sort orders
- vertical scaling ( see scaling up)
-
vertices (in graphs)
,
Graph-Like Data Models
- property graph model , Property Graphs
-
Viewstamped Replication (consensus algorithm)
,
Consensus algorithms and total order broadcast
- view number , Epoch numbering and quorums
-
virtual machines
,
Distributed Data
- ( see also cloud computing)
- context switches , Process Pauses
- network performance , Network congestion and queueing
- noisy neighbors , Network congestion and queueing
- reliability in cloud services , Hardware Faults
- virtualized clocks in , Clock Synchronization and Accuracy
-
virtual memory
- process pauses due to page faults , Describing Performance , Process Pauses
- versus memory management by databases , Keeping everything in memory
- VisiCalc (spreadsheets) , Designing Applications Around Dataflow
- vnodes (partitioning) , Partitioning
- Voice over IP (VoIP) , Network congestion and queueing
-
Voldemort (database)
- building read-only stores in batch processes , Key-value stores as batch process output
- hash partitioning , Partitioning by Hash of Key - Partitioning by Hash of Key , Fixed number of partitions
- leaderless replication , Leaderless Replication
- multi-datacenter support , Multi-datacenter operation
- rebalancing , Operations: Automatic or Manual Rebalancing
- reliance on read repair , Read repair and anti-entropy
- sloppy quorums , Sloppy Quorums and Hinted Handoff
-
VoltDB (database)
- cross-partition serializability , Partitioning
- deterministic stored procedures , Pros and cons of stored procedures
- in-memory storage , Keeping everything in memory
- output streams , API support for change streams
- secondary indexes , Partitioning Secondary Indexes by Document
- serial execution of transactions , Actual Serial Execution
- statement-based replication , Statement-based replication , Rebuilding state after a failure
- transactions in stream processing , Atomic commit revisited
W
- WAL (write-ahead log) , Making B-trees reliable
- web services ( see services)
- Web Services Description Language (WSDL) , Web services
- webhooks , Direct messaging from producers to consumers
- webMethods (messaging) , Message brokers
- WebSocket (protocol) , Pushing state changes to clients
-
windows (stream processing)
,
Stream analytics
,
Reasoning About Time
-
Types of windows
- infinite windows for changelogs , Maintaining materialized views , Stream-table join (stream enrichment)
- knowing when all events have arrived , Knowing when you’re ready
- stream joins within a window , Stream-stream join (window join)
- types of windows , Types of windows
- winners (conflict resolution) , Converging toward a consistent state
- WITH RECURSIVE syntax (SQL) , Graph Queries in SQL
-
workflows (MapReduce)
,
MapReduce workflows
-
outputs
,
The Output of Batch Workflows
-
Philosophy of batch process outputs
- key-value stores , Key-value stores as batch process output
- search indexes , Building search indexes
- with map-side joins , MapReduce workflows with map-side joins
-
outputs
,
The Output of Batch Workflows
-
Philosophy of batch process outputs
- working set , Sorting versus in-memory aggregation
- write amplification , Advantages of LSM-trees
- write path (derived data) , Observing Derived State
-
write skew (transaction isolation)
,
Write Skew and Phantoms
-
Materializing conflicts
- characterizing , Write Skew and Phantoms - Phantoms causing write skew , Decisions based on an outdated premise
- examples of , Write Skew and Phantoms , More examples of write skew
- materializing conflicts , Materializing conflicts
- occurrence in practice , Maintaining integrity in the face of software bugs
- phantoms , Phantoms causing write skew
-
preventing
- in snapshot isolation , Decisions based on an outdated premise - Detecting writes that affect prior reads
- in two-phase locking , Predicate locks - Index-range locks
- options for , Characterizing write skew
- write-ahead log (WAL) , Making B-trees reliable , Write-ahead log (WAL) shipping
-
writes (database)
- atomic write operations , Atomic write operations
- detecting writes affecting prior reads , Detecting writes that affect prior reads
- preventing dirty writes with read committed , No dirty writes
-
WS-* framework
,
Web services
- ( see also services)
- WS-AtomicTransaction (2PC) , Introduction to two-phase commit
X
-
XA transactions
,
Introduction to two-phase commit
,
XA transactions
-
Limitations of distributed transactions
- heuristic decisions , Recovering from coordinator failure
- limitations of , Limitations of distributed transactions
- xargs (Unix tool) , Simple Log Analysis , A uniform interface
-
XML
- binary variants , Binary encoding
- encoding RDF data , The RDF data model
- for application data, issues with , JSON, XML, and Binary Variants
- in relational databases , The Object-Relational Mismatch , Convergence of document and relational databases
- XSL/XPath , Declarative Queries on the Web
Y
-
Yahoo!
- Pistachio (database) , Deriving several views from the same event log
- Sherpa (database) , Implementing change data capture
-
YARN (job scheduler)
,
Diversity of processing models
,
Separation of application code and state
- preemption of jobs , Designing for frequent faults
- use of ZooKeeper , Membership and Coordination Services
Z
-
Zab (consensus algorithm)
,
Consensus algorithms and total order broadcast
- use in ZooKeeper , Distributed Transactions and Consensus
- ZeroMQ (messaging library) , Direct messaging from producers to consumers
-
ZooKeeper (coordination service)
,
Membership and Coordination Services
-
Membership services
- generating fencing tokens , Fencing tokens , Using total order broadcast , Membership and Coordination Services
- linearizable operations , Implementing Linearizable Systems , Implementing linearizable storage using total order broadcast
- locks and leader election , Locking and leader election
- service discovery , Service discovery
- use for partition assignment , Request Routing , Allocating work to nodes
- use of Zab algorithm , Using total order broadcast , Distributed Transactions and Consensus , Consensus algorithms and total order broadcast