术语表

Mar 15, 2013

Designing Data-Intensive Applications

Glossary

Note

Please note that the definitions in this glossary are short and simple, intended to convey the core idea but not the full subtleties of a term. For more detail, please follow the references into the main text.

请注意，这个词汇表中的定义是简短而简单的，旨在传达核心概念，而不是一个术语的所有细节。如需更多细节，请查看主文中的参考资料。

asynchronous

Not waiting for something to complete (e.g., sending data over the network to another node), and not making any assumptions about how long it is going to take. See “Synchronous Versus Asynchronous Replication” , “Synchronous Versus Asynchronous Networks” , and “System Model and Reality” .

不等待某个操作完成（例如，向另一个节点发送数据），也不做任何关于操作所需时间的假设。请参考“同步与异步复制”、“同步与异步网络”和“系统模型与现实”。

atomic

In the context of concurrent operations: describing an operation that appears to take effect at a single point in time, so another concurrent process can never encounter the operation in a “half-finished” state. See also isolation .

描述一个操作在同时发生时，出现在单个时间点，因此另一个并发进程永远不会遇到该操作处于“半完成”的状态。请参考隔离。
In the context of transactions: grouping together a set of writes that must either all be committed or all be rolled back, even if faults occur. See “Atomicity” and “Atomic Commit and Two-Phase Commit (2PC)” .

在交易的背景下：将一组写入操作分组，无论发生何种故障，其必须全部提交或全部回滚。请参见“原子性”和“原子提交和两阶段提交(2PC)”。

backpressure

Forcing the sender of some data to slow down because the recipient cannot keep up with it. Also known as flow control . See “Messaging Systems” .

强制数据发送方减缓速度，因为接收方无法跟上。也称为流量控制。参见“消息系统”。

batch process

A computation that takes some fixed (and usually large) set of data as input and produces some other data as output, without modifying the input. See Chapter 10 .

一种计算，它以一些固定（通常很大）的数据集作为输入并产生一些其他数据作为输出，而不修改输入。请参见第十章。

bounded

Having some known upper limit or size. Used for example in the context of network delay (see “Timeouts and Unbounded Delays” ) and datasets (see the introduction to Chapter 11 ).

具有某些已知上限或大小。例如，在网络延迟（参见“超时和无界延迟”）和数据集（参见第11章介绍）的上下文中使用。

Byzantine fault

A node that behaves incorrectly in some arbitrary way, for example by sending contradictory or malicious messages to other nodes. See “Byzantine Faults” .

一个节点以某种任意的方式行为不当，例如向其他节点发送矛盾或恶意信息。请参见“拜占庭故障”。

cache

A component that remembers recently used data in order to speed up future reads of the same data. It is generally not complete: thus, if some data is missing from the cache, it has to be fetched from some underlying, slower data storage system that has a complete copy of the data.

一个记忆最近使用的数据以加速未来读取同样数据的组件。通常并不完整：如果缓存中遗漏了某些数据，则必须从底层、较慢的数据存储系统中获取完整的数据拷贝。

CAP theorem

A widely misunderstood theoretical result that is not useful in practice. See “The CAP theorem” .

一个在实践中没有用处且被广泛误解的理论结果。请查看“CAP定理”。

causality

The dependency between events that arises when one thing “happens before” another thing in a system. For example, a later event that is in response to an earlier event, or builds upon an earlier event, or should be understood in the light of an earlier event. See “The “happens-before” relationship and concurrency” and “Ordering and Causality” .

当一个事件“在某个系统中发生之前”，另一个事件发生时它们之间就出现了依赖关系。例如，后续事件是对先前事件的响应，是建立在先前事件之上，或者应该在先前事件的基础上加以理解。请参阅“The “happens-before” relationship and concurrency”和“Ordering and Causality”。

consensus

A fundamental problem in distributed computing, concerning getting several nodes to agree on something (for example, which node should be the leader for a database cluster). The problem is much harder than it seems at first glance. See “Fault-Tolerant Consensus” .

分散式計算中的根本問題，涉及使多個節點就某事達成一致（例如，應該是哪個節點成為數據庫集群的首領）。這個問題比看上去困難得多。請參閱“容錯共識”。

data warehouse

A database in which data from several different OLTP systems has been combined and prepared to be used for analytics purposes. See “Data Warehousing” .

将来自多个不同OLTP系统的数据组合并准备好用于分析目的的数据库。参见“数据仓库”。

declarative

Describing the properties that something should have, but not the exact steps for how to achieve it. In the context of queries, a query optimizer takes a declarative query and decides how it should best be executed. See “Query Languages for Data” .

描述某物应该具有的特性，但不是如何实现它的确切步骤。在查询的上下文中，查询优化器会接收一个声明性的查询并决定如何最好地执行它。请参见“数据查询语言”。

denormalize

To introduce some amount of redundancy or duplication in a normalized dataset, typically in the form of a cache or index , in order to speed up reads. A denormalized value is a kind of precomputed query result, similar to a materialized view. See “Single-Object and Multi-Object Operations” and “Deriving several views from the same event log” .

在规范化数据集中引入一定程度的冗余或重复，通常以缓存或索引的形式来加速读取。非规范化值是一种预先计算的查询结果，类似于材料化视图。请参阅“单对象和多对象操作”和“从相同的事件日志派生多个视图”。

derived data

A dataset that is created from some other data through a repeatable process, which you could run again if necessary. Usually, derived data is needed to speed up a particular kind of read access to the data. Indexes, caches, and materialized views are examples of derived data. See the introduction to Part III .

衍生数据集是通过可重复的过程从其他数据中创建的，如果需要，您可以再次运行该过程。通常，衍生数据用于加快对数据的特定读取访问速度。索引、缓存和物化视图是衍生数据的示例。请参见第III部分的介绍。

deterministic

Describing a function that always produces the same output if you give it the same input. This means it cannot depend on random numbers, the time of day, network communication, or other unpredictable things.

描述一个函数，如果你给它相同的输入，它总是会产生相同的输出。这意味着它不能依赖于随机数、时间、网络通信或其他不可预测的事物。

distributed

Running on several nodes connected by a network. Characterized by partial failures : some part of the system may be broken while other parts are still working, and it is often impossible for the software to know what exactly is broken. See “Faults and Partial Failures” .

运行在由网络连接的多个节点上。特点是部分故障：系统的一些部分可能出现故障，而其他部分仍在工作，软件通常无法知道哪个部分出现故障。请参见“故障和部分故障”。

durable

Storing data in a way such that you believe it will not be lost, even if various faults occur. See “Durability” .

以一种方式存储数据，即使发生各种故障，您也认为数据不会丢失。参见“耐久性”。

ETL

Extract–Transform–Load. The process of extracting data from a source database, transforming it into a form that is more suitable for analytic queries, and loading it into a data warehouse or batch processing system. See “Data Warehousing” .

抽取-转换-加载。从源数据库中提取数据，将其转换为更适合进行分析查询的形式，并将其加载到数据仓库或批处理系统中的过程。参见“数据仓库”。

failover

In systems that have a single leader, failover is the process of moving the leadership role from one node to another. See “Handling Node Outages” .

在只有一个领导者的系统中，故障转移是将领导角色从一个节点移动到另一个节点的过程。请参见“处理节点故障”。

fault-tolerant

Able to recover automatically if something goes wrong (e.g., if a machine crashes or a network link fails). See “Reliability” .

自动恢复功能可在出现问题时（例如机器崩溃或网络链路故障）进行恢复。请参见“可靠性”。

flow control

See backpressure .

看背压。

follower

A replica that does not directly accept any writes from clients, but only processes data changes that it receives from a leader. Also known as a secondary , slave , read replica , or hot standby . See “Leaders and Followers” .

一种复制品，不直接接受客户端的任何写入操作，而是仅处理从领导者接收到的数据更改。也称为次要、从属、只读复制品或热备份。参见“领导者和跟随者”。

full-text search

Searching text by arbitrary keywords, often with additional features such as matching similarly spelled words or synonyms. A full-text index is a kind of secondary index that supports such queries. See “Full-text search and fuzzy indexes” .

通过任意关键字搜索文本，通常具有匹配拼写类似单词或同义词等其他功能。全文索引是一种支持此类查询的次要索引。参见“全文搜索和模糊索引”。

graph

A data structure consisting of vertices (things that you can refer to, also known as nodes or entities ) and edges (connections from one vertex to another, also known as relationships or arcs ). See “Graph-Like Data Models” .

一个由顶点（也称为节点或实体，可以引用的事物）和边缘（从一个顶点到另一个顶点的连接，也称为关系或弧）组成的数据结构。参见“类似图形的数据模型”。

hash

A function that turns an input into a random-looking number. The same input always returns the same number as output. Two different inputs are very likely to have two different numbers as output, although it is possible that two different inputs produce the same output (this is called a collision ). See “Partitioning by Hash of Key” .

一个将输入转换为随机外表数字的函数。相同的输入始终会返回相同的输出数字。两个不同的输入非常可能会产生不同的输出数字，虽然有可能两个不同的输入会产生相同的输出数字（这被称为冲突）。参见“按键哈希分区”。

idempotent

Describing an operation that can be safely retried; if it is executed more than once, it has the same effect as if it was only executed once. See “Idempotence” .

描述一个可以安全重试的操作；如果它被执行多次，它的效果与它只被执行一次相同。请参见“幂等性”。

index

A data structure that lets you efficiently search for all records that have a particular value in a particular field. See “Data Structures That Power Your Database” .

一种数据结构，可让您有效地搜索特定字段中具有特定值的所有记录。请参见“驱动您的数据库的数据结构”。

isolation

In the context of transactions, describing the degree to which concurrently executing transactions can interfere with each other. Serializable isolation provides the strongest guarantees, but weaker isolation levels are also used. See “Isolation” .

在交易的背景下，描述同时执行的交易之间会产生干扰的程度。串行化隔离提供最强的保证，但使用较弱的隔离级别也很常见。参见“隔离”。

join

To bring together records that have something in common. Most commonly used in the case where one record has a reference to another (a foreign key, a document reference, an edge in a graph) and a query needs to get the record that the reference points to. See “Many-to-One and Many-to-Many Relationships” and “Reduce-Side Joins and Grouping” .

将具有共同点的记录汇集在一起。通常用于一个记录引用另一个记录的情况（外键、文档引用、图中的边），并且查询需要获取引用所指向的记录。参见“一对多和多对多关系”和“Reduce-Side连接和分组”。

leader

When data or a service is replicated across several nodes, the leader is the designated replica that is allowed to make changes. A leader may be elected through some protocol, or manually chosen by an administrator. Also known as the primary or master . See “Leaders and Followers” .

当数据或服务被复制到多个节点时，领导者是被指定允许进行更改的副本。领导者可以通过某种协议选举，或由管理员手动选择。也称为主要或主控。请参见“领导者和跟随者”。

linearizable

Behaving as if there was only a single copy of data in the system, which is updated by atomic operations. See “Linearizability” .

表现得好像系统中只有单个数据副本，而该副本通过原子操作进行更新。请参见“线性化”。

locality

A performance optimization: putting several pieces of data in the same place if they are frequently needed at the same time. See “Data locality for queries” .

一个性能优化：如果一些数据在同一时间频繁地需要，将这些数据放在同一个位置。请参阅“查询的数据局部性”。

lock

A mechanism to ensure that only one thread, node, or transaction can access something, and anyone else who wants to access the same thing must wait until the lock is released. See “Two-Phase Locking (2PL)” and “The leader and the lock” .

确保只有一个线程、节点或事务可以访问某物，并且任何想要访问相同物品的人必须等待锁被释放的机制。参见“两阶段锁定（2PL）”和“领导者和锁”。

log

An append-only file for storing data. A write-ahead log is used to make a storage engine resilient against crashes (see “Making B-trees reliable” ), a log-structured storage engine uses logs as its primary storage format (see “SSTables and LSM-Trees” ), a replication log is used to copy writes from a leader to followers (see “Leaders and Followers” ), and an event log can represent a data stream (see “Partitioned Logs” ).

一个仅能附加的文件用于存储数据。采用写前日志来使存储引擎能够抵御崩溃（见“使B树可靠”），采用日志结构的存储引擎将日志用作其主要存储格式（见“SSTables和LSM-Trees”），复制日志用于将写操作从领导者复制到跟随者（见“领导者和跟随者”），事件日志可以表示数据流（见“分区日志”）。

materialize

To perform a computation eagerly and write out its result, as opposed to calculating it on demand when requested. See “Aggregation: Data Cubes and Materialized Views” and “Materialization of Intermediate State” .

急切地执行一项计算并输出结果，而不是在需要时计算。请参见“聚合：数据立方体和材料化视图”和“中间状态的材料化”。

node

An instance of some software running on a computer, which communicates with other nodes via a network in order to accomplish some task.

某台计算机上运行的软件实例，通过网络与其他节点通信，以完成某项任务。

normalized

Structured in such a way that there is no redundancy or duplication. In a normalized database, when some piece of data changes, you only need to change it in one place, not many copies in many different places. See “Many-to-One and Many-to-Many Relationships” .

规范化的数据库结构设计避免了冗余和重复的存在。当某个数据发生变化时，你只需要在一个地方修改，而不是在许多不同的地方修改。请参见“一对多和多对多关系”。

OLAP

Online analytic processing. Access pattern characterized by aggregating (e.g., count, sum, average) over a large number of records. See “Transaction Processing or Analytics?” .

在线分析处理。访问模式通过对大量记录进行聚合（如计数，求和，平均值）来表征。参见“事务处理还是分析？”。

OLTP

Online transaction processing. Access pattern characterized by fast queries that read or write a small number of records, usually indexed by key. See “Transaction Processing or Analytics?” .

在线事务处理。访问模式特征为快速查询读取或写入数量较少的记录，通常由键索引。请参阅“事务处理还是分析？”。

partitioning

Splitting up a large dataset or computation that is too big for a single machine into smaller parts and spreading them across several machines. Also known as sharding . See Chapter 6 .

将太大无法在单个机器上处理的大型数据集或计算拆分成较小的部分，并将它们分散在多台机器上。也称为分片。请参见第6章。

percentile

A way of measuring the distribution of values by counting how many values are above or below some threshold. For example, the 95th percentile response time during some period is the time t such that 95% of requests in that period complete in less than t , and 5% take longer than t . See “Describing Performance” .

一种通过计算有多少值高于或低于某个阈值来衡量价值分布的方法。例如，在某段时间内，第95个百分位响应时间是时间t，该时间下的请求完成时间少于t的百分之95，而超过t的有5％。请参阅“描述性能”。

primary key

A value (typically a number or a string) that uniquely identifies a record. In many applications, primary keys are generated by the system when a record is created (e.g., sequentially or randomly); they are not usually set by users. See also secondary index .

一个唯一标识记录的值（通常是数字或字符串）。在许多应用程序中，当记录被创建时（例如按顺序或随机），系统会生成主键；它们通常不是由用户设置的。另见二级索引。

quorum

The minimum number of nodes that need to vote on an operation before it can be considered successful. See “Quorums for reading and writing” .

操作被视为成功之前需要投票的最小节点数。请参见“读写配额”。

rebalance

To move data or services from one node to another in order to spread the load fairly. See “Rebalancing Partitions” .

将数据或服务从一个节点移动到另一个节点，以使负载分布更加公平。参见“重新平衡分区”。

replication

Keeping a copy of the same data on several nodes ( replicas ) so that it remains accessible if a node becomes unreachable. See Chapter 5 .

将相同的数据副本存储在多个节点上，以便在一个节点无法访问时仍然可访问。见第五章。

schema

A description of the structure of some data, including its fields and datatypes. Whether some data conforms to a schema can be checked at various points in the data’s lifetime (see “Schema flexibility in the document model” ), and a schema can change over time (see Chapter 4 ).

一些数据的结构描述，包括其字段和数据类型。可以在数据的生命周期的各个阶段检查某些数据是否符合模式（请参见“文档模型中的模式灵活性”），并且模式可以随时间改变（请参见第四章）。

secondary index

An additional data structure that is maintained alongside the primary data storage and which allows you to efficiently search for records that match a certain kind of condition. See “Other Indexing Structures” and “Partitioning and Secondary Indexes” .

附加的数据结构同时维护在主数据存储之外，能够帮助你高效地搜索符合某种条件的记录。请参见“其他索引结构”和“分区和二级索引”。

serializable

A guarantee that if several transactions execute concurrently, they behave the same as if they had executed one at a time, in some serial order. See “Serializability” .

如果多个事务同时执行，有一个保证它们的行为就像它们一个接一个地按照某种顺序执行一样。请参见“串行化”。

shared-nothing

An architecture in which independent nodes—each with their own CPUs, memory, and disks—are connected via a conventional network, in contrast to shared-memory or shared-disk architectures. See the introduction to Part II .

一种体系结构，其中独立的节点——每个节点都有自己的CPU、内存和磁盘——通过传统网络连接，与共享内存或共享磁盘体系结构相反。请参阅第II部分的介绍。

skew

Imbalanced load across partitions, such that some partitions have lots of requests or data, and others have much less. Also known as hot spots . See “Skewed Workloads and Relieving Hot Spots” and “Handling skew” .

分区之间的负载不平衡，导致一些分区有大量请求或数据，其他分区则较少。也称为热点。请参见“倾斜工作负载和缓解热点”和“处理倾斜”。
A timing anomaly that causes events to appear in an unexpected, nonsequential order. See the discussions of read skew in “Snapshot Isolation and Repeatable Read” , write skew in “Write Skew and Phantoms” , and clock skew in “Timestamps for ordering events” .

时间异常导致事件以意外的非顺序方式出现。请参阅“快照隔离和可重复读”中关于读取偏斜的讨论，“写偏斜和幻影”中关于写入偏斜的讨论，以及“事件排序的时间戳”中关于时钟偏差的讨论。

split brain

A scenario in which two nodes simultaneously believe themselves to be the leader, and which may cause system guarantees to be violated. See “Handling Node Outages” and “The Truth Is Defined by the Majority” .

两个节点同时认为自己是领导者的情况可能会导致系统保障被违反。请参见“处理节点故障”和“真相由多数决定”。

stored procedure

A way of encoding the logic of a transaction such that it can be entirely executed on a database server, without communicating back and forth with a client during the transaction. See “Actual Serial Execution” .

一种编码事务逻辑的方式，使其可以完全在数据库服务器上执行，而无需在事务期间与客户端来回通信。参见“实际串行执行”。

stream process

A continually running computation that consumes a never-ending stream of events as input, and derives some output from it. See Chapter 11 .

一个持续运行的计算，消耗无止境的事件流作为输入，并从中派生一些输出。请参阅第11章。

synchronous

The opposite of asynchronous .

同步的。

system of record

A system that holds the primary, authoritative version of some data, also known as the source of truth . Changes are first written here, and other datasets may be derived from the system of record. See the introduction to Part III .

一个持有某些数据主要权威版本，也被称为真相来源的系统。更改首先在此处书写，其他数据集可以派生自记录系统。请参见第三部分的介绍。

timeout

One of the simplest ways of detecting a fault, namely by observing the lack of a response within some amount of time. However, it is impossible to know whether a timeout is due to a problem with the remote node, or an issue in the network. See “Timeouts and Unbounded Delays” .

其中一种最简单的检测故障的方法是观察在某段时间内未收到响应。但是，无法确定超时是由远程节点的问题还是网络问题导致。请参阅《超时和无界延迟》。

total order

A way of comparing things (e.g., timestamps) that allows you to always say which one of two things is greater and which one is lesser. An ordering in which some things are incomparable (you cannot say which is greater or smaller) is called a partial order . See “The causal order is not a total order” .

一种比较事物（例如时间戳）的方法，使您始终可以说出两个事物中哪一个更大或更小。一种排序，其中某些事物是无法比较的（您无法说哪一个更大或更小），被称为偏序。请参见“因果关系不是完全排序”。

transaction

Grouping together several reads and writes into a logical unit, in order to simplify error handling and concurrency issues. See Chapter 7 .

将几次读写操作归为一个逻辑单元，以简化错误处理和并发问题。详见第七章。

two-phase commit (2PC)

An algorithm to ensure that several database nodes either all commit or all abort a transaction. See “Atomic Commit and Two-Phase Commit (2PC)” .

一个算法，确保多个数据库节点要么全部提交要么全部撤回一个事务。参见“原子提交和两阶段提交（2PC）”。

two-phase locking (2PL)

An algorithm for achieving serializable isolation that works by a transaction acquiring a lock on all data it reads or writes, and holding the lock until the end of the transaction. See “Two-Phase Locking (2PL)” .

一种实现可串行隔离的算法，其原理是事务在读取或写入数据时需获取锁，并在事务结束之前一直持有该锁。参考“两阶段锁定（2PL）”。

unbounded

Not having any known upper limit or size. The opposite of bounded .

没有任何已知的上限或大小。与有界相反。

Index

A

aborts (transactions) , Transactions , Atomicity
- in two-phase commit , Introduction to two-phase commit
- performance of optimistic concurrency control , Performance of serializable snapshot isolation
- retrying aborted transactions , Handling errors and aborts
abstraction , Simplicity: Managing Complexity , Data Models and Query Languages , Transactions , Summary , Consistency and Consensus
access path (in network model) , The network model , The SPARQL query language
accidental complexity, removing , Simplicity: Managing Complexity
accountability , Responsibility and accountability
ACID properties (transactions) , Transaction Processing or Analytics? , The Meaning of ACID
- atomicity , Atomicity , Single-Object and Multi-Object Operations
- consistency , Consistency , Maintaining integrity in the face of software bugs
- durability , Durability
- isolation , Isolation , Single-Object and Multi-Object Operations
acknowledgements (messaging) , Acknowledgments and redelivery
active/active replication ( see multi-leader replication)
active/passive replication ( see leader-based replication)
ActiveMQ (messaging) , Message brokers , Message brokers compared to databases
- distributed transaction support , XA transactions
ActiveRecord (object-relational mapper) , The Object-Relational Mismatch , Handling errors and aborts
actor model , Distributed actor frameworks
- ( see also message-passing)
- comparison to Pregel model , The Pregel processing model
- comparison to stream processing , Message passing and RPC
Advanced Message Queuing Protocol ( see AMQP)
aerospace systems , Reliability , Human Errors , Byzantine Faults , Membership services
aggregation
- data cubes and materialized views , Aggregation: Data Cubes and Materialized Views
- in batch processes , GROUP BY
- in stream processes , Stream analytics
aggregation pipeline query language , MapReduce Querying
Agile , Evolvability: Making Change Easy
- minimizing irreversibility , Philosophy of batch process outputs , Reprocessing data for application evolution
- moving faster with confidence , The end-to-end argument again
- Unix philosophy , The Unix Philosophy
agreement , Fault-Tolerant Consensus
- ( see also consensus)
Airflow (workflow scheduler) , MapReduce workflows
Ajax , Dataflow Through Services: REST and RPC
Akka (actor framework) , Distributed actor frameworks
algorithms
- algorithm correctness , Correctness of an algorithm
- B-trees , B-Trees - B-tree optimizations
- for distributed systems , System Model and Reality
- hash indexes , Hash Indexes - Hash Indexes
- mergesort , SSTables and LSM-Trees , Distributed execution of MapReduce , Sort-merge joins
- red-black trees , Constructing and maintaining SSTables
- SSTables and LSM-trees , SSTables and LSM-Trees - Performance optimizations
all-to-all replication topologies , Multi-Leader Replication Topologies
AllegroGraph (database) , Graph-Like Data Models
ALTER TABLE statement (SQL) , Schema flexibility in the document model , Encoding and Evolution
Amazon
- Dynamo (database) , Leaderless Replication
Amazon Web Services (AWS) , Hardware Faults
- Kinesis Streams (messaging) , Using logs for message storage
- network reliability , Network Faults in Practice
- postmortems , Software Errors
- RedShift (database) , The divergence between OLTP databases and data warehouses
- S3 (object storage) , MapReduce and Distributed Filesystems
  - checking data integrity , Don’t just blindly trust what they promise
amplification
- of bias , Bias and discrimination
- of failures , Limitations of distributed transactions , Maintaining derived state
- of tail latency , Describing Performance , Partitioning Secondary Indexes by Document
- write amplification , Advantages of LSM-trees
AMQP (Advanced Message Queuing Protocol) , Message brokers compared to databases
- ( see also messaging systems)
- comparison to log-based messaging , Logs compared to traditional messaging , Replaying old messages
- message ordering , Acknowledgments and redelivery
analytics , Transaction Processing or Analytics?
- comparison to transaction processing , Transaction Processing or Analytics?
- data warehousing ( see data warehousing)
- parallel query execution in MPP databases , Comparing Hadoop to Distributed Databases
- predictive ( see predictive analytics)
- relation to batch processing , The Output of Batch Workflows
- schemas for , Stars and Snowflakes: Schemas for Analytics - Stars and Snowflakes: Schemas for Analytics
- snapshot isolation for queries , Snapshot Isolation and Repeatable Read
- stream analytics , Stream analytics
- using MapReduce, analysis of user activity events (example) , Example: analysis of user activity events
anti-caching (in-memory databases) , Keeping everything in memory
anti-entropy , Read repair and anti-entropy
Apache ActiveMQ ( see ActiveMQ)
Apache Avro ( see Avro)
Apache Beam ( see Beam)
Apache BookKeeper ( see BookKeeper)
Apache Cassandra ( see Cassandra)
Apache CouchDB ( see CouchDB)
Apache Curator ( see Curator)
Apache Drill ( see Drill)
Apache Flink ( see Flink)
Apache Giraph ( see Giraph)
Apache Hadoop ( see Hadoop)
Apache HAWQ ( see HAWQ)
Apache HBase ( see HBase)
Apache Helix ( see Helix)
Apache Hive ( see Hive)
Apache Impala ( see Impala)
Apache Jena ( see Jena)
Apache Kafka ( see Kafka)
Apache Lucene ( see Lucene)
Apache MADlib ( see MADlib)
Apache Mahout ( see Mahout)
Apache Oozie ( see Oozie)
Apache Parquet ( see Parquet)
Apache Qpid ( see Qpid)
Apache Samza ( see Samza)
Apache Solr ( see Solr)
Apache Spark ( see Spark)
Apache Storm ( see Storm)
Apache Tajo ( see Tajo)
Apache Tez ( see Tez)
Apache Thrift ( see Thrift)
Apache ZooKeeper ( see ZooKeeper)
Apama (stream analytics) , Complex event processing
append-only B-trees , B-tree optimizations , Indexes and snapshot isolation
append-only files ( see logs)
Application Programming Interfaces (APIs) , Thinking About Data Systems , Data Models and Query Languages
- for batch processing , MapReduce workflows
- for change streams , API support for change streams
- for distributed transactions , XA transactions
- for graph processing , The Pregel processing model
- for services , Dataflow Through Services: REST and RPC - Data encoding and evolution for RPC
  - ( see also services)
  - evolvability , Data encoding and evolution for RPC
  - RESTful , Web services
  - SOAP , Web services
application state ( see state)
approximate search ( see similarity search)
archival storage, data from databases , Archival storage
arcs ( see edges)
arithmetic mean , Describing Performance
ASCII text , Thrift and Protocol Buffers , A uniform interface
ASN.1 (schema language) , The Merits of Schemas
asynchronous networks , Unreliable Networks , Glossary
- comparison to synchronous networks , Synchronous Versus Asynchronous Networks
- formal model , System Model and Reality
asynchronous replication , Synchronous Versus Asynchronous Replication , Glossary
- conflict detection , Synchronous versus asynchronous conflict detection
- data loss on failover , Leader failure: Failover
- reads from asynchronous follower , Problems with Replication Lag
Asynchronous Transfer Mode (ATM) , Can we not simply make network delays predictable?
atomic broadcast ( see total order broadcast)
atomic clocks (caesium clocks) , Clock readings have a confidence interval , Synchronized clocks for global snapshots
- ( see also clocks)
atomicity (concurrency) , Glossary
- atomic increment-and-get , Implementing total order broadcast using linearizable storage
- compare-and-set , Compare-and-set , What Makes a System Linearizable?
  - ( see also compare-and-set operations)
- replicated operations , Conflict resolution and replication
- write operations , Atomic write operations
atomicity (transactions) , Atomicity , Single-Object and Multi-Object Operations , Glossary
- atomic commit , Distributed Transactions and Consensus
  - avoiding , Multi-partition request processing , Coordination-avoiding data systems
  - blocking and nonblocking , Three-phase commit
  - in stream processing , Exactly-once message processing , Atomic commit revisited
  - maintaining derived data , Keeping Systems in Sync
- for multi-object transactions , Single-Object and Multi-Object Operations
- for single-object writes , Single-object writes
auditability , Trust, but Verify - Tools for auditable data systems
- designing for , Designing for auditability
- self-auditing systems , A culture of verification
- through immutability , Advantages of immutable events
- tools for auditable data systems , Tools for auditable data systems
availability , Hardware Faults
- ( see also fault tolerance)
- in CAP theorem , The CAP theorem
- in service level agreements (SLAs) , Describing Performance
Avro (data format) , Avro - Code generation and dynamically typed languages
- code generation , Code generation and dynamically typed languages
- dynamically generated schemas , Dynamically generated schemas
- object container files , But what is the writer’s schema? , Archival storage , Philosophy of batch process outputs
- reader determining writer’s schema , But what is the writer’s schema?
- schema evolution , The writer’s schema and the reader’s schema
- use in Hadoop , Philosophy of batch process outputs
awk (Unix tool) , Simple Log Analysis
AWS ( see Amazon Web Services)
Azure ( see Microsoft)

N

nanomsg (messaging library) , Direct messaging from producers to consumers
Narayana (transaction coordinator) , Introduction to two-phase commit
NATS (messaging) , Message brokers
near-real-time (nearline) processing , Batch Processing
- ( see also stream processing)
Neo4j (database)
- Cypher query language , The Cypher Query Language
- graph data model , Graph-Like Data Models
Nephele (dataflow engine) , Dataflow engines
netcat (Unix tool) , Separation of logic and wiring
Netflix Chaos Monkey , Reliability , Network Faults in Practice
Network Attached Storage (NAS) , Distributed Data , MapReduce and Distributed Filesystems
network model , The network model
- graph databases versus , The SPARQL query language
- imperative query APIs , Declarative Queries on the Web
Network Time Protocol ( see NTP)
networks
- congestion and queueing , Network congestion and queueing
- datacenter network topologies , Cloud Computing and Supercomputing
- faults ( see faults)
- linearizability and network delays , Linearizability and network delays
- network partitions , Network Faults in Practice , The CAP theorem
- timeouts and unbounded delays , Timeouts and Unbounded Delays
next-key locking , Index-range locks
nodes (in graphs) ( see vertices)
nodes (processes) , Glossary
- handling outages in leader-based replication , Handling Node Outages
- system models for failure , System Model and Reality
noisy neighbors , Network congestion and queueing
nonblocking atomic commit , Three-phase commit
nondeterministic operations
- accidental nondeterminism , Fault tolerance
- partial failures in distributed systems , Faults and Partial Failures
nonfunctional requirements , Summary
nonrepeatable reads , Snapshot Isolation and Repeatable Read
- ( see also read skew)
normalization (data representation) , Many-to-One and Many-to-Many Relationships , Glossary
- executing joins , Which data model leads to simpler application code? , Convergence of document and relational databases , Reduce-Side Joins and Grouping
- foreign key references , The need for multi-object transactions
- in systems of record , Derived Data
- versus denormalization , Deriving several views from the same event log
NoSQL , The Birth of NoSQL , Unbundling Databases
- transactions and , The Slippery Concept of a Transaction
Notation3 (N3) , Triple-Stores and SPARQL
npm (package manager) , The move toward declarative query languages
NTP (Network Time Protocol) , Unreliable Clocks
- accuracy , Clock Synchronization and Accuracy , Timestamps for ordering events
- adjustments to monotonic clocks , Monotonic clocks
- multiple server addresses , Weak forms of lying
numbers, in XML and JSON encodings , JSON, XML, and Binary Variants

O

object-relational mapping (ORM) frameworks , The Object-Relational Mismatch
- error handling and aborted transactions , Handling errors and aborts
- unsafe read-modify-write cycle code , Atomic write operations
object-relational mismatch , The Object-Relational Mismatch
observer pattern , Separation of application code and state
offline systems , Batch Processing
- ( see also batch processing)
- stateful, offline-capable clients , Clients with offline operation , Stateful, offline-capable clients
offline-first applications , Stateful, offline-capable clients
offsets
- consumer offsets in partitioned logs , Consumer offsets
- messages in partitioned logs , Using logs for message storage
OLAP (online analytic processing) , Transaction Processing or Analytics? , Glossary
- data cubes , Aggregation: Data Cubes and Materialized Views
OLTP (online transaction processing) , Transaction Processing or Analytics? , Glossary
- analytics queries versus , The Output of Batch Workflows
- workload characteristics , Actual Serial Execution
one-to-many relationships , The Object-Relational Mismatch
- JSON representation , The Object-Relational Mismatch
online systems , Batch Processing
- ( see also services)
Oozie (workflow scheduler) , MapReduce workflows
OpenAPI (service definition format) , Web services
OpenStack
- Nova (cloud infrastructure)
  - use of ZooKeeper , Membership and Coordination Services
- Swift (object storage) , MapReduce and Distributed Filesystems
operability , Operability: Making Life Easy for Operations
operating systems versus databases , Unbundling Databases
operation identifiers , Operation identifiers , Multi-partition request processing
operational transformation , Custom conflict resolution logic
operators , Dataflow engines
- flow of data between , Graphs and Iterative Processing
- in stream processing , Processing Streams
optimistic concurrency control , Pessimistic versus optimistic concurrency control
Oracle (database)
- distributed transaction support , XA transactions
- GoldenGate (change data capture) , Trigger-based replication , Multi-datacenter operation , Implementing change data capture
- lack of serializability , Isolation
- leader-based replication , Leaders and Followers
- multi-table index cluster tables , Data locality for queries
- not preventing write skew , Characterizing write skew
- partitioned indexes , Partitioning Secondary Indexes by Term
- PL/SQL language , Pros and cons of stored procedures
- preventing lost updates , Automatically detecting lost updates
- read committed isolation , Implementing read committed
- Real Application Clusters (RAC) , Locking and leader election
- recursive query support , Graph Queries in SQL
- snapshot isolation support , Snapshot Isolation and Repeatable Read , Repeatable read and naming confusion
- TimesTen (in-memory database) , Keeping everything in memory
- WAL-based replication , Write-ahead log (WAL) shipping
- XML support , The Object-Relational Mismatch
ordering , Ordering Guarantees - Implementing total order broadcast using linearizable storage
- by sequence numbers , Sequence Number Ordering - Timestamp ordering is not sufficient
- causal ordering , Ordering and Causality - Capturing causal dependencies
  - partial order , The causal order is not a total order
- limits of total ordering , The limits of total ordering
- total order broadcast , Total Order Broadcast - Implementing total order broadcast using linearizable storage
Orleans (actor framework) , Distributed actor frameworks
outliers (response time) , Describing Performance
Oz (programming language) , Designing Applications Around Dataflow

P

package managers , The move toward declarative query languages , Separation of application code and state
packet switching , Can we not simply make network delays predictable?
packets
- corruption of , Weak forms of lying
- sending via UDP , Direct messaging from producers to consumers
PageRank (algorithm) , Graph-Like Data Models , Graphs and Iterative Processing
paging ( see virtual memory)
ParAccel (database) , The divergence between OLTP databases and data warehouses
parallel databases ( see massively parallel processing)
parallel execution
- of graph analysis algorithms , Parallel execution
- queries in MPP databases , Parallel Query Execution
Parquet (data format) , Column-Oriented Storage , Archival storage
- ( see also column-oriented storage)
- use in Hadoop , Philosophy of batch process outputs
partial failures , Faults and Partial Failures , Summary
- limping , Summary
partial order , The causal order is not a total order
partitioning , Partitioning - Summary , Glossary
- and replication , Partitioning and Replication
- in batch processing , Summary
- multi-partition operations , Multi-partition data processing
  - enforcing constraints , Multi-partition request processing
  - secondary index maintenance , Maintaining derived state
- of key-value data , Partitioning of Key-Value Data - Skewed Workloads and Relieving Hot Spots
  - by key range , Partitioning by Key Range
  - skew and hot spots , Skewed Workloads and Relieving Hot Spots
- rebalancing partitions , Rebalancing Partitions - Operations: Automatic or Manual Rebalancing
  - automatic or manual rebalancing , Operations: Automatic or Manual Rebalancing
  - problems with hash mod N , How not to do it: hash mod N
  - using dynamic partitioning , Dynamic partitioning
  - using fixed number of partitions , Fixed number of partitions
  - using N partitions per node , Partitioning proportionally to nodes
- replication and , Distributed Data
- request routing , Request Routing - Parallel Query Execution
- secondary indexes , Partitioning and Secondary Indexes - Partitioning Secondary Indexes by Term
  - document-based partitioning , Partitioning Secondary Indexes by Document
  - term-based partitioning , Partitioning Secondary Indexes by Term
- serial execution of transactions and , Partitioning
Paxos (consensus algorithm) , Consensus algorithms and total order broadcast
- ballot number , Epoch numbering and quorums
- Multi-Paxos (total order broadcast) , Consensus algorithms and total order broadcast
percentiles , Describing Performance , Glossary
- calculating efficiently , Describing Performance
- importance of high percentiles , Describing Performance
- use in service level agreements (SLAs) , Describing Performance
Percona XtraBackup (MySQL tool) , Setting Up New Followers
performance
- describing , Describing Performance
- of distributed transactions , Distributed Transactions in Practice
- of in-memory databases , Keeping everything in memory
- of linearizability , Linearizability and network delays
- of multi-leader replication , Multi-datacenter operation
perpetual inconsistency , Timeliness and Integrity
pessimistic concurrency control , Pessimistic versus optimistic concurrency control
phantoms (transaction isolation) , Phantoms causing write skew
- materializing conflicts , Materializing conflicts
- preventing, in serializability , Predicate locks
physical clocks ( see clocks)
pickle (Python) , Language-Specific Formats
Pig (dataflow language) , Beyond MapReduce , High-Level APIs and Languages
- replicated joins , Broadcast hash joins
- skewed joins , Handling skew
- workflows , MapReduce workflows
Pinball (workflow scheduler) , MapReduce workflows
pipelined execution , Discussion of materialization
- in Unix , The Unix Philosophy
point in time , Unreliable Clocks
polyglot persistence , The Birth of NoSQL
polystores , The meta-database of everything
PostgreSQL (database)
- BDR (multi-leader replication) , Multi-datacenter operation
  - causal ordering of writes , Multi-Leader Replication Topologies
- Bottled Water (change data capture) , Implementing change data capture
- Bucardo (trigger-based replication) , Trigger-based replication , Custom conflict resolution logic
- distributed transaction support , XA transactions
- foreign data wrappers , The meta-database of everything
- full text search support , Combining Specialized Tools by Deriving Data
- leader-based replication , Leaders and Followers
- log sequence number , Setting Up New Followers
- MVCC implementation , Implementing snapshot isolation , Indexes and snapshot isolation
- PL/pgSQL language , Pros and cons of stored procedures
- PostGIS geospatial indexes , Multi-column indexes
- preventing lost updates , Automatically detecting lost updates
- preventing write skew , Characterizing write skew , Serializable Snapshot Isolation (SSI)
- read committed isolation , Implementing read committed
- recursive query support , Graph Queries in SQL
- representing graphs , Property Graphs
- serializable snapshot isolation (SSI) , Serializable Snapshot Isolation (SSI)
- snapshot isolation support , Snapshot Isolation and Repeatable Read , Repeatable read and naming confusion
- WAL-based replication , Write-ahead log (WAL) shipping
- XML and JSON support , The Object-Relational Mismatch , Convergence of document and relational databases
pre-splitting , Dynamic partitioning
Precision Time Protocol (PTP) , Clock Synchronization and Accuracy
predicate locks , Predicate locks
predictive analytics , Predictive Analytics - Feedback loops
- amplifying bias , Bias and discrimination
- ethics of ( see ethics)
- feedback loops , Feedback loops
preemption
- of datacenter resources , Designing for frequent faults
- of threads , Process Pauses
Pregel processing model , The Pregel processing model
primary keys , Other Indexing Structures , Glossary
- compound primary key (Cassandra) , Partitioning by Hash of Key
primary-secondary replication ( see leader-based replication)
privacy , Privacy and Tracking - Legislation and self-regulation
- consent and freedom of choice , Consent and freedom of choice
- data as assets and power , Data as assets and power
- deleting data , Limitations of immutability
- ethical considerations ( see ethics)
- legislation and self-regulation , Legislation and self-regulation
- meaning of , Privacy and use of data
- surveillance , Surveillance
- tracking behavioral data , Privacy and Tracking
probabilistic algorithms , Describing Performance , Stream analytics
process pauses , Process Pauses - Limiting the impact of garbage collection
processing time (of events) , Reasoning About Time
producers (message streams) , Transmitting Event Streams
programming languages
- dataflow languages , Designing Applications Around Dataflow
- for stored procedures , Pros and cons of stored procedures
- functional reactive programming (FRP) , Designing Applications Around Dataflow
- logic programming , Designing Applications Around Dataflow
Prolog (language) , The Foundation: Datalog
- ( see also Datalog)
promises (asynchronous operations) , Current directions for RPC
property graphs , Property Graphs
- Cypher query language , The Cypher Query Language
Protocol Buffers (data format) , Thrift and Protocol Buffers - Datatypes and schema evolution
- field tags and schema evolution , Field tags and schema evolution
provenance of data , Designing for auditability
publish/subscribe model , Messaging Systems
publishers (message streams) , Transmitting Event Streams
punch card tabulating machines , Batch Processing
pure functions , MapReduce Querying
putting computation near data , Distributed execution of MapReduce

Q

Qpid (messaging) , Message brokers compared to databases
quality of service (QoS) , Can we not simply make network delays predictable?
Quantcast File System (distributed filesystem) , MapReduce and Distributed Filesystems
query languages , Query Languages for Data - MapReduce Querying
- aggregation pipeline , MapReduce Querying
- CSS and XSL , Declarative Queries on the Web
- Cypher , The Cypher Query Language
- Datalog , The Foundation: Datalog
- Juttle , Designing Applications Around Dataflow
- MapReduce querying , MapReduce Querying - MapReduce Querying
- recursive SQL queries , Graph Queries in SQL
- relational algebra and SQL , Query Languages for Data
- SPARQL , The SPARQL query language
query optimizers , The relational model , The move toward declarative query languages
queueing delays (networks) , Network congestion and queueing
- head-of-line blocking , Describing Performance
- latency and response time , Describing Performance
queues (messaging) , Message brokers
quorums , Quorums for reading and writing - Limitations of Quorum Consistency , Glossary
- for leaderless replication , Quorums for reading and writing
- in consensus algorithms , Epoch numbering and quorums
- limitations of consistency , Limitations of Quorum Consistency - Monitoring staleness , Linearizability and quorums
- making decisions in distributed systems , The Truth Is Defined by the Majority
- monitoring staleness , Monitoring staleness
- multi-datacenter replication , Multi-datacenter operation
- relying on durability , Mapping system models to the real world
- sloppy quorums and hinted handoff , Sloppy Quorums and Hinted Handoff

R

R-trees (indexes) , Multi-column indexes
RabbitMQ (messaging) , Message brokers , Message brokers compared to databases
- leader-based replication , Leaders and Followers
race conditions , Isolation
- ( see also concurrency)
- avoiding with linearizability , Cross-channel timing dependencies
- caused by dual writes , Keeping Systems in Sync
- dirty writes , No dirty writes
- in counter increments , No dirty writes
- lost updates , Preventing Lost Updates - Conflict resolution and replication
- preventing with event logs , Concurrency control , Dataflow: Interplay between state changes and application code
- preventing with serializable isolation , Serializability
- write skew , Write Skew and Phantoms - Materializing conflicts
Raft (consensus algorithm) , Consensus algorithms and total order broadcast
- sensitivity to network problems , Limitations of consensus
- term number , Epoch numbering and quorums
- use in etcd , Distributed Transactions and Consensus
RAID (Redundant Array of Independent Disks) , Hardware Faults , MapReduce and Distributed Filesystems
railways, schema migration on , Reprocessing data for application evolution
RAMCloud (in-memory storage) , Keeping everything in memory
ranking algorithms , Graphs and Iterative Processing
RDF (Resource Description Framework) , The semantic web
- querying with SPARQL , The SPARQL query language
RDMA (Remote Direct Memory Access) , Cloud Computing and Supercomputing
read committed isolation level , Read Committed - Implementing read committed
- implementing , Implementing read committed
- multi-version concurrency control (MVCC) , Implementing snapshot isolation
- no dirty reads , No dirty reads
- no dirty writes , No dirty writes
read path (derived data) , Observing Derived State
read repair (leaderless replication) , Read repair and anti-entropy
- for linearizability , Linearizability and quorums
read replicas ( see leader-based replication)
read skew (transaction isolation) , Snapshot Isolation and Repeatable Read , Summary
- as violation of causality , Ordering and Causality
read-after-write consistency , Reading Your Own Writes , Timeliness and Integrity
- cross-device , Reading Your Own Writes
read-modify-write cycle , Preventing Lost Updates
read-scaling architecture , Problems with Replication Lag
reads as events , Reads are events too
real-time
- collaborative editing , Collaborative editing
- near-real-time processing , Batch Processing
  - ( see also stream processing)
- publish/subscribe dataflow , End-to-end event streams
- response time guarantees , Response time guarantees
- time-of-day clocks , Time-of-day clocks
rebalancing partitions , Rebalancing Partitions - Operations: Automatic or Manual Rebalancing , Glossary
- ( see also partitioning)
- automatic or manual rebalancing , Operations: Automatic or Manual Rebalancing
- dynamic partitioning , Dynamic partitioning
- fixed number of partitions , Fixed number of partitions
- fixed number of partitions per node , Partitioning proportionally to nodes
- problems with hash mod N , How not to do it: hash mod N
recency guarantee , Linearizability
recommendation engines
- batch process outputs , Key-value stores as batch process output
- batch workflows , MapReduce workflows , Materialization of Intermediate State
- iterative processing , Graphs and Iterative Processing
- statistical and numerical algorithms , Specialization for different domains
records , MapReduce Job Execution
- events in stream processing , Transmitting Event Streams
recursive common table expressions (SQL) , Graph Queries in SQL
redelivery (messaging) , Acknowledgments and redelivery
Redis (database)
- atomic operations , Atomic write operations
- durability , Keeping everything in memory
- Lua scripting , Pros and cons of stored procedures
- single-threaded execution , Actual Serial Execution
- usage example , Thinking About Data Systems
redundancy
- hardware components , Hardware Faults
- of derived data , Derived Data
  - ( see also derived data)
Reed–Solomon codes (error correction) , MapReduce and Distributed Filesystems
refactoring , Evolvability: Making Change Easy
- ( see also evolvability)
regions (partitioning) , Partitioning
register (data structure) , What Makes a System Linearizable?
relational data model , Relational Model Versus Document Model - Convergence of document and relational databases
- comparison to document model , Relational Versus Document Databases Today - Convergence of document and relational databases
- graph queries in SQL , Graph Queries in SQL
- in-memory databases with , Keeping everything in memory
- many-to-one and many-to-many relationships , Many-to-One and Many-to-Many Relationships
- multi-object transactions, need for , The need for multi-object transactions
- NoSQL as alternative to , The Birth of NoSQL
- object-relational mismatch , The Object-Relational Mismatch
- relational algebra and SQL , Query Languages for Data
- versus document model
  - convergence of models , Convergence of document and relational databases
  - data locality , Data locality for queries
relational databases
- eventual consistency , Problems with Replication Lag
- history , Relational Model Versus Document Model
- leader-based replication , Leaders and Followers
- logical logs , Logical (row-based) log replication
- philosophy compared to Unix , Unbundling Databases , The meta-database of everything
- schema changes , Schema flexibility in the document model , Encoding and Evolution , Different values written at different times
- statement-based replication , Statement-based replication
- use of B-tree indexes , B-Trees
relationships ( see edges)
reliability , Reliability - How Important Is Reliability? , The Future of Data Systems
- building a reliable system from unreliable components , Cloud Computing and Supercomputing
- defined , Thinking About Data Systems , Summary
- hardware faults , Hardware Faults
- human errors , Human Errors
- importance of , How Important Is Reliability?
- of messaging systems , Messaging Systems
- software errors , Software Errors
Remote Method Invocation (Java RMI) , The problems with remote procedure calls (RPCs)
remote procedure calls (RPCs) , The problems with remote procedure calls (RPCs) - Data encoding and evolution for RPC
- ( see also services)
- based on futures , Current directions for RPC
- data encoding and evolution , Data encoding and evolution for RPC
- issues with , The problems with remote procedure calls (RPCs)
- using Avro , But what is the writer’s schema? , Current directions for RPC
- using Thrift , Current directions for RPC
- versus message brokers , Message-Passing Dataflow
repeatable reads (transaction isolation) , Repeatable read and naming confusion
replicas , Leaders and Followers
replication , Replication - Summary , Glossary
- and durability , Durability
- chain replication , Synchronous Versus Asynchronous Replication
- conflict resolution and , Conflict resolution and replication
- consistency properties , Problems with Replication Lag - Solutions for Replication Lag
  - consistent prefix reads , Consistent Prefix Reads
  - monotonic reads , Monotonic Reads
  - reading your own writes , Reading Your Own Writes
- in distributed filesystems , MapReduce and Distributed Filesystems
- leaderless , Leaderless Replication - Version vectors
  - detecting concurrent writes , Detecting Concurrent Writes - Version vectors
  - limitations of quorum consistency , Limitations of Quorum Consistency - Monitoring staleness , Linearizability and quorums
  - sloppy quorums and hinted handoff , Sloppy Quorums and Hinted Handoff
- monitoring staleness , Monitoring staleness
- multi-leader , Multi-Leader Replication - Multi-Leader Replication Topologies
  - across multiple datacenters , Multi-datacenter operation , The Cost of Linearizability
  - handling write conflicts , Handling Write Conflicts - What is a conflict?
  - replication topologies , Multi-Leader Replication Topologies - Multi-Leader Replication Topologies
- partitioning and , Distributed Data , Partitioning and Replication
- reasons for using , Distributed Data , Replication
- single-leader , Leaders and Followers - Trigger-based replication
  - failover , Leader failure: Failover
  - implementation of replication logs , Implementation of Replication Logs - Trigger-based replication
  - relation to consensus , Single-leader replication and consensus
  - setting up new followers , Setting Up New Followers
  - synchronous versus asynchronous , Synchronous Versus Asynchronous Replication - Synchronous Versus Asynchronous Replication
- state machine replication , Using total order broadcast , Databases and Streams
- using erasure coding , MapReduce and Distributed Filesystems
- with heterogeneous data systems , Keeping Systems in Sync
replication logs ( see logs)
reprocessing data , Reprocessing data for application evolution , Unifying batch and stream processing
- ( see also evolvability)
- from log-based messaging , Replaying old messages
request routing , Request Routing - Parallel Query Execution
- approaches to , Request Routing
- parallel query execution , Parallel Query Execution
resilient systems , Reliability
- ( see also fault tolerance)
response time
- as performance metric for services , Describing Performance , Batch Processing
- guarantees on , Response time guarantees
- latency versus , Describing Performance
- mean and percentiles , Describing Performance
- user experience , Describing Performance
responsibility and accountability , Responsibility and accountability
REST (Representational State Transfer) , Web services
- ( see also services)
RethinkDB (database)
- document data model , The Object-Relational Mismatch
- dynamic partitioning , Dynamic partitioning
- join support , Many-to-One and Many-to-Many Relationships , Convergence of document and relational databases
- key-range partitioning , Partitioning by Key Range
- leader-based replication , Leaders and Followers
- subscribing to changes , API support for change streams
Riak (database)
- Bitcask storage engine , Hash Indexes
- CRDTs , Custom conflict resolution logic , Merging concurrently written values
- dotted version vectors , Version vectors
- gossip protocol , Request Routing
- hash partitioning , Partitioning by Hash of Key - Partitioning by Hash of Key , Fixed number of partitions
- last-write-wins conflict resolution , Last write wins (discarding concurrent writes)
- leaderless replication , Leaderless Replication
- LevelDB storage engine , Making an LSM-tree out of SSTables
- linearizability, lack of , Linearizability and quorums
- multi-datacenter support , Multi-datacenter operation
- preventing lost updates across replicas , Conflict resolution and replication
- rebalancing , Operations: Automatic or Manual Rebalancing
- search feature , Partitioning Secondary Indexes by Term
- secondary indexes , Partitioning Secondary Indexes by Document
- siblings (concurrently written values) , Merging concurrently written values
- sloppy quorums , Sloppy Quorums and Hinted Handoff
ring buffers , Disk space usage
Ripple (cryptocurrency) , Tools for auditable data systems
rockets , Human Errors , Are Document Databases Repeating History? , Byzantine Faults
RocksDB (storage engine) , Making an LSM-tree out of SSTables
- leveled compaction , Performance optimizations
rollbacks (transactions) , Transactions
rolling upgrades , Hardware Faults , Encoding and Evolution
routing ( see request routing)
row-oriented storage , Column-Oriented Storage
- row-based replication , Logical (row-based) log replication
rowhammer (memory corruption) , Trust, but Verify
RPCs ( see remote procedure calls)
Rubygems (package manager) , The move toward declarative query languages
rules (Datalog) , The Foundation: Datalog

S

safety and liveness properties , Safety and liveness
- in consensus algorithms , Fault-Tolerant Consensus
- in transactions , Transactions
sagas ( see compensating transactions)
Samza (stream processor) , Stream analytics , Maintaining materialized views
- fault tolerance , Rebuilding state after a failure
- streaming SQL support , Complex event processing
sandboxes , Human Errors
SAP HANA (database) , The divergence between OLTP databases and data warehouses
scalability , Scalability - Approaches for Coping with Load , The Future of Data Systems
- approaches for coping with load , Approaches for Coping with Load
- defined , Summary
- describing load , Describing Load
- describing performance , Describing Performance
- partitioning and , Partitioning
- replication and , Problems with Replication Lag
- scaling up versus scaling out , Distributed Data
scaling out , Approaches for Coping with Load , Distributed Data
- ( see also shared-nothing architecture)
scaling up , Approaches for Coping with Load , Distributed Data
scatter/gather approach, querying partitioned databases , Partitioning Secondary Indexes by Document
SCD (slowly changing dimension) , Time-dependence of joins
schema-on-read , Schema flexibility in the document model
- comparison to evolvable schema , The Merits of Schemas
- in distributed filesystems , Diversity of storage
schema-on-write , Schema flexibility in the document model
schemaless databases ( see schema-on-read)
schemas , Glossary
- Avro , Avro - Code generation and dynamically typed languages
  - reader determining writer’s schema , But what is the writer’s schema?
  - schema evolution , The writer’s schema and the reader’s schema
- dynamically generated , Dynamically generated schemas
- evolution of , Reprocessing data for application evolution
  - affecting application code , Encoding and Evolution
  - compatibility checking , But what is the writer’s schema?
  - in databases , Dataflow Through Databases - Archival storage
  - in message-passing , Distributed actor frameworks
  - in service calls , Data encoding and evolution for RPC
- flexibility in document model , Schema flexibility in the document model
- for analytics , Stars and Snowflakes: Schemas for Analytics - Stars and Snowflakes: Schemas for Analytics
- for JSON and XML , JSON, XML, and Binary Variants
- merits of , The Merits of Schemas
- schema migration on railways , Reprocessing data for application evolution
- Thrift and Protocol Buffers , Thrift and Protocol Buffers - Datatypes and schema evolution
  - schema evolution , Field tags and schema evolution
- traditional approach to design, fallacy in , Deriving several views from the same event log
searches
- building search indexes in batch processes , Building search indexes
- k-nearest neighbors , Specialization for different domains
- on streams , Search on streams
- partitioned secondary indexes , Partitioning and Secondary Indexes
secondaries ( see leader-based replication)
secondary indexes , Other Indexing Structures , Glossary
- partitioning , Partitioning and Secondary Indexes - Partitioning Secondary Indexes by Term , Summary
  - document-partitioned , Partitioning Secondary Indexes by Document
  - index maintenance , Maintaining derived state
  - term-partitioned , Partitioning Secondary Indexes by Term
- problems with dual writes , Keeping Systems in Sync , Reasoning about dataflows
- updating, transaction isolation and , The need for multi-object transactions
secondary sorts , Sort-merge joins
sed (Unix tool) , Simple Log Analysis
self-describing files , Code generation and dynamically typed languages
self-joins , Summary
self-validating systems , A culture of verification
semantic web , The semantic web
semi-synchronous replication , Synchronous Versus Asynchronous Replication
sequence number ordering , Sequence Number Ordering - Timestamp ordering is not sufficient
- generators , Synchronized clocks for global snapshots , Noncausal sequence number generators
- insufficiency for enforcing constraints , Timestamp ordering is not sufficient
- Lamport timestamps , Lamport timestamps
- use of timestamps , Timestamps for ordering events , Synchronized clocks for global snapshots , Noncausal sequence number generators
sequential consistency , Implementing linearizable storage using total order broadcast
serializability , Isolation , Weak Isolation Levels , Serializability - Performance of serializable snapshot isolation , Glossary
- linearizability versus , What Makes a System Linearizable?
- pessimistic versus optimistic concurrency control , Pessimistic versus optimistic concurrency control
- serial execution , Actual Serial Execution - Summary of serial execution
  - partitioning , Partitioning
  - using stored procedures , Encapsulating transactions in stored procedures , Using total order broadcast
- serializable snapshot isolation (SSI) , Serializable Snapshot Isolation (SSI) - Performance of serializable snapshot isolation
  - detecting stale MVCC reads , Detecting stale MVCC reads
  - detecting writes that affect prior reads , Detecting writes that affect prior reads
  - distributed execution , Performance of serializable snapshot isolation , Limitations of distributed transactions
  - performance of SSI , Performance of serializable snapshot isolation
  - preventing write skew , Decisions based on an outdated premise - Detecting writes that affect prior reads
- two-phase locking (2PL) , Two-Phase Locking (2PL) - Index-range locks
  - index-range locks , Index-range locks
  - performance , Performance of two-phase locking
Serializable (Java) , Language-Specific Formats
serialization , Formats for Encoding Data
- ( see also encoding)
service discovery , Current directions for RPC , Request Routing , Service discovery
- using DNS , Request Routing , Service discovery
service level agreements (SLAs) , Describing Performance
service-oriented architecture (SOA) , Dataflow Through Services: REST and RPC
- ( see also services)
services , Dataflow Through Services: REST and RPC - Data encoding and evolution for RPC
- microservices , Dataflow Through Services: REST and RPC
  - causal dependencies across services , The limits of total ordering
  - loose coupling , Making unbundling work
- relation to batch/stream processors , Batch Processing , Stream processors and services
- remote procedure calls (RPCs) , The problems with remote procedure calls (RPCs) - Data encoding and evolution for RPC
  - issues with , The problems with remote procedure calls (RPCs)
- similarity to databases , Dataflow Through Services: REST and RPC
- web services , Web services , Current directions for RPC
session windows (stream processing) , Types of windows
- ( see also windows)
sessionization , GROUP BY
sharding ( see partitioning)
shared mode (locks) , Implementation of two-phase locking
shared-disk architecture , Distributed Data , MapReduce and Distributed Filesystems
shared-memory architecture , Distributed Data
shared-nothing architecture , Approaches for Coping with Load , Distributed Data - Distributed Data , Glossary
- ( see also replication)
- distributed filesystems , MapReduce and Distributed Filesystems
  - ( see also distributed filesystems)
- partitioning , Partitioning
- use of network , Unreliable Networks
sharks
- biting undersea cables , Network Faults in Practice
- counting (example) , MapReduce Querying - MapReduce Querying
- finding (example) , Query Languages for Data
- website about (example) , Declarative Queries on the Web
shredding (in relational model) , Which data model leads to simpler application code?
siblings (concurrent values) , Merging concurrently written values , Conflict resolution and replication
- ( see also conflicts)
similarity search
- edit distance , Full-text search and fuzzy indexes
- genome data , Summary
- k-nearest neighbors , Specialization for different domains
single-leader replication ( see leader-based replication)
single-threaded execution , Atomic write operations , Actual Serial Execution
- in batch processing , Bringing related data together in the same place , Dataflow engines , Parallel execution
- in stream processing , Logs compared to traditional messaging , Concurrency control , Uniqueness in log-based messaging
size-tiered compaction , Performance optimizations
skew , Glossary
- clock skew , Relying on Synchronized Clocks - Clock readings have a confidence interval , Implementing Linearizable Systems
- in transaction isolation
  - read skew , Snapshot Isolation and Repeatable Read , Summary
  - write skew , Write Skew and Phantoms - Materializing conflicts , Decisions based on an outdated premise - Detecting writes that affect prior reads
    - ( see also write skew)
- meanings of , Snapshot Isolation and Repeatable Read
- unbalanced workload , Partitioning of Key-Value Data
  - compensating for , Skewed Workloads and Relieving Hot Spots
  - due to celebrities , Skewed Workloads and Relieving Hot Spots
  - for time-series data , Partitioning by Key Range
  - in batch processing , Handling skew
slaves ( see leader-based replication)
sliding windows (stream processing) , Types of windows
- ( see also windows)
sloppy quorums , Sloppy Quorums and Hinted Handoff
- ( see also quorums)
- lack of linearizability , Implementing Linearizable Systems
slowly changing dimension (data warehouses) , Time-dependence of joins
smearing (leap seconds adjustments) , Clock Synchronization and Accuracy
snapshots (databases)
- causal consistency , Ordering and Causality
- computing derived data , Creating an index
- in change data capture , Initial snapshot
- serializable snapshot isolation (SSI) , Serializable Snapshot Isolation (SSI) - Performance of serializable snapshot isolation , What Makes a System Linearizable?
- setting up a new replica , Setting Up New Followers
- snapshot isolation and repeatable read , Snapshot Isolation and Repeatable Read - Repeatable read and naming confusion
  - implementing with MVCC , Implementing snapshot isolation
  - indexes and MVCC , Indexes and snapshot isolation
  - visibility rules , Visibility rules for observing a consistent snapshot
- synchronized clocks for global snapshots , Synchronized clocks for global snapshots
snowflake schemas , Stars and Snowflakes: Schemas for Analytics
SOAP , Web services
- ( see also services)
- evolvability , Data encoding and evolution for RPC
software bugs , Software Errors
- maintaining integrity , Maintaining integrity in the face of software bugs
solid state drives (SSDs)
- access patterns , Advantages of LSM-trees
- detecting corruption , The end-to-end argument , Don’t just blindly trust what they promise
- faults in , Durability
- sequential write throughput , Hash Indexes
Solr (search server)
- building indexes in batch processes , Building search indexes
- document-partitioned indexes , Partitioning Secondary Indexes by Document
- request routing , Request Routing
- usage example , Thinking About Data Systems
- use of Lucene , Making an LSM-tree out of SSTables
sort (Unix tool) , Simple Log Analysis , Sorting versus in-memory aggregation , The Unix Philosophy
sort-merge joins (MapReduce) , Sort-merge joins
Sorted String Tables ( see SSTables)
sorting
- sort order in column storage , Sort Order in Column Storage
source of truth ( see systems of record)
Spanner (database)
- data locality , Data locality for queries
- snapshot isolation using clocks , Synchronized clocks for global snapshots
- TrueTime API , Clock readings have a confidence interval
Spark (processing framework) , Dataflow engines - Discussion of materialization
- bytecode generation , The move toward declarative query languages
- dataflow APIs , High-Level APIs and Languages
- fault tolerance , Fault tolerance
- for data warehouses , The divergence between OLTP databases and data warehouses
- GraphX API (graph processing) , The Pregel processing model
- machine learning , Specialization for different domains
- query optimizer , The move toward declarative query languages
- Spark Streaming , Stream analytics
  - microbatching , Microbatching and checkpointing
- stream processing on top of batch processing , Batch and Stream Processing
SPARQL (query language) , The SPARQL query language
spatial algorithms , Specialization for different domains
split brain , Leader failure: Failover , Glossary
- in consensus algorithms , Distributed Transactions and Consensus , Single-leader replication and consensus
- preventing , Consistency and Consensus , Implementing Linearizable Systems
- using fencing tokens to avoid , The leader and the lock - Fencing tokens
spreadsheets, dataflow programming capabilities , Designing Applications Around Dataflow
SQL (Structured Query Language) , Simplicity: Managing Complexity , Relational Model Versus Document Model , Query Languages for Data
- advantages and limitations of , Diversity of processing models
- distributed query execution , MapReduce Querying
- graph queries in , Graph Queries in SQL
- isolation levels standard, issues with , Repeatable read and naming confusion
- query execution on Hadoop , Diversity of processing models
- résumé (example) , The Object-Relational Mismatch
- SQL injection vulnerability , Byzantine Faults
- SQL on Hadoop , The divergence between OLTP databases and data warehouses
- statement-based replication , Statement-based replication
- stored procedures , Pros and cons of stored procedures
SQL Server (database)
- data warehousing support , The divergence between OLTP databases and data warehouses
- distributed transaction support , XA transactions
- leader-based replication , Leaders and Followers
- preventing lost updates , Automatically detecting lost updates
- preventing write skew , Characterizing write skew , Implementation of two-phase locking
- read committed isolation , Implementing read committed
- recursive query support , Graph Queries in SQL
- serializable isolation , Implementation of two-phase locking
- snapshot isolation support , Snapshot Isolation and Repeatable Read
- T-SQL language , Pros and cons of stored procedures
- XML support , The Object-Relational Mismatch
SQLstream (stream analytics) , Complex event processing
SSDs ( see solid state drives)
SSTables (storage format) , SSTables and LSM-Trees - Performance optimizations
- advantages over hash indexes , SSTables and LSM-Trees
- concatenated index , Partitioning by Hash of Key
- constructing and maintaining , Constructing and maintaining SSTables
- making LSM-Tree from , Making an LSM-tree out of SSTables
staleness (old data) , Reading Your Own Writes
- cross-channel timing dependencies , Cross-channel timing dependencies
- in leaderless databases , Writing to the Database When a Node Is Down
- in multi-version concurrency control , Detecting stale MVCC reads
- monitoring for , Monitoring staleness
- of client state , Pushing state changes to clients
- versus linearizability , Linearizability
- versus timeliness , Timeliness and Integrity
standbys ( see leader-based replication)
star replication topologies , Multi-Leader Replication Topologies
star schemas , Stars and Snowflakes: Schemas for Analytics - Stars and Snowflakes: Schemas for Analytics
- similarity to event sourcing , Event Sourcing
Star Wars analogy (event time versus processing time) , Event time versus processing time
state
- derived from log of immutable events , State, Streams, and Immutability
- deriving current state from the event log , Deriving current state from the event log
- interplay between state changes and application code , Dataflow: Interplay between state changes and application code
- maintaining derived state , Maintaining derived state
- maintenance by stream processor in stream-stream joins , Stream-stream join (window join)
- observing derived state , Observing Derived State - Multi-partition data processing
- rebuilding after stream processor failure , Rebuilding state after a failure
- separation of application code and , Separation of application code and state
state machine replication , Using total order broadcast , Databases and Streams
statement-based replication , Statement-based replication
statically typed languages
- analogy to schema-on-write , Schema flexibility in the document model
- code generation and , Code generation and dynamically typed languages
statistical and numerical algorithms , Specialization for different domains
StatsD (metrics aggregator) , Direct messaging from producers to consumers
stdin, stdout , A uniform interface , Separation of logic and wiring
Stellar (cryptocurrency) , Tools for auditable data systems
stock market feeds , Direct messaging from producers to consumers
STONITH (Shoot The Other Node In The Head) , Leader failure: Failover
stop-the-world ( see garbage collection)
storage
- composing data storage technologies , Composing Data Storage Technologies - What’s missing?
- diversity of, in MapReduce , Diversity of storage
Storage Area Network (SAN) , Distributed Data , MapReduce and Distributed Filesystems
storage engines , Storage and Retrieval - Summary
- column-oriented , Column-Oriented Storage - Writing to Column-Oriented Storage
  - column compression , Column Compression - Memory bandwidth and vectorized processing
  - defined , Column-Oriented Storage
  - distinction between column families and , Column Compression
  - Parquet , Column-Oriented Storage , Archival storage
  - sort order in , Sort Order in Column Storage - Several different sort orders
  - writing to , Writing to Column-Oriented Storage
- comparing requirements for transaction processing and analytics , Transaction Processing or Analytics? - Column-Oriented Storage
- in-memory storage , Keeping everything in memory
  - durability , Durability
- row-oriented , Data Structures That Power Your Database - Keeping everything in memory
  - B-trees , B-Trees - B-tree optimizations
  - comparing B-trees and LSM-trees , Comparing B-Trees and LSM-Trees - Downsides of LSM-trees
  - defined , Column-Oriented Storage
  - log-structured , Hash Indexes - Performance optimizations
stored procedures , Trigger-based replication , Encapsulating transactions in stored procedures - Pros and cons of stored procedures , Glossary
- and total order broadcast , Using total order broadcast
- pros and cons of , Pros and cons of stored procedures
- similarity to stream processors , Application code as a derivation function
Storm (stream processor) , Stream analytics
- distributed RPC , Message passing and RPC , Multi-partition data processing
- Trident state handling , Idempotence
straggler events , Knowing when you’re ready , The lambda architecture
stream processing , Processing Streams - Summary , Glossary
- accessing external services within job , Stream-table join (stream enrichment) , Microbatching and checkpointing , Idempotence , Exactly-once execution of an operation
- combining with batch processing
  - lambda architecture , The lambda architecture
  - unifying technologies , Unifying batch and stream processing
- comparison to batch processing , Processing Streams
- complex event processing (CEP) , Complex event processing
- fault tolerance , Fault Tolerance - Rebuilding state after a failure
  - atomic commit , Atomic commit revisited
  - idempotence , Idempotence
  - microbatching and checkpointing , Microbatching and checkpointing
  - rebuilding state after a failure , Rebuilding state after a failure
- for data integration , Batch and Stream Processing - Unifying batch and stream processing
- maintaining derived state , Maintaining derived state
- maintenance of materialized views , Maintaining materialized views
- messaging systems ( see messaging systems)
- reasoning about time , Reasoning About Time - Types of windows
  - event time versus processing time , Event time versus processing time , Microbatching and checkpointing , Unifying batch and stream processing
  - knowing when window is ready , Knowing when you’re ready
  - types of windows , Types of windows
- relation to databases ( see streams)
- relation to services , Stream processors and services
- search on streams , Search on streams
- single-threaded execution , Logs compared to traditional messaging , Concurrency control
- stream analytics , Stream analytics
- stream joins , Stream Joins - Time-dependence of joins
  - stream-stream join , Stream-stream join (window join)
  - stream-table join , Stream-table join (stream enrichment)
  - table-table join , Table-table join (materialized view maintenance)
  - time-dependence of , Time-dependence of joins
streams , Stream Processing - Replaying old messages
- end-to-end, pushing events to clients , End-to-end event streams
- messaging systems ( see messaging systems)
- processing ( see stream processing)
- relation to databases , Databases and Streams - Limitations of immutability
  - ( see also changelogs)
  - API support for change streams , API support for change streams
  - change data capture , Change Data Capture - API support for change streams
  - derivative of state by time , State, Streams, and Immutability
  - event sourcing , Event Sourcing - Commands and events
  - keeping systems in sync , Keeping Systems in Sync - Keeping Systems in Sync
  - philosophy of immutable events , State, Streams, and Immutability - Limitations of immutability
- topics , Transmitting Event Streams
strict serializability , What Makes a System Linearizable?
strong consistency ( see linearizability)
strong one-copy serializability , What Makes a System Linearizable?
subjects, predicates, and objects (in triple-stores) , Triple-Stores and SPARQL
subscribers (message streams) , Transmitting Event Streams
- ( see also consumers)
supercomputers , Cloud Computing and Supercomputing
surveillance , Surveillance
- ( see also privacy)
Swagger (service definition format) , Web services
swapping to disk ( see virtual memory)
synchronous networks , Synchronous Versus Asynchronous Networks , Glossary
- comparison to asynchronous networks , Synchronous Versus Asynchronous Networks
- formal model , System Model and Reality
synchronous replication , Synchronous Versus Asynchronous Replication , Glossary
- chain replication , Synchronous Versus Asynchronous Replication
- conflict detection , Synchronous versus asynchronous conflict detection
system models , Knowledge, Truth, and Lies , System Model and Reality - Mapping system models to the real world
- assumptions in , Trust, but Verify
- correctness of algorithms , Correctness of an algorithm
- mapping to the real world , Mapping system models to the real world
- safety and liveness , Safety and liveness
systems of record , Derived Data , Glossary
- change data capture , Implementing change data capture , Reasoning about dataflows
- treating event log as , State, Streams, and Immutability
systems thinking , Feedback loops

T

t-digest (algorithm) , Describing Performance
table-table joins , Table-table join (materialized view maintenance)
Tableau (data visualization software) , Diversity of processing models
tail (Unix tool) , Using logs for message storage
tail vertex (property graphs) , Property Graphs
Tajo (query engine) , The divergence between OLTP databases and data warehouses
Tandem NonStop SQL (database) , Partitioning
TCP (Transmission Control Protocol) , Cloud Computing and Supercomputing
- comparison to circuit switching , Can we not simply make network delays predictable?
- comparison to UDP , Network congestion and queueing
- connection failures , Detecting Faults
- flow control , Network congestion and queueing , Messaging Systems
- packet checksums , Weak forms of lying , The end-to-end argument , Trust, but Verify
- reliability and duplicate suppression , Duplicate suppression
- retransmission timeouts , Network congestion and queueing
- use for transaction sessions , Single-Object and Multi-Object Operations
telemetry ( see monitoring)
Teradata (database) , The divergence between OLTP databases and data warehouses , Partitioning
term-partitioned indexes , Partitioning Secondary Indexes by Term , Summary
termination (consensus) , Fault-Tolerant Consensus
Terrapin (database) , Key-value stores as batch process output
Tez (dataflow engine) , Dataflow engines - Discussion of materialization
- fault tolerance , Fault tolerance
- support by higher-level tools , High-Level APIs and Languages
thrashing (out of memory) , Process Pauses
threads (concurrency)
- actor model , Distributed actor frameworks , Message passing and RPC
  - ( see also message-passing)
- atomic operations , Atomicity
- background threads , Hash Indexes , Downsides of LSM-trees
- execution pauses , Can we not simply make network delays predictable? , Process Pauses - Process Pauses
- memory barriers , Linearizability and network delays
- preemption , Process Pauses
- single ( see single-threaded execution)
three-phase commit , Three-phase commit
Thrift (data format) , Thrift and Protocol Buffers - Datatypes and schema evolution
- BinaryProtocol , Thrift and Protocol Buffers
- CompactProtocol , Thrift and Protocol Buffers
- field tags and schema evolution , Field tags and schema evolution
throughput , Describing Performance , Batch Processing
TIBCO , Message brokers
- Enterprise Message Service , Message brokers compared to databases
- StreamBase (stream analytics) , Complex event processing
time
- concurrency and , The “happens-before” relationship and concurrency
- cross-channel timing dependencies , Cross-channel timing dependencies
- in distributed systems , Unreliable Clocks - Limiting the impact of garbage collection
  - ( see also clocks)
  - clock synchronization and accuracy , Clock Synchronization and Accuracy
  - relying on synchronized clocks , Relying on Synchronized Clocks - Synchronized clocks for global snapshots
- process pauses , Process Pauses - Limiting the impact of garbage collection
- reasoning about, in stream processors , Reasoning About Time - Types of windows
  - event time versus processing time , Event time versus processing time , Microbatching and checkpointing , Unifying batch and stream processing
  - knowing when window is ready , Knowing when you’re ready
  - timestamp of events , Whose clock are you using, anyway?
  - types of windows , Types of windows
- system models for distributed systems , System Model and Reality
- time-dependence in stream joins , Time-dependence of joins
time-of-day clocks , Time-of-day clocks
timeliness , Timeliness and Integrity
- coordination-avoiding data systems , Coordination-avoiding data systems
- correctness of dataflow systems , Correctness of dataflow systems
timeouts , Unreliable Networks , Glossary
- dynamic configuration of , Network congestion and queueing
- for failover , Leader failure: Failover
- length of , Timeouts and Unbounded Delays
timestamps , Sequence Number Ordering
- assigning to events in stream processing , Whose clock are you using, anyway?
- for read-after-write consistency , Reading Your Own Writes
- for transaction ordering , Synchronized clocks for global snapshots
- insufficiency for enforcing constraints , Timestamp ordering is not sufficient
- key range partitioning by , Partitioning by Key Range
- Lamport , Lamport timestamps
- logical , Ordering events to capture causality
- ordering events , Timestamps for ordering events , Noncausal sequence number generators
Titan (database) , Graph-Like Data Models
tombstones , Hash Indexes , Merging concurrently written values , Log compaction
topics (messaging) , Message brokers , Transmitting Event Streams
total order , The causal order is not a total order , Glossary
- limits of , The limits of total ordering
- sequence numbers or timestamps , Sequence Number Ordering
total order broadcast , Total Order Broadcast - Implementing total order broadcast using linearizable storage , The limits of total ordering , Uniqueness in log-based messaging
- consensus algorithms and , Consensus algorithms and total order broadcast - Epoch numbering and quorums
- implementation in ZooKeeper and etcd , Membership and Coordination Services
- implementing with linearizable storage , Implementing total order broadcast using linearizable storage
- using , Using total order broadcast
- using to implement linearizable storage , Implementing linearizable storage using total order broadcast
tracking behavioral data , Privacy and Tracking
- ( see also privacy)
transaction coordinator ( see coordinator)
transaction manager ( see coordinator)
transaction processing , Relational Model Versus Document Model , Transaction Processing or Analytics? - Stars and Snowflakes: Schemas for Analytics
- comparison to analytics , Transaction Processing or Analytics?
- comparison to data warehousing , The divergence between OLTP databases and data warehouses
transactions , Transactions - Summary , Glossary
- ACID properties of , The Meaning of ACID
  - atomicity , Atomicity
  - consistency , Consistency
  - durability , Durability
  - isolation , Isolation
- compensating ( see compensating transactions)
- concept of , The Slippery Concept of a Transaction
- distributed transactions , Distributed Transactions and Consensus - Limitations of distributed transactions
  - avoiding , Derived data versus distributed transactions , Making unbundling work , Enforcing Constraints - Coordination-avoiding data systems
  - failure amplification , Limitations of distributed transactions , Maintaining derived state
  - in doubt/uncertain status , Coordinator failure , Holding locks while in doubt
  - two-phase commit , Atomic Commit and Two-Phase Commit (2PC) - Three-phase commit
  - use of , Distributed Transactions in Practice - Exactly-once message processing
  - XA transactions , XA transactions - Limitations of distributed transactions
- OLTP versus analytics queries , The Output of Batch Workflows
- purpose of , Transactions
- serializability , Serializability - Performance of serializable snapshot isolation
  - actual serial execution , Actual Serial Execution - Summary of serial execution
  - pessimistic versus optimistic concurrency control , Pessimistic versus optimistic concurrency control
  - serializable snapshot isolation (SSI) , Serializable Snapshot Isolation (SSI) - Performance of serializable snapshot isolation
  - two-phase locking (2PL) , Two-Phase Locking (2PL) - Index-range locks
- single-object and multi-object , Single-Object and Multi-Object Operations - Handling errors and aborts
  - handling errors and aborts , Handling errors and aborts
  - need for multi-object transactions , The need for multi-object transactions
  - single-object writes , Single-object writes
- snapshot isolation ( see snapshots)
- weak isolation levels , Weak Isolation Levels - Materializing conflicts
  - preventing lost updates , Preventing Lost Updates - Conflict resolution and replication
  - read committed , Read Committed - Snapshot Isolation and Repeatable Read
transitive closure (graph algorithm) , Graphs and Iterative Processing
trie (data structure) , Full-text search and fuzzy indexes
triggers (databases) , Trigger-based replication , Transmitting Event Streams
- implementing change data capture , Implementing change data capture
- implementing replication , Trigger-based replication
triple-stores , Triple-Stores and SPARQL - The SPARQL query language
- SPARQL query language , The SPARQL query language
tumbling windows (stream processing) , Types of windows
- ( see also windows)
- in microbatching , Microbatching and checkpointing
tuple spaces (programming model) , Dataflow: Interplay between state changes and application code
Turtle (RDF data format) , Triple-Stores and SPARQL
Twitter
- constructing home timelines (example) , Describing Load , Deriving several views from the same event log , Table-table join (materialized view maintenance) , Materialized views and caching
- DistributedLog (event log) , Using logs for message storage
- Finagle (RPC framework) , Current directions for RPC
- Snowflake (sequence number generator) , Synchronized clocks for global snapshots
- Summingbird (processing library) , The lambda architecture
two-phase commit (2PC) , Distributed Transactions and Consensus , Introduction to two-phase commit - Coordinator failure , Glossary
- confusion with two-phase locking , Introduction to two-phase commit
- coordinator failure , Coordinator failure
- coordinator recovery , Recovering from coordinator failure
- how it works , A system of promises
- issues in practice , Limitations of distributed transactions
- performance cost , Distributed Transactions in Practice
- transactions holding locks , Holding locks while in doubt
two-phase locking (2PL) , Two-Phase Locking (2PL) - Index-range locks , What Makes a System Linearizable? , Glossary
- confusion with two-phase commit , Introduction to two-phase commit
- index-range locks , Index-range locks
- performance of , Performance of two-phase locking
type checking, dynamic versus static , Schema flexibility in the document model

U

UDP (User Datagram Protocol)
- comparison to TCP , Network congestion and queueing
- multicast , Direct messaging from producers to consumers
unbounded datasets , Stream Processing , Glossary
- ( see also streams)
unbounded delays , Glossary
- in networks , Timeouts and Unbounded Delays
- process pauses , Process Pauses
unbundling databases , Unbundling Databases - Multi-partition data processing
- composing data storage technologies , Composing Data Storage Technologies - What’s missing?
  - federation versus unbundling , The meta-database of everything
  - need for high-level language , What’s missing?
- designing applications around dataflow , Designing Applications Around Dataflow - Stream processors and services
- observing derived state , Observing Derived State - Multi-partition data processing
  - materialized views and caching , Materialized views and caching
  - multi-partition data processing , Multi-partition data processing
  - pushing state changes to clients , Pushing state changes to clients
uncertain (transaction status) ( see in doubt)
uniform consensus , Fault-Tolerant Consensus
- ( see also consensus)
uniform interfaces , A uniform interface
union type (in Avro) , Schema evolution rules
uniq (Unix tool) , Simple Log Analysis
uniqueness constraints
- asynchronously checked , Loosely interpreted constraints
- requiring consensus , Uniqueness constraints require consensus
- requiring linearizability , Constraints and uniqueness guarantees
- uniqueness in log-based messaging , Uniqueness in log-based messaging
Unix philosophy , The Unix Philosophy - Transparency and experimentation
- command-line batch processing , Batch Processing with Unix Tools - Sorting versus in-memory aggregation
  - Unix pipes versus dataflow engines , Discussion of materialization
- comparison to Hadoop , Philosophy of batch process outputs - Philosophy of batch process outputs
- comparison to relational databases , Unbundling Databases , The meta-database of everything
- comparison to stream processing , Processing Streams
- composability and uniform interfaces , The Unix Philosophy
- loose coupling , Separation of logic and wiring
- pipes , The Unix Philosophy
- relation to Hadoop , Unbundling Databases
UPDATE statement (SQL) , Schema flexibility in the document model
updates
- preventing lost updates , Preventing Lost Updates - Conflict resolution and replication
  - atomic write operations , Atomic write operations
  - automatically detecting lost updates , Automatically detecting lost updates
  - compare-and-set operations , Compare-and-set
  - conflict resolution and replication , Conflict resolution and replication
  - using explicit locking , Explicit locking
- preventing write skew , Write Skew and Phantoms - Materializing conflicts

V

validity (consensus) , Fault-Tolerant Consensus
vBuckets (partitioning) , Partitioning
vector clocks , Version vectors
- ( see also version vectors)
vectorized processing , Memory bandwidth and vectorized processing , The move toward declarative query languages
verification , Trust, but Verify - Tools for auditable data systems
- avoiding blind trust , Don’t just blindly trust what they promise
- culture of , A culture of verification
- designing for auditability , Designing for auditability
- end-to-end integrity checks , The end-to-end argument again
- tools for auditable data systems , Tools for auditable data systems
version control systems, reliance on immutable data , Limitations of immutability
version vectors , Multi-Leader Replication Topologies , Version vectors
- capturing causal dependencies , Capturing causal dependencies
- versus vector clocks , Version vectors
Vertica (database) , The divergence between OLTP databases and data warehouses
- handling writes , Writing to Column-Oriented Storage
- replicas using different sort orders , Several different sort orders
vertical scaling ( see scaling up)
vertices (in graphs) , Graph-Like Data Models
- property graph model , Property Graphs
Viewstamped Replication (consensus algorithm) , Consensus algorithms and total order broadcast
- view number , Epoch numbering and quorums
virtual machines , Distributed Data
- ( see also cloud computing)
- context switches , Process Pauses
- network performance , Network congestion and queueing
- noisy neighbors , Network congestion and queueing
- reliability in cloud services , Hardware Faults
- virtualized clocks in , Clock Synchronization and Accuracy
virtual memory
- process pauses due to page faults , Describing Performance , Process Pauses
- versus memory management by databases , Keeping everything in memory
VisiCalc (spreadsheets) , Designing Applications Around Dataflow
vnodes (partitioning) , Partitioning
Voice over IP (VoIP) , Network congestion and queueing
Voldemort (database)
- building read-only stores in batch processes , Key-value stores as batch process output
- hash partitioning , Partitioning by Hash of Key - Partitioning by Hash of Key , Fixed number of partitions
- leaderless replication , Leaderless Replication
- multi-datacenter support , Multi-datacenter operation
- rebalancing , Operations: Automatic or Manual Rebalancing
- reliance on read repair , Read repair and anti-entropy
- sloppy quorums , Sloppy Quorums and Hinted Handoff
VoltDB (database)
- cross-partition serializability , Partitioning
- deterministic stored procedures , Pros and cons of stored procedures
- in-memory storage , Keeping everything in memory
- output streams , API support for change streams
- secondary indexes , Partitioning Secondary Indexes by Document
- serial execution of transactions , Actual Serial Execution
- statement-based replication , Statement-based replication , Rebuilding state after a failure
- transactions in stream processing , Atomic commit revisited

W

WAL (write-ahead log) , Making B-trees reliable
web services ( see services)
Web Services Description Language (WSDL) , Web services
webhooks , Direct messaging from producers to consumers
webMethods (messaging) , Message brokers
WebSocket (protocol) , Pushing state changes to clients
windows (stream processing) , Stream analytics , Reasoning About Time - Types of windows
- infinite windows for changelogs , Maintaining materialized views , Stream-table join (stream enrichment)
- knowing when all events have arrived , Knowing when you’re ready
- stream joins within a window , Stream-stream join (window join)
- types of windows , Types of windows
winners (conflict resolution) , Converging toward a consistent state
WITH RECURSIVE syntax (SQL) , Graph Queries in SQL
workflows (MapReduce) , MapReduce workflows
- outputs , The Output of Batch Workflows - Philosophy of batch process outputs
  - key-value stores , Key-value stores as batch process output
  - search indexes , Building search indexes
- with map-side joins , MapReduce workflows with map-side joins
working set , Sorting versus in-memory aggregation
write amplification , Advantages of LSM-trees
write path (derived data) , Observing Derived State
write skew (transaction isolation) , Write Skew and Phantoms - Materializing conflicts
- characterizing , Write Skew and Phantoms - Phantoms causing write skew , Decisions based on an outdated premise
- examples of , Write Skew and Phantoms , More examples of write skew
- materializing conflicts , Materializing conflicts
- occurrence in practice , Maintaining integrity in the face of software bugs
- phantoms , Phantoms causing write skew
- preventing
  - in snapshot isolation , Decisions based on an outdated premise - Detecting writes that affect prior reads
  - in two-phase locking , Predicate locks - Index-range locks
  - options for , Characterizing write skew
write-ahead log (WAL) , Making B-trees reliable , Write-ahead log (WAL) shipping
writes (database)
- atomic write operations , Atomic write operations
- detecting writes affecting prior reads , Detecting writes that affect prior reads
- preventing dirty writes with read committed , No dirty writes
WS-* framework , Web services
- ( see also services)
WS-AtomicTransaction (2PC) , Introduction to two-phase commit

X

XA transactions , Introduction to two-phase commit , XA transactions - Limitations of distributed transactions
- heuristic decisions , Recovering from coordinator failure
- limitations of , Limitations of distributed transactions
xargs (Unix tool) , Simple Log Analysis , A uniform interface
XML
- binary variants , Binary encoding
- encoding RDF data , The RDF data model
- for application data, issues with , JSON, XML, and Binary Variants
- in relational databases , The Object-Relational Mismatch , Convergence of document and relational databases
XSL/XPath , Declarative Queries on the Web

Y

Yahoo!
- Pistachio (database) , Deriving several views from the same event log
- Sherpa (database) , Implementing change data capture
YARN (job scheduler) , Diversity of processing models , Separation of application code and state
- preemption of jobs , Designing for frequent faults
- use of ZooKeeper , Membership and Coordination Services

Z

Zab (consensus algorithm) , Consensus algorithms and total order broadcast
- use in ZooKeeper , Distributed Transactions and Consensus
ZeroMQ (messaging library) , Direct messaging from producers to consumers
ZooKeeper (coordination service) , Membership and Coordination Services - Membership services
- generating fencing tokens , Fencing tokens , Using total order broadcast , Membership and Coordination Services
- linearizable operations , Implementing Linearizable Systems , Implementing linearizable storage using total order broadcast
- locks and leader election , Locking and leader election
- service discovery , Service discovery
- use for partition assignment , Request Routing , Allocating work to nodes
- use of Zab algorithm , Using total order broadcast , Distributed Transactions and Consensus , Consensus algorithms and total order broadcast