第五章复制

Mar 8, 2013

Designing Data-Intensive Applications

Chapter 5. Replication

The major difference between a thing that might go wrong and a thing that cannot possibly go wrong is that when a thing that cannot possibly go wrong goes wrong it usually turns out to be impossible to get at or repair.

可能出错的事情和不可能出错的事情的主要区别在于，当不可能出错的事情出错时，通常会发现难以修复。

Douglas Adams, Mostly Harmless (1992)

道格拉斯·亚当斯，《几乎无害》（1992年）

Replication means keeping a copy of the same data on multiple machines that are connected via a network. As discussed in the introduction to Part II , there are several reasons why you might want to replicate data:

复制意味着将连接通过网络的多台计算机上的相同数据保留副本。正如第二部分的引言所讨论的那样，有几个原因你可能想要复制数据：

To keep data geographically close to your users (and thus reduce latency)

将数据保持地理位置与用户接近（从而减少延迟）。
To allow the system to continue working even if some of its parts have failed (and thus increase availability)

允许系统即使一些部件出现故障也能正常运行（从而提高可用性）。
To scale out the number of machines that can serve read queries (and thus increase read throughput)

扩展可处理读查询的机器数量（从而增加读取吞吐量）。

In this chapter we will assume that your dataset is so small that each machine can hold a copy of the entire dataset. In Chapter 6 we will relax that assumption and discuss partitioning ( sharding ) of datasets that are too big for a single machine. In later chapters we will discuss various kinds of faults that can occur in a replicated data system, and how to deal with them.

在本章中，我们假设您的数据集很小，以至于每台机器都可以容纳整个数据集的副本。在第6章中，我们将放宽这个假设，并讨论数据集的分区（分片）——这些数据集太大而无法放入一台机器中。在后面的章节中，我们将讨论在复制数据系统中可能发生的各种故障，以及如何处理它们。

If the data that you’re replicating does not change over time, then replication is easy: you just need to copy the data to every node once, and you’re done. All of the difficulty in replication lies in handling changes to replicated data, and that’s what this chapter is about. We will discuss three popular algorithms for replicating changes between nodes: single-leader , multi-leader , and leaderless replication. Almost all distributed databases use one of these three approaches. They all have various pros and cons, which we will examine in detail.

如果你要复制的数据不会随时间变化，那么复制就很容易：你只需要将数据复制到每个节点一次，就完成了。所有复制中的困难都在于处理复制数据的变化，这就是本章讨论的内容。我们将讨论三种流行的节点间复制变化的算法：单主节点、多主节点和无主节点复制。几乎所有分布式数据库都使用这三种方法中的一种。它们都有不同的优缺点，我们将详细研究。

There are many trade-offs to consider with replication: for example, whether to use synchronous or asynchronous replication, and how to handle failed replicas. Those are often configuration options in databases, and although the details vary by database, the general principles are similar across many different implementations. We will discuss the consequences of such choices in this chapter.

使用数据复制需要考虑很多折衷方案，例如同步或异步复制和如何处理失败的副本。这些通常是数据库中的配置选项，尽管实现细节因数据库而异，但一般原则在许多不同实现中都是类似的。本章将讨论这些选择的后果。

Replication of databases is an old topic—the principles haven’t changed much since they were studied in the 1970s [ 1 ], because the fundamental constraints of networks have remained the same. However, outside of research, many developers continued to assume for a long time that a database consisted of just one node. Mainstream use of distributed databases is more recent. Since many application developers are new to this area, there has been a lot of misunderstanding around issues such as eventual consistency . In “Problems with Replication Lag” we will get more precise about eventual consistency and discuss things like the read-your-writes and monotonic reads guarantees.

数据库复制是一个古老的话题——自从1970年代研究以来，其原理基本没有改变，因为网络的基本限制仍然相同。然而，在研究以外，许多开发人员长时间以来仍然假定数据库只包含一个节点。分布式数据库的主流使用是较晚期的。由于许多应用程序开发人员对这个领域还不熟悉，因此关于最终一致性等问题存在许多误解。在“复制滞后的问题”中，我们将更精确地讨论最终一致性等问题，并讨论类似于读取自己写入和单调读取保证的问题。

Leaders and Followers

Each node that stores a copy of the database is called a replica . With multiple replicas, a question inevitably arises: how do we ensure that all the data ends up on all the replicas?

每个存储数据库副本的节点被称为副本。有了多个副本，不可避免地会出现一个问题：我们如何确保所有的数据都最终存储在所有的副本上？

Every write to the database needs to be processed by every replica; otherwise, the replicas would no longer contain the same data. The most common solution for this is called leader-based replication (also known as active/passive or master–slave replication ) and is illustrated in Figure 5-1 . It works as follows:

每次写入数据库都需要由每个副本进行处理，否则副本将不再包含相同的数据。最常见的解决方案称为基于主节点的复制（也称为主动/被动或主从复制），如图5-1所示。其工作如下：

One of the replicas is designated the leader (also known as master or primary ). When clients want to write to the database, they must send their requests to the leader, which first writes the new data to its local storage.

其中一个复制品被指定为领导者（也称为主节点）。当客户端想要写入数据库时，它们必须将请求发送给领导者，领导者首先将新数据写入其本地存储。
The other replicas are known as followers ( read replicas , slaves , secondaries , or hot standbys ). ⁱ Whenever the leader writes new data to its local storage, it also sends the data change to all of its followers as part of a replication log or change stream . Each follower takes the log from the leader and updates its local copy of the database accordingly, by applying all writes in the same order as they were processed on the leader.

其他复制品称为跟随者（读取副本、从属、次要或热备份）。每当领导者将新数据写入其本地存储时，它还将数据更改作为复制日志或更改流发送到其所有跟随者。每个跟随者从领导者获取日志，并按照在领导者上处理它们的相同顺序应用其本地数据库的副本中的所有写入。
When a client wants to read from the database, it can query either the leader or any of the followers. However, writes are only accepted on the leader (the followers are read-only from the client’s point of view).

当客户端想要从数据库中读取数据时，它可以查询领导者或任何一个追随者。但是，只有领导者接受写入（从客户端的角度来看，追随者只读）。

This mode of replication is a built-in feature of many relational databases, such as PostgreSQL (since version 9.0), MySQL, Oracle Data Guard [ 2 ], and SQL Server’s AlwaysOn Availability Groups [ 3 ]. It is also used in some nonrelational databases, including MongoDB, RethinkDB, and Espresso [ 4 ]. Finally, leader-based replication is not restricted to only databases: distributed message brokers such as Kafka [ 5 ] and RabbitMQ highly available queues [ 6 ] also use it. Some network filesystems and replicated block devices such as DRBD are similar.

这种复制模式是许多关系型数据库内置的功能，如PostgreSQL（自版本9.0起），MySQL，Oracle Data Guard和SQL Server的AlwaysOn可用性组。它还用于一些非关系型数据库，包括MongoDB、RethinkDB和Espresso。最后，基于leader的复制不仅仅局限于数据库：分布式消息代理，如Kafka和RabbitMQ高可用队列，也使用它。一些网络文件系统和复制块设备，如DRBD，也类似。

Synchronous Versus Asynchronous Replication

An important detail of a replicated system is whether the replication happens synchronously or asynchronously . (In relational databases, this is often a configurable option; other systems are often hardcoded to be either one or the other.)

复制系统的一个重要细节是复制是同步还是异步发生的。（在关系型数据库中，这常常是一个可配置的选项;其他系统常常是硬编码为其中的一个。）

Think about what happens in Figure 5-1 , where the user of a website updates their profile image. At some point in time, the client sends the update request to the leader; shortly afterward, it is received by the leader. At some point, the leader forwards the data change to the followers. Eventually, the leader notifies the client that the update was successful.

想想在图5-1中发生了什么，其中一个网站用户更新了其个人资料图片。在某个时间点，客户端将更新请求发送给领导者；不久之后，领导者接收到该请求。然后，领导者将数据更改转发给跟随者。最终，领导者会通知客户端更新成功。

Figure 5-2 shows the communication between various components of the system: the user’s client, the leader, and two followers. Time flows from left to right. A request or response message is shown as a thick arrow.

图5-2显示了系统各个组件之间的通信：用户客户端、领导者和两个跟随者。时间从左到右流动。请求或响应消息显示为粗箭头。

In the example of Figure 5-2 , the replication to follower 1 is synchronous : the leader waits until follower 1 has confirmed that it received the write before reporting success to the user, and before making the write visible to other clients. The replication to follower 2 is asynchronous : the leader sends the message, but doesn’t wait for a response from the follower.

在图5-2的例子中，向追随者1的复制是同步的：领导者等待追随者1确认接收到写操作，然后才向用户报告成功并使写操作对其他客户端可见。向追随者2的复制是异步的：领导者发送信息，但不等待追随者的响应。

The diagram shows that there is a substantial delay before follower 2 processes the message. Normally, replication is quite fast: most database systems apply changes to followers in less than a second. However, there is no guarantee of how long it might take. There are circumstances when followers might fall behind the leader by several minutes or more; for example, if a follower is recovering from a failure, if the system is operating near maximum capacity, or if there are network problems between the nodes.

这张图表明在跟随者2处理信息之前会有相当大的延迟。通常情况下，复制是相当快的：大多数数据库系统在不到一秒的时间内将更改应用到跟随者上。然而，无法保证需要多长时间。有些情况下，跟随者可能会落后于领导者几分钟或更长时间；例如，如果跟随者正在从故障中恢复、系统接近最大容量运行或节点之间存在网络问题。

The advantage of synchronous replication is that the follower is guaranteed to have an up-to-date copy of the data that is consistent with the leader. If the leader suddenly fails, we can be sure that the data is still available on the follower. The disadvantage is that if the synchronous follower doesn’t respond (because it has crashed, or there is a network fault, or for any other reason), the write cannot be processed. The leader must block all writes and wait until the synchronous replica is available again.

同步复制的好处是，从程序可以保证跟随者拥有与主程序一致的最新数据副本。如果主程序突然故障，我们可以确保数据仍然可用于从程序。缺点是，如果同步从程序没有响应（因为它已崩溃、网络故障或其他原因），则写操作将无法执行。主程序必须阻塞所有写入操作，并等待同步复制再次可用。

For that reason, it is impractical for all followers to be synchronous: any one node outage would cause the whole system to grind to a halt. In practice, if you enable synchronous replication on a database, it usually means that one of the followers is synchronous, and the others are asynchronous. If the synchronous follower becomes unavailable or slow, one of the asynchronous followers is made synchronous. This guarantees that you have an up-to-date copy of the data on at least two nodes: the leader and one synchronous follower. This configuration is sometimes also called semi-synchronous [ 7 ].

因此，所有的跟隨者同步是不切實際的：任何一個節點的停機都會導致整個系統停擺。實際上，如果你在一個數據庫上啟用同步複製，通常意味著其中一個跟隨者是同步的，其他的是非同步的。如果同步跟隨者變得不可用或變慢，一個非同步的跟隨者就會被設置為同步。這可以保證你在至少兩個節點上擁有最新的數據副本：領導者和一個同步跟隨者。這種配置有時候也被稱為半同步（Semi-Synchronous）。

Often, leader-based replication is configured to be completely asynchronous. In this case, if the leader fails and is not recoverable, any writes that have not yet been replicated to followers are lost. This means that a write is not guaranteed to be durable, even if it has been confirmed to the client. However, a fully asynchronous configuration has the advantage that the leader can continue processing writes, even if all of its followers have fallen behind.

通常情况下，基于领导者的复制被配置为完全异步。在这种情况下，如果领导者失败并且无法恢复，则尚未复制到追随者的任何写入都将丢失。这意味着，即使已经确认给客户端，写入也不能保证是持久的。但是，完全异步配置的优点是，即使其所有追随者已经落后，领导者也可以继续处理写入。

Weakening durability may sound like a bad trade-off, but asynchronous replication is nevertheless widely used, especially if there are many followers or if they are geographically distributed. We will return to this issue in “Problems with Replication Lag” .

耐久性的减弱可能听起来像是一个不好的权衡，但是异步复制仍然被广泛使用，特别是如果有许多追随者或者它们在地理上分布。我们将在“复制滞后问题”中回到这个问题。

Research on Replication

It can be a serious problem for asynchronously replicated systems to lose data if the leader fails, so researchers have continued investigating replication methods that do not lose data but still provide good performance and availability. For example, chain replication [ 8 , 9 ] is a variant of synchronous replication that has been successfully implemented in a few systems such as Microsoft Azure Storage [ 10 , 11 ].

异步复制系统在领导者失效时丢失数据可能会成为严重的问题，因此研究人员继续探索不会丢失数据但仍能提供良好性能和可用性的复制方法。例如，链式复制是同步复制的一种变体，已成功实现在一些系统中，如Microsoft Azure Storage。

There is a strong connection between consistency of replication and consensus (getting several nodes to agree on a value), and we will explore this area of theory in more detail in Chapter 9 . In this chapter we will concentrate on the simpler forms of replication that are most commonly used in databases in practice.

一致性的复制和共识（使多个节点达成一致）之间存在强烈的联系，我们将在第9章更详细地探讨这个理论领域。在本章中，我们将集中讨论实践中最常用的简单复制形式。

Setting Up New Followers

From time to time, you need to set up new followers—perhaps to increase the number of replicas, or to replace failed nodes. How do you ensure that the new follower has an accurate copy of the leader’s data?

你需要定期设置新的跟随者，可能是为了增加副本数量，或替换故障节点。你如何确保新的跟随者有准确的领袖数据副本？

Simply copying data files from one node to another is typically not sufficient: clients are constantly writing to the database, and the data is always in flux, so a standard file copy would see different parts of the database at different points in time. The result might not make any sense.

仅仅从一个节点复制数据文件到另一个节点通常是不足够的：客户端不断地向数据库写入数据，数据始终在变化，因此标准的文件复制在不同时间可能看到不同部分的数据库。结果可能毫无意义。

You could make the files on disk consistent by locking the database (making it unavailable for writes), but that would go against our goal of high availability. Fortunately, setting up a follower can usually be done without downtime. Conceptually, the process looks like this:

通过锁定数据库（使其无法写入），您可以使磁盘上的文件一致，但这将违反我们高可用性的目标。幸运的是，通常可以在无需停机的情况下设置跟随者。从概念上讲，该过程看起来像这样：

Take a consistent snapshot of the leader’s database at some point in time—if possible, without taking a lock on the entire database. Most databases have this feature, as it is also required for backups. In some cases, third-party tools are needed, such as innobackupex for MySQL [ 12 ].

在某个时间点拍摄领导数据库的一致性快照——如果可能的话，不要锁定整个数据库。大多数数据库都有这个功能，因为备份也需要这个功能。在某些情况下，需要使用第三方工具，例如 MySQL 的 innobackupex 。[12]。
Copy the snapshot to the new follower node.

将快照复制到新的关注者节点。
The follower connects to the leader and requests all the data changes that have happened since the snapshot was taken. This requires that the snapshot is associated with an exact position in the leader’s replication log. That position has various names: for example, PostgreSQL calls it the log sequence number , and MySQL calls it the binlog coordinates .

跟随者连接到领导者，并请求自快照拍摄以来发生的所有数据更改。这需要将快照与领导者的复制日志中的确切位置相关联。该位置有各种名称：例如，PostgreSQL称其为日志序列号，而MySQL称其为binlog坐标。
When the follower has processed the backlog of data changes since the snapshot, we say it has caught up . It can now continue to process data changes from the leader as they happen.

当追随者处理完自快照之后的数据更改积压时，我们称它已经追上了。现在它可以继续处理来自领导者的实时数据更改。

The practical steps of setting up a follower vary significantly by database. In some systems the process is fully automated, whereas in others it can be a somewhat arcane multi-step workflow that needs to be manually performed by an administrator.

建立跟随者的实际步骤因数据库而异。在某些系统中，该过程完全自动化，而在其他系统中，可能需要管理员手动执行多步神秘的工作流程。

Handling Node Outages

Any node in the system can go down, perhaps unexpectedly due to a fault, but just as likely due to planned maintenance (for example, rebooting a machine to install a kernel security patch). Being able to reboot individual nodes without downtime is a big advantage for operations and maintenance. Thus, our goal is to keep the system as a whole running despite individual node failures, and to keep the impact of a node outage as small as possible.

系统中的任何节点都可能由于故障或计划维护（例如重新启动机器以安装内核安全补丁）而意外关闭。能够在不影响 downtime 的情况下重启个别节点对于操作和维护来说是一个巨大的优势。因此，我们的目标是在个别节点故障时保持整个系统运行，并尽可能减少节点故障的影响。

How do you achieve high availability with leader-based replication?

如何通过基于领导者的复制实现高可用性？

Follower failure: Catch-up recovery

On its local disk, each follower keeps a log of the data changes it has received from the leader. If a follower crashes and is restarted, or if the network between the leader and the follower is temporarily interrupted, the follower can recover quite easily: from its log, it knows the last transaction that was processed before the fault occurred. Thus, the follower can connect to the leader and request all the data changes that occurred during the time when the follower was disconnected. When it has applied these changes, it has caught up to the leader and can continue receiving a stream of data changes as before.

每个追随者在本地磁盘上保存了一个从领导者接收到的数据更改日志。如果一个跟随者崩溃并重新启动，或者领导者和跟随者之间的网络暂时中断，跟随者可以很容易地恢复：从它的日志中，它知道在故障发生之前处理的最后一项交易。因此，跟随者可以连接到领导者并请求在跟随者断开连接期间发生的所有数据更改。当它应用了这些更改后，它已经赶上领导者，可以继续像以前一样接收数据更改流。

Leader failure: Failover

Handling a failure of the leader is trickier: one of the followers needs to be promoted to be the new leader, clients need to be reconfigured to send their writes to the new leader, and the other followers need to start consuming data changes from the new leader. This process is called failover .

处理领导人失败更加棘手：需要提升一个跟随者成为新的领导者，客户端需要重新配置以将其写入发送到新的领导者，其他跟随者需要开始从新的领导者消耗数据更改。此过程称为故障切换。

Failover can happen manually (an administrator is notified that the leader has failed and takes the necessary steps to make a new leader) or automatically. An automatic failover process usually consists of the following steps:

故障转移可以手动进行（管理员会被通知领导者发生失败并采取必要措施使新的领导者），也可以自动进行。自动故障转移过程通常包括以下步骤：

Determining that the leader has failed. There are many things that could potentially go wrong: crashes, power outages, network issues, and more. There is no foolproof way of detecting what has gone wrong, so most systems simply use a timeout: nodes frequently bounce messages back and forth between each other, and if a node doesn’t respond for some period of time—say, 30 seconds—it is assumed to be dead. (If the leader is deliberately taken down for planned maintenance, this doesn’t apply.)

确定领导者已失败。有很多事情可能会出错：崩溃，停电，网络问题等等。没有绝对可靠的方法来检测出问题所在，因此大多数系统只是使用时间限制：节点之间经常反弹消息，如果某个节点在一段时间内没有响应，比如30秒，就被认为已经死亡。（如果领导者是因计划维护而被故意关闭，则不适用。）
Choosing a new leader. This could be done through an election process (where the leader is chosen by a majority of the remaining replicas), or a new leader could be appointed by a previously elected controller node . The best candidate for leadership is usually the replica with the most up-to-date data changes from the old leader (to minimize any data loss). Getting all the nodes to agree on a new leader is a consensus problem, discussed in detail in Chapter 9 .

选择新领导。这可以通过选举过程（领导者由剩余副本的大多数选择）或由先前选举的控制器节点任命新领导者来完成。领导的最佳人选通常是来自旧领导者最新的数据更改的副本（以最小化任何数据丢失）。让所有节点对新领导达成共识是一个共识问题，详细讨论在第9章中进行。
Reconfiguring the system to use the new leader. Clients now need to send their write requests to the new leader (we discuss this in “Request Routing” ). If the old leader comes back, it might still believe that it is the leader, not realizing that the other replicas have forced it to step down. The system needs to ensure that the old leader becomes a follower and recognizes the new leader.

重新配置系统以使用新的领导者。客户端现在需要将其写入请求发送给新的领导者（我们在“请求路由”中讨论此问题）。如果旧的领导者回来，它可能仍然认为自己是领导者，没有意识到其他副本已经迫使它下台。系统需要确保旧的领导者成为追随者，并认识到新的领导者。

Failover is fraught with things that can go wrong:

故障切换充满了可能会出错的事情:

If asynchronous replication is used, the new leader may not have received all the writes from the old leader before it failed. If the former leader rejoins the cluster after a new leader has been chosen, what should happen to those writes? The new leader may have received conflicting writes in the meantime. The most common solution is for the old leader’s unreplicated writes to simply be discarded, which may violate clients’ durability expectations.

如果使用异步复制，新领导者在旧领导者失败之前可能没有收到所有写入。如果前领导者在新领导者被选出后重新加入集群，那么这些写入应该怎么处理呢？在此期间，新的领导者可能已经接收到了冲突的写入。最常见的解决方案是旧领导者的未复制写入被简单地丢弃，这可能会违反客户端的耐久性期望。
Discarding writes is especially dangerous if other storage systems outside of the database need to be coordinated with the database contents. For example, in one incident at GitHub [ 13 ], an out-of-date MySQL follower was promoted to leader. The database used an autoincrementing counter to assign primary keys to new rows, but because the new leader’s counter lagged behind the old leader’s, it reused some primary keys that were previously assigned by the old leader. These primary keys were also used in a Redis store, so the reuse of primary keys resulted in inconsistency between MySQL and Redis, which caused some private data to be disclosed to the wrong users.

如果数据库外部的其他存储系统需要与数据库内容协调，那么丢弃写入操作尤其危险。例如，在GitHub的一次事件中，一个过时的MySQL从属数据库被提升为主库。数据库使用一个自增计数器来分配新行的主键，但由于新主库的计数器落后于旧主库，它重用了一些旧主库已经分配的主键。这些主键也用于Redis存储，因此主键的重用导致MySQL和Redis之间的不一致，从而导致某些私人数据被泄露给错误的用户。
In certain fault scenarios (see Chapter 8 ), it could happen that two nodes both believe that they are the leader. This situation is called split brain , and it is dangerous: if both leaders accept writes, and there is no process for resolving conflicts (see “Multi-Leader Replication” ), data is likely to be lost or corrupted. As a safety catch, some systems have a mechanism to shut down one node if two leaders are detected. ⁱⁱ However, if this mechanism is not carefully designed, you can end up with both nodes being shut down [ 14 ].

在某些故障情况（见第8章）下，可能会出现两个节点都认为自己是领导者的情况。此情况称为“脑裂”，它是危险的：如果两个领导者都接受写入，并且没有解决冲突的过程（见“多领导者复制”），数据很可能会丢失或损坏。为了安全起见，有些系统有机制在检测到两个领导者时关闭其中一个节点。然而，如果这个机制设计不当，你可能会导致两个节点都关闭[14]。
What is the right timeout before the leader is declared dead? A longer timeout means a longer time to recovery in the case where the leader fails. However, if the timeout is too short, there could be unnecessary failovers. For example, a temporary load spike could cause a node’s response time to increase above the timeout, or a network glitch could cause delayed packets. If the system is already struggling with high load or network problems, an unnecessary failover is likely to make the situation worse, not better.

正确的超时时间是多少才能宣布领袖死亡？更长的超时时间意味着在领袖失效的情况下需要更长的恢复时间。然而，如果超时时间太短，可能会出现不必要的故障转移。例如，暂时的负载峰值可能会导致节点的响应时间超过超时时间，或者网络故障可能会导致数据包延迟。如果系统已经面临高负载或网络问题，不必要的故障转移很可能会使情况更糟，而不是更好。

There are no easy solutions to these problems. For this reason, some operations teams prefer to perform failovers manually, even if the software supports automatic failover.

这些问题没有简单的解决方案。因此，一些运维团队更喜欢手动执行故障转移，即使软件支持自动故障转移。

These issues—node failures; unreliable networks; and trade-offs around replica consistency, durability, availability, and latency—are in fact fundamental problems in distributed systems. In Chapter 8 and Chapter 9 we will discuss them in greater depth.

这些问题——节点故障；不可靠的网络；以及副本一致性、耐久性、可用性和延迟之间的权衡——实际上是分布式系统中的基本问题。在第8章和第9章中，我们将更深入地讨论它们。

Implementation of Replication Logs

How does leader-based replication work under the hood? Several different replication methods are used in practice, so let’s look at each one briefly.

领导者基础复制如何在幕后运行？在实践中使用了几种不同的复制方法，因此让我们简要地看看每种方法。

Statement-based replication

In the simplest case, the leader logs every write request ( statement ) that it executes and sends that statement log to its followers. For a relational database, this means that every INSERT , UPDATE , or DELETE statement is forwarded to followers, and each follower parses and executes that SQL statement as if it had been received from a client.

在最简单的情况下，领导者记录执行的每个写入请求（语句），并将该语句日志发送给其追随者。对于关系型数据库，这意味着每个INSERT，UPDATE或DELETE语句都会转发到追随者，并且每个追随者将解析和执行该SQL语句，就像它是从客户端接收到的一样。

Although this may sound reasonable, there are various ways in which this approach to replication can break down:

虽然这听起来很合理，但这种复制方法可能会出现各种问题：

Any statement that calls a nondeterministic function, such as NOW() to get the current date and time or RAND() to get a random number, is likely to generate a different value on each replica.

任何调用非确定性函数的语句，例如NOW()获取当前日期和时间或RAND()获取随机数的语句，在每个副本上生成的值很可能不同。
If statements use an autoincrementing column, or if they depend on the existing data in the database (e.g., UPDATE … WHERE <some condition> ), they must be executed in exactly the same order on each replica, or else they may have a different effect. This can be limiting when there are multiple concurrently executing transactions.

如果语句使用自增列，或者依赖于数据库中现有数据（例如，UPDATE...WHERE <某些条件>），则它们必须在每个副本上以完全相同的顺序执行，否则它们可能会产生不同的效果。当有多个正在执行的事务时，这可能会受限。
Statements that have side effects (e.g., triggers, stored procedures, user-defined functions) may result in different side effects occurring on each replica, unless the side effects are absolutely deterministic.

具有副作用的语句（例如触发器、存储过程、用户定义函数）可能导致不同的副作用在每个副本上发生，除非这些副作用是绝对确定的。

It is possible to work around those issues—for example, the leader can replace any nondeterministic function calls with a fixed return value when the statement is logged so that the followers all get the same value. However, because there are so many edge cases, other replication methods are now generally preferred.

可以采用解决这些问题的方法——例如，领导可以在记录语句时将任何非确定性函数调用替换为固定的返回值，以便跟随者都获得相同的值。但由于存在很多特殊情况，现在通常更喜欢使用其他复制方法。

Statement-based replication was used in MySQL before version 5.1. It is still sometimes used today, as it is quite compact, but by default MySQL now switches to row-based replication (discussed shortly) if there is any nondeterminism in a statement. VoltDB uses statement-based replication, and makes it safe by requiring transactions to be deterministic [ 15 ].

在MySQL 5.1版本之前，基于语句的复制曾被使用。尽管这种方式非常紧凑，但现在默认情况下MySQL会切换到基于行的复制（稍后讨论），如果一个语句存在任何不确定性。VoltDB使用基于语句的复制，并通过要求事务具有确定性来使其安全。[15]。

Write-ahead log (WAL) shipping

In Chapter 3 we discussed how storage engines represent data on disk, and we found that usually every write is appended to a log:

在第三章中，我们讨论了存储引擎如何在磁盘上表示数据，我们发现通常每次写入都会附加到日志中。

In the case of a log-structured storage engine (see “SSTables and LSM-Trees” ), this log is the main place for storage. Log segments are compacted and garbage-collected in the background.

在采用日志结构化存储引擎的情况下（见“SSTables和LSM树”），这个日志是主要的存储地点。日志段在后台进行压缩和垃圾回收。
In the case of a B-tree (see “B-Trees” ), which overwrites individual disk blocks, every modification is first written to a write-ahead log so that the index can be restored to a consistent state after a crash.

在B树（参见“B树”）的情况下，会覆盖单个磁盘块，因此每次修改都会首先写入写前日志，以便在崩溃后恢复索引到一致状态。

In either case, the log is an append-only sequence of bytes containing all writes to the database. We can use the exact same log to build a replica on another node: besides writing the log to disk, the leader also sends it across the network to its followers. When the follower processes this log, it builds a copy of the exact same data structures as found on the leader.

在任何情况下，日志都是一系列字节的附加序列，包含对数据库的所有写入。我们可以使用完全相同的日志在另一个节点上构建副本：除将日志写入磁盘外，领导者还将其发送到其追随者之间的网络。当追随者处理此日志时，它将构建与领导者上发现的完全相同的数据结构的副本。

This method of replication is used in PostgreSQL and Oracle, among others [ 16 ]. The main disadvantage is that the log describes the data on a very low level: a WAL contains details of which bytes were changed in which disk blocks. This makes replication closely coupled to the storage engine. If the database changes its storage format from one version to another, it is typically not possible to run different versions of the database software on the leader and the followers.

此复制方法适用于PostgreSQL和Oracle等数据库。其主要缺点是日志记录数据级别非常低：WAL包含哪些磁盘块中的哪些字节被更改的详细信息。这使得复制与存储引擎密切相关。如果数据库从一个版本更改其存储格式到另一个版本，通常不可能在Leader和Follower上运行不同版本的数据库软件。

That may seem like a minor implementation detail, but it can have a big operational impact. If the replication protocol allows the follower to use a newer software version than the leader, you can perform a zero-downtime upgrade of the database software by first upgrading the followers and then performing a failover to make one of the upgraded nodes the new leader. If the replication protocol does not allow this version mismatch, as is often the case with WAL shipping, such upgrades require downtime.

这可能看起来像一个微小的实现细节，但它可能会对操作产生重大影响。如果复制协议允许从者使用比领导者更新的软件版本，那么您可以通过首先升级从者，然后执行故障转移使升级后的节点成为新的领导者，从而执行零停机升级数据库软件。如果复制协议不允许这种版本不匹配，通常在WAL传送中出现这种情况，则此类升级需要停机时间。

Logical (row-based) log replication

An alternative is to use different log formats for replication and for the storage engine, which allows the replication log to be decoupled from the storage engine internals. This kind of replication log is called a logical log , to distinguish it from the storage engine’s ( physical ) data representation.

一种替代方案是使用不同的日志格式进行复制和存储引擎，这允许复制日志与存储引擎内部解耦。这种类型的复制日志被称为逻辑日志，以区别于存储引擎（物理）数据表示。

A logical log for a relational database is usually a sequence of records describing writes to database tables at the granularity of a row:

关系数据库的逻辑日志通常是描述对数据库表进行行级写操作的记录序列：

For an inserted row, the log contains the new values of all columns.

对于插入的行，日志包含所有列的新值。
For a deleted row, the log contains enough information to uniquely identify the row that was deleted. Typically this would be the primary key, but if there is no primary key on the table, the old values of all columns need to be logged.

对于已删除的行，日志包含足够的信息来唯一标识被删除的行。通常这将是主键，但如果表没有主键，则需要记录所有列的旧值。
For an updated row, the log contains enough information to uniquely identify the updated row, and the new values of all columns (or at least the new values of all columns that changed).

更新的行，在日志中包含足够的信息来唯一识别更新的行，并且包含所有列的新值（或者至少包含所有修改的列的新值）。

A transaction that modifies several rows generates several such log records, followed by a record indicating that the transaction was committed. MySQL’s binlog (when configured to use row-based replication) uses this approach [ 17 ].

一项修改多行的交易会产生多个此类日志记录，最后会有记录表明该交易已提交。当配置为使用基于行的复制时，MySQL的binlog使用此方法。

Since a logical log is decoupled from the storage engine internals, it can more easily be kept backward compatible, allowing the leader and the follower to run different versions of the database software, or even different storage engines.

由于逻辑日志与存储引擎内部解耦，因此可以更轻松地保持向后兼容性，允许领导者和追随者运行不同版本的数据库软件，甚至是不同的存储引擎。

A logical log format is also easier for external applications to parse. This aspect is useful if you want to send the contents of a database to an external system, such as a data warehouse for offline analysis, or for building custom indexes and caches [ 18 ]. This technique is called change data capture , and we will return to it in Chapter 11 .

逻辑日志格式也更容易被外部应用程序分析。如果您想将数据库内容发送到外部系统（例如用于离线分析的数据仓库），或用于构建自定义索引和缓存，这个方面会很有用 [18]。这个技术称为变更数据捕获，在第11章中我们将回到它。

Trigger-based replication

The replication approaches described so far are implemented by the database system, without involving any application code. In many cases, that’s what you want—but there are some circumstances where more flexibility is needed. For example, if you want to only replicate a subset of the data, or want to replicate from one kind of database to another, or if you need conflict resolution logic (see “Handling Write Conflicts” ), then you may need to move replication up to the application layer.

迄今为止所描述的复制方法是由数据库系统实现的，不涉及任何应用程序代码。在许多情况下，这是您想要的，但还有一些情况需要更多的灵活性。例如，如果您只想复制数据子集，或者想要从一种数据库复制到另一种数据库，或者如果您需要冲突解决逻辑（请参见“处理写入冲突”），那么您可能需要将复制向应用程序层移动。

Some tools, such as Oracle GoldenGate [ 19 ], can make data changes available to an application by reading the database log. An alternative is to use features that are available in many relational databases: triggers and stored procedures .

一些工具，例如Oracle GoldenGate，可以通过读取数据库日志将数据更改提供给应用程序。另一种选择是使用许多关系数据库可用的功能：触发器和存储过程。

A trigger lets you register custom application code that is automatically executed when a data change (write transaction) occurs in a database system. The trigger has the opportunity to log this change into a separate table, from which it can be read by an external process. That external process can then apply any necessary application logic and replicate the data change to another system. Databus for Oracle [ 20 ] and Bucardo for Postgres [ 21 ] work like this, for example.

触发器允许您注册自定义应用程序代码，当数据库系统中发生数据更改（写入事务）时自动执行。触发器可以将此更改记录到单独的表中，外部进程可以读取该表。然后，外部进程可以应用任何必要的应用程序逻辑并将数据更改复制到另一个系统。例如，Databus for Oracle和Postgres的Bucardo工作方式如此。

Trigger-based replication typically has greater overheads than other replication methods, and is more prone to bugs and limitations than the database’s built-in replication. However, it can nevertheless be useful due to its flexibility.

基于触发器的复制通常比其他复制方法有更高的开销，且比数据库内置的复制更容易出现错误和限制。然而，由于其灵活性，它仍然可以是有用的。

Problems with Replication Lag

Being able to tolerate node failures is just one reason for wanting replication. As mentioned in the introduction to Part II , other reasons are scalability (processing more requests than a single machine can handle) and latency (placing replicas geographically closer to users).

实现容忍节点故障只是需要复制的原因之一。如Part II 的介绍中提到的，其它原因包括可扩展性（能够处理超出单一机器所能负担的更多请求）和延迟（把副本放置在更接近用户的地理位置）。

Leader-based replication requires all writes to go through a single node, but read-only queries can go to any replica. For workloads that consist of mostly reads and only a small percentage of writes (a common pattern on the web), there is an attractive option: create many followers, and distribute the read requests across those followers. This removes load from the leader and allows read requests to be served by nearby replicas.

领导者基础复制要求所有写入必须通过单个节点完成，但只读查询可以去任何副本。对于主要由读取组成且仅有少量写入的工作负载（在网络上是一个常见模式），这是一种有吸引力的选择：创建许多追随者并将读取请求分布到这些追随者之间。这将负荷从领导者处移除，并允许附近的副本为读取请求提供服务。

In this read-scaling architecture, you can increase the capacity for serving read-only requests simply by adding more followers. However, this approach only realistically works with asynchronous replication—if you tried to synchronously replicate to all followers, a single node failure or network outage would make the entire system unavailable for writing. And the more nodes you have, the likelier it is that one will be down, so a fully synchronous configuration would be very unreliable.

在这种读取扩展架构中，您只需添加更多的关注者即可增加处理只读请求的能力。但是，这种方法只在异步复制中实际有效——如果您尝试通过同步复制到所有关注者，则单个节点故障或网络中断将导致整个系统无法进行写入。而且，您拥有的节点越多，出现故障的可能性就越大，因此完全同步的配置会非常不可靠。

Unfortunately, if an application reads from an asynchronous follower, it may see outdated information if the follower has fallen behind. This leads to apparent inconsistencies in the database: if you run the same query on the leader and a follower at the same time, you may get different results, because not all writes have been reflected in the follower. This inconsistency is just a temporary state—if you stop writing to the database and wait a while, the followers will eventually catch up and become consistent with the leader. For that reason, this effect is known as eventual consistency [ 22 , 23 ]. ⁱⁱⁱ

不幸的是，如果应用程序从异步追随者读取，如果追随者落后，可能会看到过时的信息。这导致数据库中的明显不一致性：如果您同时在领导者和追随者上运行相同的查询，您可能会得到不同的结果，因为并非所有写入都已在追随者中反映。这种不一致性只是暂时的状态 - 如果您停止向数据库写入并等待一段时间，追随者最终将追上并与领导者保持一致。因此，这种效果被称为最终一致性[22，23]。

The term “eventually” is deliberately vague: in general, there is no limit to how far a replica can fall behind. In normal operation, the delay between a write happening on the leader and being reflected on a follower—the replication lag —may be only a fraction of a second, and not noticeable in practice. However, if the system is operating near capacity or if there is a problem in the network, the lag can easily increase to several seconds or even minutes.

“Eventually”这个术语是有意模糊的：通常情况下，副本落后的限制是没有的。在正常运行中，领导者执行写操作到某个跟随者反映这个写操作的延迟，也就是复制延迟，可能仅为几分之一秒，在实践中并不会被感知。然而，如果系统接近容量极限或网络存在问题，延迟很容易增加到几秒甚至几分钟。

When the lag is so large, the inconsistencies it introduces are not just a theoretical issue but a real problem for applications. In this section we will highlight three examples of problems that are likely to occur when there is replication lag and outline some approaches to solving them.

当延迟非常大时，它引入的不一致性不仅是一个理论问题，而且对应用程序来说是一个真正的问题。在本节中，我们将强调三个问题的例子，这些问题在存在复制延迟时很有可能发生，并概述解决它们的方法。

Reading Your Own Writes

Many applications let the user submit some data and then view what they have submitted. This might be a record in a customer database, or a comment on a discussion thread, or something else of that sort. When new data is submitted, it must be sent to the leader, but when the user views the data, it can be read from a follower. This is especially appropriate if data is frequently viewed but only occasionally written.

许多应用程序允许用户提交数据，然后查看他们提交的内容。这可能是客户数据库中的记录，或讨论主题上的评论，或其他类似的东西。当新数据被提交时，它必须发送到领导者，但当用户查看数据时，可以从追随者中读取。如果数据经常被查看但只偶尔被写入，这是特别适当的。

With asynchronous replication, there is a problem, illustrated in Figure 5-3 : if the user views the data shortly after making a write, the new data may not yet have reached the replica. To the user, it looks as though the data they submitted was lost, so they will be understandably unhappy.

使用异步复制，存在一个问题，如图5-3所示：如果用户在写入数据后不久即查看数据，则新数据可能尚未到达副本。对于用户来说，看起来好像他们提交的数据丢失了，所以他们会感到不满意。

In this situation, we need read-after-write consistency , also known as read-your-writes consistency [ 24 ]. This is a guarantee that if the user reloads the page, they will always see any updates they submitted themselves. It makes no promises about other users: other users’ updates may not be visible until some later time. However, it reassures the user that their own input has been saved correctly.

在这种情况下，我们需要读写一致性，也称为读取自己写入的一致性[24]。这保证了用户重新加载页面时，他们始终会看到他们自己提交的更新。它不保证其他用户：其他用户的更新可能直到某个时间才能看到。然而，它向用户保证他们自己的输入已经正确保存。

How can we implement read-after-write consistency in a system with leader-based replication? There are various possible techniques. To mention a few:

我们如何在基于领导者复制的系统中实现读写一致性？有多种可能的技术。举几个例子：

When reading something that the user may have modified, read it from the leader; otherwise, read it from a follower. This requires that you have some way of knowing whether something might have been modified, without actually querying it. For example, user profile information on a social network is normally only editable by the owner of the profile, not by anybody else. Thus, a simple rule is: always read the user’s own profile from the leader, and any other users’ profiles from a follower.

如果用户可能修改了某些内容，请从领导者处读取；否则，请从追随者处读取。这要求您以某种方式知道某些内容可能已经被修改，而不必实际查询它。例如，在社交网络上，用户资料信息通常只能由个人资料的所有者进行编辑，而不能由其他任何人编辑。因此，一个简单的规则是：总是从领导者读取用户自己的资料，并从追随者读取其他用户的资料。
If most things in the application are potentially editable by the user, that approach won’t be effective, as most things would have to be read from the leader (negating the benefit of read scaling). In that case, other criteria may be used to decide whether to read from the leader. For example, you could track the time of the last update and, for one minute after the last update, make all reads from the leader. You could also monitor the replication lag on followers and prevent queries on any follower that is more than one minute behind the leader.

如果应用程序中大多数内容都可以由用户编辑，这种方法可能不会有效，因为大多数内容都需要从领袖读取（抵消了读取扩展的好处）。在这种情况下，可以使用其他标准来决定是否从领袖中读取。例如，您可以跟踪上次更新的时间，并在上次更新后的一分钟内使所有读取都来自领袖。您还可以监视追随者的复制延迟，并防止在落后于领袖超过一分钟的任何追随者上查询。
The client can remember the timestamp of its most recent write—then the system can ensure that the replica serving any reads for that user reflects updates at least until that timestamp. If a replica is not sufficiently up to date, either the read can be handled by another replica or the query can wait until the replica has caught up. The timestamp could be a logical timestamp (something that indicates ordering of writes, such as the log sequence number) or the actual system clock (in which case clock synchronization becomes critical; see “Unreliable Clocks” ).

客户端可以记住它最近写入的时间戳 - 然后系统可以确保为该用户提供的任何读取副本都至少反映该时间戳之前的更新。如果一个副本不够更新，那么读取可以由另一个副本处理，或者查询可以等待直到该副本赶上。时间戳可以是逻辑时间戳（指示写入排序的某些内容，例如日志序列号）或实际的系统时钟（在这种情况下，时钟同步变得至关重要；请参见“不可靠的时钟”）。
If your replicas are distributed across multiple datacenters (for geographical proximity to users or for availability), there is additional complexity. Any request that needs to be served by the leader must be routed to the datacenter that contains the leader.

如果您的副本分布在多个数据中心（以方便用户的地理位置或可用性），那么就会有额外的复杂性。任何需要由领导者提供服务的请求都必须路由到包含领导者的数据中心。

Another complication arises when the same user is accessing your service from multiple devices, for example a desktop web browser and a mobile app. In this case you may want to provide cross-device read-after-write consistency: if the user enters some information on one device and then views it on another device, they should see the information they just entered.

当同一用户从多个设备访问您的服务时，另一个复杂性问题出现，例如桌面 web 浏览器和移动应用程序。在这种情况下，您可能需要提供跨设备的“先写后读”一致性：如果用户在一个设备上输入一些信息，然后在另一个设备上查看它，他们应该看到刚刚输入的信息。

In this case, there are some additional issues to consider:

在这种情况下，还有一些额外的问题需要考虑：

Approaches that require remembering the timestamp of the user’s last update become more difficult, because the code running on one device doesn’t know what updates have happened on the other device. This metadata will need to be centralized.

需要记住用户上次更新的时间戳的方法变得更加困难，因为在一个设备上运行的代码不知道另一个设备上发生了哪些更新。这些元数据需要集中处理。
If your replicas are distributed across different datacenters, there is no guarantee that connections from different devices will be routed to the same datacenter. (For example, if the user’s desktop computer uses the home broadband connection and their mobile device uses the cellular data network, the devices’ network routes may be completely different.) If your approach requires reading from the leader, you may first need to route requests from all of a user’s devices to the same datacenter.

如果您的副本分布在不同的数据中心中，则不能保证来自不同设备的连接将路由到同一数据中心。（例如，如果用户的桌面计算机使用家庭宽带连接，而其移动设备使用蜂窝数据网络，则设备的网络路由可能完全不同。）如果您的方法需要从领导者读取，您可能需要首先将来自用户所有设备的请求路由到同一个数据中心。

Monotonic Reads

Our second example of an anomaly that can occur when reading from asynchronous followers is that it’s possible for a user to see things moving backward in time .

异步跟随者读取时可能出现的异常情况的第二个例子是用户可能会看到时间倒流。

This can happen if a user makes several reads from different replicas. For example, Figure 5-4 shows user 2345 making the same query twice, first to a follower with little lag, then to a follower with greater lag. (This scenario is quite likely if the user refreshes a web page, and each request is routed to a random server.) The first query returns a comment that was recently added by user 1234, but the second query doesn’t return anything because the lagging follower has not yet picked up that write. In effect, the second query is observing the system at an earlier point in time than the first query. This wouldn’t be so bad if the first query hadn’t returned anything, because user 2345 probably wouldn’t know that user 1234 had recently added a comment. However, it’s very confusing for user 2345 if they first see user 1234’s comment appear, and then see it disappear again.

这种情况可能发生在用户从不同副本中进行多次读取时。例如，图5-4显示用户2345两次查询相同的问题，首先是到滞后很小的追随者，然后是到滞后更大的追随者。（如果用户刷新网页，并且每个请求都路由到随机服务器，则很可能出现此场景。）第一个查询返回最近由用户1234添加的评论，但是第二个查询不返回任何内容，因为滞后的追随者尚未拾取该写入。实际上，第二个查询观察系统比第一个查询更早的时间点。如果第一个查询没有返回任何内容，那么这并不太糟糕，因为用户2345可能不知道用户1234最近添加了评论。但是，如果他们首先看到用户1234的评论出现，然后再次看到它消失，这对用户2345非常令人困惑。

Monotonic reads [ 23 ] is a guarantee that this kind of anomaly does not happen. It’s a lesser guarantee than strong consistency, but a stronger guarantee than eventual consistency. When you read data, you may see an old value; monotonic reads only means that if one user makes several reads in sequence, they will not see time go backward—i.e., they will not read older data after having previously read newer data.

单调性读取是一种保证，可以避免出现此类异常。这是比强一致性更小但比最终一致性更强的保证。当您读取数据时，可能会看到旧值；单调性读取只意味着如果一个用户连续进行多次读取，他们不会看到时间倒流，也就是说，他们不会在之前读取了更新的数据后再读取旧数据。

One way of achieving monotonic reads is to make sure that each user always makes their reads from the same replica (different users can read from different replicas). For example, the replica can be chosen based on a hash of the user ID, rather than randomly. However, if that replica fails, the user’s queries will need to be rerouted to another replica.

实现单调读的一种方法是确保每个用户始终从相同的副本中读取（不同的用户可以从不同的副本中读取）。例如，可以根据用户ID的哈希值选择副本，而不是随机选择。但是，如果该副本失败，用户的查询将需要重新路由到另一个副本。

Consistent Prefix Reads

Our third example of replication lag anomalies concerns violation of causality. Imagine the following short dialog between Mr. Poons and Mrs. Cake:

我们第三个复制滞后异常的例子涉及违反因果关系。想象一下普恩斯先生和蛋糕夫人之间的简短对话。

Mr. Poons

How far into the future can you see, Mrs. Cake?

你能看到未来多远，蛋糕夫人？

Mrs. Cake

About ten seconds usually, Mr. Poons.

通常大约需要十秒钟，Poons先生。

There is a causal dependency between those two sentences: Mrs. Cake heard Mr. Poons’s question and answered it.

这两个句子之间存在因果依赖关系：Cake 夫人听到了 Poons 先生的问题并回答了它。

Now, imagine a third person is listening to this conversation through followers. The things said by Mrs. Cake go through a follower with little lag, but the things said by Mr. Poons have a longer replication lag (see Figure 5-5 ). This observer would hear the following:

现在，想象一位第三方通过追随者听到这次谈话。蛋糕夫人说的话通过追随者传递的时间较短，而普恩斯先生说的话则有较长的复制延迟（见图5-5）。这位观察者会听到以下内容：

Mrs. Cake

About ten seconds usually, Mr. Poons.

通常约十秒钟，Poons先生。

Mr. Poons

How far into the future can you see, Mrs. Cake?

"Mrs. Cake，你能看多远的将来？"

To the observer it looks as though Mrs. Cake is answering the question before Mr. Poons has even asked it. Such psychic powers are impressive, but very confusing [ 25 ].

观察者会觉得蛋糕夫人在普恩斯先生还没问出问题之前就已经回答了问题。这种超自然力量令人印象深刻，但也让人困惑。

Preventing this kind of anomaly requires another type of guarantee: consistent prefix reads [ 23 ]. This guarantee says that if a sequence of writes happens in a certain order, then anyone reading those writes will see them appear in the same order.

预防这种异常需要另一种类型的保证：一致性前缀读取[23]。该保证表示，如果一系列写入按照某个顺序发生，则任何读取这些写入的人都会以相同的顺序看到它们出现。

This is a particular problem in partitioned (sharded) databases, which we will discuss in Chapter 6 . If the database always applies writes in the same order, reads always see a consistent prefix, so this anomaly cannot happen. However, in many distributed databases, different partitions operate independently, so there is no global ordering of writes: when a user reads from the database, they may see some parts of the database in an older state and some in a newer state.

这是分区数据库中的特殊问题，我们将在第6章中讨论。如果数据库总是按相同顺序应用写操作，读操作就能始终看到一致的前缀，因此这种异常不会发生。然而，在许多分布式数据库中，不同的分区操作是独立的，因此没有全局的写入顺序：当用户从数据库读取时，他们可能会看到一些旧状态和一些新状态的数据库部分。

One solution is to make sure that any writes that are causally related to each other are written to the same partition—but in some applications that cannot be done efficiently. There are also algorithms that explicitly keep track of causal dependencies, a topic that we will return to in “The “happens-before” relationship and concurrency” .

一种解决方案是确保任何因果相关的写操作都写入同一分区 - 但在某些应用中，这可能无法高效完成。还有一些算法明确跟踪因果依赖关系，这是我们将在““发生在之前”关系和并发性”中重新回到的主题。

Solutions for Replication Lag

When working with an eventually consistent system, it is worth thinking about how the application behaves if the replication lag increases to several minutes or even hours. If the answer is “no problem,” that’s great. However, if the result is a bad experience for users, it’s important to design the system to provide a stronger guarantee, such as read-after-write. Pretending that replication is synchronous when in fact it is asynchronous is a recipe for problems down the line.

当使用一致性系统时，值得考虑一下如果复制延迟增加到几分钟甚至几小时时应用程序的行为。如果答案是“没有问题”，那太好了。然而，如果结果是给用户造成糟糕的体验，那么设计系统以提供更强的保证，如写入后读取，就非常重要。假装复制是同步的实际上是异步的，这会导致问题逐渐加剧。

As discussed earlier, there are ways in which an application can provide a stronger guarantee than the underlying database—for example, by performing certain kinds of reads on the leader. However, dealing with these issues in application code is complex and easy to get wrong.

正如之前所讨论的那样，应用程序可以提供比底层数据库更强的保证方式——例如，在领导者上进行某些类型的读取。然而，处理这些问题在应用程序代码中是复杂的，容易出错。

It would be better if application developers didn’t have to worry about subtle replication issues and could just trust their databases to “do the right thing.” This is why transactions exist: they are a way for a database to provide stronger guarantees so that the application can be simpler.

应用程序开发人员不必担心微妙的复制问题，只需信任它们的数据库“做正确的事情”，这样会更好。这就是事务存在的原因：它们是数据库提供更强保证的一种方式，因此应用程序可以更简单。

Single-node transactions have existed for a long time. However, in the move to distributed (replicated and partitioned) databases, many systems have abandoned them, claiming that transactions are too expensive in terms of performance and availability, and asserting that eventual consistency is inevitable in a scalable system. There is some truth in that statement, but it is overly simplistic, and we will develop a more nuanced view over the course of the rest of this book. We will return to the topic of transactions in Chapters 7 and 9 , and we will discuss some alternative mechanisms in Part III .

单节点事务存在已久。然而，在分布式（复制和分区）数据库的移动中，许多系统已经放弃了它们，并声称事务在性能和可用性方面过于昂贵，并断言最终一致性在可扩展系统中是不可避免的。这种说法有一定的道理，但过于简单化了，我们将在本书的其余部分中逐渐形成更为微妙的观点。我们将在第7章和第9章回到事务的话题上，并在第III部分中讨论一些替代机制。

Multi-Leader Replication

So far in this chapter we have only considered replication architectures using a single leader. Although that is a common approach, there are interesting alternatives.

到目前为止，在这一章中，我们只考虑使用单个领导者的复制架构。虽然这是常见的方法，但也有一些有趣的替代方案。

Leader-based replication has one major downside: there is only one leader, and all writes must go through it. ^iv If you can’t connect to the leader for any reason, for example due to a network interruption between you and the leader, you can’t write to the database.

领导者复制有一个主要的缺点：只有一个领导者，所有的写操作都必须通过它进行。如果由于任何原因无法连接到领导者，例如由于您和领导者之间的网络中断，您将无法对数据库进行写操作。

A natural extension of the leader-based replication model is to allow more than one node to accept writes. Replication still happens in the same way: each node that processes a write must forward that data change to all the other nodes. We call this a multi-leader configuration (also known as master–master or active/active replication ). In this setup, each leader simultaneously acts as a follower to the other leaders.

一种领导者基础复制模式的自然扩展是允许多个节点接受写入。复制仍然采用相同的方式：处理写入的每个节点都必须将该数据更改转发给所有其他节点。我们将其称为多个领导者配置（也称为主-主或活动/活动复制）。在此设置中，每个领导者同时充当其他领导者的追随者。

Use Cases for Multi-Leader Replication

It rarely makes sense to use a multi-leader setup within a single datacenter, because the benefits rarely outweigh the added complexity. However, there are some situations in which this configuration is reasonable.

在单个数据中心中很少有必要使用多领导者架构, 因为收益很少能够超过额外的复杂度。然而, 在某些情况下这种配置是合理的。

Multi-datacenter operation

Imagine you have a database with replicas in several different datacenters (perhaps so that you can tolerate failure of an entire datacenter, or perhaps in order to be closer to your users). With a normal leader-based replication setup, the leader has to be in one of the datacenters, and all writes must go through that datacenter.

想象一下，你有一个数据库，它在几个不同的数据中心中有副本（可能是为了能够容忍整个数据中心的故障，或者可能是为了更加接近你的用户）。在普通的基于领导者的复制设置中，领导者必须在其中一个数据中心中，而所有写操作必须经过该数据中心。

In a multi-leader configuration, you can have a leader in each datacenter. Figure 5-6 shows what this architecture might look like. Within each datacenter, regular leader–follower replication is used; between datacenters, each datacenter’s leader replicates its changes to the leaders in other datacenters.

在多领导者配置中，您可以在每个数据中心都拥有一位领导者。图5-6显示了这种架构可能的外观。在每个数据中心内，使用常规的领导者-跟随者复制；在数据中心之间，每个数据中心的领导者将其更改复制到其他数据中心的领导者。

Let’s compare how the single-leader and multi-leader configurations fare in a multi-datacenter deployment:

让我们来比较单领导者和多领导者配置在多数据中心部署中的表现：

Performance

In a single-leader configuration, every write must go over the internet to the datacenter with the leader. This can add significant latency to writes and might contravene the purpose of having multiple datacenters in the first place. In a multi-leader configuration, every write can be processed in the local datacenter and is replicated asynchronously to the other datacenters. Thus, the inter-datacenter network delay is hidden from users, which means the perceived performance may be better.

在单个领导者配置中，每个写操作都必须通过互联网传输到具有领导者的数据中心。这可能会增加写入的延迟，并可能违反首先拥有多个数据中心的目的。在多领导者配置中，每个写操作可以在本地数据中心中处理，并异步复制到其他数据中心。因此，数据中心之间的网络延迟对用户来说是隐藏的，这意味着感知性能可能更好。

Tolerance of datacenter outages

In a single-leader configuration, if the datacenter with the leader fails, failover can promote a follower in another datacenter to be leader. In a multi-leader configuration, each datacenter can continue operating independently of the others, and replication catches up when the failed datacenter comes back online.

在单个领导者配置中，如果具有领导者的数据中心失败，故障转移可以将另一个数据中心中的追随者晋升为领导者。在多个领导者配置中，每个数据中心都可以独立运行，而且当失败的数据中心恢复线上时，复制会追赶上来。

Tolerance of network problems

Traffic between datacenters usually goes over the public internet, which may be less reliable than the local network within a datacenter. A single-leader configuration is very sensitive to problems in this inter-datacenter link, because writes are made synchronously over this link. A multi-leader configuration with asynchronous replication can usually tolerate network problems better: a temporary network interruption does not prevent writes being processed.

数据中心之间的流量通常通过公共互联网进行传输，这可能比数据中心内部的本地网络不够可靠。单主配置非常敏感于此数据中心间的链接问题，因为写入是通过此链接同步进行的。使用异步复制的多主配置通常可以更好地容纳网络问题：暂时的网络中断不会阻止写入被处理。

Some databases support multi-leader configurations by default, but it is also often implemented with external tools, such as Tungsten Replicator for MySQL [ 26 ], BDR for PostgreSQL [ 27 ], and GoldenGate for Oracle [ 19 ].

一些数据库默认支持多主配置，但也经常通过外部工具实现，例如 MySQL 的 Tungsten Replicator [26]，PostgreSQL 的 BDR [27]，以及 Oracle 的 GoldenGate [19]。

Although multi-leader replication has advantages, it also has a big downside: the same data may be concurrently modified in two different datacenters, and those write conflicts must be resolved (indicated as “conflict resolution” in Figure 5-6 ). We will discuss this issue in “Handling Write Conflicts” .

虽然多领导者复制具有优点，但也有一个很大的缺点：同样的数据可能会在两个不同的数据中心中被同时修改，这些写冲突必须被解决（在图5-6中被表示为“冲突解决”）。我们将在“处理写冲突”中讨论这个问题。

As multi-leader replication is a somewhat retrofitted feature in many databases, there are often subtle configuration pitfalls and surprising interactions with other database features. For example, autoincrementing keys, triggers, and integrity constraints can be problematic. For this reason, multi-leader replication is often considered dangerous territory that should be avoided if possible [ 28 ].

由于多主复制在许多数据库中是一项后期配置功能，因此通常存在微妙的配置陷阱和与其他数据库功能的惊人相互作用。例如，自增键、触发器和完整性约束可能存在问题。因此，多主复制通常被认为是危险领域，如果可能应该避免 [28]。

Clients with offline operation

Another situation in which multi-leader replication is appropriate is if you have an application that needs to continue to work while it is disconnected from the internet.

另一个适用于多主复制的情况是，如果您有一个应用程序需要在断开与互联网的连接时仍然继续工作。

For example, consider the calendar apps on your mobile phone, your laptop, and other devices. You need to be able to see your meetings (make read requests) and enter new meetings (make write requests) at any time, regardless of whether your device currently has an internet connection. If you make any changes while you are offline, they need to be synced with a server and your other devices when the device is next online.

例如，考虑一下您手机、笔记本电脑和其他设备上的日历应用程序。无论您的设备当前是否有互联网连接，您都需要能够随时查看您的会议（进行读取请求）并输入新的会议（进行写入请求）。如果您在离线状态下进行任何更改，它们需要在设备下次联机时与服务器和您的其他设备同步。

In this case, every device has a local database that acts as a leader (it accepts write requests), and there is an asynchronous multi-leader replication process (sync) between the replicas of your calendar on all of your devices. The replication lag may be hours or even days, depending on when you have internet access available.

在这种情况下，每个设备都有一个本地数据库，充当领导者（它接受写入请求），并且在您所有设备上的日历副本之间存在异步的多领导者复制过程（同步）。复制滞后可能是几个小时甚至几天，这取决于您何时可以访问互联网。

From an architectural point of view, this setup is essentially the same as multi-leader replication between datacenters, taken to the extreme: each device is a “datacenter,” and the network connection between them is extremely unreliable. As the rich history of broken calendar sync implementations demonstrates, multi-leader replication is a tricky thing to get right.

从建筑的角度来看，这个设置实际上与数据中心之间的多领导者复制基本相同，达到了极限：每个设备都是一个“数据中心”，它们之间的网络连接非常不可靠。正如破碎的日历同步实现的丰富历史所证明的那样，多领导者复制是一个很棘手的事情。

There are tools that aim to make this kind of multi-leader configuration easier. For example, CouchDB is designed for this mode of operation [ 29 ].

有些工具旨在使这种多领导者配置更加容易。例如，CouchDB是为此模式设计的[29]。

Collaborative editing

Real-time collaborative editing applications allow several people to edit a document simultaneously. For example, Etherpad [ 30 ] and Google Docs [ 31 ] allow multiple people to concurrently edit a text document or spreadsheet (the algorithm is briefly discussed in “Automatic Conflict Resolution” ).

实时协作编辑应用程序可以让多人同时编辑一个文档。例如，Etherpad [30] 和 Google Docs [31] 允许多人同时编辑文本文档或电子表格（算法已在“自动冲突解决”中简要讨论）。

We don’t usually think of collaborative editing as a database replication problem, but it has a lot in common with the previously mentioned offline editing use case. When one user edits a document, the changes are instantly applied to their local replica (the state of the document in their web browser or client application) and asynchronously replicated to the server and any other users who are editing the same document.

我们通常不将协作编辑视为数据库复制问题，但它与先前提到的离线编辑用例有很多共同点。当一个用户编辑文档时，更改会立即应用于他们的本地副本（即在其Web浏览器或客户端应用程序中的文档状态），并异步地复制到服务器和任何其他正在编辑同一文档的用户。

If you want to guarantee that there will be no editing conflicts, the application must obtain a lock on the document before a user can edit it. If another user wants to edit the same document, they first have to wait until the first user has committed their changes and released the lock. This collaboration model is equivalent to single-leader replication with transactions on the leader.

如果您想保证没有编辑冲突，应用程序必须在用户进行编辑之前获取文档的锁定。如果另一个用户想要编辑同一份文档，他们必须等待第一个用户提交更改并释放锁定。这种协作模型相当于在领导者上具有事务的单一领导者复制。

However, for faster collaboration, you may want to make the unit of change very small (e.g., a single keystroke) and avoid locking. This approach allows multiple users to edit simultaneously, but it also brings all the challenges of multi-leader replication, including requiring conflict resolution [ 32 ].

然而，为了更快的协作，您可能想要将更改的单元变得非常小（例如，一个单一的按键），并避免锁定。这种方法允许多个用户同时编辑，但也带来了多个领导者复制的所有挑战，包括需要冲突解决。

Handling Write Conflicts

The biggest problem with multi-leader replication is that write conflicts can occur, which means that conflict resolution is required.

多领导者复制的最大问题是可能会发生写冲突，这意味着需要进行冲突解决。

For example, consider a wiki page that is simultaneously being edited by two users, as shown in Figure 5-7 . User 1 changes the title of the page from A to B, and user 2 changes the title from A to C at the same time. Each user’s change is successfully applied to their local leader. However, when the changes are asynchronously replicated, a conflict is detected [ 33 ]. This problem does not occur in a single-leader database.

比如，考虑一个维基页面，同时由两个用户编辑，如图5-7所示。用户1将页面标题从A更改为B，用户2将标题从A更改为C。每个用户的更改都成功应用于其本地领导者。然而，当更改被异步复制时，会检测到冲突[33]。这个问题在单领导者数据库中不会发生。

Synchronous versus asynchronous conflict detection

In a single-leader database, the second writer will either block and wait for the first write to complete, or abort the second write transaction, forcing the user to retry the write. On the other hand, in a multi-leader setup, both writes are successful, and the conflict is only detected asynchronously at some later point in time. At that time, it may be too late to ask the user to resolve the conflict.

在单个领导者数据库中，第二个写入者将要么阻塞并等待第一个写入完成，要么中止第二个写入事务，强制用户重试写入。另一方面，在多个领导者设置中，两个写入都成功，冲突仅在稍后异步检测到。这时，可能已经太晚要求用户解决冲突。

In principle, you could make the conflict detection synchronous—i.e., wait for the write to be replicated to all replicas before telling the user that the write was successful. However, by doing so, you would lose the main advantage of multi-leader replication: allowing each replica to accept writes independently. If you want synchronous conflict detection, you might as well just use single-leader replication.

原则上，您可以将冲突检测设置为同步 - 即在向用户报告写入成功之前，等待复制到所有副本。然而，这样做会丧失多主复制的主要优势：允许每个副本独立接受写入。如果您想要同步的冲突检测，那么您也可以使用单主复制。

Conflict avoidance

The simplest strategy for dealing with conflicts is to avoid them: if the application can ensure that all writes for a particular record go through the same leader, then conflicts cannot occur. Since many implementations of multi-leader replication handle conflicts quite poorly, avoiding conflicts is a frequently recommended approach [ 34 ].

避免冲突的最简单策略是避免它们：如果应用程序能够确保特定记录的所有写入都通过同一领导者进行，则冲突就不会发生。由于许多多领导者复制的实现处理冲突非常糟糕，避免冲突是一种经常推荐的方法[34]。

For example, in an application where a user can edit their own data, you can ensure that requests from a particular user are always routed to the same datacenter and use the leader in that datacenter for reading and writing. Different users may have different “home” datacenters (perhaps picked based on geographic proximity to the user), but from any one user’s point of view the configuration is essentially single-leader.

例如，在一个用户可以编辑自己数据的应用程序中，您可以确保来自特定用户的请求始终路由到同一数据中心，并使用该数据中心的领导者进行读写操作。不同的用户可能有不同的“家庭”数据中心（可能是基于地理距离选择的），但从任何一个用户的角度来看，配置基本上是单主导的。

However, sometimes you might want to change the designated leader for a record—perhaps because one datacenter has failed and you need to reroute traffic to another datacenter, or perhaps because a user has moved to a different location and is now closer to a different datacenter. In this situation, conflict avoidance breaks down, and you have to deal with the possibility of concurrent writes on different leaders.

然而，有时你可能想要更改被指定的记录领导者——可能是因为一个数据中心已经失效，你需要将流量重新路由到另一个数据中心，或者因为用户已经移动到不同的位置，更靠近其他数据中心。在这种情况下，冲突避免就会失败，你必须面对不同领导者上相互写入的可能性。

Converging toward a consistent state

A single-leader database applies writes in a sequential order: if there are several updates to the same field, the last write determines the final value of the field.

单主数据库按顺序应用写操作：如果对同一字段进行了多个更新，则最后一次写入确定字段的最终值。

In a multi-leader configuration, there is no defined ordering of writes, so it’s not clear what the final value should be. In Figure 5-7 , at leader 1 the title is first updated to B and then to C; at leader 2 it is first updated to C and then to B. Neither order is “more correct” than the other.

在多领导者配置中，写入的顺序没有定义，因此不清楚最终的值应该是什么。在图5-7中，领导者1首先将标题更新为B，然后更新为C；在领导者2中，它首先更新为C，然后更新为B。没有一种顺序比另一种“更正确”。

If each replica simply applied writes in the order that it saw the writes, the database would end up in an inconsistent state: the final value would be C at leader 1 and B at leader 2. That is not acceptable—every replication scheme must ensure that the data is eventually the same in all replicas. Thus, the database must resolve the conflict in a convergent way, which means that all replicas must arrive at the same final value when all changes have been replicated.

如果每个副本只是按照看到的写入顺序应用写入操作，数据库将最终处于不一致的状态：Leader1的最终值将是C，而Leader2的最终值将是B。这是不可接受的-每个复制方案必须确保最终所有副本中的数据相同。因此，数据库必须以收敛方式解决冲突，这意味着所有副本在复制所有更改后必须到达相同的最终值。

There are various ways of achieving convergent conflict resolution:

有许多实现协调解决冲突的方法：

Give each write a unique ID (e.g., a timestamp, a long random number, a UUID, or a hash of the key and value), pick the write with the highest ID as the winner , and throw away the other writes. If a timestamp is used, this technique is known as last write wins (LWW). Although this approach is popular, it is dangerously prone to data loss [ 35 ]. We will discuss LWW in more detail at the end of this chapter ( “Detecting Concurrent Writes” ).

给每个写入一个唯一的ID（例如，时间戳，长随机数，UUID或键值的哈希），选择具有最高ID的写入为赢家，并且丢弃其他写入。如果使用时间戳，这种技术被称为最后写入赢（LWW）。尽管此方法很流行，但极易丢失数据[35]。我们将在本章末尾“检测并发写入”中更详细地讨论LWW。
Give each replica a unique ID, and let writes that originated at a higher-numbered replica always take precedence over writes that originated at a lower-numbered replica. This approach also implies data loss.

给每个副本分配唯一的ID，并允许来自较高编号的副本发起的写操作始终优先于来自较低编号的副本的写操作。这种方法也意味着数据丢失。
Somehow merge the values together—e.g., order them alphabetically and then concatenate them (in Figure 5-7 , the merged title might be something like “B/C”).

以某种方式将值合并在一起 - 例如，按字母顺序排列它们，然后将它们连接起来（在5-7图中，合并后的标题可能是“B / C”之类的东西）。
Record the conflict in an explicit data structure that preserves all information, and write application code that resolves the conflict at some later time (perhaps by prompting the user).

将冲突记录在显式数据结构中，以保留所有信息，并编写应用程序代码以在以后的某个时间解决冲突（可能通过提示用户）。

Custom conflict resolution logic

As the most appropriate way of resolving a conflict may depend on the application, most multi-leader replication tools let you write conflict resolution logic using application code. That code may be executed on write or on read:

由于解决冲突的最适当方式可能取决于应用程序，因此大多数多领导者复制工具可以使用应用程序代码编写冲突解决逻辑。该代码可以在写入或读取时执行。

On write

As soon as the database system detects a conflict in the log of replicated changes, it calls the conflict handler. For example, Bucardo allows you to write a snippet of Perl for this purpose. This handler typically cannot prompt a user—it runs in a background process and it must execute quickly.

当数据库系统检测到复制更改日志中的冲突时，它会调用冲突处理程序。例如，Bucardo允许您编写一个Perl片段来处理此类冲突。此处理程序通常无法提示用户-它在后台进程中运行，并且必须快速执行。

On read

When a conflict is detected, all the conflicting writes are stored. The next time the data is read, these multiple versions of the data are returned to the application. The application may prompt the user or automatically resolve the conflict, and write the result back to the database. CouchDB works this way, for example.

当检测到冲突时，所有冲突的写操作都被存储。下一次数据被读取时，这些数据的多个版本将返回到应用程序中。应用程序可以提示用户或自动解决冲突，并将结果写回到数据库。例如，CouchDB就是这样工作的。

Note that conflict resolution usually applies at the level of an individual row or document, not for an entire transaction [ 36 ]. Thus, if you have a transaction that atomically makes several different writes (see Chapter 7 ), each write is still considered separately for the purposes of conflict resolution.

需要注意的是，冲突解决通常适用于单个行或文档级别，而不是整个事务[36]。因此，如果您有一个原子地进行多个不同写入的事务（见第7章），则每个写入仍然被单独考虑用于冲突解决的目的。

Automatic Conflict Resolution

Conflict resolution rules can quickly become complicated, and custom code can be error-prone. Amazon is a frequently cited example of surprising effects due to a conflict resolution handler: for some time, the conflict resolution logic on the shopping cart would preserve items added to the cart, but not items removed from the cart. Thus, customers would sometimes see items reappearing in their carts even though they had previously been removed [ 37 ].

冲突解决规则很快就变得复杂，并且自定义代码很容易出错。亚马逊是一个经常被引用的例子，由于冲突解决处理程序的影响出人意料：一段时间内，购物车上的冲突解决逻辑会保留添加到购物车中的物品，但不包括从购物车中删除的物品。因此，顾客有时会看到已经被删除的物品重新出现在他们的购物车中。[37]。

There has been some interesting research into automatically resolving conflicts caused by concurrent data modifications. A few lines of research are worth mentioning:

有关自动解决由并发数据修改引起的冲突的一些有趣研究已经进行。值得一提的是几个研究方向：

Conflict-free replicated datatypes (CRDTs) [ 32 , 38 ] are a family of data structures for sets, maps, ordered lists, counters, etc. that can be concurrently edited by multiple users, and which automatically resolve conflicts in sensible ways. Some CRDTs have been implemented in Riak 2.0 [ 39 , 40 ].

无冲突复制数据类型（CRDT）是一族数据结构，包括集合、映射、有序列表、计数器等，可同时被多个用户编辑，并自动以合理方式解决冲突。一些CRDT已在Riak 2.0中得到实现。
Mergeable persistent data structures [ 41 ] track history explicitly, similarly to the Git version control system, and use a three-way merge function (whereas CRDTs use two-way merges).

可合并的持久数据结构[41] 显式地跟踪历史，类似于 Git 版本控制系统，并使用三方合并函数（而 CRDTs 使用两方合并）。
Operational transformation [ 42 ] is the conflict resolution algorithm behind collaborative editing applications such as Etherpad [ 30 ] and Google Docs [ 31 ]. It was designed particularly for concurrent editing of an ordered list of items, such as the list of characters that constitute a text document.

操作转换是协同编辑应用程序，如Etherpad和Google Docs背后的冲突解决算法。它专门设计用于同时编辑有序项目列表，例如构成文本文档的字符列表。

Implementations of these algorithms in databases are still young, but it’s likely that they will be integrated into more replicated data systems in the future. Automatic conflict resolution could make multi-leader data synchronization much simpler for applications to deal with.

这些算法在数据库中的应用还很年轻，但很可能会被整合到更多的复制数据系统中。自动冲突解决可以使多主数据同步更加简单，方便应用程序处理。

What is a conflict?

Some kinds of conflict are obvious. In the example in Figure 5-7 , two writes concurrently modified the same field in the same record, setting it to two different values. There is little doubt that this is a conflict.

一些冲突是显而易见的。在图5-7的例子中，两个写操作同时修改了同一条记录中的同一字段，将其设置为两个不同的值。毫无疑问，这是一种冲突。

Other kinds of conflict can be more subtle to detect. For example, consider a meeting room booking system: it tracks which room is booked by which group of people at which time. This application needs to ensure that each room is only booked by one group of people at any one time (i.e., there must not be any overlapping bookings for the same room). In this case, a conflict may arise if two different bookings are created for the same room at the same time. Even if the application checks availability before allowing a user to make a booking, there can be a conflict if the two bookings are made on two different leaders.

其他类型的冲突可能更加难以检测。例如，考虑一个会议室预订系统：它跟踪哪个房间在什么时间被哪个团队预订。这个应用程序需要确保每个房间在任何时候只被一个团队预订（即，不能有同一房间的重叠预订）。在这种情况下，如果同一时间为同一房间创建了两个不同的预订，则可能会发生冲突。即使应用程序在允许用户预订之前检查可用性，如果两个预订是在两个不同的领导者上进行的，仍可能存在冲突。

There isn’t a quick ready-made answer, but in the following chapters we will trace a path toward a good understanding of this problem. We will see some more examples of conflicts in Chapter 7 , and in Chapter 12 we will discuss scalable approaches for detecting and resolving conflicts in a replicated system.

这个问题没有快速现成的答案，但是在接下来的章节中，我们会追溯一条路径，以便更好地理解这个问题。在第七章中，我们将看到更多的冲突示例，在第十二章中，我们将讨论检测和解决复制系统中冲突的可扩展方法。

Multi-Leader Replication Topologies

A replication topology describes the communication paths along which writes are propagated from one node to another. If you have two leaders, like in Figure 5-7 , there is only one plausible topology: leader 1 must send all of its writes to leader 2, and vice versa. With more than two leaders, various different topologies are possible. Some examples are illustrated in Figure 5-8 .

复制拓扑描述了写入从一个节点到另一个节点传播的通信路径。如果您有两个领导者，如图5-7所示，则只有一种合理的拓扑：领导者1必须将其所有写入发送到领导者2，反之亦然。具有两个以上领导者时，可能存在各种不同的拓扑。图5-8中给出了一些示例。

The most general topology is all-to-all ( Figure 5-8 [c]), in which every leader sends its writes to every other leader. However, more restricted topologies are also used: for example, MySQL by default supports only a circular topology [ 34 ], in which each node receives writes from one node and forwards those writes (plus any writes of its own) to one other node. Another popular topology has the shape of a star : ^v one designated root node forwards writes to all of the other nodes. The star topology can be generalized to a tree.

最常见的拓扑结构是全对全（图5-8 [c]），其中每个领导者都向其他领导者发送其写入。但是，也可以使用更受限制的拓扑结构：例如，默认情况下MySQL仅支持循环拓扑结构[34]，其中每个节点接收来自一个节点的写入并将这些写入（以及其自己的任何写入）转发给另一个节点。另一个流行的拓扑结构是星形拓扑结构：一个指定的根节点将写入转发到所有其他节点。星形拓扑结构可以推广为树形结构。

In circular and star topologies, a write may need to pass through several nodes before it reaches all replicas. Therefore, nodes need to forward data changes they receive from other nodes. To prevent infinite replication loops, each node is given a unique identifier, and in the replication log, each write is tagged with the identifiers of all the nodes it has passed through [ 43 ]. When a node receives a data change that is tagged with its own identifier, that data change is ignored, because the node knows that it has already been processed.

在环形和星型拓扑结构中，一个写操作可能需要经过多个节点才能到达所有副本。因此，节点需要转发从其他节点接收到的数据变更。为了防止无限复制循环，每个节点都被赋予了一个唯一的标识符，并且在复制日志中，每个写操作都带有它经过的所有节点的标识符。当一个节点接收到带有自己标识符的数据变更时，它将忽略此变更，因为该节点知道它已经被处理过了。

A problem with circular and star topologies is that if just one node fails, it can interrupt the flow of replication messages between other nodes, causing them to be unable to communicate until the node is fixed. The topology could be reconfigured to work around the failed node, but in most deployments such reconfiguration would have to be done manually. The fault tolerance of a more densely connected topology (such as all-to-all) is better because it allows messages to travel along different paths, avoiding a single point of failure.

循环和星型拓扑的问题在于，如果仅有一个节点失败，它就可能会中断其他节点之间的复制消息流，导致它们无法通信，直到该节点得到修复。虽然可以重新配置拓扑来解决故障节点问题，但在大多部署中，这种重新配置通常必须手动完成。更密集连接的拓扑（例如全互联）的容错性更好，因为它允许消息沿不同的路径传播，避免单点故障。

On the other hand, all-to-all topologies can have issues too. In particular, some network links may be faster than others (e.g., due to network congestion), with the result that some replication messages may “overtake” others, as illustrated in Figure 5-9 .

但是，全互连拓扑结构也可能存在问题。特别是，一些网络连接可能比其他连接更快（例如，由于网络拥塞），导致一些复制消息可能会“超越”其他消息，如图5-9所示。

In Figure 5-9 , client A inserts a row into a table on leader 1, and client B updates that row on leader 3. However, leader 2 may receive the writes in a different order: it may first receive the update (which, from its point of view, is an update to a row that does not exist in the database) and only later receive the corresponding insert (which should have preceded the update).

在图5-9中，客户端A在领袖1上向表中插入一行，而客户端B在领袖3上更新该行。但领袖2可能以不同的顺序接收写入操作：它可能首先接收更新操作（从它的角度来看，这是对不存在于数据库中的行进行的更新），然后才接收相应的插入操作（本应该先于更新操作）。

This is a problem of causality, similar to the one we saw in “Consistent Prefix Reads” : the update depends on the prior insert, so we need to make sure that all nodes process the insert first, and then the update. Simply attaching a timestamp to every write is not sufficient, because clocks cannot be trusted to be sufficiently in sync to correctly order these events at leader 2 (see Chapter 8 ).

这是一个因果关系的问题，类似于我们在“一致的前缀读取”中看到的问题：更新取决于之前的插入，因此我们需要确保所有节点先处理插入然后处理更新。仅仅为每个写操作附加时间戳是不够的，因为时钟不能被信任地同步，以正确地对事件进行排序在领导者2（见第8章）。

To order these events correctly, a technique called version vectors can be used, which we will discuss later in this chapter (see “Detecting Concurrent Writes” ). However, conflict detection techniques are poorly implemented in many multi-leader replication systems. For example, at the time of writing, PostgreSQL BDR does not provide causal ordering of writes [ 27 ], and Tungsten Replicator for MySQL doesn’t even try to detect conflicts [ 34 ].

为了正确地排序这些事件，可以使用一种称为版本向量的技术，我们将在本章后面讨论（请参见“检测并发写入”）。然而，许多多主复制系统的冲突检测技术实现得很差。例如，写作时，PostgreSQL BDR不提供写入的因果排序，而MySQL的Tungsten Replicator甚至不尝试检测冲突。

If you are using a system with multi-leader replication, it is worth being aware of these issues, carefully reading the documentation, and thoroughly testing your database to ensure that it really does provide the guarantees you believe it to have.

如果您正在使用具有多主复制系统的系统，则值得了解这些问题，仔细阅读文档，并彻底测试您的数据库以确保它确实提供您所相信的保证。

Leaderless Replication

The replication approaches we have discussed so far in this chapter—single-leader and multi-leader replication—are based on the idea that a client sends a write request to one node (the leader), and the database system takes care of copying that write to the other replicas. A leader determines the order in which writes should be processed, and followers apply the leader’s writes in the same order.

目前在本章中讨论的复制方法——单领导者和多领导者复制，基于一个客户端向一个节点（领导者）发送写请求的想法，数据库系统会处理将该写入复制到其他副本的工作。领导者确定写入的处理顺序，跟随者按照相同的顺序应用领导者的写入。

Some data storage systems take a different approach, abandoning the concept of a leader and allowing any replica to directly accept writes from clients. Some of the earliest replicated data systems were leaderless [ 1 , 44 ], but the idea was mostly forgotten during the era of dominance of relational databases. It once again became a fashionable architecture for databases after Amazon used it for its in-house Dynamo system [ 37 ]. ^vi Riak, Cassandra, and Voldemort are open source datastores with leaderless replication models inspired by Dynamo, so this kind of database is also known as Dynamo-style .

一些数据存储系统采取不同的方法，放弃领导者的概念，允许任何副本直接从客户端接受写入。一些最早的复制数据系统是无领导的，但是在关系数据库占主导地位的时代，这个想法大多被遗忘了。亚马逊将其用于其内部动力系统之后，它再次成为数据库的流行架构。Riak、Cassandra和Voldemort是受动力启发的开源数据存储器，因此这种数据库也被称为动力样式。

In some leaderless implementations, the client directly sends its writes to several replicas, while in others, a coordinator node does this on behalf of the client. However, unlike a leader database, that coordinator does not enforce a particular ordering of writes. As we shall see, this difference in design has profound consequences for the way the database is used.

在一些无领导实现中，客户端直接向多个副本发送其写入内容，而在另一些实现中，一个协调节点代表客户端执行此操作。然而，与领导数据库不同，该协调器不会强制执行特定的写入顺序。正如我们将看到的那样，这种设计上的差异对数据库的使用方式有深远的影响。

Writing to the Database When a Node Is Down

Imagine you have a database with three replicas, and one of the replicas is currently unavailable—perhaps it is being rebooted to install a system update. In a leader-based configuration, if you want to continue processing writes, you may need to perform a failover (see “Handling Node Outages” ).

假设您拥有一个包含三个副本的数据库，并且其中一个副本当前不可用 —— 可能正在重新启动以安装系统更新。在基于领导者的配置中，如果您想要继续处理写入操作，您可能需要执行故障转移（请参阅“处理节点故障”）。

On the other hand, in a leaderless configuration, failover does not exist. Figure 5-10 shows what happens: the client (user 1234) sends the write to all three replicas in parallel, and the two available replicas accept the write but the unavailable replica misses it. Let’s say that it’s sufficient for two out of three replicas to acknowledge the write: after user 1234 has received two ok responses, we consider the write to be successful. The client simply ignores the fact that one of the replicas missed the write.

另一方面，在没有领导的配置中，不存在故障转移。如图5-10所示：客户端（用户1234）并行发送写请求到所有三个副本，两个可用的副本接受写请求，但不可用的副本未接受此请求。假设只需要三个副本中的两个副本确认写操作即可，当用户1234接收到两个确认响应之后，我们认为写操作已成功。客户端简单地忽略了一个副本未接受写请求的情况。

Now imagine that the unavailable node comes back online, and clients start reading from it. Any writes that happened while the node was down are missing from that node. Thus, if you read from that node, you may get stale (outdated) values as responses.

现在想象一下，不可用的节点重新联机，客户端开始从它那里读取。在该节点离线期间发生的任何写操作都将缺失于该节点。因此，如果您从该节点读取，您可能会收到过时的响应值。

To solve that problem, when a client reads from the database, it doesn’t just send its request to one replica: read requests are also sent to several nodes in parallel . The client may get different responses from different nodes; i.e., the up-to-date value from one node and a stale value from another. Version numbers are used to determine which value is newer (see “Detecting Concurrent Writes” ).

为了解决这个问题，当客户端从数据库中读取数据时，它不仅会将请求发送到一个复制节点，读取请求还会同时发送到多个节点。客户端可能会从不同的节点获取不同的响应；即从一个节点获取最新值，而从另一个节点获取旧值。版本号用于确定哪个值是更新的（请参阅“检测并发写入”）。

Read repair and anti-entropy

The replication scheme should ensure that eventually all the data is copied to every replica. After an unavailable node comes back online, how does it catch up on the writes that it missed?

复制方案应确保最终所有数据都被复制到每个副本。当一个不可用的节点重新联机后，它如何追赶它错过的写入呢？

Two mechanisms are often used in Dynamo-style datastores:

通常在 Dynamo 风格的数据存储中使用了两种机制：

Read repair

When a client makes a read from several nodes in parallel, it can detect any stale responses. For example, in Figure 5-10 , user 2345 gets a version 6 value from replica 3 and a version 7 value from replicas 1 and 2. The client sees that replica 3 has a stale value and writes the newer value back to that replica. This approach works well for values that are frequently read.

当客户端并行从多个节点读取时，它可以检测到任何过时的响应。例如，在图5-10中，用户 2345 从复制品 3 获取版本 6 的值，并从副本 1 和 2 中获取版本 7 的值。客户端发现复制品 3 有一个过时的值，然后将更新的值写回该复制品。这种方法适用于频繁读取的值。

Anti-entropy process

In addition, some datastores have a background process that constantly looks for differences in the data between replicas and copies any missing data from one replica to another. Unlike the replication log in leader-based replication, this anti-entropy process does not copy writes in any particular order, and there may be a significant delay before data is copied.

此外，一些数据存储具有后台进程，不断查找副本之间数据的差异，并将任何缺失的数据从一个副本复制到另一个副本。与基于领导者的复制中的复制日志不同，此反熵过程不按任何特定顺序复制写入，并且数据复制可能会有显着延迟。

Not all systems implement both of these; for example, Voldemort currently does not have an anti-entropy process. Note that without an anti-entropy process, values that are rarely read may be missing from some replicas and thus have reduced durability, because read repair is only performed when a value is read by the application.

并非所有的系统都实现了这两种方法，例如目前 Voldemort 就没有一个反熵过程。请注意，如果没有反熵过程，很少被读取的值可能会在某些副本中丢失，从而降低了其耐用性，因为只有在应用程序读取一个值时才会进行读取修复。

Quorums for reading and writing

In the example of Figure 5-10 , we considered the write to be successful even though it was only processed on two out of three replicas. What if only one out of three replicas accepted the write? How far can we push this?

在图5-10的示例中，即使仅在三个副本中处理了两个，我们仍考虑写操作是成功的。如果只有三个副本中的一个接受了写入，我们可以推到多远？

If we know that every successful write is guaranteed to be present on at least two out of three replicas, that means at most one replica can be stale. Thus, if we read from at least two replicas, we can be sure that at least one of the two is up to date. If the third replica is down or slow to respond, reads can nevertheless continue returning an up-to-date value.

如果我们知道每个成功的写入都保证在三个副本中至少有两个出现，这意味着最多只能有一个副本过期。因此，如果我们从至少两个副本中读取，我们可以确信其中至少有一个是最新的。如果第三个副本停机或响应缓慢，读取仍然可以返回最新的值。

More generally, if there are n replicas, every write must be confirmed by w nodes to be considered successful, and we must query at least r nodes for each read. (In our example, n = 3, w = 2, r = 2.) As long as w + r > n , we expect to get an up-to-date value when reading, because at least one of the r nodes we’re reading from must be up to date. Reads and writes that obey these r and w values are called quorum reads and writes [ 44 ]. ^vii You can think of r and w as the minimum number of votes required for the read or write to be valid.

一般来说，如果有n个副本，每次写入必须得到w个节点的确认才能被视为成功，每次读取时我们必须查询至少r个节点。(在我们的例子中，n=3，w=2，r=2）。只要w+r>n，我们预计在读取时能得到最新的值，因为我们至少会从r个节点中得到一个最新值。遵守这些r和w值的读写称为"quorum reads and writes"[44]。你可以把r和w看作读写操作所需的最小投票数，以确定其是否有效。

In Dynamo-style databases, the parameters n , w , and r are typically configurable. A common choice is to make n an odd number (typically 3 or 5) and to set w = r = ( n + 1) / 2 (rounded up). However, you can vary the numbers as you see fit. For example, a workload with few writes and many reads may benefit from setting w = n and r = 1. This makes reads faster, but has the disadvantage that just one failed node causes all database writes to fail.

在 Dynamo 风格的数据库中，参数 n、w 和 r 通常是可配置的。一个普遍的选择是将 n 设为奇数（通常是 3 或 5），并将 w = r = (n + 1) / 2（向上取整）。然而，你可以根据需要变化这些数字。例如，一个写入较少但读取较多的工作负载可能会从设置 w = n 和 r = 1 中获益。这会加快读取速度，但缺点是只要一个节点失败，所有数据库的写入就会失败。

Note

There may be more than n nodes in the cluster, but any given value is stored only on n nodes. This allows the dataset to be partitioned, supporting datasets that are larger than you can fit on one node. We will return to partitioning in Chapter 6 .

集群中可能有多个节点，但任何给定值仅存储在n个节点上。这使得数据集能够被分割，支持大于一个节点可容纳的数据集。在第6章中，我们将返回到分区。

The quorum condition, w + r > n , allows the system to tolerate unavailable nodes as follows:

法定人数条件w+r>n允许系统在以下情况下容忍不可用节点：

If w < n , we can still process writes if a node is unavailable.

如果w < n，即使节点不可用，我们仍然可以处理写操作。
If r < n , we can still process reads if a node is unavailable.

如果 r < n，即使节点不可用，我们仍然可以处理读取。
With n = 3, w = 2, r = 2 we can tolerate one unavailable node.

当n = 3，w = 2，r = 2时，我们可以容忍一个不可用节点。
With n = 5, w = 3, r = 3 we can tolerate two unavailable nodes. This case is illustrated in Figure 5-11 .

当n = 5，w = 3，r = 3时，我们可以容忍两个不可用的节点。此情况在图5-11中说明。
Normally, reads and writes are always sent to all n replicas in parallel. The parameters w and r determine how many nodes we wait for—i.e., how many of the n nodes need to report success before we consider the read or write to be successful.

通常情况下，读写操作总是并行发送到所有的n个副本。参数w和r决定我们等待多少个节点——即多少个n个节点需要报告成功，我们才认为读写操作成功。

If fewer than the required w or r nodes are available, writes or reads return an error. A node could be unavailable for many reasons: because the node is down (crashed, powered down), due to an error executing the operation (can’t write because the disk is full), due to a network interruption between the client and the node, or for any number of other reasons. We only care whether the node returned a successful response and don’t need to distinguish between different kinds of fault.

如果可用的w或r节点少于所需数量，写入或读取将返回错误。节点可能无法使用的原因很多：因为节点宕机（崩溃，断电），由于执行操作时出现错误（因为磁盘已满而无法写入），由于客户端和节点之间的网络中断，或由于任何其他原因。我们只关心节点是否返回成功响应，无需区分不同类型的故障。

Limitations of Quorum Consistency

If you have n replicas, and you choose w and r such that w + r > n , you can generally expect every read to return the most recent value written for a key. This is the case because the set of nodes to which you’ve written and the set of nodes from which you’ve read must overlap. That is, among the nodes you read there must be at least one node with the latest value (illustrated in Figure 5-11 ).

如果您有n个副本，并且选择w和r，使得w + r> n，则通常可以期望每次读取都返回针对键写入的最新值。这是因为您写入的节点集和您读取的节点集必须重叠。也就是说，您读取的节点集中必须至少有一个节点具有最新值（如图5-11所示）。

Often, r and w are chosen to be a majority (more than n /2) of nodes, because that ensures w + r > n while still tolerating up to n /2 node failures. But quorums are not necessarily majorities—it only matters that the sets of nodes used by the read and write operations overlap in at least one node. Other quorum assignments are possible, which allows some flexibility in the design of distributed algorithms [ 45 ].

通常，r 和 w 被选择为大多数节点（超过 n/2），因为这确保了 w + r > n，同时仍然容忍 n/2 节点故障。但是，仲裁机并不一定是多数机制——重要的是读操作和写操作使用的节点集合在至少一个节点上重叠。其他仲裁分配也是可能的，这允许分布式算法设计的一定灵活性 [45]。

You may also set w and r to smaller numbers, so that w + r ≤ n (i.e., the quorum condition is not satisfied). In this case, reads and writes will still be sent to n nodes, but a smaller number of successful responses is required for the operation to succeed.

你还可以将w和r设置为较小的数字，使得w + r ≤ n（即未满足法定人数条件）。在这种情况下，读取和写入仍将发送到n个节点，但操作成功所需的成功响应数量将减少。

With a smaller w and r you are more likely to read stale values, because it’s more likely that your read didn’t include the node with the latest value. On the upside, this configuration allows lower latency and higher availability: if there is a network interruption and many replicas become unreachable, there’s a higher chance that you can continue processing reads and writes. Only after the number of reachable replicas falls below w or r does the database become unavailable for writing or reading, respectively.

使用较小的w和r值，更有可能读取陈旧的值，因为您可能没有包含最新值的节点。好处是，该配置可以实现较低的延迟和更高的可用性：如果存在网络中断并且许多副本变得无法访问，则可以继续处理读写操作的可能性更高。仅在可达副本数低于w或r时，数据库才变得无法写入或读取。

However, even with w + r > n , there are likely to be edge cases where stale values are returned. These depend on the implementation, but possible scenarios include:

然而，即使w + r > n，仍然有可能出现返回过期值的边际情况。这些情况取决于实现，但可能的场景包括：

If a sloppy quorum is used (see “Sloppy Quorums and Hinted Handoff” ), the w writes may end up on different nodes than the r reads, so there is no longer a guaranteed overlap between the r nodes and the w nodes [ 46 ].

如果使用了松散的法定人数（参见“松散的法定人数和暗示的转交”），那么写入可能会落在与读取不同的节点上，因此读操作节点和写操作节点之间就不再有保证的重叠了。
If two writes occur concurrently, it is not clear which one happened first. In this case, the only safe solution is to merge the concurrent writes (see “Handling Write Conflicts” ). If a winner is picked based on a timestamp (last write wins), writes can be lost due to clock skew [ 35 ]. We will return to this topic in “Detecting Concurrent Writes” .

如果出现两个并发写入，不清楚哪一个先发生。在这种情况下，唯一安全的解决方案是合并并发的写入（参见“处理写入冲突”）。如果根据时间戳选择获胜者（最后一次写入获胜），由于时钟偏差可能会丢失写入[35]。我们将在“检测并发写入”中回到这个话题。
If a write happens concurrently with a read, the write may be reflected on only some of the replicas. In this case, it’s undetermined whether the read returns the old or the new value.

如果在读取发生并发写入时，写入可能只会反映在一些副本上。在这种情况下，无法确定读取是否会返回旧值还是新值。
If a write succeeded on some replicas but failed on others (for example because the disks on some nodes are full), and overall succeeded on fewer than w replicas, it is not rolled back on the replicas where it succeeded. This means that if a write was reported as failed, subsequent reads may or may not return the value from that write [ 47 ].

如果在某些副本上写入成功，但在其他副本上失败（例如因为某些节点上的磁盘已满），并且总体上成功的副本少于w个，则不会在成功的副本上回滚。这意味着如果写入报告为失败，则随后的读取可能会返回该写入的值或不返回。[47]。
If a node carrying a new value fails, and its data is restored from a replica carrying an old value, the number of replicas storing the new value may fall below w , breaking the quorum condition.

如果一个承载新价值的节点失败了，并且它的数据从承载旧价值的副本中恢复，那么存储新价值的副本数量可能会低于w，从而破坏了法定人数的条件。
Even if everything is working correctly, there are edge cases in which you can get unlucky with the timing, as we shall see in “Linearizability and quorums” .

即使一切都运行得正确，也有极端情况，在“线性化和仲裁”的内容中我们会看到可能会出现时机不利的情况。

Thus, although quorums appear to guarantee that a read returns the latest written value, in practice it is not so simple. Dynamo-style databases are generally optimized for use cases that can tolerate eventual consistency. The parameters w and r allow you to adjust the probability of stale values being read, but it’s wise to not take them as absolute guarantees.

因此，尽管法定人数似乎保证读取最新写入的值，但实际上并不那么简单。 Dynamo样式的数据库通常针对可以容忍最终一致性的用例进行优化。参数w和r允许您调整读取过期值的概率，但明智的做法是不将它们视为绝对保证。

In particular, you usually do not get the guarantees discussed in “Problems with Replication Lag” (reading your writes, monotonic reads, or consistent prefix reads), so the previously mentioned anomalies can occur in applications. Stronger guarantees generally require transactions or consensus. We will return to these topics in Chapter 7 and Chapter 9 .

特别是，在“复制延迟问题”的讨论中通常不会得到“读取您的写入操作”、“单调读取”或“一致前缀读取”等保证，因此在应用程序中可能会出现前面提到的异常情况。更强的保证通常需要事务或共识。我们将在第7章和第9章返回这些主题。

Monitoring staleness

From an operational perspective, it’s important to monitor whether your databases are returning up-to-date results. Even if your application can tolerate stale reads, you need to be aware of the health of your replication. If it falls behind significantly, it should alert you so that you can investigate the cause (for example, a problem in the network or an overloaded node).

从操作角度看，监视数据库是否返回最新结果非常重要。即使您的应用程序可以容忍陈旧的读取，您也需要了解复制的健康状况。如果它落后很多，它应该向您发出警报，以便您可以调查原因（例如，在网络中出现问题或节点超载）。

For leader-based replication, the database typically exposes metrics for the replication lag, which you can feed into a monitoring system. This is possible because writes are applied to the leader and to followers in the same order, and each node has a position in the replication log (the number of writes it has applied locally). By subtracting a follower’s current position from the leader’s current position, you can measure the amount of replication lag.

针对基于领导者的复制，数据库通常公开用于复制滞后度的度量标准，您可以将其输入监控系统。这是可能的，因为写入按照相同顺序应用于领导者和关注者，并且每个节点在复制日志中都有一个位置（它已经本地应用的写入数量）。通过从关注者的当前位置减去领导者的当前位置，您可以测量复制滞后的量。

However, in systems with leaderless replication, there is no fixed order in which writes are applied, which makes monitoring more difficult. Moreover, if the database only uses read repair (no anti-entropy), there is no limit to how old a value might be—if a value is only infrequently read, the value returned by a stale replica may be ancient.

然而，在没有领导的复制系统中，写入应用的顺序是没有固定的，这使得监视更加困难。此外，如果数据库仅使用读修复而没有反熵机制，则值可能非常陈旧 - 如果一个值很少被读取，那么由过时副本返回的值可能已经过时。

There has been some research on measuring replica staleness in databases with leaderless replication and predicting the expected percentage of stale reads depending on the parameters n , w , and r [ 48 ]. This is unfortunately not yet common practice, but it would be good to include staleness measurements in the standard set of metrics for databases. Eventual consistency is a deliberately vague guarantee, but for operability it’s important to be able to quantify “eventual.”

已经进行了一些研究，用于在无领导副本复制的数据库中测量副本陈旧度，并根据参数n、w和r [48] 预测期望的陈旧读取百分比。不幸的是，这还不是普遍的做法，但将陈旧度测量包括在数据库的标准指标集中将是很好的。最终一致性是一种故意模糊的保证，但为了操作能力，量化“最终”非常重要。

Sloppy Quorums and Hinted Handoff

Databases with appropriately configured quorums can tolerate the failure of individual nodes without the need for failover. They can also tolerate individual nodes going slow, because requests don’t have to wait for all n nodes to respond—they can return when w or r nodes have responded. These characteristics make databases with leaderless replication appealing for use cases that require high availability and low latency, and that can tolerate occasional stale reads.

配置适当仲裁的数据库可以容忍个别节点的失败，而无需进行故障转移。它们也可以容忍节点的缓慢，因为请求不必等待所有的n个节点响应，它们可以在w或r个节点响应时返回。这些特性使得具有无领导复制的数据库在需要高可用性和低延迟、可以容忍偶尔的陈旧读取的用例中非常有吸引力。

However, quorums (as described so far) are not as fault-tolerant as they could be. A network interruption can easily cut off a client from a large number of database nodes. Although those nodes are alive, and other clients may be able to connect to them, to a client that is cut off from the database nodes, they might as well be dead. In this situation, it’s likely that fewer than w or r reachable nodes remain, so the client can no longer reach a quorum.

然而，目前描述的仲裁（quorum）并没有尽可能实现容错性。网络中断很容易就会使客户端与大量数据库节点失去联系。虽然这些节点仍然存活，其他客户端可能仍能够连接它们，但对于与数据库节点失去联系的客户端来说，这些节点实际上就已经死亡。在这种情况下，可到达的节点数量很可能少于w或r，因此客户端无法再达到一个仲裁。

In a large cluster (with significantly more than n nodes) it’s likely that the client can connect to some database nodes during the network interruption, just not to the nodes that it needs to assemble a quorum for a particular value. In that case, database designers face a trade-off:

在一个非常大的群集（节点数量显著超过n）中，客户端可能可以在网络中断期间连接到一些数据库节点，只是无法连接到需要为特定值组装法定人数的节点。在这种情况下，数据库设计师面临一个权衡：

Is it better to return errors to all requests for which we cannot reach a quorum of w or r nodes?

我们是否应该对于那些无法达到w或r节点共识的请求返回错误信息更好呢？
Or should we accept writes anyway, and write them to some nodes that are reachable but aren’t among the n nodes on which the value usually lives?

或者我们应该接受写入请求，然后将其写入一些可达但不属于通常存储值的n个节点中的某些节点吗？

The latter is known as a sloppy quorum [ 37 ]: writes and reads still require w and r successful responses, but those may include nodes that are not among the designated n “home” nodes for a value. By analogy, if you lock yourself out of your house, you may knock on the neighbor’s door and ask whether you may stay on their couch temporarily.

后者被称为松散的法定人数。写入和读取仍需要成功的w和r响应，但可能包括未经指定的n“家庭”节点的值。类比地，如果你锁在外面找不到钥匙进不了自己家，你可能会敲邻居的门并询问是否可以暂时留在他们的沙发上。

Once the network interruption is fixed, any writes that one node temporarily accepted on behalf of another node are sent to the appropriate “home” nodes. This is called hinted handoff . (Once you find the keys to your house again, your neighbor politely asks you to get off their couch and go home.)

一旦网络中断问题得到解决，一个节点暂时接受另一个节点的写入，这些写入会被发送到相应的“归属”节点。这被称为提示性转移。（一旦你找到家的钥匙，你的邻居会礼貌地请你离开他们的沙发回家。）

Sloppy quorums are particularly useful for increasing write availability: as long as any w nodes are available, the database can accept writes. However, this means that even when w + r > n , you cannot be sure to read the latest value for a key, because the latest value may have been temporarily written to some nodes outside of n [ 47 ].

松散的议定书对于增加写入可用性特别有用：只要有任何w个节点可用，数据库就可以接受写入。然而，这意味着即使w + r > n，您也不能确保读取关键字的最新值，因为最新值可能暂时写入n外的某些节点[47]。

Thus, a sloppy quorum actually isn’t a quorum at all in the traditional sense. It’s only an assurance of durability, namely that the data is stored on w nodes somewhere. There is no guarantee that a read of r nodes will see it until the hinted handoff has completed.

因此，一个松散的法定人数实际上在传统意义上并不是一个法定人数。它只是一种持久性的保证，即数据存储在某处的w个节点上。没有保证r个节点的读取将看到它，直到暗示手动完成。

Sloppy quorums are optional in all common Dynamo implementations. In Riak they are enabled by default, and in Cassandra and Voldemort they are disabled by default [ 46 , 49 , 50 ].

在所有通用的 Dynamo 实现中，松散的仲裁机制是可选的。在 Riak 中，它们默认启用，在Cassandra 和 Voldemort 中默认禁用。[46, 49, 50]。

Multi-datacenter operation

We previously discussed cross-datacenter replication as a use case for multi-leader replication (see “Multi-Leader Replication” ). Leaderless replication is also suitable for multi-datacenter operation, since it is designed to tolerate conflicting concurrent writes, network interruptions, and latency spikes.

我们之前讨论了跨数据中心复制作为多领袖复制的用例（参见“多首领复制”）。无领导复制也适用于多数据中心操作，因为它被设计为容忍冲突的并发写入，网络中断和延迟波动。

Cassandra and Voldemort implement their multi-datacenter support within the normal leaderless model: the number of replicas n includes nodes in all datacenters, and in the configuration you can specify how many of the n replicas you want to have in each datacenter. Each write from a client is sent to all replicas, regardless of datacenter, but the client usually only waits for acknowledgment from a quorum of nodes within its local datacenter so that it is unaffected by delays and interruptions on the cross-datacenter link. The higher-latency writes to other datacenters are often configured to happen asynchronously, although there is some flexibility in the configuration [ 50 , 51 ].

Cassandra和Voldemort在常规无领导模式下实现了它们的多数据中心支持：副本数n包括所有数据中心的节点，并且在配置中可以指定要在每个数据中心中拥有多少个n副本。每个客户端的写操作都发送到所有副本，而不管数据中心，但客户端通常只等待其本地数据中心内的节点数量达到法定人数以获得确认，因此才不会受到跨数据中心链接的延迟和中断的影响。对于其他数据中心的高延迟写操作通常配置为异步执行，尽管在配置中还有一定的灵活性[50，51]。

Riak keeps all communication between clients and database nodes local to one datacenter, so n describes the number of replicas within one datacenter. Cross-datacenter replication between database clusters happens asynchronously in the background, in a style that is similar to multi-leader replication [ 52 ].

Riak将客户端与数据库节点之间的所有通信都保持在一个数据中心内，因此n描述了一个数据中心内的副本数量。不同数据中心之间的数据库集群之间的交叉数据中心复制会以类似于多个领导者复制的方式在后台异步进行。

Detecting Concurrent Writes

Dynamo-style databases allow several clients to concurrently write to the same key, which means that conflicts will occur even if strict quorums are used. The situation is similar to multi-leader replication (see “Handling Write Conflicts” ), although in Dynamo-style databases conflicts can also arise during read repair or hinted handoff.

Dynamo式数据库允许多个客户端同时写入同一关键字，这意味着即使使用严格的仲裁，冲突仍将发生。这种情况类似于多领导者复制（请参阅“处理写冲突”），尽管在Dynamo式数据库中，冲突也可能在读取修复或提示式移交中出现。

The problem is that events may arrive in a different order at different nodes, due to variable network delays and partial failures. For example, Figure 5-12 shows two clients, A and B, simultaneously writing to a key X in a three-node datastore:

问题在于由于不同节点之间的网络延迟和部分故障，事件可能以不同的顺序到达不同的节点。例如，图5-12显示了两个客户端A和B同时向三节点数据存储中的键X写入的情况。

Node 1 receives the write from A, but never receives the write from B due to a transient outage.

节点1由于短暂的故障未能收到B的写入。
Node 2 first receives the write from A, then the write from B.

节点2首先收到来自A的写入，然后收到来自B的写入。
Node 3 first receives the write from B, then the write from A.

节点3先接收到来自B的写入，然后再接收来自A的写入。

If each node simply overwrote the value for a key whenever it received a write request from a client, the nodes would become permanently inconsistent, as shown by the final get request in Figure 5-12 : node 2 thinks that the final value of X is B, whereas the other nodes think that the value is A.

如果每个节点仅仅在收到来自客户端的写请求时覆盖键的值，节点将变得永久不一致，如图5-12中的最后一个get请求所示：节点2认为X的最终值是B，而其他节点则认为值为A。

In order to become eventually consistent, the replicas should converge toward the same value. How do they do that? One might hope that replicated databases would handle this automatically, but unfortunately most implementations are quite poor: if you want to avoid losing data, you—the application developer—need to know a lot about the internals of your database’s conflict handling.

为了实现最终一致性，副本应该收敛到同一值。如何实现这一点？人们可能希望复制的数据库能够自动处理这一点，但不幸的是，大多数实现都相当糟糕：如果您想避免丢失数据，您——应用程序开发者——需要了解很多有关数据库冲突处理的内部知识。

We briefly touched on some techniques for conflict resolution in “Handling Write Conflicts” . Before we wrap up this chapter, let’s explore the issue in a bit more detail.

在“处理写冲突”一章中，我们简要地介绍了一些冲突解决的技巧。在我们结束本章之前，让我们更详细地探讨这个问题。

Last write wins (discarding concurrent writes)

One approach for achieving eventual convergence is to declare that each replica need only store the most “recent” value and allow “older” values to be overwritten and discarded. Then, as long as we have some way of unambiguously determining which write is more “recent,” and every write is eventually copied to every replica, the replicas will eventually converge to the same value.

达到最终收敛的方法之一是声明每个副本只需存储最“新”的值，允许“旧”的值被覆盖和丢弃。然后，只要我们有一种明确确定哪个写入更“新”的方法，并且每个写入最终都复制到每个副本，副本最终将收敛到相同的值。

As indicated by the quotes around “recent,” this idea is actually quite misleading. In the example of Figure 5-12 , neither client knew about the other one when it sent its write requests to the database nodes, so it’s not clear which one happened first. In fact, it doesn’t really make sense to say that either happened “first”: we say the writes are concurrent , so their order is undefined.

由于“recent”周围的引号，这个想法实际上是非常具有误导性的。在图5-12的例子中，当它发送写请求到数据库节点时，两个客户端都不知道对方的存在，因此不清楚哪个首先发生。事实上，说其中任何一个先发生都没有意义：我们说写操作是并发的，因此它们的顺序是未定义的。

Even though the writes don’t have a natural ordering, we can force an arbitrary order on them. For example, we can attach a timestamp to each write, pick the biggest timestamp as the most “recent,” and discard any writes with an earlier timestamp. This conflict resolution algorithm, called last write wins (LWW), is the only supported conflict resolution method in Cassandra [ 53 ], and an optional feature in Riak [ 35 ].

尽管写入没有自然的顺序，我们可以强制在它们上面施加任意的顺序。例如，我们可以为每个写入连接一个时间戳，选择最大的时间戳作为最“近期”的，丢弃任何具有较早时间戳的写入。这个冲突解决算法叫做最后写入获胜（LWW），是Cassandra中唯一支持的冲突解决方法[53]，也是Riak中的可选功能[35]。

LWW achieves the goal of eventual convergence, but at the cost of durability: if there are several concurrent writes to the same key, even if they were all reported as successful to the client (because they were written to w replicas), only one of the writes will survive and the others will be silently discarded. Moreover, LWW may even drop writes that are not concurrent, as we shall discuss in “Timestamps for ordering events” .

LWW可实现最终收敛目标，但以耐用性为代价：如果有多个并发写入相同的键，即使它们都被报告给客户端（因为它们被写入w个副本），只有一个写入将生存，而其他写入将被静默丢弃。此外，LWW甚至可能会删除不并发的写入，如我们将在“用于排序事件的时间戳”中讨论的那样。

There are some situations, such as caching, in which lost writes are perhaps acceptable. If losing data is not acceptable, LWW is a poor choice for conflict resolution.

在某些情况下，如缓存中，一些丢失的写入也许是可以接受的。如果数据丢失是不可接受的，LWW 是一个冲突解决的劣选择。

The only safe way of using a database with LWW is to ensure that a key is only written once and thereafter treated as immutable, thus avoiding any concurrent updates to the same key. For example, a recommended way of using Cassandra is to use a UUID as the key, thus giving each write operation a unique key [ 53 ].

使用LWW和数据库的唯一安全方式是确保一个键只被写入一次，然后被视为不可变，以避免任何并发更新相同的键。例如，建议使用Cassandra的一种方法是使用UUID作为键，从而为每个写操作提供一个唯一的键[53]。

The “happens-before” relationship and concurrency

How do we decide whether two operations are concurrent or not? To develop an intuition, let’s look at some examples:

我们如何确定两个操作是否并发？为了发展直觉，让我们来看几个例子：

In Figure 5-9 , the two writes are not concurrent: A’s insert happens before B’s increment, because the value incremented by B is the value inserted by A. In other words, B’s operation builds upon A’s operation, so B’s operation must have happened later. We also say that B is causally dependent on A.

在图5-9中，两个写操作不是并发的：A的插入操作发生在B的递增操作之前，因为B递增的值是A插入的值。换句话说，B的操作建立在A的操作上，因此B的操作必须是后发生的。我们还可以说B在因果上依赖于A。
On the other hand, the two writes in Figure 5-12 are concurrent: when each client starts the operation, it does not know that another client is also performing an operation on the same key. Thus, there is no causal dependency between the operations.

另一方面，图5-12中的两个写操作是并发的：当每个客户端开始操作时，它并不知道另一个客户端也正在对同一键执行操作。因此，这些操作之间没有因果依赖关系。

An operation A happens before another operation B if B knows about A, or depends on A, or builds upon A in some way. Whether one operation happens before another operation is the key to defining what concurrency means. In fact, we can simply say that two operations are concurrent if neither happens before the other (i.e., neither knows about the other) [ 54 ].

如果B知道A，或者依赖于A，或者以某种方式建立在A之上，则操作A发生在另一个操作B之前。一个操作是否在另一个操作之前发生是定义并发的关键。实际上，我们可以简单地说，如果两个操作都不知道对方，则它们是并发的（也就是说，都不知道对方） [54]。

Thus, whenever you have two operations A and B, there are three possibilities: either A happened before B, or B happened before A, or A and B are concurrent. What we need is an algorithm to tell us whether two operations are concurrent or not. If one operation happened before another, the later operation should overwrite the earlier operation, but if the operations are concurrent, we have a conflict that needs to be resolved.

因此，每当您有两个操作A和B时，有三种可能性：要么A发生在B之前，要么B发生在A之前，要么A和B是并发的。我们需要的是一种算法来告诉我们两个操作是否是并发的。如果一个操作在另一个操作之前发生，后面的操作应覆盖先前的操作，但如果操作是并发的，我们就有一个需要解决的冲突。

Concurrency, Time, and Relativity

It may seem that two operations should be called concurrent if they occur “at the same time”—but in fact, it is not important whether they literally overlap in time. Because of problems with clocks in distributed systems, it is actually quite difficult to tell whether two things happened at exactly the same time—an issue we will discuss in more detail in Chapter 8 .

在分布式系统中，可能会出现两个操作“同时”发生的情况，但实际上，它们是否真正重叠在时间上并不重要。由于时钟问题，很难确定两个事情是否确实发生在完全相同的时间点上。我们将在第8章中详细讨论这个问题。

For defining concurrency, exact time doesn’t matter: we simply call two operations concurrent if they are both unaware of each other, regardless of the physical time at which they occurred. People sometimes make a connection between this principle and the special theory of relativity in physics [ 54 ], which introduced the idea that information cannot travel faster than the speed of light. Consequently, two events that occur some distance apart cannot possibly affect each other if the time between the events is shorter than the time it takes light to travel the distance between them.

定义并发，确切的时间并不重要：如果两个操作互相不知道彼此的存在，那么我们就称它们同时发生，而不考虑它们发生的具体时间。人们有时将此原则与物理学中的特殊相对论联系起来，[54] 这引入了这样一个观念：信息不能以超过光速的速度传递。因此，如果两个事件在一定距离内发生，并且它们之间的时间比光从它们之间的距离传播所需的时间更短，那么这两个事件绝对不会互相影响。

In computer systems, two operations might be concurrent even though the speed of light would in principle have allowed one operation to affect the other. For example, if the network was slow or interrupted at the time, two operations can occur some time apart and still be concurrent, because the network problems prevented one operation from being able to know about the other.

在计算机系统中，即使光速原则上允许一项操作影响另一项操作，但两项操作可能同时进行。例如，如果网络在某个时间变慢或中断，两项操作可在一段时间之后发生并仍然是并发的，因为网络问题阻止了一项操作无法了解另一项操作的情况。

Capturing the happens-before relationship

Let’s look at an algorithm that determines whether two operations are concurrent, or whether one happened before another. To keep things simple, let’s start with a database that has only one replica. Once we have worked out how to do this on a single replica, we can generalize the approach to a leaderless database with multiple replicas.

让我们来看一个算法，确定两个操作是并发的还是一个发生在另一个之前。为了保持简单，让我们从只有一个副本的数据库开始。一旦我们弄清楚了如何在单个副本上执行此操作，我们就可以将其推广到具有多个副本的无领导数据库。

Figure 5-13 shows two clients concurrently adding items to the same shopping cart. (If that example strikes you as too inane, imagine instead two air traffic controllers concurrently adding aircraft to the sector they are tracking.) Initially, the cart is empty. Between them, the clients make five writes to the database:

图5-13展示了两个客户端同时向同一个购物车中添加物品的情境。（如果这个例子听起来太无聊了，那么你可以想象一下两个空管员同时添加他们正在追踪的飞机到区域中。）初始时，购物车是空的。两个客户端总共在数据库上进行了五次写操作。

Client 1 adds milk to the cart. This is the first write to that key, so the server successfully stores it and assigns it version 1. The server also echoes the value back to the client, along with the version number.

客户端1将牛奶添加到购物车中。这是对该键的第一次写入，因此服务器成功存储并分配版本1。服务器还将值回显到客户端，同时附带版本号。
Client 2 adds eggs to the cart, not knowing that client 1 concurrently added milk (client 2 thought that its eggs were the only item in the cart). The server assigns version 2 to this write, and stores eggs and milk as two separate values. It then returns both values to the client, along with the version number of 2.

客户端2将鸡蛋添加到购物车中，不知道客户端1同时添加了牛奶（客户端2认为它的鸡蛋是购物车中唯一的物品）。服务器将版本2分配给此写入，并将鸡蛋和牛奶存储为两个单独的值。然后将这两个值与版本号2一起返回给客户端。
Client 1, oblivious to client 2’s write, wants to add flour to the cart, so it thinks the current cart contents should be [milk, flour] . It sends this value to the server, along with the version number 1 that the server gave client 1 previously. The server can tell from the version number that the write of [milk, flour] supersedes the prior value of [milk] but that it is concurrent with [eggs] . Thus, the server assigns version 3 to [milk, flour] , overwrites the version 1 value [milk] , but keeps the version 2 value [eggs] and returns both remaining values to the client.

客户端1不知道客户端2的写入操作，想向购物车添加面粉，因此它认为当前购物车内容应该是[牛奶，面粉]。它将这个值与服务器之前给客户端1的版本号1一起发送到服务器。从版本号可以看出，[牛奶，面粉]的写入操作取代了以前的值[牛奶]但与[鸡蛋]并发。因此，服务器将版本3分配给[牛奶，面粉]，覆盖版本1的值[牛奶]，但保留版本2的值[鸡蛋]并将两个剩余的值返回给客户端。
Meanwhile, client 2 wants to add ham to the cart, unaware that client 1 just added flour . Client 2 received the two values [milk] and [eggs] from the server in the last response, so the client now merges those values and adds ham to form a new value, [eggs, milk, ham] . It sends that value to the server, along with the previous version number 2. The server detects that version 2 overwrites [eggs] but is concurrent with [milk, flour] , so the two remaining values are [milk, flour] with version 3, and [eggs, milk, ham] with version 4.

与此同时，客户2想要将火腿添加到购物车中，不知道客户1只是添加了面粉。客户2在上次响应中从服务器接收到了两个值[milk]和[eggs]，所以客户现在合并这些值并添加火腿来形成一个新值[eggs，milk，ham]。它将该值发送到服务器，以及先前的版本号2。服务器检测到版本2覆盖了[eggs]，但与[milk，flour]并发，因此剩下的两个值是[milk，flour]版本3和[eggs，milk，ham]版本4。
Finally, client 1 wants to add bacon . It previously received [milk, flour] and [eggs] from the server at version 3, so it merges those, adds bacon , and sends the final value [milk, flour, eggs, bacon] to the server, along with the version number 3. This overwrites [milk, flour] (note that [eggs] was already overwritten in the last step) but is concurrent with [eggs, milk, ham] , so the server keeps those two concurrent values.

最终，客户端1想要加入培根。它之前从版本3的服务器上收到了[牛奶，面粉]和[鸡蛋]，于是它合并了这些值并添加了培根，最终发送值[milk，flour，eggs，bacon]以及版本号3到服务器。这将覆盖[milk，flour]（请注意，[eggs]已经在上一步被覆盖了），但与[eggs，milk，ham]并发，因此服务器保留这两个并发的值。

The dataflow between the operations in Figure 5-13 is illustrated graphically in Figure 5-14 . The arrows indicate which operation happened before which other operation, in the sense that the later operation knew about or depended on the earlier one. In this example, the clients are never fully up to date with the data on the server, since there is always another operation going on concurrently. But old versions of the value do get overwritten eventually, and no writes are lost.

图5-13中操作之间的数据流在图5-14中以图形化的方式呈现。箭头表示哪一个操作发生在另一个操作之前，即后面的操作知道或依赖于前面的操作。在这个例子中，由于总是有另一个操作在同时进行，所以客户端从未完全更新服务器上的数据。但是旧版的值最终会被覆盖，且没有写操作会丢失。

Note that the server can determine whether two operations are concurrent by looking at the version numbers—it does not need to interpret the value itself (so the value could be any data structure). The algorithm works as follows:

请注意，服务器可以通过查看版本号来确定两个操作是否并发，它不需要解释值本身（因此值可以是任何数据结构）。该算法的工作原理如下：

The server maintains a version number for every key, increments the version number every time that key is written, and stores the new version number along with the value written.

服务器为每个键维护一个版本号，在每次写入该键时递增版本号，并将新版本号与写入的值一起存储。
When a client reads a key, the server returns all values that have not been overwritten, as well as the latest version number. A client must read a key before writing.

当客户端读取一个键时，服务器返回所有未被覆盖的值，以及最新的版本号。客户端必须在写入之前先读取一个键。
When a client writes a key, it must include the version number from the prior read, and it must merge together all values that it received in the prior read. (The response from a write request can be like a read, returning all current values, which allows us to chain several writes like in the shopping cart example.)

当客户端写入一个键时，它必须包含先前读取的版本号，并将先前读取的所有值合并在一起。(写请求的响应可以像读取一样返回所有当前值，这使我们可以像购物车示例中那样链接几个写操作。)
When the server receives a write with a particular version number, it can overwrite all values with that version number or below (since it knows that they have been merged into the new value), but it must keep all values with a higher version number (because those values are concurrent with the incoming write).

当服务器接收到特定版本号的写操作时，它可以覆盖所有该版本号及以下的值（因为它知道它们已被合并到新值中），但必须保留所有高版本号的值（因为这些值与传入的写操作并发）。

When a write includes the version number from a prior read, that tells us which previous state the write is based on. If you make a write without including a version number, it is concurrent with all other writes, so it will not overwrite anything—it will just be returned as one of the values on subsequent reads.

当写操作包括先前读取的版本号时，这告诉我们写操作是基于哪个先前的状态。如果您进行写操作而不包括版本号，则它与所有其他写操作是并发的，因此它不会覆盖任何内容，它只会在后续读取中作为一个值返回。

Merging concurrently written values

This algorithm ensures that no data is silently dropped, but it unfortunately requires that the clients do some extra work: if several operations happen concurrently, clients have to clean up afterward by merging the concurrently written values. Riak calls these concurrent values siblings .

这个算法确保不会静默丢失任何数据，但不幸的是，它要求客户做一些额外的工作：如果有几个操作同时发生，客户必须通过合并并发写入的值来清理。Riak称这些并发值为“同胞”。

Merging sibling values is essentially the same problem as conflict resolution in multi-leader replication, which we discussed previously (see “Handling Write Conflicts” ). A simple approach is to just pick one of the values based on a version number or timestamp (last write wins), but that implies losing data. So, you may need to do something more intelligent in application code.

合并兄弟节点的值本质上与多个领导者复制中的冲突解决问题相同，我们之前讨论过（请参见“处理写入冲突”）。一个简单的方法是基于版本号或时间戳选择一个值（最后一个写入者获胜），但这意味着会丢失数据。因此，在应用程序代码中，您可能需要做一些更智能的处理。

With the example of a shopping cart, a reasonable approach to merging siblings is to just take the union. In Figure 5-14 , the two final siblings are [milk, flour, eggs, bacon] and [eggs, milk, ham] ; note that milk and eggs appear in both, even though they were each only written once. The merged value might be something like [milk, flour, eggs, bacon, ham] , without duplicates.

在购物车的示例中，合并兄弟节点的合理方法是取并集。在图5-14中，最终的两个兄弟节点是[milk，flour，eggs，bacon]和[eggs，milk，ham]；请注意，牛奶和鸡蛋出现在两者中，即使它们仅被写入一次。合并后的值可能类似于[milk，flour，eggs，bacon，ham]，不包括重复项。

However, if you want to allow people to also remove things from their carts, and not just add things, then taking the union of siblings may not yield the right result: if you merge two sibling carts and an item has been removed in only one of them, then the removed item will reappear in the union of the siblings [ 37 ]. To prevent this problem, an item cannot simply be deleted from the database when it is removed; instead, the system must leave a marker with an appropriate version number to indicate that the item has been removed when merging siblings. Such a deletion marker is known as a tombstone . (We previously saw tombstones in the context of log compaction in “Hash Indexes” .)

然而，如果您希望允许用户从购物车中删除物品，而不仅仅是添加物品，则合并兄弟节点的并集可能不会产生正确的结果：如果合并了两个兄弟购物车，并且有一项物品仅在其中一个购物车中被删除，则已删除的物品将重新出现在兄弟节点的并集中[37]。为了避免这个问题，当一个物品被移除时，不能只是从数据库中简单地删除它；相反，系统必须留下一个带有适当版本号的标记，以指示该物品已被移除当合并兄弟节点时。这样的删除标记被称为tombstone。（我们在“哈希索引”中曾经看到过tombstone在日志压缩的上下文中的应用。）

As merging siblings in application code is complex and error-prone, there are some efforts to design data structures that can perform this merging automatically, as discussed in “Automatic Conflict Resolution” . For example, Riak’s datatype support uses a family of data structures called CRDTs [ 38 , 39 , 55 ] that can automatically merge siblings in sensible ways, including preserving deletions.

由于在应用程序代码中合并兄弟节点很复杂且容易出错，因此有一些努力设计数据结构，可以自动执行此合并，正如“自动冲突解决”所讨论的那样。例如，Riak 的数据类型支持使用称为 CRDTs 的一系列数据结构 [38，39，55]，可以以明智的方式自动合并兄弟节点，包括保留删除操作。

Version vectors

The example in Figure 5-13 used only a single replica. How does the algorithm change when there are multiple replicas, but no leader?

当有多个副本但没有领导者时，图5-13中的示例仅使用单个副本。算法如何发生改变？

Figure 5-13 uses a single version number to capture dependencies between operations, but that is not sufficient when there are multiple replicas accepting writes concurrently. Instead, we need to use a version number per replica as well as per key. Each replica increments its own version number when processing a write, and also keeps track of the version numbers it has seen from each of the other replicas. This information indicates which values to overwrite and which values to keep as siblings.

图5-13使用一个版本号来捕捉操作之间的依赖关系，但当有多个副本同时接受写入时，这是不够的。相反，我们需要为每个副本和每个键使用一个版本号。每个副本在处理写入时都会增加自己的版本号，并且还会跟踪它从其他副本中看到的版本号。这些信息指示了要覆盖哪些值以及哪些值作为兄弟保留。

The collection of version numbers from all the replicas is called a version vector [ 56 ]. A few variants of this idea are in use, but the most interesting is probably the dotted version vector [ 57 ], which is used in Riak 2.0 [ 58 , 59 ]. We won’t go into the details, but the way it works is quite similar to what we saw in our cart example.

所有副本版本号的收集称为版本向量。这个想法有几个变体，但最有趣的可能是点分版本向量，它在Riak 2.0中使用。我们不会深入探讨细节，但它的工作方式与我们在购物车示例中看到的相似。

Like the version numbers in Figure 5-13 , version vectors are sent from the database replicas to clients when values are read, and need to be sent back to the database when a value is subsequently written. (Riak encodes the version vector as a string that it calls causal context .) The version vector allows the database to distinguish between overwrites and concurrent writes.

像图5-13中的版本号一样，当值被读取时，版本向量从数据库副本发送到客户端，并在随后写入值时需要发送回数据库。(Riak将版本向量编码为称为因果上下文的字符串。)版本向量允许数据库区分覆盖写和并发写。

Also, like in the single-replica example, the application may need to merge siblings. The version vector structure ensures that it is safe to read from one replica and subsequently write back to another replica. Doing so may result in siblings being created, but no data is lost as long as siblings are merged correctly.

同样地，就像在单副本示例中一样，应用程序可能需要合并兄弟节点。版本向量结构确保从一个副本中读取并随后写回另一个副本是安全的。这样做可能会导致创建兄弟节点，但只要正确合并兄弟节点，就不会丢失任何数据。

Version vectors and vector clocks

A version vector is sometimes also called a vector clock , even though they are not quite the same. The difference is subtle—please see the references for details [ 57 , 60 , 61 ]. In brief, when comparing the state of replicas, version vectors are the right data structure to use.

版本向量有时也被称为向量时钟，尽管它们并不完全相同。不同之处微妙，请参阅详细信息的参考资料。概括地说，当比较副本的状态时，版本向量是正确的数据结构。

Summary

In this chapter we looked at the issue of replication. Replication can serve several purposes:

在本章中，我们讨论了复制的问题。复制可以有几个目的：

High availability

Keeping the system running, even when one machine (or several machines, or an entire datacenter) goes down

保持系统运行，即使一个机器（或多个机器，或整个数据中心）挂掉。

Disconnected operation

Allowing an application to continue working when there is a network interruption

当网络中断时，允许应用程序继续工作。

Latency

Placing data geographically close to users, so that users can interact with it faster

将数据放置在用户附近，以便用户能够更快地与之交互。

Scalability

Being able to handle a higher volume of reads than a single machine could handle, by performing reads on replicas

通过在副本上执行读取操作，能够处理比单台机器更高量的读取。

Despite being a simple goal—keeping a copy of the same data on several machines—replication turns out to be a remarkably tricky problem. It requires carefully thinking about concurrency and about all the things that can go wrong, and dealing with the consequences of those faults. At a minimum, we need to deal with unavailable nodes and network interruptions (and that’s not even considering the more insidious kinds of fault, such as silent data corruption due to software bugs).

尽管保持在几台计算机上的相同数据似乎是一个简单的目标，但复制实际上是一个异常棘手的问题。它需要仔细考虑并发性和所有可能出错的事情，并处理这些故障的后果。最少，我们需要处理不可用的节点和网络中断（甚至没有考虑由于软件错误而导致的隐蔽数据损坏等更隐蔽的故障）。

We discussed three main approaches to replication:

我们讨论了三种主要的复制方法：

Single-leader replication

Clients send all writes to a single node (the leader), which sends a stream of data change events to the other replicas (followers). Reads can be performed on any replica, but reads from followers might be stale.

客户端将所有的写操作发送到单个节点（领导者），该节点发送一系列的数据变更事件到其它副本（跟随者）。读取操作可以在任何副本中进行，但从跟随者读取可能过期。

Multi-leader replication

Clients send each write to one of several leader nodes, any of which can accept writes. The leaders send streams of data change events to each other and to any follower nodes.

客户端将每个写操作发送到多个领导节点中的一个，任何一个领导节点都可以接受。领导者将数据变更事件的数据流发送给彼此和任何跟随者节点。

Leaderless replication

Clients send each write to several nodes, and read from several nodes in parallel in order to detect and correct nodes with stale data.

客户端将每个写操作发送到多个节点，并从多个节点并行读取，以便检测和纠正具有过时数据的节点。

Each approach has advantages and disadvantages. Single-leader replication is popular because it is fairly easy to understand and there is no conflict resolution to worry about. Multi-leader and leaderless replication can be more robust in the presence of faulty nodes, network interruptions, and latency spikes—at the cost of being harder to reason about and providing only very weak consistency guarantees.

每种方法都有优缺点。单个领导者复制很受欢迎，因为它相对容易理解，而且没有冲突解决的问题。多个领导者和无领导者复制可以更鲁棒，即使在存在故障节点、网络中断和延迟峰值的情况下，也可以在提供非常弱的一致性保证的代价下更加令人信服。

Replication can be synchronous or asynchronous, which has a profound effect on the system behavior when there is a fault. Although asynchronous replication can be fast when the system is running smoothly, it’s important to figure out what happens when replication lag increases and servers fail. If a leader fails and you promote an asynchronously updated follower to be the new leader, recently committed data may be lost.

复制可以是同步或异步的，这对系统的行为有深远影响，特别是在出现故障时。尽管当系统运行顺畅时，异步复制可以很快，但重要的是要弄清楚当复制延迟增加和服务器故障时会发生什么。如果领导者失败并且您提升一个异步更新的追随者成为新领导者，则可能会丢失最近提交的数据。

We looked at some strange effects that can be caused by replication lag, and we discussed a few consistency models which are helpful for deciding how an application should behave under replication lag:

我们研究了一些由复制延迟引起的奇怪影响，并讨论了几个一致性模型，这些模型有助于决定应用程序在复制延迟下应该如何行为。

Read-after-write consistency

Users should always see data that they submitted themselves.

用户应始终看到他们自己提交的数据。

Monotonic reads

After users have seen the data at one point in time, they shouldn’t later see the data from some earlier point in time.

用户在某个时间点看到了数据后，不应该再看到一些早期时间点的数据。

Consistent prefix reads

Users should see the data in a state that makes causal sense: for example, seeing a question and its reply in the correct order.

用户应该以类因果关系的形式看到数据：例如，在正确的顺序中看到一个问题及其回答。

Finally, we discussed the concurrency issues that are inherent in multi-leader and leaderless replication approaches: because they allow multiple writes to happen concurrently, conflicts may occur. We examined an algorithm that a database might use to determine whether one operation happened before another, or whether they happened concurrently. We also touched on methods for resolving conflicts by merging together concurrent updates.

最后，我们讨论了多领导者和无领导者复制方法固有的并发问题：因为它们允许多个写入同时发生，可能会发生冲突。我们研究了一个数据库可能使用的算法，以确定一项操作是在另一项操作之前发生，还是它们同时发生。我们还提及了通过合并同时更新来解决冲突的方法。

In the next chapter we will continue looking at data that is distributed across multiple machines, through the counterpart of replication: splitting a large dataset into partitions .

在下一章中，我们将继续探讨在多台机器上分布的数据，通过复制的对应方式：将大型数据集分成多个分区。

Footnotes

ⁱ Different people have different definitions for hot , warm , and cold standby servers. In PostgreSQL, for example, hot standby is used to refer to a replica that accepts reads from clients, whereas a warm standby processes changes from the leader but doesn’t process any queries from clients. For purposes of this book, the difference isn’t important.

不同人对热备、温备和冷备服务器有不同的定义。例如，在PostgreSQL中，热备是指可以接受客户端读取的副本，而温备只处理来自主服务器的更改，不会处理客户端的任何查询。本书的目的并不在于讨论它们之间的区别。

ⁱⁱ This approach is known as fencing or, more emphatically, Shoot The Other Node In The Head (STONITH). We will discuss fencing in more detail in “The leader and the lock” .

这种方法被称为围栏或更强调地说，Shoot The Other Node In The Head (STONITH)。我们将在“领导者和锁”的章节中更详细地讨论围栏。

ⁱⁱⁱ The term eventual consistency was coined by Douglas Terry et al. [ 24 ], popularized by Werner Vogels [ 22 ], and became the battle cry of many NoSQL projects. However, not only NoSQL databases are eventually consistent: followers in an asynchronously replicated relational database have the same characteristics.

“最终一致性”这个术语是由Douglas Terry等人首创[24]，由Werner Vogels[22]推广，并成为许多NoSQL项目的战斗口号。但是，不仅NoSQL数据库是最终一致性的：在异步复制的关系型数据库中，追随者具有相同的特点。”

^iv If the database is partitioned (see Chapter 6 ), each partition has one leader. Different partitions may have their leaders on different nodes, but each partition must nevertheless have one leader node.

如果数据库被分区（参见第6章），每个分区都有一个领导者。不同的分区可能在不同的节点上有它们的领导者，但每个分区必须至少有一个领导节点。

^v Not to be confused with a star schema (see “Stars and Snowflakes: Schemas for Analytics” ), which describes the structure of a data model, not the communication topology between nodes.

不要与星型模式混淆（参见“星型和雪花型：分析模式”），星型模式描述的是数据模型的结构，而不是节点之间的通信拓扑结构。

^vi Dynamo is not available to users outside of Amazon. Confusingly, AWS offers a hosted database product called DynamoDB , which uses a completely different architecture: it is based on single-leader replication.

Vi Dynamo对于亚马逊以外的用户不可用。AWS提供了一个名为DynamoDB的托管数据库产品，但它采用了完全不同的架构：它基于单主复制。

^vii Sometimes this kind of quorum is called a strict quorum , to contrast with sloppy quorums (discussed in “Sloppy Quorums and Hinted Handoff” ).

有时，这种类型的法定人数被称为严格的法定人数，与松散的法定人数相对应（在“松散的法定人数和提示手动抛掷”中讨论）。

References

[ 1 ] Bruce G. Lindsay, Patricia Griffiths Selinger, C. Galtieri, et al.: “ Notes on Distributed Databases ,” IBM Research, Research Report RJ2571(33471), July 1979.

[1] 布鲁斯·林赛，帕特里夏·格里菲斯·塞林格，C. 卡尔蒂埃里等人： “分布式数据库注记”，IBM 研究，研究报告 RJ2571（33471），1979年7月。

[ 2 ] “ Oracle Active Data Guard Real-Time Data Protection and Availability ,” Oracle White Paper, June 2013.

[2] “Oracle Active Data Guard实时数据保护和可用性”，Oracle白皮书，2013年6月。

[ 3 ] “ AlwaysOn Availability Groups ,” in SQL Server Books Online , Microsoft, 2012.

[3] "SQL Server Books Online中的AlwaysOn可用性组，Microsoft，2012年"。

[ 4 ] Lin Qiao, Kapil Surlaker, Shirshanka Das, et al.: “ On Brewing Fresh Espresso: LinkedIn’s Distributed Data Serving Platform ,” at ACM International Conference on Management of Data (SIGMOD), June 2013.

[4] Lin Qiao、Kapil Surlaker、Shirshanka Das等人: “关于制作新鲜的浓缩咖啡：LinkedIn的分布式数据服务平台”，于2013年6月ACM国际数据管理会议(SIGMOD)发表。

[ 5 ] Jun Rao: “ Intra-Cluster Replication for Apache Kafka ,” at ApacheCon North America , February 2013.

[5] 饶俊: “Apache Kafka的集群内复制”，在2013年2月的ApacheCon北美大会上。

[ 6 ] “ Highly Available Queues ,” in RabbitMQ Server Documentation , Pivotal Software, Inc., 2014.

[6] “高度可用的队列”，RabbitMQ服务器文档，Pivotal Software, Inc., 2014。

[ 7 ] Yoshinori Matsunobu: “ Semi-Synchronous Replication at Facebook ,” yoshinorimatsunobu.blogspot.co.uk , April 1, 2014.

[7] Yoshinori Matsunobu： “Facebook的半同步复制”，yoshinorimatsunobu.blogspot.co.uk，2014年4月1日。

[ 8 ] Robbert van Renesse and Fred B. Schneider: “ Chain Replication for Supporting High Throughput and Availability ,” at 6th USENIX Symposium on Operating System Design and Implementation (OSDI), December 2004.

[8] Robbert van Renesse 和 Fred B. Schneider： “链式复制以支持高吞吐量和可用性”，于第六届USENIX操作系统设计与实现研讨会（OSDI）上，于2004年12月发布。

[ 9 ] Jeff Terrace and Michael J. Freedman: “ Object Storage on CRAQ: High-Throughput Chain Replication for Read-Mostly Workloads ,” at USENIX Annual Technical Conference (ATC), June 2009.

[9] Jeff Terrace和Michael J. Freedman: “基于CRAQ的对象存储：适用于只读工作负载的高吞吐量链复制”，于2009年6月的USENIX 年度技术会议(ATC)上发表。

[ 10 ] Brad Calder, Ju Wang, Aaron Ogus, et al.: “ Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency ,” at 23rd ACM Symposium on Operating Systems Principles (SOSP), October 2011.

【10】Brad Calder, Ju Wang, Aaron Ogus等人：“Windows Azure储存：高可用云存储服务与强一致性”，发表在第23届ACM操作系统原理研讨会（SOSP）上，2011年10月。

[ 11 ] Andrew Wang: “ Windows Azure Storage ,” umbrant.com , February 4, 2016.

[11] Andrew Wang: “Windows Azure Storage,” umbrant.com，2016年2月4日。安德鲁·王： “Windows Azure 存储”， umbrant.com，2016年2月4日。

[ 12 ] “ Percona Xtrabackup - Documentation ,” Percona LLC, 2014.

[12] "Percona Xtrabackup - 文档，" Percona LLC，2014年。

[ 13 ] Jesse Newland: “ GitHub Availability This Week ,” github.com , September 14, 2012.

Jesse Newland：“GitHub本周可用性”，github.com，2012年9月14日。

[ 14 ] Mark Imbriaco: “ Downtime Last Saturday ,” github.com , December 26, 2012.

[14] Mark Imbriaco：“上周六的停机时间”，github.com，2012年12月26日。

[ 15 ] John Hugg: “ ‘All in’ with Determinism for Performance and Testing in Distributed Systems ,” at Strange Loop , September 2015.

[15] 约翰·哈格（John Hugg）：“在分布式系统的性能与测试中实现确定性”，于 2015 年 9 月在 Strange Loop 上演讲。

[ 16 ] Amit Kapila: “ WAL Internals of PostgreSQL ,” at PostgreSQL Conference (PGCon), May 2012.

[16] Amit Kapila：2012年5月在PostgreSQL Conference（PGCon）上发表的“PostgreSQL WAL的内部机制”。

[ 17 ] MySQL Internals Manual . Oracle, 2014.

[17] MySQL 内部手册。Oracle，2014年。

[ 18 ] Yogeshwer Sharma, Philippe Ajoux, Petchean Ang, et al.: “ Wormhole: Reliable Pub-Sub to Support Geo-Replicated Internet Services ,” at 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI), May 2015.

【18】Yogeshwer Sharma，Philippe Ajoux，Petchean Ang等：“Wormhole：“可靠的发布-订阅以支持地理复制的互联网服务”，在第12届USENIX网络系统设计和实现研讨会（NSDI），2015年5月。

[ 19 ] “ Oracle GoldenGate 12c: Real-Time Access to Real-Time Information ,” Oracle White Paper, October 2013.

[19] “Oracle GoldenGate 12c：实时获取实时信息，”Oracle白皮书，2013年10月。

[ 20 ] Shirshanka Das, Chavdar Botev, Kapil Surlaker, et al.: “ All Aboard the Databus! ,” at ACM Symposium on Cloud Computing (SoCC), October 2012.

"[20] Shirshanka Das, Chavdar Botev, Kapil Surlaker等人： “全员上车数据总线！”，于2012年10月在ACM云计算研讨会（SoCC）上发表。"

[ 21 ] Greg Sabino Mullane: “ Version 5 of Bucardo Database Replication System ,” blog.endpoint.com , June 23, 2014.

[21] Greg Sabino Mullane: “Bucardo数据库复制系统的版本5”，blog.endpoint.com，2014年6月23日。

[ 22 ] Werner Vogels: “ Eventually Consistent ,” ACM Queue , volume 6, number 6, pages 14–19, October 2008. doi:10.1145/1466443.1466448

“最终一致性”，Werner Vogels，ACM队列，第6卷，第6号，第14-19页，2008年10月。DOI：10.1145 / 1466443.1466448。

[ 23 ] Douglas B. Terry: “ Replicated Data Consistency Explained Through Baseball ,” Microsoft Research, Technical Report MSR-TR-2011-137, October 2011.

[23] Douglas B. Terry：“通过棒球解释复制数据一致性”，微软研究，技术报告MSR-TR-2011-137，2011年10月。

[ 24 ] Douglas B. Terry, Alan J. Demers, Karin Petersen, et al.: “ Session Guarantees for Weakly Consistent Replicated Data ,” at 3rd International Conference on Parallel and Distributed Information Systems (PDIS), September 1994. doi:10.1109/PDIS.1994.331722

[24] Douglas B. Terry, Alan J. Demers, Karin Petersen等人： “弱一致性复制数据的会话保证”，发表于1994年9月第3届国际并行与分布式信息系统会议（PDIS）。 doi：10.1109/PDIS.1994.331722。

[ 25 ] Terry Pratchett: Reaper Man: A Discworld Novel . Victor Gollancz, 1991. ISBN: 978-0-575-04979-6

[25] 特里·普拉切特：死神来了：磁盘世界小说。维克多·戈兰茨，1991年。ISBN：978-0-575-04979-6。

[ 26 ] “ Tungsten Replicator ,” Continuent, Inc., 2014.

[26] “钨复制器”，Continuent公司，2014年。

[ 27 ] “ BDR 0.10.0 Documentation ,” The PostgreSQL Global Development Group, bdr-project.org , 2015.

“BDR 0.10.0文档”，PostgreSQL全球开发组， bdr-project.org，2015年。

[ 28 ] Robert Hodges: “ If You *Must* Deploy Multi-Master Replication, Read This First ,” scale-out-blog.blogspot.co.uk , March 30, 2012.

[28] 罗伯特·霍奇斯： “如果您必须部署多主复制，请先阅读本文”，scale-out-blog.blogspot.co.uk，2012年3月30日。

[ 29 ] J. Chris Anderson, Jan Lehnardt, and Noah Slater: CouchDB: The Definitive Guide . O’Reilly Media, 2010. ISBN: 978-0-596-15589-6

[29] J. Chris Anderson，Jan Lehnardt和Noah Slater：CouchDB：权威指南。O'Reilly Media，2010年。 ISBN：978-0-596-15589-6。

[ 30 ] AppJet, Inc.: “ Etherpad and EasySync Technical Manual ,” github.com , March 26, 2011.

[30] AppJet, Inc.：“Etherpad和EasySync技术手册”，github.com，2011年3月26日。

[ 31 ] John Day-Richter: “ What’s Different About the New Google Docs: Making Collaboration Fast ,” googledrive.blogspot.com , 23 September 2010.

[31] 约翰·戴-里希特： “新版Google文档有何不同：加快协作速度”，googledrive.blogspot.com，2010年9月23日。

[ 32 ] Martin Kleppmann and Alastair R. Beresford: “ A Conflict-Free Replicated JSON Datatype ,” arXiv:1608.03960, August 13, 2016.

[32] Martin Kleppmann和Alastair R. Beresford： “无冲突复制JSON数据类型，” arXiv:1608.03960，2016年8月13日。

[ 33 ] Frazer Clement: “ Eventual Consistency – Detecting Conflicts ,” messagepassing.blogspot.co.uk , October 20, 2011.

"最终一致性–检测冲突"，Frazer Clement，messagepassing.blogspot.co.uk，2011年10月20日。"

[ 34 ] Robert Hodges: “ State of the Art for MySQL Multi-Master Replication ,” at Percona Live: MySQL Conference & Expo , April 2013.

[34] Robert Hodges： “MySQL 多主复制的现状”，于 Percona Live：MySQL 会议和展览会，2013 年 4 月。

[ 35 ] John Daily: “ Clocks Are Bad, or, Welcome to the Wonderful World of Distributed Systems ,” basho.com , November 12, 2013.

[35] 约翰·戴利: “时钟是不好的，或者说，欢迎来到分布式系统的美好世界，” basho.com，2013年11月12日。

[ 36 ] Riley Berton: “ Is Bi-Directional Replication (BDR) in Postgres Transactional? ,” sdf.org , January 4, 2016.

“Postgres的Bi-Directional Replication (BDR)是否支持事务处理？”- Riley Berton，sdf.org，2016年1月4日。

[ 37 ] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, et al.: “ Dynamo: Amazon’s Highly Available Key-Value Store ,” at 21st ACM Symposium on Operating Systems Principles (SOSP), October 2007.

[37] 朱塞佩·德坎迪亚、德尼兹·哈斯图伦、马丹·贾姆帕尼等人: “Dynamo: 亚马逊高度可用性的键值存储”，发表于2007年10月的第21届ACM操作系统原理研讨会（SOSP）。

[ 38 ] Marc Shapiro, Nuno Preguiça, Carlos Baquero, and Marek Zawirski: “ A Comprehensive Study of Convergent and Commutative Replicated Data Types ,” INRIA Research Report no. 7506, January 2011.

[38] Marc Shapiro、Nuno Preguiça、Carlos Baquero和Marek Zawirski：“《收敛和交换复制数据类型的全面研究》”，INRIA研究报告7506，2011年1月。

[ 39 ] Sam Elliott: “ CRDTs: An UPDATE (or Maybe Just a PUT) ,” at RICON West , October 2013.

[39] 萨姆·艾略特: “CRDTs: 更新(或或许只是PUT)”，于2013年10月RICON West演讲。

[ 40 ] Russell Brown: “ A Bluffers Guide to CRDTs in Riak ,” gist.github.com , October 28, 2013.

“Riak中CRDT的骗子指南”，Russell Brown，2013年10月28日，gist.github.com。

[ 41 ] Benjamin Farinier, Thomas Gazagnaire, and Anil Madhavapeddy: “ Mergeable Persistent Data Structures ,” at 26es Journées Francophones des Langages Applicatifs (JFLA), January 2015.

[41] Benjamin Farinier, Thomas Gazagnaire, and Anil Madhavapeddy：“可合并的持久化数据结构”，发表于2015年1月的第26届法语应用语言研讨会（JFLA）。

[ 42 ] Chengzheng Sun and Clarence Ellis: “ Operational Transformation in Real-Time Group Editors: Issues, Algorithms, and Achievements ,” at ACM Conference on Computer Supported Cooperative Work (CSCW), November 1998.

[42] 孙成政和克拉伦斯·艾利斯：“实时群组编辑器中的操作转换：问题、算法和成就”，1998年11月，ACM计算机支持的合作工作会议（CSCW）。

[ 43 ] Lars Hofhansl: “ HBASE-7709: Infinite Loop Possible in Master/Master Replication ,” issues.apache.org , January 29, 2013.

[43] Lars Hofhansl：“HBASE-7709：主/主复制中可能存在无限循环”，issues.apache.org，2013年1月29日。

[ 44 ] David K. Gifford: “ Weighted Voting for Replicated Data ,” at 7th ACM Symposium on Operating Systems Principles (SOSP), December 1979. doi:10.1145/800215.806583

[44] David K. Gifford： “复制数据的加权投票”，于1979年12月第7届ACM操作系统原理研讨会（SOSP）上发表。 doi:10.1145/800215.806583。

[ 45 ] Heidi Howard, Dahlia Malkhi, and Alexander Spiegelman: “ Flexible Paxos: Quorum Intersection Revisited ,” arXiv:1608.06696 , August 24, 2016.

[45] Heidi Howard，Dahlia Malkhi和Alexander Spiegelman：“灵活的Paxos: 重访仲裁交集”，arXiv:1608.06696，2016年8月24日。

[ 46 ] Joseph Blomstedt: “ Re: Absolute Consistency ,” email to riak-users mailing list, lists.basho.com , January 11, 2012.

[46] Joseph Blomstedt： “关于：绝对一致性”，发件人为riak-users邮件列表的电子邮件，lists.basho.com，2012年1月11日。

[ 47 ] Joseph Blomstedt: “ Bringing Consistency to Riak ,” at RICON West , October 2012.

[47] Joseph Blomstedt：2012年10月，在RICON West上，“将一致性带给Riak”。

[ 48 ] Peter Bailis, Shivaram Venkataraman, Michael J. Franklin, et al.: “ Quantifying Eventual Consistency with PBS ,” Communications of the ACM , volume 57, number 8, pages 93–102, August 2014. doi:10.1145/2632792

Peter Bailis、Shivaram Venkataraman、Michael J. Franklin等人在2014年8月的ACM通讯中发表了一篇名为“Quantifying Eventual Consistency with PBS”的论文，页码为93-102，DOI为10.1145/2632792，介绍了使用PBS量化最终一致性的方法。

[ 49 ] Jonathan Ellis: “ Modern Hinted Handoff ,” datastax.com , December 11, 2012.

[49] Jonathan Ellis：“现代提示式传递”，datastax.com，2012年12月11日。 [49] 乔纳森·埃利斯：“现代提示式传递”，datastax.com，2012年12月11日。

[ 50 ] “ Project Voldemort Wiki ,” github.com , 2013.

“伏地魔项目维基”，github.com，2013年。

[ 51 ] “ Apache Cassandra 2.0 Documentation ,” DataStax, Inc., 2014.

“Apache Cassandra 2.0 文档”，DataStax，Inc.，2014年。

[ 52 ] “ Riak Enterprise: Multi-Datacenter Replication .” Technical whitepaper, Basho Technologies, Inc., September 2014.

[52] “Riak企业版：多数据中心复制。” 技术白皮书，Basho Technologies, Inc.，2014年9月。

[ 53 ] Jonathan Ellis: “ Why Cassandra Doesn’t Need Vector Clocks ,” datastax.com , September 2, 2013.

[53] Jonathan Ellis: "为什么Cassandra不需要向量时钟"，datastax.com，2013年9月2日。

[ 54 ] Leslie Lamport: “ Time, Clocks, and the Ordering of Events in a Distributed System ,” Communications of the ACM , volume 21, number 7, pages 558–565, July 1978. doi:10.1145/359545.359563

“时间、时钟和分布式系统中事件的排序”，作者 Leslie Lamport，发表于 1978 年 7 月的 ACM 通讯期刊，第 21 卷第 7 期，页码为 558-565。doi:10.1145/359545.359563。

[ 55 ] Joel Jacobson: “ Riak 2.0: Data Types ,” blog.joeljacobson.com , March 23, 2014.

[55] Joel Jacobson：《Riak 2.0: 数据类型》，blog.joeljacobson.com，2014年3月23日。

[ 56 ] D. Stott Parker Jr., Gerald J. Popek, Gerard Rudisin, et al.: “ Detection of Mutual Inconsistency in Distributed Systems ,” IEEE Transactions on Software Engineering , volume 9, number 3, pages 240–247, May 1983. doi:10.1109/TSE.1983.236733

[56] D. Stott Parker Jr.，Gerald J. Popek，Gerard Rudisin等人： “分布式系统中的相互不一致性检测”，IEEE软件工程交易，第9卷，第3号，240-247页， 1983年5月。 doi:10.1109/TSE.1983.236733

[ 57 ] Nuno Preguiça, Carlos Baquero, Paulo Sérgio Almeida, et al.: “ Dotted Version Vectors: Logical Clocks for Optimistic Replication ,” arXiv:1011.5808, November 26, 2010.

【57】Nuno Preguiça、Carlos Baquero、Paulo Sérgio Almeida 等人：「点状版本向量：乐观复制的逻辑时钟」，arXiv:1011.5808，2010 年 11 月 26 日。

[ 58 ] Sean Cribbs: “ A Brief History of Time in Riak ,” at RICON , October 2014.

[58] Sean Cribbs： “Riak的简史”，于2014年10月在RICON上。

[ 59 ] Russell Brown: “ Vector Clocks Revisited Part 2: Dotted Version Vectors ,” basho.com , November 10, 2015.

[59] 罗素·布朗：“向量时钟再探（第二部分）：“带点版本向量”，basho.com，2015年11月10日。

[ 60 ] Carlos Baquero: “ Version Vectors Are Not Vector Clocks ,” haslab.wordpress.com , July 8, 2011.

[60] Carlos Baquero：“版本向量不是向量时钟”，haslab.wordpress.com，2011年7月8日。

[ 61 ] Reinhard Schwarz and Friedemann Mattern: “ Detecting Causal Relationships in Distributed Computations: In Search of the Holy Grail ,” Distributed Computing , volume 7, number 3, pages 149–174, March 1994. doi:10.1007/BF02277859

[61] Reinhard Schwarz 和 Friedemann Mattern： “在分布式计算中检测因果关系：寻找圣杯”，《分布式计算》杂志，1994 年 3 月，卷 7，号 3，第 149-174 页。doi:10.1007/BF02277859

Chapter 5. Replication

Leaders and Followers

Figure 5-1. Leader-based (master–slave) replication.

Synchronous Versus Asynchronous Replication

Figure 5-2. Leader-based replication with one synchronous and one asynchronous follower.

Setting Up New Followers

Handling Node Outages

Follower failure: Catch-up recovery

Leader failure: Failover

Implementation of Replication Logs

Statement-based replication

Write-ahead log (WAL) shipping

Logical (row-based) log replication

Trigger-based replication

Problems with Replication Lag

Reading Your Own Writes

Figure 5-3. A user makes a write, followed by a read from a stale replica. To prevent this anomaly, we need read-after-write consistency.

Monotonic Reads

Figure 5-4. A user first reads from a fresh replica, then from a stale replica. Time appears to go backward. To prevent this anomaly, we need monotonic reads.

Consistent Prefix Reads

Figure 5-5. If some partitions are replicated slower than others, an observer may see the answer before they see the question.

Solutions for Replication Lag

Multi-Leader Replication

Use Cases for Multi-Leader Replication

Multi-datacenter operation

Figure 5-6. Multi-leader replication across multiple datacenters.

Clients with offline operation

Collaborative editing

Handling Write Conflicts

Figure 5-7. A write conflict caused by two leaders concurrently updating the same record.

Synchronous versus asynchronous conflict detection

Conflict avoidance

Converging toward a consistent state

Custom conflict resolution logic

What is a conflict?

Multi-Leader Replication Topologies

Figure 5-8. Three example topologies in which multi-leader replication can be set up.

Figure 5-9. With multi-leader replication, writes may arrive in the wrong order at some replicas.

Leaderless Replication

Writing to the Database When a Node Is Down

Figure 5-10. A quorum write, quorum read, and read repair after a node outage.

Read repair and anti-entropy

Quorums for reading and writing

Note

Figure 5-11. If w + r > n , at least one of the r replicas you read from must have seen the most recent successful write.

Limitations of Quorum Consistency

Monitoring staleness

Sloppy Quorums and Hinted Handoff

Multi-datacenter operation

Detecting Concurrent Writes

Figure 5-12. Concurrent writes in a Dynamo-style datastore: there is no well-defined ordering.

Last write wins (discarding concurrent writes)

The “happens-before” relationship and concurrency

Capturing the happens-before relationship

Figure 5-13. Capturing causal dependencies between two clients concurrently editing a shopping cart.

Figure 5-14. Graph of causal dependencies in Figure 5-13 .

Merging concurrently written values

Version vectors

Version vectors and vector clocks

Summary

Footnotes

References