第九章一致性和共识

Mar 11, 2013

Designing Data-Intensive Applications

Chapter 9. Consistency and Consensus

Is it better to be alive and wrong or right and dead?

活着做错了还是死了做对了更好？

Jay Kreps, A Few Notes on Kafka and Jepsen (2013)

Jay Kreps, 关于Kafka和Jepsen的几点说明（2013年）

Lots of things can go wrong in distributed systems, as discussed in Chapter 8 . The simplest way of handling such faults is to simply let the entire service fail, and show the user an error message. If that solution is unacceptable, we need to find ways of tolerating faults—that is, of keeping the service functioning correctly, even if some internal component is faulty.

分布式系统中可能出现很多问题，如第8章所讨论的那样。处理此类故障的最简单方法是直接让整个服务失败，并向用户显示错误消息。如果该解决方案不可接受，则需要找到容错的方法——即使某些内部组件出现故障，也能正确地保持服务运行。

In this chapter, we will talk about some examples of algorithms and protocols for building fault-tolerant distributed systems. We will assume that all the problems from Chapter 8 can occur: packets can be lost, reordered, duplicated, or arbitrarily delayed in the network; clocks are approximate at best; and nodes can pause (e.g., due to garbage collection) or crash at any time.

在本章中，我们将谈论一些算法和协议的示例，用于构建容错的分布式系统。我们将假设来自第8章的所有问题都可能发生：数据包可能在网络中丢失、重排序、重复或任意延迟；时钟充其量只是近似的；节点可以随时暂停（例如，由于垃圾收集）或崩溃。

The best way of building fault-tolerant systems is to find some general-purpose abstractions with useful guarantees, implement them once, and then let applications rely on those guarantees. This is the same approach as we used with transactions in Chapter 7 : by using a transaction, the application can pretend that there are no crashes (atomicity), that nobody else is concurrently accessing the database (isolation), and that storage devices are perfectly reliable (durability). Even though crashes, race conditions, and disk failures do occur, the transaction abstraction hides those problems so that the application doesn’t need to worry about them.

构建容错系统的最佳方法是找到一些具有有用保证的通用抽象，仅实现一次，然后让应用程序依赖于这些保证。这与第7章中使用事务的方法相同：通过使用事务，应用程序可以假装没有崩溃（原子性），没有其他人同时访问数据库（隔离性）以及存储设备完全可靠（持久性）。尽管发生崩溃，竞争条件和磁盘故障，但事务抽象将这些问题隐藏起来，使应用程序不必担心它们。

We will now continue along the same lines, and seek abstractions that can allow an application to ignore some of the problems with distributed systems. For example, one of the most important abstractions for distributed systems is consensus : that is, getting all of the nodes to agree on something. As we shall see in this chapter, reliably reaching consensus in spite of network faults and process failures is a surprisingly tricky problem.

现在我们将沿用相同的思路，寻求可以让应用程序忽略分布式系统中某些问题的抽象。例如，分布式系统中最重要的抽象之一是一致性：即让所有节点都达成一致。正如我们将在本章中看到的那样，即使在网络故障和进程失败的情况下可靠地达成共识是一个非常棘手的问题。

Once you have an implementation of consensus, applications can use it for various purposes. For example, say you have a database with single-leader replication. If the leader dies and you need to fail over to another node, the remaining database nodes can use consensus to elect a new leader. As discussed in “Handling Node Outages” , it’s important that there is only one leader, and that all nodes agree who the leader is. If two nodes both believe that they are the leader, that situation is called split brain , and it often leads to data loss. Correct implementations of consensus help avoid such problems.

一旦你实现了共识，应用程序可以用它来实现各种目的。例如，假设你有一个单领导者复制的数据库。如果领导者死亡并且你需要转移到另一个节点，剩余的数据库节点可以使用共识来选举新领导者。如“处理节点故障”中所讨论的那样，重要的是只有一个领导者，并且所有节点都同意谁是领导者。如果两个节点都认为自己是领导者，那种情况被称为脑裂，并且通常会导致数据丢失。正确实现共识有助于避免这些问题。

Later in this chapter, in “Distributed Transactions and Consensus” , we will look into algorithms to solve consensus and related problems. But first we first need to explore the range of guarantees and abstractions that can be provided in a distributed system.

稍后在本章的“分布式事务与一致性”部分，我们将探讨解决一致性和相关问题的算法。但是首先，我们需要探索分布式系统中可以提供的保证和抽象范围。

We need to understand the scope of what can and cannot be done: in some situations, it’s possible for the system to tolerate faults and continue working; in other situations, that is not possible. The limits of what is and isn’t possible have been explored in depth, both in theoretical proofs and in practical implementations. We will get an overview of those fundamental limits in this chapter.

我们需要理解可行和不可行的范畴：在某些情况下，系统可以容忍故障并继续工作；在其他情况下，这是不可能的。这些可能与不可能的限制已经在理论证明和实际应用中深入探讨过。本章中将简要介绍这些基本限制。

Researchers in the field of distributed systems have been studying these topics for decades, so there is a lot of material—we’ll only be able to scratch the surface. In this book we don’t have space to go into details of the formal models and proofs, so we will stick with informal intuitions. The literature references offer plenty of additional depth if you’re interested.

分布式系统领域的研究人员已经研究这些主题数十年，因此有很多材料-我们只能刮到表面。在本书中，我们没有空间详细介绍正式模型和证明，因此我们将坚持非正式的直觉。如果您感兴趣，文献参考资料提供了大量额外的深度。

Consistency Guarantees

In “Problems with Replication Lag” we looked at some timing issues that occur in a replicated database. If you look at two database nodes at the same moment in time, you’re likely to see different data on the two nodes, because write requests arrive on different nodes at different times. These inconsistencies occur no matter what replication method the database uses (single-leader, multi-leader, or leaderless replication).

在“复制延迟问题”中，我们探讨了发生在复制数据库中的一些时间问题。如果您在同一时刻查看两个数据库节点，您可能会在这两个节点上看到不同的数据，因为写请求到达不同的节点的时间不同。这些不一致性发生无论数据库使用什么复制方法（单主复制，多主复制或无主复制）。

Most replicated databases provide at least eventual consistency , which means that if you stop writing to the database and wait for some unspecified length of time, then eventually all read requests will return the same value [ 1 ]. In other words, the inconsistency is temporary, and it eventually resolves itself (assuming that any faults in the network are also eventually repaired). A better name for eventual consistency may be convergence , as we expect all replicas to eventually converge to the same value [ 2 ].

大多数复制的数据库至少提供最终一致性，这意味着如果您停止写入数据库并等待一定时间，那么最终所有读取请求将返回相同的值。换句话说，不一致是暂时的，并且最终会自行解决（假设网络中的任何故障也最终得到修复）。最终一致性的更好名称可能是收敛，因为我们期望所有副本最终会收敛到相同的值。

However, this is a very weak guarantee—it doesn’t say anything about when the replicas will converge. Until the time of convergence, reads could return anything or nothing [ 1 ]. For example, if you write a value and then immediately read it again, there is no guarantee that you will see the value you just wrote, because the read may be routed to a different replica (see “Reading Your Own Writes” ).

然而，这是一个非常薄弱的保证-它并没有说明副本什么时候会汇聚。在汇聚的时候之前，读取可能会返回任何东西或什么都没有[1]。例如，如果您写入一个值然后立即再次读取它，就不保证您会看到刚刚写入的值，因为读取可能会路由到另一个副本（请参见“读取您自己的写入”）。

Eventual consistency is hard for application developers because it is so different from the behavior of variables in a normal single-threaded program. If you assign a value to a variable and then read it shortly afterward, you don’t expect to read back the old value, or for the read to fail. A database looks superficially like a variable that you can read and write, but in fact it has much more complicated semantics [ 3 ].

最终一致性对应用程序开发者来说很难，因为它与普通单线程程序中变量的行为完全不同。如果你给一个变量赋值，然后很快就读取它，你不会期望读回旧的值，或者读取失败。数据库表面上看起来像一个可以读写的变量，但实际上它具有更复杂的语义。

When working with a database that provides only weak guarantees, you need to be constantly aware of its limitations and not accidentally assume too much. Bugs are often subtle and hard to find by testing, because the application may work well most of the time. The edge cases of eventual consistency only become apparent when there is a fault in the system (e.g., a network interruption) or at high concurrency.

当与提供弱保证的数据库一起工作时，您需要时刻注意其限制，不要意外地假设太多。由于应用程序大多数时间表现良好，因此错误通常很微妙，难以通过测试找到。在系统出现故障（例如，网络中断）或高并发时，最终一致性的边界情况才会变得明显。

In this chapter we will explore stronger consistency models that data systems may choose to provide. They don’t come for free: systems with stronger guarantees may have worse performance or be less fault-tolerant than systems with weaker guarantees. Nevertheless, stronger guarantees can be appealing because they are easier to use correctly. Once you have seen a few different consistency models, you’ll be in a better position to decide which one best fits your needs.

在本章中，我们将探讨数据系统可能选择提供的更强一致性模型。它们并不是免费的：具有更强的保证的系统可能比具有更弱保证的系统性能更差或更不容错。尽管如此，更强的保证可以吸引人，因为它们更容易正确使用。一旦您看过几种不同的一致性模型，您将更好地决定哪种最适合您的需求。

There is some similarity between distributed consistency models and the hierarchy of transaction isolation levels we discussed previously [ 4 , 5 ] (see “Weak Isolation Levels” ). But while there is some overlap, they are mostly independent concerns: transaction isolation is primarily about avoiding race conditions due to concurrently executing transactions, whereas distributed consistency is mostly about coordinating the state of replicas in the face of delays and faults.

分布式一致性模型和我们之前讨论过的交易隔离级别层次之间存在一些相似之处（参见“弱隔离级别”）。但是，虽然存在一定重叠，但它们主要是独立的问题：事务隔离主要是为了避免由并发执行事务引起的竞争条件，而分布式一致性主要是为了协调副本的状态，以应对延迟和故障的问题。

This chapter covers a broad range of topics, but as we shall see, these areas are in fact deeply linked:

这一章涵盖了广泛的话题，但是正如我们将看到的那样，这些领域实际上是深度相互关联的。

We will start by looking at one of the strongest consistency models in common use, linearizability , and examine its pros and cons.

我们将从最常用的一种最强一致性模型——线性一致性开始，探讨它的优缺点。
We’ll then examine the issue of ordering events in a distributed system ( “Ordering Guarantees” ), particularly around causality and total ordering.

我们将审视分布式系统中事件订购的问题（“订购保证”），特别关注因果关系和完全订购。
In the third section ( “Distributed Transactions and Consensus” ) we will explore how to atomically commit a distributed transaction, which will finally lead us toward solutions for the consensus problem.

在第三部分（“分布式事务和共识”）中，我们将探讨如何原子地提交分布式事务，最终将带领我们走向解决共识问题的解决方案。

Linearizability

In an eventually consistent database, if you ask two different replicas the same question at the same time, you may get two different answers. That’s confusing. Wouldn’t it be a lot simpler if the database could give the illusion that there is only one replica (i.e., only one copy of the data)? Then every client would have the same view of the data, and you wouldn’t have to worry about replication lag.

在一致性最终性数据库中，如果您在同一时间向两个不同的副本询问同一个问题，您可能会得到两个不同的答案，这很令人困惑。如果数据库能够给出只有一个副本（即只有一个数据副本）的假象，这不是更简单吗？那么每个客户机都将拥有相同的数据视图，您就不必担心复制延迟了。

This is the idea behind linearizability [ 6 ] (also known as atomic consistency [ 7 ], strong consistency , immediate consistency , or external consistency [ 8 ]). The exact definition of linearizability is quite subtle, and we will explore it in the rest of this section. But the basic idea is to make a system appear as if there were only one copy of the data, and all operations on it are atomic. With this guarantee, even though there may be multiple replicas in reality, the application does not need to worry about them.

这就是线性一致性的理念（也称原子一致性，强一致性，即时一致性或外部一致性）。线性一致性的确切定义相当微妙，我们将在本节的其余部分中探讨它。但基本思想是使系统看起来好像只有一份数据副本，所有操作都是原子操作。通过这个保证，即使现实中可能有多个副本，应用程序也不必担心它们。

In a linearizable system, as soon as one client successfully completes a write, all clients reading from the database must be able to see the value just written. Maintaining the illusion of a single copy of the data means guaranteeing that the value read is the most recent, up-to-date value, and doesn’t come from a stale cache or replica. In other words, linearizability is a recency guarantee . To clarify this idea, let’s look at an example of a system that is not linearizable.

在一个可线性化的系统中，只要一个客户端成功地完成了写操作，所有从数据库中读取的客户端都必须能够看到刚刚写入的值。保持数据单一副本的幻觉意味着保证读取的值是最新的、最新的，而不是来自过时的高速缓存或副本。换句话说，可线性化是一个最近性保证。为了澄清这个想法，让我们看一个不可线性化的系统的例子。

Figure 9-1 shows an example of a nonlinearizable sports website [ 9 ]. Alice and Bob are sitting in the same room, both checking their phones to see the outcome of the 2014 FIFA World Cup final. Just after the final score is announced, Alice refreshes the page, sees the winner announced, and excitedly tells Bob about it. Bob incredulously hits reload on his own phone, but his request goes to a database replica that is lagging, and so his phone shows that the game is still ongoing.

图9-1展示了一个无法线性化的体育网站示例[9]。爱丽丝和鲍勃坐在同一个房间里，都在检查手机来查看2014年世界杯足球比赛的结果。在最终比分宣布之后，爱丽丝刷新页面，看到了获胜者的宣布，并激动地告诉鲍勃。鲍勃不可思议地在自己的手机上点击刷新，但他的请求发到了一个滞后的数据库副本，因此他的手机显示比赛仍在进行中。

If Alice and Bob had hit reload at the same time, it would have been less surprising if they had gotten two different query results, because they wouldn’t know at exactly what time their respective requests were processed by the server. However, Bob knows that he hit the reload button (initiated his query) after he heard Alice exclaim the final score, and therefore he expects his query result to be at least as recent as Alice’s. The fact that his query returned a stale result is a violation of linearizability.

如果Alice和Bob同时点击了刷新按钮，如果他们获得了两个不同的查询结果，那也不会太令人惊讶，因为他们无法确定各自的请求何时被服务器处理。然而，Bob知道他在听到Alice宣布最终比分之后才点击了刷新按钮（发起了他的查询），因此他期望他的查询结果至少与Alice的一样新。他的查询返回了过时的结果，这是线性可串行化的违反。

What Makes a System Linearizable?

The basic idea behind linearizability is simple: to make a system appear as if there is only a single copy of the data. However, nailing down precisely what that means actually requires some care. In order to understand linearizability better, let’s look at some more examples.

线性化的基本思想很简单：让系统看起来只有一个数据副本。不过，确切地理解这意味着仍然需要一些谨慎。为了更好地理解线性化，让我们看一些更多的例子。

Figure 9-2 shows three clients concurrently reading and writing the same key x in a linearizable database. In the distributed systems literature, x is called a register —in practice, it could be one key in a key-value store, one row in a relational database, or one document in a document database, for example.

图9-2展示了三个客户端同时在一个可线性化的数据库中读取和写入同一个键x。在分布式系统文献中，x被称为寄存器，在实践中，它可能是键值存储中的一个键，关系数据库中的一行或文档数据库中的一个文档，等等。

For simplicity, Figure 9-2 shows only the requests from the clients’ point of view, not the internals of the database. Each bar is a request made by a client, where the start of a bar is the time when the request was sent, and the end of a bar is when the response was received by the client. Due to variable network delays, a client doesn’t know exactly when the database processed its request—it only knows that it must have happened sometime between the client sending the request and receiving the response. ⁱ

简单起见，图9-2仅显示来自客户端的请求，而不显示数据库的内部情况。每个条形图表示客户端发出的一个请求，其中条形图的起始点是请求发送的时间，终点是客户端接收响应的时间。由于网络延迟的变化，客户机不知道数据库何时处理了请求，它只知道这必须发生在客户端发送请求和接收响应之间的某个时间。

In this example, the register has two types of operations:

在这个例子中，该寄存器有两种类型的操作：

read ( x ) ⇒ v means the client requested to read the value of register x , and the database returned the value v .

读取（x）⇒ v 表示客户端请求读取寄存器 x 的值，数据库返回值 v。
write ( x , v ) ⇒ r means the client requested to set the register x to value v , and the database returned response r (which could be ok or error ).

"write（x，v）⇒ r" 意为客户端请求将寄存器 x 设置为值 v，数据库返回响应 r（可能是“好”或“错误”）。

In Figure 9-2 , the value of x is initially 0, and client C performs a write request to set it to 1. While this is happening, clients A and B are repeatedly polling the database to read the latest value. What are the possible responses that A and B might get for their read requests?

在图9-2中，x的值最初为0，客户端C执行写请求将其设置为1。在此期间，客户端A和B正在反复轮询数据库以读取最新值。A和B可能会得到什么样的读取响应？

The first read operation by client A completes before the write begins, so it must definitely return the old value 0.

客户端A的第一次读操作在写入开始前完成，因此它一定会返回旧值0。
The last read by client A begins after the write has completed, so it must definitely return the new value 1 if the database is linearizable: we know that the write must have been processed sometime between the start and end of the write operation, and the read must have been processed sometime between the start and end of the read operation. If the read started after the write ended, then the read must have been processed after the write, and therefore it must see the new value that was written.

客户A最后读取的内容是在写入完成后进行的，因此，如果数据库是可线性化的，它一定会返回新值1：我们知道写操作必须在开始和结束之间处理，读操作必须在开始和结束之间处理。如果读取在写入结束后开始，则读取必须在写入后处理，因此它必须看到被写入的新值。
Any read operations that overlap in time with the write operation might return either 0 or 1, because we don’t know whether or not the write has taken effect at the time when the read operation is processed. These operations are concurrent with the write.

任何与写操作重叠的读取操作可能会返回0或1，因为我们不知道读操作在处理时写操作是否已经生效。这些操作与写操作并发进行。

However, that is not yet sufficient to fully describe linearizability: if reads that are concurrent with a write can return either the old or the new value, then readers could see a value flip back and forth between the old and the new value several times while a write is going on. That is not what we expect of a system that emulates a “single copy of the data.” ⁱⁱ

然而，这还不足以完全描述线性化：如果与写操作并发的读取可以返回旧值或新值，则在写操作进行时读取程序可以多次看到值在旧值和新值之间切换，这不符合模拟“数据的单一副本”的系统的预期。

To make the system linearizable, we need to add another constraint, illustrated in Figure 9-3 .

为使该系统可线性化，我们需要添加另一个约束条件，如图9-3所示。

In a linearizable system we imagine that there must be some point in time (between the start and end of the write operation) at which the value of x atomically flips from 0 to 1. Thus, if one client’s read returns the new value 1, all subsequent reads must also return the new value, even if the write operation has not yet completed.

在线系统中，我们想象在写操作开始和结束之间必须有某个时间点，其中x的值会原子性地从0翻转为1。因此，如果一个客户端的读操作返回新值1，所有后续的读操作都必须返回新值，即使写操作还未完成。

This timing dependency is illustrated with an arrow in Figure 9-3 . Client A is the first to read the new value, 1. Just after A’s read returns, B begins a new read. Since B’s read occurs strictly after A’s read, it must also return 1, even though the write by C is still ongoing. (It’s the same situation as with Alice and Bob in Figure 9-1 : after Alice has read the new value, Bob also expects to read the new value.)

在图9-3中，这种时序依赖性用箭头表示。客户端A首先读取了新值1。在A的读取返回后，B开始了新的读取。由于B的读取严格发生在A的读取之后，即使C的写入仍在进行中，它也必须返回1。（这与图9-1中的Alice和Bob的情况相同：在Alice读取新值后，Bob也希望读取新值。）

We can further refine this timing diagram to visualize each operation taking effect atomically at some point in time. A more complex example is shown in Figure 9-4 [ 10 ].

我们可以进一步细化这个时序图，以便在某个时间点上可视化每个操作的原子效果。更复杂的示例如图9-4所示。

In Figure 9-4 we add a third type of operation besides read and write :

在图9-4中，除了读和写之外，我们添加了第三种操作：

cas ( x , v _old , v _new ) ⇒ r means the client requested an atomic compare-and-set operation (see “Compare-and-set” ). If the current value of the register x equals v _old , it should be atomically set to v _new . If x ≠ v _old then the operation should leave the register unchanged and return an error. r is the database’s response ( ok or error ).

cas(x，vold，vnew)⇒r 表示客户端请求原子比较和交换操作（参见“比较和交换”）。如果寄存器x的当前值等于vold，则应将其原子地设置为vnew。如果x≠vold，则操作应使寄存器保持不变并返回错误。 r是数据库的响应（ok或error）。

Each operation in Figure 9-4 is marked with a vertical line (inside the bar for each operation) at the time when we think the operation was executed. Those markers are joined up in a sequential order, and the result must be a valid sequence of reads and writes for a register (every read must return the value set by the most recent write).

图9-4中每个操作都标有竖线（在每个操作的竖杠内），表示我们认为该操作执行的时间点。这些标记按顺序连接起来，结果必须是寄存器读写的有效序列（每次读取必须返回最新写入的值）。

The requirement of linearizability is that the lines joining up the operation markers always move forward in time (from left to right), never backward. This requirement ensures the recency guarantee we discussed earlier: once a new value has been written or read, all subsequent reads see the value that was written, until it is overwritten again.

线性可达性的要求是操作标记连线必须始终向前移动（从左到右），不得向后。此要求确保了我们早期讨论的最新性保证：一旦写入或读取新值，所有后续读取操作都将看到该值，直到再次被覆盖。

There are a few interesting details to point out in Figure 9-4 :

图9-4中有一些有趣的细节需要指出：

First client B sent a request to read x , then client D sent a request to set x to 0, and then client A sent a request to set x to 1. Nevertheless, the value returned to B’s read is 1 (the value written by A). This is okay: it means that the database first processed D’s write, then A’s write, and finally B’s read. Although this is not the order in which the requests were sent, it’s an acceptable order, because the three requests are concurrent. Perhaps B’s read request was slightly delayed in the network, so it only reached the database after the two writes.

首先客户端B发送了读取x的请求，然后客户端D发送了将x设置为0的请求，接着客户端A发送了将x设置为1的请求。然而，返回给B的值是1（由A写入的值）。这没问题：这意味着数据库首先处理了D的写入，然后是A的写入，最后是B的读取。尽管这不是请求发送的顺序，但这是一个可接受的顺序，因为这三个请求是并发的。也许B的读取请求在网络中略有延迟，因此它只在两个写入之后才到达数据库。
Client B’s read returned 1 before client A received its response from the database, saying that the write of the value 1 was successful. This is also okay: it doesn’t mean the value was read before it was written, it just means the ok response from the database to client A was slightly delayed in the network.

客户端-B的读取在客户端-A从数据库接收到响应之前返回了1，表示值1的写入成功。这也是可以的：并不意味着在写入之前就读取了该值，这只是意味着来自数据库对客户端A的“ok”响应在网络中稍有延迟。
This model doesn’t assume any transaction isolation: another client may change a value at any time. For example, C first reads 1 and then reads 2, because the value was changed by B between the two reads. An atomic compare-and-set ( cas ) operation can be used to check the value hasn’t been concurrently changed by another client: B and C’s cas requests succeed, but D’s cas request fails (by the time the database processes it, the value of x is no longer 0).

这种模型不假设事务隔离：另一个客户端可以随时更改一个值。例如，C首先读取1，然后读取2，因为该值在两次读取之间被B更改。原子比较和交换（cas）操作可用于检查该值是否已被另一个客户端并发更改：B和C的cas请求成功，但D的cas请求失败（当数据库处理它时，x的值不再是0）。
The final read by client B (in a shaded bar) is not linearizable. The operation is concurrent with C’s cas write, which updates x from 2 to 4. In the absence of other requests, it would be okay for B’s read to return 2. However, client A has already read the new value 4 before B’s read started, so B is not allowed to read an older value than A. Again, it’s the same situation as with Alice and Bob in Figure 9-1 .

客户端B（在一个阴影条中）的最后读取结果不是线性化的。该操作与C的cas写操作同时进行，该操作将x从2更新为4。在没有其他请求的情况下，B的读取返回2是可以的。但是，在B的读取开始之前，客户端A已经读取了新值4，因此B不允许读取比A旧的值。这又是Figure 9-1中的Alice和Bob相同的情况。

That is the intuition behind linearizability; the formal definition [ 6 ] describes it more precisely. It is possible (though computationally expensive) to test whether a system’s behavior is linearizable by recording the timings of all requests and responses, and checking whether they can be arranged into a valid sequential order [ 11 ].

这就是线性可操作性背后的直觉；正式定义[6]更加精确地描述了它。可以（尽管计算成本高昂）通过记录所有请求和响应的时间并检查它们是否可以被安排成有效的顺序[11]来测试系统的行为是否是线性可操作的。

Linearizability Versus Serializability

Linearizability is easily confused with serializability (see “Serializability” ), as both words seem to mean something like “can be arranged in a sequential order.” However, they are two quite different guarantees, and it is important to distinguish between them:

线性化很容易与串行化混淆（参见“串行化”），因为这两个词似乎都表示“可以按顺序安排”。然而，它们是两个非常不同的保证，因此区分它们很重要。

Serializability

Serializability is an isolation property of transactions , where every transaction may read and write multiple objects (rows, documents, records)—see “Single-Object and Multi-Object Operations” . It guarantees that transactions behave the same as if they had executed in some serial order (each transaction running to completion before the next transaction starts). It is okay for that serial order to be different from the order in which transactions were actually run [ 12 ].

序列化性是事务的孤立性属性，其中每个事务可能会读取和写入多个对象（行、文档、记录）——见“单对象和多对象操作”。它保证事务的行为与它们在某个连续顺序下执行时相同（每个事务在下一个事务开始之前运行到完成）。该顺序与实际运行事务的顺序不同也是可以的。

Linearizability

Linearizability is a recency guarantee on reads and writes of a register (an individual object ). It doesn’t group operations together into transactions, so it does not prevent problems such as write skew (see “Write Skew and Phantoms” ), unless you take additional measures such as materializing conflicts (see “Materializing conflicts” ).

线性化可确保对寄存器（一个独立的对象）的读写保持最新性。它不会将操作分组为事务，因此无法防止诸如写偏斜（参见“写偏斜和幻象”）等问题，除非你采取其他措施，例如实现冲突（参见“实现冲突”）。

A database may provide both serializability and linearizability, and this combination is known as strict serializability or strong one-copy serializability ( strong-1SR ) [ 4 , 13 ]. Implementations of serializability based on two-phase locking (see “Two-Phase Locking (2PL)” ) or actual serial execution (see “Actual Serial Execution” ) are typically linearizable.

数据库可以同时提供序列化和线性化，这种组合被称为严格序列化或强一拷贝序列化（Strong-1SR）[4，13]。基于两阶段锁定（参见“两阶段锁定（2PL）”）或实际串行执行（参见“实际串行执行”）的序列化实现通常是可线性化的。

However, serializable snapshot isolation (see “Serializable Snapshot Isolation (SSI)” ) is not linearizable: by design, it makes reads from a consistent snapshot, to avoid lock contention between readers and writers. The whole point of a consistent snapshot is that it does not include writes that are more recent than the snapshot, and thus reads from the snapshot are not linearizable.

然而，可序列化快照隔离（参见“可序列化快照隔离（SSI）”）不是线性化的：根据设计，它使读取从一致的快照，以避免读者和写者之间的锁争用。一致快照的整个重点在于它不包括比快照更近的写入，因此从快照读取不是线性化的。

Relying on Linearizability

In what circumstances is linearizability useful? Viewing the final score of a sporting match is perhaps a frivolous example: a result that is outdated by a few seconds is unlikely to cause any real harm in this situation. However, there a few areas in which linearizability is an important requirement for making a system work correctly.

在什么情况下，线性一致性是有用的？查看体育比赛的最终得分可能是一个轻浮的例子：在这种情况下，几秒钟前过时的结果不太可能造成任何实际伤害。但是，有一些领域需要线性一致性是使系统正常工作的重要要求。

Locking and leader election

A system that uses single-leader replication needs to ensure that there is indeed only one leader, not several (split brain). One way of electing a leader is to use a lock: every node that starts up tries to acquire the lock, and the one that succeeds becomes the leader [ 14 ]. No matter how this lock is implemented, it must be linearizable: all nodes must agree which node owns the lock; otherwise it is useless.

使用单主副本的系统需要确保确实只有一个领导者而不是多个(分离脑). 选举领导者的一种方法是使用锁：每个启动的节点都尝试获得锁，成功的节点成为领导者[14]. 无论如何实现这个锁，它必须是线性化的：所有节点必须同意哪个节点拥有锁；否则它是无用的。

Coordination services like Apache ZooKeeper [ 15 ] and etcd [ 16 ] are often used to implement distributed locks and leader election. They use consensus algorithms to implement linearizable operations in a fault-tolerant way (we discuss such algorithms later in this chapter, in “Fault-Tolerant Consensus” ). ⁱⁱⁱ There are still many subtle details to implementing locks and leader election correctly (see for example the fencing issue in “The leader and the lock” ), and libraries like Apache Curator [ 17 ] help by providing higher-level recipes on top of ZooKeeper. However, a linearizable storage service is the basic foundation for these coordination tasks.

类似Apache ZooKeeper和etcd这样的协调服务通常用于实现分布式锁和领导者选举。它们使用共识算法以容错方式实现线性化操作（我们在本章后面“容错共识”中讨论这些算法）。实现正确的锁和领导者选举仍然存在许多微妙的细节（例如在“领导者和锁”中的栅栏问题），Apache Curator等库提供了在ZooKeeper之上提供更高级别的配方来帮助。然而，线性化存储服务是这些协调任务的基础。

Distributed locking is also used at a much more granular level in some distributed databases, such as Oracle Real Application Clusters (RAC) [ 18 ]. RAC uses a lock per disk page, with multiple nodes sharing access to the same disk storage system. Since these linearizable locks are on the critical path of transaction execution, RAC deployments usually have a dedicated cluster interconnect network for communication between database nodes.

分布式锁定还在某些分布式数据库中以更细粒度的级别使用，例如Oracle Real Application Clusters（RAC）[18]。 RAC每页磁盘使用一个锁，多个节点共享对同一磁盘存储系统的访问。由于这些可线性化锁定在事务执行的关键路径上，因此RAC部署通常具有专用的群集互连网络，用于数据库节点之间的通信。

Constraints and uniqueness guarantees

Uniqueness constraints are common in databases: for example, a username or email address must uniquely identify one user, and in a file storage service there cannot be two files with the same path and filename. If you want to enforce this constraint as the data is written (such that if two people try to concurrently create a user or a file with the same name, one of them will be returned an error), you need linearizability.

在数据库中，唯一性约束是很常见的：例如，用户名或电子邮件地址必须唯一地标识一个用户，在文件存储服务中不能有两个具有相同路径和文件名的文件。如果您想在写入数据时强制执行此约束（例如，如果两个人尝试同时创建具有相同名称的用户或文件，则其中一个将返回错误），则需要线性化。

This situation is actually similar to a lock: when a user registers for your service, you can think of them acquiring a “lock” on their chosen username. The operation is also very similar to an atomic compare-and-set, setting the username to the ID of the user who claimed it, provided that the username is not already taken.

这种情况实际上类似于一把锁：当用户注册您的服务时，您可以将其视为获得所选择的用户名的“锁定”。操作也非常类似于原子比较和设置，将用户名设置为声明它的用户ID，前提是该用户名尚未被占用。

Similar issues arise if you want to ensure that a bank account balance never goes negative, or that you don’t sell more items than you have in stock in the warehouse, or that two people don’t concurrently book the same seat on a flight or in a theater. These constraints all require there to be a single up-to-date value (the account balance, the stock level, the seat occupancy) that all nodes agree on.

如果您想确保银行账户余额从不为负，或者仓库存货不会超售，或者两个人不会同时预订同一架飞机或剧院的座位，就会出现类似的问题。所有这些限制都需要有一个最新的值（账户余额、库存水平、座位占用率）得到所有节点的一致认可。

In real applications, it is sometimes acceptable to treat such constraints loosely (for example, if a flight is overbooked, you can move customers to a different flight and offer them compensation for the inconvenience). In such cases, linearizability may not be needed, and we will discuss such loosely interpreted constraints in “Timeliness and Integrity” .

在实际应用中，有时候处理这些约束条件可以放宽一些（举个例子，如果一个航班过度预订，可以把乘客安排到另一个航班，并提供赔偿来安抚），在这种情况下，就不需要线性化，我们将在“及时性和完整性”中探讨这些宽松解释的约束条件。

However, a hard uniqueness constraint, such as the one you typically find in relational databases, requires linearizability. Other kinds of constraints, such as foreign key or attribute constraints, can be implemented without requiring linearizability [ 19 ].

然而，像关系数据库中通常发现的硬独特性约束一样，需要线性化。其他类型的约束，如外键或属性约束，可以在不需要线性化的情况下实现[19]。

Cross-channel timing dependencies

Notice a detail in Figure 9-1 : if Alice hadn’t exclaimed the score, Bob wouldn’t have known that the result of his query was stale. He would have just refreshed the page again a few seconds later, and eventually seen the final score. The linearizability violation was only noticed because there was an additional communication channel in the system (Alice’s voice to Bob’s ears).

请注意图9-1中的一个细节：如果Alice没有大声喊出比分，Bob就不会知道他的查询结果已经过期了。他只会在几秒钟后再次刷新页面，并最终看到最终比分。只有因为系统中有一个额外的通信渠道（Alice的声音传到Bob的耳朵），才会注意到线性一致性的违规行为。

Similar situations can arise in computer systems. For example, say you have a website where users can upload a photo, and a background process resizes the photos to lower resolution for faster download (thumbnails). The architecture and dataflow of this system is illustrated in Figure 9-5 .

类似的情况也会在计算机系统中出现。例如，假设您有一个网站，用户可以上传照片，而一个后台进程会调整照片的分辨率以便更快地下载（缩略图）。此系统的体系结构和数据流程如图9-5所示。

The image resizer needs to be explicitly instructed to perform a resizing job, and this instruction is sent from the web server to the resizer via a message queue (see Chapter 11 ). The web server doesn’t place the entire photo on the queue, since most message brokers are designed for small messages, and a photo may be several megabytes in size. Instead, the photo is first written to a file storage service, and once the write is complete, the instruction to the resizer is placed on the queue.

图片调整器需要明确指示执行调整工作，此指示通过消息队列（见第11章）从Web服务器发送到调整器。Web服务器不会将整个照片放入队列中，因为大多数消息代理都设计用于小型消息，而一张照片可能有几兆字节。相反，照片首先被写入文件存储服务，一旦写入完成，就将调整器的指示放置在队列上。

If the file storage service is linearizable, then this system should work fine. If it is not linearizable, there is the risk of a race condition: the message queue (steps 3 and 4 in Figure 9-5 ) might be faster than the internal replication inside the storage service. In this case, when the resizer fetches the image (step 5), it might see an old version of the image, or nothing at all. If it processes an old version of the image, the full-size and resized images in the file storage become permanently inconsistent.

如果文件存储服务是可线性化的，那么系统应该可以正常工作。如果它不是可线性化的，就存在竞争条件的风险：消息队列（图9-5中的步骤3和4）可能比存储服务内部复制更快。在这种情况下，当调整大小程序获取图像（步骤5）时，它可能会看到旧版本的图像，或者根本什么都看不到。如果处理旧版本的图像，则文件存储中的全尺寸和调整大小后的图像将永久不一致。

This problem arises because there are two different communication channels between the web server and the resizer: the file storage and the message queue. Without the recency guarantee of linearizability, race conditions between these two channels are possible. This situation is analogous to Figure 9-1 , where there was also a race condition between two communication channels: the database replication and the real-life audio channel between Alice’s mouth and Bob’s ears.

这个问题的产生是因为Web服务器和调整器之间存在两个不同的通信渠道：文件存储和消息队列。没有线性化的新近性保证，两个渠道之间可能存在竞争条件。这种情况类似于图9-1，其中也存在两个通信渠道之间的竞赛条件：数据库复制和Alice的嘴和Bob的耳朵之间的现实音频通道。

Linearizability is not the only way of avoiding this race condition, but it’s the simplest to understand. If you control the additional communication channel (like in the case of the message queue, but not in the case of Alice and Bob), you can use alternative approaches similar to what we discussed in “Reading Your Own Writes” , at the cost of additional complexity.

线性一致性并不是避免这种竞争条件的唯一方法，但是它是最容易理解的方法。如果你控制了额外的通信渠道（就像在消息队列的情况下，但不是在Alice和Bob的情况下），你可以使用类似于我们在“读取您自己的写入”的讨论中所讨论的替代方法，但这需要额外的复杂性。

Implementing Linearizable Systems

Now that we’ve looked at a few examples in which linearizability is useful, let’s think about how we might implement a system that offers linearizable semantics.

现在我们已经看了几个证明线性化是有用的例子，让我们思考一下如何实现一个提供线性化语义的系统。

Since linearizability essentially means “behave as though there is only a single copy of the data, and all operations on it are atomic,” the simplest answer would be to really only use a single copy of the data. However, that approach would not be able to tolerate faults: if the node holding that one copy failed, the data would be lost, or at least inaccessible until the node was brought up again.

由于“线性可化”实质上意味着“表现得好像只有一个数据副本，而所有对它的操作都是原子性的”，因此最简单的答案是确实只使用一个数据副本。但是，这种方法将无法容忍故障：如果持有该副本的节点出现故障，数据将丢失，或者至少在节点重新启动之前无法访问。

The most common approach to making a system fault-tolerant is to use replication. Let’s revisit the replication methods from Chapter 5 , and compare whether they can be made linearizable:

最常见的使系统具有容错能力的方法是使用复制。让我们回顾一下第5章的复制方法，并比较它们是否可以变成线性化：

Single-leader replication (potentially linearizable)

In a system with single-leader replication (see “Leaders and Followers” ), the leader has the primary copy of the data that is used for writes, and the followers maintain backup copies of the data on other nodes. If you make reads from the leader, or from synchronously updated followers, they have the potential to be linearizable. ^iv However, not every single-leader database is actually linearizable, either by design (e.g., because it uses snapshot isolation) or due to concurrency bugs [ 10 ].

在单主复制系统中（参见“领导者和追随者”），领导者拥有用于写入的主要数据副本，而追随者在其他节点上维护数据的备份副本。如果您从领导者或同步更新的追随者中读取，则它们有可能是可线性化的。但是，并非每个单主数据库都实际上是可线性化的，这可能是由于设计缺陷（例如，它使用快照隔离）或由于并发性错误。

Using the leader for reads relies on the assumption that you know for sure who the leader is. As discussed in “The Truth Is Defined by the Majority” , it is quite possible for a node to think that it is the leader, when in fact it is not—and if the delusional leader continues to serve requests, it is likely to violate linearizability [ 20 ]. With asynchronous replication, failover may even lose committed writes (see “Handling Node Outages” ), which violates both durability and linearizability.

使用领袖节点进行读操作的前提是确定领袖节点的身份。如同在“真相由大多数决定”中所述，有可能节点认为自己是领袖，实际上却不是。如果虚妄的领袖节点继续处理请求，很可能会违反线性化条件[20]。在异步复制中，故障转移甚至可能会丢失已提交的写操作（见“处理节点故障”），这不仅违反了持久化条件，也违反了线性化条件。

Consensus algorithms (linearizable)

Some consensus algorithms, which we will discuss later in this chapter, bear a resemblance to single-leader replication. However, consensus protocols contain measures to prevent split brain and stale replicas. Thanks to these details, consensus algorithms can implement linearizable storage safely. This is how ZooKeeper [ 21 ] and etcd [ 22 ] work, for example.

一些共识算法与单领导者复制有相似之处，但共识协议包含了防止分裂大脑和陈旧副本的措施。由于这些细节，共识算法可以安全地实现可线性化存储。例如，ZooKeeper和etcd就是这样工作的。

Multi-leader replication (not linearizable)

Systems with multi-leader replication are generally not linearizable, because they concurrently process writes on multiple nodes and asynchronously replicate them to other nodes. For this reason, they can produce conflicting writes that require resolution (see “Handling Write Conflicts” ). Such conflicts are an artifact of the lack of a single copy of the data.

多主复制系统通常不是可线性化的，因为它们同时在多个节点上处理写入，并异步将它们复制到其他节点。因此，它们可能会产生需要解决的冲突写入（参见“处理写入冲突”）。这些冲突是由于数据缺乏单一副本的缺陷所导致的。

Leaderless replication (probably not linearizable)

For systems with leaderless replication (Dynamo-style; see “Leaderless Replication” ), people sometimes claim that you can obtain “strong consistency” by requiring quorum reads and writes ( w + r > n ). Depending on the exact configuration of the quorums, and depending on how you define strong consistency, this is not quite true.

对于没有主节点复制的系统（Dynamo式; 请参阅“无主节点复制”），有些人声称可以通过要求仲裁读写（w+r>n）来获得“强一致性”。根据仲裁的确切配置和您如何定义强一致性，这并不完全正确。

“Last write wins” conflict resolution methods based on time-of-day clocks (e.g., in Cassandra; see “Relying on Synchronized Clocks” ) are almost certainly nonlinearizable, because clock timestamps cannot be guaranteed to be consistent with actual event ordering due to clock skew. Sloppy quorums ( “Sloppy Quorums and Hinted Handoff” ) also ruin any chance of linearizability. Even with strict quorums, nonlinearizable behavior is possible, as demonstrated in the next section.

基于时间（例如在Cassandra中）的“最后一次写入成功”冲突解决方法是几乎肯定非线性的，因为由于时钟偏差，时钟时间戳不能保证与实际事件排序一致。松散的法定人数（“松散的法定人数和提示性转交”）也破坏了线性化的任何可能性。即使采用严格的法定人数，非线性化的行为也是可能的，这在下一节中有所展示。

Linearizability and quorums

Intuitively, it seems as though strict quorum reads and writes should be linearizable in a Dynamo-style model. However, when we have variable network delays, it is possible to have race conditions, as demonstrated in Figure 9-6 .

直觉上，严格的法定人数读写应该在Dynamo风格模型中是可线性化的。但是，当我们有可变的网络延迟时，可能会出现竞争条件，如图9-6所示。

In Figure 9-6 , the initial value of x is 0, and a writer client is updating x to 1 by sending the write to all three replicas ( n = 3, w = 3). Concurrently, client A reads from a quorum of two nodes ( r = 2) and sees the new value 1 on one of the nodes. Also concurrently with the write, client B reads from a different quorum of two nodes, and gets back the old value 0 from both.

在图9-6中，x的初始值为0，并且写入客户端通过向所有三个副本（n=3，w=3）发送写操作来将x更新为1。同时，客户端A从两个节点的仲裁中读取（r=2），并在其中一个节点上看到新的值1。同时进行写入，客户端B从不同的两个节点的仲裁中进行读取，并从两个节点中获取旧值0。

The quorum condition is met ( w + r > n ), but this execution is nevertheless not linearizable: B’s request begins after A’s request completes, but B returns the old value while A returns the new value. (It’s once again the Alice and Bob situation from Figure 9-1 .)

达到了法定人数条件（w + r > n），但该执行仍然不是可线性化的：B的请求开始在A的请求完成之后，但B返回了旧值，而A返回了新值。（这又是图9-1中的爱丽丝和鲍勃情况。）

Interestingly, it is possible to make Dynamo-style quorums linearizable at the cost of reduced performance: a reader must perform read repair (see “Read repair and anti-entropy” ) synchronously, before returning results to the application [ 23 ], and a writer must read the latest state of a quorum of nodes before sending its writes [ 24 , 25 ]. However, Riak does not perform synchronous read repair due to the performance penalty [ 26 ]. Cassandra does wait for read repair to complete on quorum reads [ 27 ], but it loses linearizability if there are multiple concurrent writes to the same key, due to its use of last-write-wins conflict resolution.

有趣的是，Dynamo式环境下的法定人数可以在牺牲性能的前提下实现线性化: 读者必须在同步执行读修复之后才能返回结果给应用程序进行读修复(见“读修复和反熵”），而写者则必须在发送其写入之前读取节点法定人数的最新状态。然而，Riak由于性能惩罚不执行同步读修复。Cassandra会等待法定人数读完成读修复，但是由于使用了最后一次写决定冲突解决方法，如果有多个并发写入同一键，则会失去线性化。

Moreover, only linearizable read and write operations can be implemented in this way; a linearizable compare-and-set operation cannot, because it requires a consensus algorithm [ 28 ].

此外，仅有可线性化的读写操作可以以这种方式实现；线性化的比较和交换操作则不行，因为这需要共识算法 [28]。

In summary, it is safest to assume that a leaderless system with Dynamo-style replication does not provide linearizability.

总的来说，最安全的做法是假设一个没有领导的系统，采用Dynamo风格的复制，不能提供线性可序性。

The Cost of Linearizability

As some replication methods can provide linearizability and others cannot, it is interesting to explore the pros and cons of linearizability in more depth.

由于一些复制方法可以提供线性化，而其他方法不能，因此更有趣的是深入探讨线性化的优缺点。

We already discussed some use cases for different replication methods in Chapter 5 ; for example, we saw that multi-leader replication is often a good choice for multi-datacenter replication (see “Multi-datacenter operation” ). An example of such a deployment is illustrated in Figure 9-7 .

在第5章中，我们已经讨论了使用不同复制方法的一些用例；例如，我们发现对于多数据中心复制，多领导者复制通常是一个好选择（请参见“多数据中心操作”）。这样的部署示例如图9-7所示。

Consider what happens if there is a network interruption between the two datacenters. Let’s assume that the network within each datacenter is working, and clients can reach the datacenters, but the datacenters cannot connect to each other.

考虑一下如果两个数据中心之间发生网络中断会发生什么。假设每个数据中心内部的网络都正常运行，客户端能够访问数据中心，但两个数据中心无法相互连接。

With a multi-leader database, each datacenter can continue operating normally: since writes from one datacenter are asynchronously replicated to the other, the writes are simply queued up and exchanged when network connectivity is restored.

使用多领导者数据库，每个数据中心都可以继续正常运行：由于来自一个数据中心的写入会异步地复制到其他数据中心，所以写入只是简单地排队并在网络连接恢复时交换。

On the other hand, if single-leader replication is used, then the leader must be in one of the datacenters. Any writes and any linearizable reads must be sent to the leader—thus, for any clients connected to a follower datacenter, those read and write requests must be sent synchronously over the network to the leader datacenter.

另一方面，如果使用单一领导者复制，则领导者必须在一个数据中心中。任何写操作和任何线性化读取操作必须发送到领导者 - 因此，对于连接到追随者数据中心的任何客户端，这些读取和写入请求必须同步通过网络发送到领导者数据中心。

If the network between datacenters is interrupted in a single-leader setup, clients connected to follower datacenters cannot contact the leader, so they cannot make any writes to the database, nor any linearizable reads. They can still make reads from the follower, but they might be stale (nonlinearizable). If the application requires linearizable reads and writes, the network interruption causes the application to become unavailable in the datacenters that cannot contact the leader.

如果单领导者设置中的数据中心之间的网络中断，则连接到跟随者数据中心的客户端无法联系领导者，因此它们无法对数据库进行任何写入或任何线性读取。他们仍然可以从追随者那里读取，但可能会过期（非线性）。如果应用程序需要线性读取和写入，则网络中断会导致无法联系领导者的数据中心中的应用程序变得不可用。

If clients can connect directly to the leader datacenter, this is not a problem, since the application continues to work normally there. But clients that can only reach a follower datacenter will experience an outage until the network link is repaired.

如果客户可以直接连接到主数据中心，那么这不是问题，因为应用程序在那里仍然正常工作。但是，只能连接到从属数据中心的客户将经历停机，直到网络连接修复为止。

The CAP theorem

This issue is not just a consequence of single-leader and multi-leader replication: any linearizable database has this problem, no matter how it is implemented. The issue also isn’t specific to multi-datacenter deployments, but can occur on any unreliable network, even within one datacenter. The trade-off is as follows: ^v

这个问题不仅仅是单领导者和多领导者复制的结果：任何线性化数据库都会有这个问题，无论它是如何实现的。这个问题也不是针对多数据中心部署的，但是可能发生在任何不可靠的网络上，甚至在一个数据中心内部也是如此。权衡如下：

If your application requires linearizability, and some replicas are disconnected from the other replicas due to a network problem, then some replicas cannot process requests while they are disconnected: they must either wait until the network problem is fixed, or return an error (either way, they become unavailable ).

如果您的应用程序需要线性可操作性，而某些副本由于网络问题与其他副本断开连接，则某些副本在断开连接时无法处理请求：它们必须等待网络问题修复，或返回错误（无论哪种方式，它们都变得不可用）。
If your application does not require linearizability, then it can be written in a way that each replica can process requests independently, even if it is disconnected from other replicas (e.g., multi-leader). In this case, the application can remain available in the face of a network problem, but its behavior is not linearizable.

如果您的应用程序不需要线性化，那么可以编写以使每个副本可以独立处理请求的方式，即使它与其他副本断开连接（例如，多领导者）。在这种情况下，应用程序可以在网络问题的情况下保持可用，但其行为不是线性的。

Thus, applications that don’t require linearizability can be more tolerant of network problems. This insight is popularly known as the CAP theorem [ 29 , 30 , 31 , 32 ], named by Eric Brewer in 2000, although the trade-off has been known to designers of distributed databases since the 1970s [ 33 , 34 , 35 , 36 ].

因此，不需要线性化的应用程序可以更容忍网络问题。这个洞见被称为CAP定理[29，30，31，32]，由Eric Brewer在2000年命名，尽管这种权衡已经为分布式数据库的设计者所知道自1970年代开始[33，34，35，36]。

CAP was originally proposed as a rule of thumb, without precise definitions, with the goal of starting a discussion about trade-offs in databases. At the time, many distributed databases focused on providing linearizable semantics on a cluster of machines with shared storage [ 18 ], and CAP encouraged database engineers to explore a wider design space of distributed shared-nothing systems, which were more suitable for implementing large-scale web services [ 37 ]. CAP deserves credit for this culture shift—witness the explosion of new database technologies since the mid-2000s (known as NoSQL).

CAP最初被提出时作为一个经验性的规则，并没有精确的定义，旨在开始一场关于数据库中权衡的讨论。当时，许多分布式数据库都专注于在共享存储的机群上提供可线性化的语义[18]，而CAP鼓励数据库工程师探索更广阔的分布式共享无关系统设计空间，这更适合实现大规模的Web服务[37]。 CAP应该得到这种文化转变的认可，这也证明了自2000年代中期以来新数据库技术的爆炸式增长（称为NoSQL）。

The Unhelpful CAP Theorem

CAP is sometimes presented as Consistency, Availability, Partition tolerance: pick 2 out of 3 . Unfortunately, putting it this way is misleading [ 32 ] because network partitions are a kind of fault, so they aren’t something about which you have a choice: they will happen whether you like it or not [ 38 ].

CAP有时被称为一致性(C),可用性(A),分区容错(P)，需要在三个中选择两个。然而，这种描述很容易引起误解，因为网络分区其实是一种故障，而不是一种你可以选择的事情：无论你喜不喜欢，它们都会发生。

At times when the network is working correctly, a system can provide both consistency (linearizability) and total availability. When a network fault occurs, you have to choose between either linearizability or total availability. Thus, a better way of phrasing CAP would be either Consistent or Available when Partitioned [ 39 ]. A more reliable network needs to make this choice less often, but at some point the choice is inevitable.

有时当网络工作正常时，系统可以提供一致性（linearizability）和总可用性。当网络故障发生时，您必须在一致性或总可用性之间做出选择。因此，更好的表达 CAP 的方式是分区时要么一致，要么可用。一个更可靠的网络需要更少地做出这种选择，但在某些情况下选择是不可避免的。

In discussions of CAP there are several contradictory definitions of the term availability , and the formalization as a theorem [ 30 ] does not match its usual meaning [ 40 ]. Many so-called “highly available” (fault-tolerant) systems actually do not meet CAP’s idiosyncratic definition of availability. All in all, there is a lot of misunderstanding and confusion around CAP, and it does not help us understand systems better, so CAP is best avoided.

在 CAP 的讨论中，有几个不同的定义与 “可用性” 一词相矛盾，而定理 [30] 的形式化并不符合其通常的含义 [40]。许多所谓的“高可用性” （容错性）系统实际上并不符合 CAP 特有的可用性定义。总之，CAP 周围存在很多的误解和混淆，它并不能帮助我们更好地了解系统，因此最好避免使用 CAP。

The CAP theorem as formally defined [ 30 ] is of very narrow scope: it only considers one consistency model (namely linearizability) and one kind of fault ( network partitions , ^vi or nodes that are alive but disconnected from each other). It doesn’t say anything about network delays, dead nodes, or other trade-offs. Thus, although CAP has been historically influential, it has little practical value for designing systems [ 9 , 40 ].

CAP 定理在正式定义上非常狭窄[30]，只考虑了一种一致性模型（即线性一致性）和一种故障（网络分割、或节点之间无法连接）。它并未涉及网络延迟、宕机等其他权衡方面。因此，尽管 CAP 在历史上具有影响力，但它对于设计系统几乎没有实际价值[9，40]。

There are many more interesting impossibility results in distributed systems [ 41 ], and CAP has now been superseded by more precise results [ 2 , 42 ], so it is of mostly historical interest today.

在分布式系统中还有许多更有趣的不可能性结果[41]，而CAP理论现在已被更精确的结果[2,42]所取代，因此它今天主要是具有历史意义的。

Linearizability and network delays

Although linearizability is a useful guarantee, surprisingly few systems are actually linearizable in practice. For example, even RAM on a modern multi-core CPU is not linearizable [ 43 ]: if a thread running on one CPU core writes to a memory address, and a thread on another CPU core reads the same address shortly afterward, it is not guaranteed to read the value written by the first thread (unless a memory barrier or fence [ 44 ] is used).

尽管线性化是一个有用的保证，但实际上很少有系统是线性化的。例如，即使是现代多核CPU上的RAM也不是线性化的[43]：如果一个运行在一个CPU核心上的线程写入一个内存地址，另一个CPU核心上的线程短时间后读取相同的地址，则不能保证读取第一个线程写入的值（除非使用了内存屏障或栅栏[44]）。

The reason for this behavior is that every CPU core has its own memory cache and store buffer. Memory access first goes to the cache by default, and any changes are asynchronously written out to main memory. Since accessing data in the cache is much faster than going to main memory [ 45 ], this feature is essential for good performance on modern CPUs. However, there are now several copies of the data (one in main memory, and perhaps several more in various caches), and these copies are asynchronously updated, so linearizability is lost.

这种行为的原因是每个CPU核心都有自己的内存缓存和存储缓冲区。默认情况下，内存访问首先进入缓存，并且任何更改都会异步写入主内存。由于在缓存中访问数据比访问主内存要快得多，[45]，因此这个功能对于现代CPU的良好性能至关重要。然而，现在数据有了几个副本（一个在主内存中，可能还有几个在各种缓存中），这些副本会异步更新，因此失去了线性化。

Why make this trade-off? It makes no sense to use the CAP theorem to justify the multi-core memory consistency model: within one computer we usually assume reliable communication, and we don’t expect one CPU core to be able to continue operating normally if it is disconnected from the rest of the computer. The reason for dropping linearizability is performance , not fault tolerance.

为什么要做出这种折衷？使用CAP定理来证明多核内存一致性模型是毫无意义的：在一台计算机内，我们通常假定通信可靠，而且如果一个CPU核心与其余部分断开连接，我们不指望它能继续正常操作。放弃线性化的原因是为了性能，而不是容错性。

The same is true of many distributed databases that choose not to provide linearizable guarantees: they do so primarily to increase performance, not so much for fault tolerance [ 46 ]. Linearizability is slow—and this is true all the time, not only during a network fault.

许多分布式数据库选择不提供线性化保证，主要是为了提高性能，而不是为了容错性[46]。线性化会降低速度，而且这种情况始终存在，不仅仅是在网络故障期间。

Can’t we maybe find a more efficient implementation of linearizable storage? It seems the answer is no: Attiya and Welch [ 47 ] prove that if you want linearizability, the response time of read and write requests is at least proportional to the uncertainty of delays in the network. In a network with highly variable delays, like most computer networks (see “Timeouts and Unbounded Delays” ), the response time of linearizable reads and writes is inevitably going to be high. A faster algorithm for linearizability does not exist, but weaker consistency models can be much faster, so this trade-off is important for latency-sensitive systems. In Chapter 12 we will discuss some approaches for avoiding linearizability without sacrificing correctness.

我们能否找到更高效的可线性化存储实现呢？看起来答案是否定的：Attiya和Welch在[47]中证明，如果你想要线性化，读写请求的响应时间至少与网络延迟的不确定性成比例。在延迟高度变化的网络中，像大多数计算机网络一样（见“超时和无界延迟”），线性化读写的响应时间不可避免地会很高。更快的线性化算法不存在，但弱一致性模型可以更快，因此这种折衷对于延迟敏感的系统非常重要。在第12章中，我们将讨论一些避免线性化而不牺牲正确性的方法。

Ordering Guarantees

We said previously that a linearizable register behaves as if there is only a single copy of the data, and that every operation appears to take effect atomically at one point in time. This definition implies that operations are executed in some well-defined order. We illustrated the ordering in Figure 9-4 by joining up the operations in the order in which they seem to have executed.

我们之前说过，线性寄存器的行为就好像只有一个数据的副本，每个操作都似乎原子地在某个时间点上生效。这个定义意味着操作按某种明确定义的顺序执行。我们通过按看起来执行的顺序连接操作的方式在图9-4中说明了这个排序。

Ordering has been a recurring theme in this book, which suggests that it might be an important fundamental idea. Let’s briefly recap some of the other contexts in which we have discussed ordering:

"订购已经是本书中经常出现的主题，这暗示着它可能是一个重要的基本思想。让我们简要回顾一些我们已经讨论过的其他订购上下文："

In Chapter 5 we saw that the main purpose of the leader in single-leader replication is to determine the order of writes in the replication log—that is, the order in which followers apply those writes. If there is no single leader, conflicts can occur due to concurrent operations (see “Handling Write Conflicts” ).

在第5章中，我们看到单领导副本中领导者的主要目的是确定复制日志中写入的顺序 - 即从属者应用这些写入的顺序。如果没有单个领导者，则由于并发操作可能会发生冲突（请参见“处理写冲突”）。
Serializability, which we discussed in Chapter 7 , is about ensuring that transactions behave as if they were executed in some sequential order . It can be achieved by literally executing transactions in that serial order, or by allowing concurrent execution while preventing serialization conflicts (by locking or aborting).

可串行化性是我们在第七章中讨论的一个概念，其目的是确保事务的行为就像它们是按某个顺序执行的一样。这可以通过以字面顺序执行事务来实现，也可以通过允许并发执行并同时防止序列化冲突（通过锁定或中止）来实现。
The use of timestamps and clocks in distributed systems that we discussed in Chapter 8 (see “Relying on Synchronized Clocks” ) is another attempt to introduce order into a disorderly world, for example to determine which one of two writes happened later.

我们在第8章讨论的分布式系统中使用时间戳和时钟（参见“依赖同步时钟”）是引入顺序到无序世界的另一种尝试，例如确定哪一个写操作发生在后面。

It turns out that there are deep connections between ordering, linearizability, and consensus. Although this notion is a bit more theoretical and abstract than the rest of this book, it is very helpful for clarifying our understanding of what systems can and cannot do. We will explore this topic in the next few sections.

原来，排序、线性化和共识之间存在深刻的关联。虽然这个想法比本书其他部分更理论和抽象，但它对于澄清系统能够做什么和不能做什么非常有帮助。在接下来的几节中，我们将探讨这个主题。结果表明，排序、线性化和共识之间存在深厚的联系。虽然此概念比本书其他部分略微更具理论性和抽象性，但它非常有助于澄清我们对系统能做什么和不能做什么的理解。我们将在接下来的几个部分中探讨这个主题。

Ordering and Causality

There are several reasons why ordering keeps coming up, and one of the reasons is that it helps preserve causality . We have already seen several examples over the course of this book where causality has been important:

有几个原因使得订单持续出现，其中一个原因是它有助于维护因果关系。在本书的过程中，我们已经看到了几个因果关系很重要的例子。

In “Consistent Prefix Reads” ( Figure 5-5 ) we saw an example where the observer of a conversation saw first the answer to a question, and then the question being answered. This is confusing because it violates our intuition of cause and effect: if a question is answered, then clearly the question had to be there first, because the person giving the answer must have seen the question (assuming they are not psychic and cannot see into the future). We say that there is a causal dependency between the question and the answer.

在“一致的前缀读取”（图5-5）中，我们看到一个例子，观察者先看到一个问题的答案，然后再看到问题被回答。这很令人困惑，因为它违反了我们对因果关系的直觉：如果一个问题被回答了，那么显然问题必须先存在，因为回答问题的人必须看到问题（假设他们不是通灵的，不能看到未来）。我们说，问题和答案之间存在因果依赖关系。
A similar pattern appeared in Figure 5-9 , where we looked at the replication between three leaders and noticed that some writes could “overtake” others due to network delays. From the perspective of one of the replicas it would look as though there was an update to a row that did not exist. Causality here means that a row must first be created before it can be updated.

Figure 5-9中出现了类似的模式，我们观察三个领导者之间的复制，发现由于网络延迟，一些写操作可能会“超越”其他操作。从一个副本的角度来看，这会让一行看起来像是被更新了，但实际上该行并不存在。在这里，因果关系意味着一行必须首先被创建才能被更新。
In “Detecting Concurrent Writes” we observed that if you have two operations A and B, there are three possibilities: either A happened before B, or B happened before A, or A and B are concurrent. This happened before relationship is another expression of causality: if A happened before B, that means B might have known about A, or built upon A, or depended on A. If A and B are concurrent, there is no causal link between them; in other words, we are sure that neither knew about the other.

在“检测并发写入”中，我们观察到如果你有两个操作A和B，有三种可能性：A发生在B之前，或B发生在A之前，或A和B是并发的。这种发生前关系是因果关系的另一种表达方式：如果A发生在B之前，那意味着B可能知道A，或者建立在A之上，或者依赖A。如果A和B是并发的，它们之间没有因果联系；换句话说，我们确定它们中没有一个知道另一个。
In the context of snapshot isolation for transactions ( “Snapshot Isolation and Repeatable Read” ), we said that a transaction reads from a consistent snapshot. But what does “consistent” mean in this context? It means consistent with causality : if the snapshot contains an answer, it must also contain the question being answered [ 48 ]. Observing the entire database at a single point in time makes it consistent with causality: the effects of all operations that happened causally before that point in time are visible, but no operations that happened causally afterward can be seen. Read skew (non-repeatable reads, as illustrated in Figure 7-6 ) means reading data in a state that violates causality.

在“快照隔离和可重复读”事务的快照隔离上下文中，我们说事务从一致的快照中读取。但在这个背景下，“一致”是什么意思呢？这意味着它符合因果关系：如果快照包含一个答案，它必须同时包含问题[48]。在单个时间点观察整个数据库可以保持因果一致性：所有在该时间点之前因果发生的操作的影响是可见的，但是之后因果发生的任何操作都将看不到。读取偏差（非重复读，如图7-6所示）意味着在违反因果关系的状态下读取数据。
Our examples of write skew between transactions (see “Write Skew and Phantoms” ) also demonstrated causal dependencies: in Figure 7-8 , Alice was allowed to go off call because the transaction thought that Bob was still on call, and vice versa. In this case, the action of going off call is causally dependent on the observation of who is currently on call. Serializable snapshot isolation (see “Serializable Snapshot Isolation (SSI)” ) detects write skew by tracking the causal dependencies between transactions.

我们在事务之间举例说明写同步（请参见“写同步和幽灵”）也展示了因果依赖关系：在图7-8中，当事务认为Bob还在通话中时，允许Alice离开通话状态，反之亦然。在这种情况下，退出通话状态的动作在于观察当前在通话中的人。可序列化快照隔离（请参见“可序列化快照隔离（SSI）”）通过跟踪事务之间的因果依赖关系来检测写同步。
In the example of Alice and Bob watching football ( Figure 9-1 ), the fact that Bob got a stale result from the server after hearing Alice exclaim the result is a causality violation: Alice’s exclamation is causally dependent on the announcement of the score, so Bob should also be able to see the score after hearing Alice. The same pattern appeared again in “Cross-channel timing dependencies” in the guise of an image resizing service.

在阿丽斯和鲍勃观看足球的例子中（见图9-1），鲍勃在听到阿丽斯喊出分数后从服务器得到了陈旧的结果，这是一个因果性违规：阿丽斯的呼喊在因果上依赖于比分的宣布，因此鲍勃也应该在听到阿丽斯的声音后能看见比分。在“跨渠道时序依赖”中，同样的模式又以图像调整服务的形式出现了。

Causality imposes an ordering on events: cause comes before effect; a message is sent before that message is received; the question comes before the answer. And, like in real life, one thing leads to another: one node reads some data and then writes something as a result, another node reads the thing that was written and writes something else in turn, and so on. These chains of causally dependent operations define the causal order in the system—i.e., what happened before what.

因果关系对事件施加了排序：原因先于结果；在消息接收之前发送消息；问题先于答案。就像在现实生活中，一件事情引发另一件事情：一个节点读取一些数据，然后作为结果写入一些内容，另一个节点读取已写入的内容，接着又写入其他内容，等等。这些因果依赖操作的链条定义了系统中的因果顺序——也就是发生在先和发生在后的顺序。

If a system obeys the ordering imposed by causality, we say that it is causally consistent . For example, snapshot isolation provides causal consistency: when you read from the database, and you see some piece of data, then you must also be able to see any data that causally precedes it (assuming it has not been deleted in the meantime).

如果一个系统遵守因果关系所强加的顺序，我们称其为因果一致性。例如，快照隔离提供了因果一致性：当您从数据库读取数据时，您看到的某些数据，那么您也必须能够看到在它之前发生的任何数据（假设它在此期间未被删除）。

The causal order is not a total order

A total order allows any two elements to be compared, so if you have two elements, you can always say which one is greater and which one is smaller. For example, natural numbers are totally ordered: if I give you any two numbers, say 5 and 13, you can tell me that 13 is greater than 5.

全序允许比较任意两个元素，所以如果你有两个元素，总能判断哪个更大哪个更小。例如，自然数是全序的：如果我给你任意两个数字，比如5和13，你可以告诉我13比5大。

However, mathematical sets are not totally ordered: is { a , b } greater than { b , c }? Well, you can’t really compare them, because neither is a subset of the other. We say they are incomparable , and therefore mathematical sets are partially ordered : in some cases one set is greater than another (if one set contains all the elements of another), but in other cases they are incomparable.

数学集合并不完全有序：{a，b}是否比{b，c}大？由于它们互不包含，因此实际上无法进行比较。我们称它们是不可比较的，因此数学集合是部分有序的：在某些情况下，一个集合比另一个集合更大（如果一个集合包含另一个集合的所有元素），但在其他情况下，它们是不可比较的。

The difference between a total order and a partial order is reflected in different database consistency models:

全序和偏序之间的区别反映在不同的数据库一致性模型中：

Linearizability

In a linearizable system, we have a total order of operations: if the system behaves as if there is only a single copy of the data, and every operation is atomic, this means that for any two operations we can always say which one happened first. This total ordering is illustrated as a timeline in Figure 9-4 .

在线性化系统中，我们有操作的总序列：如果系统的行为就像只有一个数据副本一样，并且每个操作都是原子操作，这意味着对于任何两个操作，我们都可以确定哪个操作先发生。这个总序列被展示在图9-4中的时间轴上。

Causality

We said that two operations are concurrent if neither happened before the other (see “The “happens-before” relationship and concurrency” ). Put another way, two events are ordered if they are causally related (one happened before the other), but they are incomparable if they are concurrent. This means that causality defines a partial order , not a total order: some operations are ordered with respect to each other, but some are incomparable.

如果两个操作相互独立，那么我们称它们是并行的（参见“‘先于’关系和并发性”）。换句话说，如果两个事件是有因果关系的（一个事件在另一个事件之前发生），那么它们是有序的，但是如果它们是并发的，则它们是无法比较的。这意味着因果关系定义了部分顺序，而不是完全顺序：一些操作是有序的，而一些操作是无法比较的。

Therefore, according to this definition, there are no concurrent operations in a linearizable datastore: there must be a single timeline along which all operations are totally ordered. There might be several requests waiting to be handled, but the datastore ensures that every request is handled atomically at a single point in time, acting on a single copy of the data, along a single timeline, without any concurrency.

因此，根据这个定义，在一个可线性化的数据存储中不存在并发操作：必须存在一个时间轴，沿该时间轴所有操作都被完全排序。可能有几个请求正在等待处理，但数据存储确保每个请求在单一时间点以原子方式处理，作用于单一数据副本，沿单一时间轴，不包含任何并发。

Concurrency would mean that the timeline branches and merges again—and in this case, operations on different branches are incomparable (i.e., concurrent). We saw this phenomenon in Chapter 5 : for example, Figure 5-14 is not a straight-line total order, but rather a jumble of different operations going on concurrently. The arrows in the diagram indicate causal dependencies—the partial ordering of operations.

并发意味着时间线分支和再次合并 - 在这种情况下，不同分支上的操作是不可比较的（即并发）。我们在第5章中看到了这种现象：例如，图5-14不是一条直线总序，而是一堆不同的操作并发进行。图中的箭头表示因果依赖关系 - 操作的部分排序。

If you are familiar with distributed version control systems such as Git, their version histories are very much like the graph of causal dependencies. Often one commit happens after another, in a straight line, but sometimes you get branches (when several people concurrently work on a project), and merges are created when those concurrently created commits are combined.

如果您熟悉分布式版本控制系统，如Git，它们的版本历史记录非常类似于因果依赖关系的图形。通常一个提交发生在另一个提交之后，呈直线状态，但有时会出现分支(当几个人同时在一个项目上工作时)，并且当同时创建的提交合并时也会创建分支。

Linearizability is stronger than causal consistency

So what is the relationship between the causal order and linearizability? The answer is that linearizability implies causality: any system that is linearizable will preserve causality correctly [ 7 ]. In particular, if there are multiple communication channels in a system (such as the message queue and the file storage service in Figure 9-5 ), linearizability ensures that causality is automatically preserved without the system having to do anything special (such as passing around timestamps between different components).

什么是因果关系和可线性化之间的关系？答案是线性化意味着因果关系：任何可线性化的系统都会正确地保留因果关系。特别是，在系统中存在多个通信通道（如图9-5中的消息队列和文件存储服务），线性化确保因果关系自动保留，而不需要系统做任何特殊的事情（例如在不同组件之间传递时间戳）。

The fact that linearizability ensures causality is what makes linearizable systems simple to understand and appealing. However, as discussed in “The Cost of Linearizability” , making a system linearizable can harm its performance and availability, especially if the system has significant network delays (for example, if it’s geographically distributed). For this reason, some distributed data systems have abandoned linearizability, which allows them to achieve better performance but can make them difficult to work with.

线性化确保因果关系是使线性化系统易于理解和吸引人的原因。然而，如“线性化成本”所讨论的那样，使系统线性化可能会损害其性能和可用性，特别是如果系统存在重要的网络延迟（例如，如果它地理分布）。因此，一些分布式数据系统已经放弃了线性化，这使它们可以实现更好的性能，但可能使它们难以使用。

The good news is that a middle ground is possible. Linearizability is not the only way of preserving causality—there are other ways too. A system can be causally consistent without incurring the performance hit of making it linearizable (in particular, the CAP theorem does not apply). In fact, causal consistency is the strongest possible consistency model that does not slow down due to network delays, and remains available in the face of network failures [ 2 , 42 ].

好消息是有一个中间地带是可能的。线性一致性并不是保持因果关系的唯一方法 - 还有其他方法。一个系统可以在不招致线性一致性的性能损失的情况下保持因果一致性（特别是CAP定理不适用）。实际上，因果一致性是最强大的一致性模型，它不会因网络延迟而变慢，并且在网络故障时仍然可用[2,42]。

In many cases, systems that appear to require linearizability in fact only really require causal consistency, which can be implemented more efficiently. Based on this observation, researchers are exploring new kinds of databases that preserve causality, with performance and availability characteristics that are similar to those of eventually consistent systems [ 49 , 50 , 51 ].

许多情况下，表面上需要线性化的系统实际上只需要因果一致性，这可以更有效地实现。基于这一发现，研究人员正在探索保留因果关系的新型数据库，其性能和可用性特性类似于最终一致性系统。

As this research is quite recent, not much of it has yet made its way into production systems, and there are still challenges to be overcome [ 52 , 53 ]. However, it is a promising direction for future systems.

由于这项研究相对较新，尚未在生产系统中得到广泛应用，仍然存在一些挑战 [52, 53]。然而，这是未来系统发展的一个有前途的方向。

Capturing causal dependencies

We won’t go into all the nitty-gritty details of how nonlinearizable systems can maintain causal consistency here, but just briefly explore some of the key ideas.

我们不会在这里详细讨论非线性系统如何维持因果一致性的所有细节，但只是简要探讨一些关键的思想。

In order to maintain causality, you need to know which operation happened before which other operation. This is a partial order: concurrent operations may be processed in any order, but if one operation happened before another, then they must be processed in that order on every replica. Thus, when a replica processes an operation, it must ensure that all causally preceding operations (all operations that happened before) have already been processed; if some preceding operation is missing, the later operation must wait until the preceding operation has been processed.

为了保持因果关系，你需要知道哪个操作在哪个操作前发生。这是部分顺序：并发操作可以以任何顺序进行处理，但如果一个操作在另一个操作之前发生，则它们必须在每个副本上以那个顺序进行处理。因此，当副本处理操作时，它必须确保所有因果关系前面的操作（所有已发生的操作）都已经被处理了；如果缺少某个前一操作，则后续操作必须等待前一操作已经被处理。

In order to determine causal dependencies, we need some way of describing the “knowledge” of a node in the system. If a node had already seen the value X when it issued the write Y, then X and Y may be causally related. The analysis uses the kinds of questions you would expect in a criminal investigation of fraud charges: did the CEO know about X at the time when they made decision Y?

为了确定因果依赖关系，我们需要描述系统中节点的“知识”的一些方式。如果节点在发出写入Y时已经看到了值X，则X和Y可能存在因果关系。分析使用了诈骗指控刑事调查中所期望的问题类型：CEO在做出决策Y时是否知道X的存在？

The techniques for determining which operation happened before which other operation are similar to what we discussed in “Detecting Concurrent Writes” . That section discussed causality in a leaderless datastore, where we need to detect concurrent writes to the same key in order to prevent lost updates. Causal consistency goes further: it needs to track causal dependencies across the entire database, not just for a single key. Version vectors can be generalized to do this [ 54 ].

确定哪个操作在哪个其他操作之前发生的技术与我们在“检测并发写入”中讨论过的类似。该部分讨论了在无领导数据存储中的因果关系，我们需要检测对同一关键字的并发写入，以防止丢失更新。因果一致性更进一步：它需要跟踪整个数据库的因果依赖关系，而不仅仅是单个关键字。版本向量可以推广到这样做。

In order to determine the causal ordering, the database needs to know which version of the data was read by the application. This is why, in Figure 5-13 , the version number from the prior operation is passed back to the database on a write. A similar idea appears in the conflict detection of SSI, as discussed in “Serializable Snapshot Isolation (SSI)” : when a transaction wants to commit, the database checks whether the version of the data that it read is still up to date. To this end, the database keeps track of which data has been read by which transaction.

为了确定因果关系的顺序，数据库需要知道应用程序读取了哪个数据版本。因此，在图5-13中，先前操作的版本号在写入时传递回数据库。类似的想法出现在SSI的冲突检测中，如“可序列化快照隔离（SSI）”所讨论的那样：当一个事务想要提交时，数据库会检查它所读取的数据版本是否仍然是最新的。为此，数据库跟踪哪些事务已读取了哪些数据。

Sequence Number Ordering

Although causality is an important theoretical concept, actually keeping track of all causal dependencies can become impractical. In many applications, clients read lots of data before writing something, and then it is not clear whether the write is causally dependent on all or only some of those prior reads. Explicitly tracking all the data that has been read would mean a large overhead.

尽管因果关系是一个重要的理论概念，但实际上跟踪所有的因果依赖关系可能变得不切实际。在许多应用中，客户端在写入某些内容之前读取了大量数据，然后就不清楚写入是否对所有或仅某些先前读取的数据有因果依赖关系。显式跟踪所有已读取的数据将意味着巨大的开销。

However, there is a better way: we can use sequence numbers or timestamps to order events. A timestamp need not come from a time-of-day clock (or physical clock, which have many problems, as discussed in “Unreliable Clocks” ). It can instead come from a logical clock , which is an algorithm to generate a sequence of numbers to identify operations, typically using counters that are incremented for every operation.

然而，有一种更好的方法：我们可以使用序列号或时间戳来排序事件。时间戳不需要来自于时钟（或物理时钟，由于存在许多问题，如“不可靠的时钟”所述）。它可以来自于逻辑时钟，这是一种算法，用于生成一系列数字来识别操作，通常使用针对每个操作递增的计数器。

Such sequence numbers or timestamps are compact (only a few bytes in size), and they provide a total order : that is, every operation has a unique sequence number, and you can always compare two sequence numbers to determine which is greater (i.e., which operation happened later).

这样的序列号或时间戳非常紧凑（仅几个字节大小），并提供了完整的排序：也就是说，每个操作都有一个唯一的序列号，您始终可以比较两个序列号以确定哪个更大（即哪个操作发生得更晚）。

In particular, we can create sequence numbers in a total order that is consistent with causality : ^vii we promise that if operation A causally happened before B, then A occurs before B in the total order (A has a lower sequence number than B). Concurrent operations may be ordered arbitrarily. Such a total order captures all the causality information, but also imposes more ordering than strictly required by causality.

特别是，我们可以创建与因果关系一致的总序列号序列：我们保证，如果操作A在因果上先于B，那么A出现在总序列中早于B（A的序列号比B小）。并发操作可以任意排序。这种总序列捕获了所有因果关系信息，但也强加了比因果关系要求更多的排序。

In a database with single-leader replication (see “Leaders and Followers” ), the replication log defines a total order of write operations that is consistent with causality. The leader can simply increment a counter for each operation, and thus assign a monotonically increasing sequence number to each operation in the replication log. If a follower applies the writes in the order they appear in the replication log, the state of the follower is always causally consistent (even if it is lagging behind the leader).

在具有单领导者复制的数据库中（请参见“领导者和追随者”），复制日志定义了与因果一致的写操作总序。领导者可以简单地为每个操作增加一个计数器，因此为复制日志中的每个操作分配一个单调递增的序列号。如果追随者按照复制日志中出现的顺序应用写入，则追随者的状态始终是因果一致的（即使滞后于领导者）。

Noncausal sequence number generators

If there is not a single leader (perhaps because you are using a multi-leader or leaderless database, or because the database is partitioned), it is less clear how to generate sequence numbers for operations. Various methods are used in practice:

如果没有单个领导者（可能是因为使用了多个领导者或无领导者数据库，或者因为数据库被分区），则如何生成操作的序列号就不太清楚了。实践中使用各种方法：

Each node can generate its own independent set of sequence numbers. For example, if you have two nodes, one node can generate only odd numbers and the other only even numbers. In general, you could reserve some bits in the binary representation of the sequence number to contain a unique node identifier, and this would ensure that two different nodes can never generate the same sequence number.

每个节点可以生成自己独立的序列号集合。例如，如果有两个节点，则一个节点可以仅生成奇数，另一个节点仅生成偶数。通常，您可以保留序列号的二进制表示中的一些位来包含唯一的节点标识符，这将确保两个不同的节点永远不会生成相同的序列号。
You can attach a timestamp from a time-of-day clock (physical clock) to each operation [ 55 ]. Such timestamps are not sequential, but if they have sufficiently high resolution, they might be sufficient to totally order operations. This fact is used in the last write wins conflict resolution method (see “Timestamps for ordering events” ).

可以为每个操作附加一个来自时钟的时间戳（物理时钟）[55]。这些时间戳不是连续的，但如果它们具有足够高的分辨率，则可能足以完全有序操作。这个事实被用于最后写入赢得冲突解决方法（请参阅“时间戳用于事件排序”）。
You can preallocate blocks of sequence numbers. For example, node A might claim the block of sequence numbers from 1 to 1,000, and node B might claim the block from 1,001 to 2,000. Then each node can independently assign sequence numbers from its block, and allocate a new block when its supply of sequence numbers begins to run low.

你可以预分配一段序列号的块。例如，节点A可以声称从1到1,000的序列号块，而节点B可以声称从1,001到2,000的块。然后，每个节点可以独立地从其块中分配序列号，并在其序列号供应开始变得不足时分配一个新块。

These three options all perform better and are more scalable than pushing all operations through a single leader that increments a counter. They generate a unique, approximately increasing sequence number for each operation. However, they all have a problem: the sequence numbers they generate are not consistent with causality .

这三个选项都比通过单个领袖推送所有操作并增加计数器更有效，且更具可扩展性。它们为每个操作生成唯一的、近乎递增的序列号。然而，它们都存在一个问题：它们生成的序列号与因果关系不一致。

The causality problems occur because these sequence number generators do not correctly capture the ordering of operations across different nodes:

因果关系问题的发生是因为这些序列号生成器无法正确捕捉不同节点之间操作的顺序。

Each node may process a different number of operations per second. Thus, if one node generates even numbers and the other generates odd numbers, the counter for even numbers may lag behind the counter for odd numbers, or vice versa. If you have an odd-numbered operation and an even-numbered operation, you cannot accurately tell which one causally happened first.

每个节点可能每秒处理不同数量的操作。因此，如果一个节点生成偶数，另一个生成奇数，偶数计数器可能落后于奇数计数器，反之亦然。如果你有一个奇数操作和一个偶数操作，你不能准确地判断哪一个先发生了。
Timestamps from physical clocks are subject to clock skew, which can make them inconsistent with causality. For example, see Figure 8-3 , which shows a scenario in which an operation that happened causally later was actually assigned a lower timestamp. ^viii

物理时钟的时间戳受到时钟偏差的影响，可能与因果关系不一致。例如，参见图8-3，它展示了一个场景，在该场景中，一个因果上更晚发生的操作实际上被分配了一个较低的时间戳。
In the case of the block allocator, one operation may be given a sequence number in the range from 1,001 to 2,000, and a causally later operation may be given a number in the range from 1 to 1,000. Here, again, the sequence number is inconsistent with causality.

在块分配器的情况下，一个操作可能被分配一个序列号，范围从1,001到2,000，后续操作可能被分配一个编号范围在1到1,000之间。在这种情况下，序列号与因果关系不一致。

Lamport timestamps

Although the three sequence number generators just described are inconsistent with causality, there is actually a simple method for generating sequence numbers that is consistent with causality. It is called a Lamport timestamp , proposed in 1978 by Leslie Lamport [ 56 ], in what is now one of the most-cited papers in the field of distributed systems.

尽管刚才描述的三个序列号生成器与因果关系不一致，但实际上有一种简单的方法可以生成序列号，该方法与因果关系一致。这被称为Lamport时间戳，是由Leslie Lamport在1978年提出的，现在是分布式系统领域中最受引用的论文之一。

The use of Lamport timestamps is illustrated in Figure 9-8 . Each node has a unique identifier, and each node keeps a counter of the number of operations it has processed. The Lamport timestamp is then simply a pair of ( counter , node ID ). Two nodes may sometimes have the same counter value, but by including the node ID in the timestamp, each timestamp is made unique.

Lamport时间戳的使用如图9-8所示。每个节点都有一个唯一的标识符，并且每个节点都保留着它已处理的操作数的计数器。 Lamport时间戳只是一个（计数器，节点ID）的对。两个节点有时可能具有相同的计数器值，但通过在时间戳中包含节点ID，每个时间戳都变得唯一。

A Lamport timestamp bears no relationship to a physical time-of-day clock, but it provides total ordering: if you have two timestamps, the one with a greater counter value is the greater timestamp; if the counter values are the same, the one with the greater node ID is the greater timestamp.

Lamport时间戳与实际日历时间无关，但它提供了全局排序：如果你有两个时间戳，拥有更大计数器值的时间戳更大；如果计数器值相同，则拥有更大节点ID的时间戳更大。

So far this description is essentially the same as the even/odd counters described in the last section. The key idea about Lamport timestamps, which makes them consistent with causality, is the following: every node and every client keeps track of the maximum counter value it has seen so far, and includes that maximum on every request. When a node receives a request or response with a maximum counter value greater than its own counter value, it immediately increases its own counter to that maximum.

目前，这个描述与上一节中描述的奇偶计数器基本相同。 Lamport时间戳的关键思想是使它们与因果关系一致的：每个节点和每个客户端都会跟踪迄今为止看到的最大计数器值，并在每个请求中包括该最大值。当节点收到具有大于自己计数器值的最大计数器值的请求或响应时，它立即将自己的计数器增加到那个最大值。

This is shown in Figure 9-8 , where client A receives a counter value of 5 from node 2, and then sends that maximum of 5 to node 1. At that time, node 1’s counter was only 1, but it was immediately moved forward to 5, so the next operation had an incremented counter value of 6.

这在图9-8中所示，客户端A从节点2接收了计数器值5，然后将最大值5发送到节点1。当时，节点1的计数器仅为1，但它立即向前移动到5，因此下一个操作的计数器值为6。

As long as the maximum counter value is carried along with every operation, this scheme ensures that the ordering from the Lamport timestamps is consistent with causality, because every causal dependency results in an increased timestamp.

只要每次操作都携带最大计数器值，此方案将确保Lamport时间戳的顺序符合因果关系，因为每个因果依赖都会导致时间戳增加。

Lamport timestamps are sometimes confused with version vectors, which we saw in “Detecting Concurrent Writes” . Although there are some similarities, they have a different purpose: version vectors can distinguish whether two operations are concurrent or whether one is causally dependent on the other, whereas Lamport timestamps always enforce a total ordering. From the total ordering of Lamport timestamps, you cannot tell whether two operations are concurrent or whether they are causally dependent. The advantage of Lamport timestamps over version vectors is that they are more compact.

Lamport时间戳有时会与版本向量混淆，我们在“检测并发写入”中已经看到了。虽然有一些相似之处，但它们有不同的目的：版本向量可以区分两个操作是否并发或是否彼此因果相关，而Lamport时间戳始终强制执行完全排序。从Lamport时间戳的完全排序中，您无法确定两个操作是否并发或是否彼此因果相关。与版本向量相比，Lamport时间戳的优点在于它们更紧凑。

Timestamp ordering is not sufficient

Although Lamport timestamps define a total order of operations that is consistent with causality, they are not quite sufficient to solve many common problems in distributed systems.

虽然Lamport时间戳定义了与因果一致的操作总顺序，但它们并不足以解决许多分布式系统中的常见问题。

For example, consider a system that needs to ensure that a username uniquely identifies a user account. If two users concurrently try to create an account with the same username, one of the two should succeed and the other should fail. (We touched on this problem previously in “The leader and the lock” .)

例如，考虑一个需要确保用户名唯一标识用户账户的系统。如果两个用户同时尝试使用相同的用户名创建账户，其中一个应该成功，另一个应该失败。（我们之前在“领导者和锁”中提到过这个问题。）

At first glance, it seems as though a total ordering of operations (e.g., using Lamport timestamps) should be sufficient to solve this problem: if two accounts with the same username are created, pick the one with the lower timestamp as the winner (the one who grabbed the username first), and let the one with the greater timestamp fail. Since timestamps are totally ordered, this comparison is always valid.

乍一看，似乎全面排序操作（例如使用 Lamport 时间戳）足以解决此问题：如果创建了两个具有相同用户名的帐户，请选择时间戳较低的帐户作为获胜者（第一个占用用户名的人），并让时间戳较大的帐户失败。由于时间戳是完全有序的，因此此比较始终有效。

This approach works for determining the winner after the fact: once you have collected all the username creation operations in the system, you can compare their timestamps. However, it is not sufficient when a node has just received a request from a user to create a username, and needs to decide right now whether the request should succeed or fail. At that moment, the node does not know whether another node is concurrently in the process of creating an account with the same username, and what timestamp that other node may assign to the operation.

这种方法适用于事后确定胜者：一旦您收集了系统中所有的用户名创建操作，就可以比较它们的时间戳。但是，当一个节点刚刚收到用户创建用户名的请求，并且需要立即决定请求是成功还是失败时，这种方法是不够的。在那一刻，该节点不知道另一个节点是否正在同时创建具有相同用户名的帐户，以及另一个节点可能为该操作分配的时间戳。

In order to be sure that no other node is in the process of concurrently creating an account with the same username and a lower timestamp, you would have to check with every other node to see what it is doing [ 56 ]. If one of the other nodes has failed or cannot be reached due to a network problem, this system would grind to a halt. This is not the kind of fault-tolerant system that we need.

为确保没有其他节点同时使用相同的用户名和较低的时间戳创建帐户，您需要与每个其他节点检查其正在做什么[56]。如果其中一个其他节点失败或由于网络问题无法访问，则该系统将停止工作。这不是我们所需的容错系统。

The problem here is that the total order of operations only emerges after you have collected all of the operations. If another node has generated some operations, but you don’t yet know what they are, you cannot construct the final ordering of operations: the unknown operations from the other node may need to be inserted at various positions in the total order.

问题在于只有在收集了所有操作后，操作的总顺序才会出现。如果另一个节点已经生成了一些操作，但您还不知道它们是什么，您不能构建操作的最终顺序：来自其他节点的未知操作可能需要被插入到总顺序的各个位置。

To conclude: in order to implement something like a uniqueness constraint for usernames, it’s not sufficient to have a total ordering of operations—you also need to know when that order is finalized. If you have an operation to create a username, and you are sure that no other node can insert a claim for the same username ahead of your operation in the total order, then you can safely declare the operation successful.

为了实现像用户名唯一性约束这样的东西，仅仅拥有一种操作的完全排序是不够的 - 你还需要知道什么时候这个排序被最终确定。如果你有一个操作来创建一个用户名，并且你确信没有其他节点能在全序中在你的操作之前插入相同的用户名声明，那么你可以安全地声明该操作成功。

This idea of knowing when your total order is finalized is captured in the topic of total order broadcast .

这种在整个订单结束时知道的想法被称为总订单广播的话题所涵盖。

Total Order Broadcast

If your program runs only on a single CPU core, it is easy to define a total ordering of operations: it is simply the order in which they were executed by the CPU. However, in a distributed system, getting all nodes to agree on the same total ordering of operations is tricky. In the last section we discussed ordering by timestamps or sequence numbers, but found that it is not as powerful as single-leader replication (if you use timestamp ordering to implement a uniqueness constraint, you cannot tolerate any faults).

如果您的程序仅在单个CPU核心上运行，则很容易定义操作的完整排序：它只是CPU执行它们的顺序。然而，在分布式系统中，使所有节点都同意操作的完整排序很棘手。在最后一节中，我们讨论了按时间戳或序列号排序，但发现它不如单领导者复制强大（如果您使用时间戳排序来实现唯一性约束，则无法容忍任何故障）。

As discussed, single-leader replication determines a total order of operations by choosing one node as the leader and sequencing all operations on a single CPU core on the leader. The challenge then is how to scale the system if the throughput is greater than a single leader can handle, and also how to handle failover if the leader fails (see “Handling Node Outages” ). In the distributed systems literature, this problem is known as total order broadcast or atomic broadcast [ 25 , 57 , 58 ]. ^ix

如讨论过的，单领导者复制通过选择一个节点作为领导者，并在领导者的单个 CPU 核心上排序所有操作，从而确定操作的总顺序。然后，如果吞吐量超过单个领导者可以处理的范围，如何扩展系统，以及如何处理领导者失败（请参见“处理节点故障”）则是挑战所在。在分布式系统文献中，这个问题被称为总序广播或原子广播 [25,57,58]。

Scope of ordering guarantee

Partitioned databases with a single leader per partition often maintain ordering only per partition, which means they cannot offer consistency guarantees (e.g., consistent snapshots, foreign key references) across partitions. Total ordering across all partitions is possible, but requires additional coordination [ 59 ].

每个分区有一个单独的领导者的分区数据库通常只能保持每个分区的排序，这意味着它们无法在分区之间提供一致性保证（例如，一致的快照、外键引用）。跨所有分区的总排序是可能的，但需要额外的协调。

Total order broadcast is usually described as a protocol for exchanging messages between nodes. Informally, it requires that two safety properties always be satisfied:

总订单广播通常被描述为节点间交换消息的协议。简单来说，它要求始终满足两个安全属性。

Reliable delivery

No messages are lost: if a message is delivered to one node, it is delivered to all nodes.

没有遗失的消息：如果一条消息被传递到一个节点，那么它也会被传递到所有其他节点。

Totally ordered delivery

Messages are delivered to every node in the same order.

消息被按照相同的顺序传递到每个节点。

A correct algorithm for total order broadcast must ensure that the reliability and ordering properties are always satisfied, even if a node or the network is faulty. Of course, messages will not be delivered while the network is interrupted, but an algorithm can keep retrying so that the messages get through when the network is eventually repaired (and then they must still be delivered in the correct order).

总序播算法必须确保可靠性和排序属性始终得到满足，即使有节点或网络故障。当然，当网络中断时，消息将无法传递，但是算法可以不断尝试使得消息在网络最终修复时（并且仍然以正确的顺序交付）。

Using total order broadcast

Consensus services such as ZooKeeper and etcd actually implement total order broadcast. This fact is a hint that there is a strong connection between total order broadcast and consensus, which we will explore later in this chapter.

共识服务如ZooKeeper和etcd实际上实现了完全有序广播。这个事实提示完全有序广播和共识之间存在着很强的联系，在本章后面我们将会探讨这一点。

Total order broadcast is exactly what you need for database replication: if every message represents a write to the database, and every replica processes the same writes in the same order, then the replicas will remain consistent with each other (aside from any temporary replication lag). This principle is known as state machine replication [ 60 ], and we will return to it in Chapter 11 .

“全序广播”是数据库复制所需的关键因素：如果每条消息代表对数据库的写入，并且每个副本以相同的顺序处理相同的写入，则各个副本将保持一致（除了任何短暂的复制延迟）。这个原则被称为状态机复制[60]，我们将在第11章中回到它。”

Similarly, total order broadcast can be used to implement serializable transactions: as discussed in “Actual Serial Execution” , if every message represents a deterministic transaction to be executed as a stored procedure, and if every node processes those messages in the same order, then the partitions and replicas of the database are kept consistent with each other [ 61 ].

类似地，全序播送可以用于实现可串行化事务：如同“实际串行执行”中所讨论的，如果每个消息代表一个要作为存储过程执行的确定性事务，且每个节点按相同顺序处理这些消息，则数据库的分区和副本保持一致。

An important aspect of total order broadcast is that the order is fixed at the time the messages are delivered: a node is not allowed to retroactively insert a message into an earlier position in the order if subsequent messages have already been delivered. This fact makes total order broadcast stronger than timestamp ordering.

全序广播的一个重要方面是消息在传递时顺序固定：如果后续消息已经被传递，节点就不允许将消息反向插入到早期位置。这个事实使得全序广播比时间戳排序更加强大。

Another way of looking at total order broadcast is that it is a way of creating a log (as in a replication log, transaction log, or write-ahead log): delivering a message is like appending to the log. Since all nodes must deliver the same messages in the same order, all nodes can read the log and see the same sequence of messages.

另一种理解全序广播的方式是将其视为创建日志的一种方式（例如复制日志、事务日志或预写式日志）：传递消息就像附加到日志中一样。由于所有节点必须按照相同的顺序传递相同的消息，因此所有节点可以读取日志并查看相同的消息序列。

Total order broadcast is also useful for implementing a lock service that provides fencing tokens (see “Fencing tokens” ). Every request to acquire the lock is appended as a message to the log, and all messages are sequentially numbered in the order they appear in the log. The sequence number can then serve as a fencing token, because it is monotonically increasing. In ZooKeeper, this sequence number is called zxid [ 15 ].

全序广播还可用于实现提供围栏令牌的锁定服务（请见“围栏令牌”）。每个请求获取锁都将作为一个消息附加到日志中，并按照它们在日志中出现的顺序进行顺序编号。这个序列号可以作为围栏令牌，因为它是单调递增的。在ZooKeeper中，这个序列号称为zxid [15]。

Implementing linearizable storage using total order broadcast

As illustrated in Figure 9-4 , in a linearizable system there is a total order of operations. Does that mean linearizability is the same as total order broadcast? Not quite, but there are close links between the two. ^x

如图9-4所示，在可线性化的系统中存在操作的总序。这是否意味着线性化与总序广播相同？并非完全如此，但两者之间存在密切联系。

Total order broadcast is asynchronous: messages are guaranteed to be delivered reliably in a fixed order, but there is no guarantee about when a message will be delivered (so one recipient may lag behind the others). By contrast, linearizability is a recency guarantee: a read is guaranteed to see the latest value written.

总订单广播是异步的：保证消息按照固定顺序可靠地传送，但不保证何时传送消息（因此一个收件人可能会落后于其他人）。相比之下，线性一致性是最新性保证：读取保证可以看到最新写入的值。

However, if you have total order broadcast, you can build linearizable storage on top of it. For example, you can ensure that usernames uniquely identify user accounts.

然而，如果您拥有完全有序的广播，则可以在其上构建可线性化的存储。例如，您可以确保用户名唯一地标识用户帐户。

Imagine that for every possible username, you can have a linearizable register with an atomic compare-and-set operation. Every register initially has the value null (indicating that the username is not taken). When a user wants to create a username, you execute a compare-and-set operation on the register for that username, setting it to the user account ID, under the condition that the previous register value is null . If multiple users try to concurrently grab the same username, only one of the compare-and-set operations will succeed, because the others will see a value other than null (due to linearizability).

假设对于每个可能的用户名，您可以拥有一个具有原子比较和设置操作的可线性化寄存器。每个寄存器最初值为 null（表示该用户名尚未被占用）。当用户想要创建用户名时，您执行对该用户名的寄存器的比较和设置操作，在先前的寄存器值为 null 的条件下将其设置为用户帐户 ID。如果多个用户尝试并发获取相同的用户名，则只有一个比较和设置操作将成功，因为其他人将看到一个不为 null 的值（由于可线性化性）。

You can implement such a linearizable compare-and-set operation as follows by using total order broadcast as an append-only log [ 62 , 63 ]:

您可以通过使用全序广播作为仅限追加日志来实现这样的可线性比较和设置操作。[62，63]：

Append a message to the log, tentatively indicating the username you want to claim.

向日志中追加一条消息，暂时指示您想要声明的用户名。
Read the log, and wait for the message you appended to be delivered back to you. ^xi

阅读日志，等待回传给您的消息。
Check for any messages claiming the username that you want. If the first message for your desired username is your own message, then you are successful: you can commit the username claim (perhaps by appending another message to the log) and acknowledge it to the client. If the first message for your desired username is from another user, you abort the operation.

检查是否有任何声称使用您所需用户名的消息。如果您所需用户名的第一条消息是您自己的消息，则您已经成功：您可以提交用户名声明（可能通过将另一条消息附加到日志中）并向客户确认。如果您所需用户名的第一条消息来自另一个用户，则应中止操作。

Because log entries are delivered to all nodes in the same order, if there are several concurrent writes, all nodes will agree on which one came first. Choosing the first of the conflicting writes as the winner and aborting later ones ensures that all nodes agree on whether a write was committed or aborted. A similar approach can be used to implement serializable multi-object transactions on top of a log [ 62 ].

由于日志条目以相同顺序传递到所有节点，如果有多个并发写入，则所有节点将就哪个写入先到达达成一致。选择冲突写入中的第一个作为赢家并放弃后续写入，确保所有节点都同意写入是成功还是中止。类似的方法可用于在日志之上实现可串行化的多对象事务。

While this procedure ensures linearizable writes, it doesn’t guarantee linearizable reads—if you read from a store that is asynchronously updated from the log, it may be stale. (To be precise, the procedure described here provides sequential consistency [ 47 , 64 ], sometimes also known as timeline consistency [ 65 , 66 ], a slightly weaker guarantee than linearizability.) To make reads linearizable, there are a few options:

尽管此过程确保了可线性化的写入，但并不保证可线性化的读取。如果您从记录异步更新的存储器中读取，它可能是过时的。为了使读取可线性化，有几个选择：（要精确，请注意此处描述的过程提供了“顺序一致性”，有时也称为“时间线一致性”，这是比线性一致性稍微弱一些的保证。）

You can sequence reads through the log by appending a message, reading the log, and performing the actual read when the message is delivered back to you. The message’s position in the log thus defines the point in time at which the read happens. (Quorum reads in etcd work somewhat like this [ 16 ].)

通过附加消息、读取日志并在消息被发送回您时执行实际读取操作，您可以通过日志对读取进行排序。因此，消息在日志中的位置定义了读取发生的时间点。（etcd 中的 Quorum 读取有些类似这种方式 [16]）。
If the log allows you to fetch the position of the latest log message in a linearizable way, you can query that position, wait for all entries up to that position to be delivered to you, and then perform the read. (This is the idea behind ZooKeeper’s sync() operation [ 15 ].)

如果日志允许您以线性化的方式获取最新日志消息的位置，您可以查询该位置，等待所有条目到达该位置，然后执行读取操作。（这就是ZooKeeper的sync()操作的背后思想）
You can make your read from a replica that is synchronously updated on writes, and is thus sure to be up to date. (This technique is used in chain replication [ 63 ]; see also “Research on Replication” .)

你可以从一个同步更新写操作的副本中阅读，因此保证它是最新的。这种技术在链式复制中被使用（参见“复制研究”）。

Implementing total order broadcast using linearizable storage

The last section showed how to build a linearizable compare-and-set operation from total order broadcast. We can also turn it around, assume that we have linearizable storage, and show how to build total order broadcast from it.

上一节展示了如何从全序广播构建可线性化的比较和交换操作。我们也可以反过来，假设我们拥有可线性化存储，并展示如何从中构建全序广播。

The easiest way is to assume you have a linearizable register that stores an integer and that has an atomic increment-and-get operation [ 28 ]. Alternatively, an atomic compare-and-set operation would also do the job.

最简单的方式就是假设有一个可以存储整数的可线性化寄存器，并且拥有原子的增加和获取操作[28]。另一种选择是使用原子的比较并设置操作。

The algorithm is simple: for every message you want to send through total order broadcast, you increment-and-get the linearizable integer, and then attach the value you got from the register as a sequence number to the message. You can then send the message to all nodes (resending any lost messages), and the recipients will deliver the messages consecutively by sequence number.

算法很简单：对于想要通过总序传播发送的每一条消息，你需要递增并获取线性化整数，然后将你从寄存器中获取的值作为序列号附加到消息上。接着你可以将消息发送到所有节点（重新发送任何丢失的消息），收件人将按照序列号依次传递消息。

Note that unlike Lamport timestamps, the numbers you get from incrementing the linearizable register form a sequence with no gaps. Thus, if a node has delivered message 4 and receives an incoming message with a sequence number of 6, it knows that it must wait for message 5 before it can deliver message 6. The same is not the case with Lamport timestamps—in fact, this is the key difference between total order broadcast and timestamp ordering.

请注意，与Lamport时间戳不同，从递增线性寄存器获得的数字形成一个没有间隙的序列。因此，如果节点已经传送了消息4并接收到序列号为6的传入消息，则它知道在传送消息6之前必须等待消息5。对于Lamport时间戳来说，情况并非如此——实际上，这是总序列广播和时间戳排序之间的关键区别。

How hard could it be to make a linearizable integer with an atomic increment-and-get operation? As usual, if things never failed, it would be easy: you could just keep it in a variable on one node. The problem lies in handling the situation when network connections to that node are interrupted, and restoring the value when that node fails [ 59 ]. In general, if you think hard enough about linearizable sequence number generators, you inevitably end up with a consensus algorithm.

“使用原子增量和获取操作来生成可线性化的整数会有多难呢？通常情况下，如果不会出现故障，那么这非常容易实现：在一个节点上将其保留在一个变量中即可。问题在于当与该节点的网络连接中断时，如何处理并在该节点发生故障时恢复该值[59]。一般情况下，如果您足够努力地思考线性序列号生成器，最终会使用共识算法。”

This is no coincidence: it can be proved that a linearizable compare-and-set (or increment-and-get) register and total order broadcast are both equivalent to consensus [ 28 , 67 ]. That is, if you can solve one of these problems, you can transform it into a solution for the others. This is quite a profound and surprising insight!

这不是巧合：可以证明，可线性化的比较并设置（或增量并获取）寄存器和全序广播都等价于共识[28，67]。也就是说，如果你可以解决其中一个问题，你可以将其转化为其他问题的解决方案。这是一个相当深刻和令人惊讶的见解！

It is time to finally tackle the consensus problem head-on, which we will do in the rest of this chapter.

到了终于直面共识问题的时候了，我们将在本章的其余部分来解决它。

Distributed Transactions and Consensus

Consensus is one of the most important and fundamental problems in distributed computing. On the surface, it seems simple: informally, the goal is simply to get several nodes to agree on something . You might think that this shouldn’t be too hard. Unfortunately, many broken systems have been built in the mistaken belief that this problem is easy to solve.

共识是分布式计算中最重要和基本的问题之一。表面上看，它似乎很简单：通俗地讲，目标就是让几个节点就某个事项达成一致。你可能会认为这不应该太难。不幸的是，很多破碎的系统在错误的信仰下构建，即解决这个问题很容易。

Although consensus is very important, the section about it appears late in this book because the topic is quite subtle, and appreciating the subtleties requires some prerequisite knowledge. Even in the academic research community, the understanding of consensus only gradually crystallized over the course of decades, with many misunderstandings along the way. Now that we have discussed replication ( Chapter 5 ), transactions ( Chapter 7 ), system models ( Chapter 8 ), linearizability, and total order broadcast (this chapter), we are finally ready to tackle the consensus problem.

尽管共识非常重要，但这本书中关于共识的部分出现得较晚，因为这个主题非常微妙，需要一些先决知识才能欣赏其微妙之处。即使在学术研究界，对共识的理解也是在数十年的过程中逐渐凝聚起来的，并伴随着许多误解。现在，我们已经讨论了复制（第5章）、事务（第7章）、系统模型（第8章）、线性化和全序广播（本章），终于可以着手解决共识问题。

There are a number of situations in which it is important for nodes to agree. For example:

有很多情况需要节点达成共识。例如：

Leader election

In a database with single-leader replication, all nodes need to agree on which node is the leader. The leadership position might become contested if some nodes can’t communicate with others due to a network fault. In this case, consensus is important to avoid a bad failover, resulting in a split brain situation in which two nodes both believe themselves to be the leader (see “Handling Node Outages” ). If there were two leaders, they would both accept writes and their data would diverge, leading to inconsistency and data loss.

在具有单领导副本的数据库中，所有节点都需要就哪个节点是领导者达成一致。如果由于网络故障某些节点无法与其他节点通信，则领导地位可能会受到争议。在这种情况下，共识对于避免糟糕的故障转移非常重要，导致分裂的大脑情况，即两个节点都认为自己是领导者（请参阅“处理节点故障”）。如果有两个领导者，他们都将接受写入并且它们的数据将分歧，导致不一致和数据丢失。

Atomic commit

In a database that supports transactions spanning several nodes or partitions, we have the problem that a transaction may fail on some nodes but succeed on others. If we want to maintain transaction atomicity (in the sense of ACID; see “Atomicity” ), we have to get all nodes to agree on the outcome of the transaction: either they all abort/roll back (if anything goes wrong) or they all commit (if nothing goes wrong). This instance of consensus is known as the atomic commit problem. ^xii

在支持跨多个节点或分区的事务的数据库中，我们面临的问题是该事务可能在某些节点上失败，但在其他节点上成功。如果我们想要维护事务的原子性（在ACID意义上；参见“原子性”），我们必须让所有节点就事务的结果达成一致：如果出现任何问题，则它们全部中止/回滚，或者如果没有任何问题，则它们全部提交。这种共识的实例称为原子提交问题。

The Impossibility of Consensus

You may have heard about the FLP result [ 68 ]—named after the authors Fischer, Lynch, and Paterson—which proves that there is no algorithm that is always able to reach consensus if there is a risk that a node may crash. In a distributed system, we must assume that nodes may crash, so reliable consensus is impossible. Yet, here we are, discussing algorithms for achieving consensus. What is going on here?

你可能听说过FLP结果[68]，得名于Fischer、Lynch和Paterson，它证明如果存在节点崩溃的风险，就没有一种算法能始终达成共识。在分布式系统中，我们必须假设节点可能会崩溃，因此可靠的共识是不可能的。然而，我们现在正讨论达成共识的算法。这是怎么回事？

The answer is that the FLP result is proved in the asynchronous system model (see “System Model and Reality” ), a very restrictive model that assumes a deterministic algorithm that cannot use any clocks or timeouts. If the algorithm is allowed to use timeouts, or some other way of identifying suspected crashed nodes (even if the suspicion is sometimes wrong), then consensus becomes solvable [ 67 ]. Even just allowing the algorithm to use random numbers is sufficient to get around the impossibility result [ 69 ].

FLP 结果证明了在异步系统模型下（参见“系统模型与现实”），只要算法是确定性的且不使用任何时钟或超时，那么共识就是不可能的。如果允许算法使用超时或其他方法来识别疑似崩溃节点（即使有时候怀疑是错误的），那么共识就可以得到解决[67]。即使只允许算法使用随机数字也足以摆脱不可能性结果[69]。

Thus, although the FLP result about the impossibility of consensus is of great theoretical importance, distributed systems can usually achieve consensus in practice.

因此，尽管FLP关于共识不可能性的结果在理论上具有重要意义，分布式系统通常可以在实践中实现共识。

In this section we will first examine the atomic commit problem in more detail. In particular, we will discuss the two-phase commit (2PC) algorithm, which is the most common way of solving atomic commit and which is implemented in various databases, messaging systems, and application servers. It turns out that 2PC is a kind of consensus algorithm—but not a very good one [ 70 , 71 ].

在本节中，我们将首先更详细地研究原子提交问题。特别地，我们将讨论两阶段提交（2PC）算法，这是解决原子提交问题的最常见方式，并在各种数据库、消息系统和应用服务器中实现。事实证明，2PC是一种共识算法，但不是很好的算法[70，71]。

By learning from 2PC we will then work our way toward better consensus algorithms, such as those used in ZooKeeper (Zab) and etcd (Raft).

通过学习2PC，我们将逐步学习更好的共识算法，例如在ZooKeeper中使用的Zab和在etcd中使用的Raft。

Atomic Commit and Two-Phase Commit (2PC)

In Chapter 7 we learned that the purpose of transaction atomicity is to provide simple semantics in the case where something goes wrong in the middle of making several writes. The outcome of a transaction is either a successful commit , in which case all of the transaction’s writes are made durable, or an abort , in which case all of the transaction’s writes are rolled back (i.e., undone or discarded).

在第七章中，我们了解到交易的原子性的目的是在进行多个写操作时出现错误时使用简单的语义。事务的结果要么是成功提交，此时所有事务的写操作都是持久性的，要么是中止，此时所有事务的写操作都会被回滚（即撤销或丢弃）。

Atomicity prevents failed transactions from littering the database with half-finished results and half-updated state. This is especially important for multi-object transactions (see “Single-Object and Multi-Object Operations” ) and databases that maintain secondary indexes. Each secondary index is a separate data structure from the primary data—thus, if you modify some data, the corresponding change needs to also be made in the secondary index. Atomicity ensures that the secondary index stays consistent with the primary data (if the index became inconsistent with the primary data, it would not be very useful).

原子性防止失败的事务在数据库中留下未完成的结果和未更新的状态。这对于多对象事务（请参见“单对象和多对象操作”）和维护辅助索引的数据库尤其重要。每个辅助索引都是与主数据结构分开的单独数据结构，因此，如果您修改了某些数据，则相应的更改也需要在辅助索引中进行。原子性确保辅助索引与主数据保持一致（如果索引与主数据不一致，它将没有用处）。

From single-node to distributed atomic commit

For transactions that execute at a single database node, atomicity is commonly implemented by the storage engine. When the client asks the database node to commit the transaction, the database makes the transaction’s writes durable (typically in a write-ahead log; see “Making B-trees reliable” ) and then appends a commit record to the log on disk. If the database crashes in the middle of this process, the transaction is recovered from the log when the node restarts: if the commit record was successfully written to disk before the crash, the transaction is considered committed; if not, any writes from that transaction are rolled back.

针对在单个数据库节点执行的交易，通常由存储引擎来实现原子性。当客户端要求数据库节点提交交易时，数据库会将交易写入持久化存储（通常是写前日志；参见“使B树可靠”），然后在磁盘上附加提交记录。如果数据库在此过程中崩溃，则节点重新启动时从日志中恢复交易：如果提交记录在崩溃之前成功写入磁盘，则认为交易已提交；否则，来自该交易的任何写操作都将回滚。

Thus, on a single node, transaction commitment crucially depends on the order in which data is durably written to disk: first the data, then the commit record [ 72 ]. The key deciding moment for whether the transaction commits or aborts is the moment at which the disk finishes writing the commit record: before that moment, it is still possible to abort (due to a crash), but after that moment, the transaction is committed (even if the database crashes). Thus, it is a single device (the controller of one particular disk drive, attached to one particular node) that makes the commit atomic.

因此，在单个节点上，事务的提交关键取决于数据持久写入磁盘的顺序：首先是数据，然后是提交记录。确定事务提交或中止的关键决定时刻是磁盘完成写入提交记录的时刻：在此之前，由于崩溃可能仍然可以中止，但在此之后，即使数据库崩溃，事务也已提交。因此，它是单个设备（连接到一个特定节点的一个特定磁盘驱动器的控制器）使提交具有原子性。

However, what if multiple nodes are involved in a transaction? For example, perhaps you have a multi-object transaction in a partitioned database, or a term-partitioned secondary index (in which the index entry may be on a different node from the primary data; see “Partitioning and Secondary Indexes” ). Most “NoSQL” distributed datastores do not support such distributed transactions, but various clustered relational systems do (see “Distributed Transactions in Practice” ).

然而，如果涉及多个节点的交易怎么办？例如，在分区数据库中可能有多个对象事务，或者在术语分区的二级索引中（其中索引条目可能位于与主数据不同的节点上；请参阅“分区和二级索引”）。大多数“NoSQL”分布式数据存储不支持这样的分布式事务，但各种集群关系系统都支持（请参阅“实践中的分布式事务”）。

In these cases, it is not sufficient to simply send a commit request to all of the nodes and independently commit the transaction on each one. In doing so, it could easily happen that the commit succeeds on some nodes and fails on other nodes, which would violate the atomicity guarantee:

在这些情况下，简单地向所有节点发送一个提交请求并在每个节点上独立提交交易是不够的。这样做可能会导致提交成功于某些节点，但在其他节点上失败，这将违反原子性保证。

Some nodes may detect a constraint violation or conflict, making an abort necessary, while other nodes are successfully able to commit.

一些节点可能会检测到约束违规或冲突，需要进行中止，而其他节点能够成功提交。
Some of the commit requests might be lost in the network, eventually aborting due to a timeout, while other commit requests get through.

有些提交请求可能会在网络中丢失，最终由于超时而中止，而其他提交请求则得以通过。
Some nodes may crash before the commit record is fully written and roll back on recovery, while others successfully commit.

有些节点在提交记录完全写入之前可能会崩溃并在恢复时回滚，而其他节点则可顺利提交。

If some nodes commit the transaction but others abort it, the nodes become inconsistent with each other (like in Figure 7-3 ). And once a transaction has been committed on one node, it cannot be retracted again if it later turns out that it was aborted on another node. For this reason, a node must only commit once it is certain that all other nodes in the transaction are also going to commit.

如果某些节点提交事务但其他节点中止它，则节点彼此不一致（如图7-3所示）。一旦事务在一个节点上提交，如果后来发现它在另一个节点上被中止，它就不能被撤回。因此，节点必须在确信所有其他节点也将提交后才能提交。

A transaction commit must be irrevocable—you are not allowed to change your mind and retroactively abort a transaction after it has been committed. The reason for this rule is that once data has been committed, it becomes visible to other transactions, and thus other clients may start relying on that data; this principle forms the basis of read committed isolation, discussed in “Read Committed” . If a transaction was allowed to abort after committing, any transactions that read the committed data would be based on data that was retroactively declared not to have existed—so they would have to be reverted as well.

提交的交易必须是不可撤销的——一旦事务提交后，您不允许更改想法或撤消交易。这个规则的原因是，一旦数据已经提交，它就会被其他交易可见，因此其他客户端可能会开始依赖这些数据；这个原则构成了"读已提交隔离"的基础，见“读已提交”。如果一个交易在提交后允许被中止，任何读取已提交数据的交易都将基于事后宣布不存在的数据——因此它们也必须被回滚。

(It is possible for the effects of a committed transaction to later be undone by another, compensating transaction [ 73 , 74 ]. However, from the database’s point of view this is a separate transaction, and thus any cross-transaction correctness requirements are the application’s problem.)

一项已提交的事务的影响可以被另一个补偿性事务撤销[73，74]。然而，从数据库的角度来看，这是一个分开的事务，因此任何跨事务的正确性要求都是应用程序的问题。

Introduction to two-phase commit

Two-phase commit is an algorithm for achieving atomic transaction commit across multiple nodes—i.e., to ensure that either all nodes commit or all nodes abort. It is a classic algorithm in distributed databases [ 13 , 35 , 75 ]. 2PC is used internally in some databases and also made available to applications in the form of XA transactions [ 76 , 77 ] (which are supported by the Java Transaction API, for example) or via WS-AtomicTransaction for SOAP web services [ 78 , 79 ].

两阶段提交是一种算法，用于实现跨多个节点的原子事务提交-即确保所有节点都提交或所有节点都中止。这是分布式数据库中经典的算法。2PC在某些数据库中被内部使用，并以XA事务（例如Java事务API支持的方式）或通过WS-AtomicTransaction供应用程序使用，以用于SOAP Web服务。

The basic flow of 2PC is illustrated in Figure 9-9 . Instead of a single commit request, as with a single-node transaction, the commit/abort process in 2PC is split into two phases (hence the name).

2PC的基本流程如图9-9所示。与单节点事务不同的是，2PC的提交/撤销过程分为两个阶段（因此得名）。

Don’t confuse 2PC and 2PL

Two-phase commit (2PC) and two-phase locking (see “Two-Phase Locking (2PL)” ) are two very different things. 2PC provides atomic commit in a distributed database, whereas 2PL provides serializable isolation. To avoid confusion, it’s best to think of them as entirely separate concepts and to ignore the unfortunate similarity in the names.

两阶段提交（2PC）和两阶段锁定（参见“两阶段锁定（2PL）”）是两个非常不同的概念。2PC在分布式数据库中提供原子提交，而2PL提供可串行化隔离。为了避免混淆，最好将它们视为完全独立的概念，并忽略名称上的不幸相似之处。

2PC uses a new component that does not normally appear in single-node transactions: a coordinator (also known as transaction manager ). The coordinator is often implemented as a library within the same application process that is requesting the transaction (e.g., embedded in a Java EE container), but it can also be a separate process or service. Examples of such coordinators include Narayana, JOTM, BTM, or MSDTC.

2PC使用了一种在单节点交易中通常不出现的新组件：协调员（也称事务管理器）。协调员通常作为同一应用程序过程中的库来实现（例如嵌入在Java EE容器中），但它也可以是一个单独的进程或服务。这些协调员的示例包括Narayana、JOTM、BTM或MSDTC。

A 2PC transaction begins with the application reading and writing data on multiple database nodes, as normal. We call these database nodes participants in the transaction. When the application is ready to commit, the coordinator begins phase 1: it sends a prepare request to each of the nodes, asking them whether they are able to commit. The coordinator then tracks the responses from the participants:

2PC事务始于应用读取和写入多个数据库节点上的数据，就像平常一样。我们将这些数据库节点称为事务的参与者。当应用准备提交时，协调者开始第一阶段：向每个节点发送一个准备请求，询问它们是否能够提交。然后，协调者跟踪参与者的响应。

If all participants reply “yes,” indicating they are ready to commit, then the coordinator sends out a commit request in phase 2, and the commit actually takes place.

如果所有参与者都回复 “是”，表示他们已准备好承诺，那么协调员会在第二阶段发送承诺请求，并且实际承诺会发生。
If any of the participants replies “no,” the coordinator sends an abort request to all nodes in phase 2.

如果任何参与者回答“否”，协调员将向第2阶段中的所有节点发送中止请求。

This process is somewhat like the traditional marriage ceremony in Western cultures: the minister asks the bride and groom individually whether each wants to marry the other, and typically receives the answer “I do” from both. After receiving both acknowledgments, the minister pronounces the couple husband and wife: the transaction is committed, and the happy fact is broadcast to all attendees. If either bride or groom does not say “yes,” the ceremony is aborted [ 73 ].

这个过程有点像西方文化传统的婚礼仪式：牧师单独问新郎和新娘是否希望嫁娶对方，通常两人都会回答“是的”。在获得双方确认后，牧师宣布他们成为夫妻：交易得以进行，这个喜悦的事实会向所有出席者广播。如果新娘或新郎有任何人没有回答“是”，婚礼仪式就会被取消 [73]。

A system of promises

From this short description it might not be clear why two-phase commit ensures atomicity, while one-phase commit across several nodes does not. Surely the prepare and commit requests can just as easily be lost in the two-phase case. What makes 2PC different?

从这个简短的描述中可能不清楚为什么双阶段提交确保原子性，而跨多个节点的单阶段提交则不是。在两阶段情况下，准备和提交请求同样容易丢失。是什么让2PC不同呢？

To understand why it works, we have to break down the process in a bit more detail:

要理解为什么它起作用，我们必须更详细地分解该过程：

When the application wants to begin a distributed transaction, it requests a transaction ID from the coordinator. This transaction ID is globally unique.

当应用程序想要开始一个分布式事务时，它会向协调者请求一个事务ID。这个事务ID是全局唯一的。
The application begins a single-node transaction on each of the participants, and attaches the globally unique transaction ID to the single-node transaction. All reads and writes are done in one of these single-node transactions. If anything goes wrong at this stage (for example, a node crashes or a request times out), the coordinator or any of the participants can abort.

应用程序在每个参与者上开始一个单节点事务，并将全局唯一的事务ID附加到单节点事务。所有读写操作都在这些单节点事务中完成。如果在此阶段出现任何问题（例如，节点崩溃或请求超时），协调者或任何参与者都可以中止。
When the application is ready to commit, the coordinator sends a prepare request to all participants, tagged with the global transaction ID. If any of these requests fails or times out, the coordinator sends an abort request for that transaction ID to all participants.

当应用准备执行时，协调者会向所有参与者发送一个带有全局事务ID标记的准备请求。如果其中任何一个请求失败或超时，协调者会向所有参与者发送一个有关该事务ID的中止请求。
When a participant receives the prepare request, it makes sure that it can definitely commit the transaction under all circumstances. This includes writing all transaction data to disk (a crash, a power failure, or running out of disk space is not an acceptable excuse for refusing to commit later), and checking for any conflicts or constraint violations. By replying “yes” to the coordinator, the node promises to commit the transaction without error if requested. In other words, the participant surrenders the right to abort the transaction, but without actually committing it.

当参与者接收到准备请求时，它确保在所有情况下都能够确定地提交该事务。这包括将所有事务数据写入磁盘（崩溃、停电或磁盘空间不足不是拒绝稍后提交的合理理由），并检查是否存在任何冲突或约束违规。通过回复“是”给协调者，节点承诺在请求时能够提交事务而不出现错误。换句话说，参与者放弃了终止事务的权利，但并没有实际提交它。
When the coordinator has received responses to all prepare requests, it makes a definitive decision on whether to commit or abort the transaction (committing only if all participants voted “yes”). The coordinator must write that decision to its transaction log on disk so that it knows which way it decided in case it subsequently crashes. This is called the commit point .

当协调员接收到所有准备请求的回复后，它会对事务作出最终的决定，即是提交还是中止事务（只有所有参与者都投票“是”才能提交）。协调员必须将该决定写入其磁盘上的事务日志，以便在随后发生崩溃时知道其决定的方向。这就是所谓的提交点。
Once the coordinator’s decision has been written to disk, the commit or abort request is sent to all participants. If this request fails or times out, the coordinator must retry forever until it succeeds. There is no more going back: if the decision was to commit, that decision must be enforced, no matter how many retries it takes. If a participant has crashed in the meantime, the transaction will be committed when it recovers—since the participant voted “yes,” it cannot refuse to commit when it recovers.

一旦协调者的决定已被写入磁盘，提交或中止请求将发送给所有参与者。如果该请求失败或超时，协调者必须永远重试，直到成功。没有回头路：如果决定是提交，那么无论需要多少次重试，该决定都必须执行。如果参与者在此期间崩溃，那么当其恢复时，交易将被提交，因为参与者投了“是”票，因此在恢复时不能拒绝提交。

Thus, the protocol contains two crucial “points of no return”: when a participant votes “yes,” it promises that it will definitely be able to commit later (although the coordinator may still choose to abort); and once the coordinator decides, that decision is irrevocable. Those promises ensure the atomicity of 2PC. (Single-node atomic commit lumps these two events into one: writing the commit record to the transaction log.)

因此，该协议包含两个至关重要的“无回头点”：当参与者投票“是”，它承诺它肯定能够稍后提交（虽然管理员仍可能选择放弃）; 一旦协调员做出决定，该决定就是不可撤销的。这些承诺确保了2PC的原子性。（单节点原子提交将这两个事件合并为一个：将提交记录写入事务日志。）

Returning to the marriage analogy, before saying “I do,” you and your bride/groom have the freedom to abort the transaction by saying “No way!” (or something to that effect). However, after saying “I do,” you cannot retract that statement. If you faint after saying “I do” and you don’t hear the minister speak the words “You are now husband and wife,” that doesn’t change the fact that the transaction was committed. When you recover consciousness later, you can find out whether you are married or not by querying the minister for the status of your global transaction ID, or you can wait for the minister’s next retry of the commit request (since the retries will have continued throughout your period of unconsciousness).

回到婚姻的类比，在说“我愿意”之前，你和你的新娘/新郎有自由地通过说“绝不！”或者其他类似的话终止交易。但是，一旦你说了“我愿意”，你就不能收回。如果在说“我愿意”之后你晕倒了，没有听到牧师说“你现在成为夫妻了”，这并不会改变交易已经完成的事实。当你恢复意识后，你可以通过查询牧师的全局交易ID的状态来了解你是否结婚了，或者你可以等待牧师下一次重试提交请求（因为在你失去意识的期间，重试将一直进行）。

Coordinator failure

We have discussed what happens if one of the participants or the network fails during 2PC: if any of the prepare requests fail or time out, the coordinator aborts the transaction; if any of the commit or abort requests fail, the coordinator retries them indefinitely. However, it is less clear what happens if the coordinator crashes.

如果在两阶段提交期间，参与者或网络出现故障，我们已经讨论了出现的情况：如果任何一方的准备请求失败或超时，协调者将中止事务；如果提交或中止请求失败，协调者将无限重试。然而，如果协调者崩溃，会发生什么就不那么清楚了。

If the coordinator fails before sending the prepare requests, a participant can safely abort the transaction. But once the participant has received a prepare request and voted “yes,” it can no longer abort unilaterally—it must wait to hear back from the coordinator whether the transaction was committed or aborted. If the coordinator crashes or the network fails at this point, the participant can do nothing but wait. A participant’s transaction in this state is called in doubt or uncertain .

如果协调器在发送准备请求之前失败，参与者可以安全地中止交易。但是，一旦参与者收到准备请求并投票“是”，它就不能再单方面中止交易了——它必须等待协调器回复交易是已提交还是已中止。如果此时协调器崩溃或网络失败，则参与者除了等待外无能为力。这种情况下参与者的交易被称为不确定或不确定状态。

The situation is illustrated in Figure 9-10 . In this particular example, the coordinator actually decided to commit, and database 2 received the commit request. However, the coordinator crashed before it could send the commit request to database 1, and so database 1 does not know whether to commit or abort. Even a timeout does not help here: if database 1 unilaterally aborts after a timeout, it will end up inconsistent with database 2, which has committed. Similarly, it is not safe to unilaterally commit, because another participant may have aborted.

情况如图9-10所示。在这个特殊例子中，协调器实际上决定提交，数据库2接收到提交请求。然而，在协调器发送提交请求到数据库1之前，协调器崩溃了，所以数据库1不知道是要提交还是要中止。即使超时也无法解决这个问题：如果数据库1在超时后单方面中止，它将与已经提交的数据库2不一致。同样地，单方面提交也不安全，因为另一个参与者可能已经中止。

Without hearing from the coordinator, the participant has no way of knowing whether to commit or abort. In principle, the participants could communicate among themselves to find out how each participant voted and come to some agreement, but that is not part of the 2PC protocol.

如果没有听到协调员的消息，参与者就没有办法知道是要继续还是中止。原则上，参与者可以相互沟通，了解每个人的投票情况并达成共识，但这不是2PC协议的一部分。

The only way 2PC can complete is by waiting for the coordinator to recover. This is why the coordinator must write its commit or abort decision to a transaction log on disk before sending commit or abort requests to participants: when the coordinator recovers, it determines the status of all in-doubt transactions by reading its transaction log. Any transactions that don’t have a commit record in the coordinator’s log are aborted. Thus, the commit point of 2PC comes down to a regular single-node atomic commit on the coordinator.

2PC只能等待协调者恢复才能完成。这就是为什么在向参与者发送提交或中止请求之前，协调者必须将其提交或中止决策写入磁盘上的事务日志中的原因：当协调者恢复时，它通过读取其事务日志确定所有存疑事务的状态。在协调者的日志中没有提交记录的任何事务都会被中止。因此，2PC的提交点归结为协调者上的常规单节点原子提交。

Three-phase commit

Two-phase commit is called a blocking atomic commit protocol due to the fact that 2PC can become stuck waiting for the coordinator to recover. In theory, it is possible to make an atomic commit protocol nonblocking , so that it does not get stuck if a node fails. However, making this work in practice is not so straightforward.

两阶段提交被称为阻塞原子提交协议，因为2PC可能会被卡住，等待协调者恢复。理论上，可能会使原子提交协议非阻塞，这样即使节点失败也不会被卡住。然而，在实践中使其正常工作并不那么简单。

As an alternative to 2PC, an algorithm called three-phase commit (3PC) has been proposed [ 13 , 80 ]. However, 3PC assumes a network with bounded delay and nodes with bounded response times; in most practical systems with unbounded network delay and process pauses (see Chapter 8 ), it cannot guarantee atomicity.

作为 2PC 的替代方案，提出了一种称为三阶段提交（3PC）的算法[13, 80]。然而，3PC 假设网络具有有限延迟和节点具有有限响应时间；在大多数具有无界网络延迟和进程暂停的实际系统中（请参见第 8 章），无法保证原子性。

In general, nonblocking atomic commit requires a perfect failure detector [ 67 , 71 ]—i.e., a reliable mechanism for telling whether a node has crashed or not. In a network with unbounded delay a timeout is not a reliable failure detector, because a request may time out due to a network problem even if no node has crashed. For this reason, 2PC continues to be used, despite the known problem with coordinator failure.

一般而言，非阻塞原子提交需要完美的失效检测器[67，71]，即可靠的机制来判断节点是否已经崩溃。在延迟无限制的网络中，超时并不是可靠的故障检测器，因为由于网络问题，请求可能超时，即使没有节点崩溃。因此，尽管协调员故障是已知的问题，但2PC仍然被使用。

Distributed Transactions in Practice

Distributed transactions, especially those implemented with two-phase commit, have a mixed reputation. On the one hand, they are seen as providing an important safety guarantee that would be hard to achieve otherwise; on the other hand, they are criticized for causing operational problems, killing performance, and promising more than they can deliver [ 81 , 82 , 83 , 84 ]. Many cloud services choose not to implement distributed transactions due to the operational problems they engender [ 85 , 86 ].

分布式事务，特别是使用两阶段提交实现的事务，具有良好的声誉。一方面，它们被视为提供了一个重要的安全保证，否则很难达到；另一方面，它们被批评会引起操作问题，损害性能，并承诺无法实现的问题。许多云服务不选择实现分布式事务，因为它们会引起操作问题。

Some implementations of distributed transactions carry a heavy performance penalty—for example, distributed transactions in MySQL are reported to be over 10 times slower than single-node transactions [ 87 ], so it is not surprising when people advise against using them. Much of the performance cost inherent in two-phase commit is due to the additional disk forcing ( fsync ) that is required for crash recovery [ 88 ], and the additional network round-trips.

一些分布式事务的实现会带来严重的性能惩罚，比如MySQL中的分布式事务被报道比单节点事务慢10倍以上，因此不难理解人们建议不要使用它们。两阶段提交中的许多性能成本归因于在崩溃恢复时需要的额外磁盘强制性操作（fsync），以及额外的网络往返。

However, rather than dismissing distributed transactions outright, we should examine them in some more detail, because there are important lessons to be learned from them. To begin, we should be precise about what we mean by “distributed transactions.” Two quite different types of distributed transactions are often conflated:

然而，我们不应该轻易地排斥分布式事务，而应该更加深入地研究它们，因为从中可以学到重要的教训。首先，我们应该明确我们所说的“分布式事务”是什么。通常会混淆两种非常不同的分布式事务类型：

Database-internal distributed transactions

Some distributed databases (i.e., databases that use replication and partitioning in their standard configuration) support internal transactions among the nodes of that database. For example, VoltDB and MySQL Cluster’s NDB storage engine have such internal transaction support. In this case, all the nodes participating in the transaction are running the same database software.

一些分布式数据库（即在其标准配置中使用复制和分区的数据库）支持该数据库节点间的内部事务。例如，VoltDB和MySQL Cluster的NDB存储引擎都具有此类内部事务支持。在这种情况下，参与事务的所有节点都运行相同的数据库软件。

Heterogeneous distributed transactions

In a heterogeneous transaction, the participants are two or more different technologies: for example, two databases from different vendors, or even non-database systems such as message brokers. A distributed transaction across these systems must ensure atomic commit, even though the systems may be entirely different under the hood.

在异构事务中，参与者是两个或更多不同的技术：例如，来自不同供应商的两个数据库，甚至非数据库系统，如消息代理。在这些系统之间的分布式事务必须确保原子提交，即使这些系统在内部完全不同。

Database-internal transactions do not have to be compatible with any other system, so they can use any protocol and apply optimizations specific to that particular technology. For that reason, database-internal distributed transactions can often work quite well. On the other hand, transactions spanning heterogeneous technologies are a lot more challenging.

数据库内部事务不必与任何其他系统兼容，因此可以使用任何协议并应用特定于该特定技术的优化。因此，数据库内部分布式事务通常可以非常好地工作。另一方面，跨异构技术的事务要具有更大的挑战性。

Exactly-once message processing

Heterogeneous distributed transactions allow diverse systems to be integrated in powerful ways. For example, a message from a message queue can be acknowledged as processed if and only if the database transaction for processing the message was successfully committed. This is implemented by atomically committing the message acknowledgment and the database writes in a single transaction. With distributed transaction support, this is possible, even if the message broker and the database are two unrelated technologies running on different machines.

异构分布式事务使得不同的系统可以强大地集成在一起。例如，只有在成功提交处理消息的数据库事务后，消息队列中的消息才能被确认已处理。这是通过原子地在单个事务中提交消息确认和数据库写入来实现的。即使消息代理和数据库是运行在不同机器上的两种不相关技术，分布式事务支持也可以实现这一点。

If either the message delivery or the database transaction fails, both are aborted, and so the message broker may safely redeliver the message later. Thus, by atomically committing the message and the side effects of its processing, we can ensure that the message is effectively processed exactly once, even if it required a few retries before it succeeded. The abort discards any side effects of the partially completed transaction.

如果消息传递或数据库事务中任一一个失败，这两个都会被取消，因此消息中间件可以安全地稍后重发消息。因此，通过原子地提交消息和其处理的副作用，我们可以确保该消息被有效处理一次，即使在成功之前需要进行几次重试。中止会丢弃已部分完成事务的任何副作用。

Such a distributed transaction is only possible if all systems affected by the transaction are able to use the same atomic commit protocol, however. For example, say a side effect of processing a message is to send an email, and the email server does not support two-phase commit: it could happen that the email is sent two or more times if message processing fails and is retried. But if all side effects of processing a message are rolled back on transaction abort, then the processing step can safely be retried as if nothing had happened.

如果事务受影响的所有系统都能够使用相同的原子提交协议，那么这样的分布式事务才有可能实现。例如，假设处理消息的一个副作用是发送电子邮件，而邮件服务器不支持两阶段提交：如果消息处理失败并重试，则可能发送两次或更多次电子邮件。但如果在事务中断时回滚处理消息的所有副作用，就可以安全地重试处理步骤，就像什么也没发生一样。

We will return to the topic of exactly-once message processing in Chapter 11 . Let’s look first at the atomic commit protocol that allows such heterogeneous distributed transactions.

我们将在第11章回到确保仅有一次消息处理的话题。首先让我们看一下允许这样异构分布式事务的原子提交协议。

XA transactions

X/Open XA (short for eXtended Architecture ) is a standard for implementing two-phase commit across heterogeneous technologies [ 76 , 77 ]. It was introduced in 1991 and has been widely implemented: XA is supported by many traditional relational databases (including PostgreSQL, MySQL, DB2, SQL Server, and Oracle) and message brokers (including ActiveMQ, HornetQ, MSMQ, and IBM MQ).

X/Open XA（扩展架构）是一种跨异构技术实现两阶段提交的标准 [76, 77]。它于1991年引入，并得到广泛应用：XA受到许多传统关系数据库（包括PostgreSQL、MySQL、DB2、SQL Server和Oracle）和消息代理（包括ActiveMQ、HornetQ、MSMQ和IBM MQ）的支持。

XA is not a network protocol—it is merely a C API for interfacing with a transaction coordinator. Bindings for this API exist in other languages; for example, in the world of Java EE applications, XA transactions are implemented using the Java Transaction API (JTA), which in turn is supported by many drivers for databases using Java Database Connectivity (JDBC) and drivers for message brokers using the Java Message Service (JMS) APIs.

XA不是一个网络协议，它只是一个用于与事务协调器进行接口交互的C API。其他语言中也存在这个API的绑定；例如，在Java EE应用程序世界中，XA事务是使用Java事务API（JTA）实现的，而Java数据库连接（JDBC）和Java消息服务（JMS）API的许多数据库驱动程序和消息代理驱动程序都支持JTA。

XA assumes that your application uses a network driver or client library to communicate with the participant databases or messaging services. If the driver supports XA, that means it calls the XA API to find out whether an operation should be part of a distributed transaction—and if so, it sends the necessary information to the database server. The driver also exposes callbacks through which the coordinator can ask the participant to prepare, commit, or abort.

XA 假定您的应用程序使用网络驱动程序或客户端库来与参与者数据库或消息服务进行通信。如果该驱动程序支持 XA，则表示它会调用 XA API 来确定操作是否应是分布式事务的一部分，如果是，则向数据库服务器发送必要的信息。该驱动程序还通过回调公开回调功能，协调器可以请求参与者准备、提交或中止。

The transaction coordinator implements the XA API. The standard does not specify how it should be implemented, but in practice the coordinator is often simply a library that is loaded into the same process as the application issuing the transaction (not a separate service). It keeps track of the participants in a transaction, collects partipants’ responses after asking them to prepare (via a callback into the driver), and uses a log on the local disk to keep track of the commit/abort decision for each transaction.

事务协调器实现XA API。标准并未规定它应如何实现，但实际上，协调器通常只是一个库，加载到发出事务的应用程序的相同进程中（而不是一个单独的服务）。它跟踪事务中的参与者，在请求它们准备（通过驱动程序回调）后收集参与者的响应，并使用本地磁盘上的日志来跟踪每个事务的提交/中止决策。

If the application process crashes, or the machine on which the application is running dies, the coordinator goes with it. Any participants with prepared but uncommitted transactions are then stuck in doubt. Since the coordinator’s log is on the application server’s local disk, that server must be restarted, and the coordinator library must read the log to recover the commit/abort outcome of each transaction. Only then can the coordinator use the database driver’s XA callbacks to ask participants to commit or abort, as appropriate. The database server cannot contact the coordinator directly, since all communication must go via its client library.

如果应用程序崩溃，或运行应用程序的计算机死机，协调器也将失效。任何准备好但未提交的事务参与者此时就会陷入疑惑。由于协调器日志存储在应用服务器的本地磁盘上，因此必须重新启动该服务器，并且协调器库必须读取日志以恢复每个事务的提交/中止结果。只有这样，协调器才能使用数据库驱动程序的XA回调来要求参与者提交或中止。数据库服务器无法直接联系协调器，因为所有通信都必须通过其客户端库。

Holding locks while in doubt

Why do we care so much about a transaction being stuck in doubt? Can’t the rest of the system just get on with its work, and ignore the in-doubt transaction that will be cleaned up eventually?

为什么我们如此关注一笔交易是否被搁置在疑问当中？难道系统的其他部分不可以继续工作，而忽略最终会被清理的搁置交易吗？

The problem is with locking . As discussed in “Read Committed” , database transactions usually take a row-level exclusive lock on any rows they modify, to prevent dirty writes. In addition, if you want serializable isolation, a database using two-phase locking would also have to take a shared lock on any rows read by the transaction (see “Two-Phase Locking (2PL)” ).

问题出在锁定上。如在“读取已提交”中所讨论，数据库事务通常会对其修改的任何行进行行级独占锁定以防止脏写操作。此外，如果您想要可串行化的隔离级别，则使用两阶段锁定的数据库还必须对事务读取的任何行进行共享锁定（请参见“两阶段锁定（2PL）”）。

The database cannot release those locks until the transaction commits or aborts (illustrated as a shaded area in Figure 9-9 ). Therefore, when using two-phase commit, a transaction must hold onto the locks throughout the time it is in doubt. If the coordinator has crashed and takes 20 minutes to start up again, those locks will be held for 20 minutes. If the coordinator’s log is entirely lost for some reason, those locks will be held forever—or at least until the situation is manually resolved by an administrator.

数据库在事务提交或中止前无法释放这些锁定（如图9-9中的阴影区所示）。因此，在使用两阶段提交时，事务必须在有疑问的情况下一直保持锁定。如果协调者崩溃并需要20分钟才能重新启动，则这些锁定将被保持20分钟。如果由于某种原因协调者的日志完全丢失，则这些锁将永远保持，或者至少直到管理员手动解决该情况。

While those locks are held, no other transaction can modify those rows. Depending on the database, other transactions may even be blocked from reading those rows. Thus, other transactions cannot simply continue with their business—if they want to access that same data, they will be blocked. This can cause large parts of your application to become unavailable until the in-doubt transaction is resolved.

当这些锁定被保留时，没有其他交易可以修改这些行。根据数据库，其他交易甚至可能被阻止阅读这些行。因此，其他交易不能简单地继续业务 - 如果他们想访问相同的数据，他们将被阻止。这可能导致您的应用程序的大部分部分变得不可用，直到未决交易解决。

Recovering from coordinator failure

In theory, if the coordinator crashes and is restarted, it should cleanly recover its state from the log and resolve any in-doubt transactions. However, in practice, orphaned in-doubt transactions do occur [ 89 , 90 ]—that is, transactions for which the coordinator cannot decide the outcome for whatever reason (e.g., because the transaction log has been lost or corrupted due to a software bug). These transactions cannot be resolved automatically, so they sit forever in the database, holding locks and blocking other transactions.

从理论上讲，如果协调者崩溃并重新启动，它应该从日志中清晰地恢复自己的状态并解决任何不确定的事务。然而，在实践中，孤立的不确定事务确实会发生[89, 90]--即无论出于什么原因（例如，由于软件漏洞而导致事务日志丢失或损坏），协调者无法决定结果的事务。这些事务无法自动解决，因此它们会永远停留在数据库中，持有锁并阻塞其他事务。

Even rebooting your database servers will not fix this problem, since a correct implementation of 2PC must preserve the locks of an in-doubt transaction even across restarts (otherwise it would risk violating the atomicity guarantee). It’s a sticky situation.

即使重新启动数据库服务器也不会解决此问题，因为正确实现2PC必须保留不确定事务的锁甚至跨越重启 (否则它会冒失违反原子性保证)。这是一个棘手的情况。

The only way out is for an administrator to manually decide whether to commit or roll back the transactions. The administrator must examine the participants of each in-doubt transaction, determine whether any participant has committed or aborted already, and then apply the same outcome to the other participants. Resolving the problem potentially requires a lot of manual effort, and most likely needs to be done under high stress and time pressure during a serious production outage (otherwise, why would the coordinator be in such a bad state?).

唯一的解决方法是由管理员手动决定提交还是回滚交易。管理员必须检查每个不确定的事务的参与者，确定是否已经有任何参与者提交或中止，然后将相同的结果应用于其他参与者。解决这个问题可能需要大量的手动工作，很可能需要在严重的生产中断期间，在高压和时间压力下完成（否则，为什么协调者会处于如此糟糕的状态？）。

Many XA implementations have an emergency escape hatch called heuristic decisions : allowing a participant to unilaterally decide to abort or commit an in-doubt transaction without a definitive decision from the coordinator [ 76 , 77 , 91 ]. To be clear, heuristic here is a euphemism for probably breaking atomicity , since it violates the system of promises in two-phase commit. Thus, heuristic decisions are intended only for getting out of catastrophic situations, and not for regular use.

许多XA实现都有一个紧急逃生方法，称为启发式决策：允许参与者在未从协调者得到明确定论决策的情况下单方面决定中止或提交一个未确定的事务[76，77，91]。需要指出的是，在这里，“启发式”是破坏原子性的委婉说法，因为它违反了两阶段提交的承诺系统。因此，启发式决策仅适用于紧急情况，而非常规情况下的使用。

Limitations of distributed transactions

XA transactions solve the real and important problem of keeping several participant data systems consistent with each other, but as we have seen, they also introduce major operational problems. In particular, the key realization is that the transaction coordinator is itself a kind of database (in which transaction outcomes are stored), and so it needs to be approached with the same care as any other important database:

XA事务解决了保持多个参与方数据系统相互一致的真正重要问题，但正如我们所看到的，它们也引入了重大的操作问题。特别是，关键的认识是事务协调器本身就像是一个数据库（其中存储着事务结果），因此需要像任何其他重要数据库一样小心对待。

If the coordinator is not replicated but runs only on a single machine, it is a single point of failure for the entire system (since its failure causes other application servers to block on locks held by in-doubt transactions). Surprisingly, many coordinator implementations are not highly available by default, or have only rudimentary replication support.

如果协调器未被复制，而只在单台机器上运行，那么整个系统都存在单点故障（因为它的故障会导致其他应用服务器因为维护不确定事务持有的锁而被阻塞）。令人惊讶的是，许多协调器实现默认情况下并非高可用性的，或者只有基本的复制支持。
Many server-side applications are developed in a stateless model (as favored by HTTP), with all persistent state stored in a database, which has the advantage that application servers can be added and removed at will. However, when the coordinator is part of the application server, it changes the nature of the deployment. Suddenly, the coordinator’s logs become a crucial part of the durable system state—as important as the databases themselves, since the coordinator logs are required in order to recover in-doubt transactions after a crash. Such application servers are no longer stateless.

许多服务器端应用程序采用无状态模型（正如HTTP所支持的），所有持久状态都存储在数据库中，这样的好处是应用程序服务器可以随意添加和删除。但是当协调器成为应用服务器的一部分时，它改变了部署的性质。突然间，协调器的日志成为了持久系统状态的关键部分，和数据库本身一样重要，因为在崩溃后需要使用协调器的日志来恢复未确定的事务。这样的应用程序服务器不再是无状态的。
Since XA needs to be compatible with a wide range of data systems, it is necessarily a lowest common denominator. For example, it cannot detect deadlocks across different systems (since that would require a standardized protocol for systems to exchange information on the locks that each transaction is waiting for), and it does not work with SSI (see “Serializable Snapshot Isolation (SSI)” ), since that would require a protocol for identifying conflicts across different systems.

由于XA需要与各种数据系统兼容，因此它必然是最低公共分母。例如，它无法在不同系统之间检测死锁（因为这需要标准化协议来交换有关每个事务正在等待的锁的信息），并且它不能与SSI（参见“可串行快照隔离（SSI）”）一起工作，因为这需要一个用于识别不同系统之间冲突的协议。
For database-internal distributed transactions (not XA), the limitations are not so great—for example, a distributed version of SSI is possible. However, there remains the problem that for 2PC to successfully commit a transaction, all participants must respond. Consequently, if any part of the system is broken, the transaction also fails. Distributed transactions thus have a tendency of amplifying failures , which runs counter to our goal of building fault-tolerant systems.

对于数据库内部分布式事务（非XA事务），限制并不是很大，例如可以实现分布式的SSI版本。然而，仍然存在一个问题，就是为了使2PC成功提交一个事务，所有参与者都必须做出响应。因此，如果系统的任何部分出现故障，事务也会失败。分布式事务因此具有放大故障的趋势，这与我们构建容错系统的目标相悖。

Do these facts mean we should give up all hope of keeping several systems consistent with each other? Not quite—there are alternative methods that allow us to achieve the same thing without the pain of heterogeneous distributed transactions. We will return to these in Chapters 11 and 12 . But first, we should wrap up the topic of consensus.

这些事实是否意味着我们应该放弃所有希望让多个系统保持一致？并非如此——有替代方法可以让我们在不进行异构分布式事务的痛苦情况下实现相同的目标。我们将在第11和12章回到这些方法。但首先，我们应该结束共识主题。

Fault-Tolerant Consensus

Informally, consensus means getting several nodes to agree on something. For example, if several people concurrently try to book the last seat on an airplane, or the same seat in a theater, or try to register an account with the same username, then a consensus algorithm could be used to determine which one of these mutually incompatible operations should be the winner.

非正式地，共识意味着让几个节点就某事达成共识。例如，如果几个人同时尝试预订一架飞机上的最后一个座位，或者一个剧院里的同一个座位，或者尝试注册具有相同用户名的帐户，则可以使用共识算法来确定哪一个这些相互不兼容的操作应该胜出。

The consensus problem is normally formalized as follows: one or more nodes may propose values, and the consensus algorithm decides on one of those values. In the seat-booking example, when several customers are concurrently trying to buy the last seat, each node handling a customer request may propose the ID of the customer it is serving, and the decision indicates which one of those customers got the seat.

共识问题通常被形式化如下：一个或多个节点可能会提出值，而共识算法则决定其中的一个值。在预订座位的例子中，当几个客户同时尝试购买最后一个座位时，每个处理客户请求的节点可能会提出服务的客户的 ID，决策指示哪个客户得到了座位。

In this formalism, a consensus algorithm must satisfy the following properties [ 25 ]: ^xiii

在这个形式化表述中，共识算法必须满足以下性质[25]：xiii

Uniform agreement

No two nodes decide differently.

没有两个节点会做出不同的决定。

Integrity

No node decides twice.

没有节点决策两次。

Validity

If a node decides value v , then v was proposed by some node.

如果一个节点决定了值 v，那么 v 是由某个节点提出的。

Termination

Every node that does not crash eventually decides some value.

每个不崩溃的节点最终都会决定某个值。

The uniform agreement and integrity properties define the core idea of consensus: everyone decides on the same outcome, and once you have decided, you cannot change your mind. The validity property exists mostly to rule out trivial solutions: for example, you could have an algorithm that always decides null , no matter what was proposed; this algorithm would satisfy the agreement and integrity properties, but not the validity property.

一致性协议和完整性属性定义了共识的核心思想：每个人都决定相同的结果，一旦你做出决定，就不能改变主意。有效性属性主要存在于排除微不足道的解决方案：例如，你可以有一个算法，无论提议是什么，都决定为null；这个算法将满足一致性协议和完整性属性，但不符合有效性属性。

If you don’t care about fault tolerance, then satisfying the first three properties is easy: you can just hardcode one node to be the “dictator,” and let that node make all of the decisions. However, if that one node fails, then the system can no longer make any decisions. This is, in fact, what we saw in the case of two-phase commit: if the coordinator fails, in-doubt participants cannot decide whether to commit or abort.

如果您不关心容错性，那么满足前三个属性是很容易的：您只需将一个节点硬编码为“独裁者”，并让该节点做出所有决策。然而，如果该节点失败，那么系统将无法做出任何决策。事实上，这就是我们在两阶段提交的情况下看到的情况：如果协调员失败，则存在疑问的参与者无法决定是提交还是中止。

The termination property formalizes the idea of fault tolerance. It essentially says that a consensus algorithm cannot simply sit around and do nothing forever—in other words, it must make progress. Even if some nodes fail, the other nodes must still reach a decision. (Termination is a liveness property, whereas the other three are safety properties—see “Safety and liveness” .)

终止属性将容错的概念形式化。它基本上表明共识算法不能无休止地闲置下去，换句话说，它必须取得进展。即使某些节点失败，其他节点仍必须达成决定。（终止是活跃性质，而其他三种是安全性质——请参见“安全性和活跃性”。）

The system model of consensus assumes that when a node “crashes,” it suddenly disappears and never comes back. (Instead of a software crash, imagine that there is an earthquake, and the datacenter containing your node is destroyed by a landslide. You must assume that your node is buried under 30 feet of mud and is never going to come back online.) In this system model, any algorithm that has to wait for a node to recover is not going to be able to satisfy the termination property. In particular, 2PC does not meet the requirements for termination.

共识的系统模型假设当节点"崩溃"时，它会突然消失并永远不会再回来。(不是软件崩溃，想象一下地震，包含您节点的数据中心被山体滑坡摧毁。您必须假设您的节点被30英尺深的泥土埋葬，永远不会再次联机。)在这个系统模型中，任何需要等待节点恢复的算法都无法满足终止属性。特别是，二阶段提交不符合终止要求。

Of course, if all nodes crash and none of them are running, then it is not possible for any algorithm to decide anything. There is a limit to the number of failures that an algorithm can tolerate: in fact, it can be proved that any consensus algorithm requires at least a majority of nodes to be functioning correctly in order to assure termination [ 67 ]. That majority can safely form a quorum (see “Quorums for reading and writing” ).

当然，如果所有节点都崩溃且没有任何一个正在运行，则任何算法都无法决定任何事情。算法能够容忍故障的次数是有限的：实际上，可以证明任何一种共识算法都需要至少半数的节点正常运行才能保证终止[67]。这种半数可以安全地形成一个法定人数(见“读写的法定人数”)。

Thus, the termination property is subject to the assumption that fewer than half of the nodes are crashed or unreachable. However, most implementations of consensus ensure that the safety properties—agreement, integrity, and validity—are always met, even if a majority of nodes fail or there is a severe network problem [ 92 ]. Thus, a large-scale outage can stop the system from being able to process requests, but it cannot corrupt the consensus system by causing it to make invalid decisions.

因此，终止属性取决于假设节点数量少于一半会崩溃或无法访问。然而，大多数共识实现都确保安全性属性-协议、完整性和有效性-始终得到满足，即使大多数节点失败或存在严重的网络问题[92]。因此，大规模故障可能会使系统无法处理请求，但它不会通过导致做出无效决策来损坏共识系统。

Most consensus algorithms assume that there are no Byzantine faults, as discussed in “Byzantine Faults” . That is, if a node does not correctly follow the protocol (for example, if it sends contradictory messages to different nodes), it may break the safety properties of the protocol. It is possible to make consensus robust against Byzantine faults as long as fewer than one-third of the nodes are Byzantine-faulty [ 25 , 93 ], but we don’t have space to discuss those algorithms in detail in this book.

大多数共识算法假定没有拜占庭错误，如“拜占庭错误”中所讨论的那样。即，如果节点不正确地遵循协议（例如，如果它向不同节点发送矛盾的消息），它可能会破坏协议的安全性质。只要不到三分之一的节点存在拜占庭错误，就有可能使共识免受拜占庭错误的影响，但我们没有足够的空间在本书中详细讨论这些算法。

Consensus algorithms and total order broadcast

The best-known fault-tolerant consensus algorithms are Viewstamped Replication (VSR) [ 94 , 95 ], Paxos [ 96 , 97 , 98 , 99 ], Raft [ 22 , 100 , 101 ], and Zab [ 15 , 21 , 102 ]. There are quite a few similarities between these algorithms, but they are not the same [ 103 ]. In this book we won’t go into full details of the different algorithms: it’s sufficient to be aware of some of the high-level ideas that they have in common, unless you’re implementing a consensus system yourself (which is probably not advisable—it’s hard [ 98 , 104 ]).

最为知名的容错共识算法包括Viewstamped Replication (VSR)[94, 95]、Paxos[96, 97, 98, 99]、Raft[22, 100, 101]和Zab[15, 21, 102]。这些算法有许多相似之处，但并非完全相同[103]。在本书中，我们不会详细介绍不同算法的细节：只需要了解它们在高级别想法方面的某些共性即可，除非你正在实现一个共识系统（这可能并不明智——很难[98, 104]）。

Most of these algorithms actually don’t directly use the formal model described here (proposing and deciding on a single value, while satisfying the agreement, integrity, validity, and termination properties). Instead, they decide on a sequence of values, which makes them total order broadcast algorithms, as discussed previously in this chapter (see “Total Order Broadcast” ).

大多数这些算法实际上并没有直接使用描述在这里的形式化模型（提出并决定单个值，同时满足协议、完整性、有效性和终止性属性）。相反，它们决定一个值的序列，使它们成为总排序广播算法，就像在本章之前讨论的那样（请参见“总排序广播”）。

Remember that total order broadcast requires messages to be delivered exactly once, in the same order, to all nodes. If you think about it, this is equivalent to performing several rounds of consensus: in each round, nodes propose the message that they want to send next, and then decide on the next message to be delivered in the total order [ 67 ].

全序广播要求所有消息被恰好传输一次，且顺序相同，发送到所有节点。如果你仔细想一想，这等同于进行多轮共识：在每一轮中，节点都会提出他们要发送的消息，然后决定下一个要传输到全序中的消息。

So, total order broadcast is equivalent to repeated rounds of consensus (each consensus decision corresponding to one message delivery):

因此，全局订单广播等同于重复的共识轮次（每个共识决策对应一条消息传递）。

Due to the agreement property of consensus, all nodes decide to deliver the same messages in the same order.

由于共识属性的协议，所有节点决定以相同顺序传递相同的消息。
Due to the integrity property, messages are not duplicated.

由于完整性属性，消息不会重复。
Due to the validity property, messages are not corrupted and not fabricated out of thin air.

由于有效性特性，消息不会被损坏或虚构出来。
Due to the termination property, messages are not lost.

由于终止属性，消息不会丢失。

Viewstamped Replication, Raft, and Zab implement total order broadcast directly, because that is more efficient than doing repeated rounds of one-value-at-a-time consensus. In the case of Paxos, this optimization is known as Multi-Paxos.

视图共识复制、Raft和Zab直接实现了总序播，因为它比重复一次只传递一个值的共识更高效。在Paxos的情况下，这种优化被称为多Paxos。

Single-leader replication and consensus

In Chapter 5 we discussed single-leader replication (see “Leaders and Followers” ), which takes all the writes to the leader and applies them to the followers in the same order, thus keeping replicas up to date. Isn’t this essentially total order broadcast? How come we didn’t have to worry about consensus in Chapter 5 ?

在第5章，我们讨论了单领导副本复制(参见“领导者和追随者”)，它将所有写入操作发送到领导者并按照相同的顺序应用到追随者上，因此保持副本最新。这不是本质上的总序广播吗?为什么我们在第5章不必担心共识问题?

The answer comes down to how the leader is chosen. If the leader is manually chosen and configured by the humans in your operations team, you essentially have a “consensus algorithm” of the dictatorial variety: only one node is allowed to accept writes (i.e., make decisions about the order of writes in the replication log), and if that node goes down, the system becomes unavailable for writes until the operators manually configure a different node to be the leader. Such a system can work well in practice, but it does not satisfy the termination property of consensus because it requires human intervention in order to make progress.

答案取决于领导者是如何被选择的。如果领导者是由您的运营团队人工选择和配置的，那么您实际上拥有一种“独裁式”的共识算法：只允许一个节点接受写入（即决定复制日志中写入顺序的节点），如果该节点故障，则系统变得不可用，直到操作员手动配置一个不同的节点成为领导者。这样的系统在实践中可以很好地工作，但它并不满足共识的终止属性，因为它需要人为干预才能取得进展。

Some databases perform automatic leader election and failover, promoting a follower to be the new leader if the old leader fails (see “Handling Node Outages” ). This brings us closer to fault-tolerant total order broadcast, and thus to solving consensus.

一些数据库执行自动领导选举和故障转移，如果旧领导者失败，则晋升一个跟随者成为新领导者（参见“处理节点故障”）。这使我们更接近容错的完全排序广播，从而解决共识问题。

However, there is a problem. We previously discussed the problem of split brain, and said that all nodes need to agree who the leader is—otherwise two different nodes could each believe themselves to be the leader, and consequently get the database into an inconsistent state. Thus, we need consensus in order to elect a leader. But if the consensus algorithms described here are actually total order broadcast algorithms, and total order broadcast is like single-leader replication, and single-leader replication requires a leader, then…

然而，存在一个问题。我们先前讨论了分裂大脑的问题，并说所有节点需要达成一致，以确定领导者，否则两个不同的节点都可能认为自己是领导者，从而使数据库进入不一致状态。因此，在选举领导者时，我们需要共识。但是，如果这里描述的共识算法实际上是总序广播算法，而总序广播就像单领导者复制，单领导者复制需要领导者，那么...

It seems that in order to elect a leader, we first need a leader. In order to solve consensus, we must first solve consensus. How do we break out of this conundrum?

似乎为了选举一位领袖，我们需要首先有一位领袖。为了解决共识问题，我们必须首先解决共识问题。我们如何打破这个困境？

Epoch numbering and quorums

All of the consensus protocols discussed so far internally use a leader in some form or another, but they don’t guarantee that the leader is unique. Instead, they can make a weaker guarantee: the protocols define an epoch number (called the ballot number in Paxos, view number in Viewstamped Replication, and term number in Raft) and guarantee that within each epoch, the leader is unique.

迄今为止讨论的所有共识协议内部都以某种形式使用领导者，但他们并不保证该领导者是唯一的。相反，它们可以提供一个较弱的保证：该协议定义了一个时代号（在Paxos中称为投票号，在Viewstamped Replication中称为视图号，在Raft中称为任期号），并且保证在每个时代内，领导者是唯一的。

Every time the current leader is thought to be dead, a vote is started among the nodes to elect a new leader. This election is given an incremented epoch number, and thus epoch numbers are totally ordered and monotonically increasing. If there is a conflict between two different leaders in two different epochs (perhaps because the previous leader actually wasn’t dead after all), then the leader with the higher epoch number prevails.

每当当前领导人被认为已经死亡时，节点之间就开始投票选举新领导人。此次选举将被赋予一个增加的时代号码，因此时代号码是完全有序和单调递增的。如果在两个不同时代中存在两个不同的领导者之间的冲突（也许是因为前任领导者实际上并未死亡），那么具有较高时代号码的领导者将占上风。

Before a leader is allowed to decide anything, it must first check that there isn’t some other leader with a higher epoch number which might take a conflicting decision. How does a leader know that it hasn’t been ousted by another node? Recall “The Truth Is Defined by the Majority” : a node cannot necessarily trust its own judgment—just because a node thinks that it is the leader, that does not necessarily mean the other nodes accept it as their leader.

领导想要做出决策，必须先检查是否有更高时期号的另一位领导可能做出冲突的决定。领导如何知道自己没有被其他节点取代？回想一下“真理由大多数定义”：节点不能必然信任自己的判断——仅仅因为一个节点认为自己是领导，这并不意味着其他节点也承认它是领导。

Instead, it must collect votes from a quorum of nodes (see “Quorums for reading and writing” ). For every decision that a leader wants to make, it must send the proposed value to the other nodes and wait for a quorum of nodes to respond in favor of the proposal. The quorum typically, but not always, consists of a majority of nodes [ 105 ]. A node votes in favor of a proposal only if it is not aware of any other leader with a higher epoch.

相反，它必须从节点配额中收集投票（请参见“读写的配额”）。对于领导者想要做出的每个决策，它必须将提议的值发送给其他节点，并等待配额中的节点赞同提议。配额通常（但并不总是）由大多数节点组成[105]。仅当节点不知道任何具有较高时期的其他领导者时，节点才投赞成票。

Thus, we have two rounds of voting: once to choose a leader, and a second time to vote on a leader’s proposal. The key insight is that the quorums for those two votes must overlap: if a vote on a proposal succeeds, at least one of the nodes that voted for it must have also participated in the most recent leader election [ 105 ]. Thus, if the vote on a proposal does not reveal any higher-numbered epoch, the current leader can conclude that no leader election with a higher epoch number has happened, and therefore be sure that it still holds the leadership. It can then safely decide the proposed value.

因此，我们有两轮投票：第一次选举领导人，第二次投票表决领导人的提案。关键在于这两个投票的法定人数必须重叠：如果对提案的投票成功，至少投票支持者中的一个节点也必须参与了最近的领导人选举 [105]。因此，如果对提案的投票没有显示出任何更高的时期编号，现任领导人就可以得出结论，没有任何比当前更高时期编号的领导人选举发生过，因此可以确定仍然持有领导权。然后可以安全地决定建议的价值。

This voting process looks superficially similar to two-phase commit. The biggest differences are that in 2PC the coordinator is not elected, and that fault-tolerant consensus algorithms only require votes from a majority of nodes, whereas 2PC requires a “yes” vote from every participant. Moreover, consensus algorithms define a recovery process by which nodes can get into a consistent state after a new leader is elected, ensuring that the safety properties are always met. These differences are key to the correctness and fault tolerance of a consensus algorithm.

这个投票过程表面上看起来类似于两阶段提交。最大的区别在于，2PC中没有选举协调员，而容错一致性算法仅需要来自大多数节点的投票，而2PC要求每个参与者都投赞成票。此外，共识算法定义了恢复过程，使节点能够在选举新领导者后进入一致的状态，确保始终满足安全属性。这些差异对共识算法的正确性和容错性至关重要。

Limitations of consensus

Consensus algorithms are a huge breakthrough for distributed systems: they bring concrete safety properties (agreement, integrity, and validity) to systems where everything else is uncertain, and they nevertheless remain fault-tolerant (able to make progress as long as a majority of nodes are working and reachable). They provide total order broadcast, and therefore they can also implement linearizable atomic operations in a fault-tolerant way (see “Implementing linearizable storage using total order broadcast” ).

共识算法是分布式系统的重大突破：它们为一切不确定的系统带来了具体的安全属性（协议、完整性和有效性），并且它们依然具备容错性（只要大部分节点正常工作并且可达就可以实现进步）。它们提供总序列广播，因此也可以通过容错方式实现可线性操作（详见“使用总序列广播实现可线性存储”）。

Nevertheless, they are not used everywhere, because the benefits come at a cost.

然而，它们并非到处都被使用，因为好处需要付出代价。

The process by which nodes vote on proposals before they are decided is a kind of synchronous replication. As discussed in “Synchronous Versus Asynchronous Replication” , databases are often configured to use asynchronous replication. In this configuration, some committed data can potentially be lost on failover—but many people choose to accept this risk for the sake of better performance.

节点在决策之前对提案进行投票的过程属于同步复制的一种。如“同步与异步复制”所述，数据库通常被配置为使用异步复制。在此配置中，一些已提交的数据可能会在故障转移时丢失 - 但许多人选择为了更好的性能而接受这种风险。

Consensus systems always require a strict majority to operate. This means you need a minimum of three nodes in order to tolerate one failure (the remaining two out of three form a majority), or a minimum of five nodes to tolerate two failures (the remaining three out of five form a majority). If a network failure cuts off some nodes from the rest, only the majority portion of the network can make progress, and the rest is blocked (see also “The Cost of Linearizability” ).

共识系统总是需要严格的多数来运作。这意味着你需要至少三个节点才能容忍一个故障（三个中的两个形成多数），或者至少五个节点来容忍两个故障（五个中的三个形成多数）。如果网络故障将一些节点从其余部分中隔离开来，则只有网络的多数部分才能取得进展，其余部分将被阻塞（也请参见“线性一致性成本”）。

Most consensus algorithms assume a fixed set of nodes that participate in voting, which means that you can’t just add or remove nodes in the cluster. Dynamic membership extensions to consensus algorithms allow the set of nodes in the cluster to change over time, but they are much less well understood than static membership algorithms.

大多数共识算法假定参与投票的节点集是固定的，这意味着您不能只是添加或移除集群中的节点。动态成员扩展共识算法允许集群中的节点集随时间变化，但它们比静态成员算法不太容易理解。

Consensus systems generally rely on timeouts to detect failed nodes. In environments with highly variable network delays, especially geographically distributed systems, it often happens that a node falsely believes the leader to have failed due to a transient network issue. Although this error does not harm the safety properties, frequent leader elections result in terrible performance because the system can end up spending more time choosing a leader than doing any useful work.

共识系统通常依靠超时来检测故障节点。在具有高度可变网络延迟的环境中，特别是分布式系统中，往往会发生节点错误地认为领导者由于短暂的网络问题而失败的情况。尽管这种错误不会损害安全性能，但频繁的领导人选举会导致可怕的性能问题，因为系统最终可能会花费更多时间选择一个领导者而不是执行任何有用的工作。

Sometimes, consensus algorithms are particularly sensitive to network problems. For example, Raft has been shown to have unpleasant edge cases [ 106 ]: if the entire network is working correctly except for one particular network link that is consistently unreliable, Raft can get into situations where leadership continually bounces between two nodes, or the current leader is continually forced to resign, so the system effectively never makes progress. Other consensus algorithms have similar problems, and designing algorithms that are more robust to unreliable networks is still an open research problem.

有时，共识算法特别敏感于网络问题。例如，Raft已经被证明有不愉快的边缘情况 [106]: 如果整个网络除了一个特定的网络链接出现问题以外都正常，Raft会陷入领导者不断在两个节点间跳转，或当前领导者被迫不断辞职的情况，因此系统实际上无法进展。其他共识算法也有类似的问题，设计更加适应不可靠网络的算法仍然是一项开放的研究问题。

Membership and Coordination Services

Projects like ZooKeeper or etcd are often described as “distributed key-value stores” or “coordination and configuration services.” The API of such a service looks pretty much like that of a database: you can read and write the value for a given key, and iterate over keys. So if they’re basically databases, why do they go to all the effort of implementing a consensus algorithm? What makes them different from any other kind of database?

像ZooKeeper或etcd这样的项目通常被称为“分布式键值存储”或“协调和配置服务”。这样的服务的API看起来非常像数据库：您可以读取和写入给定键的值，并迭代键。那么，如果它们基本上是数据库，为什么它们要付出所有实现共识算法的努力呢？它们与任何其他类型的数据库有何不同之处？

To understand this, it is helpful to briefly explore how a service like ZooKeeper is used. As an application developer, you will rarely need to use ZooKeeper directly, because it is actually not well suited as a general-purpose database. It is more likely that you will end up relying on it indirectly via some other project: for example, HBase, Hadoop YARN, OpenStack Nova, and Kafka all rely on ZooKeeper running in the background. What is it that these projects get from it?

为了理解这个，简单探索一下像ZooKeeper这样的服务的使用是有帮助的。作为一个应用程序开发人员，你很少需要直接使用ZooKeeper，因为它实际上不适合作为通用数据库使用。更有可能的是，你将间接地依赖它通过一些其他项目：例如，HBase，Hadoop YARN，OpenStack Nova和Kafka都依赖于ZooKeeper在后台运行。这些项目从中得到了什么？

ZooKeeper and etcd are designed to hold small amounts of data that can fit entirely in memory (although they still write to disk for durability)—so you wouldn’t want to store all of your application’s data here. That small amount of data is replicated across all the nodes using a fault-tolerant total order broadcast algorithm. As discussed previously, total order broadcast is just what you need for database replication: if each message represents a write to the database, applying the same writes in the same order keeps replicas consistent with each other.

ZooKeeper和etcd被设计用于存储小量的数据，可以完全适应于内存（尽管它们仍然写入磁盘以实现耐久性） - 所以您不希望在这里存储应用程序的所有数据。这小小的数据量使用容错总序广播算法在所有节点中复制。如前所述，总序广播是数据库复制所需的：如果每个消息代表对数据库的写入，则按照相同的顺序应用相同的写入保持副本互相一致。

ZooKeeper is modeled after Google’s Chubby lock service [ 14 , 98 ], implementing not only total order broadcast (and hence consensus), but also an interesting set of other features that turn out to be particularly useful when building distributed systems:

ZooKeeper是根据Google的Chubby锁服务[14,98]建模的，不仅实现了完全有序广播（因此协商），而且还实现了一系列其他有趣的功能，这些功能在构建分布式系统时特别有用：

Linearizable atomic operations

Using an atomic compare-and-set operation, you can implement a lock: if several nodes concurrently try to perform the same operation, only one of them will succeed. The consensus protocol guarantees that the operation will be atomic and linearizable, even if a node fails or the network is interrupted at any point. A distributed lock is usually implemented as a lease , which has an expiry time so that it is eventually released in case the client fails (see “Process Pauses” ).

使用原子比较和设置操作，您可以实现一个锁：如果多个节点同时尝试执行相同的操作，只有其中一个会成功。共识协议保证操作是原子和可线性化的，即使节点失败或网络在任何时候中断。分布式锁通常作为租约实现，该租约具有过期时间，以便在客户端失败时最终释放（请参见“进程暂停”）。

Total ordering of operations

As discussed in “The leader and the lock” , when some resource is protected by a lock or lease, you need a fencing token to prevent clients from conflicting with each other in the case of a process pause. The fencing token is some number that monotonically increases every time the lock is acquired. ZooKeeper provides this by totally ordering all operations and giving each operation a monotonically increasing transaction ID ( zxid ) and version number ( cversion ) [ 15 ].

当一个资源通过锁或租赁得到保护时，需要一个围栏令牌来防止在进程暂停的情况下客户端相互冲突。围栏令牌是一个数字，每次获取锁时单调递增。ZooKeeper通过完全排序所有操作并为每个操作分配单调递增的事务ID（zxid）和版本号（cversion）[15]来提供此功能。

Failure detection

Clients maintain a long-lived session on ZooKeeper servers, and the client and server periodically exchange heartbeats to check that the other node is still alive. Even if the connection is temporarily interrupted, or a ZooKeeper node fails, the session remains active. However, if the heartbeats cease for a duration that is longer than the session timeout, ZooKeeper declares the session to be dead. Any locks held by a session can be configured to be automatically released when the session times out (ZooKeeper calls these ephemeral nodes ).

客户端在ZooKeeper服务器上维护一个长期的会话，客户端和服务器定期交换心跳以检查对方节点是否仍然活着。即使连接暂时中断或ZooKeeper节点失败，会话仍然保持活动状态。但是，如果心跳停止的持续时间超过会话超时时间，ZooKeeper会宣布会话已死亡。会话持有的任何锁都可以配置为在会话超时时自动释放（ZooKeeper将这些节点称为临时节点）。

Change notifications

Not only can one client read locks and values that were created by another client, but it can also watch them for changes. Thus, a client can find out when another client joins the cluster (based on the value it writes to ZooKeeper), or if another client fails (because its session times out and its ephemeral nodes disappear). By subscribing to notifications, a client avoids having to frequently poll to find out about changes.

不仅一个客户端可以读取由另一个客户端创建的锁和值，它也可以监视它们的变化。因此，一个客户端可以发现另一个客户端何时加入集群（基于它写入ZooKeeper的值），或者如果另一个客户端失败（因为其会话超时并且其短暂节点消失）。通过订阅通知，客户端避免了频繁轮询以了解更改的麻烦。

Of these features, only the linearizable atomic operations really require consensus. However, it is the combination of these features that makes systems like ZooKeeper so useful for distributed coordination.

其中，只有可线性化的原子操作确实需要共识。然而，正是这些特征的结合使得像ZooKeeper这样的系统在分布式协调方面非常有用。

Allocating work to nodes

One example in which the ZooKeeper/Chubby model works well is if you have several instances of a process or service, and one of them needs to be chosen as leader or primary. If the leader fails, one of the other nodes should take over. This is of course useful for single-leader databases, but it’s also useful for job schedulers and similar stateful systems.

ZooKeeper/Chubby 模型运作良好的一个例子是，如果您有几个进程或服务的实例，并且需要选择其中一个作为领导者或主节点。如果领导者失败，另一个节点应该接管。当然，这对于单领导数据库非常有用，但对于工作调度程序和类似的有状态系统也非常有用。

Another example arises when you have some partitioned resource (database, message streams, file storage, distributed actor system, etc.) and need to decide which partition to assign to which node. As new nodes join the cluster, some of the partitions need to be moved from existing nodes to the new nodes in order to rebalance the load (see “Rebalancing Partitions” ). As nodes are removed or fail, other nodes need to take over the failed nodes’ work.

当您拥有一些分区资源（数据库、消息流、文件存储、分布式应用程序系统等），并且需要决定将哪个分区分配给哪个节点时，另一个例子就会出现。随着新节点加入群集，一些分区需要从现有节点移动到新节点，以便重新平衡负载（请参见“重新平衡分区”）。当节点被移除或失败时，其他节点需要接管失败节点的工作。

These kinds of tasks can be achieved by judicious use of atomic operations, ephemeral nodes, and notifications in ZooKeeper. If done correctly, this approach allows the application to automatically recover from faults without human intervention. It’s not easy, despite the appearance of libraries such as Apache Curator [ 17 ] that have sprung up to provide higher-level tools on top of the ZooKeeper client API—but it is still much better than attempting to implement the necessary consensus algorithms from scratch, which has a poor success record [ 107 ].

这些任务可以通过在ZooKeeper中谨慎使用原子操作、临时节点和通知来实现。如果正确地执行，这种方法允许应用程序在没有人为干预的情况下自动从故障中恢复。尽管出现了像Apache Curator [17]这样的库，为ZooKeeper客户端API提供了更高级别的工具，但这并不容易——但它仍然比尝试从头开始实现必要的一致性算法要好得多，后者的成功记录很差[107]。

An application may initially run only on a single node, but eventually may grow to thousands of nodes. Trying to perform majority votes over so many nodes would be terribly inefficient. Instead, ZooKeeper runs on a fixed number of nodes (usually three or five) and performs its majority votes among those nodes while supporting a potentially large number of clients. Thus, ZooKeeper provides a way of “outsourcing” some of the work of coordinating nodes (consensus, operation ordering, and failure detection) to an external service.

一个应用程序最初可能只在一个节点上运行，但最终可能会扩展到数千个节点。尝试在这么多节点上执行多数投票将非常低效。相反，ZooKeeper 在固定数量的节点上运行（通常为三个或五个），并在这些节点上执行其多数投票，同时支持潜在的大量客户端。因此，ZooKeeper 提供了一种将协调节点的一些工作（共识、操作排序和故障检测）“外包”给外部服务的方式。

Normally, the kind of data managed by ZooKeeper is quite slow-changing: it represents information like “the node running on 10.1.1.23 is the leader for partition 7,” which may change on a timescale of minutes or hours. It is not intended for storing the runtime state of the application, which may change thousands or even millions of times per second. If application state needs to be replicated from one node to another, other tools (such as Apache BookKeeper [ 108 ]) can be used.

通常，ZooKeeper 管理的数据类型变化非常缓慢：例如，“运行在 10.1.1.23 的节点是分区 7 的领导者”，这类信息的变化可能在几分钟或几小时内发生。它并不适用于存储应用程序的运行时状态，因为应用程序状态可能每秒钟甚至每毫秒都会发生数千次或数百万次的变化。如果需要将应用程序状态从一个节点复制到另一个节点，则可以使用其他工具（例如 Apache BookKeeper [108]）。

Service discovery

ZooKeeper, etcd, and Consul are also often used for service discovery —that is, to find out which IP address you need to connect to in order to reach a particular service. In cloud datacenter environments, where it is common for virtual machines to continually come and go, you often don’t know the IP addresses of your services ahead of time. Instead, you can configure your services such that when they start up they register their network endpoints in a service registry, where they can then be found by other services.

ZooKeeper、etcd 和 Consul 也常用于服务发现，即查找连接到特定服务所需的 IP 地址。在云数据中心环境中，由于虚拟机经常出现并消失，你通常无法提前知道服务的 IP 地址。因此，你可以配置服务，在它们启动时注册其网络端点到服务注册表中，在那里其他服务可以找到它们。

However, it is less clear whether service discovery actually requires consensus. DNS is the traditional way of looking up the IP address for a service name, and it uses multiple layers of caching to achieve good performance and availability. Reads from DNS are absolutely not linearizable, and it is usually not considered problematic if the results from a DNS query are a little stale [ 109 ]. It is more important that DNS is reliably available and robust to network interruptions.

然而，服务发现是否真的需要共识尚不太清楚。DNS是查找服务名称的IP地址的传统方法，它使用多层缓存来实现良好的性能和可用性。DNS的读取绝对不是线性化的，如果DNS查询的结果有点过时，通常不会被认为是有问题的。更重要的是，DNS可靠可用且能够抵御网络中断。

Although service discovery does not require consensus, leader election does. Thus, if your consensus system already knows who the leader is, then it can make sense to also use that information to help other services discover who the leader is. For this purpose, some consensus systems support read-only caching replicas. These replicas asynchronously receive the log of all decisions of the consensus algorithm, but do not actively participate in voting. They are therefore able to serve read requests that do not need to be linearizable.

虽然服务发现不需要共识，但领导者选举需要。因此，如果您的共识系统已经知道领导者是谁，那么使用该信息来帮助其他服务发现领导者是有意义的。为此，一些共识系统支持只读缓存副本。这些副本异步接收所有共识算法决策的日志，但不参与投票。因此，它们能够为不需要线性化的读取请求提供服务。

Membership services

ZooKeeper and friends can be seen as part of a long history of research into membership services , which goes back to the 1980s and has been important for building highly reliable systems, e.g., for air traffic control [ 110 ].

ZooKeeper和它的伙伴们可以被视为研究成员服务的悠久历史的一部分，这始于20世纪80年代，对于构建高度可靠的系统非常重要，例如空中交通管制[110]。

A membership service determines which nodes are currently active and live members of a cluster. As we saw throughout Chapter 8 , due to unbounded network delays it’s not possible to reliably detect whether another node has failed. However, if you couple failure detection with consensus, nodes can come to an agreement about which nodes should be considered alive or not.

会员服务确定当前哪些节点是集群的活动成员。就像我们在第8章中看到的那样，由于无限制的网络延迟，不可能可靠地检测另一个节点是否失败。但是，如果将故障检测与共识结合起来，节点可以就应认为是活动或非活动状态达成一致。

It could still happen that a node is incorrectly declared dead by consensus, even though it is actually alive. But it is nevertheless very useful for a system to have agreement on which nodes constitute the current membership. For example, choosing a leader could mean simply choosing the lowest-numbered among the current members, but this approach would not work if different nodes have divergent opinions on who the current members are.

可能仍然会发生共识错误地宣布某个节点已经死亡，即使它实际上是活着的。但是对于系统来说，达成关于哪些节点构成当前成员的一致性是非常有用的。例如，选择一个领导者可能意味着只需选择当前成员中编号最低的，但是如果不同节点对当前成员有不同的看法，这种方法就行不通了。

Summary

In this chapter we examined the topics of consistency and consensus from several different angles. We looked in depth at linearizability, a popular consistency model: its goal is to make replicated data appear as though there were only a single copy, and to make all operations act on it atomically. Although linearizability is appealing because it is easy to understand—it makes a database behave like a variable in a single-threaded program—it has the downside of being slow, especially in environments with large network delays.

在本章中，我们从多个角度探讨了一致性和共识的主题。我们深入研究了线性一致性这种流行的一致性模型：它的目标是让复制的数据看起来只有一个副本，并使所有操作在原子级别上执行。尽管线性一致性很吸引人，因为它易于理解-它使数据库表现得像单线程程序中的变量-但它的缺点是速度较慢，特别是在具有大网络延迟的环境中。

We also explored causality, which imposes an ordering on events in a system (what happened before what, based on cause and effect). Unlike linearizability, which puts all operations in a single, totally ordered timeline, causality provides us with a weaker consistency model: some things can be concurrent, so the version history is like a timeline with branching and merging. Causal consistency does not have the coordination overhead of linearizability and is much less sensitive to network problems.

我们还研究了因果关系，它对系统中事件的顺序施加了一个约束（根据因果关系发生在先）。与线性可操作性不同，它将所有操作放在单个完全有序的时间线上，因果一致性提供了一个更弱的一致性模型：有些事情可以同时发生，因此版本历史就像带有分支和合并的时间表。因果一致性没有线性可操作性的协调开销，并且对网络问题的敏感度要小得多。

However, even if we capture the causal ordering (for example using Lamport timestamps), we saw that some things cannot be implemented this way: in “Timestamp ordering is not sufficient” we considered the example of ensuring that a username is unique and rejecting concurrent registrations for the same username. If one node is going to accept a registration, it needs to somehow know that another node isn’t concurrently in the process of registering the same name. This problem led us toward consensus .

然而，即使我们捕捉到因果顺序（例如使用Lamport时间戳），我们发现有些事情不能用这种方式实现：在“时间戳顺序不足够”中，我们考虑了确保用户名是唯一的并拒绝同时注册相同用户名的示例。如果一个节点将要接受注册，它需要确切地知道另一个节点是否正在同时注册相同的名称。这个问题让我们思考到了共识。

We saw that achieving consensus means deciding something in such a way that all nodes agree on what was decided, and such that the decision is irrevocable. With some digging, it turns out that a wide range of problems are actually reducible to consensus and are equivalent to each other (in the sense that if you have a solution for one of them, you can easily transform it into a solution for one of the others). Such equivalent problems include:

我们发现，达成共识意味着以一种让所有节点都同意所决定的事情的方式做出决定，并且决定是不可撤销的。经过一番调查，事实证明广泛的问题实际上都可以归结为共识，并且它们彼此之间是等效的（也就是说，如果你有其中一个问题的解决方案，你可以轻松地将它转化为另一个问题的解决方案）。这样的等效问题包括：

Linearizable compare-and-set registers

The register needs to atomically decide whether to set its value, based on whether its current value equals the parameter given in the operation.

寄存器需要在原子级别决定是否设置其值，基于当前值是否等于操作中给定的参数。

Atomic transaction commit

A database must decide whether to commit or abort a distributed transaction.

一个数据库必须决定是提交还是中止分布式事务。

Total order broadcast

The messaging system must decide on the order in which to deliver messages.

消息系统必须决定交付消息的顺序。

Locks and leases

When several clients are racing to grab a lock or lease, the lock decides which one successfully acquired it.

当多个客户端争夺锁或租赁时，锁会决定哪一个成功获得它。

Membership/coordination service

Given a failure detector (e.g., timeouts), the system must decide which nodes are alive, and which should be considered dead because their sessions timed out.

给定一个故障检测器（例如超时），系统必须决定哪些节点是存活的，哪些应该被认为是已经超时的死亡节点。

Uniqueness constraint

When several transactions concurrently try to create conflicting records with the same key, the constraint must decide which one to allow and which should fail with a constraint violation.

当多个事务同时尝试使用相同的关键字创建冲突记录时，约束必须决定哪个事务可以成功执行，哪个应该因为违反约束而失败。

All of these are straightforward if you only have a single node, or if you are willing to assign the decision-making capability to a single node. This is what happens in a single-leader database: all the power to make decisions is vested in the leader, which is why such databases are able to provide linearizable operations, uniqueness constraints, a totally ordered replication log, and more.

如果只有一个节点，或者愿意将决策能力分配给单个节点，所有这些都很简单。在单个领导者数据库中，所有决策权力都被授予领导者，这就是为什么这样的数据库能够提供线性化操作、唯一性约束、完全有序的复制日志等功能的原因。

However, if that single leader fails, or if a network interruption makes the leader unreachable, such a system becomes unable to make any progress. There are three ways of handling that situation:

然而，如果这个单一领导者失败，或者网络中断使领导者无法联系，这样的系统就无法继续前进。解决这种情况有三种方法：

Wait for the leader to recover, and accept that the system will be blocked in the meantime. Many XA/JTA transaction coordinators choose this option. This approach does not fully solve consensus because it does not satisfy the termination property: if the leader does not recover, the system can be blocked forever.

等待领导者恢复，并接受系统在此期间将被阻塞。许多XA/JTA事务协调器选择这个选项。这种方法并未完全解决共识，因为它并不满足终止属性：如果领导者无法恢复，系统可能会永远被阻塞。
Manually fail over by getting humans to choose a new leader node and reconfigure the system to use it. Many relational databases take this approach. It is a kind of consensus by “act of God”—the human operator, outside of the computer system, makes the decision. The speed of failover is limited by the speed at which humans can act, which is generally slower than computers.

通过人工选择新的领导节点并重新配置系统来手动进行故障转移。许多关系型数据库采取此方法。这是一种由“上帝行为”形成的共识 - 人类操作员在计算机系统之外做出决策。故障转移的速度受到人类行动速度的限制，一般比计算机慢。
Use an algorithm to automatically choose a new leader. This approach requires a consensus algorithm, and it is advisable to use a proven algorithm that correctly handles adverse network conditions [ 107 ].

使用一种算法来自动选择新领袖。这种方法需要共识算法，建议使用能够正确处理网络异常情况的经过验证的算法[107]。

Although a single-leader database can provide linearizability without executing a consensus algorithm on every write, it still requires consensus to maintain its leadership and for leadership changes. Thus, in some sense, having a leader only “kicks the can down the road”: consensus is still required, only in a different place, and less frequently. The good news is that fault-tolerant algorithms and systems for consensus exist, and we briefly discussed them in this chapter.

尽管单一领导数据库可以在每次写入时提供线性化而无需执行共识算法，但仍需要共识来维护其领导和领导变更。因此，在某种意义上，仅有领导者“拖延”问题：仍需要共识，只是在不同的地方，且较少频繁。好消息是，具备容错性的共识算法和系统存在，我们在本章中简要讨论了它们。

Tools like ZooKeeper play an important role in providing an “outsourced” consensus, failure detection, and membership service that applications can use. It’s not easy to use, but it is much better than trying to develop your own algorithms that can withstand all the problems discussed in Chapter 8 . If you find yourself wanting to do one of those things that is reducible to consensus, and you want it to be fault-tolerant, then it is advisable to use something like ZooKeeper.

像ZooKeeper这样的工具在提供“外包”的共识、故障检测和成员服务方面起着重要作用，应用程序可利用其功能。尽管不易使用，但它比尝试开发能够经受住第八章中讨论的所有问题的自有算法要好得多。如果您想要做类似于共识可简化的事情，并且希望它具备容错性，则最好使用类似于ZooKeeper的工具。

Nevertheless, not every system necessarily requires consensus: for example, leaderless and multi-leader replication systems typically do not use global consensus. The conflicts that occur in these systems (see “Handling Write Conflicts” ) are a consequence of not having consensus across different leaders, but maybe that’s okay: maybe we simply need to cope without linearizability and learn to work better with data that has branching and merging version histories.

然而，并非每个系统都必须要求共识：例如，无领导或多领导复制系统通常不使用全局共识。在这些系统中发生的冲突（参见“处理写冲突”）是由于不同领导者之间没有共识所导致的，但也许这是可以接受的：也许我们只需要在没有线性可用性的情况下应对，并学会更好地处理具有分支和合并版本历史的数据。

This chapter referenced a large body of research on the theory of distributed systems. Although the theoretical papers and proofs are not always easy to understand, and sometimes make unrealistic assumptions, they are incredibly valuable for informing practical work in this field: they help us reason about what can and cannot be done, and help us find the counterintuitive ways in which distributed systems are often flawed. If you have the time, the references are well worth exploring.

本章引用了大量有关分布式系统理论的研究。虽然这些理论论文和证明不总是容易理解，有时做出了不切实际的假设，但它们对于指导实践工作非常有价值：它们帮助我们思考什么能够、什么不能够做，并帮助我们找到分布式系统常常存在的反直觉漏洞。如果您有时间，这些参考资料也值得探索。

This brings us to the end of Part II of this book, in which we covered replication ( Chapter 5 ), partitioning ( Chapter 6 ), transactions ( Chapter 7 ), distributed system failure models ( Chapter 8 ), and finally consistency and consensus ( Chapter 9 ). Now that we have laid a firm foundation of theory, in Part III we will turn once again to more practical systems, and discuss how to build powerful applications from heterogeneous building blocks.

这就是本书第二部分的结尾，我们在本部分中涉及到了复制（第5章），分区（第6章），事务（第7章），分布式系统故障模型（第8章）以及一致性和共识（第9章）。现在我们已经奠定了扎实的理论基础，在第三部分中，我们将再次转向更实际的系统，讨论如何从不同的构建块中构建强大的应用程序。

Footnotes

ⁱ A subtle detail of this diagram is that it assumes the existence of a global clock, represented by the horizontal axis. Even though real systems typically don’t have accurate clocks (see “Unreliable Clocks” ), this assumption is okay: for the purposes of analyzing a distributed algorithm, we may pretend that an accurate global clock exists, as long as the algorithm doesn’t have access to it [ 47 ]. Instead, the algorithm can only see a mangled approximation of real time, as produced by a quartz oscillator and NTP.

这个图表的微小细节假设了全局时钟的存在，由水平轴表示。尽管真实系统通常没有准确的时钟（参见“不可靠的时钟”），但是这种假设是可以的：为了分析分布式算法，我们可以假装存在一个准确的全局时钟，只要算法没有访问它。相反，算法只能看到由石英振荡器和NTP生成的真实时间的扭曲近似。

ⁱⁱ A register in which reads may return either the old or the new value if they are concurrent with a write is known as a regular register [ 7 , 25 ].

一个读取可能会与写并发的旧值或新值的寄存器被称为常规寄存器[7，25]。

ⁱⁱⁱ Strictly speaking, ZooKeeper and etcd provide linearizable writes, but reads may be stale, since by default they can be served by any one of the replicas. You can optionally request a linearizable read: etcd calls this a quorum read [ 16 ], and in ZooKeeper you need to call sync() before the read [ 15 ]; see “Implementing linearizable storage using total order broadcast” .

ZooKeeper和etcd在写入方面提供线性写入，但读取可能是陈旧的，因为默认情况下可以由任何一个副本提供服务。您可以选择请求线性读取：etcd称之为仲裁读取[16]，而在ZooKeeper中，您需要在读取之前调用sync（）[15]，请参见“使用总顺序广播实现线性存储”。

^iv Partitioning (sharding) a single-leader database, so that there is a separate leader per partition, does not affect linearizability, since it is only a single-object guarantee. Cross-partition transactions are a different matter (see “Distributed Transactions and Consensus” ).

将单主数据库进行 iv 分区（切片），以使每个分区有单独的主节点，并不影响线性一致性，因为它仅仅是单对象的保证。跨分区事务是另一回事（请参见“分布式事务和一致性”）。

^v These two choices are sometimes known as CP (consistent but not available under network partitions) and AP (available but not consistent under network partitions), respectively. However, this classification scheme has several flaws [ 9 ], so it is best avoided.

这两种选择有时被称为CP（在网络分区下保持一致但不可用）和AP（在网络分区下可用但不一致），但是这个分类方案存在几个缺陷 [9]，因此最好避免使用。

^vi As discussed in “Network Faults in Practice” , this book uses partitioning to refer to deliberately breaking down a large dataset into smaller ones ( sharding ; see Chapter 6 ). By contrast, a network partition is a particular type of network fault, which we normally don’t consider separately from other kinds of faults. However, since it’s the P in CAP, we can’t avoid the confusion in this case.

正如《实际网络故障》所讨论的，本书使用分区来指将一个大数据集故意分成较小的数据集（分片；参见第6章）。相比之下，网络分区是一种特定类型的网络故障，我们通常不将其与其他故障分别考虑。然而，由于这是CAP中的P，因此在这种情况下我们不能避免混淆。

^vii A total order that is inconsistent with causality is easy to create, but not very useful. For example, you can generate a random UUID for each operation, and compare UUIDs lexicographically to define the total ordering of operations. This is a valid total order, but the random UUIDs tell you nothing about which operation actually happened first, or whether the operations were concurrent.

一个与因果关系不一致的全序很容易创建，但并不是很有用。例如，你可以为每个操作生成一个随机 UUID，并按字典序比较 UUID 来定义操作的总排序。这是一个有效的总排序，但随机 UUID 并不能告诉你哪个操作实际上是先发生的，或者这些操作是否同时发生。

^viii It is possible to make physical clock timestamps consistent with causality: in “Synchronized clocks for global snapshots” we discussed Google’s Spanner, which estimates the expected clock skew and waits out the uncertainty interval before committing a write. This method ensures that a causally later transaction is given a greater timestamp. However, most clocks cannot provide the required uncertainty metric.

viii 物理时钟的时间戳可以与因果关系保持一致：在“全局快照同步时钟”中，我们讨论了谷歌的Spanner，它估计了预期的时钟偏差，并在提交写入之前等待不确定性间隔。这种方法确保因果关系更晚的事务被赋予更大的时间戳。然而，大多数时钟无法提供所需的不确定度量。

^ix The term atomic broadcast is traditional, but it is very confusing as it’s inconsistent with other uses of the word atomic : it has nothing to do with atomicity in ACID transactions and is only indirectly related to atomic operations (in the sense of multi-threaded programming) or atomic registers (linearizable storage). The term total order multicast is another synonym.

“原子广播”这个术语虽然传统，但与“原子性”在 ACID 事务中的使用不一致，与多线程编程中的“原子操作”或线性存储的“原子寄存器”只有间接关联，因此容易令人困惑。另一个同义词是“全局有序组播”。

^x In a formal sense, a linearizable read-write register is an “easier” problem. Total order broadcast is equivalent to consensus [ 67 ], which has no deterministic solution in the asynchronous crash-stop model [ 68 ], whereas a linearizable read-write register can be implemented in the same system model [ 23 , 24 , 25 ]. However, supporting atomic operations such as compare-and-set or increment-and-get in a register makes it equivalent to consensus [ 28 ]. Thus, the problems of consensus and a linearizable register are closely related.

从正式角度来看，可线性化的读写寄存器问题相对来说是“更简单”的。总序播送等同于共识[67]，而在异步宕机模型[68]中它没有确定性解。然而，可线性化的读写寄存器可以在同一系统模型中实现[23, 24, 25]。但是，将支持原子操作如比较-设置或增量和获取寄存器时，它就等效于共识[28]。因此，共识和可线性化寄存器的问题密切相关。

^xi If you don’t wait, but acknowledge the write immediately after it has been enqueued, you get something similar to the memory consistency model of multi-core x86 processors [ 43 ]. That model is neither linearizable nor sequentially consistent.

如果你不等待，而是在它被入队后立即确认写操作，你会得到类似于多核 x86 处理器的内存一致性模型[43]。该模型既不是线性化的，也不是顺序一致的。

^xii Atomic commit is formalized slightly differently from consensus: an atomic transaction can commit only if all participants vote to commit, and must abort if any participant needs to abort. Consensus is allowed to decide on any value that is proposed by one of the participants. However, atomic commit and consensus are reducible to each other [ 70 , 71 ]. Nonblocking atomic commit is harder than consensus—see “Three-phase commit” .

原子提交和共识的规范略有不同：只有当所有参与方投票赞成提交时，原子事务才能提交，并且必须在任何参与方需要中止时中止。共识允许决定任何由参与方提出的值。但是，原子提交和共识可以互相简化。非阻塞式原子提交比共识更难 - 参见“三阶段提交”。

^xiii This particular variant of consensus is called uniform consensus , which is equivalent to regular consensus in asynchronous systems with unreliable failure detectors [ 71 ]. The academic literature usually refers to processes rather than nodes , but we use nodes here for consistency with the rest of this book.

这种共识的特定变化被称为统一共识，它在具有不可靠失败检测器的异步系统中等同于常规共识[71]。学术文献通常提到进程而不是节点，但我们在这里使用节点以保持本书的一致性。

References

[ 1 ] Peter Bailis and Ali Ghodsi: “ Eventual Consistency Today: Limitations, Extensions, and Beyond ,” ACM Queue , volume 11, number 3, pages 55-63, March 2013. doi:10.1145/2460276.2462076

[1] Peter Bailis和Ali Ghodsi：“如今的最终一致性：限制、扩展和未来”，ACM队列，卷11，号3，页55-63，2013年3月。doi：10.1145 / 2460276.2462076。

[ 2 ] Prince Mahajan, Lorenzo Alvisi, and Mike Dahlin: “ Consistency, Availability, and Convergence ,” University of Texas at Austin, Department of Computer Science, Tech Report UTCS TR-11-22, May 2011.

[2] Prince Mahajan，Lorenzo Alvisi和Mike Dahlin：“一致性、可用性和收敛性”，德克萨斯大学奥斯汀分校，计算机科学系，技术报告UTCS TR-11-22，2011年5月。

[ 3 ] Alex Scotti: “ Adventures in Building Your Own Database ,” at All Your Base , November 2015.

[3] Alex Scotti：在All Your Base 2015年11月的演讲“打造自己的数据库中的冒险”。

[ 4 ] Peter Bailis, Aaron Davidson, Alan Fekete, et al.: “ Highly Available Transactions: Virtues and Limitations ,” at 40th International Conference on Very Large Data Bases (VLDB), September 2014. Extended version published as pre-print arXiv:1302.0309 [cs.DB].

【4】Peter Bailis、Aaron Davidson、Alan Fekete等人：《高可用事务：优点与局限》，发表于第40届国际超大型数据库会议（VLDB），2014年9月。扩展版本已作为预印本arXiv:1302.0309 [cs.DB]发表。

[ 5 ] Paolo Viotti and Marko Vukolić: “ Consistency in Non-Transactional Distributed Storage Systems ,” arXiv:1512.00168, 12 April 2016.

[5] Paolo Viotti和Marko Vukolić： "非事务性分布式存储系统的一致性"，arXiv:1512.00168，2016年4月12日。

[ 6 ] Maurice P. Herlihy and Jeannette M. Wing: “ Linearizability: A Correctness Condition for Concurrent Objects ,” ACM Transactions on Programming Languages and Systems (TOPLAS), volume 12, number 3, pages 463–492, July 1990. doi:10.1145/78969.78972

【6】Maurice P. Herlihy和 Jeannette M. Wing：“线性化：并发对象的正确性条件”，ACM Transactions on Programming Languages and Systems（TOPLAS），卷12，号3，页463-492，1990年7月。doi:10.1145/78969.78972。

[ 7 ] Leslie Lamport: “ On interprocess communication ,” Distributed Computing , volume 1, number 2, pages 77–101, June 1986. doi:10.1007/BF01786228

Leslie Lamport：“论进程间通信”，《分布式计算》杂志，第1卷，第2期，77-101页，1986年6月。DOI:10.1007/BF01786228。

[ 8 ] David K. Gifford: “ Information Storage in a Decentralized Computer System ,” Xerox Palo Alto Research Centers, CSL-81-8, June 1981.

[8] 大卫·基福德 (David K. Gifford): 《分散计算机系统中的信息存储》(Information Storage in a Decentralized Computer System)，施乐帕洛阿尔托研究中心 (Xerox Palo Alto Research Centers)，CSL-81-8，1981年6月。

[ 9 ] Martin Kleppmann: “ Please Stop Calling Databases CP or AP ,” martin.kleppmann.com , May 11, 2015.

请帮我翻译，“请不要再将数据库称为CP或AP”，马丁·克莱普曼（Martin Kleppmann），“martin.kleppmann.com”，2015年5月11日。

[ 10 ] Kyle Kingsbury: “ Call Me Maybe: MongoDB Stale Reads ,” aphyr.com , April 20, 2015.

[10] Kyle Kingsbury: “Call Me Maybe: MongoDB 过时读取”，aphyr.com，2015年4月20日。

[ 11 ] Kyle Kingsbury: “ Computational Techniques in Knossos ,” aphyr.com , May 17, 2014.

[11] Kyle Kingsbury： “Knossos中的计算技术”，aphyr.com，2014年5月17日。 [11] Kyle Kingsbury： “Knossos中的计算技术”，aphyr.com，2014年5月17日。

[ 12 ] Peter Bailis: “ Linearizability Versus Serializability ,” bailis.org , September 24, 2014.

[12] Peter Bailis：“线性化可靠性与序列化可靠性”，bailis.org，2014年9月24日。

[ 13 ] Philip A. Bernstein, Vassos Hadzilacos, and Nathan Goodman: Concurrency Control and Recovery in Database Systems . Addison-Wesley, 1987. ISBN: 978-0-201-10715-9, available online at research.microsoft.com .

[13] Philip A. Bernstein，Vassos Hadzilacos和Nathan Goodman：数据库系统中的并发控制和恢复。 Addison-Wesley，1987年。 ISBN：978-0-201-10715-9，可在research.microsoft.com网站上在线获取。

[ 14 ] Mike Burrows: “ The Chubby Lock Service for Loosely-Coupled Distributed Systems ,” at 7th USENIX Symposium on Operating System Design and Implementation (OSDI), November 2006.

[14] Mike Burrows：“疏松分布式系统的Chubby Lock服务”，发表于第七届USENIX操作系统设计与实现研讨会（OSDI），2006年11月。

[ 15 ] Flavio P. Junqueira and Benjamin Reed: ZooKeeper: Distributed Process Coordination . O’Reilly Media, 2013. ISBN: 978-1-449-36130-3

[15] Flavio P. Junqueira和Benjamin Reed： ZooKeeper：分布式进程协调。O'Reilly Media，2013年。 ISBN：978-1-449-36130-3。

[ 16 ] “ etcd 2.0.12 Documentation ,” CoreOS, Inc., 2015.

「[16] “etcd 2.0.12 文档”，CoreOS, Inc.，2015年。」

[ 17 ] “ Apache Curator ,” Apache Software Foundation, curator.apache.org , 2015.

“Apache Curator”，Apache软件基金会，curator.apache.org，2015年。

[ 18 ] Morali Vallath: Oracle 10g RAC Grid, Services & Clustering . Elsevier Digital Press, 2006. ISBN: 978-1-555-58321-7

【18】Morali Vallath： Oracle 10g RAC格网、服务与集群技术。Elsevier Digital Press出版社，2006。 ISBN：978-1-555-58321-7。

[ 19 ] Peter Bailis, Alan Fekete, Michael J Franklin, et al.: “ Coordination-Avoiding Database Systems ,” Proceedings of the VLDB Endowment , volume 8, number 3, pages 185–196, November 2014.

[19] Peter Bailis, Alan Fekete, Michael J Franklin等人： “避免协调的数据库系统”，《VLDB Endowment会议论文集》第8卷，第3期，185-196页，2014年11月。

[ 20 ] Kyle Kingsbury: “ Call Me Maybe: etcd and Consul ,” aphyr.com , June 9, 2014.

[20] Kyle Kingsbury：“Call Me Maybe：etcd和Consul”，aphyr.com，2014年6月9日。

[ 21 ] Flavio P. Junqueira, Benjamin C. Reed, and Marco Serafini: “ Zab: High-Performance Broadcast for Primary-Backup Systems ,” at 41st IEEE International Conference on Dependable Systems and Networks (DSN), June 2011. doi:10.1109/DSN.2011.5958223

[21] Flavio P. Junqueira, Benjamin C. Reed, and Marco Serafini: "Zab：高性能的主备系统广播协议"。41st IEEE 国际可靠系统和网络会议（DSN），2011年6月。doi:10.1109/DSN.2011.5958223。

[ 22 ] Diego Ongaro and John K. Ousterhout: “ In Search of an Understandable Consensus Algorithm (Extended Version) ,” at USENIX Annual Technical Conference (ATC), June 2014.

[22] Diego Ongaro和John K. Ousterhout： “在寻找可理解的一致性算法（扩展版）”，于2014年6月的USENIX年度技术会议上。

[ 23 ] Hagit Attiya, Amotz Bar-Noy, and Danny Dolev: “ Sharing Memory Robustly in Message-Passing Systems ,” Journal of the ACM , volume 42, number 1, pages 124–142, January 1995. doi:10.1145/200836.200869

[23] Hagit Attiya, Amotz Bar-Noy 和 Danny Dolev：“在消息传递系统中可靠地共享内存”，ACM杂志，卷42期1，页码124-142，1995年1月。doi:10.1145/200836.200869。

[ 24 ] Nancy Lynch and Alex Shvartsman: “ Robust Emulation of Shared Memory Using Dynamic Quorum-Acknowledged Broadcasts ,” at 27th Annual International Symposium on Fault-Tolerant Computing (FTCS), June 1997. doi:10.1109/FTCS.1997.614100

[24] 纳西·林奇（Nancy Lynch）和亚历克斯·施瓦茨曼（Alex Shvartsman）: “使用动态仲裁认可广播实现共享内存的强大仿真”，发表于1997年6月的第27届国际容错计算研讨会（FTCS）。 doi：10.1109/FTCS.1997.614100

[ 25 ] Christian Cachin, Rachid Guerraoui, and Luís Rodrigues: Introduction to Reliable and Secure Distributed Programming , 2nd edition. Springer, 2011. ISBN: 978-3-642-15259-7, doi:10.1007/978-3-642-15260-3

[25] Christian Cachin, Rachid Guerraoui, 和 Luís Rodrigues: 《可靠和安全的分布式编程导论》，第二版。Springer，2011年。ISBN: 978-3-642-15259-7， doi:10.1007/978-3-642-15260-3。

[ 26 ] Sam Elliott, Mark Allen, and Martin Kleppmann: personal communication , thread on twitter.com , October 15, 2015.

[26] Sam Elliott，Mark Allen和Martin Kleppmann：个人通信，Twitter.com上的主题，2015年10月15日。

[ 27 ] Niklas Ekström, Mikhail Panchenko, and Jonathan Ellis: “ Possible Issue with Read Repair? ,” email thread on cassandra-dev mailing list, October 2012.

[27] 尼克拉斯·埃克斯特伦、米哈伊尔·潘琴科和乔纳森·埃利斯： “读修复可能存在问题？”，来自Cassandra开发者邮件列表的电子邮件线程，2012年10月。

[ 28 ] Maurice P. Herlihy: “ Wait-Free Synchronization ,” ACM Transactions on Programming Languages and Systems (TOPLAS), volume 13, number 1, pages 124–149, January 1991. doi:10.1145/114005.102808

[28] Maurice P. Herlihy：「无等待同步」，ACM编程语言和系统事务（TOPLAS），卷13，号1，页124-149，1991年1月。DOI：10.1145/114005.102808。

[ 29 ] Armando Fox and Eric A. Brewer: “ Harvest, Yield, and Scalable Tolerant Systems ,” at 7th Workshop on Hot Topics in Operating Systems (HotOS), March 1999. doi:10.1109/HOTOS.1999.798396

请帮我翻译：[29] Armando Fox和Eric A. Brewer：“收割，产量和可扩展容错系统”，第7届操作系统热门话题研讨会（HotOS），1999年3月。doi:10.1109/HOTOS.1999.798396。答案：[29] Armando Fox和Eric A. Brewer：“收割，产量和可扩展容错系统”，第7届操作系统热门话题研讨会（HotOS），1999年3月。doi:10.1109/HOTOS.1999.798396。

[ 30 ] Seth Gilbert and Nancy Lynch: “ Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services ,” ACM SIGACT News , volume 33, number 2, pages 51–59, June 2002. doi:10.1145/564585.564601

[30] Seth Gilbert和Nancy Lynch: “Brewer Conjecture和一致、可用、容错的Web服务的可行性”, ACM SIGACT新闻，卷33，号2，页51-59，2002年6月。 doi:10.1145/564585.564601。

[ 31 ] Seth Gilbert and Nancy Lynch: “ Perspectives on the CAP Theorem ,” IEEE Computer Magazine , volume 45, number 2, pages 30–36, February 2012. doi:10.1109/MC.2011.389

「31」塞思·吉尔伯特和南希·琳奇：《CAP定理的视角》，IEEE计算机杂志，2012年2月第45卷第2期，第30至36页。doi:10.1109/MC.2011.389。

[ 32 ] Eric A. Brewer: “ CAP Twelve Years Later: How the ‘Rules’ Have Changed ,” IEEE Computer Magazine , volume 45, number 2, pages 23–29, February 2012. doi:10.1109/MC.2012.37

[32] Eric A. Brewer：“CAP十二年后：‘规则’如何改变”，《IEEE计算机杂志》，2012年2月，第45卷，第2期，23-29页。doi:10.1109/MC.2012.37。

[ 33 ] Susan B. Davidson, Hector Garcia-Molina, and Dale Skeen: “ Consistency in Partitioned Networks ,” ACM Computing Surveys , volume 17, number 3, pages 341–370, September 1985. doi:10.1145/5505.5508

[33] Susan B. Davidson、Hector Garcia-Molina和Dale Skeen：“分区网络中的一致性”，ACM Computing Surveys，第17卷，第3期，页341-370，1985年9月。doi:10.1145/5505.5508。

[ 34 ] Paul R. Johnson and Robert H. Thomas: “ RFC 677: The Maintenance of Duplicate Databases ,” Network Working Group, January 27, 1975.

[34] 保罗·R·约翰逊和罗伯特·H·托马斯： “RFC 677：重复数据库的维护，”网络工作组，1975年1月27日。

[ 35 ] Bruce G. Lindsay, Patricia Griffiths Selinger, C. Galtieri, et al.: “ Notes on Distributed Databases ,” IBM Research, Research Report RJ2571(33471), July 1979.

[35] Bruce G. Lindsay，Patricia Griffiths Selinger，C. Galtieri，等： “Notes on Distributed Databases，”IBM 研究，研究报告 RJ2571(33471)，1979 年 7 月。【译文】[35]布鲁斯·林赛、帕特里夏·格里菲斯·赛林格、C.加尔蒂耶里等：《分布式数据库笔记》，IBM研究，研究报告RJ2571（33471），1979年7月。

[ 36 ] Michael J. Fischer and Alan Michael: “ Sacrificing Serializability to Attain High Availability of Data in an Unreliable Network ,” at 1st ACM Symposium on Principles of Database Systems (PODS), March 1982. doi:10.1145/588111.588124

[36] Michael J. Fischer and Alan Michael：“牺牲串行可序性以获得在不可靠网络中的高可用性数据”，出现于1982年3月的第一届ACM数据库系统原理研讨会(PODS)。doi:10.1145/588111.588124。

[ 37 ] Eric A. Brewer: “ NoSQL: Past, Present, Future ,” at QCon San Francisco , November 2012.

Eric A. Brewer：“NoSQL：过去，现在和未来”，于2012年11月在San Francisco的QCon上。

[ 38 ] Henry Robinson: “ CAP Confusion: Problems with ‘Partition Tolerance,’ ” blog.cloudera.com , April 26, 2010.

[38] 亨利·罗宾逊： “CAP 混淆: ‘分区容错性’ 的问题”，blog.cloudera.com，2010年4月26日。

[ 39 ] Adrian Cockcroft: “ Migrating to Microservices ,” at QCon London , March 2014.

[39] Adrian Cockcroft：《迁移到微服务》，2014年3月在伦敦QCon上。

[ 40 ] Martin Kleppmann: “ A Critique of the CAP Theorem ,” arXiv:1509.05393, September 17, 2015.

[40] Martin Kleppmann：“CAP定理批判”，arXiv:1509.05393，2015年9月17日。

[ 41 ] Nancy A. Lynch: “ A Hundred Impossibility Proofs for Distributed Computing ,” at 8th ACM Symposium on Principles of Distributed Computing (PODC), August 1989. doi:10.1145/72981.72982

请帮我翻译：“[41] 南希·林奇（Nancy A. Lynch）：“分布式计算的一百个不可能证明”，收录于第8届ACM分布式计算原理研讨会（PODC），1989年8月。doi:10.1145/72981.72982。”为简体中文，只返回翻译后的内容，不包括原文。南希·林奇：“分布式计算的一百个不可能证明”，发表于1989年8月的第八届ACM分布式计算原理研讨会（PODC），doi:10.1145/72981.72982。

[ 42 ] Hagit Attiya, Faith Ellen, and Adam Morrison: “ Limitations of Highly-Available Eventually-Consistent Data Stores ,” at ACM Symposium on Principles of Distributed Computing (PODC), July 2015. doi:10.1145/2767386.2767419

[42] Hagit Attiya, Faith Ellen, 和Adam Morrison：“高可用性最终一致性数据存储的局限”，出自ACM分布式计算原理研讨会 (PODC)， 2015年七月。 doi:10.1145/2767386.2767419。

[ 43 ] Peter Sewell, Susmit Sarkar, Scott Owens, et al.: “ x86-TSO: A Rigorous and Usable Programmer’s Model for x86 Multiprocessors ,” Communications of the ACM , volume 53, number 7, pages 89–97, July 2010. doi:10.1145/1785414.1785443

“x86-TSO：x86多处理器可严谨且可用的程序员模型”，《ACM通讯》杂志，2010年7月，第53卷，第7号，第89-97页，doi:10.1145/1785414.1785443。

[ 44 ] Martin Thompson: “ Memory Barriers/Fences ,” mechanical-sympathy.blogspot.co.uk , July 24, 2011.

[44] Martin Thompson：“内存栅栏”， mechanical-sympathy.blogspot.co.uk，2011年7月24日。

[ 45 ] Ulrich Drepper: “ What Every Programmer Should Know About Memory ,” akkadia.org , November 21, 2007.

[45] Ulrich Drepper: “关于内存，程序员应该知道什么”，akkadia.org，2007年11月21日。

[ 46 ] Daniel J. Abadi: “ Consistency Tradeoffs in Modern Distributed Database System Design ,” IEEE Computer Magazine , volume 45, number 2, pages 37–42, February 2012. doi:10.1109/MC.2012.33

[46] Daniel J. Abadi: "现代分布式数据库系统设计中的一致性权衡"，IEEE 计算机杂志，第 45 卷，第 2 期，页码为 37–42，2012 年 2 月。 doi:10.1109/MC.2012.33

[ 47 ] Hagit Attiya and Jennifer L. Welch: “ Sequential Consistency Versus Linearizability ,” ACM Transactions on Computer Systems (TOCS), volume 12, number 2, pages 91–122, May 1994. doi:10.1145/176575.176576

【47】Hagit Attiya和Jennifer L. Welch：“顺序一致性与线性化”，发表于ACM计算机系统交易（TOCS），卷12，号2，页码91-122，1994年5月。doi：10.1145/176575.176576。

[ 48 ] Mustaque Ahamad, Gil Neiger, James E. Burns, et al.: “ Causal Memory: Definitions, Implementation, and Programming ,” Distributed Computing , volume 9, number 1, pages 37–49, March 1995. doi:10.1007/BF01784241

[48] Mustaque Ahamad, Gil Neiger, James E. Burns等人：“因果存储器：定义、实现和编程”，《分布式计算》杂志，1995年3月，第9卷第1期，页码37-49。doi:10.1007/BF01784241。

[ 49 ] Wyatt Lloyd, Michael J. Freedman, Michael Kaminsky, and David G. Andersen: “ Stronger Semantics for Low-Latency Geo-Replicated Storage ,” at 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI), April 2013.

【49】Wyatt Lloyd、Michael J. Freedman、Michael Kaminsky和David G. Andersen：《低延迟地理复制存储的更强语义》，2013年4月于第10届USENIX网络系统设计和实现研讨会（NSDI）上发表。

[ 50 ] Marek Zawirski, Annette Bieniusa, Valter Balegas, et al.: “ SwiftCloud: Fault-Tolerant Geo-Replication Integrated All the Way to the Client Machine ,” INRIA Research Report 8347, August 2013.

【50】Marek Zawirski，Annette Bieniusa，Valter Balegas，等： “SwiftCloud：实现从客户端机器到整个的容错地理复制”，INRIA研究报告8347，2013年8月。

[ 51 ] Peter Bailis, Ali Ghodsi, Joseph M Hellerstein, and Ion Stoica: “ Bolt-on Causal Consistency ,” at ACM International Conference on Management of Data (SIGMOD), June 2013.

[51] Peter Bailis，Ali Ghodsi，Joseph M Hellerstein和Ion Stoica：“附加因果一致性”，在ACM数据管理国际会议（SIGMOD）上，2013年6月。

[ 52 ] Philippe Ajoux, Nathan Bronson, Sanjeev Kumar, et al.: “ Challenges to Adopting Stronger Consistency at Scale ,” at 15th USENIX Workshop on Hot Topics in Operating Systems (HotOS), May 2015.

[52] Philippe Ajoux，Nathan Bronson，Sanjeev Kumar，等： “在规模上采用更强一致性的挑战”，于2015年5月在第15届USENIX操作系统热门主题研讨会（HotOS）上发表。

[ 53 ] Peter Bailis: “ Causality Is Expensive (and What to Do About It) ,” bailis.org , February 5, 2014.

[53] Peter Bailis：“因果关系很昂贵（及如何应对）”，bailis.org，2014年2月5日。

[ 54 ] Ricardo Gonçalves, Paulo Sérgio Almeida, Carlos Baquero, and Victor Fonte: “ Concise Server-Wide Causality Management for Eventually Consistent Data Stores ,” at 15th IFIP International Conference on Distributed Applications and Interoperable Systems (DAIS), June 2015. doi:10.1007/978-3-319-19129-4_6

[54] Ricardo Gonçalves, Paulo Sérgio Almeida, Carlos Baquero和Victor Fonte：“简洁的服务器端一致性管理的事件一致性数据存储”，第15届IFIP分布式应用和可互操作系统国际会议（DAIS），2015年6月。 doi：10.1007/978-3-319-19129-4_6。

[ 55 ] Rob Conery: “ A Better ID Generator for PostgreSQL ,” rob.conery.io , May 29, 2014.

“PostgreSQL更好的ID生成器”，Rob Conery，rob.conery.io，2014年5月29日。

[ 56 ] Leslie Lamport: “ Time, Clocks, and the Ordering of Events in a Distributed System ,” Communications of the ACM , volume 21, number 7, pages 558–565, July 1978. doi:10.1145/359545.359563

“时间、时钟和分布式系统中的事件排序”，ACM通信，第21卷，第7期，1978年7月，558-565页。doi:10.1145/359545.359563。

[ 57 ] Xavier Défago, André Schiper, and Péter Urbán: “ Total Order Broadcast and Multicast Algorithms: Taxonomy and Survey ,” ACM Computing Surveys , volume 36, number 4, pages 372–421, December 2004. doi:10.1145/1041680.1041682

【57】Xavier Défago、André Schiper和Péter Urbán：“总序广播和组播算法：分类和调查”，ACM计算调查，第36卷，第4号，372-421页，2004年12月。 doi：10.1145/1041680.1041682

[ 58 ] Hagit Attiya and Jennifer Welch: Distributed Computing: Fundamentals, Simulations and Advanced Topics , 2nd edition. John Wiley & Sons, 2004. ISBN: 978-0-471-45324-6, doi:10.1002/0471478210

[58] Hagit Attiya和Jennifer Welch：分布式计算：基础，仿真和高级主题，第2版。 John Wiley＆Sons，2004年。 ISBN：978-0-471-45324-6，doi：10.1002/0471478210

[ 59 ] Mahesh Balakrishnan, Dahlia Malkhi, Vijayan Prabhakaran, et al.: “ CORFU: A Shared Log Design for Flash Clusters ,” at 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI), April 2012.

[59] Mahesh Balakrishnan, Dahlia Malkhi, Vijayan Prabhakaran等: “CORFU:闪存集群的共享日志设计”，发表于2012年4月第9届USENIX 网络系统设计和实现研讨会（NSDI）.

[ 60 ] Fred B. Schneider: “ Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial ,” ACM Computing Surveys , volume 22, number 4, pages 299–319, December 1990.

"[60] Fred B. Schneider：“使用状态机方法实现容错服务：教程”，ACM Computing Surveys，卷22，号4，页299-319，1990年12月。"

[ 61 ] Alexander Thomson, Thaddeus Diamond, Shu-Chun Weng, et al.: “ Calvin: Fast Distributed Transactions for Partitioned Database Systems ,” at ACM International Conference on Management of Data (SIGMOD), May 2012.

[61] Alexander Thomson, Thaddeus Diamond, Shu-Chun Weng等人： “Calvin：用于分区数据库系统的快速分布式事务”，发表于ACM数据管理国际会议(SIGMOD)，2012年5月。

[ 62 ] Mahesh Balakrishnan, Dahlia Malkhi, Ted Wobber, et al.: “ Tango: Distributed Data Structures over a Shared Log ,” at 24th ACM Symposium on Operating Systems Principles (SOSP), November 2013. doi:10.1145/2517349.2522732

【62】Mahesh Balakrishnan、Dahlia Malkhi、Ted Wobber等：“Tango：基于共享日志的分布式数据结构”，收录于2013年11月的第24届ACM操作系统原理研讨会（SOSP）。doi:10.1145/2517349.2522732。

[ 63 ] Robbert van Renesse and Fred B. Schneider: “ Chain Replication for Supporting High Throughput and Availability ,” at 6th USENIX Symposium on Operating System Design and Implementation (OSDI), December 2004.

[63] Robbert van Renesse和Fred B. Schneider: “链式复制以支持高吞吐量和可用性”，发表于2004年12月的第六届USENIX操作系统设计和实现研讨会（OSDI）。

[ 64 ] Leslie Lamport: “ How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs ,” IEEE Transactions on Computers , volume 28, number 9, pages 690–691, September 1979. doi:10.1109/TC.1979.1675439

[64] 莱斯利·兰波特： “如何制造一个能正确执行多进程程序的多处理器计算机”，IEEE 计算机学报，第 28 卷，第 9 期，第 690-691 页，1979 年 9 月。 doi:10.1109/TC.1979.1675439。

[ 65 ] Enis Söztutar, Devaraj Das, and Carter Shanklin: “ Apache HBase High Availability at the Next Level ,” hortonworks.com , January 22, 2015.

"Apache HBase：下一代高可用性"，来自hortonworks.com的Enis Söztutar、Devaraj Das和Carter Shanklin，2015年1月22日。

[ 66 ] Brian F Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, et al.: “ PNUTS: Yahoo!’s Hosted Data Serving Platform ,” at 34th International Conference on Very Large Data Bases (VLDB), August 2008. doi:10.14778/1454159.1454167

[66] Brian F Cooper，Raghu Ramakrishnan，Utkarsh Srivastava等：“PNUTS：Yahoo!的托管数据服务平台” ，发表于2008年8月的第34届国际大数据会议（VLDB）。doi:10.14778/1454159.1454167。

[ 67 ] Tushar Deepak Chandra and Sam Toueg: “ Unreliable Failure Detectors for Reliable Distributed Systems ,” Journal of the ACM , volume 43, number 2, pages 225–267, March 1996. doi:10.1145/226643.226647

[67] Tushar Deepak Chandra 和 Sam Toueg： “可靠分布式系统的不可靠故障检测器”，ACM杂志，第43卷，第2期，页码225-267，1996年3月。 doi:10.1145/226643.226647

[ 68 ] Michael J. Fischer, Nancy Lynch, and Michael S. Paterson: “ Impossibility of Distributed Consensus with One Faulty Process ,” Journal of the ACM , volume 32, number 2, pages 374–382, April 1985. doi:10.1145/3149.214121

[68] 迈克尔·J·费舍尔，南希·林奇和迈克尔·S·帕特森：《在一个故障进程的情况下无法实现分布式共识》，ACM期刊，第32卷，第2期，1985年4月，374-382页。doi:10.1145/3149.214121。

[ 69 ] Michael Ben-Or: “Another Advantage of Free Choice: Completely Asynchronous Agreement Protocols,” at 2nd ACM Symposium on Principles of Distributed Computing (PODC), August 1983. doi:10.1145/800221.806707

[69] Michael Ben-Or: “自由选择的另一个优势：完全异步的协议协议”，发表于1983年8月的第二届ACM分布式计算原理研讨会(PODC)，doi:10.1145/800221.806707。

[ 70 ] Jim N. Gray and Leslie Lamport: “ Consensus on Transaction Commit ,” ACM Transactions on Database Systems (TODS), volume 31, number 1, pages 133–160, March 2006. doi:10.1145/1132863.1132867

【70】Jim N. Gray和Leslie Lamport： “事务提交上的共识”，ACM数据库系统交易（TODS），卷31，号1，页133-160，2006年3月。 doi：10.1145 / 1132863.1132867

[ 71 ] Rachid Guerraoui: “ Revisiting the Relationship Between Non-Blocking Atomic Commitment and Consensus ,” at 9th International Workshop on Distributed Algorithms (WDAG), September 1995. doi:10.1007/BFb0022140

[71] Rachid Guerraoui：“非阻塞原子提交和一致性之间的关系重新审视”，发表于1995年9月的第9届国际分布式算法研讨会（WDAG）。doi：10.1007/BFb0022140。

[ 72 ] Thanumalayan Sankaranarayana Pillai, Vijay Chidambaram, Ramnatthan Alagappan, et al.: “ All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications ,” at 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI), October 2014.

【72】 Thanumalayan Sankaranarayana Pillai，Vijay Chidambaram，Ramnatthan Alagappan等人：“并非所有文件系统都是平等的：关于编写崩溃一致性应用程序的难度”，2014年10月在第11届USENIX操作系统设计与实现研讨会（OSDI）上发表。

[ 73 ] Jim Gray: “ The Transaction Concept: Virtues and Limitations ,” at 7th International Conference on Very Large Data Bases (VLDB), September 1981.

[73] 吉姆·格雷：《交易概念：优点和限制》，发表于1981年9月的第七届国际超大型数据库会议（VLDB）。

[ 74 ] Hector Garcia-Molina and Kenneth Salem: “ Sagas ,” at ACM International Conference on Management of Data (SIGMOD), May 1987. doi:10.1145/38713.38742

[74] Hector Garcia-Molina和Kenneth Salem： “Sagas”，于1987年5月在ACM国际数据管理会议(SIGMOD)上发表。 doi:10.1145/38713.38742

[ 75 ] C. Mohan, Bruce G. Lindsay, and Ron Obermarck: “ Transaction Management in the R* Distributed Database Management System ,” ACM Transactions on Database Systems , volume 11, number 4, pages 378–396, December 1986. doi:10.1145/7239.7266

[75] C. Mohan，Bruce G. Lindsay和Ron Obermarck：“R*分布式数据库管理系统中的事务管理”，ACM Transactions on Database Systems，卷11，号4，页378-396，1986年12月。 doi：10.1145 / 7239.7266

[ 76 ] “ Distributed Transaction Processing: The XA Specification ,” X/Open Company Ltd., Technical Standard XO/CAE/91/300, December 1991. ISBN: 978-1-872-63024-3

“分布式事务处理：XA规范”，X/Open公司有限责任公司，技术标准XO/CAE/91/300，1991年12月。ISBN：978-1-872-63024-3。

[ 77 ] Mike Spille: “ XA Exposed, Part II ,” jroller.com , April 3, 2004.

[77] 迈克·斯皮莱：“XA曝光，第二部分”，jroller.com，2004年4月3日。

[ 78 ] Ivan Silva Neto and Francisco Reverbel: “ Lessons Learned from Implementing WS-Coordination and WS-AtomicTransaction ,” at 7th IEEE/ACIS International Conference on Computer and Information Science (ICIS), May 2008. doi:10.1109/ICIS.2008.75

"[78] Ivan Silva Neto和Francisco Reverbel：「从实施WS-Coordination和WS-AtomicTransaction中学到的经验教训」，发表于第七届IEEE / ACIS计算机和信息科学国际会议（ICIS），2008年5月。doi：10.1109 / ICIS.2008.75"

[ 79 ] James E. Johnson, David E. Langworthy, Leslie Lamport, and Friedrich H. Vogt: “ Formal Specification of a Web Services Protocol ,” at 1st International Workshop on Web Services and Formal Methods (WS-FM), February 2004. doi:10.1016/j.entcs.2004.02.022

[79] James E. Johnson、David E. Langworthy、Leslie Lamport和Friedrich H. Vogt："Web服务协议的形式化规范"，发表于第一届Web服务和形式化方法研讨会(WS-FM)，2004年2月。doi:10.1016/j.entcs.2004.02.022。

[ 80 ] Dale Skeen: “ Nonblocking Commit Protocols ,” at ACM International Conference on Management of Data (SIGMOD), April 1981. doi:10.1145/582318.582339

[80] Dale Skeen: “非阻塞提交协议”，发表于1981年4月ACM国际数据管理会议(SIGMOD)，doi:10.1145/582318.582339。

[ 81 ] Gregor Hohpe: “ Your Coffee Shop Doesn’t Use Two-Phase Commit ,” IEEE Software , volume 22, number 2, pages 64–66, March 2005. doi:10.1109/MS.2005.52

"你的咖啡店不使用两阶段提交协议"，IEEE软件杂志，卷22，号码2，页码64-66，2005年3月。doi:10.1109/MS.2005.52。

[ 82 ] Pat Helland: “ Life Beyond Distributed Transactions: An Apostate’s Opinion ,” at 3rd Biennial Conference on Innovative Data Systems Research (CIDR), January 2007.

帮我翻译一下：[82] Pat Helland：“超越分布式事务：一个背叛者的意见”，发表于2007年1月的第三届创新数据系统研究双年会。

[ 83 ] Jonathan Oliver: “ My Beef with MSDTC and Two-Phase Commits ,” blog.jonathanoliver.com , April 4, 2011.

[83] Jonathan Oliver： “我对MSDTC和两阶段提交的烦恼，”blog.jonathanoliver.com，2011年4月4日。 [83] 琼纳森·奥利弗： “我对 MSDTC 和两阶段提交的不满”，blog.jonathanoliver.com，2011年4月4日。

[ 84 ] Oren Eini (Ahende Rahien): “ The Fallacy of Distributed Transactions ,” ayende.com , July 17, 2014.

[84] Oren Eini (Ahende Rahien): “分布式事务的谬论”，ayende.com，2014年7月17日。

[ 85 ] Clemens Vasters: “ Transactions in Windows Azure (with Service Bus) – An Email Discussion ,” vasters.com , July 30, 2012.

[85] Clemens Vasters： “在Windows Azure中进行事务处理（使用Service Bus）-电子邮件讨论”，vasters.com，2012年7月30日。

[ 86 ] “ Understanding Transactionality in Azure ,” NServiceBus Documentation, Particular Software, 2015.

[86] “理解Azure中的交易性”，NServiceBus文档，Particular Software，2015年。

[ 87 ] Randy Wigginton, Ryan Lowe, Marcos Albe, and Fernando Ipar: “ Distributed Transactions in MySQL ,” at MySQL Conference and Expo , April 2013.

[87] Randy Wigginton, Ryan Lowe, Marcos Albe和Fernando Ipar：“MySQL中的分布式事务”，2013年4月在MySQL会议和博览会上。

[ 88 ] Mike Spille: “ XA Exposed, Part I ,” jroller.com , April 3, 2004.

[88] 迈克·斯皮尔： “XA揭秘，第一部分”， jroller.com，2004年4月3日。

[ 89 ] Ajmer Dhariwal: “ Orphaned MSDTC Transactions (-2 spids) ,” eraofdata.com , December 12, 2008.

"[89] Ajmer Dhariwal: “无主 MSDTC 交易 (-2 spids),” eraofdata.com，2008 年 12 月 12 日。"

[ 90 ] Paul Randal: “ Real World Story of DBCC PAGE Saving the Day ,” sqlskills.com , June 19, 2013.

[90] Paul Randal：“DBCC PAGE拯救一天的现实故事”，sqlskills.com，2013年6月19日。

[ 91 ] “ in-doubt xact resolution Server Configuration Option ,” SQL Server 2016 documentation, Microsoft, Inc., 2016.

"[91] “in-doubt xact resolution Server Configuration Option，” SQL Server 2016文档，Microsoft，Inc.，2016."，请将其翻译为简体中文，仅返回翻译内容，不包括原文。 "[91]“处于怀疑状态的事务解决服务器配置选项”，SQL Server 2016文档，Microsoft，Inc.，2016年。"

[ 92 ] Cynthia Dwork, Nancy Lynch, and Larry Stockmeyer: “ Consensus in the Presence of Partial Synchrony ,” Journal of the ACM , volume 35, number 2, pages 288–323, April 1988. doi:10.1145/42282.42283

[92] Cynthia Dwork，Nancy Lynch和Larry Stockmeyer：“偏同步性下的达成一致”，ACM期刊，第35卷，第2期，第288–323页，1988年4月。doi：10.1145/42282.42283。

[ 93 ] Miguel Castro and Barbara H. Liskov: “ Practical Byzantine Fault Tolerance and Proactive Recovery ,” ACM Transactions on Computer Systems , volume 20, number 4, pages 396–461, November 2002. doi:10.1145/571637.571640

[93] Miguel Castro和Barbara H. Liskov： “实用拜占庭容错和积极恢复”，ACM计算机系统事务，卷20，第4期，页396-461，2002年11月。 doi：10.1145 / 571637.571640。

[ 94 ] Brian M. Oki and Barbara H. Liskov: “ Viewstamped Replication: A New Primary Copy Method to Support Highly-Available Distributed Systems ,” at 7th ACM Symposium on Principles of Distributed Computing (PODC), August 1988. doi:10.1145/62546.62549

[94] Brian M. Oki和Barbara H. Liskov：“Viewstamped Replication：一种新的主副本方法，支持高可用性分布式系统”，收录于1988年8月第7届ACM分布式计算原理研讨会（PODC）。doi：10.1145/62546.62549。

[ 95 ] Barbara H. Liskov and James Cowling: “ Viewstamped Replication Revisited ,” Massachusetts Institute of Technology, Tech Report MIT-CSAIL-TR-2012-021, July 2012.

[95] Barbara H. Liskov和James Cowling：“Viewstamped Replication Revisited，” Massachusetts Institute of Technology，Tech Report MIT-CSAIL-TR-2012-021，2012年7月。

[ 96 ] Leslie Lamport: “ The Part-Time Parliament ,” ACM Transactions on Computer Systems , volume 16, number 2, pages 133–169, May 1998. doi:10.1145/279227.279229

[96] Leslie Lamport：“兼职议会”，ACM计算机系统交易，卷16，号2，页133-169，1998年5月。doi：10.1145/279227.279229。

[ 97 ] Leslie Lamport: “ Paxos Made Simple ,” ACM SIGACT News , volume 32, number 4, pages 51–58, December 2001.

"Leslie Lamport在《Paxos Made Simple》中指出，ACM SIGACT News，32卷4期，51-58页，2001年12月。"

[ 98 ] Tushar Deepak Chandra, Robert Griesemer, and Joshua Redstone: “ Paxos Made Live – An Engineering Perspective ,” at 26th ACM Symposium on Principles of Distributed Computing (PODC), June 2007.

【98】Tushar Deepak Chandra、Robert Griesemer和Joshua Redstone：“Paxos 在工程角度下的实现——26th ACM分布式计算原理研讨会（PODC）论文，2007年6月”

[ 99 ] Robbert van Renesse: “ Paxos Made Moderately Complex ,” cs.cornell.edu , March 2011.

[99] Robbert van Renesse：“ Paxos Made Moderately Complex”，cs.cornell.edu，2011年3月。 [99]罗伯特·范雷涅塞：《Paxos Made Moderately Complex》。Cornell大学计算机科学系，2011年3月。

[ 100 ] Diego Ongaro: “ Consensus: Bridging Theory and Practice ,” PhD Thesis, Stanford University, August 2014.

[100]迭戈·翁加罗：《共识：理论与实践的桥梁》博士论文，斯坦福大学，2014年8月。

[ 101 ] Heidi Howard, Malte Schwarzkopf, Anil Madhavapeddy, and Jon Crowcroft: “ Raft Refloated: Do We Have Consensus? ,” ACM SIGOPS Operating Systems Review , volume 49, number 1, pages 12–21, January 2015. doi:10.1145/2723872.2723876

[101] Heidi Howard, Malte Schwarzkopf, Anil Madhavapeddy，和Jon Crowcroft：“Raft再浮现: 我们有共识吗？”，ACM SIGOPS操作系统评论，第49卷，第1期，页12-21，2015年1月。 doi:10.1145/2723872.2723876。

[ 102 ] André Medeiros: “ ZooKeeper’s Atomic Broadcast Protocol: Theory and Practice ,” Aalto University School of Science, March 20, 2012.

[102] 安德烈·梅德罗斯: “ZooKeeper 的原子广播协议：理论与实践”，阿尔托大学科学学院，2012年3月20日。

[ 103 ] Robbert van Renesse, Nicolas Schiper, and Fred B. Schneider: “ Vive La Différence: Paxos vs. Viewstamped Replication vs. Zab ,” IEEE Transactions on Dependable and Secure Computing , volume 12, number 4, pages 472–484, September 2014. doi:10.1109/TDSC.2014.2355848

【103】Robbert van Renesse, Nicolas Schiper, and Fred B. Schneider：“Vive La Différence: Paxos vs.Viewstamped Replication vs. Zab”，IEEE Transactions on Dependable and Secure Computing，vol. 12，no.4， pp. 472-484，2014年9月。 doi:10.1109/TDSC.2014.2355848

[ 104 ] Will Portnoy: “ Lessons Learned from Implementing Paxos ,” blog.willportnoy.com , June 14, 2012.

【104】威尔·波特劳伊：“从实施Paxos中学到的经验教训”，blog.willportnoy.com，2012年6月14日。

[ 105 ] Heidi Howard, Dahlia Malkhi, and Alexander Spiegelman: “ Flexible Paxos: Quorum Intersection Revisited ,” arXiv:1608.06696 , August 24, 2016.

[105] Heidi Howard、Dahlia Malkhi和Alexander Spiegelman： “灵活的Paxos：再探Quorum 交集”， arXiv:1608.06696，2016年8月24日。

[ 106 ] Heidi Howard and Jon Crowcroft: “ Coracle: Evaluating Consensus at the Internet Edge ,” at Annual Conference of the ACM Special Interest Group on Data Communication (SIGCOMM), August 2015. doi:10.1145/2829988.2790010

[106] Heidi Howard and Jon Crowcroft: "Coracle: 在互联网边缘评估共识"，ACM数据通信专业兴趣小组(SIGCOMM)年会，2015年8月。doi:10.1145/2829988.2790010。

[ 107 ] Kyle Kingsbury: “ Call Me Maybe: Elasticsearch 1.5.0 ,” aphyr.com , April 27, 2015.

[107] Kyle Kingsbury: "Call Me Maybe: Elasticsearch 1.5.0," aphyr.com, April 27, 2015. [107] Kyle Kingsbury：“ 召唤我吧：Elasticsearch 1.5.0， ”aphyr.com，2015年4月27日。

[ 108 ] Ivan Kelly: “ BookKeeper Tutorial ,” github.com , October 2014.

[108] Ivan Kelly：“BookKeeper教程”，github.com，2014年10月。

[ 109 ] Camille Fournier: “ Consensus Systems for the Skeptical Architect ,” at Craft Conference , Budapest, Hungary, April 2015.

"[109] Camille Fournier: "怀疑论者的共识系统"，于2015年4月在匈牙利布达佩斯的Craft Conference上发表。"

[ 110 ] Kenneth P. Birman: “ A History of the Virtual Synchrony Replication Model ,” in Replication: Theory and Practice , Springer LNCS volume 5959, chapter 6, pages 91–120, 2010. ISBN: 978-3-642-11293-5, doi:10.1007/978-3-642-11294-2_6

[110] Kenneth P. Birman：“虚拟同步复制模型的历史”，载于《复制：理论与实践》，Springer LNCS 5959 卷，第6章，第91-120页，2010年。ISBN: 978-3-642-11293-5，doi:10.1007/978-3-642-11294-2_6。