第八章分布式系统的问题

Mar 15, 2013

Designing Data-Intensive Applications

Chapter 8. The Trouble with Distributed Systems

Hey I just met you
The network’s laggy
But here’s my data
So store it maybe

嗨我刚认识你网络有点慢但这是我的数据可以存储一下，也许

Kyle Kingsbury, Carly Rae Jepsen and the Perils of Network Partitions (2013)

Kyle Kingsbury，Carly Rae Jepsen和网络分区的危险（2013）

A recurring theme in the last few chapters has been how systems handle things going wrong. For example, we discussed replica failover ( “Handling Node Outages” ), replication lag ( “Problems with Replication Lag” ), and concurrency control for transactions ( “Weak Isolation Levels” ). As we come to understand various edge cases that can occur in real systems, we get better at handling them.

最近几章中反复出现的主题是系统如何处理出现故障的情况。例如，我们讨论了复制品故障转移（“处理节点故障”），复制滞后（“复制滞后问题”）和事务的并发控制（“弱隔离级别”）。随着我们逐渐了解实际系统可能出现的各种边缘情况，我们变得更加擅长处理它们。

However, even though we have talked a lot about faults, the last few chapters have still been too optimistic. The reality is even darker. We will now turn our pessimism to the maximum and assume that anything that can go wrong will go wrong. ⁱ (Experienced systems operators will tell you that is a reasonable assumption. If you ask nicely, they might tell you some frightening stories while nursing their scars of past battles.)

然而，即使我们谈论了很多故障，最近的几章仍然过于乐观。现实情况更加黑暗。我们现在将把悲观主义发挥到最大，假设任何可能出错的地方都会出错。（有经验的系统操作员会告诉你这是一个合理的假设。如果你很客气地问，他们可能会在抚摸着过去战斗中留下的伤疤时告诉你一些可怕的故事。）

Working with distributed systems is fundamentally different from writing software on a single computer—and the main difference is that there are lots of new and exciting ways for things to go wrong [ 1 , 2 ]. In this chapter, we will get a taste of the problems that arise in practice, and an understanding of the things we can and cannot rely on.

与单个计算机上编写软件截然不同的是，处理分布式系统涉及许多新颖和令人兴奋的问题 [1, 2]。在本章中，我们将了解实践中出现的问题和我们可以依赖和不能依赖的事情的理解。

In the end, our task as engineers is to build systems that do their job (i.e., meet the guarantees that users are expecting), in spite of everything going wrong. In Chapter 9 , we will look at some examples of algorithms that can provide such guarantees in a distributed system. But first, in this chapter, we must understand what challenges we are up against.

最終，作為工程師，我們的任務是建立系統，即使一切都出了差錯，也能完成他們的工作（即滿足用戶期望的保證）。在第9章中，我們將研究一些可以在分散系統中提供此類保證的算法示例。但首先，在本章中，我們必須了解我們面臨的挑戰是什麼。

This chapter is a thoroughly pessimistic and depressing overview of things that may go wrong in a distributed system. We will look into problems with networks ( “Unreliable Networks” ); clocks and timing issues ( “Unreliable Clocks” ); and we’ll discuss to what degree they are avoidable. The consequences of all these issues are disorienting, so we’ll explore how to think about the state of a distributed system and how to reason about things that have happened ( “Knowledge, Truth, and Lies” ).

这一章是一个非常悲观和沮丧的概述，介绍分布式系统可能出现的问题。我们将观察网络问题（“不可靠的网络”），时钟和时间问题（“不可靠的时钟”），并探讨它们在多大程度上是可以避免的。所有这些问题的后果都是令人迷惑的，因此我们将探索如何思考分布式系统的状态以及如何推理发生的事情（“知识、真相和谎言”）。

Faults and Partial Failures

When you are writing a program on a single computer, it normally behaves in a fairly predictable way: either it works or it doesn’t. Buggy software may give the appearance that the computer is sometimes “having a bad day” (a problem that is often fixed by a reboot), but that is mostly just a consequence of badly written software.

当您在单台计算机上编写程序时，它通常表现出相当可预测的方式：要么它有效，要么它无效。有错误的软件可能会让计算机看起来有时“度过糟糕的一天”（这个问题通常可以通过重新启动解决），但这主要是由于编写不良的软件造成的。

There is no fundamental reason why software on a single computer should be flaky: when the hardware is working correctly, the same operation always produces the same result (it is deterministic ). If there is a hardware problem (e.g., memory corruption or a loose connector), the consequence is usually a total system failure (e.g., kernel panic, “blue screen of death,” failure to start up). An individual computer with good software is usually either fully functional or entirely broken, but not something in between.

没有根本性的原因，使得单个计算机上的软件会出现问题：当硬件功能正常时，同样的操作总是会产生相同的结果（它是确定性的）。如果存在硬件问题（例如，内存损坏或松散的连接器），后果通常是整个系统崩溃（例如，内核恐慌，“蓝屏幕”，无法启动）。具有良好软件的单个计算机通常要么完全功能正常，要么完全损坏，而不会出现任何中间状态。

This is a deliberate choice in the design of computers: if an internal fault occurs, we prefer a computer to crash completely rather than returning a wrong result, because wrong results are difficult and confusing to deal with. Thus, computers hide the fuzzy physical reality on which they are implemented and present an idealized system model that operates with mathematical perfection. A CPU instruction always does the same thing; if you write some data to memory or disk, that data remains intact and doesn’t get randomly corrupted. This design goal of always-correct computation goes all the way back to the very first digital computer [ 3 ].

这是计算机设计上的有意选择：如果出现内部故障，我们宁愿让计算机完全崩溃，也不愿意返回错误结果，因为处理错误结果都很困难和混乱。因此，计算机隐藏了它们实现的模糊物理现实，并呈现出理想化的系统模型，以数学完美的方式运作。CPU 指令总是执行同样的操作；如果你将某些数据写入内存或磁盘，那么数据会保持完好，不会随机损坏。这种始终正确计算的设计目标一直延续到最早的数字计算机 [3]。

When you are writing software that runs on several computers, connected by a network, the situation is fundamentally different. In distributed systems, we are no longer operating in an idealized system model—we have no choice but to confront the messy reality of the physical world. And in the physical world, a remarkably wide range of things can go wrong, as illustrated by this anecdote [ 4 ]:

当您编写在网络上连接多台计算机运行的软件时，情况是根本不同的。在分布式系统中，我们不再在理想化的系统模型中工作-我们别无选择，只能面对物理世界的混乱现实。而在物理世界中，如此之多的事情都可能出错，正如这个轶事所示[4]:

In my limited experience I’ve dealt with long-lived network partitions in a single data center (DC), PDU [power distribution unit] failures, switch failures, accidental power cycles of whole racks, whole-DC backbone failures, whole-DC power failures, and a hypoglycemic driver smashing his Ford pickup truck into a DC’s HVAC [heating, ventilation, and air conditioning] system. And I’m not even an ops guy.

在我的有限经验中，我处理过单个数据中心（DC）内长时间存在的网络分区、PDU [电源分配单元] 故障、交换机故障、整个机架的意外断电、整个DC骨干网故障、整个DC电力故障，以及一位低血糖的驾驶员将他的福特皮卡撞进 DC 的 HVAC [供暖、通风和空调] 系统。我甚至不是一名运维人员。

Coda Hale

Coda Hale - 科达·哈尔

In a distributed system, there may well be some parts of the system that are broken in some unpredictable way, even though other parts of the system are working fine. This is known as a partial failure . The difficulty is that partial failures are nondeterministic : if you try to do anything involving multiple nodes and the network, it may sometimes work and sometimes unpredictably fail. As we shall see, you may not even know whether something succeeded or not, as the time it takes for a message to travel across a network is also nondeterministic!

在分布式系统中，即使系统的其他部分运行良好，仍然可能存在某些组件以不可预测的方式损坏。这被称为部分故障。困难在于，部分故障是不确定的：如果您尝试执行涉及多个节点和网络的任何操作，有时可能会成功，有时则会出现不可预测的故障。正如我们将看到的，您甚至可能不知道某些操作是否成功，因为消息在网络上传输的时间也是不确定的！

This nondeterminism and possibility of partial failures is what makes distributed systems hard to work with [ 5 ].

这种不确定性和部分故障的可能性使得分布式系统难以处理。

Cloud Computing and Supercomputing

There is a spectrum of philosophies on how to build large-scale computing systems:

如何构建大规模计算系统有各种不同的哲学观点：

At one end of the scale is the field of high-performance computing (HPC). Supercomputers with thousands of CPUs are typically used for computationally intensive scientific computing tasks, such as weather forecasting or molecular dynamics (simulating the movement of atoms and molecules).

在一端的尺度上是高性能计算（HPC）领域。具有成千上万个CPU的超级计算机通常用于计算密集的科学计算任务，例如天气预报或分子动力学（模拟原子和分子的运动）。
At the other extreme is cloud computing , which is not very well defined [ 6 ] but is often associated with multi-tenant datacenters, commodity computers connected with an IP network (often Ethernet), elastic/on-demand resource allocation, and metered billing.

在另一个极端是云计算，它没有非常明确的定义，但通常与多租户数据中心、使用以太网连接的普通计算机、弹性/按需资源分配以及按量计费相关联。
Traditional enterprise datacenters lie somewhere between these extremes.

传统企业数据中心处于这两个极端之间。

With these philosophies come very different approaches to handling faults. In a supercomputer, a job typically checkpoints the state of its computation to durable storage from time to time. If one node fails, a common solution is to simply stop the entire cluster workload. After the faulty node is repaired, the computation is restarted from the last checkpoint [ 7 , 8 ]. Thus, a supercomputer is more like a single-node computer than a distributed system: it deals with partial failure by letting it escalate into total failure—if any part of the system fails, just let everything crash (like a kernel panic on a single machine).

这些哲学观念带来了处理故障的非常不同的方法。在超级计算机中，作业通常会定期将计算状态检查点存储到持久性存储器中。如果一个节点出现故障，一种常见的解决方案是简单地停止整个群集的工作负载。在修复有故障的节点后，计算将从上一个检查点重新启动[7，8]。因此，超级计算机更像单节点计算机，而不是分布式系统：它通过让局部故障升级为总体故障来处理部分故障——如果系统的任何部分出现故障，就让一切崩溃（就像单个机器上的内核恐慌）。

In this book we focus on systems for implementing internet services, which usually look very different from supercomputers:

在本书中，我们专注于实现互联网服务的系统，这些系统通常与超级计算机大不相同。

Many internet-related applications are online , in the sense that they need to be able to serve users with low latency at any time. Making the service unavailable—for example, stopping the cluster for repair—is not acceptable. In contrast, offline (batch) jobs like weather simulations can be stopped and restarted with fairly low impact.

许多与互联网相关的应用程序是在线的，这意味着它们需要能够随时以低延迟为用户提供服务。使服务不可用，例如停止集群进行维修，是不可接受的。相反，像天气模拟这样的离线（批处理）作业可以停止和重新启动，影响相对较小。
Supercomputers are typically built from specialized hardware, where each node is quite reliable, and nodes communicate through shared memory and remote direct memory access (RDMA). On the other hand, nodes in cloud services are built from commodity machines, which can provide equivalent performance at lower cost due to economies of scale, but also have higher failure rates.

超级计算机通常使用专门的硬件构建，每个节点非常可靠，节点通过共享内存和远程直接内存存取(RDMA)进行通信。另一方面，云服务中的节点是由商品机器构建的，这些商品机器由于经济规模而能够以更低的成本提供相当的性能，但也具有更高的故障率。
Large datacenter networks are often based on IP and Ethernet, arranged in Clos topologies to provide high bisection bandwidth [ 9 ]. Supercomputers often use specialized network topologies, such as multi-dimensional meshes and toruses [ 10 ], which yield better performance for HPC workloads with known communication patterns.

大型数据中心网络通常基于IP和以Clos拓扑结构排列的以太网，以提供高度双分带宽[9]。超级计算机通常使用专用的网络拓扑结构，例如多维网格和环形结构[10]，这些结构对HPC负载具有已知通信模式的性能更好。
The bigger a system gets, the more likely it is that one of its components is broken. Over time, broken things get fixed and new things break, but in a system with thousands of nodes, it is reasonable to assume that something is always broken [ 7 ]. When the error handling strategy consists of simply giving up, a large system can end up spending a lot of its time recovering from faults rather than doing useful work [ 8 ].

系统越大，它的每个组件出现问题的可能性也越大。随着时间的推移，坏的东西会被修复，新的东西则会出现问题，但在一个拥有成千上万个节点的系统中，可以合理地假设总会有东西是坏的[7]。当错误处理策略只是放弃时，大型系统最终可能会花费大量时间来恢复故障，而不是执行有用的工作[8]。
If the system can tolerate failed nodes and still keep working as a whole, that is a very useful feature for operations and maintenance: for example, you can perform a rolling upgrade (see Chapter 4 ), restarting one node at a time, while the service continues serving users without interruption. In cloud environments, if one virtual machine is not performing well, you can just kill it and request a new one (hoping that the new one will be faster).

如果系统能够容忍故障节点，且整体仍能正常运行，那么这将是运营和维护非常有用的特性。例如，你可以执行滚动升级（见第四章），一次重启一个节点，而服务仍能继续为用户提供服务而不会中断。在云环境中，如果一个虚拟机的表现不佳，你可以将其 “kill”，并请求一个新的虚拟机（希望新的虚拟机速度更快些）。
In a geographically distributed deployment (keeping data geographically close to your users to reduce access latency), communication most likely goes over the internet, which is slow and unreliable compared to local networks. Supercomputers generally assume that all of their nodes are close together.

在地理分布的部署中（使数据在地理上接近用户以减少访问延迟），通信最可能通过互联网进行，与本地网络相比，互联网速度慢且不可靠。超级计算机通常假定它们的所有节点都彼此靠近。

If we want to make distributed systems work, we must accept the possibility of partial failure and build fault-tolerance mechanisms into the software. In other words, we need to build a reliable system from unreliable components. (As discussed in “Reliability” , there is no such thing as perfect reliability, so we’ll need to understand the limits of what we can realistically promise.)

如果我们想要使分布式系统正常运作，我们必须接受部分失败的可能性，并将容错机制纳入软件中。换句话说，我们需要从不可靠的组件中构建可靠的系统。（正如在“可靠性”中讨论的那样，不存在完美的可靠性，因此我们需要了解我们可以现实承诺的限制。）

Even in smaller systems consisting of only a few nodes, it’s important to think about partial failure. In a small system, it’s quite likely that most of the components are working correctly most of the time. However, sooner or later, some part of the system will become faulty, and the software will have to somehow handle it. The fault handling must be part of the software design, and you (as operator of the software) need to know what behavior to expect from the software in the case of a fault.

即使是由只有几个节点组成的较小系统，考虑到部分故障也是很重要的。在小型系统中，大多数组件大部分时间运行正常是相当可能的。然而，迟早会有一些组件出现故障，软件必须以某种方式进行处理。故障处理必须是软件设计的一部分，您（作为软件的操作者）需要知道在发生故障的情况下可以从软件期望什么样的行为。

It would be unwise to assume that faults are rare and simply hope for the best. It is important to consider a wide range of possible faults—even fairly unlikely ones—and to artificially create such situations in your testing environment to see what happens. In distributed systems, suspicion, pessimism, and paranoia pay off.

认为故障很少并且只是希望一切顺利是不明智的。重要的是要考虑到各种可能的故障，即使是相对不太可能的故障，也要在测试环境中人为创建这种情况来观察发生了什么。在分布式系统中，怀疑，悲观和偏执是有回报的。

Building a Reliable System from Unreliable Components

You may wonder whether this makes any sense—intuitively it may seem like a system can only be as reliable as its least reliable component (its weakest link ). This is not the case: in fact, it is an old idea in computing to construct a more reliable system from a less reliable underlying base [ 11 ]. For example:

你可能会想知道这是否有任何意义 - 直觉上，似乎一个系统只能与其最不可靠的组件一样可靠（最弱的链接）。但事实并非如此：事实上，在计算机领域，从不可靠的基础建立更可靠的系统是一个古老的想法[11]。例如：

Error-correcting codes allow digital data to be transmitted accurately across a communication channel that occasionally gets some bits wrong, for example due to radio interference on a wireless network [ 12 ].

纠错码允许数字数据在通信渠道中准确传输，在这个渠道中有时会出现一些错误的位，例如在无线网络上由于电波干扰。
IP (the Internet Protocol) is unreliable: it may drop, delay, duplicate, or reorder packets. TCP (the Transmission Control Protocol) provides a more reliable transport layer on top of IP: it ensures that missing packets are retransmitted, duplicates are eliminated, and packets are reassembled into the order in which they were sent.

IP（Internet协议）是不可靠的：它可能会丢失、延迟、重复或重新排列数据包。TCP（传输控制协议）在IP之上提供更可靠的传输层：它确保丢失的数据包被重传，重复的被消除，并且将数据包重新组装为发送的顺序。

Although the system can be more reliable than its underlying parts, there is always a limit to how much more reliable it can be. For example, error-correcting codes can deal with a small number of single-bit errors, but if your signal is swamped by interference, there is a fundamental limit to how much data you can get through your communication channel [ 13 ]. TCP can hide packet loss, duplication, and reordering from you, but it cannot magically remove delays in the network.

尽管系统可以比其基础部分更可靠，但它可以更可靠的程度总是有限的。例如，纠错码可以处理少量单比特错误，但如果您的信号被干扰淹没，您能通过通信通道传输的数据量存在根本性的限制 [13]。TCP可以隐藏数据包丢失、重复和重新排序，但它不能神奇地消除网络延迟。

Although the more reliable higher-level system is not perfect, it’s still useful because it takes care of some of the tricky low-level faults, and so the remaining faults are usually easier to reason about and deal with. We will explore this matter further in “The end-to-end argument” .

尽管较为可靠的高层系统并不完美，但它仍然很有用，因为它能够解决一些棘手的低层故障，而剩余的故障通常更容易解决。我们将在“端到端论证”中进一步探讨这个问题。

Unreliable Networks

As discussed in the introduction to Part II , the distributed systems we focus on in this book are shared-nothing systems : i.e., a bunch of machines connected by a network. The network is the only way those machines can communicate—we assume that each machine has its own memory and disk, and one machine cannot access another machine’s memory or disk (except by making requests to a service over the network).

正如第二部分的介绍所讨论的，我们在本书中关注的分布式系统是共享无关系统：即由网络连接的一堆机器。网络是这些机器之间唯一的通信途径——我们假设每台机器都有自己的内存和磁盘，一台机器不能访问另一台机器的内存或磁盘（除非通过网络向服务发出请求）。

Shared-nothing is not the only way of building systems, but it has become the dominant approach for building internet services, for several reasons: it’s comparatively cheap because it requires no special hardware, it can make use of commoditized cloud computing services, and it can achieve high reliability through redundancy across multiple geographically distributed datacenters.

共享无的构建系统不是唯一的方式，但已成为构建互联网服务的主要方法，原因如下：它相对便宜，因为不需要特殊的硬件，它可以利用商品化的云计算服务，并且可以通过多个地理分布的数据中心的冗余实现高可靠性。

The internet and most internal networks in datacenters (often Ethernet) are asynchronous packet networks . In this kind of network, one node can send a message (a packet) to another node, but the network gives no guarantees as to when it will arrive, or whether it will arrive at all. If you send a request and expect a response, many things could go wrong (some of which are illustrated in Figure 8-1 ):

互联网和大多数数据中心内部网络（通常使用以太网）是异步分组网络。在这种网络中，一个节点可以向另一个节点发送消息（分组），但是网络不能保证它何时到达或是否到达。如果您发送请求并期望得到响应，很多事情可能会出错（其中一些在图8-1中说明）：

Your request may have been lost (perhaps someone unplugged a network cable).

您的请求可能已经丢失（也许有人拔了一根网络电缆）。
Your request may be waiting in a queue and will be delivered later (perhaps the network or the recipient is overloaded).

您的请求可能正在等待队列中，并且稍后会被交付（也许是因为网络或接收方过载）。
The remote node may have failed (perhaps it crashed or it was powered down).

远程节点可能已经失效（可能是崩溃或关机）。
The remote node may have temporarily stopped responding (perhaps it is experiencing a long garbage collection pause; see “Process Pauses” ), but it will start responding again later.

远程节点可能暂时停止响应（可能正在经历长时间的垃圾回收暂停；请参阅“进程暂停”），但稍后它将重新开始响应。
The remote node may have processed your request, but the response has been lost on the network (perhaps a network switch has been misconfigured).

远程节点可能已经处理了您的请求，但响应却在网络中丢失了（可能是由于网络交换机的错误配置）。
The remote node may have processed your request, but the response has been delayed and will be delivered later (perhaps the network or your own machine is overloaded).

远程节点可能已经处理了您的请求，但是响应被延迟了，稍后会送达（可能是由于网络负载或您自己的计算机过载）。

The sender can’t even tell whether the packet was delivered: the only option is for the recipient to send a response message, which may in turn be lost or delayed. These issues are indistinguishable in an asynchronous network: the only information you have is that you haven’t received a response yet. If you send a request to another node and don’t receive a response, it is impossible to tell why.

发送者甚至无法知道数据包是否已经被传递：唯一的选择是由收件人发送响应信息，而这可能会被丢失或延迟。在异步网络中，这些问题是无法区分的：你唯一能得到的信息是你还没有收到响应。如果你向另一个节点发送请求但没有收到响应，那么你无法确定原因。

The usual way of handling this issue is a timeout : after some time you give up waiting and assume that the response is not going to arrive. However, when a timeout occurs, you still don’t know whether the remote node got your request or not (and if the request is still queued somewhere, it may still be delivered to the recipient, even if the sender has given up on it).

通常处理此问题的方法是超时：在一段时间后，您放弃等待并假设响应不会到达。然而，当超时发生时，您仍然不知道远程节点是否收到了您的请求（如果请求仍在某个位置排队，则可能仍会传递给收件人，即使发送者已经放弃它）。通常处理此问题的方法是超时：在一段时间后，您放弃等待并假设响应不会到达。然而，当超时发生时，您仍然不知道远程节点是否收到了您的请求（如果请求仍在某个位置排队，则可能仍会传递给收件人，即使发送者已经放弃它）。

Network Faults in Practice

We have been building computer networks for decades—one might hope that by now we would have figured out how to make them reliable. However, it seems that we have not yet succeeded.

数十年来，我们一直在建立计算机网络——人们或许希望我们早已学会如何使它们可靠。然而，现在看来我们还没有成功。

There are some systematic studies, and plenty of anecdotal evidence, showing that network problems can be surprisingly common, even in controlled environments like a datacenter operated by one company [ 14 ]. One study in a medium-sized datacenter found about 12 network faults per month, of which half disconnected a single machine, and half disconnected an entire rack [ 15 ]. Another study measured the failure rates of components like top-of-rack switches, aggregation switches, and load balancers [ 16 ]. It found that adding redundant networking gear doesn’t reduce faults as much as you might hope, since it doesn’t guard against human error (e.g., misconfigured switches), which is a major cause of outages.

一些系统研究和大量的个人经验表明，即使是由一家公司操作的数据中心这样的受控环境中，网络问题也可能很常见。一项在中型数据中心进行的研究发现，每月约有12次网络故障，其中一半将一个计算机断开连接，另一半将整个机架断开连接。另一项研究测量了顶部交换机、聚合交换机和负载均衡器等组件的故障率。它发现，添加冗余网络设备并不能像您希望的那样减少故障，因为它不能防止人为错误（例如，配置错误的交换机），这是故障的主要原因之一。

Public cloud services such as EC2 are notorious for having frequent transient network glitches [ 14 ], and well-managed private datacenter networks can be stabler environments. Nevertheless, nobody is immune from network problems: for example, a problem during a software upgrade for a switch could trigger a network topology reconfiguration, during which network packets could be delayed for more than a minute [ 17 ]. Sharks might bite undersea cables and damage them [ 18 ]. Other surprising faults include a network interface that sometimes drops all inbound packets but sends outbound packets successfully [ 19 ]: just because a network link works in one direction doesn’t guarantee it’s also working in the opposite direction.

公共云服务，比如EC2，以频繁的短暂网络故障著称[14]，而良好管理的私有数据中心网络可能会更加稳定。然而，无论如何，任何人都不能免于网络问题：例如，交换机软件升级期间出现问题可能会触发网络拓扑重构，在此期间，网络数据包可能会延迟超过一分钟[17]。鲨鱼可能会咬断海底电缆并损坏它们[18]。其他令人惊讶的故障包括某些时候会丢失所有入站数据包但能成功发送出站数据包的网络接口[19]：仅因为一个网络连接在一个方向工作并不能保证它在相反方向也能正常工作。

Network partitions

When one part of the network is cut off from the rest due to a network fault, that is sometimes called a network partition or netsplit . In this book we’ll generally stick with the more general term network fault , to avoid confusion with partitions (shards) of a storage system, as discussed in Chapter 6 .

当网络的一部分由于网络故障而与其他部分断开连接时，有时称为网络分区或netsplit。在本书中，我们通常会坚持使用更一般的术语网络故障，以避免与存储系统的分区（分片）混淆，如第6章讨论的那样。

Even if network faults are rare in your environment, the fact that faults can occur means that your software needs to be able to handle them. Whenever any communication happens over a network, it may fail—there is no way around it.

即使在您的环境中网络故障很少见，但故障可能发生的事实意味着您的软件需要能够处理它们。无论何时在网络上进行任何通信，都可能会失败-这是无法避免的。

If the error handling of network faults is not defined and tested, arbitrarily bad things could happen: for example, the cluster could become deadlocked and permanently unable to serve requests, even when the network recovers [ 20 ], or it could even delete all of your data [ 21 ]. If software is put in an unanticipated situation, it may do arbitrary unexpected things.

如果网络故障的错误处理未定义和测试，可能会发生任意糟糕的事情：例如，集群可能会陷入死锁状态，永久无法提供请求服务，即使网络恢复[20]，或者甚至可能删除所有数据[21]。如果软件遇到未预料的情况，它可能会做出任意意想不到的事情。

Handling network faults doesn’t necessarily mean tolerating them: if your network is normally fairly reliable, a valid approach may be to simply show an error message to users while your network is experiencing problems. However, you do need to know how your software reacts to network problems and ensure that the system can recover from them. It may make sense to deliberately trigger network problems and test the system’s response (this is the idea behind Chaos Monkey; see “Reliability” ).

处理网络故障并不一定意味着容忍它们：如果您的网络通常相当可靠，一种有效的方法可能是在网络遇到问题时向用户显示错误消息。然而，您需要了解您的软件如何应对网络问题，并确保系统可以从中恢复。有意诱发网络问题并测试系统的响应可能是有意义的（这是混沌猴子背后的想法；请参见“可靠性”）。

Detecting Faults

Many systems need to automatically detect faulty nodes. For example:

许多系统需要自动检测故障节点。例如：

A load balancer needs to stop sending requests to a node that is dead (i.e., take it out of rotation ).

负载均衡器需要停止向已经挂掉的节点发送请求（即将其从轮循列表中移除）。
In a distributed database with single-leader replication, if the leader fails, one of the followers needs to be promoted to be the new leader (see “Handling Node Outages” ).

在具有单领导复制的分布式数据库中，如果领导者失败，则需要将其中一个跟随者提升为新领导者（请参见“处理节点故障”）。

Unfortunately, the uncertainty about the network makes it difficult to tell whether a node is working or not. In some specific circumstances you might get some feedback to explicitly tell you that something is not working:

不幸的是，由于网络的不确定性，很难确定节点是否正常工作。在某些特定情况下，您可能会得到一些反馈来明确告诉您某些东西没有正常工作。

If you can reach the machine on which the node should be running, but no process is listening on the destination port (e.g., because the process crashed), the operating system will helpfully close or refuse TCP connections by sending a RST or FIN packet in reply. However, if the node crashed while it was handling your request, you have no way of knowing how much data was actually processed by the remote node [ 22 ].

如果您可以连接到节点应该运行的机器，但目的端口上没有进程正在监听（例如，因为进程崩溃了），操作系统会发送RST或FIN数据包来关闭或拒绝TCP连接。然而，如果节点在处理您的请求时崩溃了，您就无法知道远程节点实际处理了多少数据。
If a node process crashed (or was killed by an administrator) but the node’s operating system is still running, a script can notify other nodes about the crash so that another node can take over quickly without having to wait for a timeout to expire. For example, HBase does this [ 23 ].

如果节点进程崩溃（或被管理员终止），但节点的操作系统仍在运行，脚本可以通知其他节点关于崩溃的情况，以便另一个节点可以快速接管而不必等待超时过期。例如，HBase就是这样做的。
If you have access to the management interface of the network switches in your datacenter, you can query them to detect link failures at a hardware level (e.g., if the remote machine is powered down). This option is ruled out if you’re connecting via the internet, or if you’re in a shared datacenter with no access to the switches themselves, or if you can’t reach the management interface due to a network problem.

如果您可以访问数据中心网络交换机的管理界面，您可以查询它们以检测硬件级别的链路故障（例如，如果远程计算机关闭电源）。如果您通过互联网连接，或者在一个共享数据中心中，无法访问交换机自身，或者由于网络问题无法到达管理界面，则此选项被排除。
If a router is sure that the IP address you’re trying to connect to is unreachable, it may reply to you with an ICMP Destination Unreachable packet. However, the router doesn’t have a magic failure detection capability either—it is subject to the same limitations as other participants of the network.

如果路由器确定你要连接的IP地址无法到达，它可能会向你发送一个ICMP目的地不可达的数据包。但是，路由器也没有奇迹般的故障检测能力，它受到网络其他参与者同样的限制。

Rapid feedback about a remote node being down is useful, but you can’t count on it. Even if TCP acknowledges that a packet was delivered, the application may have crashed before handling it. If you want to be sure that a request was successful, you need a positive response from the application itself [ 24 ].

远程节点故障的快速反馈很有用，但不能依赖它。即使TCP确认消息已传递，应用程序在处理消息之前可能已经崩溃。如果您想确保请求成功，您需要从应用程序本身获得积极的响应[24]。

Conversely, if something has gone wrong, you may get an error response at some level of the stack, but in general you have to assume that you will get no response at all. You can retry a few times (TCP retries transparently, but you may also retry at the application level), wait for a timeout to elapse, and eventually declare the node dead if you don’t hear back within the timeout.

相反地，如果出现了问题，可能会在堆栈的某个层次上收到错误响应，但通常你必须假设你将根本不会收到任何响应。你可以尝试几次（TCP 会透明地重试，但您也可以在应用层重试），等待超时时间到期，如果在超时时间内没有收到回复，最终将节点设置为离线。

Timeouts and Unbounded Delays

If a timeout is the only sure way of detecting a fault, then how long should the timeout be? There is unfortunately no simple answer.

如果一个超时是唯一能够确定故障的方法，那么超时时间应该是多少呢？不幸的是，没有简单的答案。

A long timeout means a long wait until a node is declared dead (and during this time, users may have to wait or see error messages). A short timeout detects faults faster, but carries a higher risk of incorrectly declaring a node dead when in fact it has only suffered a temporary slowdown (e.g., due to a load spike on the node or the network).

长时间超时意味着节点被宣告为死亡需要等待很长时间（在此期间，用户可能需要等待或看到错误消息）。短时间超时可以更快地检测故障，但可能会错误地宣告节点死亡，而实际上它只是暂时减速（例如，由于节点或网络的负载峰值）的风险更高。

Prematurely declaring a node dead is problematic: if the node is actually alive and in the middle of performing some action (for example, sending an email), and another node takes over, the action may end up being performed twice. We will discuss this issue in more detail in “Knowledge, Truth, and Lies” , and in Chapters 9 and 11 .

过早宣布节点死亡是有问题的：如果节点实际上仍然存活并正在执行某些操作（例如发送电子邮件），而另一个节点接管了，那么该操作可能会被执行两次。我们将在“知识，真相和谎言”以及第9章和第11章中更详细地讨论这个问题。

When a node is declared dead, its responsibilities need to be transferred to other nodes, which places additional load on other nodes and the network. If the system is already struggling with high load, declaring nodes dead prematurely can make the problem worse. In particular, it could happen that the node actually wasn’t dead but only slow to respond due to overload; transferring its load to other nodes can cause a cascading failure (in the extreme case, all nodes declare each other dead, and everything stops working).

当一个节点被宣布死亡时，它的责任需要转移到其他节点，这会给其他节点和网络增加额外的负载。如果系统已经处于高负载状态，过早地宣布节点死亡可能会使问题恶化。特别的，可能会发生节点实际上并没有死亡，只是因为负载过重而反应较慢的情况；将它的负载转移到其他节点可能会导致级联故障（在极端情况下，所有节点互相宣布死亡，一切都停止工作）。

Imagine a fictitious system with a network that guaranteed a maximum delay for packets—every packet is either delivered within some time d , or it is lost, but delivery never takes longer than d . Furthermore, assume that you can guarantee that a non-failed node always handles a request within some time r . In this case, you could guarantee that every successful request receives a response within time 2 d + r —and if you don’t receive a response within that time, you know that either the network or the remote node is not working. If this was true, 2 d + r would be a reasonable timeout to use.

想象一个虚构的系统，其中网络保证了分组的最大延迟——每个分组都在某个时间d内被传送，否则就会丢失，但传送时间永远不会超过d。此外，请假设您可以保证非故障节点始终在某个时间r内处理请求。在这种情况下，您可以保证每个成功的请求都会在2d + r时间内接收到响应——如果您没有在该时间内收到响应，那么您就知道网络或远程节点不起作用了。如果这是真的，2d + r将是一个合理的超时时间。

Unfortunately, most systems we work with have neither of those guarantees: asynchronous networks have unbounded delays (that is, they try to deliver packets as quickly as possible, but there is no upper limit on the time it may take for a packet to arrive), and most server implementations cannot guarantee that they can handle requests within some maximum time (see “Response time guarantees” ). For failure detection, it’s not sufficient for the system to be fast most of the time: if your timeout is low, it only takes a transient spike in round-trip times to throw the system off-balance.

很不幸，我们所涉及的大多数系统都没有这些保证：异步网络有无限延迟（也就是说，它们会尽可能快地传输数据包，但数据包到达所需的时间没有上限），并且大多数服务器实现不能保证在某个最大时间内处理请求（请参阅“响应时间保证”）。对于故障检测来说，系统大部分时间表现良好是不够的：如果您的超时时间很短，只需一次往返延迟的瞬时波动就足以使系统失衡。

Network congestion and queueing

When driving a car, travel times on road networks often vary most due to traffic congestion. Similarly, the variability of packet delays on computer networks is most often due to queueing [ 25 ]:

在驾驶汽车时，路网上的行驶时间往往最受交通拥堵的影响变化最大。同样，计算机网络中数据包延迟的变化也往往是由于队列延迟引起的。

If several different nodes simultaneously try to send packets to the same destination, the network switch must queue them up and feed them into the destination network link one by one (as illustrated in Figure 8-2 ). On a busy network link, a packet may have to wait a while until it can get a slot (this is called network congestion ). If there is so much incoming data that the switch queue fills up, the packet is dropped, so it needs to be resent—even though the network is functioning fine.

如果有多个不同的节点同时尝试发送数据包到相同的目的地，网络交换机必须将它们排队并逐一输入目的地网络链接（如图8-2所示）。在繁忙的网络链接上，数据包可能需要等待一段时间才能获得槽位（这被称为网络拥塞）。如果有太多的传入数据，交换机队列就会填满，数据包就会被丢弃，因此需要重新发送，即使网络正常运行。
When a packet reaches the destination machine, if all CPU cores are currently busy, the incoming request from the network is queued by the operating system until the application is ready to handle it. Depending on the load on the machine, this may take an arbitrary length of time.

当数据包到达目标机器时，如果所有CPU核心都在忙，那么操作系统会将来自网络的请求排队，直到应用程序准备好处理它。根据机器的负载情况，这可能需要任意长度的时间。
In virtualized environments, a running operating system is often paused for tens of milliseconds while another virtual machine uses a CPU core. During this time, the VM cannot consume any data from the network, so the incoming data is queued (buffered) by the virtual machine monitor [ 26 ], further increasing the variability of network delays.

在虚拟化环境中，运行中的操作系统常常会因为另一个虚拟机在使用CPU核心而暂停数十毫秒。在此期间，虚拟机无法从网络中消耗任何数据，因此进入的数据将被虚拟机监视器排队缓冲，进一步增加了网络延迟的不确定性。
TCP performs flow control (also known as congestion avoidance or backpressure ), in which a node limits its own rate of sending in order to avoid overloading a network link or the receiving node [ 27 ]. This means additional queueing at the sender before the data even enters the network.

TCP执行流量控制（也称拥塞避免或反压），其中节点限制自己的发送速率，以避免超载网络链接或接收节点[27]。这意味着在数据甚至进入网络之前，在发送方进行额外的排队。

Moreover, TCP considers a packet to be lost if it is not acknowledged within some timeout (which is calculated from observed round-trip times), and lost packets are automatically retransmitted. Although the application does not see the packet loss and retransmission, it does see the resulting delay (waiting for the timeout to expire, and then waiting for the retransmitted packet to be acknowledged).

此外，TCP 认为如果一份数据包在规定的超时时间内（这个超时时间是从已观察到的往返时间计算出来的）没有得到确认，那么这份数据包就被视为丢失了。并且，已经丢失的数据包会自动地被重新发送。虽然应用程序并没有意识到数据包的丢失和重新发送，但是它会感知到由此带来的延迟（因为需要等待超时时间的到来，然后再等待重新发送的数据包被确认）。

TCP Versus UDP

Some latency-sensitive applications, such as videoconferencing and Voice over IP (VoIP), use UDP rather than TCP. It’s a trade-off between reliability and variability of delays: as UDP does not perform flow control and does not retransmit lost packets, it avoids some of the reasons for variable network delays (although it is still susceptible to switch queues and scheduling delays).

一些对延迟敏感的应用程序，比如视频会议和Voice over IP（VoIP），采用UDP而不是TCP。这是可靠性和延迟变量性之间的权衡：由于UDP不执行数据流控制，也不重新传输丢失的数据包，因此它避免了某些原因导致的网络延迟变化（尽管仍然容易受到交换机队列和调度延迟的影响）。

UDP is a good choice in situations where delayed data is worthless. For example, in a VoIP phone call, there probably isn’t enough time to retransmit a lost packet before its data is due to be played over the loudspeakers. In this case, there’s no point in retransmitting the packet—the application must instead fill the missing packet’s time slot with silence (causing a brief interruption in the sound) and move on in the stream. The retry happens at the human layer instead. (“Could you repeat that please? The sound just cut out for a moment.”)

UDP在延迟数据无关紧要时是一个好选择。例如，在VoIP电话中，可能没有足够的时间在所需播放数据之前重新传输一次丢失的数据包。在这种情况下，重新传输数据包是没有意义的 - 应用程序必须用沉默填充缺失数据包的时间段（导致声音短暂间断）并继续进行数据流。重试在人类层面上发生。（“请再说一遍好吗？声音刚才短暂中断了。”）

All of these factors contribute to the variability of network delays. Queueing delays have an especially wide range when a system is close to its maximum capacity: a system with plenty of spare capacity can easily drain queues, whereas in a highly utilized system, long queues can build up very quickly.

所有这些因素都会导致网络延迟的变化。排队延迟在系统接近最大容量时具有特别广泛的范围：有足够多备用容量的系统可以轻松地排空队列，而在高度利用的系统中，队列很快就会积累起来。

In public clouds and multi-tenant datacenters, resources are shared among many customers: the network links and switches, and even each machine’s network interface and CPUs (when running on virtual machines), are shared. Batch workloads such as MapReduce (see Chapter 10 ) can easily saturate network links. As you have no control over or insight into other customers’ usage of the shared resources, network delays can be highly variable if someone near you (a noisy neighbor ) is using a lot of resources [ 28 , 29 ].

在公共云和多租户数据中心中，资源被许多客户共享：网络链接和交换机，甚至每台机器的网络接口和 CPU（在虚拟机上运行时）都是共享的。批处理工作负载，例如MapReduce（见第十章），可以轻松饱和网络链接。由于您无法控制或了解其他客户对共享资源的使用，如果某个邻近的用户（嘈杂的邻居）正在使用大量资源，则网络延迟可能会变化很大 [28，29]。

In such environments, you can only choose timeouts experimentally: measure the distribution of network round-trip times over an extended period, and over many machines, to determine the expected variability of delays. Then, taking into account your application’s characteristics, you can determine an appropriate trade-off between failure detection delay and risk of premature timeouts.

在这种环境下，您只能通过实验选择适当的超时时间：长时间内测量网络往返延迟时间分布，并在多台设备上测量，从而确定延迟期望的可变性。然后，考虑您的应用特点，您可以确定适当的故障检测延迟和过早超时的风险之间的权衡。

Even better, rather than using configured constant timeouts, systems can continually measure response times and their variability ( jitter ), and automatically adjust timeouts according to the observed response time distribution. This can be done with a Phi Accrual failure detector [ 30 ], which is used for example in Akka and Cassandra [ 31 ]. TCP retransmission timeouts also work similarly [ 27 ].

更好的做法是，系统可以不使用配置的恒定超时时间，而是不断测量响应时间及其变异性（抖动），并根据观察到的响应时间分布自动调整超时时间。这可以通过Phi Accrual失效探测器来实现，例如Akka和Cassandra中使用的方法。TCP重传超时时间也类似地工作。

Synchronous Versus Asynchronous Networks

Distributed systems would be a lot simpler if we could rely on the network to deliver packets with some fixed maximum delay, and not to drop packets. Why can’t we solve this at the hardware level and make the network reliable so that the software doesn’t need to worry about it?

如果我们可以依赖网络以一定的最长延迟传递数据包，而不是丢弃数据包，分布式系统将会更简单。为什么我们不能在硬件层面解决这个问题，使网络变得可靠，从而软件不需要担心这些问题呢？

To answer this question, it’s interesting to compare datacenter networks to the traditional fixed-line telephone network (non-cellular, non-VoIP), which is extremely reliable: delayed audio frames and dropped calls are very rare. A phone call requires a constantly low end-to-end latency and enough bandwidth to transfer the audio samples of your voice. Wouldn’t it be nice to have similar reliability and predictability in computer networks?

回答这个问题，有趣的是将数据中心网络与传统的固定电话网络（非蜂窝、非VoIP）进行比较，后者非常可靠：延迟的音频帧和掉话非常少。电话呼叫需要始终保持低的端到端延迟和足够的带宽来传输您的语音样本。难道在计算机网络中拥有类似的可靠性和可预测性不好吗？

When you make a call over the telephone network, it establishes a circuit : a fixed, guaranteed amount of bandwidth is allocated for the call, along the entire route between the two callers. This circuit remains in place until the call ends [ 32 ]. For example, an ISDN network runs at a fixed rate of 4,000 frames per second. When a call is established, it is allocated 16 bits of space within each frame (in each direction). Thus, for the duration of the call, each side is guaranteed to be able to send exactly 16 bits of audio data every 250 microseconds [ 33 , 34 ].

当你通过电话网络打电话时，它建立了一个电路：在两个通话者之间的整个路线上，分配了一个固定的，保证的带宽量用于通话。这个电路一直保持到通话结束为止。例如，ISDN网络以固定速率运行为每秒4,000帧。当通话建立时，在每个帧（每个方向）内分配了16个位的空间。因此，在通话期间，每个部分都保证可以每250微秒传送确切的16位音频数据。

This kind of network is synchronous : even as data passes through several routers, it does not suffer from queueing, because the 16 bits of space for the call have already been reserved in the next hop of the network. And because there is no queueing, the maximum end-to-end latency of the network is fixed. We call this a bounded delay .

这种网络是同步的：即使数据经过多个路由器，由于调用的16位空间已在网络的下一个跳中预留，因此它不会受到排队的影响。由于没有排队，网络的最大端到端延迟是固定的。我们称之为有界时延。

Can we not simply make network delays predictable?

Note that a circuit in a telephone network is very different from a TCP connection: a circuit is a fixed amount of reserved bandwidth which nobody else can use while the circuit is established, whereas the packets of a TCP connection opportunistically use whatever network bandwidth is available. You can give TCP a variable-sized block of data (e.g., an email or a web page), and it will try to transfer it in the shortest time possible. While a TCP connection is idle, it doesn’t use any bandwidth. ⁱⁱ

请注意，电话网络中的电路与TCP连接非常不同：电路是一定量的预留带宽，当电路建立时，没有其他人可以使用该带宽，而TCP连接的数据包是机会主义地利用可用的网络带宽。您可以给TCP一个可变大小的数据块（例如，电子邮件或网页），它将尝试以最短的时间传输。当TCP连接空闲时，它不会使用任何带宽。

If datacenter networks and the internet were circuit-switched networks, it would be possible to establish a guaranteed maximum round-trip time when a circuit was set up. However, they are not: Ethernet and IP are packet-switched protocols, which suffer from queueing and thus unbounded delays in the network. These protocols do not have the concept of a circuit.

如果数据中心网络和互联网是电路交换网络，那么在建立电路时可以确保最大往返时间。但是，实际上它们不是：以太网和IP是分组交换协议，因为在网络中存在排队和无限延迟而受到影响。这些协议没有电路的概念。

Why do datacenter networks and the internet use packet switching? The answer is that they are optimized for bursty traffic . A circuit is good for an audio or video call, which needs to transfer a fairly constant number of bits per second for the duration of the call. On the other hand, requesting a web page, sending an email, or transferring a file doesn’t have any particular bandwidth requirement—we just want it to complete as quickly as possible.

数据中心网络和互联网为什么要使用分组交换？答案是它们被优化用于突发性的流量。电路对于音频或视频呼叫来说很好，这需要在呼叫期间传输相对恒定的比特数。另一方面，请求网页、发送电子邮件或传输文件并没有特定的带宽要求——我们只想尽快完成它。

If you wanted to transfer a file over a circuit, you would have to guess a bandwidth allocation. If you guess too low, the transfer is unnecessarily slow, leaving network capacity unused. If you guess too high, the circuit cannot be set up (because the network cannot allow a circuit to be created if its bandwidth allocation cannot be guaranteed). Thus, using circuits for bursty data transfers wastes network capacity and makes transfers unnecessarily slow. By contrast, TCP dynamically adapts the rate of data transfer to the available network capacity.

如果你想在电路上传输文件，则需要猜测带宽分配。如果您猜测过低，则传输速度过慢，浪费网络容量。如果您猜测过高，则电路无法设置（因为如果无法保证其带宽分配，则网络不允许创建电路）。因此，对于突发数据传输使用电路将浪费网络容量，并使传输变慢。相比之下，TCP动态调整数据传输速率以适应可用的网络容量。

There have been some attempts to build hybrid networks that support both circuit switching and packet switching, such as ATM. ⁱⁱⁱ InfiniBand has some similarities [ 35 ]: it implements end-to-end flow control at the link layer, which reduces the need for queueing in the network, although it can still suffer from delays due to link congestion [ 36 ]. With careful use of quality of service (QoS, prioritization and scheduling of packets) and admission control (rate-limiting senders), it is possible to emulate circuit switching on packet networks, or provide statistically bounded delay [ 25 , 32 ].

有一些尝试构建混合网络以支持电路交换和分组交换，例如 ATM。InfiniBand 有一些相似之处：它在链路层实现端到端的流量控制，这减少了网络中排队的需求，尽管它仍然可能因为链路拥塞而遭受延迟。通过精心利用服务质量（QoS，分组的优先级和调度）和准入控制（限制发送者的速率），可以在分组网络上模拟电路交换或提供统计所限的延迟。

Latency and Resource Utilization

More generally, you can think of variable delays as a consequence of dynamic resource partitioning.

更一般而言，您可以将可变延迟视为动态资源分配的结果。

Say you have a wire between two telephone switches that can carry up to 10,000 simultaneous calls. Each circuit that is switched over this wire occupies one of those call slots. Thus, you can think of the wire as a resource that can be shared by up to 10,000 simultaneous users. The resource is divided up in a static way: even if you’re the only call on the wire right now, and all other 9,999 slots are unused, your circuit is still allocated the same fixed amount of bandwidth as when the wire is fully utilized.

假设你有一个连接两个电话交换机的电线，可以同时承载10,000个电话。每个被接入这条线的电路都占据其中一个通话插槽。因此，你可以将这个电线看作一种资源，最多可以为10,000个用户共享。这个资源是以静态方式分配的：即使现在只有你一个人在通话，其它9,999个插槽都没被使用，你的电路所分配的带宽量仍然和当电线被充分利用时一样固定。

By contrast, the internet shares network bandwidth dynamically . Senders push and jostle with each other to get their packets over the wire as quickly as possible, and the network switches decide which packet to send (i.e., the bandwidth allocation) from one moment to the next. This approach has the downside of queueing, but the advantage is that it maximizes utilization of the wire. The wire has a fixed cost, so if you utilize it better, each byte you send over the wire is cheaper.

相比之下，互联网动态共享网络带宽。发送方相互竞争，以尽快将数据包传输到线路上，网络交换机决定从一瞬间到另一瞬间发送哪个数据包（即带宽分配）。这种方法的缺点在于会造成排队，但它的优点在于最大限度地利用了线路。线路的成本是固定的，所以如果更好地利用它，则每个字节发送到线路上的成本更低。

A similar situation arises with CPUs: if you share each CPU core dynamically between several threads, one thread sometimes has to wait in the operating system’s run queue while another thread is running, so a thread can be paused for varying lengths of time. However, this utilizes the hardware better than if you allocated a static number of CPU cycles to each thread (see “Response time guarantees” ). Better hardware utilization is also a significant motivation for using virtual machines.

与CPU类似的情况出现了：如果你动态地在几个线程之间分享每个CPU核心，那么一个线程有时必须在操作系统的运行队列中等待，而另一个线程正在运行，所以一个线程可能会暂停不同长度的时间。然而，与为每个线程分配固定数量的CPU周期相比，这更好地利用了硬件（见“响应时间保证”）。更好的硬件利用率也是使用虚拟机的重要动力。

Latency guarantees are achievable in certain environments, if resources are statically partitioned (e.g., dedicated hardware and exclusive bandwidth allocations). However, it comes at the cost of reduced utilization—in other words, it is more expensive. On the other hand, multi-tenancy with dynamic resource partitioning provides better utilization, so it is cheaper, but it has the downside of variable delays.

在某些环境下，如果资源被静态分区（如专用硬件和独占带宽分配），则可以实现延迟保证。然而，这是以降低利用率的代价为代价的，换句话说，它更昂贵。另一方面，具有动态资源分区的多租户提供更好的利用率，因此更便宜，但它的缺点是延迟不确定。

Variable delays in networks are not a law of nature, but simply the result of a cost/benefit trade-off.

网络中的可变延迟不是自然法则，而仅仅是成本效益权衡的结果。

However, such quality of service is currently not enabled in multi-tenant datacenters and public clouds, or when communicating via the internet. ^iv Currently deployed technology does not allow us to make any guarantees about delays or reliability of the network: we have to assume that network congestion, queueing, and unbounded delays will happen. Consequently, there’s no “correct” value for timeouts—they need to be determined experimentally.

然而，这样的服务质量目前在多租户数据中心、公共云或通过互联网通信时尚未启用。目前部署的技术不允许我们对网络的延迟或可靠性做出任何保证：我们必须假定网络拥塞、排队和无限延迟将会发生。因此，超时的“正确”值没有固定的标准，需要通过实验进行确定。

Unreliable Clocks

Clocks and time are important. Applications depend on clocks in various ways to answer questions like the following:

时钟和时间非常重要。应用程序以各种方式依赖于时钟来回答以下问题：

Has this request timed out yet?

这个请求超时了吗？
What’s the 99th percentile response time of this service?

这项服务的99分位响应时间是多少？
How many queries per second did this service handle on average in the last five minutes?

这项服务在过去五分钟内平均处理了多少个查询请求？
How long did the user spend on our site?

用户在我们的网站上花费了多长时间？
When was this article published?

这篇文章是什么时候发布的？
At what date and time should the reminder email be sent?

提醒邮件应该在什么日期和时间发送？
When does this cache entry expire?

这个缓存条目何时过期？
What is the timestamp on this error message in the log file?

这个错误信息在日志文件中的时间戳是什么？

Examples 1–4 measure durations (e.g., the time interval between a request being sent and a response being received), whereas examples 5–8 describe points in time (events that occur on a particular date, at a particular time).

例子1至4测量持续时间（例如，从发送请求到接收响应的时间间隔），而例子5至8描述时间点（在特定日期、特定时间发生的事件）。

In a distributed system, time is a tricky business, because communication is not instantaneous: it takes time for a message to travel across the network from one machine to another. The time when a message is received is always later than the time when it is sent, but due to variable delays in the network, we don’t know how much later. This fact sometimes makes it difficult to determine the order in which things happened when multiple machines are involved.

在分布式系统中，时间是一个棘手的问题，因为通信不是即时的：一条消息需要时间才能从一台机器传输到另一台机器。消息接收的时间总是晚于发送时间，但由于网络中的可变延迟，我们不知道晚了多少。这个事实有时会使得需要多台机器参与时难以确定事件发生的顺序。

Moreover, each machine on the network has its own clock, which is an actual hardware device: usually a quartz crystal oscillator. These devices are not perfectly accurate, so each machine has its own notion of time, which may be slightly faster or slower than on other machines. It is possible to synchronize clocks to some degree: the most commonly used mechanism is the Network Time Protocol (NTP), which allows the computer clock to be adjusted according to the time reported by a group of servers [ 37 ]. The servers in turn get their time from a more accurate time source, such as a GPS receiver.

此外，网络上的每台机器都有自己的时钟，这是实际的硬件设备：通常是石英晶体振荡器。这些设备并不完全准确，因此每台机器都有自己的时间概念，可能比其他机器稍微快或慢一些。可以在某种程度上同步时钟：最常用的机制是网络时间协议（NTP），它允许计算机时钟根据一组服务器报告的时间进行调整[37]。这些服务器反过来从更精确的时间来源，如GPS接收器，获取它们的时间。

Monotonic Versus Time-of-Day Clocks

Modern computers have at least two different kinds of clocks: a time-of-day clock and a monotonic clock . Although they both measure time, it is important to distinguish the two, since they serve different purposes.

现代计算机至少有两种不同类型的时钟：一个是时间钟，另一个是单调钟。虽然它们都可以测量时间，但是将它们区分开来非常重要，因为它们有不同的用途。

Time-of-day clocks

A time-of-day clock does what you intuitively expect of a clock: it returns the current date and time according to some calendar (also known as wall-clock time ). For example, clock_gettime(CLOCK_REALTIME) on Linux ^v and System.currentTimeMillis() in Java return the number of seconds (or milliseconds) since the epoch : midnight UTC on January 1, 1970, according to the Gregorian calendar, not counting leap seconds. Some systems use other dates as their reference point.

时钟可以做你所期望的事情：返回当前日期和时间，根据某种日历（也称为挂钟时间）。例如，在Linux上使用clock_gettime(CLOCK_REALTIME)以及在Java中使用System.currentTimeMillis()都会返回距离公元1970年1月1日UTC午夜（根据公历，不包括闰秒）的秒数（或毫秒数）。有些系统使用其他日期作为它们的参考点。

Time-of-day clocks are usually synchronized with NTP, which means that a timestamp from one machine (ideally) means the same as a timestamp on another machine. However, time-of-day clocks also have various oddities, as described in the next section. In particular, if the local clock is too far ahead of the NTP server, it may be forcibly reset and appear to jump back to a previous point in time. These jumps, as well as the fact that they often ignore leap seconds, make time-of-day clocks unsuitable for measuring elapsed time [ 38 ].

时间戳通常与NTP同步，这意味着一个机器的时间戳尽可能地与另一个机器的时间戳相同。然而，时间戳也有一些奇怪的地方，如下一节所述。特别是，如果本地时钟比NTP服务器超前太多，它可能会被强制重置，并出现向回跳转到先前的时间点的情况。这些跳跃，以及它们经常忽略闰秒的事实，使时间戳不适合用于计量经过的时间。

Time-of-day clocks have also historically had quite a coarse-grained resolution, e.g., moving forward in steps of 10 ms on older Windows systems [ 39 ]. On recent systems, this is less of a problem.

时钟的计时精度过去历史上相当粗糙，例如，在旧版Windows系统中每次向前移动10毫秒。在最新系统上，这已经不是问题了。

Monotonic clocks

A monotonic clock is suitable for measuring a duration (time interval), such as a timeout or a service’s response time: clock_gettime(CLOCK_MONOTONIC) on Linux and System.nanoTime() in Java are monotonic clocks, for example. The name comes from the fact that they are guaranteed to always move forward (whereas a time-of-day clock may jump back in time).

单调时钟适用于测量持续时间（时间间隔），例如超时或服务的响应时间：例如在Linux上使用clock_gettime（CLOCK_MONOTONIC），在Java中使用System.nanoTime（）作为单调时钟。这个名字来源于它们保证始终向前运动（而日期时间时钟可能会向后跳转）。

You can check the value of the monotonic clock at one point in time, do something, and then check the clock again at a later time. The difference between the two values tells you how much time elapsed between the two checks. However, the absolute value of the clock is meaningless: it might be the number of nanoseconds since the computer was started, or something similarly arbitrary. In particular, it makes no sense to compare monotonic clock values from two different computers, because they don’t mean the same thing.

你可以在某个时间点检查单调时钟的值，然后执行某些操作，稍后再次检查时钟。两个检查之间的差值告诉您两个检查之间经过了多少时间。但是，时钟的绝对值没有意义：它可能是自计算机启动以来的纳秒数，或者类似的任意值。特别是，从两台不同计算机比较单调时钟值是没有意义的，因为它们表示的不是同一件事情。

On a server with multiple CPU sockets, there may be a separate timer per CPU, which is not necessarily synchronized with other CPUs. Operating systems compensate for any discrepancy and try to present a monotonic view of the clock to application threads, even as they are scheduled across different CPUs. However, it is wise to take this guarantee of monotonicity with a pinch of salt [ 40 ].

在具有多个CPU插槽的服务器上，每个CPU可能有一个单独的定时器，不一定与其他CPU同步。操作系统会弥补任何差异，并尝试向应用线程呈现时钟的单调视图，即使它们跨越不同的CPU调度。但是，明智的做法是将此单调性保证略微怀疑[40]。

NTP may adjust the frequency at which the monotonic clock moves forward (this is known as slewing the clock) if it detects that the computer’s local quartz is moving faster or slower than the NTP server. By default, NTP allows the clock rate to be speeded up or slowed down by up to 0.05%, but NTP cannot cause the monotonic clock to jump forward or backward. The resolution of monotonic clocks is usually quite good: on most systems they can measure time intervals in microseconds or less.

NTP可以调整单调时钟向前移动的频率（这被称为时钟校准），如果它检测到计算机的本地石英晶体比NTP服务器运动得更快或更慢。默认情况下，NTP允许将时钟速度加速或减速最多0.05％，但是NTP无法使单调时钟向前或向后跳跃。单调时钟的分辨率通常很好：在大多数系统上，它们可以测量微秒或更短的时间间隔。

In a distributed system, using a monotonic clock for measuring elapsed time (e.g., timeouts) is usually fine, because it doesn’t assume any synchronization between different nodes’ clocks and is not sensitive to slight inaccuracies of measurement.

在分布式系统中，使用单调时钟来测量经过的时间（例如，超时）通常是可以接受的，因为它不假设不同节点的时钟之间有任何同步，且对测量的轻微不准确性不敏感。

Clock Synchronization and Accuracy

Monotonic clocks don’t need synchronization, but time-of-day clocks need to be set according to an NTP server or other external time source in order to be useful. Unfortunately, our methods for getting a clock to tell the correct time aren’t nearly as reliable or accurate as you might hope—hardware clocks and NTP can be fickle beasts. To give just a few examples:

单调时钟不需要同步，但是时钟需要按照NTP服务器或其他外部时间源设置，以便有用。不幸的是，我们校准时钟的方法并不像您希望的那样可靠或准确 - 硬件时钟和NTP可能会出现问题。举几个例子：

The quartz clock in a computer is not very accurate: it drifts (runs faster or slower than it should). Clock drift varies depending on the temperature of the machine. Google assumes a clock drift of 200 ppm (parts per million) for its servers [ 41 ], which is equivalent to 6 ms drift for a clock that is resynchronized with a server every 30 seconds, or 17 seconds drift for a clock that is resynchronized once a day. This drift limits the best possible accuracy you can achieve, even if everything is working correctly.

电脑中的石英钟不太准确：它会漂移（比应该的快或慢）。时钟漂移取决于设备的温度。谷歌的服务器假定钟漂移为200个百万分之一（ppm）[41]，相当于每30秒重新与服务器同步一次的时钟漂移为6毫秒，或每天重新同步一次的时钟漂移为17秒。即使一切正常，此漂移也限制了您能够实现的最佳精度。
If a computer’s clock differs too much from an NTP server, it may refuse to synchronize, or the local clock will be forcibly reset [ 37 ]. Any applications observing the time before and after this reset may see time go backward or suddenly jump forward.

如果电脑时钟与NTP服务器差距太大，它可能会拒绝同步，或者本地时钟会被强制重置[37]。在此重置之前和之后观察时间的任何应用程序可能会看到时间倒流或突然向前跳跃。
If a node is accidentally firewalled off from NTP servers, the misconfiguration may go unnoticed for some time. Anecdotal evidence suggests that this does happen in practice.

如果某个节点意外地被防火墙隔离在NTP服务器之外，这种配置错误可能会被忽视一段时间。个人经验表明，实际上确实会发生这种情况。
NTP synchronization can only be as good as the network delay, so there is a limit to its accuracy when you’re on a congested network with variable packet delays. One experiment showed that a minimum error of 35 ms is achievable when synchronizing over the internet [ 42 ], though occasional spikes in network delay lead to errors of around a second. Depending on the configuration, large network delays can cause the NTP client to give up entirely.

NTP同步的准确度受网络延迟的影响，当网络拥塞且包传输时间变化时，其准确度将受到限制。一项实验表明，通过互联网同步可以达到最小误差为35毫秒的精度[42]，但是偶尔遇到的网络延迟峰值会导致误差大约为一秒。根据配置，大的网络延迟可能会导致NTP客户端完全放弃。
Some NTP servers are wrong or misconfigured, reporting time that is off by hours [ 43 , 44 ]. NTP clients are quite robust, because they query several servers and ignore outliers. Nevertheless, it’s somewhat worrying to bet the correctness of your systems on the time that you were told by a stranger on the internet.

有一些NTP服务器的时间是错误的或者配置不当，与实际时间相差数小时[43, 44]。 NTP客户端非常强大，因为它们会查询多个服务器并忽略异常值。尽管如此，但是将系统的正确性放在来自互联网上的陌生人的时间上还是令人担忧的。
Leap seconds result in a minute that is 59 seconds or 61 seconds long, which messes up timing assumptions in systems that are not designed with leap seconds in mind [ 45 ]. The fact that leap seconds have crashed many large systems [ 38 , 46 ] shows how easy it is for incorrect assumptions about clocks to sneak into a system. The best way of handling leap seconds may be to make NTP servers “lie,” by performing the leap second adjustment gradually over the course of a day (this is known as smearing ) [ 47 , 48 ], although actual NTP server behavior varies in practice [ 49 ].

闰秒会导致一分钟长为59秒或61秒，这会影响那些没有考虑闰秒的系统的时间假设。事实上，闰秒已经让许多大型系统崩溃，显示了时钟方面的错误假设多么容易潜入系统中。最好的处理闰秒的方法可能是让NTP服务器“撒谎”，在一天的时间内逐渐进行闰秒调整（这被称为扩散），尽管实际的NTP服务器行为因实践而异。
In virtual machines, the hardware clock is virtualized, which raises additional challenges for applications that need accurate timekeeping [ 50 ]. When a CPU core is shared between virtual machines, each VM is paused for tens of milliseconds while another VM is running. From an application’s point of view, this pause manifests itself as the clock suddenly jumping forward [ 26 ].

在虚拟机中，硬件时钟被虚拟化，这给需要准确计时的应用程序带来了额外的挑战[50]。当一个CPU核心在虚拟机之间共享时，每个虚拟机都会暂停数十毫秒，而另一个虚拟机正在运行。从应用程序的角度来看，这种暂停表现为时钟突然向前跳动[26]。
If you run software on devices that you don’t fully control (e.g., mobile or embedded devices), you probably cannot trust the device’s hardware clock at all. Some users deliberately set their hardware clock to an incorrect date and time, for example to circumvent timing limitations in games. As a result, the clock might be set to a time wildly in the past or the future.

如果您在未完全控制的设备上运行软件（例如移动设备或嵌入式设备），您可能无法完全信任设备的硬件时钟。一些用户会故意将其硬件时钟设置为不正确的日期和时间，例如为了规避游戏中的时间限制。因此，时钟可能被设置为过去或未来的时间。

It is possible to achieve very good clock accuracy if you care about it sufficiently to invest significant resources. For example, the MiFID II draft European regulation for financial institutions requires all high-frequency trading funds to synchronize their clocks to within 100 microseconds of UTC, in order to help debug market anomalies such as “flash crashes” and to help detect market manipulation [ 51 ].

如果你足够重视，愿意投入更多的资源，就有可能实现非常好的时钟精度。例如，欧洲金融机构的《MiFID II》草案要求所有高频交易基金将时钟同步到离协调世界时（UTC）100微秒以内，以帮助调试市场异常，例如“闪崩”，并帮助检测市场操纵[51]。

Such accuracy can be achieved using GPS receivers, the Precision Time Protocol (PTP) [ 52 ], and careful deployment and monitoring. However, it requires significant effort and expertise, and there are plenty of ways clock synchronization can go wrong. If your NTP daemon is misconfigured, or a firewall is blocking NTP traffic, the clock error due to drift can quickly become large.

通过使用GPS接收器、精确时间协议（PTP）[52]和仔细的部署和监测可以达到这种准确度。然而，这需要大量的努力和专业技能，并且时钟同步有很多出错的可能。如果您的NTP守护程序配置不正确，或防火墙阻止NTP流量，时钟漂移引起的时钟误差可能会迅速变大。

Relying on Synchronized Clocks

The problem with clocks is that while they seem simple and easy to use, they have a surprising number of pitfalls: a day may not have exactly 86,400 seconds, time-of-day clocks may move backward in time, and the time on one node may be quite different from the time on another node.

时钟的问题在于它们看起来简单易用，但实际上却有很多难以预料的缺陷：一天可能并非恰好有86,400秒，时钟可能会倒退时间，而且一个节点上的时间可能与另一个节点上的时间相差很大。

Earlier in this chapter we discussed networks dropping and arbitrarily delaying packets. Even though networks are well behaved most of the time, software must be designed on the assumption that the network will occasionally be faulty, and the software must handle such faults gracefully. The same is true with clocks: although they work quite well most of the time, robust software needs to be prepared to deal with incorrect clocks.

在本章早些时候，我们讨论了网络丢包和任意延迟数据包的问题。尽管网络大多数时候都表现良好，但软件必须在假设网络偶尔出现故障的基础上进行设计，并且软件必须能够优雅地处理这些故障。同样的情况也适用于时钟：尽管大多数时候时钟工作正常，但强健的软件需要准备好处理不正确的时钟。

Part of the problem is that incorrect clocks easily go unnoticed. If a machine’s CPU is defective or its network is misconfigured, it most likely won’t work at all, so it will quickly be noticed and fixed. On the other hand, if its quartz clock is defective or its NTP client is misconfigured, most things will seem to work fine, even though its clock gradually drifts further and further away from reality. If some piece of software is relying on an accurately synchronized clock, the result is more likely to be silent and subtle data loss than a dramatic crash [ 53 , 54 ].

问题的一部分在于不正确的时钟很容易被忽略。如果机器的 CPU 有缺陷或其网络配置不正确，它很可能根本就不能工作，因此很快就会被发现并进行修复。另一方面，如果它的石英钟有缺陷或其 NTP 客户端配置不正确，大多数东西看起来似乎运行良好，尽管其时钟逐渐偏离现实。如果某些软件依靠准确同步的时钟，结果更可能是静默和微妙的数据丢失，而不是显着的崩溃[53、54]。

Thus, if you use software that requires synchronized clocks, it is essential that you also carefully monitor the clock offsets between all the machines. Any node whose clock drifts too far from the others should be declared dead and removed from the cluster. Such monitoring ensures that you notice the broken clocks before they can cause too much damage.

因此，如果您使用需要同步时钟的软件，必须仔细监控所有机器之间的时钟偏移量。任何时钟偏离其他机器太远的节点都应被声明为失效并从集群中删除。这种监控确保您在时钟出现问题之前就能注意到，避免造成过度损伤。

Timestamps for ordering events

Let’s consider one particular situation in which it is tempting, but dangerous, to rely on clocks: ordering of events across multiple nodes. For example, if two clients write to a distributed database, who got there first? Which write is the more recent one?

考虑一种情况，依赖于时钟会很诱人，但也很危险：多个节点之间事件的排序。例如，如果两个客户端写入一个分布式数据库，谁先到达那里？哪个写入是最新的？

Figure 8-3 illustrates a dangerous use of time-of-day clocks in a database with multi-leader replication (the example is similar to Figure 5-9 ). Client A writes x = 1 on node 1; the write is replicated to node 3; client B increments x on node 3 (we now have x = 2); and finally, both writes are replicated to node 2.

图 8-3 展示了在具有多主复制的数据库中使用时间戳时可能出现的危险情况（这个例子与图 5-9 类似）。客户端 A 在节点 1 上写入 x = 1；写入操作被复制到节点 3；客户端 B 在节点 3 上增加 x，使得 x = 2；最后，两个写入操作都被复制到节点 2。图 8-3 展示了在具有多主复制的数据库中使用时间戳时可能出现的危险情况（这个例子与图 5-9 类似）。客户端 A 在节点 1 上写入 x = 1；写入操作被复制到节点 3；客户端 B 在节点 3 上增加 x，使得 x = 2；最后，两个写入操作都被复制到节点 2。

In Figure 8-3 , when a write is replicated to other nodes, it is tagged with a timestamp according to the time-of-day clock on the node where the write originated. The clock synchronization is very good in this example: the skew between node 1 and node 3 is less than 3 ms, which is probably better than you can expect in practice.

在图8-3中，当写操作被复制到其他节点时，它会被标记上时间戳，该时间戳基于写操作源节点的当日时间。在此示例中，时钟同步非常好：节点1和节点3之间的偏差少于3毫秒，这可能比实际情况要好。

Nevertheless, the timestamps in Figure 8-3 fail to order the events correctly: the write x = 1 has a timestamp of 42.004 seconds, but the write x = 2 has a timestamp of 42.003 seconds, even though x = 2 occurred unambiguously later. When node 2 receives these two events, it will incorrectly conclude that x = 1 is the more recent value and drop the write x = 2. In effect, client B’s increment operation will be lost.

然而，图8-3中的时间戳未能正确排序事件：写入x = 1的时间戳为42.004秒，但写入x = 2的时间戳为42.003秒，尽管x = 2显然后发生。当节点2接收到这两个事件时，它将错误地得出x = 1是更近的值并删除写入x = 2。实际上，客户端B的增量操作将丢失。

This conflict resolution strategy is called last write wins (LWW), and it is widely used in both multi-leader replication and leaderless databases such as Cassandra [ 53 ] and Riak [ 54 ] (see “Last write wins (discarding concurrent writes)” ). Some implementations generate timestamps on the client rather than the server, but this doesn’t change the fundamental problems with LWW:

这种冲突解决策略被称为最后一次写入获胜(LWW)，它被广泛用于多领导者复制和无领导者数据库，比如Cassandra[53]和Riak[54](详见“最后一次写入获胜(丢弃并发写入)”）。一些实现会在客户端而非服务器上生成时间戳，但这不会改变LWW的根本问题：

Database writes can mysteriously disappear: a node with a lagging clock is unable to overwrite values previously written by a node with a fast clock until the clock skew between the nodes has elapsed [ 54 , 55 ]. This scenario can cause arbitrary amounts of data to be silently dropped without any error being reported to the application.

数据库写入可能会神秘消失：当一个节点的时钟落后时，它无法覆盖先前由具有快速时钟的节点写入的值，直到节点间的时钟偏差已经过去。此情况可能导致任意数量的数据被无声丢弃，而应用程序不会报告任何错误。
LWW cannot distinguish between writes that occurred sequentially in quick succession (in Figure 8-3 , client B’s increment definitely occurs after client A’s write) and writes that were truly concurrent (neither writer was aware of the other). Additional causality tracking mechanisms, such as version vectors, are needed in order to prevent violations of causality (see “Detecting Concurrent Writes” ).

LWW不能区分在快速连续发生的顺序写（在图8-3中，客户端B的递增明显是在客户端A的写之后发生的）和真正并发的写（两个写者都不知道对方的写）。需要额外的因果跟踪机制，如版本向量，才能防止因果关系的违规（请参阅“检测并发写入”）。
It is possible for two nodes to independently generate writes with the same timestamp, especially when the clock only has millisecond resolution. An additional tiebreaker value (which can simply be a large random number) is required to resolve such conflicts, but this approach can also lead to violations of causality [ 53 ].

两个节点独立生成相同时间戳的写入是有可能的，特别是当时钟仅具有毫秒分辨率时。需要一个额外的决定胜负的值（可以简单地是一个大的随机数）来解决这种冲突，但这种方法也可能导致因果关系的违反。[53]。

Thus, even though it is tempting to resolve conflicts by keeping the most “recent” value and discarding others, it’s important to be aware that the definition of “recent” depends on a local time-of-day clock, which may well be incorrect. Even with tightly NTP-synchronized clocks, you could send a packet at timestamp 100 ms (according to the sender’s clock) and have it arrive at timestamp 99 ms (according to the recipient’s clock)—so it appears as though the packet arrived before it was sent, which is impossible.

因此，尽管通过保留最“近期”的数值并丢弃其他数值来解决冲突是很诱人的，但重要的是要意识到“近期”定义取决于本地的时间时钟，可能是不正确的。即使使用紧密NTP同步的时钟，您可能会在时间戳100毫秒（根据发件人的时钟）发送数据包，并在时间戳99毫秒（根据收件人的时钟）到达，因此似乎该数据包在发送之前就已到达，这是不可能的。

Could NTP synchronization be made accurate enough that such incorrect orderings cannot occur? Probably not, because NTP’s synchronization accuracy is itself limited by the network round-trip time, in addition to other sources of error such as quartz drift. For correct ordering, you would need the clock source to be significantly more accurate than the thing you are measuring (namely network delay).

NTP同步能否精确到无法发生错误排列的程度？可能不行，因为NTP的同步精度本身受到网络往返时间的限制，除了石英漂移等其他误差来源。为了正确排序，您需要的时钟源比您要测量的东西（即网络延迟）更准确。

So-called logical clocks [ 56 , 57 ], which are based on incrementing counters rather than an oscillating quartz crystal, are a safer alternative for ordering events (see “Detecting Concurrent Writes” ). Logical clocks do not measure the time of day or the number of seconds elapsed, only the relative ordering of events (whether one event happened before or after another). In contrast, time-of-day and monotonic clocks, which measure actual elapsed time, are also known as physical clocks . We’ll look at ordering a bit more in “Ordering Guarantees” .

所谓的逻辑时钟[56, 57]，它们基于递增计数器而不是振荡的石英晶体，是对于事件排序的一种安全替代方法（见“检测并发写入”）。逻辑时钟不测量日期或经过的秒数，只测量事件的相对顺序（一个事件在另一个事件之前或之后发生）。相反，测量实际经过时间的时钟，如当天时间和单调时钟，也被称为物理时钟。我们将在“排序保证”中更详细地讨论排序。

Clock readings have a confidence interval

You may be able to read a machine’s time-of-day clock with microsecond or even nanosecond resolution. But even if you can get such a fine-grained measurement, that doesn’t mean the value is actually accurate to such precision. In fact, it most likely is not—as mentioned previously, the drift in an imprecise quartz clock can easily be several milliseconds, even if you synchronize with an NTP server on the local network every minute. With an NTP server on the public internet, the best possible accuracy is probably to the tens of milliseconds, and the error may easily spike to over 100 ms when there is network congestion [ 57 ].

您可以使用微秒甚至纳秒的分辨率读取机器的时钟。但即使您可以获得如此精细的测量值，也并不意味着该值实际上具有如此高的精度。事实上，很可能不是这样的。正如以前提到的那样，即使您每分钟都使用本地网络上的NTP服务器进行同步，精度不准确的石英钟的漂移也很容易达到几毫秒。在公共互联网上使用NTP服务器，最好的可能精度可能只有几十毫秒，并且在网络拥塞时错误可能轻易飙升到100毫秒以上[57]。

Thus, it doesn’t make sense to think of a clock reading as a point in time—it is more like a range of times, within a confidence interval: for example, a system may be 95% confident that the time now is between 10.3 and 10.5 seconds past the minute, but it doesn’t know any more precisely than that [ 58 ]. If we only know the time +/– 100 ms, the microsecond digits in the timestamp are essentially meaningless.

因此，将时钟读数视为时间点是没有意义的，它更像是一个时间范围，处于置信区间内：例如，系统可能有95％的信心现在的时间在分钟过去的10.3和10.5秒之间，但它并不比那更精确[58]。如果我们只知道时间+/- 100毫秒，则时间戳中的微秒数字基本上是没有意义的。

The uncertainty bound can be calculated based on your time source. If you have a GPS receiver or atomic (caesium) clock directly attached to your computer, the expected error range is reported by the manufacturer. If you’re getting the time from a server, the uncertainty is based on the expected quartz drift since your last sync with the server, plus the NTP server’s uncertainty, plus the network round-trip time to the server (to a first approximation, and assuming you trust the server).

不确定性范围可以根据您的时间源进行计算。如果您的计算机直接连接了GPS接收器或原子（铯）钟，制造商会报告预期的误差范围。如果您从服务器获取时间，则不确定性是基于您上次与服务器同步以来预期的石英漂移，加上NTP服务器的不确定性，加上到服务器的网络往返时间（作为初步估计，并假定您信任该服务器）。

Unfortunately, most systems don’t expose this uncertainty: for example, when you call clock_gettime() , the return value doesn’t tell you the expected error of the timestamp, so you don’t know if its confidence interval is five milliseconds or five years.

不幸的是，大多数系统没有暴露这种不确定性：例如，当您调用 clock_gettime() 时，返回值不会告诉您时间戳的预期误差，因此您不知道其置信区间是五毫秒还是五年。

An interesting exception is Google’s TrueTime API in Spanner [ 41 ], which explicitly reports the confidence interval on the local clock. When you ask it for the current time, you get back two values: [ earliest , latest ] , which are the earliest possible and the latest possible timestamp. Based on its uncertainty calculations, the clock knows that the actual current time is somewhere within that interval. The width of the interval depends, among other things, on how long it has been since the local quartz clock was last synchronized with a more accurate clock source.

Google的Spanner中有一个有趣的例外是TrueTime API，它明确报告本地时钟的置信区间。当你询问它当前的时间时，你会得到两个值：[最早的，最晚的]，它们是可能的最早和最晚时间戳。根据其不确定性计算，时钟知道实际当前时间在该区间内的某个位置。区间的宽度取决于许多因素，包括本地石英钟上一次与更准确的时钟源同步的时间。

Synchronized clocks for global snapshots

In “Snapshot Isolation and Repeatable Read” we discussed snapshot isolation , which is a very useful feature in databases that need to support both small, fast read-write transactions and large, long-running read-only transactions (e.g., for backups or analytics). It allows read-only transactions to see the database in a consistent state at a particular point in time, without locking and interfering with read-write transactions.

在“快照隔离与可重复读”一章中，我们讨论了快照隔离，它是数据库中非常有用的特性，支持小型，快速的读写事务和大型，长时间运行的只读事务（例如备份或分析）。它使只读事务在不需要锁定和干扰读写事务的情况下，在特定时间点上以一致的状态查看数据库。

The most common implementation of snapshot isolation requires a monotonically increasing transaction ID. If a write happened later than the snapshot (i.e., the write has a greater transaction ID than the snapshot), that write is invisible to the snapshot transaction. On a single-node database, a simple counter is sufficient for generating transaction IDs.

最常见的快照隔离实现需要单调递增的事务ID。如果写操作发生在快照之后（即，写操作的事务ID大于快照的事务ID），那么该写操作对快照事务是不可见的。在单节点数据库中，一个简单的计数器就足以生成事务ID。

However, when a database is distributed across many machines, potentially in multiple datacenters, a global, monotonically increasing transaction ID (across all partitions) is difficult to generate, because it requires coordination. The transaction ID must reflect causality: if transaction B reads a value that was written by transaction A, then B must have a higher transaction ID than A—otherwise, the snapshot would not be consistent. With lots of small, rapid transactions, creating transaction IDs in a distributed system becomes an untenable bottleneck. ^vi

然而，当数据库分布在许多机器上，可能在多个数据中心，生成一个全局的、单调递增的事务ID（跨所有分区）是困难的，因为它需要协调。事务ID必须反映因果关系：如果事务B读取了事务A写入的值，那么B必须具有比A更高的事务ID，否则快照将不一致。对于大量小的、快速的事务，在分布式系统中创建事务ID变得不可行。

Can we use the timestamps from synchronized time-of-day clocks as transaction IDs? If we could get the synchronization good enough, they would have the right properties: later transactions have a higher timestamp. The problem, of course, is the uncertainty about clock accuracy.

我们能否使用同步计时钟的时间戳作为交易ID？如果我们能够把同步做好，它们会具有正确的特性：较晚的交易具有较高的时间戳。当然，问题在于关于时钟精度的不确定性。

Spanner implements snapshot isolation across datacenters in this way [ 59 , 60 ]. It uses the clock’s confidence interval as reported by the TrueTime API, and is based on the following observation: if you have two confidence intervals, each consisting of an earliest and latest possible timestamp ( A = [ A _earliest , A _latest ] and B = [ B _earliest , B _latest ]), and those two intervals do not overlap (i.e., A _earliest < A _latest < B _earliest < B _latest ), then B definitely happened after A—there can be no doubt. Only if the intervals overlap are we unsure in which order A and B happened.

Spanner在数据中心之间实现了快照隔离，其方法如下：它使用TrueTime API报告的时钟置信度间隔，并基于以下观察：如果您有两个置信区间，每个区间都由最早和最晚可能的时间戳（A = [Aearliest，Alatest]和B = [Bearliest，Blatest]）组成，并且这两个区间不重叠（即Aearliest < Alatest < Bearliest < Blatest），则B肯定发生在A之后 - 完全没有疑问。仅当区间重叠时，我们才不确定A和B的顺序。

In order to ensure that transaction timestamps reflect causality, Spanner deliberately waits for the length of the confidence interval before committing a read-write transaction. By doing so, it ensures that any transaction that may read the data is at a sufficiently later time, so their confidence intervals do not overlap. In order to keep the wait time as short as possible, Spanner needs to keep the clock uncertainty as small as possible; for this purpose, Google deploys a GPS receiver or atomic clock in each datacenter, allowing clocks to be synchronized to within about 7 ms [ 41 ].

为了确保交易时间戳反映因果关系，Spanner故意在提交读写事务之前等待置信区间的长度。通过这样做，确保任何可能读取数据的事务的时间足够晚，使它们的置信区间不重叠。为了尽可能缩短等待时间，Spanner需要尽可能减小时钟不确定性；为此，Google在每个数据中心部署GPS接收器或原子钟，使时钟同步在约7毫秒内[41]。

Using clock synchronization for distributed transaction semantics is an area of active research [ 57 , 61 , 62 ]. These ideas are interesting, but they have not yet been implemented in mainstream databases outside of Google.

使用时钟同步来传播事务语义是一个活跃的研究领域[57，61，62]。这些想法很有趣，但是除了Google以外，它们尚未在主流数据库中得到实现。

Process Pauses

Let’s consider another example of dangerous clock use in a distributed system. Say you have a database with a single leader per partition. Only the leader is allowed to accept writes. How does a node know that it is still leader (that it hasn’t been declared dead by the others), and that it may safely accept writes?

让我们考虑分布式系统中另一个危险的时钟使用例子。假设你有一个数据库，每个分区只有一个领导者。只有领导者可以接受写入。一个节点如何知道它仍然是领导者(它没有被其他节点宣布死亡)，并且它可以安全地接受写入呢？

One option is for the leader to obtain a lease from the other nodes, which is similar to a lock with a timeout [ 63 ]. Only one node can hold the lease at any one time—thus, when a node obtains a lease, it knows that it is the leader for some amount of time, until the lease expires. In order to remain leader, the node must periodically renew the lease before it expires. If the node fails, it stops renewing the lease, so another node can take over when it expires.

一种选择是领导者从其他节点获得租赁，类似于带有超时的锁[63]。每次只能有一个节点持有租赁 - 因此，当一个节点获得租赁时，它知道自己在一段时间内是领导者，直到租赁到期。为了保持领导地位，节点必须在其到期之前定期更新租赁。如果节点失败，则停止更新租赁，因此在其到期时另一个节点可以接管。

You can imagine the request-handling loop looking something like this:

你可以想象请求处理循环大概长这样：

while (true) {
    request = getIncomingRequest();

    // Ensure that the lease always has at least 10 seconds remaining
    if (lease.expiryTimeMillis - System.currentTimeMillis() < 10000) {
        lease = lease.renew();
    }

    if (lease.isValid()) {
        process(request);
    }
}

What’s wrong with this code? Firstly, it’s relying on synchronized clocks: the expiry time on the lease is set by a different machine (where the expiry may be calculated as the current time plus 30 seconds, for example), and it’s being compared to the local system clock. If the clocks are out of sync by more than a few seconds, this code will start doing strange things.

这段代码有什么问题？首先，它依赖于同步的时钟：租约的到期时间是由另一台机器设置的（到期时间可能被计算为当前时间加30秒），并且与本地系统时钟进行比较。如果时钟相差超过几秒钟，这段代码就会开始出现奇怪的问题。

Secondly, even if we change the protocol to only use the local monotonic clock, there is another problem: the code assumes that very little time passes between the point that it checks the time ( System.currentTimeMillis() ) and the time when the request is processed ( process(request) ). Normally this code runs very quickly, so the 10 second buffer is more than enough to ensure that the lease doesn’t expire in the middle of processing a request.

其次，即使我们将协议更改为仅使用本地单调时钟，还有另一个问题：代码假定在检查时间（System.currentTimeMillis()）和处理请求（process(request)）之间经过了很少的时间。通常，此代码运行速度非常快，因此10秒缓冲区足以确保租约不会在处理请求的过程中过期。

However, what if there is an unexpected pause in the execution of the program? For example, imagine the thread stops for 15 seconds around the line lease.isValid() before finally continuing. In that case, it’s likely that the lease will have expired by the time the request is processed, and another node has already taken over as leader. However, there is nothing to tell this thread that it was paused for so long, so this code won’t notice that the lease has expired until the next iteration of the loop—by which time it may have already done something unsafe by processing the request.

然而，如果程序执行中出现了意外的暂停怎么办？例如，在线路lease.isValid()附近的线程停止了15秒，最终才继续执行。在这种情况下，租约很可能在请求被处理时已经过期，并且另一个节点已经成为了领导者。然而，对于该线程来说，没有任何东西能告诉它它被暂停了这么久，因此这段代码在下一次循环之前不会意识到租约已经过期，到那时它可能已经通过处理请求做了一些不安全的事情。

Is it crazy to assume that a thread might be paused for so long? Unfortunately not. There are various reasons why this could happen:

这么假定一个线程可能会暂停这么长时间是疯狂的吗？不幸的是不是。有各种原因会导致这种情况发生：

Many programming language runtimes (such as the Java Virtual Machine) have a garbage collector (GC) that occasionally needs to stop all running threads. These “stop-the-world” GC pauses have sometimes been known to last for several minutes [ 64 ]! Even so-called “concurrent” garbage collectors like the HotSpot JVM’s CMS cannot fully run in parallel with the application code—even they need to stop the world from time to time [ 65 ]. Although the pauses can often be reduced by changing allocation patterns or tuning GC settings [ 66 ], we must assume the worst if we want to offer robust guarantees.

许多编程语言的运行时环境（比如Java虚拟机）都有垃圾回收器（GC），需要偶尔停止所有正在运行的线程。这些“停止世界”GC暂停有时知道持续几分钟！即使是所谓的“并发”垃圾回收器，如HotSpot JVM的CMS，也不能完全与应用程序代码并行运行，即它们也需要定期停止工作。虽然通过改变分配模式或调整GC设置经常可以减少暂停，但如果我们想要提供强大的保证，我们必须假设最坏情况。
In virtualized environments, a virtual machine can be suspended (pausing the execution of all processes and saving the contents of memory to disk) and resumed (restoring the contents of memory and continuing execution). This pause can occur at any time in a process’s execution and can last for an arbitrary length of time. This feature is sometimes used for live migration of virtual machines from one host to another without a reboot, in which case the length of the pause depends on the rate at which processes are writing to memory [ 67 ].

在虚拟化环境中，虚拟机可以被暂停（暂停所有进程执行并将内存内容保存到磁盘）和恢复（恢复存储在内存中的内容并继续执行）。此暂停可以发生在进程执行的任何时间，并且可以持续任意长的时间。该功能有时用于虚拟机的实时迁移，无需重新启动，此时暂停的长度取决于进程写入内存的速率[67]。
On end-user devices such as laptops, execution may also be suspended and resumed arbitrarily, e.g., when the user closes the lid of their laptop.

在终端用户设备上，例如笔记本电脑，执行也可以任意暂停和恢复，例如当用户关闭笔记本电脑盖子时。
When the operating system context-switches to another thread, or when the hypervisor switches to a different virtual machine (when running in a virtual machine), the currently running thread can be paused at any arbitrary point in the code. In the case of a virtual machine, the CPU time spent in other virtual machines is known as steal time . If the machine is under heavy load—i.e., if there is a long queue of threads waiting to run—it may take some time before the paused thread gets to run again.

当操作系统切换到另一个线程时，或者当虚拟机中的运行时hypervisor切换到另一个虚拟机时，当前正在运行的线程可以在代码的任意点暂停。在虚拟机中，CPU在其他虚拟机中所花费的时间称为被窃取时间。如果机器负载很重，即有很长的线程等待运行的队列，那么在暂停的线程再次运行之前可能需要一些时间。
If the application performs synchronous disk access, a thread may be paused waiting for a slow disk I/O operation to complete [ 68 ]. In many languages, disk access can happen surprisingly, even if the code doesn’t explicitly mention file access—for example, the Java classloader lazily loads class files when they are first used, which could happen at any time in the program execution. I/O pauses and GC pauses may even conspire to combine their delays [ 69 ]. If the disk is actually a network filesystem or network block device (such as Amazon’s EBS), the I/O latency is further subject to the variability of network delays [ 29 ].

如果应用程序执行同步磁盘访问，则线程可能会因等待缓慢的磁盘I/O操作而暂停[68]。在许多语言中，即使代码没有明确提到文件访问，磁盘访问也可能会突然发生 - 例如，Java类装入器会在第一次使用类文件时惰性地加载它们，这可能会在程序执行期间的任何时候发生。I/O暂停和GC暂停甚至可能合并它们的延迟[69]。如果磁盘实际上是网络文件系统或网络块设备（例如Amazon的EBS），则I/O延迟还受网络延迟变化的影响[29]。
If the operating system is configured to allow swapping to disk ( paging ), a simple memory access may result in a page fault that requires a page from disk to be loaded into memory. The thread is paused while this slow I/O operation takes place. If memory pressure is high, this may in turn require a different page to be swapped out to disk. In extreme circumstances, the operating system may spend most of its time swapping pages in and out of memory and getting little actual work done (this is known as thrashing ). To avoid this problem, paging is often disabled on server machines (if you would rather kill a process to free up memory than risk thrashing).

如果操作系统配置为允许交换到磁盘（分页），一个简单的内存访问可能导致一个页面错误，需要将一个页面从磁盘加载到内存。在这个缓慢的I / O操作发生时，线程会暂停。如果内存压力很大，这可能需要将不同的页面交换到磁盘。在极端情况下，操作系统可能会花费大部分时间将页面从内存中换入和换出，并且实际上完成很少的工作（这称为抖动）。为了避免这个问题，服务器机器上通常禁用分页（如果您宁愿杀死一个进程来释放内存而不是冒险抖动）。
A Unix process can be paused by sending it the SIGSTOP signal, for example by pressing Ctrl-Z in a shell. This signal immediately stops the process from getting any more CPU cycles until it is resumed with SIGCONT , at which point it continues running where it left off. Even if your environment does not normally use SIGSTOP , it might be sent accidentally by an operations engineer.

Unix进程可以通过发送SIGSTOP信号暂停，例如在shell中按下Ctrl-Z。此信号立即停止进程从获取更多的CPU周期，直到使用SIGCONT恢复为止，此时它将在离开的地方继续运行。即使您的环境通常不使用SIGSTOP，操作工程师可能会意外发送它。

All of these occurrences can preempt the running thread at any point and resume it at some later time, without the thread even noticing. The problem is similar to making multi-threaded code on a single machine thread-safe: you can’t assume anything about timing, because arbitrary context switches and parallelism may occur.

所有这些情况都可能在任何时候抢占正在运行的线程，并在稍后的时间恢复它，而线程甚至都不会注意到。这个问题类似于在单个机器上编写多线程代码的线程安全性：您不能假设任何有关计时的事情，因为任意的上下文切换和并行性可能会发生。

When writing multi-threaded code on a single machine, we have fairly good tools for making it thread-safe: mutexes, semaphores, atomic counters, lock-free data structures, blocking queues, and so on. Unfortunately, these tools don’t directly translate to distributed systems, because a distributed system has no shared memory—only messages sent over an unreliable network.

在单机上编写多线程代码时，我们有相当不错的工具可以使其线程安全：互斥锁、信号量、原子计数器、无锁数据结构、阻塞队列等。不幸的是，这些工具无法直接转换为分布式系统，因为分布式系统没有共享内存，只有通过不可靠网络发送的消息。

A node in a distributed system must assume that its execution can be paused for a significant length of time at any point, even in the middle of a function. During the pause, the rest of the world keeps moving and may even declare the paused node dead because it’s not responding. Eventually, the paused node may continue running, without even noticing that it was asleep until it checks its clock sometime later.

分布式系统中的一个节点必须假定其执行可以在任何时候暂停很长一段时间，甚至在函数中间。在暂停期间，其余世界会继续前进，甚至可能宣布暂停的节点已经死亡，因为它没有响应。最终，暂停的节点可能会继续运行，并且甚至在稍后检查其时钟时也没有注意到它睡着了。

Response time guarantees

In many programming languages and operating systems, threads and processes may pause for an unbounded amount of time, as discussed. Those reasons for pausing can be eliminated if you try hard enough.

在许多编程语言和操作系统中，线程和进程可能会无限期地暂停，如上文所述。如果你努力尝试，这些暂停的原因可以被消除。

Some software runs in environments where a failure to respond within a specified time can cause serious damage: computers that control aircraft, rockets, robots, cars, and other physical objects must respond quickly and predictably to their sensor inputs. In these systems, there is a specified deadline by which the software must respond; if it doesn’t meet the deadline, that may cause a failure of the entire system. These are so-called hard real-time systems.

一些软件运行在可能导致严重损坏的环境中：控制飞机、火箭、机器人、汽车和其他物理对象的计算机必须快速而可预测地响应它们的传感器输入。在这些系统中，软件必须在指定的截止日期前做出响应；如果没有遇到截止日期，可能会导致整个系统的失败。这些是所谓的硬实时系统。

Is real-time really real?

In embedded systems, real-time means that a system is carefully designed and tested to meet specified timing guarantees in all circumstances. This meaning is in contrast to the more vague use of the term real-time on the web, where it describes servers pushing data to clients and stream processing without hard response time constraints (see Chapter 11 ).

在嵌入式系统中，实时意味着系统经过精心设计和测试，以满足所有情况下的指定时间保证。这个意义与网络上更模糊的实时术语有所区别，在这里它描述的是服务器向客户端推送数据和流处理而没有严格的响应时间约束(见第11章)。

For example, if your car’s onboard sensors detect that you are currently experiencing a crash, you wouldn’t want the release of the airbag to be delayed due to an inopportune GC pause in the airbag release system.

例如，如果您汽车上的传感器检测到您正在经历一场车祸，您不希望由于空袭释放系统中不恰当的GC暂停而延迟气囊的释放。

Providing real-time guarantees in a system requires support from all levels of the software stack: a real-time operating system (RTOS) that allows processes to be scheduled with a guaranteed allocation of CPU time in specified intervals is needed; library functions must document their worst-case execution times; dynamic memory allocation may be restricted or disallowed entirely (real-time garbage collectors exist, but the application must still ensure that it doesn’t give the GC too much work to do); and an enormous amount of testing and measurement must be done to ensure that guarantees are being met.

提供系统实时保障需要软件堆栈的所有层面的支持：需要实时操作系统（RTOS）用于允许进程在指定的时间间隔内安排具有保证的CPU时间分配；库函数必须记录它们的最坏情况执行时间；动态内存分配可能会受到限制或完全禁止（实时垃圾收集器存在，但应用程序仍必须确保不要给GC太多工作要做）；必须进行大量的测试和测量，以确保满足保证。

All of this requires a large amount of additional work and severely restricts the range of programming languages, libraries, and tools that can be used (since most languages and tools do not provide real-time guarantees). For these reasons, developing real-time systems is very expensive, and they are most commonly used in safety-critical embedded devices. Moreover, “real-time” is not the same as “high-performance”—in fact, real-time systems may have lower throughput, since they have to prioritize timely responses above all else (see also “Latency and Resource Utilization” ).

所有的这些都需要大量的额外工作并严重限制可以使用的编程语言、库和工具的范围（因为大多数语言和工具不提供实时保证）。由于这些原因，开发实时系统非常昂贵，并且它们在安全关键的嵌入式设备中最常用。此外，“实时”不同于“高性能”——实际上，实时系统的吞吐量可能更低，因为它们必须把及时响应放在首位（还请参阅 “延迟和资源利用”）。

For most server-side data processing systems, real-time guarantees are simply not economical or appropriate. Consequently, these systems must suffer the pauses and clock instability that come from operating in a non-real-time environment.

对于大多数服务器端数据处理系统而言，实时保障并不经济或合适。因此，这些系统必须忍受在非实时环境下操作所带来的暂停和时钟不稳定性。

Limiting the impact of garbage collection

The negative effects of process pauses can be mitigated without resorting to expensive real-time scheduling guarantees. Language runtimes have some flexibility around when they schedule garbage collections, because they can track the rate of object allocation and the remaining free memory over time.

处理暂停的负面影响可以通过不采用昂贵的实时调度保证来减轻。语言运行时在调度垃圾收集时有一定的灵活性，因为它们可以跟踪对象分配速率以及随时间剩余的空闲内存。

An emerging idea is to treat GC pauses like brief planned outages of a node, and to let other nodes handle requests from clients while one node is collecting its garbage. If the runtime can warn the application that a node soon requires a GC pause, the application can stop sending new requests to that node, wait for it to finish processing outstanding requests, and then perform the GC while no requests are in progress. This trick hides GC pauses from clients and reduces the high percentiles of response time [ 70 , 71 ]. Some latency-sensitive financial trading systems [ 72 ] use this approach.

一个新兴的想法是将GC暂停视为节点的短暂计划停机，并让其他节点处理客户端请求，而其中一个节点正在收集垃圾。如果运行时可以警告应用程序某个节点即将需要GC暂停，应用程序可以停止向该节点发送新请求，等待其完成处理未完成的请求，然后在没有正在进行的请求时执行GC。这种技巧可将GC暂停隐藏起来，从而减少响应时间的高百分位数[70, 71]。一些具有延迟敏感的金融交易系统[72]使用了这种方法。

A variant of this idea is to use the garbage collector only for short-lived objects (which are fast to collect) and to restart processes periodically, before they accumulate enough long-lived objects to require a full GC of long-lived objects [ 65 , 73 ]. One node can be restarted at a time, and traffic can be shifted away from the node before the planned restart, like in a rolling upgrade (see Chapter 4 ).

一种此想法的变体是仅将垃圾收集器用于短寿命对象（这些对象易于收集），并定期重新启动进程，以防止它们积累足以需要长寿命对象的完整垃圾回收 [65，73]。可以逐个重新启动节点，并在计划的重新启动之前将流量从节点中转移，就像滚动升级一样（请参阅第4章）。

These measures cannot fully prevent garbage collection pauses, but they can usefully reduce their impact on the application.

这些措施不能完全预防垃圾收集暂停，但它们可以有用地减少其对应用程序的影响。

Knowledge, Truth, and Lies

So far in this chapter we have explored the ways in which distributed systems are different from programs running on a single computer: there is no shared memory, only message passing via an unreliable network with variable delays, and the systems may suffer from partial failures, unreliable clocks, and processing pauses.

到目前为止，在这一章中，我们已经探讨了分布式系统与在单台计算机上运行的程序的不同之处：没有共享内存，只有通过不可靠的具有可变延迟的网络进行消息传递，而且系统可能会受到部分故障、不可靠的时钟和处理暂停的影响。

The consequences of these issues are profoundly disorienting if you’re not used to distributed systems. A node in the network cannot know anything for sure—it can only make guesses based on the messages it receives (or doesn’t receive) via the network. A node can only find out what state another node is in (what data it has stored, whether it is correctly functioning, etc.) by exchanging messages with it. If a remote node doesn’t respond, there is no way of knowing what state it is in, because problems in the network cannot reliably be distinguished from problems at a node.

如果你不习惯分布式系统，这些问题的后果会给你带来深深的迷茫。网络中的节点无法确定任何事情 - 它只能根据接收到的消息（或未收到的消息）来做猜测。节点只能通过与其他节点交换信息才能了解另一个节点的状态（它存储的数据，是否正常运行等）。如果远程节点没有响应，就无法知道它处于什么状态，因为网络中的问题不能可靠地区分是节点的问题还是网络的问题。

Discussions of these systems border on the philosophical: What do we know to be true or false in our system? How sure can we be of that knowledge, if the mechanisms for perception and measurement are unreliable? Should software systems obey the laws that we expect of the physical world, such as cause and effect?

这些系统的讨论涉及哲学问题：我们在系统中知道什么是真的或假的？如果感知和测量机制不可靠，我们对这种知识有多少把握？软件系统应该遵守我们期望物理世界的法则，比如因果关系吗？

Fortunately, we don’t need to go as far as figuring out the meaning of life. In a distributed system, we can state the assumptions we are making about the behavior (the system model ) and design the actual system in such a way that it meets those assumptions. Algorithms can be proved to function correctly within a certain system model. This means that reliable behavior is achievable, even if the underlying system model provides very few guarantees.

幸运的是，我们不需要深究生命的意义。在分布式系统中，我们可以陈述我们对行为的假设（系统模型），并设计实际系统以满足这些假设。算法可以被证明在某个系统模型内正确运行。这意味着可靠的行为是可以实现的，即使底层的系统模型提供的保证非常少。

However, although it is possible to make software well behaved in an unreliable system model, it is not straightforward to do so. In the rest of this chapter we will further explore the notions of knowledge and truth in distributed systems, which will help us think about the kinds of assumptions we can make and the guarantees we may want to provide. In Chapter 9 we will proceed to look at some examples of distributed systems, algorithms that provide particular guarantees under particular assumptions.

然而，尽管在不可靠的系统模型中使软件行为良好是可能的，但这并不是一件简单的事情。在本章的其余部分中，我们将进一步探讨分布式系统中知识和真相的概念，这将帮助我们考虑我们可以做出什么样的假设以及我们希望提供什么样的保证。在第9章中，我们将继续研究一些分布式系统的示例和算法，它们可以在特定假设下提供特定的保证。

The Truth Is Defined by the Majority

Imagine a network with an asymmetric fault: a node is able to receive all messages sent to it, but any outgoing messages from that node are dropped or delayed [ 19 ]. Even though that node is working perfectly well, and is receiving requests from other nodes, the other nodes cannot hear its responses. After some timeout, the other nodes declare it dead, because they haven’t heard from the node. The situation unfolds like a nightmare: the semi-disconnected node is dragged to the graveyard, kicking and screaming “I’m not dead!”—but since nobody can hear its screaming, the funeral procession continues with stoic determination.

想像一下一个有不对称故障的网络：一个节点能够接收到对它发送的所有消息，但是从该节点发送的任何消息都被丢弃或延迟[19]。即使该节点运行得非常好，而且正在接收其他节点的请求，但其他节点无法听到其响应。经过一段时间后，其他节点会宣布它死亡，因为他们没有从该节点那里听到消息。这种情况就像一场噩梦一样展开：这个半断开的节点被拖入坟墓，大喊着“我还没死！”——但由于没有人能听到它的尖叫声，葬礼队伍会义无反顾地继续前行。

In a slightly less nightmarish scenario, the semi-disconnected node may notice that the messages it is sending are not being acknowledged by other nodes, and so realize that there must be a fault in the network. Nevertheless, the node is wrongly declared dead by the other nodes, and the semi-disconnected node cannot do anything about it.

在一个稍微不那么可怕的情况下，半断开的节点可能会注意到它发送的消息未被其他节点确认，因此意识到网络中一定存在故障。然而，其他节点错误地宣布该节点死亡，而半断开的节点无法对此做出任何反应。

As a third scenario, imagine a node that experiences a long stop-the-world garbage collection pause. All of the node’s threads are preempted by the GC and paused for one minute, and consequently, no requests are processed and no responses are sent. The other nodes wait, retry, grow impatient, and eventually declare the node dead and load it onto the hearse. Finally, the GC finishes and the node’s threads continue as if nothing had happened. The other nodes are surprised as the supposedly dead node suddenly raises its head out of the coffin, in full health, and starts cheerfully chatting with bystanders. At first, the GCing node doesn’t even realize that an entire minute has passed and that it was declared dead—from its perspective, hardly any time has passed since it was last talking to the other nodes.

作为第三种情景，想象一下一个节点经历了一次长时间的停机垃圾收集暂停。所有该节点的线程都被垃圾收集器抢占并暂停了一分钟，因此，没有任何请求被处理和响应被发送。其他节点等待、重试、变得不耐烦，并最终宣布该节点已经死亡并将其装入灵车。最后，垃圾收集完成，该节点的线程继续执行，就好像什么都没有发生过一样。其他节点会感到惊讶，因为这个本来被认为已经死亡的节点突然从棺材里探出头来，身体健康，并开始和旁观者愉快地聊天。一开始，正在进行垃圾收集的节点甚至没有意识到已经过去了整整一分钟，而它被宣布死亡-从它的角度来看，自上次与其他节点交流以来几乎没有经过任何时间。

The moral of these stories is that a node cannot necessarily trust its own judgment of a situation. A distributed system cannot exclusively rely on a single node, because a node may fail at any time, potentially leaving the system stuck and unable to recover. Instead, many distributed algorithms rely on a quorum , that is, voting among the nodes (see “Quorums for reading and writing” ): decisions require some minimum number of votes from several nodes in order to reduce the dependence on any one particular node.

这些故事的寓意是，一个节点不能仅依靠自己对情况的判断。分布式系统不能只依赖于单个节点，因为节点可能随时出现故障，导致系统出现故障并且无法恢复。相反，许多分布式算法依赖于法定人数，即节点之间的投票（见“用于读写的法定人数”）：决策需要来自多个节点的最小投票数，以减少对任何特定节点的依赖。

That includes decisions about declaring nodes dead. If a quorum of nodes declares another node dead, then it must be considered dead, even if that node still very much feels alive. The individual node must abide by the quorum decision and step down.

这包括关于宣布节点死亡的决定。如果一组节点宣布另一个节点已经死亡，那么即使该节点仍然非常有生气，也必须被视为已死亡。个体节点必须遵守法定人数的决定并下台。

Most commonly, the quorum is an absolute majority of more than half the nodes (although other kinds of quorums are possible). A majority quorum allows the system to continue working if individual nodes have failed (with three nodes, one failure can be tolerated; with five nodes, two failures can be tolerated). However, it is still safe, because there can only be only one majority in the system—there cannot be two majorities with conflicting decisions at the same time. We will discuss the use of quorums in more detail when we get to consensus algorithms in Chapter 9 .

通常情况下，法定人数是超过一半以上节点的绝对多数（虽然其他种类的法定人数也是可能的）。多数派法定人数可以使系统在个别节点失败时继续工作（对于三个节点，可以容忍一个故障；对于五个节点，可以容忍两个故障）。但是，它仍然是安全的，因为系统只能有一个多数派—同时不能有两个有冲突决定的多数派。当我们在第9章讨论共识算法时，我们将更详细地讨论法定人数的使用。

The leader and the lock

Frequently, a system requires there to be only one of some thing. For example:

经常，系统要求有一些东西只能有一个。例如：

Only one node is allowed to be the leader for a database partition, to avoid split brain (see “Handling Node Outages” ).

一个数据库分区只允许存在一个领导者节点，以避免出现分裂脑的情况（参见“处理节点故障”）。
Only one transaction or client is allowed to hold the lock for a particular resource or object, to prevent concurrently writing to it and corrupting it.

只允许一个事务或客户端持有锁定特定资源或对象的锁，以防止同时写入并损坏它。
Only one user is allowed to register a particular username, because a username must uniquely identify a user.

一个用户名只能注册给一位用户，因为用户名必须唯一标识一个用户。

Implementing this in a distributed system requires care: even if a node believes that it is “the chosen one” (the leader of the partition, the holder of the lock, the request handler of the user who successfully grabbed the username), that doesn’t necessarily mean a quorum of nodes agrees! A node may have formerly been the leader, but if the other nodes declared it dead in the meantime (e.g., due to a network interruption or GC pause), it may have been demoted and another leader may have already been elected.

在分布式系统中实施这个需要小心：即使一个节点认为自己是“被选择的”（分区的领导者、锁定的持有者、成功抓取用户名称的请求处理器），这并不意味着节点的多数同意！一个节点可能曾经是领导者，但如果其他节点在此期间声明它已死（例如，由于网络中断或GC暂停），它可能已经被降级，另一个领导者可能已经被选举。

If a node continues acting as the chosen one, even though the majority of nodes have declared it dead, it could cause problems in a system that is not carefully designed. Such a node could send messages to other nodes in its self-appointed capacity, and if other nodes believe it, the system as a whole may do something incorrect.

如果一个节点即使大多数节点已经宣布它死亡，仍然继续担任被选择的节点，会在系统不小心设计的情况下造成问题。这样的节点可能会在自我任命的能力下向其他节点发送消息，如果其他节点相信它，整个系统可能会做出错误的操作。

For example, Figure 8-4 shows a data corruption bug due to an incorrect implementation of locking. (The bug is not theoretical: HBase used to have this problem [ 74 , 75 ].) Say you want to ensure that a file in a storage service can only be accessed by one client at a time, because if multiple clients tried to write to it, the file would become corrupted. You try to implement this by requiring a client to obtain a lease from a lock service before accessing the file.

例如，图8-4展示了由于锁定实现不正确而导致的数据损坏错误。（这个错误不是理论上的：HBase曾经有过这个问题[74, 75]。）假设你想确保存储服务中的文件一次只能被一个客户端访问，因为如果多个客户端尝试写入它，文件会变得损坏。你尝试通过要求客户端在访问文件之前从锁定服务中获取租约来实现这一点。

The problem is an example of what we discussed in “Process Pauses” : if the client holding the lease is paused for too long, its lease expires. Another client can obtain a lease for the same file, and start writing to the file. When the paused client comes back, it believes (incorrectly) that it still has a valid lease and proceeds to also write to the file. As a result, the clients’ writes clash and corrupt the file.

问题是我们在“流程暂停”中讨论的一个例子：如果持有租约的客户端暂停时间过长，租约将会过期。另一个客户端可以获取相同文件的租约，并开始写入文件。当暂停的客户端回来时，它会错误地认为自己仍然拥有有效的租约，然后继续写入文件。结果，客户端的写入冲突并破坏了文件。

Fencing tokens

When using a lock or lease to protect access to some resource, such as the file storage in Figure 8-4 , we need to ensure that a node that is under a false belief of being “the chosen one” cannot disrupt the rest of the system. A fairly simple technique that achieves this goal is called fencing , and is illustrated in Figure 8-5 .

当使用锁定或租赁来保护某些资源的访问权限（例如图8-4中的文件存储）时，我们需要确保一个错误地认为自己是“被选中者”的节点不能破坏整个系统。一个相当简单的技术可以实现这个目标，它叫做栅栏，如图8-5所示。

Let’s assume that every time the lock server grants a lock or lease, it also returns a fencing token , which is a number that increases every time a lock is granted (e.g., incremented by the lock service). We can then require that every time a client sends a write request to the storage service, it must include its current fencing token.

假设每次锁服务授予锁或租约时，它也返回一个围栏令牌，这是一个数字，每次授予锁时都会增加（例如，由锁服务增加）。然后，我们可以要求每次客户端向存储服务发送写入请求时，必须包含其当前的围栏令牌。

In Figure 8-5 , client 1 acquires the lease with a token of 33, but then it goes into a long pause and the lease expires. Client 2 acquires the lease with a token of 34 (the number always increases) and then sends its write request to the storage service, including the token of 34. Later, client 1 comes back to life and sends its write to the storage service, including its token value 33. However, the storage server remembers that it has already processed a write with a higher token number (34), and so it rejects the request with token 33.

在图8-5中，客户端1使用33个令牌获得租约，但之后进入长时间暂停并且租约到期了。客户端2使用34个令牌（数字一直增加）获得租约，然后将其写请求发送给存储服务，包括34个令牌。稍后，客户端1重新活跃并将其写请求发送到存储服务，包括其33个令牌值。但是，存储服务器记得它已经处理了一个更高令牌号（34）的写请求，因此拒绝了使用33个令牌的请求。

If ZooKeeper is used as lock service, the transaction ID zxid or the node version cversion can be used as fencing token. Since they are guaranteed to be monotonically increasing, they have the required properties [ 74 ].

如果ZooKeeper作为锁定服务使用，则事务ID zxid或节点版本cversion可用作分隔令牌。由于它们保证单调递增，因此具有所需的属性[74]。

Note that this mechanism requires the resource itself to take an active role in checking tokens by rejecting any writes with an older token than one that has already been processed—it is not sufficient to rely on clients checking their lock status themselves. For resources that do not explicitly support fencing tokens, you might still be able work around the limitation (for example, in the case of a file storage service you could include the fencing token in the filename). However, some kind of check is necessary to avoid processing requests outside of the lock’s protection.

请注意，此机制要求资源自身在检查令牌时扮演积极角色，拒绝任何具有较旧令牌的写入，该令牌已经被处理，它并不能仅仅依赖客户端自己检查其锁定状态。对于不明确支持栅栏令牌的资源，您可能仍能绕过此限制（例如，在文件存储服务的情况下，您可以在文件名中包含栅栏令牌）。但必须进行某种检查，以避免处理锁定保护范围外的请求。

Checking a token on the server side may seem like a downside, but it is arguably a good thing: it is unwise for a service to assume that its clients will always be well behaved, because the clients are often run by people whose priorities are very different from the priorities of the people running the service [ 76 ]. Thus, it is a good idea for any service to protect itself from accidentally abusive clients.

在服务器端检查令牌可能似乎有缺点，但这也是一件好事：服务不应该假设其客户始终表现良好，因为客户的优先级通常与运行服务的人的优先级非常不同。因此，任何服务都应该保护自己免受意外虐待的客户。

Byzantine Faults

Fencing tokens can detect and block a node that is inadvertently acting in error (e.g., because it hasn’t yet found out that its lease has expired). However, if the node deliberately wanted to subvert the system’s guarantees, it could easily do so by sending messages with a fake fencing token.

栅栏令牌可以检测和阻止错误地操作节点（例如，因为它尚未发现其租期已过期）的节点。然而，如果节点有意想要破坏系统的保证，它可以轻松地通过发送带有伪造栅栏令牌的消息来实现。

In this book we assume that nodes are unreliable but honest: they may be slow or never respond (due to a fault), and their state may be outdated (due to a GC pause or network delays), but we assume that if a node does respond, it is telling the “truth”: to the best of its knowledge, it is playing by the rules of the protocol.

在这本书中，我们假设节点是不可靠但诚实的：它们可能很慢或从不响应（由于故障），它们的状态可能过时（由于GC暂停或网络延迟），但我们假设如果节点确实响应，它正在“真实”地讲述事实：据它所知，它正在按照协议的规则进行操作。

Distributed systems problems become much harder if there is a risk that nodes may “lie” (send arbitrary faulty or corrupted responses)—for example, if a node may claim to have received a particular message when in fact it didn’t. Such behavior is known as a Byzantine fault , and the problem of reaching consensus in this untrusting environment is known as the Byzantine Generals Problem [ 77 ].

如果存在节点可能“说谎”（发送任意错误或损坏的响应）的风险，分布式系统问题会变得更加困难 - 例如，如果一个节点可能声称已收到某个消息，而实际上没有。这种行为被称为拜占庭故障，而在这种不信任环境中达成共识的问题被称为拜占庭将军问题[77]。

The Byzantine Generals Problem

The Byzantine Generals Problem is a generalization of the so-called Two Generals Problem [ 78 ], which imagines a situation in which two army generals need to agree on a battle plan. As they have set up camp on two different sites, they can only communicate by messenger, and the messengers sometimes get delayed or lost (like packets in a network). We will discuss this problem of consensus in Chapter 9 .

拜占庭将军问题是所谓的两名将军问题的概括，即想象在两名军队将军需要达成战斗计划的情况下，他们在两个不同的地点扎营，只能通过信使进行交流，而信使有时会延迟或丢失（就像网络中的数据包一样）。我们将在第9章讨论这个共识问题。

In the Byzantine version of the problem, there are n generals who need to agree, and their endeavor is hampered by the fact that there are some traitors in their midst. Most of the generals are loyal, and thus send truthful messages, but the traitors may try to deceive and confuse the others by sending fake or untrue messages (while trying to remain undiscovered). It is not known in advance who the traitors are.

在拜占庭版的问题中，有n个将军需要达成一致，但是其中存在一些叛徒妨碍了他们的努力。大多数将军是忠诚的，因此发送真实的信息，但是叛徒可能会试图发送虚假或不真实的信息来欺骗和混淆其他人（同时试图保持不被发现）。事先并不知道谁是叛徒。

Byzantium was an ancient Greek city that later became Constantinople, in the place which is now Istanbul in Turkey. There isn’t any historic evidence that the generals of Byzantium were any more prone to intrigue and conspiracy than those elsewhere. Rather, the name is derived from Byzantine in the sense of excessively complicated, bureaucratic, devious , which was used in politics long before computers [ 79 ]. Lamport wanted to choose a nationality that would not offend any readers, and he was advised that calling it The Albanian Generals Problem was not such a good idea [ 80 ].

拜占庭是一座古希腊城市，后来成为了土耳其伊斯坦布尔的君士坦丁堡。没有证据表明拜占庭将军比其他地方的将军更容易进行阴谋诡计。相反，这个名称源于“拜占庭式”的含义，即过于复杂、官僚、狡猾，这个词在电脑之前的政治领域中就被使用了。Lamport希望选择一个不会冒犯任何读者的国籍，并被建议称之为“阿尔巴尼亚将军问题”并不是个好主意。

A system is Byzantine fault-tolerant if it continues to operate correctly even if some of the nodes are malfunctioning and not obeying the protocol, or if malicious attackers are interfering with the network. This concern is relevant in certain specific circumstances. For example:

如果某些节点出现故障并且违反协议，或者有恶意攻击者干扰网络，一个系统是拜占庭容错的，可以继续正确运行。这个问题在某些具体情况下非常重要。例如：

In aerospace environments, the data in a computer’s memory or CPU register could become corrupted by radiation, leading it to respond to other nodes in arbitrarily unpredictable ways. Since a system failure would be very expensive (e.g., an aircraft crashing and killing everyone on board, or a rocket colliding with the International Space Station), flight control systems must tolerate Byzantine faults [ 81 , 82 ].

在航空航天环境中，计算机的内存或CPU寄存器中的数据可能会因为辐射而损坏，导致其对其他节点以任意不可预测的方式做出响应。由于系统故障会非常昂贵（例如，飞机坠毁并导致所有人死亡，或火箭与国际空间站相撞），所以飞行控制系统必须容忍拜占庭故障[81, 82]。
In a system with multiple participating organizations, some participants may attempt to cheat or defraud others. In such circumstances, it is not safe for a node to simply trust another node’s messages, since they may be sent with malicious intent. For example, peer-to-peer networks like Bitcoin and other blockchains can be considered to be a way of getting mutually untrusting parties to agree whether a transaction happened or not, without relying on a central authority [ 83 ].

在多个参与组织的系统中，某些参与方可能尝试欺骗或欺诈其他人。在这种情况下，节点不能简单地相信另一个节点的信息，因为它们可能具有恶意意图。例如，像比特币和其他区块链一样的对等网络可以被视为一种让相互不信任的各方同意某个交易是否发生，而不依赖于中央机构的方式。

However, in the kinds of systems we discuss in this book, we can usually safely assume that there are no Byzantine faults. In your datacenter, all the nodes are controlled by your organization (so they can hopefully be trusted) and radiation levels are low enough that memory corruption is not a major problem. Protocols for making systems Byzantine fault-tolerant are quite complicated [ 84 ], and fault-tolerant embedded systems rely on support from the hardware level [ 81 ]. In most server-side data systems, the cost of deploying Byzantine fault-tolerant solutions makes them impractical.

然而，在本书讨论的系统中，我们通常可以安全地假设没有拜占庭故障。在您的数据中心中，所有节点都受到您组织的控制（因此希望它们是可信的），并且辐射水平低到足以使内存损坏不成为主要问题。使系统具有拜占庭容错性的协议非常复杂[84]，并且容错嵌入式系统依赖于硬件级别的支持[81]。在大多数服务器端数据系统中，部署拜占庭容错解决方案的成本使它们变得不切实际。

Web applications do need to expect arbitrary and malicious behavior of clients that are under end-user control, such as web browsers. This is why input validation, sanitization, and output escaping are so important: to prevent SQL injection and cross-site scripting, for example. However, we typically don’t use Byzantine fault-tolerant protocols here, but simply make the server the authority on deciding what client behavior is and isn’t allowed. In peer-to-peer networks, where there is no such central authority, Byzantine fault tolerance is more relevant.

网络应用需要预料到受最终用户控制的客户端（如Web浏览器）的任意和恶意行为。这就是为什么输入验证、净化和输出转义非常重要的原因：例如，防止SQL注入和跨站点脚本。然而，在这里我们通常不使用拜占庭容错协议，只需将服务器作为决定允许哪些客户端行为的权威。在点对点网络中，没有这样的中央权威，因此拜占庭容错更加相关。

A bug in the software could be regarded as a Byzantine fault, but if you deploy the same software to all nodes, then a Byzantine fault-tolerant algorithm cannot save you. Most Byzantine fault-tolerant algorithms require a supermajority of more than two-thirds of the nodes to be functioning correctly (i.e., if you have four nodes, at most one may malfunction). To use this approach against bugs, you would have to have four independent implementations of the same software and hope that a bug only appears in one of the four implementations.

软件中的漏洞可能被视为拜占庭故障，但如果你将同一软件部署到所有节点上，那么拜占庭容错算法将无法拯救你。大多数拜占庭容错算法需要超过三分之二的节点正常运行（例如，如果你有四个节点，则最多只能有一个节点出现故障）。如果想要使用这种方法来解决漏洞，你需要有四个独立的软件实现，并希望漏洞只出现在其中一个实现中。

Similarly, it would be appealing if a protocol could protect us from vulnerabilities, security compromises, and malicious attacks. Unfortunately, this is not realistic either: in most systems, if an attacker can compromise one node, they can probably compromise all of them, because they are probably running the same software. Thus, traditional mechanisms (authentication, access control, encryption, firewalls, and so on) continue to be the main protection against attackers.

同样，如果协议可以保护我们免受漏洞、安全妥协和恶意攻击的侵害，那将会很吸引人。不幸的是，这也是不现实的：在大多数系统中，如果攻击者可以入侵一个节点，他们可能可以入侵所有节点，因为它们可能运行相同的软件。因此，传统机制（身份验证、访问控制、加密、防火墙等）继续成为对抗攻击者的主要保护。

Weak forms of lying

Although we assume that nodes are generally honest, it can be worth adding mechanisms to software that guard against weak forms of “lying”—for example, invalid messages due to hardware issues, software bugs, and misconfiguration. Such protection mechanisms are not full-blown Byzantine fault tolerance, as they would not withstand a determined adversary, but they are nevertheless simple and pragmatic steps toward better reliability. For example:

尽管我们假设节点一般是诚实的，但添加机制以防止“说谎”的弱形式可能是值得的，例如由于硬件问题、软件漏洞和错误配置导致的无效消息。这些保护机制不是完整的拜占庭容错，因为它们不能抵御有决心的对手，但它们仍然是朝着更好的可靠性迈进的简单而实用的步骤。例如：

Network packets do sometimes get corrupted due to hardware issues or bugs in operating systems, drivers, routers, etc. Usually, corrupted packets are caught by the checksums built into TCP and UDP, but sometimes they evade detection [ 85 , 86 , 87 ]. Simple measures are usually sufficient protection against such corruption, such as checksums in the application-level protocol.

网络数据包有时会因为硬件问题或操作系统、驱动程序、路由器等的bugs而变为损坏状态。通常，TCP和UDP内置进行校验和检测的功能可以检测出损坏的数据包，但有时依然无法检测出损坏的数据包[85，86，87]。一些简单的措施通常足以防止这种损坏，例如在应用层协议中进行校验和检测。
A publicly accessible application must carefully sanitize any inputs from users, for example checking that a value is within a reasonable range and limiting the size of strings to prevent denial of service through large memory allocations. An internal service behind a firewall may be able to get away with less strict checks on inputs, but some basic sanity-checking of values (e.g., in protocol parsing [ 85 ]) is a good idea.

公共可访问的应用程序必须仔细清洗来自用户的任何输入，例如检查一个值是否在合理范围内并限制字符串的大小，以防止通过大内存分配导致拒绝服务攻击。位于防火墙后面的内部服务可能可以对输入进行较少的严格检查，但对值进行一些基本的合理性检查（例如，在协议解析中[85]）是一个好主意。
NTP clients can be configured with multiple server addresses. When synchronizing, the client contacts all of them, estimates their errors, and checks that a majority of servers agree on some time range. As long as most of the servers are okay, a misconfigured NTP server that is reporting an incorrect time is detected as an outlier and is excluded from synchronization [ 37 ]. The use of multiple servers makes NTP more robust than if it only uses a single server.

NTP客户端可以配置多个服务器地址。同步时，客户端联系所有服务器，估计它们的误差，并检查大多数服务器是否在某个时间范围内达成一致。只要大多数服务器正常，就可以检测到错误时间的错误配置NTP服务器，并将其排除在同步之外。[37]使用多个服务器使NTP比仅使用单个服务器更加鲁棒。

System Model and Reality

Many algorithms have been designed to solve distributed systems problems—for example, we will examine solutions for the consensus problem in Chapter 9 . In order to be useful, these algorithms need to tolerate the various faults of distributed systems that we discussed in this chapter.

很多算法都被设计用来解决分布式系统问题——例如，在第9章中，我们将讨论共识问题的解决方案。为了有用，这些算法需要容忍我们在本章中讨论过的分布式系统的各种故障。

Algorithms need to be written in a way that does not depend too heavily on the details of the hardware and software configuration on which they are run. This in turn requires that we somehow formalize the kinds of faults that we expect to happen in a system. We do this by defining a system model , which is an abstraction that describes what things an algorithm may assume.

算法需要编写成不太依赖于硬件和软件配置的方式运行。因此，我们需要以某种方式将我们所预期发生的系统故障形式化。我们通过定义系统模型来实现此目的，这是描述算法可能假设的抽象概念。

With regard to timing assumptions, three system models are in common use:

关于时间假设，常用的有三种系统模型：

Synchronous model

The synchronous model assumes bounded network delay, bounded process pauses, and bounded clock error. This does not imply exactly synchronized clocks or zero network delay; it just means you know that network delay, pauses, and clock drift will never exceed some fixed upper bound [ 88 ]. The synchronous model is not a realistic model of most practical systems, because (as discussed in this chapter) unbounded delays and pauses do occur.

同步模型假设有工作网络延迟、进程暂停和时钟误差的限制。这并不意味着完全同步的时钟或网络延迟为零；它只是意味着你知道网络延迟、暂停和时钟漂移永远不会超过某个固定的上限。同步模型不是大多数实际系统的现实模型，因为（正如本章所讨论的）无限的延迟和暂停确实会发生。

Partially synchronous model

Partial synchrony means that a system behaves like a synchronous system most of the time , but it sometimes exceeds the bounds for network delay, process pauses, and clock drift [ 88 ]. This is a realistic model of many systems: most of the time, networks and processes are quite well behaved—otherwise we would never be able to get anything done—but we have to reckon with the fact that any timing assumptions may be shattered occasionally. When this happens, network delay, pauses, and clock error may become arbitrarily large.

部分同步意味着系统大部分时间表现为同步系统，但有时会超出网络延迟、进程暂停和时钟漂移的限制。[88]这是许多系统的现实模型：大多数时间，网络和进程表现良好 -否则我们就无法完成任何事情- 但我们必须考虑到任何时间假设可能会偶尔破裂。当这种情况发生时，网络延迟、暂停和时钟误差可能变得任意大。

Asynchronous model

In this model, an algorithm is not allowed to make any timing assumptions—in fact, it does not even have a clock (so it cannot use timeouts). Some algorithms can be designed for the asynchronous model, but it is very restrictive.

在这个模型中，算法不允许做出任何时间假设——事实上，它甚至没有时钟（因此无法使用超时）。一些算法可以为异步模型设计，但它是非常严格的。

Moreover, besides timing issues, we have to consider node failures. The three most common system models for nodes are:

此外，除了时间问题外，我们还必须考虑节点故障。节点的三种最常见的系统模型是：

Crash-stop faults

In the crash-stop model, an algorithm may assume that a node can fail in only one way, namely by crashing. This means that the node may suddenly stop responding at any moment, and thereafter that node is gone forever—it never comes back.

在崩溃停止模型中，算法可能假设节点只能以一种方式失败，即崩溃。这意味着节点可能随时突然停止响应，然后该节点就永远离开了——它永远不会回来。

Crash-recovery faults

We assume that nodes may crash at any moment, and perhaps start responding again after some unknown time. In the crash-recovery model, nodes are assumed to have stable storage (i.e., nonvolatile disk storage) that is preserved across crashes, while the in-memory state is assumed to be lost.

我们假设节点随时可能崩溃，并可能在一段未知的时间后重新开始响应。在崩溃恢复模型中，假定节点具有稳定的存储（即，不揮发的磁盘存储），该存储在崩溃期间保持不变，而内存状态被认为是丢失的。

Byzantine (arbitrary) faults

Nodes may do absolutely anything, including trying to trick and deceive other nodes, as described in the last section.

节点可以做任何事情，包括尝试欺骗其他节点，正如最后一节所描述的那样。

For modeling real systems, the partially synchronous model with crash-recovery faults is generally the most useful model. But how do distributed algorithms cope with that model?

对于建模实际系统来说，具有崩溃恢复错误的部分同步模型通常是最有用的模型。但是分布式算法如何应对这个模型呢？

Correctness of an algorithm

To define what it means for an algorithm to be correct , we can describe its properties . For example, the output of a sorting algorithm has the property that for any two distinct elements of the output list, the element further to the left is smaller than the element further to the right. That is simply a formal way of defining what it means for a list to be sorted.

定义算法正确性的意义，我们可以描述它的属性。例如，排序算法的输出具有以下属性：对于输出列表中的任意两个不同元素，左边的元素比右边的元素小。这只是简单地定义了列表排序的意义。

Similarly, we can write down the properties we want of a distributed algorithm to define what it means to be correct. For example, if we are generating fencing tokens for a lock (see “Fencing tokens” ), we may require the algorithm to have the following properties:

同样地，我们可以列出分布式算法需要具备的特性来定义正确性。例如，如果我们正在为一个锁生成围栏令牌（参见“围栏令牌”），我们可能需要算法具备以下特性：

Uniqueness

No two requests for a fencing token return the same value.

没有两个请求返回相同的围栏令牌值。

Monotonic sequence

If request x returned token t _x , and request y returned token t _y , and x completed before y began, then t _x < t _y .

如果请求x返回令牌tx，请求y返回令牌ty，并且x在y开始之前完成，则tx < ty。

Availability

A node that requests a fencing token and does not crash eventually receives a response.

请求围栏令牌的节点最终都会收到响应。

An algorithm is correct in some system model if it always satisfies its properties in all situations that we assume may occur in that system model. But how does this make sense? If all nodes crash, or all network delays suddenly become infinitely long, then no algorithm will be able to get anything done.

算法在某些系统模型中是正确的，如果它总是满足其在该系统模型中可能发生的所有情况下的属性。但这有意义吗？如果所有节点崩溃，或者所有网络延迟突然变得无限长，那么任何算法都无法完成任何任务。

Safety and liveness

To clarify the situation, it is worth distinguishing between two different kinds of properties: safety and liveness properties. In the example just given, uniqueness and monotonic sequence are safety properties, but availability is a liveness property.

为了弄清楚情况，值得区分两种不同的属性：安全属性和活性属性。在刚才给出的例子中，唯一性和单调序列是安全属性，而可用性是活性属性。

What distinguishes the two kinds of properties? A giveaway is that liveness properties often include the word “eventually” in their definition. (And yes, you guessed it— eventual consistency is a liveness property [ 89 ].)

两种属性的区别在哪里呢？一个很明显的区分是活性属性的定义通常包含“最终”一词。（是的，你猜对了——最终一致性是一种活性属性 [89]。）

Safety is often informally defined as nothing bad happens , and liveness as something good eventually happens . However, it’s best to not read too much into those informal definitions, because the meaning of good and bad is subjective. The actual definitions of safety and liveness are precise and mathematical [ 90 ]:

安全通常是非正式地定义为没有发生任何坏事，而活力则是最终会发生一些好事情。但是，最好不要过多解读这些非正式的定义，因为好和坏的意义是主观的。安全和生命力的实际定义是精确和数学的[90]。

If a safety property is violated, we can point at a particular point in time at which it was broken (for example, if the uniqueness property was violated, we can identify the particular operation in which a duplicate fencing token was returned). After a safety property has been violated, the violation cannot be undone—the damage is already done.

如果安全属性被违反，我们可以指出违反发生的特定时间点（例如，如果唯一性属性被违反，则我们可以确定返回了重复围栏标记的特定操作）。安全属性一旦被违反，就无法撤消违规行为——损害已经发生。
A liveness property works the other way round: it may not hold at some point in time (for example, a node may have sent a request but not yet received a response), but there is always hope that it may be satisfied in the future (namely by receiving a response).

一个生存性质的工作方式正好相反：在某个时间点上可能不成立（例如，一个节点可能已经发送了一个请求但还没有收到响应），但总有希望它在未来会得到满足（即通过接收响应）。

An advantage of distinguishing between safety and liveness properties is that it helps us deal with difficult system models. For distributed algorithms, it is common to require that safety properties always hold, in all possible situations of a system model [ 88 ]. That is, even if all nodes crash, or the entire network fails, the algorithm must nevertheless ensure that it does not return a wrong result (i.e., that the safety properties remain satisfied).

区分安全性和活性属性的优点是帮助我们处理复杂的系统模型。对于分布式算法，通常要求安全属性在系统模型的所有可能情况下始终保持[88]。也就是说，即使所有节点崩溃或整个网络失败，该算法仍必须确保不返回错误结果（即，安全属性仍保持满足）。

However, with liveness properties we are allowed to make caveats: for example, we could say that a request needs to receive a response only if a majority of nodes have not crashed, and only if the network eventually recovers from an outage. The definition of the partially synchronous model requires that eventually the system returns to a synchronous state—that is, any period of network interruption lasts only for a finite duration and is then repaired.

然而，使用活性属性，我们可以作出附加条件：例如，我们可以说只有在大多数节点未崩溃且网络最终从停机中恢复时，请求才需要收到响应。部分同步模型的定义需要系统最终返回同步状态，即任何网络中断期间仅持续有限时间，并且随后得到修复。

Mapping system models to the real world

Safety and liveness properties and system models are very useful for reasoning about the correctness of a distributed algorithm. However, when implementing an algorithm in practice, the messy facts of reality come back to bite you again, and it becomes clear that the system model is a simplified abstraction of reality.

安全性和活性性质以及系统模型对于推理分布式算法的正确性非常有用。然而，在实践中实现算法时，现实的复杂事实再次出现，系统模型就成为现实的简化抽象。

For example, algorithms in the crash-recovery model generally assume that data in stable storage survives crashes. However, what happens if the data on disk is corrupted, or the data is wiped out due to hardware error or misconfiguration [ 91 ]? What happens if a server has a firmware bug and fails to recognize its hard drives on reboot, even though the drives are correctly attached to the server [ 92 ]?

例如，在崩溃恢复模型中的算法通常假定稳定存储器中的数据幸存于崩溃之后。但是，如果存储器中的数据受到损坏或由于硬件错误或错误配置而被抹去，会怎样呢？如果服务器存在固件漏洞，在重新启动时无法识别与服务器正确连接的硬盘，会怎么样呢？

Quorum algorithms (see “Quorums for reading and writing” ) rely on a node remembering the data that it claims to have stored. If a node may suffer from amnesia and forget previously stored data, that breaks the quorum condition, and thus breaks the correctness of the algorithm. Perhaps a new system model is needed, in which we assume that stable storage mostly survives crashes, but may sometimes be lost. But that model then becomes harder to reason about.

“Quorum算法（参见“用于读取和写入的Quorums”）依赖于节点记住其声称已存储的数据。如果一个节点可能会患有健忘症并忘记以前存储的数据，那就破坏了议员条件，从而破坏了算法的正确性。也许需要一个新的系统模型，其中我们假设稳定存储大多能够经受崩溃，但有时可能会丢失。但是，这个模型就变得更难思考。”

The theoretical description of an algorithm can declare that certain things are simply assumed not to happen—and in non-Byzantine systems, we do have to make some assumptions about faults that can and cannot happen. However, a real implementation may still have to include code to handle the case where something happens that was assumed to be impossible, even if that handling boils down to printf("Sucks to be you") and exit(666) —i.e., letting a human operator clean up the mess [ 93 ]. (This is arguably the difference between computer science and software engineering.)

算法的理论描述可以声明某些事情是根本不可能发生的——在非拜占庭系统中，我们确实需要做出一些关于可能发生故障的假设和不可能发生故障的假设。然而，一个真实的实现仍可能需要包含处理假定不可能发生的情况的代码，即使该处理归结为printf("Sucks to be you")和exit(666)——即让一个人工操作员清理混乱[93]。（这可以说是计算机科学和软件工程之间的区别。）

That is not to say that theoretical, abstract system models are worthless—quite the opposite. They are incredibly helpful for distilling down the complexity of real systems to a manageable set of faults that we can reason about, so that we can understand the problem and try to solve it systematically. We can prove algorithms correct by showing that their properties always hold in some system model.

这并不意味着理论上的、抽象的系统模型毫无价值——恰恰相反。它们非常有用，可以将真实系统的复杂性简化成可管理的一组错误，以便我们可以理性地理解问题并尝试系统地解决它。我们可以通过在某些系统模型中证明算法的正确性，以展示其属性始终保持不变。

Proving an algorithm correct does not mean its implementation on a real system will necessarily always behave correctly. But it’s a very good first step, because the theoretical analysis can uncover problems in an algorithm that might remain hidden for a long time in a real system, and that only come to bite you when your assumptions (e.g., about timing) are defeated due to unusual circumstances. Theoretical analysis and empirical testing are equally important.

证明算法正确并不意味着在实际系统中的实现一定会始终正确行为。但这是一个非常好的第一步，因为理论分析可以揭示算法中可能长期隐藏的问题，在非常罕见的情况下可能才会让您的假设（例如有关时间的假设）被击败。理论分析和经验测试同样重要。

Summary

In this chapter we have discussed a wide range of problems that can occur in distributed systems, including:

在本章中，我们讨论了分布式系统中可能出现的各种问题，包括：

Whenever you try to send a packet over the network, it may be lost or arbitrarily delayed. Likewise, the reply may be lost or delayed, so if you don’t get a reply, you have no idea whether the message got through.

每当您尝试通过网络发送数据包时，它都有可能会丢失或者任意延迟。同样，回复也有可能会丢失或者延迟，所以如果没有收到回复，您就不知道消息是否已经传递。
A node’s clock may be significantly out of sync with other nodes (despite your best efforts to set up NTP), it may suddenly jump forward or back in time, and relying on it is dangerous because you most likely don’t have a good measure of your clock’s error interval.

一个节点的时钟可能会与其他节点明显不同步（即使您尽力设置网络时钟协议（NTP）），它可能会突然向前或向后跳跃时间，依赖它是危险的，因为您很可能没有好的测量您时钟误差间隔的方法。
A process may pause for a substantial amount of time at any point in its execution (perhaps due to a stop-the-world garbage collector), be declared dead by other nodes, and then come back to life again without realizing that it was paused.

一个进程可能在执行的任何时刻暂停相当长的时间（可能是因为全停垃圾收集器），被其他节点声明为死亡，然后再次复活，而不知道它曾经暂停过。

The fact that such partial failures can occur is the defining characteristic of distributed systems. Whenever software tries to do anything involving other nodes, there is the possibility that it may occasionally fail, or randomly go slow, or not respond at all (and eventually time out). In distributed systems, we try to build tolerance of partial failures into software, so that the system as a whole may continue functioning even when some of its constituent parts are broken.

分布式系统的定义特点是可能会发生这种局部故障。每当软件涉及其他节点时，它偶尔会失败，轻微卡顿，或干脆不响应(最终超时)。在分布式系统中，我们尝试将软件的容错性建立起来，使系统在一些组成部分损坏时仍能正常运行。

To tolerate faults, the first step is to detect them, but even that is hard. Most systems don’t have an accurate mechanism of detecting whether a node has failed, so most distributed algorithms rely on timeouts to determine whether a remote node is still available. However, timeouts can’t distinguish between network and node failures, and variable network delay sometimes causes a node to be falsely suspected of crashing. Moreover, sometimes a node can be in a degraded state: for example, a Gigabit network interface could suddenly drop to 1 Kb/s throughput due to a driver bug [ 94 ]. Such a node that is “limping” but not dead can be even more difficult to deal with than a cleanly failed node.

容忍故障的第一步是检测它们，但即使这也很难。大多数系统没有准确的机制来检测节点是否失败，因此大多数分布式算法依赖于超时来确定远程节点是否仍然可用。然而，超时无法区分网络和节点故障，并且可变的网络延迟有时会导致节点被错误地怀疑崩溃。此外，有时节点可能处于退化状态：例如，由于驱动程序错误，千兆网络接口可能突然降至1Kb/s的吞吐量[94]。这样一个“跛足”但并未死亡的节点比一台干净失败的节点更难处理。

Once a fault is detected, making a system tolerate it is not easy either: there is no global variable, no shared memory, no common knowledge or any other kind of shared state between the machines. Nodes can’t even agree on what time it is, let alone on anything more profound. The only way information can flow from one node to another is by sending it over the unreliable network. Major decisions cannot be safely made by a single node, so we require protocols that enlist help from other nodes and try to get a quorum to agree.

一旦检测到故障，让系统容忍它也不容易：机器之间没有全局变量、共享内存、公共知识或任何其他共享状态。节点甚至不能达成一致的时间，更不用说任何更深刻的事情了。信息从一个节点流向另一个节点的唯一方式是通过不可靠的网络发送。重大决策不能由单个节点安全地做出，因此我们需要协议以征求其他节点的帮助，并试图获得多数同意。

If you’re used to writing software in the idealized mathematical perfection of a single computer, where the same operation always deterministically returns the same result, then moving to the messy physical reality of distributed systems can be a bit of a shock. Conversely, distributed systems engineers will often regard a problem as trivial if it can be solved on a single computer [ 5 ], and indeed a single computer can do a lot nowadays [ 95 ]. If you can avoid opening Pandora’s box and simply keep things on a single machine, it is generally worth doing so.

如果你已经习惯于在单一计算机的理想化数学完美中编写软件，其中相同操作始终确定性地返回相同的结果，那么转向分布式系统的混乱物理现实可能会有点震惊。相反，分布式系统工程师通常会认为，如果问题可以在单个计算机上解决，那么问题就很简单，而事实上，单个计算机现在可以做很多事情。如果您可以避免打开潘多拉的盒子，只需将事物保留在单台机器上，那么通常值得这样做。

However, as discussed in the introduction to Part II , scalability is not the only reason for wanting to use a distributed system. Fault tolerance and low latency (by placing data geographically close to users) are equally important goals, and those things cannot be achieved with a single node.

然而，正如第二部分介绍中所讨论的，可扩展性并不是想要使用分布式系统的唯一原因。容错性和低延迟（通过将数据放置在靠近用户的地方）同样重要，而这些目标无法通过单个节点实现。

In this chapter we also went on some tangents to explore whether the unreliability of networks, clocks, and processes is an inevitable law of nature. We saw that it isn’t: it is possible to give hard real-time response guarantees and bounded delays in networks, but doing so is very expensive and results in lower utilization of hardware resources. Most non-safety-critical systems choose cheap and unreliable over expensive and reliable.

在本章中，我们也进行了一些离题探讨，探讨网络、时钟和进程的不可靠性是否是自然法则。我们发现这不是自然法则：可以在网络中提供硬实时响应保证和有界延迟，但这样做非常昂贵，会导致硬件资源利用率降低。大多数非安全关键系统选择廉价而不可靠的解决方案，而非昂贵而可靠的解决方案。

We also touched on supercomputers, which assume reliable components and thus have to be stopped and restarted entirely when a component does fail. By contrast, distributed systems can run forever without being interrupted at the service level, because all faults and maintenance can be handled at the node level—at least in theory. (In practice, if a bad configuration change is rolled out to all nodes, that will still bring a distributed system to its knees.)

我们也谈及了超级计算机，它们需要可靠的组件，因此当一个组件出现故障时必须完全停止和重新启动。相比之下，分布式系统可以在服务级别上永远运行，因为所有故障和维护都可以在节点级别处理-至少在理论上是如此。（实际上，如果所有节点都推出了错误的配置更改，那仍然会使分布式系统崩溃。）

This chapter has been all about problems, and has given us a bleak outlook. In the next chapter we will move on to solutions, and discuss some algorithms that have been designed to cope with all the problems in distributed systems.

这一章节全部关于问题，并给我们带来了灰暗的前景。在下一章中，我们将继续探讨解决方案，并讨论一些旨在应对分布式系统中所有问题的算法。

Footnotes

ⁱ With one exception: we will assume that faults are non-Byzantine (see “Byzantine Faults” ).

除一种情况外：我们将假定故障是非拜占庭式的（见“拜占庭故障”）。

ⁱⁱ Except perhaps for an occasional keepalive packet, if TCP keepalive is enabled.

除非TCP keepalive已启用，否则只有偶尔的保持活动数据包。

ⁱⁱⁱ Asynchronous Transfer Mode (ATM) was a competitor to Ethernet in the 1980s [ 32 ], but it didn’t gain much adoption outside of telephone network core switches. It has nothing to do with automatic teller machines (also known as cash machines), despite sharing an acronym. Perhaps, in some parallel universe, the internet is based on something like ATM—in that universe, internet video calls are probably a lot more reliable than they are in ours, because they don’t suffer from dropped and delayed packets.

iii 异步传输模式（ATM）在1980年代是以太网的竞争对手[32]，但它并没有在电话网络核心交换机以外得到广泛采用。尽管使用了同一个缩写，它与自动取款机没有任何关系。或许在某个平行宇宙里，互联网基于类似ATM的东西-在那个宇宙里，互联网视频通话可能比我们的更可靠，因为它们不会遭受丢失和延迟的数据包。

^iv Peering agreements between internet service providers and the establishment of routes through the Border Gateway Protocol (BGP), bear closer resemblance to circuit switching than IP itself. At this level, it is possible to buy dedicated bandwidth. However, internet routing operates at the level of networks, not individual connections between hosts, and at a much longer timescale.

互联网服务供应商间的iv Peering协议和通过边界网关协议（BGP）建立路由的过程，更像是电路交换，而不是IP本身。在这个层面上，可以购买专用带宽。然而，互联网路由在网络层级上运行，而不是主机之间的个别连接，时间尺度也更长。

^v Although the clock is called real-time , it has nothing to do with real-time operating systems, as discussed in “Response time guarantees” .

尽管时钟被称为实时时钟，但它与实时操作系统无关，如“响应时间保证”所讨论的。

^vi There are distributed sequence number generators, such as Twitter’s Snowflake, that generate approximately monotonically increasing unique IDs in a scalable way (e.g., by allocating blocks of the ID space to different nodes). However, they typically cannot guarantee an ordering that is consistent with causality, because the timescale at which blocks of IDs are assigned is longer than the timescale of database reads and writes. See also “Ordering Guarantees” .

有分布式序列号生成器，如Twitter的Snowflake，可以可伸缩地生成大约单调递增的唯一ID（例如，通过将ID空间的块分配给不同的节点）。但是，它们通常无法保证顺序一致与因果关系一致，因为分配ID块的时间尺度比数据库读写的时间尺度长。请参见“排序保证”。

References

[ 1 ] Mark Cavage: “ There’s Just No Getting Around It: You’re Building a Distributed System ,” ACM Queue , volume 11, number 4, pages 80-89, April 2013. doi:10.1145/2466486.2482856

[1] Mark Cavage：「离不开分布式系统：你正在构建它」，ACM Queue，第11卷，第4期，第80-89页，2013年4月。doi:10.1145/2466486.2482856。

[ 2 ] Jay Kreps: “ Getting Real About Distributed System Reliability ,” blog.empathybox.com , March 19, 2012.

[2] Jay Kreps：“认真对待分布式系统的可靠性”，blog.empathybox.com，2012年3月19日。

[ 3 ] Sydney Padua: The Thrilling Adventures of Lovelace and Babbage: The (Mostly) True Story of the First Computer . Particular Books, April 2015. ISBN: 978-0-141-98151-2

[3] 悉尼·帕杜亚: 洛韦莱斯和巴贝奇的惊险历险：第一台计算机的“大部分”真实故事。宝特书籍，2015年4月。ISBN: 978-0-141-98151-2。

[ 4 ] Coda Hale: “ You Can’t Sacrifice Partition Tolerance ,” codahale.com , October 7, 2010.

[4] Coda Hale： “牺牲分区容错性是不可取的,” codahale.com，2010年10月7日。

[ 5 ] Jeff Hodges: “ Notes on Distributed Systems for Young Bloods ,” somethingsimilar.com , January 14, 2013.

[5] Jeff Hodges: "分布式系统简明指南", somethingsimilar.com, 2013年1月14日。

[ 6 ] Antonio Regalado: “ Who Coined ‘Cloud Computing’? ,” technologyreview.com , October 31, 2011.

"谁创造了‘云计算’？"，Antonio Regalado，2011年10月31日，technologyreview.com。"

[ 7 ] Luiz André Barroso, Jimmy Clidaras, and Urs Hölzle: “ The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Second Edition ,” Synthesis Lectures on Computer Architecture , volume 8, number 3, Morgan & Claypool Publishers, July 2013. doi:10.2200/S00516ED2V01Y201306CAC024 , ISBN: 978-1-627-05010-4

“数据中心作为一台计算机：大型仓库式机器设计介绍，第二版”，计算机架构综合论文集，第8卷，第3期， Morgan & Claypool出版社，2013年7月。 DOI：10.2200/S00516ED2V01Y201306CAC024， ISBN：978-1-627-05010-4。

[ 8 ] David Fiala, Frank Mueller, Christian Engelmann, et al.: “ Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing ,” at International Conference for High Performance Computing, Networking, Storage and Analysis (SC12), November 2012.

David Fiala，Frank Mueller，Christian Engelmann等人：“大规模高性能计算中的静默数据损坏的检测和纠正”，于2012年11月在高性能计算、网络、存储和分析国际会议（SC12）上发表。

[ 9 ] Arjun Singh, Joon Ong, Amit Agarwal, et al.: “ Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network ,” at Annual Conference of the ACM Special Interest Group on Data Communication (SIGCOMM), August 2015. doi:10.1145/2785956.2787508

"[9] 阿尔琼·辛格、朱恩·翁、阿米特·阿格拉沃尔等人：《木星升起：谷歌数据中心网络中十年来的Clos拓扑结构和集中控制》，发表于ACM数据通信特别兴趣小组(SIGCOMM)年会，2015年8月。doi：10.1145/2785956.2787508。"

[ 10 ] Glenn K. Lockwood: “ Hadoop’s Uncomfortable Fit in HPC ,” glennklockwood.blogspot.co.uk , May 16, 2014.

[10] Glenn K. Lockwood：“Hadoop在高性能计算中的不适合”，glennklockwood.blogspot.co.uk，2014年5月16日。

[ 11 ] John von Neumann: “ Probabilistic Logics and the Synthesis of Reliable Organisms from Unreliable Components ,” in Automata Studies (AM-34) , edited by Claude E. Shannon and John McCarthy, Princeton University Press, 1956. ISBN: 978-0-691-07916-5

[11] 约翰·冯·诺伊曼： “概率逻辑和可靠生物体从不可靠组件合成”，收录于《自动机研究》（AM-34），克洛德·E·香农和约翰·麦卡锡编辑，普林斯顿大学出版社，1956年。 ISBN: 978-0-691-07916-5。

[ 12 ] Richard W. Hamming: The Art of Doing Science and Engineering . Taylor & Francis, 1997. ISBN: 978-9-056-99500-3

[12] 理查德·W·汉明：做科学和工程的艺术。泰勒和弗朗西斯，1997年。 ISBN：978-9-056-99500-3。

[ 13 ] Claude E. Shannon: “ A Mathematical Theory of Communication ,” The Bell System Technical Journal , volume 27, number 3, pages 379–423 and 623–656, July 1948.

[13] 克劳德·E·香农：《通信的数学理论》，贝尔系统技术杂志，第27卷，第3期，1948年7月，379-423和623-656页。

[ 14 ] Peter Bailis and Kyle Kingsbury: “ The Network Is Reliable ,” ACM Queue , volume 12, number 7, pages 48-55, July 2014. doi:10.1145/2639988.2639988

[14] Peter Bailis 和 Kyle Kingsbury: “网络是可靠的”，ACM Queue，卷 12，号码 7，页码 48-55，2014 年 7 月。doi:10.1145/2639988.2639988。

[ 15 ] Joshua B. Leners, Trinabh Gupta, Marcos K. Aguilera, and Michael Walfish: “ Taming Uncertainty in Distributed Systems with Help from the Network ,” at 10th European Conference on Computer Systems (EuroSys), April 2015. doi:10.1145/2741948.2741976

"在分布式系统中通过网络协作抑制不确定性"，作者为Joshua B. Leners、Trinabh Gupta、Marcos K. Aguilera和Michael Walfish，发表于2015年4月的第10届欧洲计算机系统会议（EuroSys）。doi:10.1145/2741948.2741976。"

[ 16 ] Phillipa Gill, Navendu Jain, and Nachiappan Nagappan: “ Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications ,” at ACM SIGCOMM Conference , August 2011. doi:10.1145/2018436.2018477

[16] Phillipa Gill, Navendu Jain和Nachiappan Nagappan：“理解数据中心网络故障：测量、分析和影响”，发表于ACM SIGCOMM会议，2011年8月。 doi:10.1145/2018436.2018477

[ 17 ] Mark Imbriaco: “ Downtime Last Saturday ,” github.com , December 26, 2012.

[17] Mark Imbriaco: "上周六的停机时间"，github.com，2012年12月26日。

[ 18 ] Will Oremus: “ The Global Internet Is Being Attacked by Sharks, Google Confirms ,” slate.com , August 15, 2014.

“全球互联网受到鲨鱼袭击，谷歌证实”，Slate.com，2014年8月15日。

[ 19 ] Marc A. Donges: “ Re: bnx2 cards Intermittantly Going Offline ,” Message to Linux netdev mailing list, spinics.net , September 13, 2012.

"[19] Marc A. Donges： “关于bnx2网络卡断开连接”的信息发送至Linux netdev邮件列表，spinics.net，2012年9月13日。"

[ 20 ] Kyle Kingsbury: “ Call Me Maybe: Elasticsearch ,” aphyr.com , June 15, 2014.

[20] 凯尔·金斯伯利：“Call Me Maybe：Elasticsearch”，aphyr.com，2014年6月15日。

[ 21 ] Salvatore Sanfilippo: “ A Few Arguments About Redis Sentinel Properties and Fail Scenarios ,” antirez.com , October 21, 2014.

[21] Salvatore Sanfilippo：“Redis Sentinel 属性和故障场景的几个论点”，antirez.com，2014年10月21日。

[ 22 ] Bert Hubert: “ The Ultimate SO_LINGER Page, or: Why Is My TCP Not Reliable ,” blog.netherlabs.nl , January 18, 2009.

[22] Bert Hubert：“超级SO_LINGER页面，或：为什么我的TCP不可靠”，blog.netherlabs.nl，2009年1月18日。

[ 23 ] Nicolas Liochon: “ CAP: If All You Have Is a Timeout, Everything Looks Like a Partition ,” blog.thislongrun.com , May 25, 2015.

"CAP：如果你手上只有一个超时，那么所有事情都看起来像是一个分区"， Nicolas Liochon在他的博客“thislongrun”中写道，发布于2015年5月25日。"

[ 24 ] Jerome H. Saltzer, David P. Reed, and David D. Clark: “ End-To-End Arguments in System Design ,” ACM Transactions on Computer Systems , volume 2, number 4, pages 277–288, November 1984. doi:10.1145/357401.357402

[24] Jerome H. Saltzer, David P. Reed, and David D. Clark：“系统设计中的端到端论证”，ACM 计算机系统交易，第 2 卷，第 4 期，页面 277-288，1984 年 11 月。doi:10.1145/357401.357402。

[ 25 ] Matthew P. Grosvenor, Malte Schwarzkopf, Ionel Gog, et al.: “ Queues Don’t Matter When You Can JUMP Them! ,” at 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI), May 2015.

"[25] Matthew P. Grosvenor, Malte Schwarzkopf, Ionel Gog等人: “Queues Don’t Matter When You Can JUMP Them!,” 发表于第12届USENIX网络系统设计与实现研讨会（NSDI），2015年5月。"

[ 26 ] Guohui Wang and T. S. Eugene Ng: “ The Impact of Virtualization on Network Performance of Amazon EC2 Data Center ,” at 29th IEEE International Conference on Computer Communications (INFOCOM), March 2010. doi:10.1109/INFCOM.2010.5461931

[26] 王国辉和T.S. Eugene Ng：“虚拟化对亚马逊EC2数据中心网络性能的影响”，发表于第29届IEEE国际计算机通信会议（INFOCOM），2010年3月。doi:10.1109/INFCOM.2010.5461931。

[ 27 ] Van Jacobson: “ Congestion Avoidance and Control ,” at ACM Symposium on Communications Architectures and Protocols (SIGCOMM), August 1988. doi:10.1145/52324.52356

[27]范·雅各布森： “拥塞避免和控制”，发表于1988年8月的ACM通信体系结构和协议（SIGCOMM）研讨会。 doi:10.1145/52324.52356

[ 28 ] Brandon Philips: “ etcd: Distributed Locking and Service Discovery ,” at Strange Loop , September 2014.

[28] Brandon Philips: "etcd: 分布式锁和服务发现"，在Stranger Loop，2014年9月。

[ 29 ] Steve Newman: “ A Systematic Look at EC2 I/O ,” blog.scalyr.com , October 16, 2012.

[29] Steve Newman： "系统地观察EC2 I/O"，blog.scalyr.com，2012年10月16日。

[ 30 ] Naohiro Hayashibara, Xavier Défago, Rami Yared, and Takuya Katayama: “ The ϕ Accrual Failure Detector ,” Japan Advanced Institute of Science and Technology, School of Information Science, Technical Report IS-RR-2004-010, May 2004.

【30】Naohiro Hayashibara、Xavier Défago、Rami Yared 和 Takuya Katayama：“Φ积累故障检测器”，日本高级科学技术学院信息科学学院技术报告 IS-RR-2004-010，2004 年 5 月。

[ 31 ] Jeffrey Wang: “ Phi Accrual Failure Detector ,” ternarysearch.blogspot.co.uk , August 11, 2013.

[31] Jeffrey Wang：“Phi Accrual故障探测器”，ternarysearch.blogspot.co.uk，2013年8月11日。

[ 32 ] Srinivasan Keshav: An Engineering Approach to Computer Networking: ATM Networks, the Internet, and the Telephone Network . Addison-Wesley Professional, May 1997. ISBN: 978-0-201-63442-6

[32] Srinivasan Keshav:《计算机网络的工程方法：ATM网络、互联网和电话网络》Addison-Wesley Professional，1997年5月出版。ISBN: 978-0-201-63442-6。

[ 33 ] Cisco, “ Integrated Services Digital Network ,” docwiki.cisco.com .

[33] Cisco，“集成服务数字网”，docwiki.cisco.com。

[ 34 ] Othmar Kyas: ATM Networks . International Thomson Publishing, 1995. ISBN: 978-1-850-32128-6

[34] Othmar Kyas: ATM网络。国际汤森出版，1995年。ISBN：978-1-850-32128-6。

[ 35 ] “ InfiniBand FAQ ,” Mellanox Technologies, December 22, 2014.

"无限带宽常见问题解答，" Mellanox Technologies，2014年12月22日。

[ 36 ] Jose Renato Santos, Yoshio Turner, and G. (John) Janakiraman: “ End-to-End Congestion Control for InfiniBand ,” at 22nd Annual Joint Conference of the IEEE Computer and Communications Societies (INFOCOM), April 2003. Also published by HP Laboratories Palo Alto, Tech Report HPL-2002-359. doi:10.1109/INFCOM.2003.1208949

「[36] Jose Renato Santos，Yoshio Turner 和 G. (John) Janakiraman：「InfiniBand 的端到端拥塞控制」，发表于第22届IEEE计算机和通信学会联合会议(INFOCOM)，2003年4月。同时也由惠普实验室Palo Alto发表，技术报告HPL-2002-359。 doi:10.1109/INFCOM.2003.1208949」

[ 37 ] Ulrich Windl, David Dalton, Marc Martinec, and Dale R. Worley: “ The NTP FAQ and HOWTO ,” ntp.org , November 2006.

[37] Ulrich Windl，David Dalton，Marc Martinec和Dale R. Worley： “NTP FAQ和HOWTO”，ntp.org，2006年11月。

[ 38 ] John Graham-Cumming: “ How and why the leap second affected Cloudflare DNS ,” blog.cloudflare.com , January 1, 2017.

[38] 约翰·格雷厄姆-卡明 (John Graham-Cumming): “‘闰秒’ 如何影响 Cloudflare DNS 以及其原因,” 博客地址为 blog.cloudflare.com，发表日期为 2017 年 1 月 1 日。

[ 39 ] David Holmes: “ Inside the Hotspot VM: Clocks, Timers and Scheduling Events – Part I – Windows ,” blogs.oracle.com , October 2, 2006.

[39] David Holmes: "Hotspot VM 内部: 时钟，定时器和调度事件 - 第一部分 - Windows," blogs.oracle.com, 2006 年 10 月 2 日。

[ 40 ] Steve Loughran: “ Time on Multi-Core, Multi-Socket Servers ,” steveloughran.blogspot.co.uk , September 17, 2015.

[40] Steve Loughran: “在多核、多插槽服务器上的时间管理”，steveloughran.blogspot.co.uk，2015年9月17日。

[ 41 ] James C. Corbett, Jeffrey Dean, Michael Epstein, et al.: “ Spanner: Google’s Globally-Distributed Database ,” at 10th USENIX Symposium on Operating System Design and Implementation (OSDI), October 2012.

[41] James C. Corbett，Jeffrey Dean，Michael Epstein等人： “Spanner：Google 全球分布式数据库”，在第十届 USENIX 操作系统设计和实现研讨会（OSDI）上，2012年10月。

[ 42 ] M. Caporaloni and R. Ambrosini: “ How Closely Can a Personal Computer Clock Track the UTC Timescale Via the Internet? ,” European Journal of Physics , volume 23, number 4, pages L17–L21, June 2012. doi:10.1088/0143-0807/23/4/103

[42] M. Caporaloni和R. Ambrosini：“个人电脑通过互联网可以以多大精度跟踪UTC时间标准？”，欧洲物理学杂志，第23卷，第4期，页码L17–L21，2012年6月。doi:10.1088/0143-0807/23/4/103。

[ 43 ] Nelson Minar: “ A Survey of the NTP Network ,” alumni.media.mit.edu , December 1999.

[43] 尼尔森·米纳尔：“NTP网络调查”，alumni.media.mit.edu，1999年12月。

[ 44 ] Viliam Holub: “ Synchronizing Clocks in a Cassandra Cluster Pt. 1 – The Problem ,” blog.logentries.com , March 14, 2014.

[44] 维利亚姆·霍鲁布： “在卡桑德拉集群中同步时钟问题，第一部分 - 问题”，blog.logentries.com，2014年3月14日。

[ 45 ] Poul-Henning Kamp: “ The One-Second War (What Time Will You Die?) ,” ACM Queue , volume 9, number 4, pages 44–48, April 2011. doi:10.1145/1966989.1967009

[45] Poul-Henning Kamp：“一秒战争（你将死于何时？）”，《ACM Queue》杂志，2011年4月第9卷第4期，44-48页。doi：10.1145 / 1966989.1967009。

[ 46 ] Nelson Minar: “ Leap Second Crashes Half the Internet ,” somebits.com , July 3, 2012.

[46] Nelson Minar：“闰秒导致一半的互联网崩溃”，somebits.com，2012年7月3日。

[ 47 ] Christopher Pascoe: “ Time, Technology and Leaping Seconds ,” googleblog.blogspot.co.uk , September 15, 2011.

克里斯托弗·帕斯科（Christopher Pascoe）：“时间，技术和跳秒”，googleblog.blogspot.co.uk，2011年9月15日。

[ 48 ] Mingxue Zhao and Jeff Barr: “ Look Before You Leap – The Coming Leap Second and AWS ,” aws.amazon.com , May 18, 2015.

[48] 明雪赵（Mingxue Zhao）和杰夫·巴尔（Jeff Barr）：“Before You Leap – 即将到来的跳秒和 AWS，”aws.amazon.com，2015年5月18日。

[ 49 ] Darryl Veitch and Kanthaiah Vijayalayan: “ Network Timing and the 2015 Leap Second ,” at 17th International Conference on Passive and Active Measurement (PAM), April 2016. doi:10.1007/978-3-319-30505-9_29

[49] Darryl Veitch和Kanthaiah Vijayalayan： “网络时序和2015年闰秒”，收录于第17届被动和主动测量国际会议（PAM），于2016年4月。 doi:10.1007/978-3-319-30505-9_29

[ 50 ] “ Timekeeping in VMware Virtual Machines ,” Information Guide, VMware, Inc., December 2011.

“VMware虚拟机中的时间管理”，信息指南，VMware公司，2011年12月。

[ 51 ] “ MiFID II / MiFIR: Regulatory Technical and Implementing Standards – Annex I (Draft) ,” European Securities and Markets Authority, Report ESMA/2015/1464, September 2015.

`[51] "MiFID II / MiFIR：监管技术和实施标准 - 附录一（草案）" ，欧洲证券和市场监管局，报告ESMA/2015/1464，2015年9月。`

[ 52 ] Luke Bigum: “ Solving MiFID II Clock Synchronisation With Minimum Spend (Part 1) ,” lmax.com , November 27, 2015.

[52]卢克·比格姆：“以最少开支解决MiFID II时钟同步问题（第1部分）”， lmax.com，2015年11月27日。

[ 53 ] Kyle Kingsbury: “ Call Me Maybe: Cassandra ,” aphyr.com , September 24, 2013.

[53] Kyle Kingsbury：“Call Me Maybe：Cassandra”，aphyr.com，2013年9月24日。 [53] 凯尔·金斯伯利：“Call Me Maybe：Cassandra”，aphyr.com，2013年9月24日。

[ 54 ] John Daily: “ Clocks Are Bad, or, Welcome to the Wonderful World of Distributed Systems ,” basho.com , November 12, 2013.

"钟表是坏的，或者，欢迎来到分布式系统的奇妙世界"，basho.com，2013年11月12日。

[ 55 ] Kyle Kingsbury: “ The Trouble with Timestamps ,” aphyr.com , October 12, 2013.

[55] Kyle Kingsbury：“时间戳的难题”，aphyr.com，2013年10月12日。

[ 56 ] Leslie Lamport: “ Time, Clocks, and the Ordering of Events in a Distributed System ,” Communications of the ACM , volume 21, number 7, pages 558–565, July 1978. doi:10.1145/359545.359563

“时间、时钟和分布式系统中事件的排序”，ACM通讯，第21卷，第7期，1978年7月，第558-565页。doi:10.1145/359545.359563。

[ 57 ] Sandeep Kulkarni, Murat Demirbas, Deepak Madeppa, et al.: “ Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases ,” State University of New York at Buffalo, Computer Science and Engineering Technical Report 2014-04, May 2014.

[57] Sandeep Kulkarni，Murat Demirbas，Deepak Madeppa等人：“全球分布式数据库中的逻辑物理时钟和一致快照”，纽约州立大学布法罗分校，计算机科学和工程技术报告2014-04，2014年5月。

[ 58 ] Justin Sheehy: “ There Is No Now: Problems With Simultaneity in Distributed Systems ,” ACM Queue , volume 13, number 3, pages 36–41, March 2015. doi:10.1145/2733108

[58] Justin Sheehy：“分布式系统中的同时性问题：不存在“现在””，《ACM队列》杂志，第13卷，第3期，第36-41页，2015年3月。doi:10.1145/2733108。

[ 59 ] Murat Demirbas: “ Spanner: Google’s Globally-Distributed Database ,” muratbuffalo.blogspot.co.uk , July 4, 2013.

"[59] Murat Demirbas: "Spanner: Google的全球分布式数据库"，muratbuffalo.blogspot.co.uk，2013年7月4日。"

[ 60 ] Dahlia Malkhi and Jean-Philippe Martin: “ Spanner’s Concurrency Control ,” ACM SIGACT News , volume 44, number 3, pages 73–77, September 2013. doi:10.1145/2527748.2527767

【60】 Dahlia Malkhi和Jean-Philippe Martin：“Spanner的并发控制”，ACM SIGACT新闻，44卷3号，2013年9月，73-77页。doi:10.1145/2527748.2527767。

[ 61 ] Manuel Bravo, Nuno Diegues, Jingna Zeng, et al.: “ On the Use of Clocks to Enforce Consistency in the Cloud ,” IEEE Data Engineering Bulletin , volume 38, number 1, pages 18–31, March 2015.

[61] Manuel Bravo, Nuno Diegues, Jingna Zeng等人：《在云计算中使用时钟以实现一致性》，IEEE数据工程通报，第38卷，第1期，2015年3月，18-31页。

[ 62 ] Spencer Kimball: “ Living Without Atomic Clocks ,” cockroachlabs.com , February 17, 2016.

[62] Spencer Kimball：“没有原子钟的生活”，cockroachlabs.com，2016年2月17日。

[ 63 ] Cary G. Gray and David R. Cheriton: “ Leases: An Efficient Fault-Tolerant Mechanism for Distributed File Cache Consistency ,” at 12th ACM Symposium on Operating Systems Principles (SOSP), December 1989. doi:10.1145/74850.74870

[63] Cary G. Gray和David R. Cheriton：“租约：分布式文件缓存一致性的高效容错机制”，发表于1989年12月的第12届ACM操作系统原理研讨会（SOSP）。doi：10.1145/74850.74870。

[ 64 ] Todd Lipcon: “ Avoiding Full GCs in Apache HBase with MemStore-Local Allocation Buffers: Part 1 ,” blog.cloudera.com , February 24, 2011.

[64] Todd Lipcon: "使用MemStore-Local分配缓冲区来避免Apache HBase的Full GC：第一部分"，blog.cloudera.com，2011年2月24日。

[ 65 ] Martin Thompson: “ Java Garbage Collection Distilled ,” mechanical-sympathy.blogspot.co.uk , July 16, 2013.

马丁·汤普森：《Java 垃圾回收概述》，mechanical-sympathy.blogspot.co.uk，2013年7月16日。

[ 66 ] Alexey Ragozin: “ How to Tame Java GC Pauses? Surviving 16GiB Heap and Greater ,” java.dzone.com , June 28, 2011.

[66] Alexey Ragozin: “如何驯服Java GC暂停？在16GB堆和更大的情况下生存”， java.dzone.com，2011年6月28日。

[ 67 ] Christopher Clark, Keir Fraser, Steven Hand, et al.: “ Live Migration of Virtual Machines ,” at 2nd USENIX Symposium on Symposium on Networked Systems Design & Implementation (NSDI), May 2005.

[67] 克里斯托弗·克拉克（Christopher Clark），凯尔·弗雷泽（Keir Fraser），史蒂文·汉德（Steven Hand）等：“虚拟机的实时迁移”，发表于第二届USENIX协会网络系统设计与实现研讨会（NSDI），2005年5月。

[ 68 ] Mike Shaver: “ fsyncers and Curveballs ,” shaver.off.net , May 25, 2008.

[68] 迈克·沙弗(Mike Shaver)：“fsyncers和Curveballs”，shaver.off.net，2008年5月25日。

[ 69 ] Zhenyun Zhuang and Cuong Tran: “ Eliminating Large JVM GC Pauses Caused by Background IO Traffic ,” engineering.linkedin.com , February 10, 2016.

【69】庄振云和陈庆：《消除背景IO流量造成的JVM大GC暂停》，engineering.linkedin.com，2016年2月10日。

[ 70 ] David Terei and Amit Levy: “ Blade: A Data Center Garbage Collector ,” arXiv:1504.02578, April 13, 2015.

[70] David Terei 和 Amit Levy: “Blade: 数据中心垃圾回收器,” arXiv:1504.02578, 2015 年 4 月 13 日。

[ 71 ] Martin Maas, Tim Harris, Krste Asanović, and John Kubiatowicz: “ Trash Day: Coordinating Garbage Collection in Distributed Systems ,” at 15th USENIX Workshop on Hot Topics in Operating Systems (HotOS), May 2015.

【71】Martin Maas、Tim Harris、Krste Asanović和John Kubiatowicz：“Trash Day：协调分布式系统中的垃圾回收”，于2015年5月在第15届USENIX操作系统热点工作坊（HotOS）上发表。

[ 72 ] “ Predictable Low Latency ,” Cinnober Financial Technology AB, cinnober.com , November 24, 2013.

[72] “可预测的低延迟”，Cinnober金融科技有限公司，cinnober.com，2013年11月24日。

[ 73 ] Martin Fowler: “ The LMAX Architecture ,” martinfowler.com , July 12, 2011.

[73] 马丁·福勒: “LMAX 架构”, martinfowler.com, 2011 年 7 月 12 日。

[ 74 ] Flavio P. Junqueira and Benjamin Reed: ZooKeeper: Distributed Process Coordination . O’Reilly Media, 2013. ISBN: 978-1-449-36130-3

【74】Flavio P. Junqueira和Benjamin Reed: 《ZooKeeper：分布式进程协调》。O'Reilly Media，2013年。ISBN：978-1-449-36130-3。

[ 75 ] Enis Söztutar: “ HBase and HDFS: Understanding Filesystem Usage in HBase ,” at HBaseCon , June 2013.

[75] Enis Söztutar：在HBaseCon会议上，于2013年6月提出“HBase和HDFS：了解HBase中的文件系统使用”。

[ 76 ] Caitie McCaffrey: “ Clients Are Jerks: AKA How Halo 4 DoSed the Services at Launch & How We Survived ,” caitiem.com , June 23, 2015.

[76] Caitie McCaffrey：客户就是混蛋：又名《如何在《光环4》上线时抵抗DoS攻击》（How Halo 4 DoSed the Services at Launch & How We Survived），caitiem.com，2015年6月23日。

[ 77 ] Leslie Lamport, Robert Shostak, and Marshall Pease: “ The Byzantine Generals Problem ,” ACM Transactions on Programming Languages and Systems (TOPLAS), volume 4, number 3, pages 382–401, July 1982. doi:10.1145/357172.357176

“拜占庭将军问题”，这篇论文是由Leslie Lamport、Robert Shostak和Marshall Pease合著的，发表在1982年7月的ACM Transactions on Programming Languages and Systems (TOPLAS)杂志上，第4卷第3期，382-401页。DOI:10.1145/357172.357176。

[ 78 ] Jim N. Gray: “ Notes on Data Base Operating Systems ,” in Operating Systems: An Advanced Course , Lecture Notes in Computer Science, volume 60, edited by R. Bayer, R. M. Graham, and G. Seegmüller, pages 393–481, Springer-Verlag, 1978. ISBN: 978-3-540-08755-7

[78] 吉姆·格雷：《数据库操作系统笔记》，收录于《操作系统：高阶课程》，计算机科学讲义，卷 60，由 R. Bayer、R. M. Graham 和 G. Seegmüller 编辑，Springer-Verlag 出版，1978 年，页码 393-481。ISBN: 978-3-540-08755-7。

[ 79 ] Brian Palmer: “ How Complicated Was the Byzantine Empire? ,” slate.com , October 20, 2011.

[79] 布莱恩·帕尔默：“拜占庭帝国有多复杂？” slate.com，2011年10月20日。

[ 80 ] Leslie Lamport: “ My Writings ,” research.microsoft.com , December 16, 2014. This page can be found by searching the web for the 23-character string obtained by removing the hyphens from the string allla-mport-spubso-ntheweb .

“Leslie Lamport：我的著作”，research.microsoft.com，2014年12月16日。可以通过搜索从字符字符串allla-mport-spubso-ntheweb中删除连字符的23个字符字符串找到此页面。

[ 81 ] John Rushby: “ Bus Architectures for Safety-Critical Embedded Systems ,” at 1st International Workshop on Embedded Software (EMSOFT), October 2001.

[81] 约翰·拉什比：「安全关键嵌入式系统的总线架构」，于2001年10月举办的第一届嵌入式软件研讨会（EMSOFT）上发表。

[ 82 ] Jake Edge: “ ELC: SpaceX Lessons Learned ,” lwn.net , March 6, 2013.

"ELC: SpaceX Lessons Learned," lwn.net, March 6, 2013. "ELC：SpaceX的教训与经验," lwn.net，2013年3月6日。

[ 83 ] Andrew Miller and Joseph J. LaViola, Jr.: “ Anonymous Byzantine Consensus from Moderately-Hard Puzzles: A Model for Bitcoin ,” University of Central Florida, Technical Report CS-TR-14-01, April 2014.

[83] 安德鲁·米勒和约瑟夫·J·拉维奥拉： “中度难度谜题的匿名拜占庭共识：比特币的模型”，中佛罗里达大学，技术报告CS-TR-14-01，2014年4月。

[ 84 ] James Mickens: “ The Saddest Moment ,” USENIX ;login: logout , May 2013.

"最悲哀的时刻，" USENIX ;login: logout, 2013年5月。"

[ 85 ] Evan Gilman: “ The Discovery of Apache ZooKeeper’s Poison Packet ,” pagerduty.com , May 7, 2015.

[85] Evan Gilman：“Apache ZooKeeper的毒包的发现”，pagerduty.com，2015年5月7日。

[ 86 ] Jonathan Stone and Craig Partridge: “ When the CRC and TCP Checksum Disagree ,” at ACM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication (SIGCOMM), August 2000. doi:10.1145/347059.347561

[86] Jonathan Stone和Craig Partridge： “当CRC和TCP校验和不一致时”，《计算机通信应用、技术、架构与协议ACM会议》(SIGCOMM)，2000年8月。 doi:10.1145/347059.347561

[ 87 ] Evan Jones: “ How Both TCP and Ethernet Checksums Fail ,” evanjones.ca , October 5, 2015.

[87] Evan Jones：“TCP和以太网校验和是如何失败的”，evanjones.ca，2015年10月5日。

[ 88 ] Cynthia Dwork, Nancy Lynch, and Larry Stockmeyer: “ Consensus in the Presence of Partial Synchrony ,” Journal of the ACM , volume 35, number 2, pages 288–323, April 1988. doi:10.1145/42282.42283

[88] Cynthia Dwork, Nancy Lynch, and Larry Stockmeyer: “在部分同步状态下的共识”，ACM杂志，第35卷，第2期，页码288-323，1988年4月发表。doi:10.1145/42282.42283

[ 89 ] Peter Bailis and Ali Ghodsi: “ Eventual Consistency Today: Limitations, Extensions, and Beyond ,” ACM Queue , volume 11, number 3, pages 55-63, March 2013. doi:10.1145/2460276.2462076

[89] Peter Bailis 和 Ali Ghodsi：「最终一致性今日的局限性、扩展和未来」，ACM Queue，第 11 卷，第 3 期，第 55-63 页，2013 年 3 月。 doi:10.1145/2460276.2462076

[ 90 ] Bowen Alpern and Fred B. Schneider: “ Defining Liveness ,” Information Processing Letters , volume 21, number 4, pages 181–185, October 1985. doi:10.1016/0020-0190(85)90056-0

【90】Bowen Alpern和Fred B. Schneider：“定义活性”，《信息处理信函》，第21卷，第4期，181-185页，1985年10月。doi：10.1016/0020-0190（85）90056-0。

[ 91 ] Flavio P. Junqueira: “ Dude, Where’s My Metadata? ,” fpj.me , May 28, 2015.

[91] Flavio P. Junqueira：“兄弟们，我的元数据在哪里？”fpj.me，2015年5月28日。

[ 92 ] Scott Sanders: “ January 28th Incident Report ,” github.com , February 3, 2016.

[92] Scott Sanders：“1月28日事件报告”，github.com，2016年2月3日。

[ 93 ] Jay Kreps: “ A Few Notes on Kafka and Jepsen ,” blog.empathybox.com , September 25, 2013.

[93] 杰伊·克雷普斯：《有关Kafka和Jepsen的几点注意事项》，blog.empathybox.com，2013年9月25日。

[ 94 ] Thanh Do, Mingzhe Hao, Tanakorn Leesatapornwongsa, et al.: “ Limplock: Understanding the Impact of Limpware on Scale-out Cloud Systems ,” at 4th ACM Symposium on Cloud Computing (SoCC), October 2013. doi:10.1145/2523616.2523627

【94】邓青，郝明喆，Tanakorn Leesatapornwongsa 等著：“Limplock：了解Limpware对scale-out云系统的影响”，收录于2013年第四届ACM云计算研讨会（SoCC），2013年10月发表。DOI:10.1145/2523616.2523627。

[ 95 ] Frank McSherry, Michael Isard, and Derek G. Murray: “ Scalability! But at What COST? ,” at 15th USENIX Workshop on Hot Topics in Operating Systems (HotOS), May 2015.

[95] Frank McSherry, Michael Isard和Derek G. Murray："可伸缩性！但代价是什么？", 在第15届USENIX操作系统热点工作坊（HotOS），2015年5月。