For a successful technology, reality must take precedence over public relations, for nature cannot be fooled.

技术的成功在现实胜于公关，因为无法欺骗自然。

Richard Feynman, Rogers Commission Report (1986)

理查德·费曼，罗杰斯委员会报告（1986年）

In Part I of this book, we discussed aspects of data systems that apply when data is stored on a single machine. Now, in Part II , we move up a level and ask: what happens if multiple machines are involved in storage and retrieval of data?

在本书的第一部分，我们讨论了数据系统的方面，适用于在单个计算机上存储数据的情况。现在，在第二部分中，我们提高了一个层次，问：如果多台计算机参与数据的存储和检索会发生什么？

There are various reasons why you might want to distribute a database across multiple machines:

将数据库分布在多台计算机上有多种原因：

Scalability

If your data volume, read load, or write load grows bigger than a single machine can handle, you can potentially spread the load across multiple machines.

如果您的数据量、读取负载或写入负载增长到超过单台机器可以处理的范围，您可以在多台机器之间分担负载。

Fault tolerance/high availability

If your application needs to continue working even if one machine (or several machines, or the network, or an entire datacenter) goes down, you can use multiple machines to give you redundancy. When one fails, another one can take over.

如果您的应用程序需要即使一台机器（或几台机器、网络、整个数据中心）宕机，仍需继续工作，您可以使用多台机器来提供冗余。当其中一台失败时，另一台可以接替它的工作。

Latency

If you have users around the world, you might want to have servers at various locations worldwide so that each user can be served from a datacenter that is geographically close to them. That avoids the users having to wait for network packets to travel halfway around the world.

如果你有全球的用户，你可能想要在世界各地设立服务器，这样每个用户就可以从地理位置接近他们的数据中心获得服务。这可以避免用户等待网络数据包传送到全球的中央。

Scaling to Higher Load

If all you need is to scale to higher load, the simplest approach is to buy a more powerful machine (sometimes called vertical scaling or scaling up ). Many CPUs, many RAM chips, and many disks can be joined together under one operating system, and a fast interconnect allows any CPU to access any part of the memory or disk. In this kind of shared-memory architecture , all the components can be treated as a single machine [ 1 ]. ⁱ

如果您只需要扩展到更高的负载，最简单的方法是购买一台更强大的机器（有时称为垂直扩展或向上缩放）。许多 CPU、许多内存芯片和许多磁盘可以在一个操作系统下连接在一起，并且快速互联允许任何 CPU 访问内存或磁盘的任何部分。在这种共享内存架构中，所有组件都可以被视为单个机器。[1]。

The problem with a shared-memory approach is that the cost grows faster than linearly: a machine with twice as many CPUs, twice as much RAM, and twice as much disk capacity as another typically costs significantly more than twice as much. And due to bottlenecks, a machine twice the size cannot necessarily handle twice the load.

共享内存方法的问题在于成本增长得比线性快：一台具有两倍 CPU、两倍内存和两倍磁盘容量的机器通常比另一台显著地更贵。并且由于瓶颈，两倍大小的机器不能一定处理两倍的负载。

A shared-memory architecture may offer limited fault tolerance—high-end machines have hot-swappable components (you can replace disks, memory modules, and even CPUs without shutting down the machines)—but it is definitely limited to a single geographic location.

共享内存架构可能具有有限的容错能力-高端机器具有热插拔组件(可以更换磁盘、内存模块，甚至CPU而不需要关闭机器)-但它肯定限于一个地理位置。

Another approach is the shared-disk architecture , which uses several machines with independent CPUs and RAM, but stores data on an array of disks that is shared between the machines, which are connected via a fast network. ⁱⁱ This architecture is used for some data warehousing workloads, but contention and the overhead of locking limit the scalability of the shared-disk approach [ 2 ].

另一种方法是共享磁盘架构，它使用多台具有独立CPU和RAM的机器，但将数据存储在由多台机器共享的磁盘阵列上，这些机器通过快速网络连接。ii这种架构用于一些数据仓库工作负载，但竞争和锁定的开销限制了共享磁盘方法的可扩展性[2]。

Shared-Nothing Architectures

By contrast, shared-nothing architectures [ 3 ] (sometimes called horizontal scaling or scaling out ) have gained a lot of popularity. In this approach, each machine or virtual machine running the database software is called a node . Each node uses its CPUs, RAM, and disks independently. Any coordination between nodes is done at the software level, using a conventional network.

相比之下，无共享的架构（有时称为水平扩展或扩展）变得越来越流行。在这种方法中，运行数据库软件的每台机器或虚拟机被称为节点。每个节点独立使用其CPU、RAM和磁盘。节点之间的任何协调都是在软件层面上完成，使用传统的网络。

No special hardware is required by a shared-nothing system, so you can use whatever machines have the best price/performance ratio. You can potentially distribute data across multiple geographic regions, and thus reduce latency for users and potentially be able to survive the loss of an entire datacenter. With cloud deployments of virtual machines, you don’t need to be operating at Google scale: even for small companies, a multi-region distributed architecture is now feasible.

共享无系统不需要特殊硬件，因此您可以使用具有最佳性价比的任何机器。您可以将数据潜在地分布到多个地理区域，从而减少用户的延迟，并潜在地能够承受整个数据中心的损失。通过虚拟机的云部署，您不需要以谷歌的规模运营：即使对于小公司，多区域分布式架构现在也是可行的。

In this part of the book, we focus on shared-nothing architectures—not because they are necessarily the best choice for every use case, but rather because they require the most caution from you, the application developer. If your data is distributed across multiple nodes, you need to be aware of the constraints and trade-offs that occur in such a distributed system—the database cannot magically hide these from you.

在本书的这一部分中，我们将重点关注共享无何架构，这并不是因为它们在每种情况下都是最佳选择，而是因为它们需要你，应用程序开发人员，最多的谨慎。如果你的数据分布在多个节点上，你需要了解在这样的分布式系统中会发生什么限制和权衡—数据库无法神奇地向你隐藏这些。

While a distributed shared-nothing architecture has many advantages, it usually also incurs additional complexity for applications and sometimes limits the expressiveness of the data models you can use. In some cases, a simple single-threaded program can perform significantly better than a cluster with over 100 CPU cores [ 4 ]. On the other hand, shared-nothing systems can be very powerful. The next few chapters go into details on the issues that arise when data is distributed.

分布式共享无关架构具有许多优点，但通常会使应用程序增加复杂性，并有时限制您可以使用的数据模型的表现力。在某些情况下，一个简单的单线程程序可以比一个拥有超过100个CPU核心的集群表现得更好[4]。另一方面，共享无关系统可以非常强大。接下来的几章将详细探讨数据分布时出现的问题。

Replication Versus Partitioning

There are two common ways data is distributed across multiple nodes:

数据分布在多个节点中的常见方式有两种：

Replication

Keeping a copy of the same data on several different nodes, potentially in different locations. Replication provides redundancy: if some nodes are unavailable, the data can still be served from the remaining nodes. Replication can also help improve performance. We discuss replication in Chapter 5 .

将相同的数据副本存储在多个不同的节点上，可能位于不同的位置。复制提供了冗余：如果某些节点不可用，则仍然可以从剩余节点提供数据。复制还可以帮助提高性能。我们在第5章中讨论复制。

Partitioning

Splitting a big database into smaller subsets called partitions so that different partitions can be assigned to different nodes (also known as sharding ). We discuss partitioning in Chapter 6 .

将一个大型数据库分割成较小的子集，称为分区，以便将不同的分区分配给不同的节点（也称为分片）。我们在第6章讨论分区。

These are separate mechanisms, but they often go hand in hand, as illustrated in Figure II-1 .

这些是独立的机制，但它们经常共同发生，如图II-1所示。

With an understanding of those concepts, we can discuss the difficult trade-offs that you need to make in a distributed system. We’ll discuss transactions in Chapter 7 , as that will help you understand all the many things that can go wrong in a data system, and what you can do about them. We’ll conclude this part of the book by discussing the fundamental limitations of distributed systems in Chapters 8 and 9 .

有了这些概念的理解，我们才能讨论分布式系统中需要做出的各种艰难抉择。我们将在第七章讨论事务，因为这将帮助您了解数据系统中可能出现的所有问题以及您可以采取的措施。我们将在第八章和第九章中讨论分布式系统的基本限制，以此结束本书的这一部分。

Later, in Part III of this book, we will discuss how you can take several (potentially distributed) datastores and integrate them into a larger system, satisfying the needs of a complex application. But first, let’s talk about distributed data.

接下来，在本书的第三部分，我们将讨论如何将多个（可能分布式的）数据存储集成到一个更大的系统中，以满足复杂应用的需要。但首先，让我们先谈谈分布式数据。

Footnotes

ⁱ In a large machine, although any CPU can access any part of memory, some banks of memory are closer to one CPU than to others (this is called nonuniform memory access , or NUMA [ 1 ]). To make efficient use of this architecture, the processing needs to be broken down so that each CPU mostly accesses memory that is nearby—which means that partitioning is still required, even when ostensibly running on one machine.

在一台大型机器中，尽管任何 CPU 都可以访问内存的任何部分，但某些内存库接近于某些 CPU，而不是其他 CPU（这称为非均匀内存访问，或 NUMA）。为了有效利用这种架构，处理需要分解，使每个 CPU 大多访问附近的内存-这意味着即使表面上在一台机器上运行，仍需要分区。

ⁱⁱ Network Attached Storage (NAS) or Storage Area Network (SAN).

网络附加存储（NAS）或存储区域网络（SAN）。

References

[ 1 ] Ulrich Drepper: “ What Every Programmer Should Know About Memory ,” akkadia.org , November 21, 2007.

[1] Ulrich Drepper: “程序员应该了解的有关内存的知识”，akkadia.org，2007年11月21日。

[ 2 ] Ben Stopford: “ Shared Nothing vs. Shared Disk Architectures: An Independent View ,” benstopford.com , November 24, 2009.

[2] Ben Stopford: “Shared Nothing vs. Shared Disk Architectures: An Independent View,” benstopford.com, November 24, 2009. “共享无限架构 vs. 共享磁盘架构：一个独立的观点”，Ben Stopford, 2009年11月24日。

[ 3 ] Michael Stonebraker: “ The Case for Shared Nothing ,” IEEE Database Engineering Bulletin , volume 9, number 1, pages 4–9, March 1986.

[3] 迈克尔·斯通布雷克：“共享无需空间”，IEEE 数据库工程通报，第9卷，第1期，第4-9页，1986年3月。

[ 4 ] Frank McSherry, Michael Isard, and Derek G. Murray: “ Scalability! But at What COST? ,” at 15th USENIX Workshop on Hot Topics in Operating Systems (HotOS), May 2015.

[4] Frank McSherry、 Michael Isard、 Derek G. Murray： “可伸缩性！但是代价是什么？” 于2015年5月在第15届USENIX操作系统热门话题研讨会（HotOS）上发表。

第二部分。分布式数据

Part II. Distributed Data

Scaling to Higher Load

Shared-Nothing Architectures

Replication Versus Partitioning

Figure II-1. A database split into two partitions, with two replicas per partition.

Footnotes

References