第一章。可靠、可扩展和易维护的应用程序

Mar 3, 2013

Designing Data-Intensive Applications

Chapter 1. Reliable, Scalable, and Maintainable Applications

The Internet was done so well that most people think of it as a natural resource like the Pacific Ocean, rather than something that was man-made. When was the last time a technology with a scale like that was so error-free?

互联网做得非常出色，以至于大多数人认为它是像太平洋这样的自然资源，而不是人造的东西。上一次有这样规模的技术是什么时候出现得如此无误呢？

Alan Kay , in interview with Dr Dobb’s Journal (2012)

艾伦·凯（Alan Kay），2012年接受Dr Dobb’s Journal采访。

Many applications today are data-intensive , as opposed to compute-intensive . Raw CPU power is rarely a limiting factor for these applications—bigger problems are usually the amount of data, the complexity of data, and the speed at which it is changing.

今天许多应用程序都是数据密集型的，相比计算密集型的。对于这些应用程序，原始CPU功率很少是限制因素 - 更大的问题通常是数据量，数据的复杂性以及其更改速度。

A data-intensive application is typically built from standard building blocks that provide commonly needed functionality. For example, many applications need to:

一个数据密集型应用通常是由提供常见功能的标准构建块构建而成。例如，许多应用程序需要：

Store data so that they, or another application, can find it again later ( databases )

将数据存储起来，以便它们或其他应用程序可以在以后找到它（数据库）
Remember the result of an expensive operation, to speed up reads ( caches )

记住昂贵操作的结果以加快读取速度（缓存）。
Allow users to search data by keyword or filter it in various ways ( search indexes )

允许用户通过关键字或不同的方式过滤数据（搜索索引）。
Send a message to another process, to be handled asynchronously ( stream processing )

发送消息到另一个进程，进行异步处理（流处理）。
Periodically crunch a large amount of accumulated data ( batch processing )

定期处理大量累积的数据（批处理）

If that sounds painfully obvious, that’s just because these data systems are such a successful abstraction: we use them all the time without thinking too much. When building an application, most engineers wouldn’t dream of writing a new data storage engine from scratch, because databases are a perfectly good tool for the job.

如果听起来非常显然，那只是因为这些数据系统是一种非常成功的抽象：我们经常在不过多思考的情况下使用它们。在构建应用程序时，大多数工程师不会想象从头开始编写新的数据存储引擎，因为数据库是完成工作的完美工具。

But reality is not that simple. There are many database systems with different characteristics, because different applications have different requirements. There are various approaches to caching, several ways of building search indexes, and so on. When building an application, we still need to figure out which tools and which approaches are the most appropriate for the task at hand. And it can be hard to combine tools when you need to do something that a single tool cannot do alone.

然而现实并不是那么简单的。有很多具有不同特征的数据库系统，因为不同的应用有不同的要求。有各种各样的缓存方法，几种建立搜索索引的方法等等。在构建应用程序时，我们仍然需要弄清楚哪些工具和方法最适合手头的任务。当您需要做一些单个工具无法完成的任务时，很难组合工具。

This book is a journey through both the principles and the practicalities of data systems, and how you can use them to build data-intensive applications. We will explore what different tools have in common, what distinguishes them, and how they achieve their characteristics.

这本书带你穿越数据系统的原则和实践，以及如何使用它们构建数据密集型应用。我们将探讨不同工具的共同点、区别以及它们如何实现其特性。

In this chapter, we will start by exploring the fundamentals of what we are trying to achieve: reliable, scalable, and maintainable data systems. We’ll clarify what those things mean, outline some ways of thinking about them, and go over the basics that we will need for later chapters. In the following chapters we will continue layer by layer, looking at different design decisions that need to be considered when working on a data-intensive application.

在本章中，我们将首先探讨我们所致力于实现的基本原理：可靠、可扩展和可维护的数据系统。我们将澄清这些事物的含义，概述一些思考方式，并介绍我们后续章节所需要的基础知识。在接下来的章节中，我们将逐层深入，探讨处理数据密集型应用时需要考虑的不同设计决策。

Thinking About Data Systems

We typically think of databases, queues, caches, etc. as being very different categories of tools. Although a database and a message queue have some superficial similarity—both store data for some time—they have very different access patterns, which means different performance characteristics, and thus very different implementations.

我们通常认为数据库、队列、缓存等是非常不同的工具类别。尽管数据库和消息队列有一些表面的相似之处——都可以存储数据一段时间——但它们具有非常不同的访问模式，这意味着不同的性能特征，因此实现也非常不同。

So why should we lump them all together under an umbrella term like data systems ?

那么我们为什么要把它们都归为一个数据系统的总称呢？

Many new tools for data storage and processing have emerged in recent years. They are optimized for a variety of different use cases, and they no longer neatly fit into traditional categories [ 1 ]. For example, there are datastores that are also used as message queues (Redis), and there are message queues with database-like durability guarantees (Apache Kafka). The boundaries between the categories are becoming blurred.

近年来涌现了许多新的数据存储和处理工具。它们针对不同的使用情境进行了优化，并不再简单地符合传统分类[1]。例如，有些数据存储库也用作消息队列（Redis），有些消息队列具备类似数据库的耐久性保证（Apache Kafka）。各种类别的边界变得模糊不清。

Secondly, increasingly many applications now have such demanding or wide-ranging requirements that a single tool can no longer meet all of its data processing and storage needs. Instead, the work is broken down into tasks that can be performed efficiently on a single tool, and those different tools are stitched together using application code.

其次，越来越多的应用程序现在拥有如此苛刻或广泛的要求，以至于单个工具不能再满足其所有数据处理和存储需求。相反，工作被分解为可以在单个工具上高效执行的任务，并使用应用代码将这些不同的工具组合在一起。

For example, if you have an application-managed caching layer (using Memcached or similar), or a full-text search server (such as Elasticsearch or Solr) separate from your main database, it is normally the application code’s responsibility to keep those caches and indexes in sync with the main database. Figure 1-1 gives a glimpse of what this may look like (we will go into detail in later chapters).

例如，如果您拥有一个应用程序管理的缓存层（使用Memcached或类似工具）或一个与您的主数据库分开的全文搜索服务器（例如Elasticsearch或Solr），通常应用程序代码需要负责保持这些缓存和索引与主数据库同步。图1-1展示了这可能是什么样子（我们会在后面的章节详细说明）。

When you combine several tools in order to provide a service, the service’s interface or application programming interface (API) usually hides those implementation details from clients. Now you have essentially created a new, special-purpose data system from smaller, general-purpose components. Your composite data system may provide certain guarantees: e.g., that the cache will be correctly invalidated or updated on writes so that outside clients see consistent results. You are now not only an application developer, but also a data system designer.

当你将几个工具组合在一起以提供服务时，服务的界面或应用程序接口（API）通常会从客户端隐藏这些实现细节。现在，您基本上从较小的通用组件创建了一个新的特定用途的数据系统。您的组合数据系统可能提供某些保证：例如，缓存将在写操作时正确失效或更新，以便外部客户端看到一致的结果。现在，您不仅是应用程序开发人员，还是数据系统设计师。

If you are designing a data system or service, a lot of tricky questions arise. How do you ensure that the data remains correct and complete, even when things go wrong internally? How do you provide consistently good performance to clients, even when parts of your system are degraded? How do you scale to handle an increase in load? What does a good API for the service look like?

如果您正在设计数据系统或服务，会出现很多棘手的问题。如何保证数据即使出错，也仍正确和完整？如何确保在系统部分出现问题时，能为客户提供一致良好的性能？如何扩展以处理负载的增加？服务的良好API是什么样的？

There are many factors that may influence the design of a data system, including the skills and experience of the people involved, legacy system dependencies, the timescale for delivery, your organization’s tolerance of different kinds of risk, regulatory constraints, etc. Those factors depend very much on the situation.

有许多因素可能影响数据系统的设计，包括涉及的人员的技能和经验，遗留系统的依赖性，交付时间表，组织对不同风险容忍度，监管限制等。这些因素在很大程度上取决于具体情况。

In this book, we focus on three concerns that are important in most software systems:

在这本书中，我们关注的是大多数软件系统中重要的三个问题：

Reliability

The system should continue to work correctly (performing the correct function at the desired level of performance) even in the face of adversity (hardware or software faults, and even human error). See “Reliability” .

系统应该在逆境中继续正常工作（保持所需的性能水平执行正确的功能），甚至在硬件或软件故障和人为错误的情况下也是如此。请参见“可靠性”。

Scalability

As the system grows (in data volume, traffic volume, or complexity), there should be reasonable ways of dealing with that growth. See “Scalability” .

随着系统规模的增长（数据量、流量或复杂性），应有合理的处理方式。参见“可扩展性”。

Maintainability

Over time, many different people will work on the system (engineering and operations, both maintaining current behavior and adapting the system to new use cases), and they should all be able to work on it productively . See “Maintainability” .

随着时间的推移，许多不同的人将在系统上工作（包括工程和运营，维护当前行为并适应新的用例），他们都应该能够以高效的方式工作。请参见：“可维护性”。

These words are often cast around without a clear understanding of what they mean. In the interest of thoughtful engineering, we will spend the rest of this chapter exploring ways of thinking about reliability, scalability, and maintainability. Then, in the following chapters, we will look at various techniques, architectures, and algorithms that are used in order to achieve those goals.

这些词经常被使用，然而人们往往缺乏明确的理解。为了进行深入的工程思考，我们将花费整个章节探究可靠性、可扩展性和可维护性的思考方式。紧接着，在下一章中，我们将探讨在实现这些目标时所使用的各种技术、架构和算法。

Reliability

Everybody has an intuitive idea of what it means for something to be reliable or unreliable. For software, typical expectations include:

每个人都对什么是可靠或不可靠的东西有一个直观的想法。对于软件来说，典型的期望包括：

The application performs the function that the user expected.

应用程序执行用户期望的功能。
It can tolerate the user making mistakes or using the software in unexpected ways.

它可以容忍用户犯错误或以意想不到的方式使用软件。
Its performance is good enough for the required use case, under the expected load and data volume.

它的表现足够好，以满足预期的负荷和数据量的要求。
The system prevents any unauthorized access and abuse.

该系统防止未经授权的访问和滥用。

If all those things together mean “working correctly,” then we can understand reliability as meaning, roughly, “continuing to work correctly, even when things go wrong.”

如果所有这些事情一起意味着“正常工作”，那么我们可以将可靠性大致理解为“即使在出现故障时仍能正确工作”。

The things that can go wrong are called faults , and systems that anticipate faults and can cope with them are called fault-tolerant or resilient . The former term is slightly misleading: it suggests that we could make a system tolerant of every possible kind of fault, which in reality is not feasible. If the entire planet Earth (and all servers on it) were swallowed by a black hole, tolerance of that fault would require web hosting in space—good luck getting that budget item approved. So it only makes sense to talk about tolerating certain types of faults.

能够出错的事情被称为故障，而能够预见并应对故障的系统被称为容错或弹性系统。前者的术语略有误导：它暗示我们可以使系统容忍每种可能的故障，但实际上这是不可行的。如果整个地球（和所有服务器）都被黑洞吞噬，要容忍那种故障就需要在太空中进行网络托管——但很难拿到预算。因此，只有谈论容忍某些类型的故障才有意义。

Note that a fault is not the same as a failure [ 2 ]. A fault is usually defined as one component of the system deviating from its spec, whereas a failure is when the system as a whole stops providing the required service to the user. It is impossible to reduce the probability of a fault to zero; therefore it is usually best to design fault-tolerance mechanisms that prevent faults from causing failures. In this book we cover several techniques for building reliable systems from unreliable parts.

请注意，故障与失效不同。[2] 故障通常被定义为系统中的一个组件偏离其规范，而失效是指整个系统停止向用户提供所需的服务。将故障的概率降至零是不可能的，因此通常最好设计能够防止故障导致失效的容错机制。在本书中，我们介绍了几种利用不可靠部件构建可靠系统的技术。

Counterintuitively, in such fault-tolerant systems, it can make sense to increase the rate of faults by triggering them deliberately—for example, by randomly killing individual processes without warning. Many critical bugs are actually due to poor error handling [ 3 ]; by deliberately inducing faults, you ensure that the fault-tolerance machinery is continually exercised and tested, which can increase your confidence that faults will be handled correctly when they occur naturally. The Netflix Chaos Monkey [ 4 ] is an example of this approach.

在这样的容错系统中，出乎意料的是，通过故意触发它们，例如随机杀死单个进程而不发出警告，可能增加故障率是有意义的。许多关键错误实际上是由于错误处理不当引起的；通过故意引发故障，可以确保容错机制不断得到运作和测试，从而增加您对自然发生故障时正确处理故障的信心。Netflix的混沌猴就是这种方法的例子。

Although we generally prefer tolerating faults over preventing faults, there are cases where prevention is better than cure (e.g., because no cure exists). This is the case with security matters, for example: if an attacker has compromised a system and gained access to sensitive data, that event cannot be undone. However, this book mostly deals with the kinds of faults that can be cured, as described in the following sections.

虽然我们通常更喜欢容忍错误而不是预防错误，但在有些情况下，预防优于治疗（例如，因为没有治愈方法）。安全问题就是这种情况的例子：如果攻击者侵犯了某个系统并获得了敏感数据的访问权限，那么该事件就无法撤消。然而，本书大多数讨论的是可以治愈的故障类型，如下面的章节所描述的那样。

Hardware Faults

When we think of causes of system failure, hardware faults quickly come to mind. Hard disks crash, RAM becomes faulty, the power grid has a blackout, someone unplugs the wrong network cable. Anyone who has worked with large datacenters can tell you that these things happen all the time when you have a lot of machines.

当我们考虑系统故障的原因时，硬件故障很快就会跳入脑海。硬盘崩溃，内存损坏，电网停电，有人拔了错误的网络电缆。任何曾在大型数据中心工作过的人都可以告诉你，当你有很多机器时，这些事情经常发生。

Hard disks are reported as having a mean time to failure (MTTF) of about 10 to 50 years [ 5 , 6 ]. Thus, on a storage cluster with 10,000 disks, we should expect on average one disk to die per day.

硬盘通常被报告有一个平均故障时间（MTTF）约为10到50年[5, 6]。因此，在一个拥有10,000个硬盘的存储群集中，我们平均每天应该期望有一块硬盘损坏。

Our first response is usually to add redundancy to the individual hardware components in order to reduce the failure rate of the system. Disks may be set up in a RAID configuration, servers may have dual power supplies and hot-swappable CPUs, and datacenters may have batteries and diesel generators for backup power. When one component dies, the redundant component can take its place while the broken component is replaced. This approach cannot completely prevent hardware problems from causing failures, but it is well understood and can often keep a machine running uninterrupted for years.

我们通常的第一反应是将冗余添加到个别硬件组件中，以降低系统的故障率。可以将磁盘设置为RAID配置，服务器可以拥有双重电源和可热插拔的CPU，数据中心可以拥有备用电源的电池和柴油发电机。当一个组件出现故障时，冗余组件可以代替它的位置，而破损的组件则需要被更换。这种方法不能完全防止硬件问题导致故障，但是它是众所周知的，并且通常可以使一台机器连续运行多年。

Until recently, redundancy of hardware components was sufficient for most applications, since it makes total failure of a single machine fairly rare. As long as you can restore a backup onto a new machine fairly quickly, the downtime in case of failure is not catastrophic in most applications. Thus, multi-machine redundancy was only required by a small number of applications for which high availability was absolutely essential.

直到最近，硬件组件的冗余对于大多数应用程序来说已经足够，因为它使单台机器的完全失效变得相当罕见。只要能够相当快地将备份恢复到新机器上，在大多数应用程序中，发生故障时停机时间并不是灾难性的。因此，仅有少数应用程序需要多台机器的冗余，这对于绝对必要的高可用性是必需的。

However, as data volumes and applications’ computing demands have increased, more applications have begun using larger numbers of machines, which proportionally increases the rate of hardware faults. Moreover, in some cloud platforms such as Amazon Web Services (AWS) it is fairly common for virtual machine instances to become unavailable without warning [ 7 ], as the platforms are designed to prioritize flexibility and elasticity ⁱ over single-machine reliability.

然而，随着数据量和应用程序的计算需求增加，更多的应用程序开始使用更多的机器，这些机器的故障率也相应增加。此外，在一些云平台，例如AWS，虚拟机实例经常会在没有警告的情况下变得不可用[7]，因为这些平台的设计更注重灵活性和弹性而不是单机可靠性。

Hence there is a move toward systems that can tolerate the loss of entire machines, by using software fault-tolerance techniques in preference or in addition to hardware redundancy. Such systems also have operational advantages: a single-server system requires planned downtime if you need to reboot the machine (to apply operating system security patches, for example), whereas a system that can tolerate machine failure can be patched one node at a time, without downtime of the entire system (a rolling upgrade ; see Chapter 4 ).

因此，现在越来越倾向于使用软件容错技术来替代或补充硬件冗余，以实现整个机器的失效容忍系统。这样的系统还具有操作上的优势：单服务器系统需要计划停机时间才能重新启动机器（例如，应用操作系统安全补丁），而能够容忍机器故障的系统可以逐个升级节点而不需要整个系统停机（滚动升级；参见第四章）。

Software Errors

We usually think of hardware faults as being random and independent from each other: one machine’s disk failing does not imply that another machine’s disk is going to fail. There may be weak correlations (for example due to a common cause, such as the temperature in the server rack), but otherwise it is unlikely that a large number of hardware components will fail at the same time.

通常我们认为硬件故障是随机和相互独立的：一台机器的硬盘故障并不意味着另一台机器的硬盘会发生故障。可能存在一些弱相关性（例如由于共同的原因，例如服务器机架中的温度），但除此之外，大量硬件组件同时发生故障的可能性很小。

Another class of fault is a systematic error within the system [ 8 ]. Such faults are harder to anticipate, and because they are correlated across nodes, they tend to cause many more system failures than uncorrelated hardware faults [ 5 ]. Examples include:

另一类故障是系统内的系统性错误[8]。这种故障很难预测，因为它们在节点之间存在相关性，因此它们往往比不相关的硬件故障导致更多的系统故障[5]。例如：

A software bug that causes every instance of an application server to crash when given a particular bad input. For example, consider the leap second on June 30, 2012, that caused many applications to hang simultaneously due to a bug in the Linux kernel [ 9 ].

一个软件错误导致每个应用服务器在接收到特定的错误输入后都会崩溃。例如，考虑2012年6月30日的闰秒导致许多应用程序同时挂起，原因是Linux内核的错误[9]。
A runaway process that uses up some shared resource—CPU time, memory, disk space, or network bandwidth.

一个耗尽了一些共享资源的失控进程——CPU时间、内存、磁盘空间或网络带宽。
A service that the system depends on that slows down, becomes unresponsive, or starts returning corrupted responses.

系统依赖的服务变慢、无响应或返回错误响应。
Cascading failures, where a small fault in one component triggers a fault in another component, which in turn triggers further faults [ 10 ].

级联故障，一个组件的小故障会引发另一个组件的故障，接着又会触发更多的故障[10]。

The bugs that cause these kinds of software faults often lie dormant for a long time until they are triggered by an unusual set of circumstances. In those circumstances, it is revealed that the software is making some kind of assumption about its environment—and while that assumption is usually true, it eventually stops being true for some reason [ 11 ].

导致这些软件故障的错误通常会在很长一段时间内处于休眠状态，直到它们被一组不寻常的情况激活。在那种情况下，软件会揭示出它对环境做出某种假设——虽然这个假设通常是正确的，但它最终由于某种原因停止成立[11]。

There is no quick solution to the problem of systematic faults in software. Lots of small things can help: carefully thinking about assumptions and interactions in the system; thorough testing; process isolation; allowing processes to crash and restart; measuring, monitoring, and analyzing system behavior in production. If a system is expected to provide some guarantee (for example, in a message queue, that the number of incoming messages equals the number of outgoing messages), it can constantly check itself while it is running and raise an alert if a discrepancy is found [ 12 ].

软件系统存在系统性故障问题并非能立即解决。有许多小方法可以帮助解决：认真考虑系统中的假设和交互作用；进行彻底的测试；隔离处理；允许进程崩溃和重新启动；在生产中测量、监控和分析系统的行为。如果一个系统要提供某种保证（例如，在消息队列中，输入消息数等于输出消息数），它可以在运行时不断检查本身，并在发现差异时发出警报 [12]。

Human Errors

Humans design and build software systems, and the operators who keep the systems running are also human. Even when they have the best intentions, humans are known to be unreliable. For example, one study of large internet services found that configuration errors by operators were the leading cause of outages, whereas hardware faults (servers or network) played a role in only 10–25% of outages [ 13 ].

人类设计和构建软件系统，保持系统运行的操作员也是人类。即使他们有最好的意图，人类被证明是不可靠的。例如，一项针对大型互联网服务的研究发现，运营商的配置错误是停机的主要原因，而硬件故障（服务器或网络）仅在10-25%的停机中发挥作用[13]。

How do we make our systems reliable, in spite of unreliable humans? The best systems combine several approaches:

我们如何在人类不可靠的情况下使我们的系统可靠？最好的系统结合了几种方法：

Design systems in a way that minimizes opportunities for error. For example, well-designed abstractions, APIs, and admin interfaces make it easy to do “the right thing” and discourage “the wrong thing.” However, if the interfaces are too restrictive people will work around them, negating their benefit, so this is a tricky balance to get right.

以最小化错误机会的方式设计系统。例如，设计良好的抽象、API和管理界面可以使“正确的事情”变得容易，并避免“错误的事情”。然而，如果接口太严格，人们会绕过它们，抵消其好处，因此这是一个棘手的平衡问题。
Decouple the places where people make the most mistakes from the places where they can cause failures. In particular, provide fully featured non-production sandbox environments where people can explore and experiment safely, using real data, without affecting real users.

将人们经常犯错的地方与可能导致失败的地方分离开来。特别是，提供完整功能的非生产沙盒环境，人们可以在其中安全地探索和实验，使用真实数据，而不会影响真实用户。
Test thoroughly at all levels, from unit tests to whole-system integration tests and manual tests [ 3 ]. Automated testing is widely used, well understood, and especially valuable for covering corner cases that rarely arise in normal operation.

彻底测试各个层面，从单元测试到整个系统集成测试和手动测试[3]。自动化测试被广泛使用，被很好地理解，并且在覆盖在正常操作中很少出现的边缘情况时尤为有价值。
Allow quick and easy recovery from human errors, to minimize the impact in the case of a failure. For example, make it fast to roll back configuration changes, roll out new code gradually (so that any unexpected bugs affect only a small subset of users), and provide tools to recompute data (in case it turns out that the old computation was incorrect).

允许快速轻松地从人为错误中恢复，以最小化故障的影响。例如，快速回滚配置更改，逐步推出新代码（以便任何意外错误只影响一小部分用户），并提供重新计算数据的工具（以防旧计算结果不正确）。
Set up detailed and clear monitoring, such as performance metrics and error rates. In other engineering disciplines this is referred to as telemetry . (Once a rocket has left the ground, telemetry is essential for tracking what is happening, and for understanding failures [ 14 ].) Monitoring can show us early warning signals and allow us to check whether any assumptions or constraints are being violated. When a problem occurs, metrics can be invaluable in diagnosing the issue.

建立详细和清晰的监控，例如性能指标和错误率。在其他工程学科中，这被称为遥测。（一旦火箭离开地面，遥测对于跟踪发生的情况和理解故障至关重要[14]。）监测可以向我们显示早期警告信号，并允许我们检查是否违反任何假设或约束。当问题发生时，指标可以非常有价值地诊断问题。
Implement good management practices and training—a complex and important aspect, and beyond the scope of this book.

实施良好的管理实践和培训——这是一个复杂而重要的方面，超出本书的范围。

How Important Is Reliability?

Reliability is not just for nuclear power stations and air traffic control software—more mundane applications are also expected to work reliably. Bugs in business applications cause lost productivity (and legal risks if figures are reported incorrectly), and outages of ecommerce sites can have huge costs in terms of lost revenue and damage to reputation.

可靠性不仅针对核电站和空中交通控制软件——更普通的应用程序同样需要保证可靠运行。商业应用程序中的漏洞会导致生产力下降（如果数字报告不正确，还会面临法律风险），电子商务网站的停机则会造成巨额损失，且声誉受损。

Even in “noncritical” applications we have a responsibility to our users. Consider a parent who stores all their pictures and videos of their children in your photo application [ 15 ]. How would they feel if that database was suddenly corrupted? Would they know how to restore it from a backup?

即使在“非关键”应用程序中，我们也有责任对我们的用户负责。考虑一个将他们所有的孩子的照片和视频存储在你的照片应用程序[15]中的父母。如果该数据库突然损坏，他们会感觉如何？他们知道如何从备份中恢复它吗？

There are situations in which we may choose to sacrifice reliability in order to reduce development cost (e.g., when developing a prototype product for an unproven market) or operational cost (e.g., for a service with a very narrow profit margin)—but we should be very conscious of when we are cutting corners.

在某些情况下，我们可能选择在可靠性方面做出牺牲，以降低开发成本（例如，针对未经证实的市场开发原型产品）或运营成本（例如，针对利润非常微薄的服务）——但我们应该非常清醒地意识到我们是在削减开销。

Scalability

Even if a system is working reliably today, that doesn’t mean it will necessarily work reliably in the future. One common reason for degradation is increased load: perhaps the system has grown from 10,000 concurrent users to 100,000 concurrent users, or from 1 million to 10 million. Perhaps it is processing much larger volumes of data than it did before.

即使系统今天能够可靠地运行，也不意味着它未来也能保持可靠性。一个常见的退化原因是负载增加：可能系统从10,000个并发用户增长到100,000个并发用户，或者从1百万增长到1千万。也许它正在处理比以前更大量的数据。

Scalability is the term we use to describe a system’s ability to cope with increased load. Note, however, that it is not a one-dimensional label that we can attach to a system: it is meaningless to say “X is scalable” or “Y doesn’t scale.” Rather, discussing scalability means considering questions like “If the system grows in a particular way, what are our options for coping with the growth?” and “How can we add computing resources to handle the additional load?”

可扩展性是我们用来描述系统处理增加负载的能力的术语。然而，需要注意的是，它不是我们可以给系统附加的一维标签：说“X是可扩展的”或“Y不可扩展”是没有意义的。相反，讨论可扩展性意味着考虑问题，例如“如果系统以特定方式增长，我们处理增长的选择是什么？”和“我们如何添加计算资源来处理额外的负载？”

Describing Load

First, we need to succinctly describe the current load on the system; only then can we discuss growth questions (what happens if our load doubles?). Load can be described with a few numbers which we call load parameters . The best choice of parameters depends on the architecture of your system: it may be requests per second to a web server, the ratio of reads to writes in a database, the number of simultaneously active users in a chat room, the hit rate on a cache, or something else. Perhaps the average case is what matters for you, or perhaps your bottleneck is dominated by a small number of extreme cases.

首先，我们需要简洁地描述系统上的当前负载；只有这样我们才能讨论增长问题（如果我们的负载翻倍会发生什么？）。负载可以用一些数字来描述，我们称之为负载参数。参数的最佳选择取决于您系统的架构：可能是每秒钟对Web服务器的请求，数据库中读写比例，聊天室中的同时活跃用户数量，缓存命中率或其他内容。也许对您来说平均情况最重要，或者您的瓶颈由少数几个极端情况主导。

To make this idea more concrete, let’s consider Twitter as an example, using data published in November 2012 [ 16 ]. Two of Twitter’s main operations are:

为了使这个想法更具体化，让我们以Twitter为例，使用2012年11月发表的数据[16]。 Twitter的两个主要操作是：

Post tweet

A user can publish a new message to their followers (4.6k requests/sec on average, over 12k requests/sec at peak).

一个用户可以向他们的关注者发布一条新信息（平均每秒4.6k个请求，峰值时每秒超过12k个请求）。

Home timeline

A user can view tweets posted by the people they follow (300k requests/sec).

一个用户可以查看其关注者发布的推文（每秒300k次请求）。

Simply handling 12,000 writes per second (the peak rate for posting tweets) would be fairly easy. However, Twitter’s scaling challenge is not primarily due to tweet volume, but due to fan-out ⁱⁱ —each user follows many people, and each user is followed by many people. There are broadly two ways of implementing these two operations:

处理每秒12000次写入（发布推文的峰值速率）将相当容易。然而，Twitter的扩展挑战不主要是由于推文数量，而是由于扇出-每个用户都关注许多人，每个用户都被许多人关注。实现这两个操作通常有两种方式：

Posting a tweet simply inserts the new tweet into a global collection of tweets. When a user requests their home timeline, look up all the people they follow, find all the tweets for each of those users, and merge them (sorted by time). In a relational database like in Figure 1-2 , you could write a query such as:

发布一条推文只会将其插入到全球推文集合中。当用户请求其主页时间线时，找出他们关注的所有人，找到每个用户的所有推文，并合并它们（按时间排序）。在关系型数据库中，可以编写查询语句，例如图1-2中的查询语句：
```
SELECT tweets.*, users.* FROM tweets
  JOIN users   ON tweets.sender_id    = users.id
  JOIN follows ON follows.followee_id = users.id
  WHERE follows.follower_id = current_user
```
Maintain a cache for each user’s home timeline—like a mailbox of tweets for each recipient user (see Figure 1-3 ). When a user posts a tweet , look up all the people who follow that user, and insert the new tweet into each of their home timeline caches. The request to read the home timeline is then cheap, because its result has been computed ahead of time.

为每个用户的主页时间线维护一个缓存——就像每个接受者用户的推文邮箱（参见图1-3）。当用户发布一条推文时，查找所有关注该用户的人，并将新的推文插入到他们每个人的主页时间线缓存中。因为结果已经提前计算出来，所以读取主页时间线的请求非常便宜。

The first version of Twitter used approach 1, but the systems struggled to keep up with the load of home timeline queries, so the company switched to approach 2. This works better because the average rate of published tweets is almost two orders of magnitude lower than the rate of home timeline reads, and so in this case it’s preferable to do more work at write time and less at read time.

Twitter的第一版使用方法1，但是那时系统很难跟上主页时间轴查询的负载，所以公司转而使用方法2。这个方法效果更好，因为发布推文的平均速率比主页时间轴读取的速率低了近两个数量级，所以在这种情况下，在写入时做更多的工作，而在读取时做更少的工作更好。

However, the downside of approach 2 is that posting a tweet now requires a lot of extra work. On average, a tweet is delivered to about 75 followers, so 4.6k tweets per second become 345k writes per second to the home timeline caches. But this average hides the fact that the number of followers per user varies wildly, and some users have over 30 million followers. This means that a single tweet may result in over 30 million writes to home timelines! Doing this in a timely manner—Twitter tries to deliver tweets to followers within five seconds—is a significant challenge.

然而，方法2的缺点是现在发布推文需要额外的工作量。平均而言，一条推文会被传递给约75个关注者，因此每秒4.6k条推文就变成了对主页时间轴缓存的345k次写入。但这个平均数隐藏了一个事实，即每个用户的关注者数量差异很大，有些用户拥有超过3000万的关注者。这意味着一条推文可能会导致3000万次对主页时间轴的写入！而在Twitter尝试在五秒内向关注者传递推文的情况下及时完成这项工作是一个巨大的挑战。

In the example of Twitter, the distribution of followers per user (maybe weighted by how often those users tweet) is a key load parameter for discussing scalability, since it determines the fan-out load. Your application may have very different characteristics, but you can apply similar principles to reasoning about its load.

以 Twitter 为例，用户的关注者分布（可能根据其推文频率加权）是讨论可扩展性的关键负载参数，因为它决定了扇出负载。你的应用程序可能具有非常不同的特点，但你可以应用类似的原则来推理它的负载。

The final twist of the Twitter anecdote: now that approach 2 is robustly implemented, Twitter is moving to a hybrid of both approaches. Most users’ tweets continue to be fanned out to home timelines at the time when they are posted, but a small number of users with a very large number of followers (i.e., celebrities) are excepted from this fan-out. Tweets from any celebrities that a user may follow are fetched separately and merged with that user’s home timeline when it is read, like in approach 1. This hybrid approach is able to deliver consistently good performance. We will revisit this example in Chapter 12 after we have covered some more technical ground.

发推特轶事的最后转折：现在实现了方法2，推特正在朝着两种方法的混合方式迈进。大多数用户的推特在发布时仍会在主页时间轴上扇出，但有一小部分拥有大量粉丝（即名人）的用户会被例外。用户可能会关注的任何名人的推特会单独获取并在读取时与该用户的主页时间轴合并，类似于方法1。这种混合方法能够提供一致的良好性能。在我们覆盖了一些更技术性的地面后，我们将在第12章重新审视这个例子。

Describing Performance

Once you have described the load on your system, you can investigate what happens when the load increases. You can look at it in two ways:

一旦你描述了系统的负载，你就可以调查当负载增加时会发生什么。你可以从两个方面来看待它：

When you increase a load parameter and keep the system resources (CPU, memory, network bandwidth, etc.) unchanged, how is the performance of your system affected?

当您增加负载参数并保持系统资源（CPU，内存，网络带宽等）不变时，系统的性能会受到影响吗？
When you increase a load parameter, how much do you need to increase the resources if you want to keep performance unchanged?

当你增加负载参数时，如果想保持性能不变，需要增加多少资源？

Both questions require performance numbers, so let’s look briefly at describing the performance of a system.

两个问题都需要性能数字，因此让我们简要地描述一下系统的性能。

In a batch processing system such as Hadoop, we usually care about throughput —the number of records we can process per second, or the total time it takes to run a job on a dataset of a certain size. ⁱⁱⁱ In online systems, what’s usually more important is the service’s response time —that is, the time between a client sending a request and receiving a response.

在像Hadoop这样的批处理系统中，我们通常关心吞吐量——每秒可以处理的记录数量，或者在特定大小的数据集上运行作业所需的总时间。在在线系统中，更重要的是服务的响应时间——即客户端发送请求和接收响应之间的时间。

Latency and response time

Latency and response time are often used synonymously, but they are not the same. The response time is what the client sees: besides the actual time to process the request (the service time ), it includes network delays and queueing delays. Latency is the duration that a request is waiting to be handled—during which it is latent , awaiting service [ 17 ].

延迟和响应时间通常被视为同义词，但它们并不相同。响应时间是客户端所看到的：除了实际处理请求所需的时间（服务时间）之外，它还包括网络延迟和排队延迟。延迟是请求等待处理的时间，期间需要等待服务[17]。

Even if you only make the same request over and over again, you’ll get a slightly different response time on every try. In practice, in a system handling a variety of requests, the response time can vary a lot. We therefore need to think of response time not as a single number, but as a distribution of values that you can measure.

即使你一遍又一遍地做出相同的请求，每次尝试得到的响应时间也会稍有不同。在实践中，对于处理各种请求的系统，响应时间可能会有很大的差异。因此，我们需要将响应时间看作一组可以测量的值的分布，而不是一个单一的数字。

In Figure 1-4 , each gray bar represents a request to a service, and its height shows how long that request took. Most requests are reasonably fast, but there are occasional outliers that take much longer. Perhaps the slow requests are intrinsically more expensive, e.g., because they process more data. But even in a scenario where you’d think all requests should take the same time, you get variation: random additional latency could be introduced by a context switch to a background process, the loss of a network packet and TCP retransmission, a garbage collection pause, a page fault forcing a read from disk, mechanical vibrations in the server rack [ 18 ], or many other causes.

在图1-4中，每个灰色条代表对服务的请求，其高度显示了该请求所花费的时间。大多数请求的速度都比较快，但偶尔会出现一些耗时更长的异常值。也许慢请求本质上更昂贵，例如因为它们处理的数据更多。但即使在所有请求应该花费相同时间的情况下，你也会得到变化：随机额外的延迟可能会由于切换到后台进程、网络数据包的丢失和TCP重传、垃圾收集暂停、在服务器架中机械振动 [18] 或其他许多原因引入。

It’s common to see the average response time of a service reported. (Strictly speaking, the term “average” doesn’t refer to any particular formula, but in practice it is usually understood as the arithmetic mean : given n values, add up all the values, and divide by n .) However, the mean is not a very good metric if you want to know your “typical” response time, because it doesn’t tell you how many users actually experienced that delay.

通常会报告服务的平均响应时间。（严格来说，“平均值”一词并不指代任何特定的公式，但在实践中通常被理解为算术平均值：给定 n 个值，将所有值相加，然后除以 n。）然而，如果你想知道你的“典型”响应时间，平均数不是一个很好的度量标准，因为它不告诉你有多少个用户实际上经历了那个延迟。

Usually it is better to use percentiles . If you take your list of response times and sort it from fastest to slowest, then the median is the halfway point: for example, if your median response time is 200 ms, that means half your requests return in less than 200 ms, and half your requests take longer than that.

通常最好使用百分位数。如果您将响应时间列表按从最快到最慢进行排序，则中位数是中间点：例如，如果中位数响应时间为200毫秒，则表示您一半的请求在200毫秒以下返回，一半的请求需要更长时间。

This makes the median a good metric if you want to know how long users typically have to wait: half of user requests are served in less than the median response time, and the other half take longer than the median. The median is also known as the 50th percentile , and sometimes abbreviated as p50 . Note that the median refers to a single request; if the user makes several requests (over the course of a session, or because several resources are included in a single page), the probability that at least one of them is slower than the median is much greater than 50%.

这使得中位数成为一种良好的度量方法，如果您想知道用户通常需要等待多长时间：一半的用户请求在低于中位响应时间的情况下得到服务，另一半需要比中位数更长的时间。中位数也被称为第50个百分位数，并有时缩写为p50。请注意，中位数是指单个请求;如果用户发出多个请求（在会话期间或因为单个页面包括多个资源），则至少有一个请求慢于中位数的概率远大于50％。

In order to figure out how bad your outliers are, you can look at higher percentiles: the 95th , 99th , and 99.9th percentiles are common (abbreviated p95 , p99 , and p999 ). They are the response time thresholds at which 95%, 99%, or 99.9% of requests are faster than that particular threshold. For example, if the 95th percentile response time is 1.5 seconds, that means 95 out of 100 requests take less than 1.5 seconds, and 5 out of 100 requests take 1.5 seconds or more. This is illustrated in Figure 1-4 .

为了弄清异常数据有多严重，你可以查看更高的百分位数：95％、99％和99.9％百分位数是常见的（简写为p95、p99和p999）。它们是响应时间阈值，其中95％、99％或99.9％的请求速度比特定阈值更快。举例来说，如果95％分位数的响应时间为1.5秒，就意味着100个请求中有95个会在1.5秒内完成，而有5个需要1.5秒或更长时间。这在图1-4中有所说明。

High percentiles of response times, also known as tail latencies , are important because they directly affect users’ experience of the service. For example, Amazon describes response time requirements for internal services in terms of the 99.9th percentile, even though it only affects 1 in 1,000 requests. This is because the customers with the slowest requests are often those who have the most data on their accounts because they have made many purchases—that is, they’re the most valuable customers [ 19 ]. It’s important to keep those customers happy by ensuring the website is fast for them: Amazon has also observed that a 100 ms increase in response time reduces sales by 1% [ 20 ], and others report that a 1-second slowdown reduces a customer satisfaction metric by 16% [ 21 , 22 ].

高响应时间的百分位，也称为尾部延迟，非常重要，因为直接影响用户对服务的体验。例如，亚马逊在内部服务中以99.9百分位的响应时间要求为例，即使只影响了1,000个请求中的1个。这是因为最慢请求的客户通常是那些拥有许多账户数据的客户，因为他们进行了多次购买，即他们是最有价值的客户[19]。通过确保网站对这些客户快速，可以保持这些客户的满意度：亚马逊还观察到，响应时间增加100毫秒会导致销售额下降1%[20]，其他人则报告，1秒的减速会将客户满意度指标降低16% [21,22]。

On the other hand, optimizing the 99.99th percentile (the slowest 1 in 10,000 requests) was deemed too expensive and to not yield enough benefit for Amazon’s purposes. Reducing response times at very high percentiles is difficult because they are easily affected by random events outside of your control, and the benefits are diminishing.

另一方面，优化第99.99个百分位数（最慢的10000个请求中的1个）被认为对于亚马逊的目的来说太昂贵且利益不足。在非常高的百分位数上降低响应时间很困难，因为它们很容易受到无法控制的随机事件的影响，而且效益逐渐变小。

For example, percentiles are often used in service level objectives (SLOs) and service level agreements (SLAs), contracts that define the expected performance and availability of a service. An SLA may state that the service is considered to be up if it has a median response time of less than 200 ms and a 99th percentile under 1 s (if the response time is longer, it might as well be down), and the service may be required to be up at least 99.9% of the time. These metrics set expectations for clients of the service and allow customers to demand a refund if the SLA is not met.

例如，百分位数通常用于服务水平目标（SLO）和服务水平协议（SLA），这些协议定义了服务的预期性能和可用性。SLA 可以说明，如果服务的中位响应时间小于 200 毫秒且 99 百分位数小于 1 秒，则认为该服务处于上线状态（如果响应时间更长，则可以认为已经下线），并且该服务可能需要保持至少 99.9% 的可用性。这些指标为服务的客户设定了期望，并允许客户在未达到 SLA 的情况下要求退款。

Queueing delays often account for a large part of the response time at high percentiles. As a server can only process a small number of things in parallel (limited, for example, by its number of CPU cores), it only takes a small number of slow requests to hold up the processing of subsequent requests—an effect sometimes known as head-of-line blocking . Even if those subsequent requests are fast to process on the server, the client will see a slow overall response time due to the time waiting for the prior request to complete. Due to this effect, it is important to measure response times on the client side.

排队延迟通常占高百分位响应时间的很大一部分。由于一个服务器只能处理少量并行事项（例如受其CPU核心数量限制），只需要极少量缓慢请求就足以阻塞后续请求的处理——这种效应有时被称为头阻塞。即使这些后续请求在服务器上处理很快，客户端也会因等待先前请求完成而看到较慢的总响应时间。由于此效应，客户端测量响应时间非常重要。

When generating load artificially in order to test the scalability of a system, the load-generating client needs to keep sending requests independently of the response time. If the client waits for the previous request to complete before sending the next one, that behavior has the effect of artificially keeping the queues shorter in the test than they would be in reality, which skews the measurements [ 23 ].

当人工产生负载以测试系统的可扩展性时，产生负载的客户端需要独立于响应时间继续发送请求。如果客户端在发送下一个请求之前等待前一个请求完成，这种行为会在测试中人为地使队列比实际上要短，从而扭曲测量结果[23]。

Percentiles in Practice

High percentiles become especially important in backend services that are called multiple times as part of serving a single end-user request. Even if you make the calls in parallel, the end-user request still needs to wait for the slowest of the parallel calls to complete. It takes just one slow call to make the entire end-user request slow, as illustrated in Figure 1-5 . Even if only a small percentage of backend calls are slow, the chance of getting a slow call increases if an end-user request requires multiple backend calls, and so a higher proportion of end-user requests end up being slow (an effect known as tail latency amplification [ 24 ]).

在后端服务中，高百分位特别重要，因为它们会在为单个终端用户请求提供服务的过程中被多次调用。即使您并行进行调用，终端用户请求仍然需要等待最慢的并行调用完成。正如图1-5所示，只需要一个慢速调用就可以使整个终端用户请求变慢。即使只有少数后端调用较慢，如果终端用户请求需要多次后端调用，则获取较慢的调用的机会会增加，因此更高比例的终端用户请求会变慢（这种效应称为尾延迟放大[24]）。

If you want to add response time percentiles to the monitoring dashboards for your services, you need to efficiently calculate them on an ongoing basis. For example, you may want to keep a rolling window of response times of requests in the last 10 minutes. Every minute, you calculate the median and various percentiles over the values in that window and plot those metrics on a graph.

如果您想将响应时间百分位数添加到您服务的监控仪表板上，您需要高效地持续计算它们。例如，您可能希望保持最近10分钟请求的响应时间的滚动窗口。每分钟，在该窗口中计算中位数和各个百分位数，并将这些指标绘制在图表上。

The naïve implementation is to keep a list of response times for all requests within the time window and to sort that list every minute. If that is too inefficient for you, there are algorithms that can calculate a good approximation of percentiles at minimal CPU and memory cost, such as forward decay [ 25 ], t-digest [ 26 ], or HdrHistogram [ 27 ]. Beware that averaging percentiles, e.g., to reduce the time resolution or to combine data from several machines, is mathematically meaningless—the right way of aggregating response time data is to add the histograms [ 28 ].

天真的实现方法是在时间窗口内保留所有请求的响应时间列表，并在每分钟对该列表进行排序。如果这对您来说太低效，那么有一些算法可以在最小的 CPU 和内存成本下计算出很好的百分位数近似值，例如前向衰减 [25]、t-digest[26] 或 HdrHistogram [27]。注意，对百分位数进行平均，例如为了降低时间分辨率或将来自多台机器的数据组合起来，这在数学上是没有意义的——聚合响应时间数据的正确方法是添加直方图 [28]。

Approaches for Coping with Load

Now that we have discussed the parameters for describing load and metrics for measuring performance, we can start discussing scalability in earnest: how do we maintain good performance even when our load parameters increase by some amount?

现在我们已经讨论了描述负载的参数和测量性能的指标，我们可以认真讨论可扩展性了：即使我们的负载参数增加了一定量，我们如何保持良好的性能？

An architecture that is appropriate for one level of load is unlikely to cope with 10 times that load. If you are working on a fast-growing service, it is therefore likely that you will need to rethink your architecture on every order of magnitude load increase—or perhaps even more often than that.

如果一种架构适合某个负载水平，那么它很可能无法应对10倍于此的负载。如果你在开发一个快速增长的服务，那么在每一次负载增加一个数量级时，你很可能需要重新思考你的架构，甚至需要更频繁地重构。

People often talk of a dichotomy between scaling up ( vertical scaling , moving to a more powerful machine) and scaling out ( horizontal scaling , distributing the load across multiple smaller machines). Distributing load across multiple machines is also known as a shared-nothing architecture. A system that can run on a single machine is often simpler, but high-end machines can become very expensive, so very intensive workloads often can’t avoid scaling out. In reality, good architectures usually involve a pragmatic mixture of approaches: for example, using several fairly powerful machines can still be simpler and cheaper than a large number of small virtual machines.

人们经常谈论缩放的二分法（垂直缩放，移动到更强大的机器和水平缩放，将负载分布在多台较小的机器上）。在多台机器上分布负载也被称为“不共享任何东西”的架构。可以在单个机器上运行的系统通常更简单，但高端机器可能非常昂贵，因此非常密集的工作负载通常无法避免缩放。实际上，良好的架构通常涉及实用主义的方法混合：例如，使用几台相对强大的机器仍然可能比大量小虚拟机更简单，更便宜。

Some systems are elastic , meaning that they can automatically add computing resources when they detect a load increase, whereas other systems are scaled manually (a human analyzes the capacity and decides to add more machines to the system). An elastic system can be useful if load is highly unpredictable, but manually scaled systems are simpler and may have fewer operational surprises (see “Rebalancing Partitions” ).

一些系统是弹性的，意味着它们在检测到负载增加时可以自动添加计算资源，而其他系统则需要手动扩容（由人员分析容量并决定是否需要添加更多机器）。当负载高度不可预测时，弹性系统可能非常有用，但手动扩容的系统更简单，可能会有更少的操作意外（请参见“重新平衡分区”）。

While distributing stateless services across multiple machines is fairly straightforward, taking stateful data systems from a single node to a distributed setup can introduce a lot of additional complexity. For this reason, common wisdom until recently was to keep your database on a single node (scale up) until scaling cost or high-availability requirements forced you to make it distributed.

将无状态服务分发到多台机器上相对来说相当简单，但将有状态数据系统从单节点迁移至分布式环境则可能引入大量额外复杂性。因此，直到不久之前，通常被认为明智之举是将数据库保持在单一节点上（纵向扩展）直至扩展成本或高可用性需求迫使你将其变为分布式环境。

As the tools and abstractions for distributed systems get better, this common wisdom may change, at least for some kinds of applications. It is conceivable that distributed data systems will become the default in the future, even for use cases that don’t handle large volumes of data or traffic. Over the course of the rest of this book we will cover many kinds of distributed data systems, and discuss how they fare not just in terms of scalability, but also ease of use and maintainability.

随着分布式系统的工具和抽象层的不断完善，这些普遍的认识可能会改变，至少对某些应用而言是如此。未来，分布式数据系统有可能成为默认选择，即使是用于不处理大量数据或流量的案例。在本书的其余部分，我们将涵盖许多种分布式数据系统，并讨论它们在可扩展性、易用性和可维护性方面的表现。

The architecture of systems that operate at large scale is usually highly specific to the application—there is no such thing as a generic, one-size-fits-all scalable architecture (informally known as magic scaling sauce ). The problem may be the volume of reads, the volume of writes, the volume of data to store, the complexity of the data, the response time requirements, the access patterns, or (usually) some mixture of all of these plus many more issues.

大规模操作系统的架构通常高度特定于应用程序——没有通用的、一刀切的可扩展架构（俗称神奇扩展酱）。问题可能是读取量、写入量、存储数据量、数据复杂性、响应时间要求、访问模式，或（通常）所有这些问题的混合，加上许多其他问题。

For example, a system that is designed to handle 100,000 requests per second, each 1 kB in size, looks very different from a system that is designed for 3 requests per minute, each 2 GB in size—even though the two systems have the same data throughput.

例如，一个被设计为每秒处理100,000个1kB请求的系统，看起来与一个被设计为每分钟处理3个2GB请求的系统非常不同，即使这两个系统具有相同的数据吞吐量。

An architecture that scales well for a particular application is built around assumptions of which operations will be common and which will be rare—the load parameters. If those assumptions turn out to be wrong, the engineering effort for scaling is at best wasted, and at worst counterproductive. In an early-stage startup or an unproven product it’s usually more important to be able to iterate quickly on product features than it is to scale to some hypothetical future load.

一种可适用于特定应用的可扩展架构是基于哪些操作是常见的和哪些是罕见的假设——负载参数来构建的。如果这些假设被证明是错误的，那么扩展的工程努力将至少是浪费的，而在最坏的情况下可能是适得其反的。在初创阶段的创业公司或未经验证产品中，能够快速迭代产品特性通常比将来某个假设负载的扩展更加重要。

Even though they are specific to a particular application, scalable architectures are nevertheless usually built from general-purpose building blocks, arranged in familiar patterns. In this book we discuss those building blocks and patterns.

尽管可扩展架构是针对特定应用而设计的，但通常是由常见的通用构建块组成的，排列成熟悉的模式。在本书中，我们将讨论这些构建块和模式。

Maintainability

It is well known that the majority of the cost of software is not in its initial development, but in its ongoing maintenance—fixing bugs, keeping its systems operational, investigating failures, adapting it to new platforms, modifying it for new use cases, repaying technical debt, and adding new features.

众所周知，软件大部分的成本并不在其初期的开发，而是在其持续的维护工作中——修复程序错误，保持系统运行，调查故障，适应新平台，修改新的用例，偿还技术债务以及添加新功能。

Yet, unfortunately, many people working on software systems dislike maintenance of so-called legacy systems—perhaps it involves fixing other people’s mistakes, or working with platforms that are now outdated, or systems that were forced to do things they were never intended for. Every legacy system is unpleasant in its own way, and so it is difficult to give general recommendations for dealing with them.

然而，不幸的是，许多从事软件系统工作的人不喜欢所谓的遗留系统维护——或许这涉及修复他人的错误，或者使用已经过时的平台，或者系统被迫执行其从未打算过的功能。每个遗留系统都有其自己的不愉快之处，因此很难对处理它们给出一般性建议。

However, we can and should design software in such a way that it will hopefully minimize pain during maintenance, and thus avoid creating legacy software ourselves. To this end, we will pay particular attention to three design principles for software systems:

然而，我们可以并且应该设计软件，使其在维护过程中尽可能减少痛苦，从而避免自己创建遗留软件。为此，我们将特别注意软件系统的三个设计原则：

Operability

Make it easy for operations teams to keep the system running smoothly.

让运维团队更轻松地保持系统运行流畅。

Simplicity

Make it easy for new engineers to understand the system, by removing as much complexity as possible from the system. (Note this is not the same as simplicity of the user interface.)

让新工程师更容易理解系统，尽量从系统中消除复杂性。（注意这与用户界面的简单性不同。）

Evolvability

Make it easy for engineers to make changes to the system in the future, adapting it for unanticipated use cases as requirements change. Also known as extensibility , modifiability , or plasticity .

为工程师未来轻松修改系统、并根据变化的需求适应未预期的用例提供便捷操作。这也称为可扩展性、可修改性或可塑性。

As previously with reliability and scalability, there are no easy solutions for achieving these goals. Rather, we will try to think about systems with operability, simplicity, and evolvability in mind.

以前可靠性和可扩展性已经没有简单的解决方案来实现这些目标。相反，我们会尝试考虑在可操作性、简洁性和可变性方面的系统思路。

Operability: Making Life Easy for Operations

It has been suggested that “good operations can often work around the limitations of bad (or incomplete) software, but good software cannot run reliably with bad operations” [ 12 ]. While some aspects of operations can and should be automated, it is still up to humans to set up that automation in the first place and to make sure it’s working correctly.

有人建议“好的操作通常可以克服不好（或不完整）的软件限制，但是好的软件不能可靠地运行不良操作”[12]。虽然某些操作方面可以和应该被自动化，但最终还是需要人类来进行第一次自动化设置，并确保其正确地运行。

Operations teams are vital to keeping a software system running smoothly. A good operations team typically is responsible for the following, and more [ 29 ]:

运维团队对于保持软件系统平稳运行非常重要。一个优秀的运维团队通常负责以下工作，以及更多其他事项 [29]：

Monitoring the health of the system and quickly restoring service if it goes into a bad state

监控系统的健康状况，如果系统状态不佳，迅速恢复服务。
Tracking down the cause of problems, such as system failures or degraded performance

追踪问题的原因，例如系统故障或性能降低。
Keeping software and platforms up to date, including security patches

保持软件和平台更新，包括安全补丁。
Keeping tabs on how different systems affect each other, so that a problematic change can be avoided before it causes damage

监控不同系统之间的相互影响，以避免出现可能导致破坏的问题性变化。
Anticipating future problems and solving them before they occur (e.g., capacity planning)

预料未来的问题并在它们发生之前解决它们（例如，容量规划）。
Establishing good practices and tools for deployment, configuration management, and more

建立好的实践和工具，用于部署、配置管理等方面。
Performing complex maintenance tasks, such as moving an application from one platform to another

执行复杂的维护任务，例如将一个应用程序从一个平台移动到另一个平台。
Maintaining the security of the system as configuration changes are made

随着配置变化的发生，维护系统的安全性。
Defining processes that make operations predictable and help keep the production environment stable

定义使操作可预测并帮助保持生产环境稳定的流程。
Preserving the organization’s knowledge about the system, even as individual people come and go

保留组织对系统的知识，即使个别人员进出。

Good operability means making routine tasks easy, allowing the operations team to focus their efforts on high-value activities. Data systems can do various things to make routine tasks easy, including:

良好的可操作性意味着使日常任务变得更加容易，使运营团队可以将精力集中在高价值的活动上。数据系统可以采取各种方式来简化日常任务，其中包括：

Providing visibility into the runtime behavior and internals of the system, with good monitoring

提供对系统运行行为和内部情况的可见性，带有良好的监控。
Providing good support for automation and integration with standard tools

为自动化提供良好支持并与标准工具进行集成。
Avoiding dependency on individual machines (allowing machines to be taken down for maintenance while the system as a whole continues running uninterrupted)

避免依赖特定的机器（使机器可以在维护期间关闭，而整个系统仍能无间断地运行）。
Providing good documentation and an easy-to-understand operational model (“If I do X, Y will happen”)

提供良好的文档和易于理解的操作模型（“如果我做 X，就会发生 Y”）。
Providing good default behavior, but also giving administrators the freedom to override defaults when needed

提供良好的默认行为，但也给管理员在需要时覆盖默认值的自由。
Self-healing where appropriate, but also giving administrators manual control over the system state when needed

在必要时进行自愈，但也在需要时赋予管理员手动控制系统状态的能力。
Exhibiting predictable behavior, minimizing surprises

展示可预测的行为，最小化惊喜。

Simplicity: Managing Complexity

Small software projects can have delightfully simple and expressive code, but as projects get larger, they often become very complex and difficult to understand. This complexity slows down everyone who needs to work on the system, further increasing the cost of maintenance. A software project mired in complexity is sometimes described as a big ball of mud [ 30 ].

小型软件项目可以拥有简单明了且表达清晰的代码，但随着项目规模的扩大，往往会变得非常复杂且难以理解。这种复杂性会拖慢每一个需要处理该系统的人，进一步增加了维护成本。一个陷入复杂困境的软件项目有时被描述为一个沉重的泥球。[30]

There are various possible symptoms of complexity: explosion of the state space, tight coupling of modules, tangled dependencies, inconsistent naming and terminology, hacks aimed at solving performance problems, special-casing to work around issues elsewhere, and many more. Much has been said on this topic already [ 31 , 32 , 33 ].

复杂性可能会出现各种症状：状态空间爆炸，模块之间紧密耦合，依赖关系错综复杂，命名和术语不一致，为解决性能问题而进行的修改，为解决其他问题而进行的特殊处理等等。已经有很多关于这个话题的讨论[31，32，33]。

When complexity makes maintenance hard, budgets and schedules are often overrun. In complex software, there is also a greater risk of introducing bugs when making a change: when the system is harder for developers to understand and reason about, hidden assumptions, unintended consequences, and unexpected interactions are more easily overlooked. Conversely, reducing complexity greatly improves the maintainability of software, and thus simplicity should be a key goal for the systems we build.

当复杂性使得维护困难时，预算和计划常常会超支。在复杂的软件中，更容易引入错误风险：当开发人员难以理解和推理系统时，隐藏的假设、意外的后果和意外的交互就很容易被忽视。相反，减少复杂性极大地提高了软件的可维护性，因此简单性应成为我们构建系统的关键目标。

Making a system simpler does not necessarily mean reducing its functionality; it can also mean removing accidental complexity. Moseley and Marks [ 32 ] define complexity as accidental if it is not inherent in the problem that the software solves (as seen by the users) but arises only from the implementation.

简化系统不一定意味着减少其功能；这也可能意味着消除偶然复杂性。Moseley和Marks [32]将复杂性定义为偶然的，如果它不是软件解决的问题本身固有的（如用户所看到的），而仅仅是由实现引起的。

One of the best tools we have for removing accidental complexity is abstraction . A good abstraction can hide a great deal of implementation detail behind a clean, simple-to-understand façade. A good abstraction can also be used for a wide range of different applications. Not only is this reuse more efficient than reimplementing a similar thing multiple times, but it also leads to higher-quality software, as quality improvements in the abstracted component benefit all applications that use it.

我们用来消除无意的复杂性的最佳工具之一是抽象化。良好的抽象化可以在一个干净、简单易懂的立面后面隐藏许多实现细节。良好的抽象化也可以用于广泛的不同应用。这种重用不仅比多次重新实现相似的东西更有效率，而且还会导致高质量的软件，因为在被抽象化的组件中的质量改进将使所有使用它的应用受益。

For example, high-level programming languages are abstractions that hide machine code, CPU registers, and syscalls. SQL is an abstraction that hides complex on-disk and in-memory data structures, concurrent requests from other clients, and inconsistencies after crashes. Of course, when programming in a high-level language, we are still using machine code; we are just not using it directly , because the programming language abstraction saves us from having to think about it.

例如，高级编程语言是抽象层，它们隐藏了机器码、CPU寄存器和系统调用。SQL是一个抽象层，它隐藏了复杂的磁盘和内存数据结构，其他客户端的并发请求，以及崩溃后的不一致性。当然，当使用高级语言编程时，我们仍然使用机器码；我们只是不直接使用它，因为编程语言的抽象层帮助我们不必考虑它。

However, finding good abstractions is very hard. In the field of distributed systems, although there are many good algorithms, it is much less clear how we should be packaging them into abstractions that help us keep the complexity of the system at a manageable level.

然而，寻找好的抽象概念是非常困难的。在分布式系统领域，虽然有许多好的算法，但我们如何将它们打包成有助于我们将系统的复杂性保持在可控范围内的抽象概念，却不太清楚。

Throughout this book, we will keep our eyes open for good abstractions that allow us to extract parts of a large system into well-defined, reusable components.

在本书中，我们将密切关注好的抽象，以便将大系统的部分提取到明确定义的可重用组件中。

Evolvability: Making Change Easy

It’s extremely unlikely that your system’s requirements will remain unchanged forever. They are much more likely to be in constant flux: you learn new facts, previously unanticipated use cases emerge, business priorities change, users request new features, new platforms replace old platforms, legal or regulatory requirements change, growth of the system forces architectural changes, etc.

你的系统需求永远不可能保持不变。它们更可能是不断变化的：你会学习到新的知识，出现以前未预料到的用例，业务重点会改变，用户会要求新的功能，新平台会取代旧平台，法律或监管要求会改变，系统的增长会迫使架构发生改变，等等。

In terms of organizational processes, Agile working patterns provide a framework for adapting to change. The Agile community has also developed technical tools and patterns that are helpful when developing software in a frequently changing environment, such as test-driven development (TDD) and refactoring.

在组织流程方面，敏捷工作模式提供了一个适应变化的框架。敏捷社区还开发了技术工具和模式，在频繁变化的环境中开发软件非常有帮助，例如测试驱动开发（TDD）和重构。

Most discussions of these Agile techniques focus on a fairly small, local scale (a couple of source code files within the same application). In this book, we search for ways of increasing agility on the level of a larger data system, perhaps consisting of several different applications or services with different characteristics. For example, how would you “refactor” Twitter’s architecture for assembling home timelines ( “Describing Load” ) from approach 1 to approach 2?

大多数关于这些敏捷技术的讨论都聚焦于相当小的、局部的范围（同一应用程序中的几个源代码文件）。在本书中，我们寻求在更大的数据系统层面上增加敏捷性的方法，这些系统可能包括多个具有不同特点的应用程序或服务。例如，您将如何将 Twitter 的首页时间线组装架构（“描述负载”）从方法1改进到方法2？

The ease with which you can modify a data system, and adapt it to changing requirements, is closely linked to its simplicity and its abstractions: simple and easy-to-understand systems are usually easier to modify than complex ones. But since this is such an important idea, we will use a different word to refer to agility on a data system level: evolvability [ 34 ].

你能够轻松修改和适应不断变化的需求的数据系统的能力，与其简单性和抽象性密切相关：易于理解的简单系统通常比复杂系统更容易修改。但由于这是一个非常重要的想法，因此我们将使用一个不同的词来指代数据系统层面的敏捷性：可发展性[34]。

Summary

In this chapter, we have explored some fundamental ways of thinking about data-intensive applications. These principles will guide us through the rest of the book, where we dive into deep technical detail.

在这章中，我们探讨了一些关于数据密集型应用的基本思考方式。这些原则将指导我们进入本书的其余部分，深入了解技术细节。

An application has to meet various requirements in order to be useful. There are functional requirements (what it should do, such as allowing data to be stored, retrieved, searched, and processed in various ways), and some nonfunctional requirements (general properties like security, reliability, compliance, scalability, compatibility, and maintainability). In this chapter we discussed reliability, scalability, and maintainability in detail.

一个应用程序必须满足各种要求才能发挥作用。这些要求包括功能需求（它应该做什么，例如允许以不同方式存储、检索、搜索和处理数据），以及一些非功能性需求（例如安全性、可靠性、合规性、可扩展性、兼容性和可维护性等一般属性）。在本章中，我们详细讨论了可靠性、可扩展性和可维护性。

Reliability means making systems work correctly, even when faults occur. Faults can be in hardware (typically random and uncorrelated), software (bugs are typically systematic and hard to deal with), and humans (who inevitably make mistakes from time to time). Fault-tolerance techniques can hide certain types of faults from the end user.

可靠性意味着即使出现故障，系统也能正常工作。故障可能是硬件问题（通常是随机且不相关的），软件问题（错误通常是系统性的，并且很难处理）以及人为问题（人们不可避免地偶尔会犯错误）。容错技术可以隐藏某些类型的故障，使最终用户不受其影响。

Scalability means having strategies for keeping performance good, even when load increases. In order to discuss scalability, we first need ways of describing load and performance quantitatively. We briefly looked at Twitter’s home timelines as an example of describing load, and response time percentiles as a way of measuring performance. In a scalable system, you can add processing capacity in order to remain reliable under high load.

可扩展性意味着拥有保持良好性能的策略，即使负载增加。为了讨论可扩展性，我们首先需要用数量化的方式描述负载和性能。我们简要介绍了Twitter的主页时间线作为描述负载的示例，以及响应时间百分位作为衡量性能的方式。在可扩展系统中，您可以添加处理能力以保持高负载下的可靠性。

Maintainability has many facets, but in essence it’s about making life better for the engineering and operations teams who need to work with the system. Good abstractions can help reduce complexity and make the system easier to modify and adapt for new use cases. Good operability means having good visibility into the system’s health, and having effective ways of managing it.

可维护性有许多方面，但本质上就是为工程和操作团队提供更好的工作体验。良好的抽象可以帮助降低复杂性，使系统更易于修改和适应新的用例。良好的可操作性意味着对系统的健康状况有良好的可见性，并且有有效的管理方式。

There is unfortunately no easy fix for making applications reliable, scalable, or maintainable. However, there are certain patterns and techniques that keep reappearing in different kinds of applications. In the next few chapters we will take a look at some examples of data systems and analyze how they work toward those goals.

很遗憾，没有简单的解决方案使应用程序可靠、可扩展或易于维护。然而，某些模式和技术在不同类型的应用程序中不断出现。在接下来的几章中，我们将看一些数据系统的示例，并分析它们如何朝着这些目标努力。

Later in the book, in Part III , we will look at patterns for systems that consist of several components working together, such as the one in Figure 1-1 .

书的后面，在第三部分，我们将研究由几个组件共同工作的系统模式，例如图1-1所示的模式。

Footnotes

ⁱ Defined in “Approaches for Coping with Load” .

我在“应对负载的方法”中进行了定义。

ⁱⁱ A term borrowed from electronic engineering, where it describes the number of logic gate inputs that are attached to another gate’s output. The output needs to supply enough current to drive all the attached inputs. In transaction processing systems, we use it to describe the number of requests to other services that we need to make in order to serve one incoming request.

"ii"这个术语源自电子工程，用于描述连接到另一个门输出的逻辑门输入数量。输出需要提供足够的电流来驱动所有连接的输入。在交易处理系统中，我们用它来描述为了处理一个传入请求，需要向其他服务发出的请求数量。

ⁱⁱⁱ In an ideal world, the running time of a batch job is the size of the dataset divided by the throughput. In practice, the running time is often longer, due to skew (data not being spread evenly across worker processes) and needing to wait for the slowest task to complete.

在理想的情况下，批处理作业的运行时间是数据集大小除以吞吐量。但实际情况下，由于倾斜（数据不均匀分布在工作进程之间）以及需要等待最慢的任务完成，运行时间通常会更长。

References

[ 1 ] Michael Stonebraker and Uğur Çetintemel: “ ‘One Size Fits All’: An Idea Whose Time Has Come and Gone ,” at 21st International Conference on Data Engineering (ICDE), April 2005.

[1] 迈克尔·斯通布雷克和乌古尔·塞廷特梅尔： “‘一尺之木，安否全局’：一个时代已经过去的思想”，收录于2005年4月的第21届国际数据工程大会（ICDE）上。

[ 2 ] Walter L. Heimerdinger and Charles B. Weinstock: “ A Conceptual Framework for System Fault Tolerance ,” Technical Report CMU/SEI-92-TR-033, Software Engineering Institute, Carnegie Mellon University, October 1992.

[2] Walter L. Heimerdinger和Charles B. Weinstock：“系统容错性的概念框架”，技术报告CMU/SEI-92-TR-033，软件工程研究所，卡内基梅隆大学，1992年10月。

[ 3 ] Ding Yuan, Yu Luo, Xin Zhuang, et al.: “ Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems ,” at 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI), October 2014.

"简单的测试可以避免大部分关键故障：对分布式数据密集型系统生产故障的分析"，丁原，骆煜，庄欣等，发表于2014年10月第11届USENIX操作系统设计与实现研讨会（OSDI）。"

[ 4 ] Yury Izrailevsky and Ariel Tseitlin: “ The Netflix Simian Army ,” techblog.netflix.com , July 19, 2011.

【4】Yury Izrailevsky和Ariel Tseitlin：“Netflix的猴子军队”，techblog.netflix.com，2011年7月19日。

[ 5 ] Daniel Ford, François Labelle, Florentina I. Popovici, et al.: “ Availability in Globally Distributed Storage Systems ,” at 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI), October 2010.

[5] Daniel Ford, François Labelle, Florentina I. Popovici等：“全球分布式存储系统中的可用性”，发表于2010年10月第9届USENIX操作系统设计与实现研讨会(OSDI)。

[ 6 ] Brian Beach: “ Hard Drive Reliability Update – Sep 2014 ,” backblaze.com , September 23, 2014.

[6] Brian Beach：“硬盘可靠性更新 - 2014年9月”，backblaze.com，2014年9月23日。

[ 7 ] Laurie Voss: “ AWS: The Good, the Bad and the Ugly ,” blog.awe.sm , December 18, 2012.

[7] Laurie Voss：“AWS：优点、缺点和丑陋”，blog.awe.sm, 2012年12月18日。

[ 8 ] Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, et al.: “ What Bugs Live in the Cloud? ,” at 5th ACM Symposium on Cloud Computing (SoCC), November 2014. doi:10.1145/2670979.2670986

【8】Haryadi S. Gunawi，Mingzhe Hao，Tanakorn Leesatapornwongsa等：“云中有哪些漏洞？”，第5届ACM云计算研讨会 (SoCC)，2014年11月。doi: 10.1145/2670979.2670986

[ 9 ] Nelson Minar: “ Leap Second Crashes Half the Internet ,” somebits.com , July 3, 2012.

[9] Nelson Minar：“闰秒导致互联网崩溃一半”，somebits.com，2012年7月3日。

[ 10 ] Amazon Web Services: “ Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region ,” aws.amazon.com , April 29, 2011.

[10] 亚马逊网络服务： “关于亚马逊EC2和Amazon RDS服务故障的总结美国东部地区”，aws.amazon.com，2011年4月29日。

[ 11 ] Richard I. Cook: “ How Complex Systems Fail ,” Cognitive Technologies Laboratory, April 2000.

[11] Richard I. Cook: “复杂系统的失效”，认知技术实验室，2000年4月。

[ 12 ] Jay Kreps: “ Getting Real About Distributed System Reliability ,” blog.empathybox.com , March 19, 2012.

[12] Jay Kreps：「认真对待分布式系统可靠性」，blog.empathybox.com，2012年3月19日。

[ 13 ] David Oppenheimer, Archana Ganapathi, and David A. Patterson: “ Why Do Internet Services Fail, and What Can Be Done About It? ,” at 4th USENIX Symposium on Internet Technologies and Systems (USITS), March 2003.

[13] David Oppenheimer，Archana Ganapathi和David A. Patterson：「为什么互联网服务会出现故障，以及如何解决？」，发表于第四届USENIX互联网技术与系统研讨会（USITS），2003年3月。

[ 14 ] Nathan Marz: “ Principles of Software Engineering, Part 1 ,” nathanmarz.com , April 2, 2013.

[14] Nathan Marz：“软件工程原则，第一部分”，nathanmarz.com，2013年4月2日。

[ 15 ] Michael Jurewitz: “ The Human Impact of Bugs ,” jury.me , March 15, 2013.

[15] 迈克尔·朱雷维茨：“虫害的人类影响”，jury.me，2013年3月15日。

[ 16 ] Raffi Krikorian: “ Timelines at Scale ,” at QCon San Francisco , November 2012.

[16] Raffi Krikorian: “大规模时间表”，在2012年11月的QCon San Francisco上。

[ 17 ] Martin Fowler: Patterns of Enterprise Application Architecture . Addison Wesley, 2002. ISBN: 978-0-321-12742-6

【17】马丁·福勒：《企业应用架构模式》。Addison Wesley出版社，2002年。 ISBN：978-0-321-12742-6。

[ 18 ] Kelly Sommers: “ After all that run around, what caused 500ms disk latency even when we replaced physical server? ” twitter.com , November 13, 2014.

“在我们更换物理服务器后，为什么仍然出现500毫秒的磁盘延迟？”，凯利·萨默斯在Twitter上发表，2014年11月13日。

[ 19 ] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, et al.: “ Dynamo: Amazon’s Highly Available Key-Value Store ,” at 21st ACM Symposium on Operating Systems Principles (SOSP), October 2007.

【19】Giuseppe DeCandia、Deniz Hastorun、Madan Jampani等：“Dynamo：亚马逊高可用性的键值存储”，于第21届ACM操作系统原理研讨会（SOSP），2007年10月。

[ 20 ] Greg Linden: “ Make Data Useful ,” slides from presentation at Stanford University Data Mining class (CS345), December 2006.

【20】格雷格·林登： “让数据更有用”，2006年12月在斯坦福大学数据挖掘课程（CS345）上演示文稿。

[ 21 ] Tammy Everts: “ The Real Cost of Slow Time vs Downtime ,” webperformancetoday.com , November 12, 2014.

[21] Tammy Everts: "慢响应时间与停机时间的实际代价," webperformancetoday.com, 2014年11月12日。

[ 22 ] Jake Brutlag: “ Speed Matters for Google Web Search ,” googleresearch.blogspot.co.uk , June 22, 2009.

【22】Jake Brutlag：“速度对于Google网络搜索至关重要”，googleresearch.blogspot.co.uk，2009年6月22日。

[ 23 ] Tyler Treat: “ Everything You Know About Latency Is Wrong ,” bravenewgeek.com , December 12, 2015.

[23] Tyler Treat：“关于延迟的一切你所知道的都是错的”，bravenewgeek.com，2015年12月12日。

[ 24 ] Jeffrey Dean and Luiz André Barroso: “ The Tail at Scale ,” Communications of the ACM , volume 56, number 2, pages 74–80, February 2013. doi:10.1145/2408776.2408794

"大规模下的尾部效应" Jeffrey Dean 和 Luiz André Barroso：通信ACM，第56卷，第2期，第74-80页，2013年2月 DOI: 10.1145/2408776.2408794

[ 25 ] Graham Cormode, Vladislav Shkapenyuk, Divesh Srivastava, and Bojian Xu: “ Forward Decay: A Practical Time Decay Model for Streaming Systems ,” at 25th IEEE International Conference on Data Engineering (ICDE), March 2009.

【25】Graham Cormode，Vladislav Shkapenyuk，Divesh Srivastava和Bojian Xu：「前向衰变: 流式系统的实用时间衰减模型」，于2009年3月在第25届IEEE国际数据工程会议（ICDE）上发表。

[ 26 ] Ted Dunning and Otmar Ertl: “ Computing Extremely Accurate Quantiles Using t-Digests ,” github.com , March 2014.

[26] Ted Dunning和Otmar Ertl： “使用t-Digest计算极其准确的分位数”，github.com，2014年3月。

[ 27 ] Gil Tene: “ HdrHistogram ,” hdrhistogram.org .

"[27] Gil Tene: “HdrHistogram,” hdrhistogram.org." "[27] Gil Tene： “HdrHistogram”，hdrhistogram.org。"

[ 28 ] Baron Schwartz: “ Why Percentiles Don’t Work the Way You Think ,” vividcortex.com , December 7, 2015.

[28] Baron Schwartz：“为什么百分位数不像你想象的那样工作”，vividcortex.com，2015年12月7日。

[ 29 ] James Hamilton: “ On Designing and Deploying Internet-Scale Services ,” at 21st Large Installation System Administration Conference (LISA), November 2007.

[29] 詹姆斯·汉密尔顿: “关于设计和部署互联网规模服务”，于第21届大型安装系统管理会议（LISA），2007年11月。

[ 30 ] Brian Foote and Joseph Yoder: “ Big Ball of Mud ,” at 4th Conference on Pattern Languages of Programs (PLoP), September 1997.

"[30] Brian Foote和Joseph Yoder： “Big Ball of Mud”，于 1997年9月的第四届程序模式语言会议（PLoP）上。"

[ 31 ] Frederick P Brooks: “No Silver Bullet – Essence and Accident in Software Engineering,” in The Mythical Man-Month , Anniversary edition, Addison-Wesley, 1995. ISBN: 978-0-201-83595-3

[31] 弗雷德里克·P·布鲁克斯：“软件工程中的本质和意外——《神话般的人月》周年纪念版，Addison-Wesley出版社，1995年，ISBN：978-0-201-83595-3”。

[ 32 ] Ben Moseley and Peter Marks: “ Out of the Tar Pit ,” at BCS Software Practice Advancement (SPA), 2006.

[32] 本·莫斯利和彼得·马克斯：“走出沥青坑”，在BCS软件实践推进（SPA），2006年。

[ 33 ] Rich Hickey: “ Simple Made Easy ,” at Strange Loop , September 2011.

[33] Rich Hickey：“简单易用”，Strange Loop，2011年9月。

[ 34 ] Hongyu Pei Breivold, Ivica Crnkovic, and Peter J. Eriksson: “ Analyzing Software Evolvability ,” at 32nd Annual IEEE International Computer Software and Applications Conference (COMPSAC), July 2008. doi:10.1109/COMPSAC.2008.50

[34] Hongyu Pei Breivold，Ivica Crnkovic，Peter J. Eriksson：“分析软件可发展性”，32nd Annual IEEE国际计算机软件和应用会议（COMPSAC），2008年7月。doi：10.1109/COMPSAC.2008.50