第三部分。派生数据

Mar 12, 2013

Designing Data-Intensive Applications

Part III. Derived Data

In Parts I and II of this book, we assembled from the ground up all the major considerations that go into a distributed database, from the layout of data on disk all the way to the limits of distributed consistency in the presence of faults. However, this discussion assumed that there was only one database in the application.

在本书的第一部分和第二部分中，我们从最基础的开始组织了分布式数据库中所涉及的所有主要考虑因素，从磁盘上的数据布局一直到存在故障时的分布式一致性限制。但是，这种讨论假设应用程序中只有一个数据库。

In reality, data systems are often more complex. In a large application you often need to be able to access and process data in many different ways, and there is no one database that can satisfy all those different needs simultaneously. Applications thus commonly use a combination of several different datastores, indexes, caches, analytics systems, etc. and implement mechanisms for moving data from one store to another.

现实情况下，数据系统往往更加复杂。在大型应用中，您通常需要能够以许多不同的方式访问和处理数据，并且没有一个数据库可以同时满足所有这些不同的需求。因此，应用程序通常使用多个不同的数据存储、索引、缓存、分析系统等，并实现将数据从一个存储库移动到另一个存储库的机制。

In this final part of the book, we will examine the issues around integrating multiple different data systems, potentially with different data models and optimized for different access patterns, into one coherent application architecture. This aspect of system-building is often overlooked by vendors who claim that their product can satisfy all your needs. In reality, integrating disparate systems is one of the most important things that needs to be done in a nontrivial application.

在本书的最后部分，我们将研究将多个不同的数据系统集成到一个连贯的应用架构中的问题，这些系统可能具有不同的数据模型并针对不同的访问模式进行了优化。系统构建中的这个方面通常被供应商所忽略，他们声称他们的产品可以满足您所有的需求。实际上，将不同的系统集成起来是一个需要完成的最重要的事情，在非平凡的应用程序中。

Systems of Record and Derived Data

On a high level, systems that store and process data can be grouped into two broad categories:

从高层次来看，存储和处理数据的系统可以分为两个广泛的类别：

Systems of record

A system of record, also known as source of truth , holds the authoritative version of your data. When new data comes in, e.g., as user input, it is first written here. Each fact is represented exactly once (the representation is typically normalized ). If there is any discrepancy between another system and the system of record, then the value in the system of record is (by definition) the correct one.

系统记录，也称真相源，保留着您数据的权威版本。当新数据进来，如用户输入时，首先写入此处。每个事实都被准确地表述一次（通常是归一化的）。如果其他系统与系统记录之间存在任何差异，那么系统记录中的值（根据定义）是正确的。

Derived data systems

Data in a derived system is the result of taking some existing data from another system and transforming or processing it in some way. If you lose derived data, you can recreate it from the original source. A classic example is a cache: data can be served from the cache if present, but if the cache doesn’t contain what you need, you can fall back to the underlying database. Denormalized values, indexes, and materialized views also fall into this category. In recommendation systems, predictive summary data is often derived from usage logs.

派生系统中的数据是从另一个系统中获取一些现有数据并以某种方式进行转换或处理的结果。如果您丢失了派生数据，则可以从原始来源重新创建它。一个经典的例子是缓存：如果缓存存在数据，可以从缓存中提供数据，但如果缓存中不存在所需数据，则可以回退到基础数据库。非规范化的值、索引和物化视图也属于此类。在推荐系统中，预测性摘要数据通常是从使用日志中派生的。

Technically speaking, derived data is redundant , in the sense that it duplicates existing information. However, it is often essential for getting good performance on read queries. It is commonly denormalized . You can derive several different datasets from a single source, enabling you to look at the data from different “points of view.”

技术上说，派生数据是冗余的，因为它重复了现有信息。然而，在读取查询上获得良好性能通常是必要的。它通常是去正则化的。您可以从单个来源派生多个不同的数据集，使您能够从不同的“视角”查看数据。

Not all systems make a clear distinction between systems of record and derived data in their architecture, but it’s a very helpful distinction to make, because it clarifies the dataflow through your system: it makes explicit which parts of the system have which inputs and which outputs, and how they depend on each other.

并非所有系统在其体系结构中明确区分记录系统和衍生数据系统，但这是一个非常有帮助的区分，因为它可以澄清系统中的数据流：它可以明确哪些部分具有哪些输入和输出，以及它们如何相互依赖。

Most databases, storage engines, and query languages are not inherently either a system of record or a derived system. A database is just a tool: how you use it is up to you. The distinction between system of record and derived data system depends not on the tool, but on how you use it in your application.

大多数数据库、存储引擎和查询语言并非本质上是记录系统或派生系统。数据库只是一种工具：你如何使用它取决于你。系统记录和派生数据系统之间的区别取决于你在应用程序中如何使用这个工具，而不是取决于这个工具本身。

By being clear about which data is derived from which other data, you can bring clarity to an otherwise confusing system architecture. This point will be a running theme throughout this part of the book.

通过清楚地表明哪些数据是从哪些其他数据导出的，您可以为本来混淆的系统架构带来明晰。这一点将贯穿本书的整个部分。

Overview of Chapters

We will start in Chapter 10 by examining batch-oriented dataflow systems such as MapReduce, and see how they give us good tools and principles for building large-scale data systems. In Chapter 11 we will take those ideas and apply them to data streams, which allow us to do the same kinds of things with lower delays. Chapter 12 concludes the book by exploring ideas about how we might use these tools to build reliable, scalable, and maintainable applications in the future.

我们将从第10章开始，研究类似MapReduce等批处理数据流系统，并了解它们为构建大规模数据系统提供了很好的工具和原则。在第11章中，我们将将这些想法应用于数据流，从而使我们能够以更低的延迟执行相同类型的操作。第12章总结了本书，探讨了如何利用这些工具在未来构建可靠、可扩展和易维护的应用程序的想法。