第三章。存储和检索

Mar 5, 2013

Designing Data-Intensive Applications

Chapter 3. Storage and Retrieval

Wer Ordnung hält, ist nur zu faul zum Suchen.

谁有秩序，只是懒得寻找。

(If you keep things tidily ordered, you’re just too lazy to go searching.)

如果你把东西整理得井井有条，那就是因为懒得到处找。

German proverb

德国谚语。

On the most fundamental level, a database needs to do two things: when you give it some data, it should store the data, and when you ask it again later, it should give the data back to you.

在最基本的层级上，一个数据库需要做两件事情：当你提供一些数据时，它应该存储数据，当你以后再次询问时，它应该将数据返回给你。

In Chapter 2 we discussed data models and query languages—i.e., the format in which you (the application developer) give the database your data, and the mechanism by which you can ask for it again later. In this chapter we discuss the same from the database’s point of view: how we can store the data that we’re given, and how we can find it again when we’re asked for it.

在第2章中，我们讨论了数据模型和查询语言——也就是你（应用程序开发者）向数据库提供数据的格式以及以后可以再次请求它的机制。在本章中，我们从数据库的角度讨论相同的问题：如何存储我们收到的数据以及在被要求时如何找到它。

Why should you, as an application developer, care how the database handles storage and retrieval internally? You’re probably not going to implement your own storage engine from scratch, but you do need to select a storage engine that is appropriate for your application, from the many that are available. In order to tune a storage engine to perform well on your kind of workload, you need to have a rough idea of what the storage engine is doing under the hood.

作为应用程序开发人员，为什么要关心数据库如何处理内部的存储和检索？您可能不会从头开始实现自己的存储引擎，但是您需要选择适合您的应用程序的存储引擎，从众多可用的引擎中进行选择。为了调整存储引擎以在您的工作负载类型上表现良好，您需要对存储引擎在幕后执行的操作有一个大致的了解。

In particular, there is a big difference between storage engines that are optimized for transactional workloads and those that are optimized for analytics. We will explore that distinction later in “Transaction Processing or Analytics?” , and in “Column-Oriented Storage” we’ll discuss a family of storage engines that is optimized for analytics.

特别是，那些优化事务工作负载的存储引擎和那些优化分析的存储引擎之间有很大的差异。我们将在“事务处理还是分析？”中探讨这一区别，并在“基于列的存储”中讨论一系列优化分析的存储引擎。

However, first we’ll start this chapter by talking about storage engines that are used in the kinds of databases that you’re probably familiar with: traditional relational databases, and also most so-called NoSQL databases. We will examine two families of storage engines: log-structured storage engines, and page-oriented storage engines such as B-trees.

然而，我们将首先开始本章，讨论在您可能熟悉的数据库类型中使用的存储引擎：传统关系型数据库，以及大多数所谓的NoSQL数据库。我们将研究两个存储引擎系列：日志结构存储引擎和基于页的存储引擎，如B树。

Data Structures That Power Your Database

Consider the world’s simplest database, implemented as two Bash functions:

考虑作为两个Bash函数实现的世界上最简单的数据库：

#!/bin/bash

db_set () {
    echo "$1,$2" >> database
}

db_get () {
    grep "^$1," database | sed -e "s/^$1,//" | tail -n 1
}

These two functions implement a key-value store. You can call db_set key value , which will store key and value in the database. The key and value can be (almost) anything you like—for example, the value could be a JSON document. You can then call db_get key , which looks up the most recent value associated with that particular key and returns it.

这两个函数实现了键值存储。您可以调用db_set键值对，它会将键和值存储在数据库中。键和值可以是（几乎）任何您喜欢的东西 - 例如，值可以是JSON文档。然后，您可以调用db_get键，查找与该特定键相关的最新值并返回它。

And it works:

它起作用了：

$ db_set 123456 '{"name":"London","attractions":["Big Ben","London Eye"]}'

$ db_set 42 '{"name":"San Francisco","attractions":["Golden Gate Bridge"]}'

$ db_get 42
{"name":"San Francisco","attractions":["Golden Gate Bridge"]}

The underlying storage format is very simple: a text file where each line contains a key-value pair, separated by a comma (roughly like a CSV file, ignoring escaping issues). Every call to db_set appends to the end of the file, so if you update a key several times, the old versions of the value are not overwritten—you need to look at the last occurrence of a key in a file to find the latest value (hence the tail -n 1 in db_get ):

底层存储格式非常简单：文本文件中每一行都包含一个键值对，通过逗号分隔（类似 CSV 文件，忽略转义问题）。每次调用 db_set 都会将数据追加到文件末尾，所以如果多次更新同一个键，旧版本的值是不会被覆盖的 —— 你需要在文件中查找键的最后一次出现来获取最新的值（因此在 db_get 中使用了 tail -n 1 命令）。

$ db_set 42 '{"name":"San Francisco","attractions":["Exploratorium"]}'

$ db_get 42
{"name":"San Francisco","attractions":["Exploratorium"]}

$ cat database
123456,{"name":"London","attractions":["Big Ben","London Eye"]}
42,{"name":"San Francisco","attractions":["Golden Gate Bridge"]}
42,{"name":"San Francisco","attractions":["Exploratorium"]}

Our db_set function actually has pretty good performance for something that is so simple, because appending to a file is generally very efficient. Similarly to what db_set does, many databases internally use a log , which is an append-only data file. Real databases have more issues to deal with (such as concurrency control, reclaiming disk space so that the log doesn’t grow forever, and handling errors and partially written records), but the basic principle is the same. Logs are incredibly useful, and we will encounter them several times in the rest of this book.

我们的db_set函数在这么简单的情况下实际上有相当好的性能，因为向文件添加数据通常非常高效。与db_set类似的是，许多数据库内部使用日志，即追加模式的数据文件。真正的数据库会遇到更多问题（如并发控制，回收磁盘空间，以防日志无限增长，处理错误和部分写入记录等），但基本原理是相同的。日志非常有用，在本书的其余部分中我们将遇到它们多次。

Note

The word log is often used to refer to application logs, where an application outputs text that describes what’s happening. In this book, log is used in the more general sense: an append-only sequence of records. It doesn’t have to be human-readable; it might be binary and intended only for other programs to read.

日志一词通常用于指应用程序日志，应用程序输出描述正在发生的文本。在本书中，日志的更一般含义是：仅追加记录的序列。它不必人类可读；它可能是二进制的，只供其他程序阅读。

On the other hand, our db_get function has terrible performance if you have a large number of records in your database. Every time you want to look up a key, db_get has to scan the entire database file from beginning to end, looking for occurrences of the key. In algorithmic terms, the cost of a lookup is O ( n ): if you double the number of records n in your database, a lookup takes twice as long. That’s not good.

另一方面，如果你的数据库中有大量记录，我们的db_get函数性能非常差。每当你想查找一个键时，db_get必须从头到尾扫描整个数据库文件，寻找键的出现。在算法术语中，查找的成本是O（n）：如果你将数据库中的记录数量n翻倍，查找需要的时间就会增加一倍。这不好。

In order to efficiently find the value for a particular key in the database, we need a different data structure: an index . In this chapter we will look at a range of indexing structures and see how they compare; the general idea behind them is to keep some additional metadata on the side, which acts as a signpost and helps you to locate the data you want. If you want to search the same data in several different ways, you may need several different indexes on different parts of the data.

为了有效地在数据库中找到特定键的值，我们需要一个不同的数据结构：索引。在本章中，我们将研究一系列索引结构并看看它们的比较；它们背后的一般思想是在一侧保存一些额外的元数据，它作为标志牌，帮助您定位所需的数据。如果您想以几种不同的方式搜索相同的数据，则可能需要在数据的不同部分上具有几个不同的索引。

An index is an additional structure that is derived from the primary data. Many databases allow you to add and remove indexes, and this doesn’t affect the contents of the database; it only affects the performance of queries. Maintaining additional structures incurs overhead, especially on writes. For writes, it’s hard to beat the performance of simply appending to a file, because that’s the simplest possible write operation. Any kind of index usually slows down writes, because the index also needs to be updated every time data is written.

索引是从主要数据派生出来的额外结构。许多数据库允许您添加和删除索引，这不会影响数据库的内容，只会影响查询的性能。维护额外的结构会产生额外的负担，尤其是对写入操作。对于写入操作，很难击败简单地追加到文件的性能，因为那是最简单的写入操作。任何类型的索引通常会减慢写入速度，因为每次写入数据时也需要更新索引。

This is an important trade-off in storage systems: well-chosen indexes speed up read queries, but every index slows down writes. For this reason, databases don’t usually index everything by default, but require you—the application developer or database administrator—to choose indexes manually, using your knowledge of the application’s typical query patterns. You can then choose the indexes that give your application the greatest benefit, without introducing more overhead than necessary.

在存储系统中，这是一个重要的权衡：精心选择索引可以加速读取查询，但是每个索引都会减缓写入速度。因此，数据库通常不会默认索引所有内容，而需要你——应用程序开发人员或数据库管理员——根据应用程序的典型查询模式手动选择索引。你可以选择对你的应用程序带来最大收益的索引，而不会引入更多不必要的开销。

Hash Indexes

Let’s start with indexes for key-value data. This is not the only kind of data you can index, but it’s very common, and it’s a useful building block for more complex indexes.

让我们从键值数据的索引开始。这不是你可以索引的唯一类型，但是它非常常见，是更复杂索引的有用基石。

Key-value stores are quite similar to the dictionary type that you can find in most programming languages, and which is usually implemented as a hash map (hash table). Hash maps are described in many algorithms textbooks [ 1 , 2 ], so we won’t go into detail of how they work here. Since we already have hash maps for our in-memory data structures, why not use them to index our data on disk?

键值存储与大多数编程语言中的字典类型非常类似，通常实现为哈希表。哈希映射可以在许多算法教材中找到描述。既然我们已经在内存数据结构中使用了哈希映射，为什么不使用它们来索引我们的磁盘数据呢？

Let’s say our data storage consists only of appending to a file, as in the preceding example. Then the simplest possible indexing strategy is this: keep an in-memory hash map where every key is mapped to a byte offset in the data file—the location at which the value can be found, as illustrated in Figure 3-1 . Whenever you append a new key-value pair to the file, you also update the hash map to reflect the offset of the data you just wrote (this works both for inserting new keys and for updating existing keys). When you want to look up a value, use the hash map to find the offset in the data file, seek to that location, and read the value.

假设我们的数据存储仅仅是在一个文件中添加内容，和前面的例子一样。那么最简单的索引策略是：在内存中保持一个哈希映射，其中每个键都被映射到数据文件中的一个字节偏移量，也就是值所在的位置，如图3-1所示。每当你把一个新的键值对添加到文件中，你也要更新哈希映射，以反映出你刚刚写入的数据的偏移量（这对于插入新键和更新现有键都适用）。当你想要查找一个值时，使用哈希映射来找到数据文件中的偏移量，把指针定位到该位置，然后读取值。

This may sound simplistic, but it is a viable approach. In fact, this is essentially what Bitcask (the default storage engine in Riak) does [ 3 ]. Bitcask offers high-performance reads and writes, subject to the requirement that all the keys fit in the available RAM, since the hash map is kept completely in memory. The values can use more space than there is available memory, since they can be loaded from disk with just one disk seek. If that part of the data file is already in the filesystem cache, a read doesn’t require any disk I/O at all.

这听起来很简单，但这是一种可行的方法。实际上，这就是Bitcask（Riak的默认存储引擎）的基本操作[3]。 Bitcask提供高性能的读写功能，要求所有键都适合可用的RAM，因为哈希映射被完全保存在内存中。值可以使用比可用内存更多的空间，因为它们可以通过只进行一次磁盘搜索，从磁盘加载。如果该数据文件的部分已经在文件系统高速缓存中，那么读取就不需要任何磁盘I/O。

A storage engine like Bitcask is well suited to situations where the value for each key is updated frequently. For example, the key might be the URL of a cat video, and the value might be the number of times it has been played (incremented every time someone hits the play button). In this kind of workload, there are a lot of writes, but there are not too many distinct keys—you have a large number of writes per key, but it’s feasible to keep all keys in memory.

像Bitcask这样的存储引擎非常适用于每个键的值经常更新的情况。例如，键可以是猫视频的URL，而值可以是播放次数（每次有人点击播放按钮就会增加）。在这种负载下，有很多写入操作，但是不会有太多不同的键 - 每个键有大量的写入操作，但是在内存中保留所有键是可行的。

As described so far, we only ever append to a file—so how do we avoid eventually running out of disk space? A good solution is to break the log into segments of a certain size by closing a segment file when it reaches a certain size, and making subsequent writes to a new segment file. We can then perform compaction on these segments, as illustrated in Figure 3-2 . Compaction means throwing away duplicate keys in the log, and keeping only the most recent update for each key.

到目前为止，我们只添加到文件，那么我们如何避免最终用尽磁盘空间？一个好的解决方案是将日志分成一定大小的段，当一个段文件达到一定大小时关闭它，并在新的段文件中进行后续写操作。然后我们可以对这些段执行压缩，如图3-2所示。压缩是指在日志中丢弃重复键，并仅保留每个键的最新更新。

Moreover, since compaction often makes segments much smaller (assuming that a key is overwritten several times on average within one segment), we can also merge several segments together at the same time as performing the compaction, as shown in Figure 3-3 . Segments are never modified after they have been written, so the merged segment is written to a new file. The merging and compaction of frozen segments can be done in a background thread, and while it is going on, we can still continue to serve read and write requests as normal, using the old segment files. After the merging process is complete, we switch read requests to using the new merged segment instead of the old segments—and then the old segment files can simply be deleted.

此外，由于紧缩通常会使段变得更小（假设一个键在一个段内被平均覆盖多次），所以我们也可以在紧缩的同时合并多个段，如图3-3所示。段一旦被写入后就不会被修改，因此合并的段被写入一个新文件中。冻结段的合并和紧缩可以在后台线程中完成，而在此期间，我们仍然可以像往常一样使用旧段文件来继续提供读写请求。在合并过程完成后，我们将读取请求切换到使用新合并的段而不是旧的段，然后旧的段文件可以被简单删除。

Each segment now has its own in-memory hash table, mapping keys to file offsets. In order to find the value for a key, we first check the most recent segment’s hash map; if the key is not present we check the second-most-recent segment, and so on. The merging process keeps the number of segments small, so lookups don’t need to check many hash maps.

现在每个分段都有自己的内存哈希表，将键映射到文件偏移量。为了查找键的值，我们首先检查最近的分段哈希映射；如果键不存在，则检查第二个最近的分段，依此类推。合并过程使分段数保持较小，因此查找不需要检查许多哈希映射。

Lots of detail goes into making this simple idea work in practice. Briefly, some of the issues that are important in a real implementation are:

制作这个简单的想法需要考虑许多细节。简要来说，一些实际实现中重要的问题包括：

File format

CSV is not the best format for a log. It’s faster and simpler to use a binary format that first encodes the length of a string in bytes, followed by the raw string (without need for escaping).

CSV不是日志的最佳格式。使用一种二进制格式更快，更简单，首先会对字符串的长度进行编码（以字节为单位），然后是原始字符串（无需转义）。

Deleting records

If you want to delete a key and its associated value, you have to append a special deletion record to the data file (sometimes called a tombstone ). When log segments are merged, the tombstone tells the merging process to discard any previous values for the deleted key.

如果您想删除一个键和它相关联的值，必须向数据文件附加一个特殊的删除记录（有时称为墓碑）。当日志段合并时，墓碑会告诉合并过程丢弃已删除键的任何先前值。

Crash recovery

If the database is restarted, the in-memory hash maps are lost. In principle, you can restore each segment’s hash map by reading the entire segment file from beginning to end and noting the offset of the most recent value for every key as you go along. However, that might take a long time if the segment files are large, which would make server restarts painful. Bitcask speeds up recovery by storing a snapshot of each segment’s hash map on disk, which can be loaded into memory more quickly.

如果数据库重新启动，内存中的哈希映射会丢失。原则上，可以通过从头到尾读取每个段文件并记录每个键的最新值的偏移量来恢复每个段的哈希映射。但是，如果段文件很大，这样做可能需要很长时间，从而使服务器重新启动非常痛苦。Bitcask通过在磁盘上存储每个段的哈希映射快照来加速恢复，可以更快地将其加载到内存中。

Partially written records

The database may crash at any time, including halfway through appending a record to the log. Bitcask files include checksums, allowing such corrupted parts of the log to be detected and ignored.

数据库可能会在任何时候崩溃，包括在将记录附加到日志的一半之处。 Bitcask 文件包括校验和，使可以检测和忽略日志中的这些损坏的部分。

Concurrency control

As writes are appended to the log in a strictly sequential order, a common implementation choice is to have only one writer thread. Data file segments are append-only and otherwise immutable, so they can be read concurrently by multiple threads.

由于写入操作必须严格按顺序附加到日志中，因此常见的实现选择是仅使用一个写入线程。数据文件段是只追加且不可变的，因此它们可以被多个线程同时读取。

An append-only log seems wasteful at first glance: why don’t you update the file in place, overwriting the old value with the new value? But an append-only design turns out to be good for several reasons:

“乍一看，仅追加日志似乎很浪费：为什么不在原地更新文件，用新值覆盖旧值？但事实证明，仅追加设计在多个方面都非常优秀：”

Appending and segment merging are sequential write operations, which are generally much faster than random writes, especially on magnetic spinning-disk hard drives. To some extent sequential writes are also preferable on flash-based solid state drives (SSDs) [ 4 ]. We will discuss this issue further in “Comparing B-Trees and LSM-Trees” .

添加与段合并是连续写入操作，通常比随机写入快得多，尤其是在磁盘硬盘上。在基于闪存的固态硬盘（SSD）上，一定程度上也更喜欢连续写入[4]。我们将在“比较B-树和LSM-树”中进一步讨论此问题。
Concurrency and crash recovery are much simpler if segment files are append-only or immutable. For example, you don’t have to worry about the case where a crash happened while a value was being overwritten, leaving you with a file containing part of the old and part of the new value spliced together.

如果分段文件是只追加或不可变的，那么并发和崩溃恢复就要简单得多。例如，你不必担心在值被覆盖时发生崩溃的情况，使得你只得到一个包含新值和旧值的部分拼接在一起的文件。
Merging old segments avoids the problem of data files getting fragmented over time.

合并旧部分可以避免数据文件随着时间的推移而变得分散。

However, the hash table index also has limitations:

然而，哈希表索引也有一些限制：

The hash table must fit in memory, so if you have a very large number of keys, you’re out of luck. In principle, you could maintain a hash map on disk, but unfortunately it is difficult to make an on-disk hash map perform well. It requires a lot of random access I/O, it is expensive to grow when it becomes full, and hash collisions require fiddly logic [ 5 ].

哈希表必须适合内存，所以如果您有大量的键，您就没那么幸运了。原则上，您可以在磁盘上维护哈希映射，但不幸的是很难使磁盘上的哈希映射表现良好。它需要大量的随机访问I/O，当它变满时扩展它是昂贵的，并且哈希冲突需要费力的逻辑 [5]。
Range queries are not efficient. For example, you cannot easily scan over all keys between kitty00000 and kitty99999 —you’d have to look up each key individually in the hash maps.

范围查询不是高效的。例如，您无法轻松地在kitty00000和kitty99999之间扫描所有键-您必须逐个查找散列映射中的每个键。

In the next section we will look at an indexing structure that doesn’t have those limitations.

在下一部分中，我们将看到一种没有这些限制的索引结构。

SSTables and LSM-Trees

In Figure 3-3 , each log-structured storage segment is a sequence of key-value pairs. These pairs appear in the order that they were written, and values later in the log take precedence over values for the same key earlier in the log. Apart from that, the order of key-value pairs in the file does not matter.

在图3-3中，每个日志结构化存储段都是键值对的序列。这些对按照它们被写入的顺序出现，日志中后面的值优先于日志中先前相同键的值。除此之外，文件中键值对的顺序并不重要。

Now we can make a simple change to the format of our segment files: we require that the sequence of key-value pairs is sorted by key . At first glance, that requirement seems to break our ability to use sequential writes, but we’ll get to that in a moment.

现在我们可以对我们的段文件格式进行简单更改：要求键值对的顺序按照键来排序。乍一看，这个要求似乎可能会破坏我们使用顺序写的能力，但是我们马上会解决这个问题。

We call this format Sorted String Table , or SSTable for short. We also require that each key only appears once within each merged segment file (the compaction process already ensures that). SSTables have several big advantages over log segments with hash indexes:

我们称之为排序字符串表格式，简称SSTable。我们还要求每个键在合并的段文件中只出现一次（合并过程已经确保了这一点）。SSTable比具有哈希索引的日志段有几个重要优点：

Merging segments is simple and efficient, even if the files are bigger than the available memory. The approach is like the one used in the mergesort algorithm and is illustrated in Figure 3-4 : you start reading the input files side by side, look at the first key in each file, copy the lowest key (according to the sort order) to the output file, and repeat. This produces a new merged segment file, also sorted by key.

合并段落是简单高效的，即使文件大小超过了可用内存。这种方法类似于归并排序算法中使用的方法，在图3-4中有图解：你可以一边读取输入文件，一边对比每个文件的第一个键值，复制最小的键值（根据排序顺序）到输出文件中，然后重复此操作。这将产生一个新的合并段落文件，也是按键值排序的。

Figure 3-4. Merging several SSTable segments, retaining only the most recent value for each key.

What if the same key appears in several input segments? Remember that each segment contains all the values written to the database during some period of time. This means that all the values in one input segment must be more recent than all the values in the other segment (assuming that we always merge adjacent segments). When multiple segments contain the same key, we can keep the value from the most recent segment and discard the values in older segments.

如果同一个键出现在多个输入段中怎么办？要记住，每个段都包含一段时间内写入数据库的所有值。这意味着一个输入段中的所有值必须比另一个段中的所有值更近（假设我们总是合并相邻段）。当多个段包含相同的键时，我们可以保留来自最新段的值，并且丢弃旧段中的值。如果多个段中有相同的键，应该保留最新段的值，丢弃老段中的值。
In order to find a particular key in the file, you no longer need to keep an index of all the keys in memory. See Figure 3-5 for an example: say you’re looking for the key handiwork , but you don’t know the exact offset of that key in the segment file. However, you do know the offsets for the keys handbag and handsome , and because of the sorting you know that handiwork must appear between those two. This means you can jump to the offset for handbag and scan from there until you find handiwork (or not, if the key is not present in the file).

为了在文件中查找特定的键，您不再需要在内存中保留所有键的索引。例如，假设您正在查找关键字"handiwork"，但您不知道该关键字在段文件中的确切偏移量。然而，您知道"handbag"和"handsome"的偏移量，并且由于排序，您知道"handiwork"必须出现在这两个关键字之间。这意味着您可以跳转到"handbag"的偏移量，并从那里扫描，直到找到"handiwork"（如果该键不存在于文件中，则不会找到）。

Figure 3-5. An SSTable with an in-memory index.

You still need an in-memory index to tell you the offsets for some of the keys, but it can be sparse: one key for every few kilobytes of segment file is sufficient, because a few kilobytes can be scanned very quickly. ⁱ

你仍需一个内存索引来告诉你一些键的偏移量，但它可以是稀疏的：每几千字节的段文件只需一个键就足够了，因为几千字节可以很快地扫描。
Since read requests need to scan over several key-value pairs in the requested range anyway, it is possible to group those records into a block and compress it before writing it to disk (indicated by the shaded area in Figure 3-5 ). Each entry of the sparse in-memory index then points at the start of a compressed block. Besides saving disk space, compression also reduces the I/O bandwidth use.

由于读请求需要在请求范围内扫描多个键值对，因此可以将这些记录分组并在写入磁盘之前对其进行压缩（如图3-5中的阴影区域所示）。稀疏的内存索引的每个条目都指向一个压缩块的开头。除了节省磁盘空间外，压缩还可以减少I / O带宽使用。

Constructing and maintaining SSTables

Fine so far—but how do you get your data to be sorted by key in the first place? Our incoming writes can occur in any order.

到目前为止很好，但是您如何才能使您的数据首先按键排序？我们的传入写入可以以任何顺序发生。

Maintaining a sorted structure on disk is possible (see “B-Trees” ), but maintaining it in memory is much easier. There are plenty of well-known tree data structures that you can use, such as red-black trees or AVL trees [ 2 ]. With these data structures, you can insert keys in any order and read them back in sorted order.

在磁盘中维护排序结构是可能的（参见“B-树”），但在内存中维护它要容易得多。有很多众所周知的树形数据结构可以使用，例如红黑树或AVL树[2]。使用这些数据结构，您可以以任何顺序插入键，并以排序顺序读取它们。

We can now make our storage engine work as follows:

我们现在可以让我们的存储引擎按照以下方式工作：

When a write comes in, add it to an in-memory balanced tree data structure (for example, a red-black tree). This in-memory tree is sometimes called a memtable .

当收到一篇写作时，将其添加到内存中平衡树数据结构中（例如红黑树）。这个内存中的树有时被称为Memtable。
When the memtable gets bigger than some threshold—typically a few megabytes—write it out to disk as an SSTable file. This can be done efficiently because the tree already maintains the key-value pairs sorted by key. The new SSTable file becomes the most recent segment of the database. While the SSTable is being written out to disk, writes can continue to a new memtable instance.

当内存表格变得比某个阈值（通常为几兆字节）更大时，将其写入磁盘作为SSTable文件。这可以高效地完成，因为树已经通过键对按键排序维护了这个对值。新的SSTable文件成为数据库的最近段。在将SSTable写入磁盘时，可以继续写入新的内存表。
In order to serve a read request, first try to find the key in the memtable, then in the most recent on-disk segment, then in the next-older segment, etc.

为了响应读取请求，应首先在内存表中查找键，然后在最新的磁盘段、接下来较旧的段等逐步查找。
From time to time, run a merging and compaction process in the background to combine segment files and to discard overwritten or deleted values.

不断地在后台运行合并和压缩进程，以组合段文件并丢弃被覆盖或删除的值。

This scheme works very well. It only suffers from one problem: if the database crashes, the most recent writes (which are in the memtable but not yet written out to disk) are lost. In order to avoid that problem, we can keep a separate log on disk to which every write is immediately appended, just like in the previous section. That log is not in sorted order, but that doesn’t matter, because its only purpose is to restore the memtable after a crash. Every time the memtable is written out to an SSTable, the corresponding log can be discarded.

这个方案运作得很好，唯一的问题是如果数据库崩溃，最近的写入（在内存表中但尚未写入磁盘）会丢失。为了避免这个问题，我们可以在磁盘上保留一个单独的日志，每次写入都立即附加到其中，就像在前一节中一样。该日志没有排序，但这并不重要，因为它的唯一目的是在崩溃后恢复内存表。每次内存表写入SSTable时，相应的日志可以被丢弃。

Making an LSM-tree out of SSTables

The algorithm described here is essentially what is used in LevelDB [ 6 ] and RocksDB [ 7 ], key-value storage engine libraries that are designed to be embedded into other applications. Among other things, LevelDB can be used in Riak as an alternative to Bitcask. Similar storage engines are used in Cassandra and HBase [ 8 ], both of which were inspired by Google’s Bigtable paper [ 9 ] (which introduced the terms SSTable and memtable ).

这里描述的算法基本上是LevelDB和RocksDB所使用的，它们是键-值存储引擎库，旨在嵌入到其他应用程序中。除此之外，LevelDB还可作为Bitcask的替代方案在Riak中使用。类似的存储引擎在Cassandra和HBase中也被使用，两者都受Google的Bigtable论文启发（该论文介绍了SSTable和memtable这些术语）。

Originally this indexing structure was described by Patrick O’Neil et al. under the name Log-Structured Merge-Tree (or LSM-Tree) [ 10 ], building on earlier work on log-structured filesystems [ 11 ]. Storage engines that are based on this principle of merging and compacting sorted files are often called LSM storage engines.

最初，这种索引结构是由Patrick O’Neil等人在较早的日志结构文件系统[11]上的基础上描述的，被称为Log-Structured Merge-Tree（或LSM-Tree）[10]。基于此合并和压缩排序文件的存储引擎通常被称为LSM存储引擎。

Lucene, an indexing engine for full-text search used by Elasticsearch and Solr, uses a similar method for storing its term dictionary [ 12 , 13 ]. A full-text index is much more complex than a key-value index but is based on a similar idea: given a word in a search query, find all the documents (web pages, product descriptions, etc.) that mention the word. This is implemented with a key-value structure where the key is a word (a term ) and the value is the list of IDs of all the documents that contain the word (the postings list ). In Lucene, this mapping from term to postings list is kept in SSTable-like sorted files, which are merged in the background as needed [ 14 ].

Lucene是一个索引引擎，用于Elasticsearch和Solr的全文搜索，它使用类似的方法存储其术语字典。全文索引比键值索引复杂得多，但是基于类似的想法：给定搜索查询中的一个单词，查找提到该单词的所有文档（网页，产品描述等）。这是通过一个键值结构来实现的，其中键是一个单词（术语），值是包含该单词的所有文档的ID列表（帖子列表）。在Lucene中，从术语到帖子列表的映射保留在类似SSTable的排序文件中，根据需要后台合并。

Performance optimizations

As always, a lot of detail goes into making a storage engine perform well in practice. For example, the LSM-tree algorithm can be slow when looking up keys that do not exist in the database: you have to check the memtable, then the segments all the way back to the oldest (possibly having to read from disk for each one) before you can be sure that the key does not exist. In order to optimize this kind of access, storage engines often use additional Bloom filters [ 15 ]. (A Bloom filter is a memory-efficient data structure for approximating the contents of a set. It can tell you if a key does not appear in the database, and thus saves many unnecessary disk reads for nonexistent keys.)

制作存储引擎时需要考虑许多细节，以使其在实际使用中表现良好。例如，当在数据库中查找不存在的键时，LSM树算法可能会变得很慢：您需要检查内存表，然后向后查找所有段直到最老的段（可能需要为每一个段都读取磁盘），才能确定键不存在。为了优化这种访问，存储引擎经常使用额外的布隆过滤器【15】。（布隆过滤器是一种内存效率高的数据结构，用于近似表示集合的内容。它可以告诉您键是否存在于数据库中，从而为不存在的键节省许多不必要的磁盘读取。）

There are also different strategies to determine the order and timing of how SSTables are compacted and merged. The most common options are size-tiered and leveled compaction. LevelDB and RocksDB use leveled compaction (hence the name of LevelDB), HBase uses size-tiered, and Cassandra supports both [ 16 ]. In size-tiered compaction, newer and smaller SSTables are successively merged into older and larger SSTables. In leveled compaction, the key range is split up into smaller SSTables and older data is moved into separate “levels,” which allows the compaction to proceed more incrementally and use less disk space.

决定SSTables紧凑和合并顺序和时间的策略也不同。最常见的选项是大小分层和级别紧凑。LevelDB和RocksDB使用级别紧凑（因此LevelDB的名称），HBase使用大小分层，Cassandra支持两种[16]。在大小分层紧凑中，新的和较小的SSTables逐步合并为旧的和较大的SSTables。在级别紧凑中，键范围被拆分成较小的SSTables，旧数据被移入单独的“级别”，这允许更逐步地进行压缩，使用更少的磁盘空间。

Even though there are many subtleties, the basic idea of LSM-trees—keeping a cascade of SSTables that are merged in the background—is simple and effective. Even when the dataset is much bigger than the available memory it continues to work well. Since data is stored in sorted order, you can efficiently perform range queries (scanning all keys above some minimum and up to some maximum), and because the disk writes are sequential the LSM-tree can support remarkably high write throughput.

即使有许多微妙之处，LSM树的基本思想-保持背景中合并的SSTables级联-是简单而有效的。即使数据集比可用内存大得多，它仍然能够良好地工作。由于数据以排序顺序存储，因此可以高效地执行范围查询（扫描所有最小值以上且最大值以下的所有键），而且由于磁盘写入是顺序的，因此LSM树可以支持极高的写入吞吐量。

B-Trees

The log-structured indexes we have discussed so far are gaining acceptance, but they are not the most common type of index. The most widely used indexing structure is quite different: the B-tree .

到目前为止，我们讨论的日志结构化索引已经得到了认可，但它们并不是最常见的索引类型。最广泛使用的索引结构是完全不同的：B树。

Introduced in 1970 [ 17 ] and called “ubiquitous” less than 10 years later [ 18 ], B-trees have stood the test of time very well. They remain the standard index implementation in almost all relational databases, and many nonrelational databases use them too.

自1970年被引入[17]并在不到10年的时间内被称为“无处不在”， B-树经过了时间的考验。它们仍然是几乎所有关系型数据库中的标准索引实现，许多非关系型数据库也使用它们。

Like SSTables, B-trees keep key-value pairs sorted by key, which allows efficient key-value lookups and range queries. But that’s where the similarity ends: B-trees have a very different design philosophy.

就像SSTables一样，B-树通过键对值进行排序，从而实现高效的键对值查找和范围查询。但这也是相似之处的尽头：B-树有一种非常不同的设计哲学。

The log-structured indexes we saw earlier break the database down into variable-size segments , typically several megabytes or more in size, and always write a segment sequentially. By contrast, B-trees break the database down into fixed-size blocks or pages , traditionally 4 KB in size (sometimes bigger), and read or write one page at a time. This design corresponds more closely to the underlying hardware, as disks are also arranged in fixed-size blocks.

早些时候我们看到的基于日志的索引将数据库拆分为可变大小的段，通常是几兆字节甚至更大，并且总是按顺序写入段。相比之下，B-树将数据库拆分为固定大小的块或页面，传统上为4 KB（有时更大），并一次读取或写入一个页面。这种设计更贴近底层硬件，因为磁盘也是按固定大小的块排列。

Each page can be identified using an address or location, which allows one page to refer to another—similar to a pointer, but on disk instead of in memory. We can use these page references to construct a tree of pages, as illustrated in Figure 3-6 .

每个页面都可以通过地址或位置进行识别，这使得一个页面可以引用另一个页面，类似于指针，但是在磁盘上而不是在内存中。我们可以使用这些页面引用来构建页面树，如图3-6所示。

One page is designated as the root of the B-tree; whenever you want to look up a key in the index, you start here. The page contains several keys and references to child pages. Each child is responsible for a continuous range of keys, and the keys between the references indicate where the boundaries between those ranges lie.

一个页面被指定为B-tree的根；每当你想在索引中查找一个键时，你从这里开始。这个页面包含几个键和指向子页面的引用。每个子页面负责一段连续的键范围，引用之间的键指示这些范围之间的边界在哪里。

In the example in Figure 3-6 , we are looking for the key 251, so we know that we need to follow the page reference between the boundaries 200 and 300. That takes us to a similar-looking page that further breaks down the 200–300 range into subranges. Eventually we get down to a page containing individual keys (a leaf page ), which either contains the value for each key inline or contains references to the pages where the values can be found.

在图3-6的例子中，我们正在寻找键值为251的密钥，因此我们知道需要在200和300的范围内跟随页面引用。这将带我们到另一个类似的页面，进一步将200-300范围分解成子范围。最终，我们会来到一个包含单个密钥的页面（叶子页），其中每个密钥包含该值或包含对包含该值的页面的引用。

The number of references to child pages in one page of the B-tree is called the branching factor . For example, in Figure 3-6 the branching factor is six. In practice, the branching factor depends on the amount of space required to store the page references and the range boundaries, but typically it is several hundred.

B树中一个页面引用子页面的数量被称为分支因子。例如，在图3-6中，分支因子为六。实际上，分支因子取决于存储页面引用和范围边界所需的空间量，但通常为数百个。

If you want to update the value for an existing key in a B-tree, you search for the leaf page containing that key, change the value in that page, and write the page back to disk (any references to that page remain valid). If you want to add a new key, you need to find the page whose range encompasses the new key and add it to that page. If there isn’t enough free space in the page to accommodate the new key, it is split into two half-full pages, and the parent page is updated to account for the new subdivision of key ranges—see Figure 3-7 . ⁱⁱ

如果您想更新B-树中现有键的值，则需要搜索包含该键的叶页，更改该页中的值，并将该页写回磁盘（对该页的任何引用仍然有效）。如果您想添加新键，则需要找到其范围包含新键的页面，并将其添加到该页面中。如果页面中没有足够的空闲空间来容纳新键，则将其拆分为两个半满页面，并更新父页面以说明关键范围的新细分 - 请参见图3-7.ii。

This algorithm ensures that the tree remains balanced : a B-tree with n keys always has a depth of O (log n ). Most databases can fit into a B-tree that is three or four levels deep, so you don’t need to follow many page references to find the page you are looking for. (A four-level tree of 4 KB pages with a branching factor of 500 can store up to 256 TB.)

这种算法确保树保持平衡：具有n个键的B树始终具有O(log n)的深度。大多数数据库可以适应三到四层深度的B树，因此您不需要遵循许多页面引用来找到所需的页面。（具有500个分支因子的4 KB页面的四级树可存储高达256 TB。）

Making B-trees reliable

The basic underlying write operation of a B-tree is to overwrite a page on disk with new data. It is assumed that the overwrite does not change the location of the page; i.e., all references to that page remain intact when the page is overwritten. This is in stark contrast to log-structured indexes such as LSM-trees, which only append to files (and eventually delete obsolete files) but never modify files in place.

B-树的基本底层写操作是使用新数据覆盖磁盘上的页面。假设覆盖不改变页面的位置；即，在覆盖页面时，所有对该页面的引用仍保持不变。这与日志结构化索引（如 LSM-树）形成了鲜明的对比，后者仅追加到文件中（并最终删除过时的文件），但从不直接修改文件。

You can think of overwriting a page on disk as an actual hardware operation. On a magnetic hard drive, this means moving the disk head to the right place, waiting for the right position on the spinning platter to come around, and then overwriting the appropriate sector with new data. On SSDs, what happens is somewhat more complicated, due to the fact that an SSD must erase and rewrite fairly large blocks of a storage chip at a time [ 19 ].

你可以把在磁盘上覆盖一个页面看作是一个实际的硬件操作。在磁性硬盘上，这意味着把磁头移动到正确的位置，等待旋转盘上的正确位置到达，然后用新数据覆盖适当的扇区。在固态硬盘上，情况稍微复杂一些，因为固态硬盘必须一次擦除和重写存储芯片的相当大的块。

Moreover, some operations require several different pages to be overwritten. For example, if you split a page because an insertion caused it to be overfull, you need to write the two pages that were split, and also overwrite their parent page to update the references to the two child pages. This is a dangerous operation, because if the database crashes after only some of the pages have been written, you end up with a corrupted index (e.g., there may be an orphan page that is not a child of any parent).

此外，某些操作需要覆盖多个不同的页面。例如，如果您因为插入而导致页面过度填充而需要拆分页面，则需要编写两个拆分的页面，并覆盖它们的父页面以更新对两个子页面的引用。这是一项危险的操作，因为如果数据库在编写某些页面之后崩溃，则会导致索引损坏（例如，可能会产生未属于任何父页面的孤立页面）。

In order to make the database resilient to crashes, it is common for B-tree implementations to include an additional data structure on disk: a write-ahead log (WAL, also known as a redo log ). This is an append-only file to which every B-tree modification must be written before it can be applied to the pages of the tree itself. When the database comes back up after a crash, this log is used to restore the B-tree back to a consistent state [ 5 , 20 ].

为了使数据库具有抗崩溃能力，常见的B树实现方式是在磁盘上包含一个额外的数据结构：写前日志（WAL，也称重做日志）。这是一个只追加的文件，每次对B树的修改必须先被写入该文件，然后才能应用于树本身的页面。当数据库在崩溃后重新启动时，使用该日志将B树恢复到一致状态[5，20]。

An additional complication of updating pages in place is that careful concurrency control is required if multiple threads are going to access the B-tree at the same time—otherwise a thread may see the tree in an inconsistent state. This is typically done by protecting the tree’s data structures with latches (lightweight locks). Log-structured approaches are simpler in this regard, because they do all the merging in the background without interfering with incoming queries and atomically swap old segments for new segments from time to time.

在原地更新页面的一个额外复杂性是，如果多个线程同时访问B树，则需要小心的并发控制，否则线程可能会看到树处于不一致状态。这通常通过使用闩（轻量级锁）保护树的数据结构来实现。相比之下，基于日志的方法在这方面更简单，因为它们在后台执行所有合并操作，而不会干扰传入的查询，定期以原子方式交换旧段和新段。

B-tree optimizations

As B-trees have been around for so long, it’s not surprising that many optimizations have been developed over the years. To mention just a few:

由于B树已经存在了很长时间，多年来已开发出许多优化，仅举几例：

Instead of overwriting pages and maintaining a WAL for crash recovery, some databases (like LMDB) use a copy-on-write scheme [ 21 ]. A modified page is written to a different location, and a new version of the parent pages in the tree is created, pointing at the new location. This approach is also useful for concurrency control, as we shall see in “Snapshot Isolation and Repeatable Read” .

一些数据库（例如LMDB）使用写时复制方案[21]来避免覆盖页面并维护 WAL 以进行崩溃恢复。修改后的页面被写入不同的位置，并创建指向新位置的父页的新版本。这种方法对于并发控制也很有用，我们将在“快照隔离和可重复读”中看到。
We can save space in pages by not storing the entire key, but abbreviating it. Especially in pages on the interior of the tree, keys only need to provide enough information to act as boundaries between key ranges. Packing more keys into a page allows the tree to have a higher branching factor, and thus fewer levels. ⁱⁱⁱ

我们可以通过缩写密钥来节省页面空间。特别是在树的内部页面上，密钥只需要提供足够的信息作为密钥范围之间的边界即可。将更多的密钥打包进页面可以使树有更高的分支因子，从而减少层数。
In general, pages can be positioned anywhere on disk; there is nothing requiring pages with nearby key ranges to be nearby on disk. If a query needs to scan over a large part of the key range in sorted order, that page-by-page layout can be inefficient, because a disk seek may be required for every page that is read. Many B-tree implementations therefore try to lay out the tree so that leaf pages appear in sequential order on disk. However, it’s difficult to maintain that order as the tree grows. By contrast, since LSM-trees rewrite large segments of the storage in one go during merging, it’s easier for them to keep sequential keys close to each other on disk.

通常，页面可以位于磁盘的任何位置；没有任何要求就近键范围的页面必须在磁盘上附近。如果查询需要按排序顺序扫描键范围的大部分，那么按页面排列可能效率低下，因为每次阅读的页面可能需要磁盘查找。因此，许多B树实现尝试布局树，使叶页面按顺序出现在磁盘上。然而，随着树的增长，维护该顺序很困难。相比之下，由于LSM树在合并期间一次重写大量存储，因此它们更容易在磁盘上保持顺序的键靠近彼此。
Additional pointers have been added to the tree. For example, each leaf page may have references to its sibling pages to the left and right, which allows scanning keys in order without jumping back to parent pages.

树上已添加了额外的指针。例如，每个叶页可能都有指向其左侧和右侧的兄弟页的引用，这样可以按顺序扫描键而不跳回父页。
B-tree variants such as fractal trees [ 22 ] borrow some log-structured ideas to reduce disk seeks (and they have nothing to do with fractals).

B-树的变体，如分形树[22]，借鉴一些日志结构的思想来减少盘寻找（并且它们与分形无关）。

Comparing B-Trees and LSM-Trees

Even though B-tree implementations are generally more mature than LSM-tree implementations, LSM-trees are also interesting due to their performance characteristics. As a rule of thumb, LSM-trees are typically faster for writes, whereas B-trees are thought to be faster for reads [ 23 ]. Reads are typically slower on LSM-trees because they have to check several different data structures and SSTables at different stages of compaction.

尽管B-树实现通常比LSM-树实现更成熟，但由于性能特征，LSM-树也具有一定的吸引力。按照经验法则，LSM-树通常对写入更快，而B-树则被认为对读取更快[23]。由于它们必须检查几个不同的数据结构和不同的压实阶段的SSTables，因此LSM-树的读取通常较慢。

However, benchmarks are often inconclusive and sensitive to details of the workload. You need to test systems with your particular workload in order to make a valid comparison. In this section we will briefly discuss a few things that are worth considering when measuring the performance of a storage engine.

然而，基准测试通常不具可靠性，容易受到工作负载细节的影响。您需要使用特定工作负载测试系统，以确保有效的比较。在本节中，我们将简要探讨一些有关测量存储引擎性能的值得考虑的事情。

Advantages of LSM-trees

A B-tree index must write every piece of data at least twice: once to the write-ahead log, and once to the tree page itself (and perhaps again as pages are split). There is also overhead from having to write an entire page at a time, even if only a few bytes in that page changed. Some storage engines even overwrite the same page twice in order to avoid ending up with a partially updated page in the event of a power failure [ 24 , 25 ].

B树索引需要至少两次写入每个数据：一次写入日志，一次写入树页本身（可能在分页时再次写入）。即使只有一页中的少量字节发生更改，也需要一次性写入整个页面，因此存在开销。有些存储引擎甚至会两次覆盖同一页，以避免在发生断电时出现部分更新的页面。

Log-structured indexes also rewrite data multiple times due to repeated compaction and merging of SSTables. This effect—one write to the database resulting in multiple writes to the disk over the course of the database’s lifetime—is known as write amplification . It is of particular concern on SSDs, which can only overwrite blocks a limited number of times before wearing out.

由于定期压缩和合并SSTable，日志结构索引会多次重写数据。这种效果-数据库的一次写入导致在数据库的生命周期中多次写入磁盘-被称为写放大。这在SSD上尤为关注，因为它们只能有限次地覆盖块才会磨损。

In write-heavy applications, the performance bottleneck might be the rate at which the database can write to disk. In this case, write amplification has a direct performance cost: the more that a storage engine writes to disk, the fewer writes per second it can handle within the available disk bandwidth.

在写入量大的应用程序中，性能瓶颈可能是数据库写入磁盘的速率。在这种情况下，写入增益会直接影响性能成本：存储引擎写入磁盘越多，它在可用磁盘带宽内能处理的每秒写入数量就越少。

Moreover, LSM-trees are typically able to sustain higher write throughput than B-trees, partly because they sometimes have lower write amplification (although this depends on the storage engine configuration and workload), and partly because they sequentially write compact SSTable files rather than having to overwrite several pages in the tree [ 26 ]. This difference is particularly important on magnetic hard drives, where sequential writes are much faster than random writes.

此外，LSM树通常能够支持比B树更高的写入吞吐量，部分原因是它们有时具有较低的写放大（尽管这取决于存储引擎配置和工作负载），部分原因是它们按顺序写入紧凑的SSTable文件，而不必覆盖树中的多个页面 [26]。在磁性硬盘上，这种差异特别重要，因为顺序写入比随机写入快得多。

LSM-trees can be compressed better, and thus often produce smaller files on disk than B-trees. B-tree storage engines leave some disk space unused due to fragmentation: when a page is split or when a row cannot fit into an existing page, some space in a page remains unused. Since LSM-trees are not page-oriented and periodically rewrite SSTables to remove fragmentation, they have lower storage overheads, especially when using leveled compaction [ 27 ].

LSM树可以更好地压缩，因此通常在磁盘上生成比B树更小的文件。B树存储引擎由于碎片化而留下一些未使用的磁盘空间：当页面被拆分或当一行无法适应现有页面时，页面中的一些空间将保持未使用。由于LSM树不是面向页面的，并且定期重写SSTables以消除碎片，因此它们具有较低的存储开销，特别是在使用级别压实时。

On many SSDs, the firmware internally uses a log-structured algorithm to turn random writes into sequential writes on the underlying storage chips, so the impact of the storage engine’s write pattern is less pronounced [ 19 ]. However, lower write amplification and reduced fragmentation are still advantageous on SSDs: representing data more compactly allows more read and write requests within the available I/O bandwidth.

在许多SSDs上，固件在内部使用日志结构算法将随机写入转换为底层存储芯片的顺序写入，因此存储引擎的写入模式的影响不太明显[19]。然而，降低写入扩展和减少碎片对于SSD仍然具有优势：更紧凑地表示数据可以在可用的I / O带宽内实现更多的读取和写入请求。

Downsides of LSM-trees

A downside of log-structured storage is that the compaction process can sometimes interfere with the performance of ongoing reads and writes. Even though storage engines try to perform compaction incrementally and without affecting concurrent access, disks have limited resources, so it can easily happen that a request needs to wait while the disk finishes an expensive compaction operation. The impact on throughput and average response time is usually small, but at higher percentiles (see “Describing Performance” ) the response time of queries to log-structured storage engines can sometimes be quite high, and B-trees can be more predictable [ 28 ].

日志结构存储的缺点是压缩过程有时会干扰正在进行的读写操作的性能。尽管存储引擎试图进行递增压缩并且不影响并发访问，但磁盘资源有限，请求在磁盘完成昂贵的压缩操作时需要等待。对吞吐量和平均响应时间的影响通常很小，但在更高的百分位数（见“描述性能”）下，针对日志结构存储引擎的查询响应时间有时可能会很长，而B树可以更为可预测（28）。

Another issue with compaction arises at high write throughput: the disk’s finite write bandwidth needs to be shared between the initial write (logging and flushing a memtable to disk) and the compaction threads running in the background. When writing to an empty database, the full disk bandwidth can be used for the initial write, but the bigger the database gets, the more disk bandwidth is required for compaction.

高写入吞吐量会导致压缩的另一个问题：磁盘有限的写入带宽需要在初始写入（将 memtable 记录和刷新到磁盘）和后台运行的压缩线程之间共享。在写入空数据库时，完整的磁盘带宽可以用于初始写入，但是数据库越大，压缩所需的磁盘带宽就越多。

If write throughput is high and compaction is not configured carefully, it can happen that compaction cannot keep up with the rate of incoming writes. In this case, the number of unmerged segments on disk keeps growing until you run out of disk space, and reads also slow down because they need to check more segment files. Typically, SSTable-based storage engines do not throttle the rate of incoming writes, even if compaction cannot keep up, so you need explicit monitoring to detect this situation [ 29 , 30 ].

如果写入吞吐量很高，而合并操作没有得到仔细配置，就可能出现合并不能跟上写入速率的情况。在这种情况下，磁盘上未合并段的数量会不断增加，直到磁盘空间不足，同时读取速度也会变慢，因为需要检查更多的段文件。通常，基于SSTable的存储引擎即使合并不能跟上也不会限制写入速率，因此需要明确的监控来检测此情况。

An advantage of B-trees is that each key exists in exactly one place in the index, whereas a log-structured storage engine may have multiple copies of the same key in different segments. This aspect makes B-trees attractive in databases that want to offer strong transactional semantics: in many relational databases, transaction isolation is implemented using locks on ranges of keys, and in a B-tree index, those locks can be directly attached to the tree [ 5 ]. In Chapter 7 we will discuss this point in more detail.

B树的一个优点是每个键在索引中仅存在一个位置，而日志结构化存储引擎可能会在不同的段中有多个相同的键的副本。这一方面使得B树在那些想要提供强事务语义的数据库中显得有吸引力：在许多关系型数据库中，事务隔离是通过对键范围上的锁进行实现的，在B树索引中，这些锁可以直接附加到树上[5]。在第7章中，我们将更详细地讨论这一点。

B-trees are very ingrained in the architecture of databases and provide consistently good performance for many workloads, so it’s unlikely that they will go away anytime soon. In new datastores, log-structured indexes are becoming increasingly popular. There is no quick and easy rule for determining which type of storage engine is better for your use case, so it is worth testing empirically.

B树已经深深地融入了数据库的架构中，并为许多工作负载提供了持续良好的性能，因此它们不太可能很快被淘汰。在新的数据存储中，基于日志结构的索引（LSM）变得越来越受欢迎。没有一个快速而简单的规则来确定哪种存储引擎更适合您的用例，因此进行经验性测试是值得的。

Other Indexing Structures

So far we have only discussed key-value indexes, which are like a primary key index in the relational model. A primary key uniquely identifies one row in a relational table, or one document in a document database, or one vertex in a graph database. Other records in the database can refer to that row/document/vertex by its primary key (or ID), and the index is used to resolve such references.

到目前为止，我们只讨论了键值索引，这类似于关系模型中的主键索引。主键唯一标识关系表中的一行，文档数据库中的一个文档，或图形数据库中的一个顶点。数据库中的其他记录可以通过它的主键（或ID）引用该行/文档/顶点，索引用于解决这些引用。

It is also very common to have secondary indexes . In relational databases, you can create several secondary indexes on the same table using the CREATE INDEX command, and they are often crucial for performing joins efficiently. For example, in Figure 2-1 in Chapter 2 you would most likely have a secondary index on the user_id columns so that you can find all the rows belonging to the same user in each of the tables.

在关系型数据库中，创建多个相同表格的次要索引是非常常见的，可以使用CREATE INDEX命令进行创建，通常这些索引对于高效地执行连接操作至关重要。例如，如果在第二章中的图2-1中，您可能会在user_id列上制定一个次要索引，以便可以在每个表格中找到所有属于同一用户的行。

A secondary index can easily be constructed from a key-value index. The main difference is that keys are not unique; i.e., there might be many rows (documents, vertices) with the same key. This can be solved in two ways: either by making each value in the index a list of matching row identifiers (like a postings list in a full-text index) or by making each key unique by appending a row identifier to it. Either way, both B-trees and log-structured indexes can be used as secondary indexes.

二级索引可以轻松地从键值索引构建。主要区别在于键不是唯一的；即，可能有许多行（文档，顶点）具有相同的键。这可以通过两种方式解决：要么使索引中的每个值都是匹配行标识符的列表（例如全文索引中的发布列表），要么通过附加行标识符将每个键唯一。无论哪种方法，B树和日志结构索引都可以用作二级索引。

Storing values within the index

The key in an index is the thing that queries search for, but the value can be one of two things: it could be the actual row (document, vertex) in question, or it could be a reference to the row stored elsewhere. In the latter case, the place where rows are stored is known as a heap file , and it stores data in no particular order (it may be append-only, or it may keep track of deleted rows in order to overwrite them with new data later). The heap file approach is common because it avoids duplicating data when multiple secondary indexes are present: each index just references a location in the heap file, and the actual data is kept in one place.

索引中的关键字是查询搜索的内容，但值可以是以下两种之一：它可能是实际的行（文档、顶点），或者是对其他地方存储的行的引用。在后一种情况下，行存储的位置被称为堆文件，它以任意顺序存储数据（可能只是追加，或者可能跟踪已删除的行以便稍后用新数据覆盖它们）。堆文件方法很常见，因为它避免了在存在多个辅助索引时复制数据：每个索引只引用堆文件中的一个位置，而实际数据只保存在一个地方。

When updating a value without changing the key, the heap file approach can be quite efficient: the record can be overwritten in place, provided that the new value is not larger than the old value. The situation is more complicated if the new value is larger, as it probably needs to be moved to a new location in the heap where there is enough space. In that case, either all indexes need to be updated to point at the new heap location of the record, or a forwarding pointer is left behind in the old heap location [ 5 ].

更新值时，如果键不变，堆文件方法可以相当有效：只要新值不比旧值大，记录就可以被原地覆盖。如果新值较大，则情况就更加复杂了，因为它可能需要移动到堆中有足够空间的新位置。在这种情况下，要么更新所有索引以指向记录的新堆位置，要么在旧的堆位置留下转发指针。

In some situations, the extra hop from the index to the heap file is too much of a performance penalty for reads, so it can be desirable to store the indexed row directly within an index. This is known as a clustered index . For example, in MySQL’s InnoDB storage engine, the primary key of a table is always a clustered index, and secondary indexes refer to the primary key (rather than a heap file location) [ 31 ]. In SQL Server, you can specify one clustered index per table [ 32 ].

在某些情况下，从索引到堆文件的额外跳跃对于读取来说是过于性能损失，因此将希望将索引行直接存储在索引中。这就被称为聚集索引。例如，MySQL的InnoDB存储引擎中，表的主键始终是聚集索引，而次要索引引用主键（而不是堆文件位置）[31]。在SQL Server中，您可以对每个表指定一个聚集索引[32]。

A compromise between a clustered index (storing all row data within the index) and a nonclustered index (storing only references to the data within the index) is known as a covering index or index with included columns , which stores some of a table’s columns within the index [ 33 ]. This allows some queries to be answered by using the index alone (in which case, the index is said to cover the query) [ 32 ].

覆盖索引或包含列的索引是聚簇索引（在索引内保存所有行数据）和非聚簇索引（仅在索引内保存对数据的引用）之间的一种折中。它将表格的某些列保存在索引内，从而允许某些查询仅使用索引来回答（在这种情况下，索引称为覆盖查询）。

As with any kind of duplication of data, clustered and covering indexes can speed up reads, but they require additional storage and can add overhead on writes. Databases also need to go to additional effort to enforce transactional guarantees, because applications should not see inconsistencies due to the duplication.

如同任何資料的複製一樣，群集和覆蓋索引可以加快讀取速度，但它們需要額外的儲存空間，並且可能增加寫入時的負擔。資料庫還需要額外的努力來實施交易保證，因為應用程式不應因複製而看到不一致性。

Multi-column indexes

The indexes discussed so far only map a single key to a value. That is not sufficient if we need to query multiple columns of a table (or multiple fields in a document) simultaneously.

到目前为止，讨论的索引只能将单个键映射到一个值。如果我们需要同时查询表格中的多个列（或文档中的多个字段），这是不够的。

The most common type of multi-column index is called a concatenated index , which simply combines several fields into one key by appending one column to another (the index definition specifies in which order the fields are concatenated). This is like an old-fashioned paper phone book, which provides an index from ( lastname , firstname ) to phone number. Due to the sort order, the index can be used to find all the people with a particular last name, or all the people with a particular lastname-firstname combination. However, the index is useless if you want to find all the people with a particular first name.

最常见的多列索引类型是称为连接索引的索引，它简单地通过将一个列附加到另一个列来将几个字段组合成一个键（索引定义指定了字段连接的顺序）。这就像一个老式的电话簿，它提供了一个从（姓氏，名字）到电话号码的索引。由于排序顺序，索引可用于查找所有具有特定姓氏的人，或所有具有特定姓氏-名字组合的人。然而，如果您要查找所有具有特定名字的人，则索引无用。

Multi-dimensional indexes are a more general way of querying several columns at once, which is particularly important for geospatial data. For example, a restaurant-search website may have a database containing the latitude and longitude of each restaurant. When a user is looking at the restaurants on a map, the website needs to search for all the restaurants within the rectangular map area that the user is currently viewing. This requires a two-dimensional range query like the following:

多维索引是一种更通用的同时查询多列的方式，这对于地理空间数据尤其重要。例如，一个餐厅搜索网站可能有一个包含每个餐厅纬度和经度的数据库。当用户在地图上查看餐厅时，网站需要搜索用户当前查看的矩形地图区域内的所有餐厅。这需要一个二维范围查询，如下所示：

SELECT * FROM restaurants WHERE latitude  > 51.4946 AND latitude  < 51.5079
                            AND longitude > -0.1162 AND longitude < -0.1004;

A standard B-tree or LSM-tree index is not able to answer that kind of query efficiently: it can give you either all the restaurants in a range of latitudes (but at any longitude), or all the restaurants in a range of longitudes (but anywhere between the North and South poles), but not both simultaneously.

标准的B树或LSM树索引无法有效地回答这种查询：它可以给出纬度范围内的所有餐厅（但经度可以任意），或者经度范围内的所有餐厅（但在南北极之间的任何地方），但不能同时给出。

One option is to translate a two-dimensional location into a single number using a space-filling curve, and then to use a regular B-tree index [ 34 ]. More commonly, specialized spatial indexes such as R-trees are used. For example, PostGIS implements geospatial indexes as R-trees using PostgreSQL’s Generalized Search Tree indexing facility [ 35 ]. We don’t have space to describe R-trees in detail here, but there is plenty of literature on them.

一种选择是使用填充空间曲线将二维位置转换为单个数字，然后使用常规B树索引[34]。更常见的是使用专门的空间索引，例如R-trees。例如，PostGIS将地理空间索引实现为R-trees，使用PostgreSQL的广义搜索树索引工具[35]。我们没有空间在此处详细描述R-trees，但有大量文献可供参考。

An interesting idea is that multi-dimensional indexes are not just for geographic locations. For example, on an ecommerce website you could use a three-dimensional index on the dimensions ( red , green , blue ) to search for products in a certain range of colors, or in a database of weather observations you could have a two-dimensional index on ( date , temperature ) in order to efficiently search for all the observations during the year 2013 where the temperature was between 25 and 30℃. With a one-dimensional index, you would have to either scan over all the records from 2013 (regardless of temperature) and then filter them by temperature, or vice versa. A 2D index could narrow down by timestamp and temperature simultaneously. This technique is used by HyperDex [ 36 ].

一个有趣的想法是，多维索引不仅仅适用于地理位置。例如，在电子商务网站上，你可以使用三维索引（红色，绿色，蓝色）搜索一定范围颜色的产品，或者在天气观测数据库中，你可以使用二维索引（日期，温度）以便高效地搜索所有2013年温度在25℃至30℃之间的观测数据。使用一维索引，你需要扫描所有2013年的记录（不考虑温度）然后再根据温度进行筛选，或者反之。二维索引可以同时缩小 by 时间戳和温度。这种技术由HyperDex[36]使用。

Full-text search and fuzzy indexes

All the indexes discussed so far assume that you have exact data and allow you to query for exact values of a key, or a range of values of a key with a sort order. What they don’t allow you to do is search for similar keys, such as misspelled words. Such fuzzy querying requires different techniques.

到目前为止讨论的所有索引都假定您具有精确数据，并允许您查询某个关键字的精确值或具有排序顺序的关键字值范围。但它们无法让你搜索相似的关键字，比如拼写错误的单词。这种模糊查询需要不同的技术。

For example, full-text search engines commonly allow a search for one word to be expanded to include synonyms of the word, to ignore grammatical variations of words, and to search for occurrences of words near each other in the same document, and support various other features that depend on linguistic analysis of the text. To cope with typos in documents or queries, Lucene is able to search text for words within a certain edit distance (an edit distance of 1 means that one letter has been added, removed, or replaced) [ 37 ].

例如，全文搜索引擎通常允许将一个单词的搜索扩展到包括该词的同义词，忽略单词的语法变化，并在同一文档中搜索单词出现在彼此附近的情况，并支持其他各种依赖于文本语言分析的功能。为了处理文档或查询中的打字错误，Lucene能够在一定的编辑距离内搜索文字（编辑距离为1表示添加、删除或替换一个字母）[37]。

As mentioned in “Making an LSM-tree out of SSTables” , Lucene uses a SSTable-like structure for its term dictionary. This structure requires a small in-memory index that tells queries at which offset in the sorted file they need to look for a key. In LevelDB, this in-memory index is a sparse collection of some of the keys, but in Lucene, the in-memory index is a finite state automaton over the characters in the keys, similar to a trie [ 38 ]. This automaton can be transformed into a Levenshtein automaton , which supports efficient search for words within a given edit distance [ 39 ].

在“将SSTable制作成LSM-树”中提到，Lucene使用类似SSTable的结构用于其术语字典。该结构需要一个小的内存索引，告诉查询需要在已排序的文件中的哪个偏移量查找键。在LevelDB中，这个内存索引是一组稀疏的键，但在Lucene中，内存索引是有限状态自动机，类似于trie[38]。这个自动机可以转化为Levenshtein自动机，支持在规定的编辑距离内高效地搜索单词[39]。

Other fuzzy search techniques go in the direction of document classification and machine learning. See an information retrieval textbook for more detail [e.g., 40 ].

其他模糊搜索技术越来越倾向于文档分类和机器学习。有关更多细节，请参阅信息检索教材 [例如，40]。

Keeping everything in memory

The data structures discussed so far in this chapter have all been answers to the limitations of disks. Compared to main memory, disks are awkward to deal with. With both magnetic disks and SSDs, data on disk needs to be laid out carefully if you want good performance on reads and writes. However, we tolerate this awkwardness because disks have two significant advantages: they are durable (their contents are not lost if the power is turned off), and they have a lower cost per gigabyte than RAM.

这一章讨论的数据结构都是为克服磁盘的限制而设计的。与主存储器相比，磁盘处理起来比较麻烦。对于磁性盘和固态盘，如果要读写数据具有良好的性能，就需要仔细布局磁盘上的数据。但我们忍受这种麻烦，是因为磁盘有两个重要的优点：它们是耐用的（如果断电，内容不会丢失），并且每GB成本比RAM低。

As RAM becomes cheaper, the cost-per-gigabyte argument is eroded. Many datasets are simply not that big, so it’s quite feasible to keep them entirely in memory, potentially distributed across several machines. This has led to the development of in-memory databases .

随着RAM的价格越来越便宜，每GB成本的争议已经消失了。许多数据集并不是非常庞大，因此将它们完全存储在内存中是相当可行的，还可以在多台机器之间进行分布。这导致了内存数据库的发展。

Some in-memory key-value stores, such as Memcached, are intended for caching use only, where it’s acceptable for data to be lost if a machine is restarted. But other in-memory databases aim for durability, which can be achieved with special hardware (such as battery-powered RAM), by writing a log of changes to disk, by writing periodic snapshots to disk, or by replicating the in-memory state to other machines.

一些内存键值存储，如Memcached，仅用于缓存，如果机器重新启动，数据丢失是可以接受的。但是，其他内存数据库的目标是实现持久性，可以通过特殊硬件（例如电池供电的RAM），将更改日志写入磁盘，定期将快照写入磁盘或将内存状态复制到其他机器来实现。

When an in-memory database is restarted, it needs to reload its state, either from disk or over the network from a replica (unless special hardware is used). Despite writing to disk, it’s still an in-memory database, because the disk is merely used as an append-only log for durability, and reads are served entirely from memory. Writing to disk also has operational advantages: files on disk can easily be backed up, inspected, and analyzed by external utilities.

当内存数据库重新启动时，它需要从磁盘或从副本通过网络重新加载其状态（除非使用特殊的硬件）。尽管写入磁盘，它仍然是内存数据库，因为磁盘仅用作附加日志以实现耐久性，并且读取完全从内存中提供。写入磁盘还具有操作优势：磁盘上的文件可以轻松备份，检查和分析外部实用程序。

Products such as VoltDB, MemSQL, and Oracle TimesTen are in-memory databases with a relational model, and the vendors claim that they can offer big performance improvements by removing all the overheads associated with managing on-disk data structures [ 41 , 42 ]. RAMCloud is an open source, in-memory key-value store with durability (using a log-structured approach for the data in memory as well as the data on disk) [ 43 ]. Redis and Couchbase provide weak durability by writing to disk asynchronously.

像VoltDB、MemSQL和Oracle TimesTen这样的产品是具有关系模型的内存数据库，供应商声称可以通过删除与管理磁盘上的数据结构相关的所有开销来提供大幅度的性能提升。 RAMCloud是一个开放源代码的内存键-值存储，具有耐久性（使用对内存中的数据以及磁盘上的数据采用日志结构方法）。Redis和Couchbase通过异步写入磁盘提供较弱的耐久性。

Counterintuitively, the performance advantage of in-memory databases is not due to the fact that they don’t need to read from disk. Even a disk-based storage engine may never need to read from disk if you have enough memory, because the operating system caches recently used disk blocks in memory anyway. Rather, they can be faster because they can avoid the overheads of encoding in-memory data structures in a form that can be written to disk [ 44 ].

直觉上，内存数据库的性能优势并不是因为它们不需要从磁盘读取。即使磁盘存储引擎足够多的内存，也可能永远不需要从磁盘读取，因为操作系统会在内存中缓存最近使用的磁盘块。相反，其速度更快的原因是它们可以避免将内存数据结构编码为可写入磁盘的形式所需的开销。

Besides performance, another interesting area for in-memory databases is providing data models that are difficult to implement with disk-based indexes. For example, Redis offers a database-like interface to various data structures such as priority queues and sets. Because it keeps all data in memory, its implementation is comparatively simple.

除了性能外，内存数据库的另一个有趣领域是提供难以用基于磁盘索引实现的数据模型。例如，Redis提供了类似数据库的接口，可以处理各种数据结构，如优先级队列和集合。由于它将所有数据存储在内存中，因此它的实现相对简单。

Recent research indicates that an in-memory database architecture could be extended to support datasets larger than the available memory, without bringing back the overheads of a disk-centric architecture [ 45 ]. The so-called anti-caching approach works by evicting the least recently used data from memory to disk when there is not enough memory, and loading it back into memory when it is accessed again in the future. This is similar to what operating systems do with virtual memory and swap files, but the database can manage memory more efficiently than the OS, as it can work at the granularity of individual records rather than entire memory pages. This approach still requires indexes to fit entirely in memory, though (like the Bitcask example at the beginning of the chapter).

最近的研究表明，内存数据库架构可以扩展以支持比可用内存更大的数据集，而不会带来磁盘中心架构的开销[45]。所谓的反缓存方法的工作原理是在内存中驱逐最近最少使用的数据到磁盘上，当内存不足时，并在将来再次使用时重新加载数据到内存中。这类似于操作系统使用虚拟内存和交换文件的方式，但是数据库可以比操作系统更有效地管理内存，因为它可以在单个记录的粒度上工作，而不是整个内存页面。然而，这种方法仍然需要索引完全适合内存，就像本章开头的 Bitcask 示例一样。

Further changes to storage engine design will probably be needed if non-volatile memory (NVM) technologies become more widely adopted [ 46 ]. At present, this is a new area of research, but it is worth keeping an eye on in the future.

如果非易失性存储器（NVM）技术得到更广泛的应用，存储引擎设计可能需要进一步改变。目前，这是一个新的研究领域，但未来值得注意。

Transaction Processing or Analytics?

In the early days of business data processing, a write to the database typically corresponded to a commercial transaction taking place: making a sale, placing an order with a supplier, paying an employee’s salary, etc. As databases expanded into areas that didn’t involve money changing hands, the term transaction nevertheless stuck, referring to a group of reads and writes that form a logical unit.

在商业数据处理的早期，对数据库的写入通常对应于商业交易的发生：销售商品，向供应商下订单，支付员工工资等。随着数据库扩展到不涉及资金流动的领域，事务这个术语仍然保持着，指的是形成逻辑单元的一组读写操作。

Note

A transaction needn’t necessarily have ACID (atomicity, consistency, isolation, and durability) properties. Transaction processing just means allowing clients to make low-latency reads and writes—as opposed to batch processing jobs, which only run periodically (for example, once per day). We discuss the ACID properties in Chapter 7 and batch processing in Chapter 10 .

一笔交易不一定需要ACID (原子性、一致性、隔离性和持久性) 属性。事务处理只是允许客户进行低延迟读写，而不是周期性运行的批处理作业（例如，每天一次）。我们在第7章中讨论ACID属性，在第10章中讨论批处理。

Even though databases started being used for many different kinds of data—comments on blog posts, actions in a game, contacts in an address book, etc.—the basic access pattern remained similar to processing business transactions. An application typically looks up a small number of records by some key, using an index. Records are inserted or updated based on the user’s input. Because these applications are interactive, the access pattern became known as online transaction processing (OLTP).

尽管数据库开始用于许多不同类型的数据-博客评论、游戏中的活动、地址簿中的联系人等-但基本的访问模式与处理业务交易类似。应用程序通常通过索引按某个键查找少量记录。根据用户的输入插入或更新记录。由于这些应用程序是交互式的，因此访问模式被称为在线事务处理（OLTP）。

However, databases also started being increasingly used for data analytics , which has very different access patterns. Usually an analytic query needs to scan over a huge number of records, only reading a few columns per record, and calculates aggregate statistics (such as count, sum, or average) rather than returning the raw data to the user. For example, if your data is a table of sales transactions, then analytic queries might be:

然而，数据库也开始越来越多地用于数据分析，其访问模式也有很大差别。通常情况下，分析查询需要扫描大量的记录，每个记录只阅读少数列，并计算聚合统计信息（如计数、求和或平均数），而不是将原始数据返回给用户。例如，如果您的数据是销售交易表，则分析查询可能包括：

What was the total revenue of each of our stores in January?

我们每家店在一月份的总收入是多少？
How many more bananas than usual did we sell during our latest promotion?

我们最近促销期间卖了比平常多多少香蕉？
Which brand of baby food is most often purchased together with brand X diapers?

哪个品牌的婴儿食品最常与品牌 X 的尿布一起购买？

These queries are often written by business analysts, and feed into reports that help the management of a company make better decisions ( business intelligence ). In order to differentiate this pattern of using databases from transaction processing, it has been called online analytic processing (OLAP) [ 47 ]. ^iv The difference between OLTP and OLAP is not always clear-cut, but some typical characteristics are listed in Table 3-1 .

这些查询通常由业务分析师编写，并提供反馈给报告，帮助公司管理层做出更好的决策（商业智能）。为了区分数据库使用模式与事务处理的差异，它被称为在线分析处理（OLAP）[47]。OLTP和OLAP之间的区别并不总是清晰的，但一些典型的特征列在表3-1中。

Table 3-1. Comparing characteristics of transaction processing versus analytic systems
Property	Transaction processing systems (OLTP)	Analytic systems (OLAP)
Main read pattern 主要阅读模式	Small number of records per query, fetched by key 每次查询只获取少量数据，通过关键字获取。	Aggregate over large number of records 大量记录聚合
Main write pattern 主要写作模式	Random-access, low-latency writes from user input 用户输入的随机访问、低延迟写入	Bulk import (ETL) or event stream 批量导入（ETL）或事件流
Primarily used by 主要用于	End user/customer, via web application 终端用户/客户，通过web应用程序。	Internal analyst, for decision support 内部分析师，用于决策支持。
What data represents 什么数据代表着	Latest state of data (current point in time) 最新数据状态（当前时间点）	History of events that happened over time 历史事件的发生过程。
Dataset size 数据集大小	Gigabytes to terabytes 千兆字节到太字节	Terabytes to petabytes 千兆字节到拍字节

At first, the same databases were used for both transaction processing and analytic queries. SQL turned out to be quite flexible in this regard: it works well for OLTP-type queries as well as OLAP-type queries. Nevertheless, in the late 1980s and early 1990s, there was a trend for companies to stop using their OLTP systems for analytics purposes, and to run the analytics on a separate database instead. This separate database was called a data warehouse .

最初，同样的数据库被用于事务处理和分析查询。SQL 在这方面表现得相当灵活：它适用于 OLTP 类型查询和 OLAP 类型查询。然而，在 1980 年代末和 1990 年代初，公司采用停止在 OLTP 系统上进行分析，改为在另一个数据库上运行分析查询的趋势。这个单独的数据库被称为数据仓库。

Data Warehousing

An enterprise may have dozens of different transaction processing systems: systems powering the customer-facing website, controlling point of sale (checkout) systems in physical stores, tracking inventory in warehouses, planning routes for vehicles, managing suppliers, administering employees, etc. Each of these systems is complex and needs a team of people to maintain it, so the systems end up operating mostly autonomously from each other.

企业可能有数十个不同的交易处理系统：为客户面向的网站提供支持的系统，控制实体商店销售点（结账）系统的系统，跟踪仓库中的库存，规划车辆路线，管理供应商，管理员工等等。每个系统都很复杂，需要一个团队来维护，因此这些系统大多从彼此独立地运作。

These OLTP systems are usually expected to be highly available and to process transactions with low latency, since they are often critical to the operation of the business. Database administrators therefore closely guard their OLTP databases. They are usually reluctant to let business analysts run ad hoc analytic queries on an OLTP database, since those queries are often expensive, scanning large parts of the dataset, which can harm the performance of concurrently executing transactions.

这些OLTP系统通常被期望具有高可用性，并以低延迟方式处理事务，因为它们通常对业务操作至关重要。因此，数据库管理员严密保管他们的OLTP数据库。他们通常不愿意让业务分析师在OLTP数据库上运行即席分析查询，因为这些查询通常很昂贵，需要扫描数据集的大部分内容，这可能会损害同时执行的事务的性能。

A data warehouse , by contrast, is a separate database that analysts can query to their hearts’ content, without affecting OLTP operations [ 48 ]. The data warehouse contains a read-only copy of the data in all the various OLTP systems in the company. Data is extracted from OLTP databases (using either a periodic data dump or a continuous stream of updates), transformed into an analysis-friendly schema, cleaned up, and then loaded into the data warehouse. This process of getting data into the warehouse is known as Extract–Transform–Load (ETL) and is illustrated in Figure 3-8 .

数据仓库是一个单独的数据库，分析师可以随心所欲地查询它，而不会影响 OLTP 操作。数据仓库中包含了公司所有各种 OLTP 系统的数据的只读副本。数据是从 OLTP 数据库中提取的（使用定期数据转储或持续更新流），转换为适合分析的模式，清理并加载到数据仓库中。将数据装入仓库的过程称为 Extract-Transform-Load (ETL)，如图 3-8 所示。

Data warehouses now exist in almost all large enterprises, but in small companies they are almost unheard of. This is probably because most small companies don’t have so many different OLTP systems, and most small companies have a small amount of data—small enough that it can be queried in a conventional SQL database, or even analyzed in a spreadsheet. In a large company, a lot of heavy lifting is required to do something that is simple in a small company.

数据仓库现在几乎在所有大型企业中存在，但在小型公司中，它们几乎是不为人知的。这可能是因为大多数小型公司没有太多不同的OLTP系统，并且大多数小型公司具有较少的数据量-足够小，可以在传统的SQL数据库中查询，甚至可以在电子表格中进行分析。在大型公司中，需要进行大量的工作才能完成小型公司中的简单任务。

A big advantage of using a separate data warehouse, rather than querying OLTP systems directly for analytics, is that the data warehouse can be optimized for analytic access patterns. It turns out that the indexing algorithms discussed in the first half of this chapter work well for OLTP, but are not very good at answering analytic queries. In the rest of this chapter we will look at storage engines that are optimized for analytics instead.

使用单独的数据仓库进行分析的一个巨大优势是，数据仓库可以为分析访问模式进行优化。事实证明，在本章的前半部分讨论的索引算法适用于 OLTP，但不适合回答分析查询。在本章的其余部分中，我们将研究专为分析而优化的存储引擎。

The divergence between OLTP databases and data warehouses

The data model of a data warehouse is most commonly relational, because SQL is generally a good fit for analytic queries. There are many graphical data analysis tools that generate SQL queries, visualize the results, and allow analysts to explore the data (through operations such as drill-down and slicing and dicing ).

数据仓库的数据模型通常是关系型的，因为SQL通常非常适合分析查询。有许多图形化数据分析工具可以生成SQL查询，可视化结果，并允许分析人员探索数据（通过钻取和切片）等操作。

On the surface, a data warehouse and a relational OLTP database look similar, because they both have a SQL query interface. However, the internals of the systems can look quite different, because they are optimized for very different query patterns. Many database vendors now focus on supporting either transaction processing or analytics workloads, but not both.

表面上，数据仓库和关系型OLTP数据库看起来很相似，因为它们都有SQL查询界面。然而，系统内部可能会看起来非常不同，因为它们针对非常不同的查询模式进行了优化。许多数据库供应商现在专注于支持事务处理或分析工作负载，但不支持两者兼备。

Some databases, such as Microsoft SQL Server and SAP HANA, have support for transaction processing and data warehousing in the same product. However, they are increasingly becoming two separate storage and query engines, which happen to be accessible through a common SQL interface [ 49 , 50 , 51 ].

一些数据库，例如Microsoft SQL Server和SAP HANA，支持在同一产品中进行事务处理和数据仓库。然而，它们越来越成为两个独立的存储和查询引擎，通过共同的SQL接口访问[49，50，51]。

Data warehouse vendors such as Teradata, Vertica, SAP HANA, and ParAccel typically sell their systems under expensive commercial licenses. Amazon RedShift is a hosted version of ParAccel. More recently, a plethora of open source SQL-on-Hadoop projects have emerged; they are young but aiming to compete with commercial data warehouse systems. These include Apache Hive, Spark SQL, Cloudera Impala, Facebook Presto, Apache Tajo, and Apache Drill [ 52 , 53 ]. Some of them are based on ideas from Google’s Dremel [ 54 ].

数据仓库供应商，如Teradata、Vertica、SAP HANA和ParAccel通常以昂贵的商业许可证销售其系统。Amazon RedShift是ParAccel的托管版本。最近，涌现了许多开源的基于Hadoop的SQL项目，它们年轻但旨在与商业数据仓库系统竞争。其中包括Apache Hive、Spark SQL、Cloudera Impala、Facebook Presto、Apache Tajo和Apache Drill [52，53]。其中一些基于Google的Dremel的想法[54]。

Stars and Snowflakes: Schemas for Analytics

As explored in Chapter 2 , a wide range of different data models are used in the realm of transaction processing, depending on the needs of the application. On the other hand, in analytics, there is much less diversity of data models. Many data warehouses are used in a fairly formulaic style, known as a star schema (also known as dimensional modeling [ 55 ]).

正如第2章所探讨的，事务处理领域中使用了各种不同的数据模型，具体取决于应用程序的需求。另一方面，在分析领域中，数据模型的多样性要少得多。许多数据仓库采用了一种相当公式化的样式，称为星型模式（也称为维度建模）。

The example schema in Figure 3-9 shows a data warehouse that might be found at a grocery retailer. At the center of the schema is a so-called fact table (in this example, it is called fact_sales ). Each row of the fact table represents an event that occurred at a particular time (here, each row represents a customer’s purchase of a product). If we were analyzing website traffic rather than retail sales, each row might represent a page view or a click by a user.

图3-9中的示例模式显示了一个可能在杂货零售商处找到的数据仓库。模式的中心是所谓的事实表（在本例中，它被称为fact_sales）。事实表的每一行表示发生在特定时间的事件（在这里，每一行表示客户购买产品）。如果我们分析的是网站流量而不是零售销售，每行可能表示用户的页面视图或点击。

Usually, facts are captured as individual events, because this allows maximum flexibility of analysis later. However, this means that the fact table can become extremely large. A big enterprise like Apple, Walmart, or eBay may have tens of petabytes of transaction history in its data warehouse, most of which is in fact tables [ 56 ].

通常情况下，事实被捕捉为单独的事件，因为这允许以后进行最大限度的灵活性分析。然而，这意味着事实表可以变得非常大。像苹果、沃尔玛或eBay这样的大企业可能拥有数十个petabytes的交易历史记录在其数据仓库中，其中大部分位于事实表中[56]。

Some of the columns in the fact table are attributes, such as the price at which the product was sold and the cost of buying it from the supplier (allowing the profit margin to be calculated). Other columns in the fact table are foreign key references to other tables, called dimension tables . As each row in the fact table represents an event, the dimensions represent the who , what , where , when , how , and why of the event.

事实表中的某些列是属性，例如产品销售的价格以及从供应商购买产品的成本（可计算利润率）。事实表中的其他列是指向其他表的外键引用，称为维度表。由于事实表中的每行代表一个事件，因此维度代表事件的谁，何物，何处，何时，如何和为什么。

For example, in Figure 3-9 , one of the dimensions is the product that was sold. Each row in the dim_product table represents one type of product that is for sale, including its stock-keeping unit (SKU), description, brand name, category, fat content, package size, etc. Each row in the fact_sales table uses a foreign key to indicate which product was sold in that particular transaction. (For simplicity, if the customer buys several different products at once, they are represented as separate rows in the fact table.)

例如，在图3-9中，一个维度是已售出的产品。dim_product表中的每一行代表一种销售产品，包括它的库存保持单位（SKU），描述，品牌名称，类别，脂肪含量，包装大小等。fact_sales表中的每一行使用外键表示在该特定交易中售出的产品。（为简单起见，如果客户同时购买多种不同产品，则在事实表中表示为单独的行。）

Even date and time are often represented using dimension tables, because this allows additional information about dates (such as public holidays) to be encoded, allowing queries to differentiate between sales on holidays and non-holidays.

日期和时间经常使用维度表来代表，因为这可以将有关日期的其他信息（例如公共假期）进行编码，从而使查询能够区别节假日和非节假日的销售。

The name “star schema” comes from the fact that when the table relationships are visualized, the fact table is in the middle, surrounded by its dimension tables; the connections to these tables are like the rays of a star.

星型模式这个名字来源于表关系的可视化，事实表位于中心，被尺寸表所包围; 这些表的连接就像一颗星的射线一样。

A variation of this template is known as the snowflake schema , where dimensions are further broken down into subdimensions. For example, there could be separate tables for brands and product categories, and each row in the dim_product table could reference the brand and category as foreign keys, rather than storing them as strings in the dim_product table. Snowflake schemas are more normalized than star schemas, but star schemas are often preferred because they are simpler for analysts to work with [ 55 ].

这个模板的一个变体称为雪花模式，其中维度被进一步分解为子维度。例如，品牌和产品类别可以有单独的表，并且dim_product表中的每一行都可以引用品牌和类别作为外键，而不是在dim_product表中将它们存储为字符串。雪花模式比星形模式更规范化，但星形模式通常更受分析师的欢迎，因为它们更简单[55]。

In a typical data warehouse, tables are often very wide: fact tables often have over 100 columns, sometimes several hundred [ 51 ]. Dimension tables can also be very wide, as they include all the metadata that may be relevant for analysis—for example, the dim_store table may include details of which services are offered at each store, whether it has an in-store bakery, the square footage, the date when the store was first opened, when it was last remodeled, how far it is from the nearest highway, etc.

在典型的数据仓库中，表通常非常宽：事实表通常拥有超过100个列，有时甚至几百个 [51]。维度表也可以非常宽，因为它们包括可能与分析相关的所有元数据 - 例如，dim_store表可能包括每个店铺提供的服务细节，是否有店内面包房，面积大小，店铺开业日期，上次翻新日期，距离最近的高速公路有多远等。

Column-Oriented Storage

If you have trillions of rows and petabytes of data in your fact tables, storing and querying them efficiently becomes a challenging problem. Dimension tables are usually much smaller (millions of rows), so in this section we will concentrate primarily on storage of facts.

如果您的事实表中有数万亿行和数PB的数据，高效地存储和查询它们成为一个具有挑战性的问题。维度表通常要小得多（数百万行），因此在本节中，我们将主要关注事实的存储。

Although fact tables are often over 100 columns wide, a typical data warehouse query only accesses 4 or 5 of them at one time ( "SELECT *" queries are rarely needed for analytics) [ 51 ]. Take the query in Example 3-1 : it accesses a large number of rows (every occurrence of someone buying fruit or candy during the 2013 calendar year), but it only needs to access three columns of the fact_sales table: date_key , product_sk , and quantity . The query ignores all other columns.

虽然事实表通常有超过100列，但典型的数据仓库查询一次只访问4或5列（很少需要“SELECT *”查询进行分析）。在示例3-1中的查询中，虽然它访问了大量的行（每个人在2013年购买水果或糖果的每个事件），但它仅需要访问fact_sales表的三个列：date_key、product_sk和quantity。该查询忽略了所有其他列。

Example 3-1. Analyzing whether people are more inclined to buy fresh fruit or candy, depending on the day of the week

SELECT
  dim_date.weekday, dim_product.category,
  SUM(fact_sales.quantity) AS quantity_sold
FROM fact_sales
  JOIN dim_date    ON fact_sales.date_key   = dim_date.date_key
  JOIN dim_product ON fact_sales.product_sk = dim_product.product_sk
WHERE
  dim_date.year = 2013 AND
  dim_product.category IN ('Fresh fruit', 'Candy')
GROUP BY
  dim_date.weekday, dim_product.category;

How can we execute this query efficiently?

如何有效地执行这个查询？

In most OLTP databases, storage is laid out in a row-oriented fashion: all the values from one row of a table are stored next to each other. Document databases are similar: an entire document is typically stored as one contiguous sequence of bytes. You can see this in the CSV example of Figure 3-1 .

在大多数OLTP数据库中，存储是以行为导向的方式布置的：表的一行中的所有值都相互靠近存储。文档数据库类似：整个文档通常被存储为一个连续的字节序列。您可以在图3-1的CSV示例中看到这一点。

In order to process a query like Example 3-1 , you may have indexes on fact_sales.date_key and/or fact_sales.product_sk that tell the storage engine where to find all the sales for a particular date or for a particular product. But then, a row-oriented storage engine still needs to load all of those rows (each consisting of over 100 attributes) from disk into memory, parse them, and filter out those that don’t meet the required conditions. That can take a long time.

为了处理类似于Example 3-1的查询，您可能会在fact_sales.date_key和/或fact_sales.product_sk上拥有索引，告诉存储引擎在哪里找到特定日期或特定产品的所有销售。但是，行定向存储引擎仍然需要将所有这些行（每个行包含100多个属性）从磁盘加载到内存中，解析它们并过滤掉不满足要求条件的行。这可能需要很长时间。

The idea behind column-oriented storage is simple: don’t store all the values from one row together, but store all the values from each column together instead. If each column is stored in a separate file, a query only needs to read and parse those columns that are used in that query, which can save a lot of work. This principle is illustrated in Figure 3-10 .

列式存储的想法很简单：不要把一行中的所有值都存放在一起，而是把每一列的值都存放在一起。如果每一列都存储在不同的文件中，查询只需要读取和解析其中用到的列，这样可以节省很多工作量。这个原则在图3-10中有所体现。

Note

Column storage is easiest to understand in a relational data model, but it applies equally to nonrelational data. For example, Parquet [ 57 ] is a columnar storage format that supports a document data model, based on Google’s Dremel [ 54 ].

列存储最容易在关系数据模型中理解，但同样适用于非关系数据。例如，Parquet [57] 是一种基于 Google 的 Dremel [54] 的文档数据模型支持的列存储格式。

The column-oriented storage layout relies on each column file containing the rows in the same order. Thus, if you need to reassemble an entire row, you can take the 23rd entry from each of the individual column files and put them together to form the 23rd row of the table.

列式存储布局依赖于每个列文件中行的顺序相同。因此，如果您需要重新组合整个行，您可以从每个单独的列文件中取出第23个条目并将它们放在一起形成表的第23行。

Column Compression

Besides only loading those columns from disk that are required for a query, we can further reduce the demands on disk throughput by compressing data. Fortunately, column-oriented storage often lends itself very well to compression.

除了只从磁盘加载查询所需的列之外，我们可以通过压缩数据进一步减少对磁盘吞吐量的需求。幸运的是，列式存储通常非常适合压缩。

Take a look at the sequences of values for each column in Figure 3-10 : they often look quite repetitive, which is a good sign for compression. Depending on the data in the column, different compression techniques can be used. One technique that is particularly effective in data warehouses is bitmap encoding , illustrated in Figure 3-11 .

看一下图3-10中每列的值序列：它们常常看起来非常重复，这对于压缩来说是一个好的迹象。根据列中的数据，可以使用不同的压缩技术。在数据仓库中特别有效的一种技术是位图编码，如图3-11所示。

Often, the number of distinct values in a column is small compared to the number of rows (for example, a retailer may have billions of sales transactions, but only 100,000 distinct products). We can now take a column with n distinct values and turn it into n separate bitmaps: one bitmap for each distinct value, with one bit for each row. The bit is 1 if the row has that value, and 0 if not.

通常，列中有不同值的数量相对于行数来说很小（例如，零售商可能有数十亿的销售交易，但只有10万个不同的产品）。我们现在可以将具有n个不同值的列转换为n个单独的位图：每个不同值都有一个位图，每行一个位，如果该行具有该值，则位为1，否则为0。

If n is very small (for example, a country column may have approximately 200 distinct values), those bitmaps can be stored with one bit per row. But if n is bigger, there will be a lot of zeros in most of the bitmaps (we say that they are sparse ). In that case, the bitmaps can additionally be run-length encoded, as shown at the bottom of Figure 3-11 . This can make the encoding of a column remarkably compact.

如果n很小（例如，一个国家列大约有200个不同的值），那么这些位图可以以每行一个比特的方式存储。但是如果n更大，大部分位图中会有很多零（我们称之为稀疏）。在这种情况下，位图可以额外进行运行长度编码，如图3-11底部所示。这可以使列的编码非常紧凑。

Bitmap indexes such as these are very well suited for the kinds of queries that are common in a data warehouse. For example:

这样的位图索引非常适合数据仓库中常见的查询类型。例如：

WHERE product_sk IN (30, 68, 69):

Load the three bitmaps for product_sk = 30 , product_sk = 68 , and product_sk = 69 , and calculate the bitwise OR of the three bitmaps, which can be done very efficiently.

加载 product_sk = 30、product_sk = 68 和 product_sk = 69 的三个位图，然后高效地计算这三个位图的按位或。

WHERE product_sk = 31 AND store_sk = 3:

Load the bitmaps for product_sk = 31 and store_sk = 3 , and calculate the bitwise AND . This works because the columns contain the rows in the same order, so the k th bit in one column’s bitmap corresponds to the same row as the k th bit in another column’s bitmap.

加载 product_sk = 31 和 store_sk = 3 的位图，并计算位与。这是因为列包含的行以相同的顺序排列，因此一个列位图中的第k位对应于另一个列位图中的第k位相同的行。

There are also various other compression schemes for different kinds of data, but we won’t go into them in detail—see [ 58 ] for an overview.

还有许多不同种类数据的压缩方案，但我们不会详细介绍-请参阅[58]以获取概述。

Column-oriented storage and column families

Cassandra and HBase have a concept of column families , which they inherited from Bigtable [ 9 ]. However, it is very misleading to call them column-oriented: within each column family, they store all columns from a row together, along with a row key, and they do not use column compression. Thus, the Bigtable model is still mostly row-oriented.

Cassandra和HBase有一个来自Bigtable的列族的概念[9]。然而，称它们为面向列的非常误导：在每个列族中，它们将一行中的所有列与行键一起存储，并且它们不使用列压缩。因此，Bigtable模型仍然主要是面向行的。

Memory bandwidth and vectorized processing

For data warehouse queries that need to scan over millions of rows, a big bottleneck is the bandwidth for getting data from disk into memory. However, that is not the only bottleneck. Developers of analytical databases also worry about efficiently using the bandwidth from main memory into the CPU cache, avoiding branch mispredictions and bubbles in the CPU instruction processing pipeline, and making use of single-instruction-multi-data (SIMD) instructions in modern CPUs [ 59 , 60 ].

数据仓库查询需要扫描数百万行数据，最大瓶颈是从磁盘读取数据到内存的带宽。不过，这并不是唯一的瓶颈。分析型数据库的开发人员还关注如何有效利用从主存储器到CPU高速缓存的带宽，避免分支预测失误和CPU指令处理流水线上的气泡，以及利用现代CPU的单指令多数据（SIMD）指令[59、60]。

Besides reducing the volume of data that needs to be loaded from disk, column-oriented storage layouts are also good for making efficient use of CPU cycles. For example, the query engine can take a chunk of compressed column data that fits comfortably in the CPU’s L1 cache and iterate through it in a tight loop (that is, with no function calls). A CPU can execute such a loop much faster than code that requires a lot of function calls and conditions for each record that is processed. Column compression allows more rows from a column to fit in the same amount of L1 cache. Operators, such as the bitwise AND and OR described previously, can be designed to operate on such chunks of compressed column data directly. This technique is known as vectorized processing [ 58 , 49 ].

列式存储布局不仅可以减少需要从磁盘加载的数据量，还可以有效利用 CPU 循环。例如，查询引擎可以获取一块压缩的列数据并在 CPU 的 L1 缓存中舒适地迭代它，而不需要函数调用。 CPU 可以比处理每个记录需要大量函数调用和条件的代码更快地执行这样的循环。列压缩允许更多的列行适合相同数量的 L1 缓存。操作符，如之前描述的按位 AND 和 OR，可以直接设计用于这样压缩的列数据块。这种技术被称为矢量化处理。

Sort Order in Column Storage

In a column store, it doesn’t necessarily matter in which order the rows are stored. It’s easiest to store them in the order in which they were inserted, since then inserting a new row just means appending to each of the column files. However, we can choose to impose an order, like we did with SSTables previously, and use that as an indexing mechanism.

在列存储中，行的存储顺序并不重要。最简单的方法是按插入顺序进行存储，因为这样插入新行只需要将其追加到每个列文件中。然而，我们可以选择强制排序，就像之前使用SSTables一样，并将其用作索引机制。

Note that it wouldn’t make sense to sort each column independently, because then we would no longer know which items in the columns belong to the same row. We can only reconstruct a row because we know that the k th item in one column belongs to the same row as the k th item in another column.

请注意，独立对每一列进行排序是没有意义的，因为这样我们将不再知道每一列中的项目属于哪一行。我们只能重构一行，因为我们知道一个列中的第k个项目属于另一列中的第k个项目所在的同一行。

Rather, the data needs to be sorted an entire row at a time, even though it is stored by column. The administrator of the database can choose the columns by which the table should be sorted, using their knowledge of common queries. For example, if queries often target date ranges, such as the last month, it might make sense to make date_key the first sort key. Then the query optimizer can scan only the rows from the last month, which will be much faster than scanning all rows.

相反，数据需要整行排序，即使它是按列存储的。数据库管理员可以选择按哪些列对表进行排序，利用常见查询的知识。例如，如果查询经常会针对日期范围，比如最近一个月，那么将日期键作为第一排序键可能是有意义的。然后查询优化器可以只扫描最近一个月的行，这比扫描所有行要快得多。

A second column can determine the sort order of any rows that have the same value in the first column. For example, if date_key is the first sort key in Figure 3-10 , it might make sense for product_sk to be the second sort key so that all sales for the same product on the same day are grouped together in storage. That will help queries that need to group or filter sales by product within a certain date range.

第二列可以确定在第一列具有相同值的任何行的排序顺序。例如，在图3-10中，如果date_key是第一个排序键，那么product_sk可能成为第二个排序键，以便将同一天相同产品的所有销售分组在一起存储。这将有助于需要在特定日期范围内按产品对销售进行分组或筛选的查询。

Another advantage of sorted order is that it can help with compression of columns. If the primary sort column does not have many distinct values, then after sorting, it will have long sequences where the same value is repeated many times in a row. A simple run-length encoding, like we used for the bitmaps in Figure 3-11 , could compress that column down to a few kilobytes—even if the table has billions of rows.

排序顺序的另一个优点是它有助于列的压缩。如果主排序列没有很多不同的值，那么排序后，它将有很长的序列，其中相同的值连续重复很多次。像图3-11中用于位图的简单运行长度编码可以将该列压缩到几千字节，即使表有数十亿行。

That compression effect is strongest on the first sort key. The second and third sort keys will be more jumbled up, and thus not have such long runs of repeated values. Columns further down the sorting priority appear in essentially random order, so they probably won’t compress as well. But having the first few columns sorted is still a win overall.

该压缩效应在第一个排序键上最为强烈。第二个和第三个排序键将会更加杂乱无章，因此不会有如此长的重复值序列。更靠后的排序优先级的列是基本上以随机顺序出现的，因此它们可能不会被很好地压缩。但是，总体而言，排序前几列仍然是一个胜利。

Several different sort orders

A clever extension of this idea was introduced in C-Store and adopted in the commercial data warehouse Vertica [ 61 , 62 ]. Different queries benefit from different sort orders, so why not store the same data sorted in several different ways? Data needs to be replicated to multiple machines anyway, so that you don’t lose data if one machine fails. You might as well store that redundant data sorted in different ways so that when you’re processing a query, you can use the version that best fits the query pattern.

这个思路的巧妙延伸被引入到C-Store中，并在商业数据仓库Vertica中被采用。不同的查询受益于不同的排序顺序，因此为什么不以几种不同的方式存储相同的数据？数据需要复制到多台机器上，以防止一台机器故障时丢失数据。因此，您可以将冗余数据以不同的方式排序，以便在处理查询时使用最符合查询模式的版本。

Having multiple sort orders in a column-oriented store is a bit similar to having multiple secondary indexes in a row-oriented store. But the big difference is that the row-oriented store keeps every row in one place (in the heap file or a clustered index), and secondary indexes just contain pointers to the matching rows. In a column store, there normally aren’t any pointers to data elsewhere, only columns containing values.

在面向列的存储中具有多个排序顺序有点类似于在面向行的存储中具有多个辅助索引。但是最大的区别在于面向行的存储将每行都保存在一个地方（在堆文件或聚集索引中），而辅助索引只包含指向匹配行的指针。在列存储中，通常没有指向其他数据的指针，只有包含值的列。

Writing to Column-Oriented Storage

These optimizations make sense in data warehouses, because most of the load consists of large read-only queries run by analysts. Column-oriented storage, compression, and sorting all help to make those read queries faster. However, they have the downside of making writes more difficult.

这些优化在数据仓库中是有意义的，因为大多数负载都是由分析师运行的大型只读查询。列导向存储、压缩和排序都有助于加快这些读查询的速度。然而，它们的缺点是使写更加困难。

An update-in-place approach, like B-trees use, is not possible with compressed columns. If you wanted to insert a row in the middle of a sorted table, you would most likely have to rewrite all the column files. As rows are identified by their position within a column, the insertion has to update all columns consistently.

压缩列不可能使用像B树那样的就地更新方法。如果您想要在排序表的中间插入一行，您很可能不得不重新编写所有列文件。由于行是通过它们在列中的位置来标识的，因此插入操作必须一致地更新所有列。

Fortunately, we have already seen a good solution earlier in this chapter: LSM-trees. All writes first go to an in-memory store, where they are added to a sorted structure and prepared for writing to disk. It doesn’t matter whether the in-memory store is row-oriented or column-oriented. When enough writes have accumulated, they are merged with the column files on disk and written to new files in bulk. This is essentially what Vertica does [ 62 ].

幸运的是，我们已经在本章早些时候看到了一个好的解决方案：LSM树。所有写操作首先都会进入内存存储，添加到一个排序结构中，并做好写入磁盘的准备。无论内存存储是面向行还是面向列都没有关系。当收集到足够的写操作后，它们会与磁盘上的列文件合并，并批量地写入新文件中。这本质上就是Vertica的做法[62]。

Queries need to examine both the column data on disk and the recent writes in memory, and combine the two. However, the query optimizer hides this distinction from the user. From an analyst’s point of view, data that has been modified with inserts, updates, or deletes is immediately reflected in subsequent queries.

查询需要同时检查磁盘上的列数据和最近在内存中的写入，并将两者结合起来。然而，查询优化器会隐藏这种区别。从分析师的角度来看，插入、更新或删除的数据在后续查询中立即反映出来。

Aggregation: Data Cubes and Materialized Views

Not every data warehouse is necessarily a column store: traditional row-oriented databases and a few other architectures are also used. However, columnar storage can be significantly faster for ad hoc analytical queries, so it is rapidly gaining popularity [ 51 , 63 ].

并非所有的数据仓库都必须是列存储：传统的行导向数据库和其他几种架构也被使用。然而，列存储对于特定于问题的分析查询可以明显地更快，因此正在快速地获得流行 [51, 63]。并非所有数据仓库都是列式存储：也有传统的行式数据库和一些其他架构被使用。然而，列式存储对于特定的分析查询可以明显更快，因此正在迅速普及。

Another aspect of data warehouses that is worth mentioning briefly is materialized aggregates . As discussed earlier, data warehouse queries often involve an aggregate function, such as COUNT , SUM , AVG , MIN , or MAX in SQL. If the same aggregates are used by many different queries, it can be wasteful to crunch through the raw data every time. Why not cache some of the counts or sums that queries use most often?

数据仓库中值得简要提及的另一个方面是物化聚合。如早先所述，数据仓库查询经常涉及SQL中的聚合函数，如COUNT、SUM、AVG、MIN或MAX。如果许多不同的查询使用相同的聚合，每次都通过原始数据进行计算可能是浪费的。为什么不缓存一些最常用的计数或总和？

One way of creating such a cache is a materialized view . In a relational data model, it is often defined like a standard (virtual) view: a table-like object whose contents are the results of some query. The difference is that a materialized view is an actual copy of the query results, written to disk, whereas a virtual view is just a shortcut for writing queries. When you read from a virtual view, the SQL engine expands it into the view’s underlying query on the fly and then processes the expanded query.

创建这样一个高速缓存的一种方法是利用实现视图。在关系式数据模型中，它通常与标准（虚拟）视图相似：一个类似表格的对象其内容是某些查询的结果。区别在于实现视图是查询结果的实际副本并写入磁盘，而虚拟视图只是查询的一个捷径。当你从虚拟视图中阅读时，SQL引擎会实时将其扩展为视图底层的查询然后处理扩展的查询。

When the underlying data changes, a materialized view needs to be updated, because it is a denormalized copy of the data. The database can do that automatically, but such updates make writes more expensive, which is why materialized views are not often used in OLTP databases. In read-heavy data warehouses they can make more sense (whether or not they actually improve read performance depends on the individual case).

当底层数据发生变化时，需要更新物化视图，因为它是数据的非规范化副本。数据库可以自动执行此操作，但这会导致写入更加昂贵，这就是为什么物化视图在OLTP数据库中并不常用的原因。在读取密集的数据仓库中，它们可能更有意义（无论它们是否实际改善读取性能取决于不同情况）。

A common special case of a materialized view is known as a data cube or OLAP cube [ 64 ]. It is a grid of aggregates grouped by different dimensions. Figure 3-12 shows an example.

一种常见的物化视图特例称为数据立方体或OLAP立方体[64]。它是按不同维度分组的聚合数据表格。图3-12显示了一个例子。

Imagine for now that each fact has foreign keys to only two dimension tables—in Figure 3-12 , these are date and product . You can now draw a two-dimensional table, with dates along one axis and products along the other. Each cell contains the aggregate (e.g., SUM ) of an attribute (e.g., net_price ) of all facts with that date-product combination. Then you can apply the same aggregate along each row or column and get a summary that has been reduced by one dimension (the sales by product regardless of date, or the sales by date regardless of product).

现在想象一下，每个事实仅有两个维度表的外键 - 在图3-12中，这些是日期和产品。您现在可以绘制一个二维表，一个轴上是日期，另一个轴上是产品。每个单元格包含具有该日期-产品组合的所有事实属性（例如净价）的聚合（例如SUM）。然后，您可以沿每行或列应用相同的聚合并获得减少一个维度的摘要（不考虑产品的销售或不考虑日期的销售）。

In general, facts often have more than two dimensions. In Figure 3-9 there are five dimensions: date, product, store, promotion, and customer. It’s a lot harder to imagine what a five-dimensional hypercube would look like, but the principle remains the same: each cell contains the sales for a particular date-product-store-promotion-customer combination. These values can then repeatedly be summarized along each of the dimensions.

通常来说，事实往往具有超过两个维度。在图3-9中，有五个维度：日期、产品、商店、促销和客户。想象一个五维超立方体是非常困难的，但原则仍然相同：每个单元格包含特定日期-产品-商店-促销-客户组合的销售额。然后可以沿着每个维度反复总结这些值。

The advantage of a materialized data cube is that certain queries become very fast because they have effectively been precomputed. For example, if you want to know the total sales per store yesterday, you just need to look at the totals along the appropriate dimension—no need to scan millions of rows.

材料化数据立方体的优点在于，某些查询变得非常快速，因为它们已经有效地预先计算过了。例如，如果你想知道昨天每个店铺的总销售额，你只需要查看沿着适当的维度的总数——不需要扫描数百万行。

The disadvantage is that a data cube doesn’t have the same flexibility as querying the raw data. For example, there is no way of calculating which proportion of sales comes from items that cost more than $100, because the price isn’t one of the dimensions. Most data warehouses therefore try to keep as much raw data as possible, and use aggregates such as data cubes only as a performance boost for certain queries.

数据立方体的劣势在于它没有像查询原始数据那样的灵活性。例如，没有办法计算销售中由成本高于100美元的商品组成的比例，因为价格不是其中一个维度。因此，大多数数据仓库尽可能地保留原始数据，仅将数据立方体等聚合方式用于某些查询的性能提升。

Summary

In this chapter we tried to get to the bottom of how databases handle storage and retrieval. What happens when you store data in a database, and what does the database do when you query for the data again later?

在这一章中，我们试图深入了解数据库如何处理存储和检索。当您将数据存储在数据库中时会发生什么，以及当您稍后查询数据时数据库会做什么？

On a high level, we saw that storage engines fall into two broad categories: those optimized for transaction processing (OLTP), and those optimized for analytics (OLAP). There are big differences between the access patterns in those use cases:

高层次上，我们看到存储引擎可以分为两个广泛的类别：针对事务处理（OLTP）进行优化的引擎和针对分析（OLAP）进行优化的引擎。这两种用例之间的访问模式存在很大的区别。

OLTP systems are typically user-facing, which means that they may see a huge volume of requests. In order to handle the load, applications usually only touch a small number of records in each query. The application requests records using some kind of key, and the storage engine uses an index to find the data for the requested key. Disk seek time is often the bottleneck here.

OLTP系统通常面向用户，这意味着它们可能会面临大量的请求。为了处理负载，应用程序通常只在每个查询中操作少量记录。应用程序使用某种键请求记录，存储引擎使用索引查找所请求的数据。磁盘查找时间通常是瓶颈。
Data warehouses and similar analytic systems are less well known, because they are primarily used by business analysts, not by end users. They handle a much lower volume of queries than OLTP systems, but each query is typically very demanding, requiring many millions of records to be scanned in a short time. Disk bandwidth (not seek time) is often the bottleneck here, and column-oriented storage is an increasingly popular solution for this kind of workload.

数据仓库和类似的分析系统并不太出名，因为它们主要被商业分析师使用，而不是终端用户。它们处理的查询量比OLTP系统少得多，但每个查询通常都十分严格要求，需要在短时间内扫描数以百万计的记录。磁盘带宽（不是寻道时间）通常是瓶颈，而针对这种工作负载的列式存储越来越受欢迎。

On the OLTP side, we saw storage engines from two main schools of thought:

在 OLTP 方面，我们看到两个主要思想派别的存储引擎：

The log-structured school, which only permits appending to files and deleting obsolete files, but never updates a file that has been written. Bitcask, SSTables, LSM-trees, LevelDB, Cassandra, HBase, Lucene, and others belong to this group.

日志结构化学派，只允许对文件进行追加和删除已过时的文件，但永远不更新已写入的文件。Bitcask、SSTables、LSM-trees、LevelDB、Cassandra、HBase、Lucene和其他属于这个组。
The update-in-place school, which treats the disk as a set of fixed-size pages that can be overwritten. B-trees are the biggest example of this philosophy, being used in all major relational databases and also many nonrelational ones.

“就地更新”的理念将磁盘视为一组固定大小的页面，可以覆盖已有数据。B-树是这种思想的最佳例证，在所有主要关系型数据库中都得到了使用，并且也被许多非关系型数据库所采用。”

Log-structured storage engines are a comparatively recent development. Their key idea is that they systematically turn random-access writes into sequential writes on disk, which enables higher write throughput due to the performance characteristics of hard drives and SSDs.

日志结构化存储引擎是相对较新的开发。它们的关键思想是系统地将随机访问写入转换为磁盘上的顺序写入，这使得写入吞吐量更高，因为硬盘和固态硬盘的性能特征。

Finishing off the OLTP side, we did a brief tour through some more complicated indexing structures, and databases that are optimized for keeping all data in memory.

我们结束了 OLTP 端，简短地介绍了一些更复杂的索引结构和针对将所有数据保留在内存中进行优化的数据库。

We then took a detour from the internals of storage engines to look at the high-level architecture of a typical data warehouse. This background illustrated why analytic workloads are so different from OLTP: when your queries require sequentially scanning across a large number of rows, indexes are much less relevant. Instead it becomes important to encode data very compactly, to minimize the amount of data that the query needs to read from disk. We discussed how column-oriented storage helps achieve this goal.

我们先转移到数据仓库的高层结构，抛开储存引擎的内部细节。这背景说明了分析工作负载与 OLTP 的巨大不同：当你需要按顺序扫描大量行时，索引就不那么重要了。相反，重要的是紧凑地编码数据，最小化查询需要从磁盘读取的数据量。我们讨论了列式存储如何帮助实现这个目标。

As an application developer, if you’re armed with this knowledge about the internals of storage engines, you are in a much better position to know which tool is best suited for your particular application. If you need to adjust a database’s tuning parameters, this understanding allows you to imagine what effect a higher or a lower value may have.

作为应用程序开发者，如果你掌握了关于存储引擎内部的知识，你就能更好地知道哪种工具最适合你的特定应用程序。如果你需要调整数据库的调优参数，这种了解能让你想象出较高或较低值可能带来的影响。

Although this chapter couldn’t make you an expert in tuning any one particular storage engine, it has hopefully equipped you with enough vocabulary and ideas that you can make sense of the documentation for the database of your choice.

虽然本章节不能使您成为调优任何一个特定存储引擎的专家，但是它希望能够为您提供足够的词汇和理念，以便您能够理解您选择的数据库的文档。

Footnotes

ⁱ If all keys and values had a fixed size, you could use binary search on a segment file and avoid the in-memory index entirely. However, they are usually variable-length in practice, which makes it difficult to tell where one record ends and the next one starts if you don’t have an index.

如果所有的键和值都有固定的大小，你可以在段文件上使用二分查找，完全避免内存索引。然而，实际上它们通常是可变长度的，如果没有索引，很难确定一个记录在哪里结束，下一个记录从哪里开始。

ⁱⁱ Inserting a new key into a B-tree is reasonably intuitive, but deleting one (while keeping the tree balanced) is somewhat more involved [ 2 ].

插入新键到B树中的方法相对直观，但删除一个键（同时保持树的平衡）则较为复杂[2]。

ⁱⁱⁱ This variant is sometimes known as a B ⁺ tree, although the optimization is so common that it often isn’t distinguished from other B-tree variants.

这个变体有时被称为B+树，尽管这种优化是如此常见，以至于常常不能与其他B树变体区分开来。

^iv The meaning of online in OLAP is unclear; it probably refers to the fact that queries are not just for predefined reports, but that analysts use the OLAP system interactively for explorative queries.

在线分析处理（OLAP）中“在线”的含义不清晰；它可能指的是查询不仅仅是针对预定义的报告，而是分析员通过交互式方式使用OLAP系统进行探索性查询的事实。

References

[ 1 ] Alfred V. Aho, John E. Hopcroft, and Jeffrey D. Ullman: Data Structures and Algorithms . Addison-Wesley, 1983. ISBN: 978-0-201-00023-8

[1] 阿尔弗雷德·V·阿霍、约翰·E·霍普克罗夫特和杰弗里·D·阿尔曼：数据结构与算法。Addison-Wesley，1983年。ISBN：978-0-201-00023-8。

[ 2 ] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein: Introduction to Algorithms , 3rd edition. MIT Press, 2009. ISBN: 978-0-262-53305-8

[2] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein: 算法导论, 第三版. MIT出版社, 2009. ISBN: 978-0-262-53305-8

[ 3 ] Justin Sheehy and David Smith: “ Bitcask: A Log-Structured Hash Table for Fast Key/Value Data ,” Basho Technologies, April 2010.

[3] Justin Sheehy和David Smith： “Bitcask：快速键/值数据的日志结构哈希表，”Basho Technologies，2010年4月。

[ 4 ] Yinan Li, Bingsheng He, Robin Jun Yang, et al.: “ Tree Indexing on Solid State Drives ,” Proceedings of the VLDB Endowment , volume 3, number 1, pages 1195–1206, September 2010.

[4] Yinan Li, Bingsheng He, Robin Jun Yang等人：“固态硬盘上的树索引”，VLDB Endowment会议论文集，第3卷，第1号，1195–1206页，2010年9月。

[ 5 ] Goetz Graefe: “ Modern B-Tree Techniques ,” Foundations and Trends in Databases , volume 3, number 4, pages 203–402, August 2011. doi:10.1561/1900000028

【5】Goetz Graefe：“现代B-Tree技术”，数据库基础与趋势，第3卷，第4期，2011年8月，第203-402页，doi：10.1561/1900000028。

[ 6 ] Jeffrey Dean and Sanjay Ghemawat: “ LevelDB Implementation Notes ,” leveldb.googlecode.com .

[6] Jeffrey Dean和Sanjay Ghemawat：“LevelDB实现笔记”，leveldb.googlecode.com。

[ 7 ] Dhruba Borthakur: “ The History of RocksDB ,” rocksdb.blogspot.com , November 24, 2013.

[7] Dhruba Borthakur：“RocksDB的历史”，rocksdb.blogspot.com，2013年11月24日。

[ 8 ] Matteo Bertozzi: “ Apache HBase I/O – HFile ,” blog.cloudera.com , June, 29 2012.

[8] Matteo Bertozzi： “Apache HBase I/O – HFile”，blog.cloudera.com，2012年6月29日。 [8] 马特奥·贝尔托齐： “Apache HBase I/O – HFile”，blog.cloudera.com，2012年6月29日。

[ 9 ] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, et al.: “ Bigtable: A Distributed Storage System for Structured Data ,” at 7th USENIX Symposium on Operating System Design and Implementation (OSDI), November 2006.

[9] Fay Chang、Jeffrey Dean、Sanjay Ghemawat等：「Bigtable：用於結構化數據的分散式存儲系統」，收錄於第7屆USENIX操作系統設計和實現研討會（OSDI），2006年11月。

[ 10 ] Patrick O’Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O’Neil: “ The Log-Structured Merge-Tree (LSM-Tree) ,” Acta Informatica , volume 33, number 4, pages 351–385, June 1996. doi:10.1007/s002360050048

[10] Patrick O'Neil、Edward Cheng、Dieter Gawlick 和 Elizabeth O'Neil：“日志结构化合并树（LSM-Tree）”，Acta Informatica，卷33，编号4，页351-385，1996年6月。 doi:10.1007/s002360050048。

[ 11 ] Mendel Rosenblum and John K. Ousterhout: “ The Design and Implementation of a Log-Structured File System ,” ACM Transactions on Computer Systems , volume 10, number 1, pages 26–52, February 1992. doi:10.1145/146941.146943

"[11] Mendel Rosenblum 和 John K. Ousterhout: “日志结构文件系统的设计和实现”，ACM 计算机系统交易，卷10，号1，页26-52，1992年2月。doi:10.1145/146941.146943"

[ 12 ] Adrien Grand: “ What Is in a Lucene Index? ,” at Lucene/Solr Revolution , November 14, 2013.

[12] Adrien Grand: "Lucene索引中有什么？"，于2013年11月14日在Lucene/Solr Revolution上发表。

[ 13 ] Deepak Kandepet: “ Hacking Lucene—The Index Format ,” hackerlabs.org , October 1, 2011.

[13] Deepak Kandepet：“Hacking Lucene-索引格式”，hackerlabs.org，2011年10月1日。

[ 14 ] Michael McCandless: “ Visualizing Lucene’s Segment Merges ,” blog.mikemccandless.com , February 11, 2011.

[14] Michael McCandless：“可视化Lucene的段合并”，blog.mikemccandless.com，2011年2月11日。

[ 15 ] Burton H. Bloom: “ Space/Time Trade-offs in Hash Coding with Allowable Errors ,” Communications of the ACM , volume 13, number 7, pages 422–426, July 1970. doi:10.1145/362686.362692

[15] Burton H.Bloom："允许错误的哈希编码中的空间/时间折衷"，ACM通信，卷13，号7，1970年7月，页码422-426，doi：10.1145/362686.362692。

[ 16 ] “ Operating Cassandra: Compaction ,” Apache Cassandra Documentation v4.0, 2016.

【16】“运行Cassandra：压缩”，Apache Cassandra文档v4.0，2016年。

[ 17 ] Rudolf Bayer and Edward M. McCreight: “ Organization and Maintenance of Large Ordered Indices ,” Boeing Scientific Research Laboratories, Mathematical and Information Sciences Laboratory, report no. 20, July 1970.

[17] Rudolf Bayer和Edward M. McCreight： “大型有序索引的组织和维护”，波音科学研究实验室，数学和信息科学实验室，报告编号20，1970年7月。

[ 18 ] Douglas Comer: “ The Ubiquitous B-Tree ,” ACM Computing Surveys , volume 11, number 2, pages 121–137, June 1979. doi:10.1145/356770.356776

[18] Douglas Comer：“B-树的无处不在”，ACM计算机调查，第11卷，第2期，页121-137，1979年6月。doi:10.1145/356770.356776

[ 19 ] Emmanuel Goossaert: “ Coding for SSDs ,” codecapsule.com , February 12, 2014.

"[19] Emmanuel Goossaert: “SSD编码”，codecapsule.com，2014年2月12日。"

[ 20 ] C. Mohan and Frank Levine: “ ARIES/IM: An Efficient and High Concurrency Index Management Method Using Write-Ahead Logging ,” at ACM International Conference on Management of Data (SIGMOD), June 1992. doi:10.1145/130283.130338

C. Mohan和Frank Levine写了一份名为"ARIES/IM：一种高效且高并发性索引管理方法，使用写前日志记录"的论文，在1992年的ACM数据管理国际会议(SIGMOD)上发表。doi:10.1145/130283.130338。

[ 21 ] Howard Chu: “ LDAP at Lightning Speed ,” at Build Stuff ’14 , November 2014.

[21] Howard Chu：'Lightning Speed下的LDAP'，于2014年11月在Build Stuff '14展会上。

[ 22 ] Bradley C. Kuszmaul: “ A Comparison of Fractal Trees to Log-Structured Merge (LSM) Trees ,” tokutek.com , April 22, 2014.

[22] Bradley C. Kuszmaul：“分形树与日志结构合并树（LSM）的比较”，tokutek.com，2014年4月22日。

[ 23 ] Manos Athanassoulis, Michael S. Kester, Lukas M. Maas, et al.: “ Designing Access Methods: The RUM Conjecture ,” at 19th International Conference on Extending Database Technology (EDBT), March 2016. doi:10.5441/002/edbt.2016.42

[23] Manos Athanassoulis, Michael S. Kester, Lukas M. Maas等人：「设计访问方法：RUM猜想」，发表于第19届国际数据库技术扩展会议（EDBT），2016年3月。doi:10.5441/002/edbt.2016.42

[ 24 ] Peter Zaitsev: “ Innodb Double Write ,” percona.com , August 4, 2006.

"Innodb Double Write"，彼得·扎伊采夫， percona.com，2006 年 8 月 4 日。

[ 25 ] Tomas Vondra: “ On the Impact of Full-Page Writes ,” blog.2ndquadrant.com , November 23, 2016.

[25] Tomas Vondra：“全页写入对性能的影响”，blog.2ndquadrant.com，2016年11月23日。

[ 26 ] Mark Callaghan: “ The Advantages of an LSM vs a B-Tree ,” smalldatum.blogspot.co.uk , January 19, 2016.

[26] Mark Callaghan： “LSM相对于B树的优势”，smalldatum.blogspot.co.uk, 2016年1月19日。

[ 27 ] Mark Callaghan: “ Choosing Between Efficiency and Performance with RocksDB ,” at Code Mesh , November 4, 2016.

[27] Mark Callaghan：在Code Mesh会议上的“在Efficiency和Performance之间选择RocksDB”演讲，日期为2016年11月4日。

[ 28 ] Michi Mutsuzaki: “ MySQL vs. LevelDB ,” github.com , August 2011.

[28] Michi Mutsuzaki：“MySQL vs. LevelDB”，github.com，2011年8月。 [28] Michi Mutsuzaki：“MySQL vs. LevelDB”，github.com，2011年8月。

[ 29 ] Benjamin Coverston, Jonathan Ellis, et al.: “ CASSANDRA-1608: Redesigned Compaction , issues.apache.org , July 2011.

[29] 本杰明·科尔弗斯顿、乔纳森·埃利斯等：“CASSANDRA-1608: 重新设计压缩”，issues.apache.org，2011年7月。

[ 30 ] Igor Canadi, Siying Dong, and Mark Callaghan: “ RocksDB Tuning Guide ,” github.com , 2016.

"RocksDB调优指南" Igor Canadi, Siying Dong和Mark Callaghan: github.com，2016年。

[ 31 ] MySQL 5.7 Reference Manual . Oracle, 2014.

[31] MySQL 5.7 参考手册。Oracle，2014年。

[ 32 ] Books Online for SQL Server 2012 . Microsoft, 2012.

[32]本书 SQL Server 2012 在线版。微软，2012年。

[ 33 ] Joe Webb: “ Using Covering Indexes to Improve Query Performance ,” simple-talk.com , 29 September 2008.

[33] Joe Webb：“使用覆盖索引来提高查询性能”，simple-talk.com，2008年9月29日。

[ 34 ] Frank Ramsak, Volker Markl, Robert Fenk, et al.: “ Integrating the UB-Tree into a Database System Kernel ,” at 26th International Conference on Very Large Data Bases (VLDB), September 2000.

[34] Frank Ramsak, Volker Markl, Robert Fenk等人： “将UB-Tree集成到数据库系统内核中”，收录于第26届大数据国际会议（VLDB)，2000年9月。

[ 35 ] The PostGIS Development Group: “ PostGIS 2.1.2dev Manual ,” postgis.net , 2014.

[35] The PostGIS 开发团队： “PostGIS 2.1.2dev 手册”， postgis.net，2014年。

[ 36 ] Robert Escriva, Bernard Wong, and Emin Gün Sirer: “ HyperDex: A Distributed, Searchable Key-Value Store ,” at ACM SIGCOMM Conference , August 2012. doi:10.1145/2377677.2377681

[36] Robert Escriva, Bernard Wong, and Emin Gün Sirer："HyperDex:分布式、可搜索的键值存储"，发表于ACM SIGCOMM会议，2012年8月。doi:10.1145/2377677.2377681。

[ 37 ] Michael McCandless: “ Lucene’s FuzzyQuery Is 100 Times Faster in 4.0 ,” blog.mikemccandless.com , March 24, 2011.

[37] Michael McCandless： “Lucene的FuzzyQuery在4.0中快了一百倍”，博客mikemccandless.com，2011年3月24日。

[ 38 ] Steffen Heinz, Justin Zobel, and Hugh E. Williams: “ Burst Tries: A Fast, Efficient Data Structure for String Keys ,” ACM Transactions on Information Systems , volume 20, number 2, pages 192–223, April 2002. doi:10.1145/506309.506312

[38] Steffen Heinz、Justin Zobel和Hugh E. Williams：《Burst Tries: 一种用于字符串键的快速高效的数据结构》，ACM Transactions on Information Systems，第20卷，第2期，页码192-223，2002年4月。doi:10.1145/506309.506312。

[ 39 ] Klaus U. Schulz and Stoyan Mihov: “ Fast String Correction with Levenshtein Automata ,” International Journal on Document Analysis and Recognition , volume 5, number 1, pages 67–85, November 2002. doi:10.1007/s10032-002-0082-8

【39】 Klaus U. Schulz和Stoyan Mihov:“利用Levenshtein自动机实现快速字符串纠错”，《国际文件分析与识别杂志》，第5卷第1期，2002年11月，第67-85页。doi：10.1007/s10032-002-0082-8。

[ 40 ] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze: Introduction to Information Retrieval . Cambridge University Press, 2008. ISBN: 978-0-521-86571-5, available online at nlp.stanford.edu/IR-book

[40] 克里斯托弗·D·曼宁、普拉巴卡尔·拉加万和欣里奇·史密斯：《信息检索导论》。剑桥大学出版社，2008年。 ISBN：978-0-521-86571-5，可在线访问nlp.stanford.edu/IR-book。

[ 41 ] Michael Stonebraker, Samuel Madden, Daniel J. Abadi, et al.: “ The End of an Architectural Era (It’s Time for a Complete Rewrite) ,” at 33rd International Conference on Very Large Data Bases (VLDB), September 2007.

[41] 迈克尔·斯通布雷克（Michael Stonebraker）、塞缪尔·马登（Samuel Madden）、丹尼尔·J·阿巴迪（Daniel J. Abadi）等人： “一个架构时代的终结（是时候进行彻底重写了），”于2007年9月举行的第33届极大数据库国际会议（VLDB）上。

[ 42 ] “ VoltDB Technical Overview White Paper ,” VoltDB, 2014.

“VoltDB技术概述白皮书”，VoltDB，2014年。

[ 43 ] Stephen M. Rumble, Ankita Kejriwal, and John K. Ousterhout: “ Log-Structured Memory for DRAM-Based Storage ,” at 12th USENIX Conference on File and Storage Technologies (FAST), February 2014.

【43】Stephen M. Rumble，Ankita Kejriwal和John K. Ousterhout：“基于DRAM存储器的日志结构化内存”，发表于2014年2月的第12届USENIX文件和存储技术会议（FAST）。

[ 44 ] Stavros Harizopoulos, Daniel J. Abadi, Samuel Madden, and Michael Stonebraker: “ OLTP Through the Looking Glass, and What We Found There ,” at ACM International Conference on Management of Data (SIGMOD), June 2008. doi:10.1145/1376616.1376713

[44] Stavros Harizopoulos，Daniel J. Abadi，Samuel Madden和Michael Stonebraker： “OLTP穿越镜子，我们在其中发现了什么”，ACM数据管理国际会议(SIGMOD)，2008年6月。 doi:10.1145/1376616.1376713

[ 45 ] Justin DeBrabant, Andrew Pavlo, Stephen Tu, et al.: “ Anti-Caching: A New Approach to Database Management System Architecture ,” Proceedings of the VLDB Endowment , volume 6, number 14, pages 1942–1953, September 2013.

[45] Justin DeBrabant，Andrew Pavlo，Stephen Tu等： “反缓存：一种新的数据库管理系统架构方法”，《VLDB年鉴》第6卷，第14期，2013年9月，页码1942-1953。

[ 46 ] Joy Arulraj, Andrew Pavlo, and Subramanya R. Dulloor: “ Let’s Talk About Storage & Recovery Methods for Non-Volatile Memory Database Systems ,” at ACM International Conference on Management of Data (SIGMOD), June 2015. doi:10.1145/2723372.2749441

“让我们谈论非易失性内存数据库系统的存储和恢复方法”，Joy Arulraj、Andrew Pavlo和Subramanya R. Dulloor在2015年6月的ACM数据管理国际会议(SIGMOD)上发表。doi:10.1145/2723372.2749441

[ 47 ] Edgar F. Codd, S. B. Codd, and C. T. Salley: “ Providing OLAP to User-Analysts: An IT Mandate ,” E. F. Codd Associates, 1993.

"[47] Edgar F. Codd, S. B. Codd, and C. T. Salley: "为用户分析员提供OLAP：IT命令"，E.F. Codd Associates，1993年。"

[ 48 ] Surajit Chaudhuri and Umeshwar Dayal: “ An Overview of Data Warehousing and OLAP Technology ,” ACM SIGMOD Record , volume 26, number 1, pages 65–74, March 1997. doi:10.1145/248603.248616

[48] Surajit Chaudhuri和Umeshwar Dayal：“数据仓库和OLAP技术概述，”ACM SIGMOD Record，第26卷，第1期，页65-74，1997年3月。 doi：10.1145 / 248603.248616。

[ 49 ] Per-Åke Larson, Cipri Clinciu, Campbell Fraser, et al.: “ Enhancements to SQL Server Column Stores ,” at ACM International Conference on Management of Data (SIGMOD), June 2013.

[49] Per-Åke Larson, Cipri Clinciu, Campbell Fraser等人：“增强SQL Server 列存储”，于2013年6月ACM数据管理国际会议(SIGMOD)发表。

[ 50 ] Franz Färber, Norman May, Wolfgang Lehner, et al.: “ The SAP HANA Database – An Architecture Overview ,” IEEE Data Engineering Bulletin , volume 35, number 1, pages 28–33, March 2012.

“SAP HANA数据库——架构概述”，作者：Franz Färber，Norman May，Wolfgang Lehner等（共50人），发表于2012年3月，IEEE数据工程通报第35卷第1期，第28-33页。

[ 51 ] Michael Stonebraker: “ The Traditional RDBMS Wisdom Is (Almost Certainly) All Wrong ,” presentation at EPFL , May 2013.

[51] Michael Stonebraker：迈克尔·斯通布雷克：“传统的关系型数据库管理系统智慧（几乎可以确定）都是错误的”，于2013年5月在EPFL演讲。

[ 52 ] Daniel J. Abadi: “ Classifying the SQL-on-Hadoop Solutions ,” hadapt.com , October 2, 2013.

"[52] Daniel J. Abadi: “对 Hadoop 上的 SQL 解决方案进行分类”， hadapt.com，2013年10月2日。" Note: This is the translated version of the given text.

[ 53 ] Marcel Kornacker, Alexander Behm, Victor Bittorf, et al.: “ Impala: A Modern, Open-Source SQL Engine for Hadoop ,” at 7th Biennial Conference on Innovative Data Systems Research (CIDR), January 2015.

"“Impala: A Modern, Open-Source SQL Engine for Hadoop,” 在第七届创新数据系统研究双年会 (CIDR)，2015年1月发布。"

[ 54 ] Sergey Melnik, Andrey Gubarev, Jing Jing Long, et al.: “ Dremel: Interactive Analysis of Web-Scale Datasets ,” at 36th International Conference on Very Large Data Bases (VLDB), pages 330–339, September 2010.

[54] Sergey Melnik, Andrey Gubarev, Jing Jing Long等人： “Dremel：Web规模数据的交互式分析”，在36届国际大数据（VLDB）会议上发布，页码为 330-339，2010年9月。

[ 55 ] Ralph Kimball and Margy Ross: The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling , 3rd edition. John Wiley & Sons, July 2013. ISBN: 978-1-118-53080-1

[55] Ralph Kimball 和 Margy Ross：数据仓库工具箱：维度建模的权威指南，第三版。John Wiley & Sons，2013年7月。ISBN：978-1-118-53080-1。

[ 56 ] Derrick Harris: “ Why Apple, eBay, and Walmart Have Some of the Biggest Data Warehouses You’ve Ever Seen ,” gigaom.com , March 27, 2013.

"为什么苹果、eBay和沃尔玛拥有你从未见过的最大数据仓库"，gigaom.com，2013年3月27日。"

[ 57 ] Julien Le Dem: “ Dremel Made Simple with Parquet ,” blog.twitter.com , September 11, 2013.

[57] Julien Le Dem：“使用 Parquet 简化 Dremel”，blog.twitter.com，2013 年 9 月 11 日。

[ 58 ] Daniel J. Abadi, Peter Boncz, Stavros Harizopoulos, et al.: “ The Design and Implementation of Modern Column-Oriented Database Systems ,” Foundations and Trends in Databases , volume 5, number 3, pages 197–280, December 2013. doi:10.1561/1900000024

[58] Daniel J. Abadi, Peter Boncz, Stavros Harizopoulos等人：《现代基于列的数据库系统的设计与实现》， Foundations and Trends in Databases, 2013年12月，卷5，第3期，197-280页。 doi:10.1561/1900000024

[ 59 ] Peter Boncz, Marcin Zukowski, and Niels Nes: “ MonetDB/X100: Hyper-Pipelining Query Execution ,” at 2nd Biennial Conference on Innovative Data Systems Research (CIDR), January 2005.

Peter Boncz，Marcin Zukowski和Niels Nes：《MonetDB / X100：高速管道查询执行》第二届创新数据系统研究双年会（CIDR） 2005年1月。

[ 60 ] Jingren Zhou and Kenneth A. Ross: “ Implementing Database Operations Using SIMD Instructions ,” at ACM International Conference on Management of Data (SIGMOD), pages 145–156, June 2002. doi:10.1145/564691.564709

请帮忙翻译以下内容：[60]周敬仁和肯尼思·A·罗斯：“使用SIMD指令实现数据库操作”，发表于ACM数据管理国际会议(SIGMOD)2002年6月第145-156页。 doi:10.1145/564691.564709

[ 61 ] Michael Stonebraker, Daniel J. Abadi, Adam Batkin, et al.: “ C-Store: A Column-oriented DBMS ,” at 31st International Conference on Very Large Data Bases (VLDB), pages 553–564, September 2005.

[61] Michael Stonebraker, Daniel J. Abadi, Adam Batkin等人： “C-Store：一种列式数据库管理系统”，于第31届国际大型数据库会议（VLDB）上，2005年9月，第553-564页。

[ 62 ] Andrew Lamb, Matt Fuller, Ramakrishna Varadarajan, et al.: “ The Vertica Analytic Database: C-Store 7 Years Later ,” Proceedings of the VLDB Endowment , volume 5, number 12, pages 1790–1801, August 2012.

[62] Andrew Lamb，Matt Fuller，Ramakrishna Varadarajan等人：《Vertica分析数据库：七年后的C-Store》，VLDB Endowment会议录，第5卷，第12期，1790-1801页，2012年8月。

[ 63 ] Julien Le Dem and Nong Li: “ Efficient Data Storage for Analytics with Apache Parquet 2.0 ,” at Hadoop Summit , San Jose, June 2014.

"Julien Le Dem和Nong Li：在Hadoop峰会上提出的“Apache Parquet 2.0用于分析的高效数据存储”演讲，于2014年6月在圣何塞举行。"

[ 64 ] Jim Gray, Surajit Chaudhuri, Adam Bosworth, et al.: “ Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals ,” Data Mining and Knowledge Discovery , volume 1, number 1, pages 29–53, March 2007. doi:10.1023/A:1009726021843

[64] Jim Gray, Surajit Chaudhuri, Adam Bosworth等： “数据立方体：一个泛化Group-By，Cross-Tab和Sub-Totals的关系聚合运算符”，《数据挖掘与知识发现》，第1卷，第1期，页29–53，2007年3月。 doi:10.1023/A:1009726021843

Chapter 3. Storage and Retrieval

Data Structures That Power Your Database

Note

Hash Indexes

Figure 3-1. Storing a log of key-value pairs in a CSV-like format, indexed with an in-memory hash map.

Figure 3-2. Compaction of a key-value update log (counting the number of times each cat video was played), retaining only the most recent value for each key.

Figure 3-3. Performing compaction and segment merging simultaneously.

SSTables and LSM-Trees

Figure 3-4. Merging several SSTable segments, retaining only the most recent value for each key.

Figure 3-5. An SSTable with an in-memory index.

Constructing and maintaining SSTables

Making an LSM-tree out of SSTables

Performance optimizations

B-Trees

Figure 3-6. Looking up a key using a B-tree index.

Figure 3-7. Growing a B-tree by splitting a page.

Making B-trees reliable

B-tree optimizations

Comparing B-Trees and LSM-Trees

Advantages of LSM-trees

Downsides of LSM-trees

Other Indexing Structures

Storing values within the index

Multi-column indexes

Full-text search and fuzzy indexes

Keeping everything in memory

Transaction Processing or Analytics?

Note

Data Warehousing

Figure 3-8. Simplified outline of ETL into a data warehouse.

The divergence between OLTP databases and data warehouses

Stars and Snowflakes: Schemas for Analytics

Figure 3-9. Example of a star schema for use in a data warehouse.

Column-Oriented Storage

Example 3-1. Analyzing whether people are more inclined to buy fresh fruit or candy, depending on the day of the week

Note

Figure 3-10. Storing relational data by column, rather than by row.

Column Compression

Figure 3-11. Compressed, bitmap-indexed storage of a single column.

Column-oriented storage and column families

Memory bandwidth and vectorized processing

Sort Order in Column Storage

Several different sort orders

Writing to Column-Oriented Storage

Aggregation: Data Cubes and Materialized Views

Figure 3-12. Two dimensions of a data cube, aggregating data by summing.

Summary

Footnotes

References