第七章。交易
Chapter 7. Transactions
Some authors have claimed that general two-phase commit is too expensive to support, because of the performance or availability problems that it brings. We believe it is better to have application programmers deal with performance problems due to overuse of transactions as bottlenecks arise, rather than always coding around the lack of transactions.
一些作者声称,一般的两阶段提交由于带来的性能或可用性问题而过于昂贵。我们认为,最好让应用程序员在出现瓶颈时处理由于过度使用事务而导致的性能问题,而不是总是编码以避免缺少事务。
James Corbett et al., Spanner: Google’s Globally-Distributed Database (2012)
詹姆斯·科贝特等,扳手:谷歌的全球分布式数据库(2012年)
In the harsh reality of data systems, many things can go wrong:
在数据系统的残酷现实中,很多事情可能会出错。
-
The database software or hardware may fail at any time (including in the middle of a write operation).
数据库软件或硬件随时可能出现故障(包括在写入操作中间)。
-
The application may crash at any time (including halfway through a series of operations).
应用程序在任何时候都可能崩溃(包括在一系列操作中途崩溃)。
-
Interruptions in the network can unexpectedly cut off the application from the database, or one database node from another.
网络中的中断可能会意外地从数据库中切断应用程序,或从一个数据库节点到另一个节点。
-
Several clients may write to the database at the same time, overwriting each other’s changes.
若干个客户端可能同时对数据库进行写操作,从而覆盖彼此所做的更改。
-
A client may read data that doesn’t make sense because it has only partially been updated.
客户端可能读取数据,因为数据只进行了部分更新,导致数据看起来毫无意义。
-
Race conditions between clients can cause surprising bugs.
客户端之间的竞争条件可能会导致令人惊讶的错误。
In order to be reliable, a system has to deal with these faults and ensure that they don’t cause catastrophic failure of the entire system. However, implementing fault-tolerance mechanisms is a lot of work. It requires a lot of careful thinking about all the things that can go wrong, and a lot of testing to ensure that the solution actually works.
为了保证可靠性,系统必须处理这些故障并确保它们不会导致整个系统的灾难性故障。然而,实现容错机制是一项艰巨的工作,需要对可能出现的所有问题进行仔细思考,并进行大量的测试以确保解决方案实际有效。
For decades, transactions have been the mechanism of choice for simplifying these issues. A transaction is a way for an application to group several reads and writes together into a logical unit. Conceptually, all the reads and writes in a transaction are executed as one operation: either the entire transaction succeeds ( commit ) or it fails ( abort , rollback ). If it fails, the application can safely retry. With transactions, error handling becomes much simpler for an application, because it doesn’t need to worry about partial failure—i.e., the case where some operations succeed and some fail (for whatever reason).
几十年来,事务一直是简化这些问题的首选机制。事务是应用程序将多个读取和写入组合成一个逻辑单元的方式。从概念上讲,事务中的所有读取和写入都作为一个操作执行:整个事务成功(提交)或失败(中止,回滚)。如果失败,应用程序可以安全地重试。使用事务,应用程序的错误处理变得更加简单,因为它不需要担心部分失败,即某些操作成功而某些操作失败(无论出于什么原因)。
If you have spent years working with transactions, they may seem obvious, but we shouldn’t take them for granted. Transactions are not a law of nature; they were created with a purpose, namely to simplify the programming model for applications accessing a database. By using transactions, the application is free to ignore certain potential error scenarios and concurrency issues, because the database takes care of them instead (we call these safety guarantees ).
如果您花了多年时间处理事务,它们可能看起来很明显,但我们不应该把它们视为理所当然。事务并不是自然法则;它们是为了简化访问数据库的应用程序的编程模型而创建的。通过使用事务,应用程序可以忽略某些潜在的错误场景和并发问题,因为数据库会代替它们处理(我们称这些为安全保障)。
Not every application needs transactions, and sometimes there are advantages to weakening transactional guarantees or abandoning them entirely (for example, to achieve higher performance or higher availability). Some safety properties can be achieved without transactions.
并非每个应用程序都需要事务,有时放弃或减弱事务保证可能会带来优势(例如,为了实现更高的性能或可用性)。有些安全属性可以在不使用事务的情况下实现。
How do you figure out whether you need transactions? In order to answer that question, we first need to understand exactly what safety guarantees transactions can provide, and what costs are associated with them. Although transactions seem straightforward at first glance, there are actually many subtle but important details that come into play.
你如何确定是否需要事务处理?为了回答这个问题,我们首先需要了解事务处理能提供什么安全保障以及与之相关的成本。虽然事务处理乍一看似乎很简单,但实际上有许多微妙但重要的细节需要考虑。
In this chapter, we will examine many examples of things that can go wrong, and explore the algorithms that databases use to guard against those issues. We will go especially deep in the area of concurrency control, discussing various kinds of race conditions that can occur and how databases implement isolation levels such as read committed , snapshot isolation , and serializability .
在本章,我们将研究许多可能出错的示例,并探讨数据库用来防范这些问题的算法。我们将特别深入探讨并发控制领域,讨论可能发生的各种竞争条件,以及数据库如何实现读取提交、快照隔离和串行化等隔离级别。
This chapter applies to both single-node and distributed databases; in Chapter 8 we will focus the discussion on the particular challenges that arise only in distributed systems.
本章适用于单节点和分布式数据库;第8章我们将重点讨论仅在分布式系统中出现的特定挑战。
The Slippery Concept of a Transaction
Almost all relational databases today, and some nonrelational databases, support transactions. Most of them follow the style that was introduced in 1975 by IBM System R, the first SQL database [ 1 , 2 , 3 ]. Although some implementation details have changed, the general idea has remained virtually the same for 40 years: the transaction support in MySQL, PostgreSQL, Oracle, SQL Server, etc., is uncannily similar to that of System R.
几乎所有的关系型数据库、以及一些非关系型数据库,都支持事务。它们中的大部分都遵循1975年IBM System R所引入的样式,这是第一个SQL数据库[1,2,3]。虽然一些实现细节发生了改变,但是基本思想已经几乎保持了40年不变:MySQL、PostgreSQL、Oracle、SQL Server等的事务支持都和System R非常相似。
In the late 2000s, nonrelational (NoSQL) databases started gaining popularity. They aimed to improve upon the relational status quo by offering a choice of new data models (see Chapter 2 ), and by including replication ( Chapter 5 ) and partitioning ( Chapter 6 ) by default. Transactions were the main casualty of this movement: many of this new generation of databases abandoned transactions entirely, or redefined the word to describe a much weaker set of guarantees than had previously been understood [ 4 ].
在2000年代后期,非关系型(NoSQL)数据库开始变得流行起来。他们旨在通过提供新的数据模型选择(见第2章),以及包括默认的复制(第5章)和分区(第6章),来改善关系型数据库现状。这一运动的主要牺牲品是交易:许多这一新一代数据库完全放弃了交易,或重新定义这个词来描述比以前理解的保证要弱得多的一组保证[4]。
With the hype around this new crop of distributed databases, there emerged a popular belief that transactions were the antithesis of scalability, and that any large-scale system would have to abandon transactions in order to maintain good performance and high availability [ 5 , 6 ]. On the other hand, transactional guarantees are sometimes presented by database vendors as an essential requirement for “serious applications” with “valuable data.” Both viewpoints are pure hyperbole.
随着新一代分布式数据库的热度上升,出现了一种流行观念,即事务是可扩展性的对立面,任何大型系统都必须放弃事务以保持良好的性能和高可用性。另一方面,数据库供应商有时将事务保证作为“重要应用程序”和“有价值数据”的必要要求。两种观点都是纯粹的夸张言辞。
The truth is not that simple: like every other technical design choice, transactions have advantages and limitations. In order to understand those trade-offs, let’s go into the details of the guarantees that transactions can provide—both in normal operation and in various extreme (but realistic) circumstances.
事实并不简单:就像其他技术设计选择一样,交易具有优点和限制。为了理解这些权衡,让我们深入了解交易可以提供的保证细节 - 在正常操作和各种极端(但现实)情况下。
The Meaning of ACID
The safety guarantees provided by transactions are often described by the well-known acronym ACID , which stands for Atomicity , Consistency , Isolation , and Durability . It was coined in 1983 by Theo Härder and Andreas Reuter [ 7 ] in an effort to establish precise terminology for fault-tolerance mechanisms in databases.
交易所提供的安全保证通常被描述为著名缩写ACID,即原子性、一致性、隔离性和持久性。该术语由Theo Härder和Andreas Reuter [7]于1983年创造,旨在为数据库中的容错机制建立精确术语。
However, in practice, one database’s implementation of ACID does not equal another’s implementation. For example, as we shall see, there is a lot of ambiguity around the meaning of isolation [ 8 ]. The high-level idea is sound, but the devil is in the details. Today, when a system claims to be “ACID compliant,” it’s unclear what guarantees you can actually expect. ACID has unfortunately become mostly a marketing term.
然而,在实践中,一个数据库的ACID实现并不等同于另一个数据库的实现。例如,正如我们将看到的那样,关于隔离级别的含义存在很大的歧义。这个高层次的想法是好的,但是细节决定成败。今天,当一个系统声称是“ACID兼容”的时候,你可以期望得到什么样的保证是不清楚的。不幸的是,ACID已经成为主要的营销术语。
(Systems that do not meet the ACID criteria are sometimes called BASE , which stands for Basically Available , Soft state , and Eventual consistency [ 9 ]. This is even more vague than the definition of ACID. It seems that the only sensible definition of BASE is “not ACID”; i.e., it can mean almost anything you want.)
没有符合 ACID 标准的系统有时被称为 BASE,即 Basically Available、Soft state 和 Eventual consistency [9]。这比 ACID 的定义更加模糊。似乎唯一合理的 BASE 定义是“非 ACID”;也就是说,它可以意味着几乎任何你想要的东西。
Let’s dig into the definitions of atomicity, consistency, isolation, and durability, as this will let us refine our idea of transactions.
让我们深入探讨原子性、一致性、隔离性和持久性的定义,这将让我们完善对事务的概念。
Atomicity
In general, atomic refers to something that cannot be broken down into smaller parts. The word means similar but subtly different things in different branches of computing. For example, in multi-threaded programming, if one thread executes an atomic operation, that means there is no way that another thread could see the half-finished result of the operation. The system can only be in the state it was before the operation or after the operation, not something in between.
通常情况下,原子指的是无法被分解成更小部分的物质。在计算机不同领域中,这个词有着类似但微妙不同的意思。例如,在多线程编程中,如果一个线程执行了一个原子操作,意味着另一个线程无法看到操作的半成品结果。系统只能处于操作之前或之后的状态,而不可能处于中间状态。
By contrast, in the context of ACID, atomicity is not about concurrency. It does not describe what happens if several processes try to access the same data at the same time, because that is covered under the letter I , for isolation (see “Isolation” ).
相比之下,在ACID的背景下,原子性不涉及并发。如果多个进程尝试同时访问相同的数据,它并不描述会发生什么,因为这在I,即隔离(请参见“隔离”)下涵盖。
Rather, ACID atomicity describes what happens if a client wants to make several writes, but a fault occurs after some of the writes have been processed—for example, a process crashes, a network connection is interrupted, a disk becomes full, or some integrity constraint is violated. If the writes are grouped together into an atomic transaction, and the transaction cannot be completed ( committed ) due to a fault, then the transaction is aborted and the database must discard or undo any writes it has made so far in that transaction.
ACID的原子性描述的是当客户端要进行多写操作,但是在部分写操作被处理后发生故障的情况,例如进程崩溃、网络连接中断、磁盘空间满或某些完整性约束被违反。 如果将写操作组合成原子事务,并且由于故障无法完成(提交)事务,则事务将被中止,数据库必须放弃或撤销在该事务中迄今为止所做的任何写操作。
Without atomicity, if an error occurs partway through making multiple changes, it’s difficult to know which changes have taken effect and which haven’t. The application could try again, but that risks making the same change twice, leading to duplicate or incorrect data. Atomicity simplifies this problem: if a transaction was aborted, the application can be sure that it didn’t change anything, so it can safely be retried.
如果没有原子性,则在进行多个更改的过程中发生错误时,很难知道哪些更改已生效,哪些没有生效。应用程序可以再次尝试,但这会冒着使相同的更改两次的风险,导致重复或不正确的数据。原子性简化了这个问题:如果事务被中止,应用程序可以确定它没有更改任何内容,因此可以安全地重试。
The ability to abort a transaction on error and have all writes from that transaction discarded is the defining feature of ACID atomicity. Perhaps abortability would have been a better term than atomicity , but we will stick with atomicity since that’s the usual word.
事务发生错误时可以终止事务并且回滚所有写入操作,这是ACID原子性的定义特征。或许"可中止性"是个更好的术语,但由于"原子性"是常用词汇,我们将继续使用它。
Consistency
The word consistency is terribly overloaded:
一致性这个词非常混乱多义。
-
In Chapter 5 we discussed replica consistency and the issue of eventual consistency that arises in asynchronously replicated systems (see “Problems with Replication Lag” ).
在第五章中,我们讨论了复制一致性以及异步复制系统中可能出现的最终一致性问题(请参阅“复制滞后问题”)。
-
Consistent hashing is an approach to partitioning that some systems use for rebalancing (see “Consistent Hashing” ).
"一致性哈希是一些系统用于重新平衡分区的一种方法(请参见“一致性哈希”)。"
-
In the CAP theorem (see Chapter 9 ), the word consistency is used to mean linearizability (see “Linearizability” ).
在CAP定理(参见第9章)中,“一致性”一词是用来指代可线性化性(参见“线性化”)。
-
In the context of ACID, consistency refers to an application-specific notion of the database being in a “good state.”
在ACID的背景下,一致性指的是数据库处于“良好状态”的应用程序特定概念。
It’s unfortunate that the same word is used with at least four different meanings.
使用至少四个不同的含义来表达同一个词是不幸的。
The idea of ACID consistency is that you have certain statements about your data ( invariants ) that must always be true—for example, in an accounting system, credits and debits across all accounts must always be balanced. If a transaction starts with a database that is valid according to these invariants, and any writes during the transaction preserve the validity, then you can be sure that the invariants are always satisfied.
ACID一致性的理念在于,您对数据有某些语句(不变量)必须始终为真——例如,在一个会计系统中,所有账户的借贷必须始终平衡。如果事务始于一个根据这些不变量有效的数据库,并且事务期间的任何写操作都保存了有效性,那么您可以确保不变量始终满足。
However, this idea of consistency depends on the application’s notion of invariants, and it’s the application’s responsibility to define its transactions correctly so that they preserve consistency. This is not something that the database can guarantee: if you write bad data that violates your invariants, the database can’t stop you. (Some specific kinds of invariants can be checked by the database, for example using foreign key constraints or uniqueness constraints. However, in general, the application defines what data is valid or invalid—the database only stores it.)
然而,一致性的概念取决于应用程序的不变量概念,应用程序有责任正确定义其交易,以便保持一致性。这不是数据库能够保证的:如果您写入违反不变量的不良数据,数据库无法阻止您。(某些特定类型的不变量可以由数据库检查,例如使用外键约束或唯一性约束。但是,一般来说,应用程序定义哪些数据是有效或无效的-数据库仅存储它们。)
Atomicity, isolation, and durability are properties of the database, whereas consistency (in the ACID sense) is a property of the application. The application may rely on the database’s atomicity and isolation properties in order to achieve consistency, but it’s not up to the database alone. Thus, the letter C doesn’t really belong in ACID. i
原子性、隔离性和持久性是数据库的属性,而一致性(在ACID的意义上)是应用程序的属性。 应用程序可以依赖于数据库的原子性和隔离性属性以实现一致性,但它不只是数据库的责任。 因此,字母C实际上不属于ACID。
Isolation
Most databases are accessed by several clients at the same time. That is no problem if they are reading and writing different parts of the database, but if they are accessing the same database records, you can run into concurrency problems (race conditions).
大多数数据库同时被多个客户端访问。如果它们读取和写入数据库的不同部分,那么就没有问题。但是,如果它们同时访问同一数据库记录,就可能出现并发问题(竞态条件)。
Figure 7-1 is a simple example of this kind of problem. Say you have two clients simultaneously incrementing a counter that is stored in a database. Each client needs to read the current value, add 1, and write the new value back (assuming there is no increment operation built into the database). In Figure 7-1 the counter should have increased from 42 to 44, because two increments happened, but it actually only went to 43 because of the race condition.
图7-1是这种问题的一个简单示例。假设您有两个客户端同时对存储在数据库中的计数器进行增量。每个客户端都需要读取当前值,加1,并将新值写回(假设数据库中没有增量操作)。在图7-1中,计数器应该从42增加到44,因为发生了两次增量,但由于竞争条件,它实际上只增加到了43。
Isolation in the sense of ACID means that concurrently executing transactions are isolated from each other: they cannot step on each other’s toes. The classic database textbooks formalize isolation as serializability , which means that each transaction can pretend that it is the only transaction running on the entire database. The database ensures that when the transactions have committed, the result is the same as if they had run serially (one after another), even though in reality they may have run concurrently [ 10 ].
ACID中的隔离是指并发执行的事务相互隔离,它们不能相互影响。经典的数据库教材将隔离形式化为串行化,这意味着每个事务都可以假装它是整个数据库上运行的唯一事务。数据库确保在事务提交时,结果与它们串行运行(一个接一个地)时的结果相同,即使事实上它们可能已经并发运行[10]。
However, in practice, serializable isolation is rarely used, because it carries a performance penalty. Some popular databases, such as Oracle 11g, don’t even implement it. In Oracle there is an isolation level called “serializable,” but it actually implements something called snapshot isolation , which is a weaker guarantee than serializability [ 8 , 11 ]. We will explore snapshot isolation and other forms of isolation in “Weak Isolation Levels” .
然而,在实践中,可序列化隔离很少被使用,因为它会带来性能损失。一些流行的数据库,如 Oracle 11g,甚至没有实现它。在 Oracle 中有一种叫做“可序列化”的隔离级别,但实际上它实现的是一种叫做快照隔离的东西,这种保证比可序列化要弱[8, 11]。我们将在“弱隔离级别”中探讨快照隔离和其他形式的隔离。
Durability
The purpose of a database system is to provide a safe place where data can be stored without fear of losing it. Durability is the promise that once a transaction has committed successfully, any data it has written will not be forgotten, even if there is a hardware fault or the database crashes.
数据库系统的目的是提供一个安全的存储数据的地方,使用户不必担心数据丢失。持久性是保证一旦事务成功提交,它写入的任何数据都不会被遗忘的承诺,即使存在硬件故障或数据库崩溃。
In a single-node database, durability typically means that the data has been written to nonvolatile storage such as a hard drive or SSD. It usually also involves a write-ahead log or similar (see “Making B-trees reliable” ), which allows recovery in the event that the data structures on disk are corrupted. In a replicated database, durability may mean that the data has been successfully copied to some number of nodes. In order to provide a durability guarantee, a database must wait until these writes or replications are complete before reporting a transaction as successfully committed.
在单节点数据库中,持久性通常意味着数据已被写入非易失性存储,例如硬盘或固态硬盘。通常它也包括一个预写日志或类似物(请参见"使B树可靠"),它允许在磁盘上的数据结构损坏时进行恢复。在复制的数据库中,持久性可能意味着数据已成功复制到某些节点。为了提供持久性保证,数据库必须等待这些写入或复制完成才能报告一个事务成功提交。
As discussed in “Reliability” , perfect durability does not exist: if all your hard disks and all your backups are destroyed at the same time, there’s obviously nothing your database can do to save you.
正如在“可靠性”中所讨论的那样,没有完美的耐久性:如果您的所有硬盘和所有备份同时被销毁,那么您的数据库显然无法挽救您。
Single-Object and Multi-Object Operations
To recap, in ACID, atomicity and isolation describe what the database should do if a client makes several writes within the same transaction:
回顾一下,在ACID中,原子性和隔离性描述了如果客户端在同一事务中进行了多次写操作,数据库应该采取的措施:
- Atomicity
-
If an error occurs halfway through a sequence of writes, the transaction should be aborted, and the writes made up to that point should be discarded. In other words, the database saves you from having to worry about partial failure, by giving an all-or-nothing guarantee.
如果在一系列写操作进行的中途出现错误,交易应该被中止且到那时为止的写操作应该被丢弃。换言之,数据库通过提供全部或无操作的保证来避免您担心部分故障。
- Isolation
-
Concurrently running transactions shouldn’t interfere with each other. For example, if one transaction makes several writes, then another transaction should see either all or none of those writes, but not some subset.
并发运行的事务不应互相干扰。例如,如果一个事务进行了多次写操作,那么另一个事务应该看到全部或部分,但不是一部分。
These definitions assume that you want to modify several objects (rows, documents, records) at once. Such multi-object transactions are often needed if several pieces of data need to be kept in sync. Figure 7-2 shows an example from an email application. To display the number of unread messages for a user, you could query something like:
这些定义假定您想要同时修改多个对象(行、文档、记录)。 如果需要保持几个数据的同步,这种多对象交易通常是必需的。 图7-2显示了电子邮件应用程序的示例。要显示用户未读消息的数量,您可以查询类似于:`的内容。
SELECT
COUNT
(
*
)
FROM
emails
WHERE
recipient_id
=
2
AND
unread_flag
=
true
However, you might find this query to be too slow if there are many emails, and decide to store the number of unread messages in a separate field (a kind of denormalization). Now, whenever a new message comes in, you have to increment the unread counter as well, and whenever a message is marked as read, you also have to decrement the unread counter.
然而,如果有许多电子邮件,您可能会发现此查询速度太慢,决定在单独的字段中存储未读邮件的数量(一种非规范化)。现在,每当有新消息进来时,您还必须增加未读计数器,每当将消息标记为已读时,您也必须减少未读计数器。
In Figure 7-2 , user 2 experiences an anomaly: the mailbox listing shows an unread message, but the counter shows zero unread messages because the counter increment has not yet happened. ii Isolation would have prevented this issue by ensuring that user 2 sees either both the inserted email and the updated counter, or neither, but not an inconsistent halfway point.
在图7-2中,用户2遇到了异常情况:邮箱列表显示未读邮件,但计数器显示为零未读邮件,因为计数器增量尚未发生。隔离将通过确保用户2同时看到插入的电子邮件和更新的计数器或两者都不看到而避免此问题,而不是出现不一致的中间状态。
Figure 7-3 illustrates the need for atomicity: if an error occurs somewhere over the course of the transaction, the contents of the mailbox and the unread counter might become out of sync. In an atomic transaction, if the update to the counter fails, the transaction is aborted and the inserted email is rolled back.
图7-3展示了原子性的必要性:如果事务的过程中发生了错误,邮箱的内容和未读邮件计数器可能不同步。在一个原子事务中,如果计数器更新失败,事务将被中止并且插入的邮件将被回滚。
Multi-object transactions require some way of determining which read and write operations belong to
the same transaction. In relational databases, that is typically done based on the client’s TCP
connection to the database server: on any particular connection, everything between a
BEGIN
TRANSACTION
and a
COMMIT
statement is considered to be part of the same
transaction.
iii
多对象事务需要确定哪些读写操作属于同一个事务。在关系型数据库中,通常基于客户端与数据库服务器之间的TCP连接进行选择:在任何一个特定的连接中,BEGIN TRANSACTION和COMMIT语句之间的所有内容都被认为是同一个事务的一部分。
On the other hand, many nonrelational databases don’t have such a way of grouping operations together. Even if there is a multi-object API (for example, a key-value store may have a multi-put operation that updates several keys in one operation), that doesn’t necessarily mean it has transaction semantics: the command may succeed for some keys and fail for others, leaving the database in a partially updated state.
然而,许多非关系型数据库没有将操作分组在一起的方式。即使有多对象API(例如,键-值存储可以有一个多放操作,在一个操作中更新多个键),这并不一定意味着它具有事务语义:该命令可能对某些键成功,对其他键失败,从而使数据库处于部分更新状态。
Single-object writes
Atomicity and isolation also apply when a single object is being changed. For example, imagine you are writing a 20 KB JSON document to a database:
原子性和隔离性也适用于单个对象的更改。例如,想象一下,您要将一个20 KB的JSON文档写入数据库:
-
If the network connection is interrupted after the first 10 KB have been sent, does the database store that unparseable 10 KB fragment of JSON?
如果在第一个10KB被发送后,网络连接中断,数据库是否会存储无法解析的10KB JSON片段?
-
If the power fails while the database is in the middle of overwriting the previous value on disk, do you end up with the old and new values spliced together?
如果在数据库正在覆盖磁盘上的先前值时出现电力故障,您最终会得到旧值和新值拼接在一起吗?
-
If another client reads that document while the write is in progress, will it see a partially updated value?
如果在写入过程中另一个客户端读取该文档,它会看到部分更新的值吗?
Those issues would be incredibly confusing, so storage engines almost universally aim to provide atomicity and isolation on the level of a single object (such as a key-value pair) on one node. Atomicity can be implemented using a log for crash recovery (see “Making B-trees reliable” ), and isolation can be implemented using a lock on each object (allowing only one thread to access an object at any one time).
这些问题将极其混乱,因此存储引擎几乎普遍旨在为单个对象(例如键值对)的某个节点提供原子性和隔离性。可以使用日志实现原子性以进行崩溃恢复(请参阅“使B树可靠”),并且可以使用每个对象上的锁实现隔离(仅允许一个线程在任何时间访问对象)。
Some databases also provide more complex atomic operations, iv such as an increment operation, which removes the need for a read-modify-write cycle like that in Figure 7-1 . Similarly popular is a compare-and-set operation, which allows a write to happen only if the value has not been concurrently changed by someone else (see “Compare-and-set” ).
一些数据库也提供更复杂的原子操作,例如增量操作,它消除了像图7-1中的读取-修改-写入循环的需要。同样受欢迎的是比较和设置操作,它仅允许在值未被他人同时更改时进行写入(请参阅“比较和设置”)。
These single-object operations are useful, as they can prevent lost updates when several clients try to write to the same object concurrently (see “Preventing Lost Updates” ). However, they are not transactions in the usual sense of the word. Compare-and-set and other single-object operations have been dubbed “lightweight transactions” or even “ACID” for marketing purposes [ 20 , 21 , 22 ], but that terminology is misleading. A transaction is usually understood as a mechanism for grouping multiple operations on multiple objects into one unit of execution.
这些单个操作非常有用,因为它们可以防止多个客户端同时写入同一个对象时丢失更新(参见“防止丢失更新”)。然而,它们在通常意义上不是事务。比较和设置以及其他单个对象操作已被称为“轻量级事务”甚至是“ACID”以进行营销 [20, 21, 22],但这种术语是误导性的。事务通常被理解为将多个对象上的多个操作分组为一个执行单元的机制。
The need for multi-object transactions
Many distributed datastores have abandoned multi-object transactions because they are difficult to implement across partitions, and they can get in the way in some scenarios where very high availability or performance is required. However, there is nothing that fundamentally prevents transactions in a distributed database, and we will discuss implementations of distributed transactions in Chapter 9 .
许多分布式数据存储已经放弃了多对象事务,因为跨分区实现起来很困难,并且在需要非常高的可用性或性能的某些情况下可能会妨碍。然而,在分布式数据库中实现事务没有什么根本上的阻碍,我们将在第9章中讨论分布式事务的实现。
But do we need multi-object transactions at all? Would it be possible to implement any application with only a key-value data model and single-object operations?
我们真的需要多对象事务吗?是否可能只通过键值数据模型和单对象操作来实现任何应用程序?
There are some use cases in which single-object inserts, updates, and deletes are sufficient. However, in many other cases writes to several different objects need to be coordinated:
有些使用情况下,单个对象的插入、更新和删除就足够了。但是,在许多其他情况下,需要协调对多个不同对象的写入。
-
In a relational data model, a row in one table often has a foreign key reference to a row in another table. (Similarly, in a graph-like data model, a vertex has edges to other vertices.) Multi-object transactions allow you to ensure that these references remain valid: when inserting several records that refer to one another, the foreign keys have to be correct and up to date, or the data becomes nonsensical.
在关系数据模型中,一个表中的一行通常具有到另一个表中一行的外键引用。(同样,在类似图形的数据模型中,一个顶点有到其他顶点的边缘。)多个对象的事务允许您确保这些引用保持有效:在插入互相引用的多条记录时,外键必须正确和最新,否则数据就变得没有意义。
-
In a document data model, the fields that need to be updated together are often within the same document, which is treated as a single object—no multi-object transactions are needed when updating a single document. However, document databases lacking join functionality also encourage denormalization (see “Relational Versus Document Databases Today” ). When denormalized information needs to be updated, like in the example of Figure 7-2 , you need to update several documents in one go. Transactions are very useful in this situation to prevent denormalized data from going out of sync.
在文档数据模型中,需要同时更新的字段通常在同一文档中,该文档被视为单个对象 - 更新单个文档时不需要多个对象事务。然而,缺乏连接功能的文档数据库也鼓励去规范化(见“关系和文档数据库的比较”)。当需要更新非规范化信息时,例如图7-2中的示例,您需要一次更新多个文档。在这种情况下,事务非常有用,可以防止去规范化数据失去同步。
-
In databases with secondary indexes (almost everything except pure key-value stores), the indexes also need to be updated every time you change a value. These indexes are different database objects from a transaction point of view: for example, without transaction isolation, it’s possible for a record to appear in one index but not another, because the update to the second index hasn’t happened yet.
在具有辅助索引的数据库中(几乎所有非纯键值存储都需要),每次更改值时也需要更新索引。从事务角度讲,这些索引是不同的数据库对象:例如,如果没有事务隔离,可能会在一个索引中出现一条记录,但在另一个索引中还没有出现,因为第二个索引的更新尚未发生。
Such applications can still be implemented without transactions. However, error handling becomes much more complicated without atomicity, and the lack of isolation can cause concurrency problems. We will discuss those in “Weak Isolation Levels” , and explore alternative approaches in Chapter 12 .
这些应用程序仍然可以在没有事务的情况下实现。然而,没有原子性会使错误处理变得更加复杂,缺乏隔离可能会引起并发问题。我们将在“弱隔离级别”中讨论这些问题,并在第12章探讨替代方法。
Handling errors and aborts
A key feature of a transaction is that it can be aborted and safely retried if an error occurred. ACID databases are based on this philosophy: if the database is in danger of violating its guarantee of atomicity, isolation, or durability, it would rather abandon the transaction entirely than allow it to remain half-finished.
交易的一个关键特性是,如果出现错误,它可以被中止并安全地重试。 ACID数据库基于这个哲学:如果数据库有可能违反其原子性、隔离性或持久性保证,它宁可完全放弃该事务,也不愿意让它半途而废。
Not all systems follow that philosophy, though. In particular, datastores with leaderless replication (see “Leaderless Replication” ) work much more on a “best effort” basis, which could be summarized as “the database will do as much as it can, and if it runs into an error, it won’t undo something it has already done”—so it’s the application’s responsibility to recover from errors.
不过,并非所有系统都遵循这种哲学。特别是领导者缺失的数据存储(参见“领导者缺失的复制”)更多地运用“最佳努力”方法,可以概括为“数据库会尽可能地做好它的工作,如果出现错误,它不会撤销已经完成的工作”——所以应用程序需要负责从错误中恢复。
Errors will inevitably happen, but many software developers prefer to think only about the happy path rather than the intricacies of error handling. For example, popular object-relational mapping (ORM) frameworks such as Rails’s ActiveRecord and Django don’t retry aborted transactions—the error usually results in an exception bubbling up the stack, so any user input is thrown away and the user gets an error message. This is a shame, because the whole point of aborts is to enable safe retries.
错误不可避免,但许多软件开发人员更喜欢考虑快乐路径,而不是错误处理的复杂性。例如,流行的对象关系映射(ORM)框架,如Rails的ActiveRecord和Django,不会重试中止事务,错误通常会导致异常冒泡到堆栈,因此任何用户输入都被丢弃,并且用户收到错误消息。这很遗憾,因为中止的整个目的是为了实现安全重试。
Although retrying an aborted transaction is a simple and effective error handling mechanism, it isn’t perfect:
尽管重试已终止的交易是一种简单且有效的错误处理机制,它并不是完美的:
-
If the transaction actually succeeded, but the network failed while the server tried to acknowledge the successful commit to the client (so the client thinks it failed), then retrying the transaction causes it to be performed twice—unless you have an additional application-level deduplication mechanism in place.
如果交易实际上成功了,但在服务器尝试向客户端确认成功提交时网络失败(因此客户端认为它失败),那么重试交易会导致交易被执行两次——除非您有另外一个应用程序级别的去重机制。
-
If the error is due to overload, retrying the transaction will make the problem worse, not better. To avoid such feedback cycles, you can limit the number of retries, use exponential backoff, and handle overload-related errors differently from other errors (if possible).
如果错误是由于过载引起的,重试交易将使问题恶化而不是改善。 为避免这种反馈循环,可以限制重试次数,使用指数退避,并将过载相关错误与其他错误区别对待(如果可能)。
-
It is only worth retrying after transient errors (for example due to deadlock, isolation violation, temporary network interruptions, and failover); after a permanent error (e.g., constraint violation) a retry would be pointless.
只有在短暂错误(例如死锁、隔离违规、暂时网络中断和故障转移)后重新尝试才有意义;在永久错误(例如约束违规)后重试是没有意义的。
-
If the transaction also has side effects outside of the database, those side effects may happen even if the transaction is aborted. For example, if you’re sending an email, you wouldn’t want to send the email again every time you retry the transaction. If you want to make sure that several different systems either commit or abort together, two-phase commit can help (we will discuss this in “Atomic Commit and Two-Phase Commit (2PC)” ).
如果交易还具有数据库之外的副作用,即使事务被中止,这些副作用也可能发生。例如,如果您正在发送电子邮件,则不希望每次重试交易时都重新发送电子邮件。如果您想确保多个不同的系统同时提交或中止,两阶段提交可以帮助(我们将在“原子提交和两阶段提交(2PC)”中讨论)。
-
If the client process fails while retrying, any data it was trying to write to the database is lost.
如果客户端在重试期间失败,它试图写入数据库的任何数据都将丢失。
Weak Isolation Levels
If two transactions don’t touch the same data, they can safely be run in parallel, because neither depends on the other. Concurrency issues (race conditions) only come into play when one transaction reads data that is concurrently modified by another transaction, or when two transactions try to simultaneously modify the same data.
如果两个事务不涉及相同的数据,它们可以安全地并行运行,因为它们之间没有相互依存的关系。并发问题(竞态条件)只会在一个事务读取由另一个事务并发修改的数据时出现,或者在两个事务尝试同时修改相同的数据时出现。
Concurrency bugs are hard to find by testing, because such bugs are only triggered when you get unlucky with the timing. Such timing issues might occur very rarely, and are usually difficult to reproduce. Concurrency is also very difficult to reason about, especially in a large application where you don’t necessarily know which other pieces of code are accessing the database. Application development is difficult enough if you just have one user at a time; having many concurrent users makes it much harder still, because any piece of data could unexpectedly change at any time.
并发bug很难通过测试找到,因为只有在时间不幸的情况下才会触发这样的bug。这种时间问题可能非常罕见,而且通常很难重现。并发也非常难以理解,特别是在一个大型应用程序中,你可能不知道哪些其他代码正在访问数据库。如果你只有一个用户,应用程序开发已经很困难了;如果有很多并发用户,那么它会变得更加困难,因为任何数据都可能在任何时候意外地改变。
For that reason, databases have long tried to hide concurrency issues from application developers by providing transaction isolation . In theory, isolation should make your life easier by letting you pretend that no concurrency is happening: serializable isolation means that the database guarantees that transactions have the same effect as if they ran serially (i.e., one at a time, without any concurrency).
因此,数据库长期以来一直试图通过提供事务隔离来隐藏应用程序开发人员的并发问题。 理论上,隔离应该使您的生活更轻松,因为它让您假装没有发生并发:可串行化隔离意味着数据库保证事务具有与串行运行(即一个接一个地进行,没有任何并发)相同的效果。
In practice, isolation is unfortunately not that simple. Serializable isolation has a performance cost, and many databases don’t want to pay that price [ 8 ]. It’s therefore common for systems to use weaker levels of isolation, which protect against some concurrency issues, but not all. Those levels of isolation are much harder to understand, and they can lead to subtle bugs, but they are nevertheless used in practice [ 23 ].
在实践中,隔离不是那么简单的。可串行化隔离会带来性能上的成本,因此许多数据库不想承担这样的代价。通常情况下,系统会采用更弱的隔离级别,可以防止某些并发问题,但并非所有问题都可以得到解决。这种隔离级别更难以理解,可能会导致一些微妙的错误,但它们仍然在实践中被广泛使用。
Concurrency bugs caused by weak transaction isolation are not just a theoretical problem. They have caused substantial loss of money [ 24 , 25 ], led to investigation by financial auditors [ 26 ], and caused customer data to be corrupted [ 27 ]. A popular comment on revelations of such problems is “Use an ACID database if you’re handling financial data!”—but that misses the point. Even many popular relational database systems (which are usually considered “ACID”) use weak isolation, so they wouldn’t necessarily have prevented these bugs from occurring.
弱事务隔离引起的并发错误不仅仅是一个理论问题。它们已经导致了大量的金钱损失[24, 25],引起了金融审计机构的调查[26],并导致客户数据的损坏[27]。对于这样的问题,一个普遍的评论是“如果你处理财务数据,使用ACID数据库!”-但这失去了重点。即使是许多流行的关系型数据库系统(通常被认为是“ACID”)也使用弱隔离,因此它们不一定可以防止这些错误的出现。
Rather than blindly relying on tools, we need to develop a good understanding of the kinds of concurrency problems that exist, and how to prevent them. Then we can build applications that are reliable and correct, using the tools at our disposal.
我们需要对存在的并发问题有一个很好的理解,并学会如何防止它们,而不是盲目地依靠工具。然后我们可以利用手头的工具构建可靠和正确的应用程序。
In this section we will look at several weak (nonserializable) isolation levels that are used in practice, and discuss in detail what kinds of race conditions can and cannot occur, so that you can decide what level is appropriate to your application. Once we’ve done that, we will discuss serializability in detail (see “Serializability” ). Our discussion of isolation levels will be informal, using examples. If you want rigorous definitions and analyses of their properties, you can find them in the academic literature [ 28 , 29 , 30 ].
在这一部分中,我们将探讨几个实际使用的弱(非可串行化)隔离级别,并详细讨论可能发生和不可能发生的竞争条件,以便您可以决定适合您应用程序的级别。一旦我们完成了这个任务,我们将详细讨论可串行化性(见“可序列化性”)。我们对隔离级别的讨论将是非正式的,并使用示例。如果您想了解其属性的严格定义和分析,则可以在学术文献[28、29、30]中找到。
Read Committed
The most basic level of transaction isolation is read committed . v It makes two guarantees:
最基本的事务隔离级别是读取提交。它提供两个保证:
-
When reading from the database, you will only see data that has been committed (no dirty reads ).
读取数据库时,只能看到已提交的数据(没有脏读取)。
-
When writing to the database, you will only overwrite data that has been committed (no dirty writes ).
在写入数据库时,只会覆盖已提交的数据(不会出现脏写)。
Let’s discuss these two guarantees in more detail.
让我们更详细地讨论这两个保证。
No dirty reads
Imagine a transaction has written some data to the database, but the transaction has not yet committed or aborted. Can another transaction see that uncommitted data? If yes, that is called a dirty read [ 2 ].
想象一下,一个事务已经向数据库写入了一些数据,但是这个事务还没有提交或中止。另一个事务是否能看到这些未提交的数据?如果是的话,那就叫做脏读取。
Transactions running at the read committed isolation level must prevent dirty reads. This means that any writes by a transaction only become visible to others when that transaction commits (and then all of its writes become visible at once). This is illustrated in Figure 7-4 , where user 1 has set x = 3, but user 2’s get x still returns the old value, 2, while user 1 has not yet committed.
运行在读提交隔离级别下的事务必须防止脏读。这意味着,只有当一个事务提交后,它的任何写入才会对其他事务可见(然后所有写入一次性变为可见)。这在图7-4中有所体现,用户1设置x = 3,但用户2的get x仍返回旧值2,因为用户1尚未提交。
There are a few reasons why it’s useful to prevent dirty reads:
防止脏读有几个好处:
-
If a transaction needs to update several objects, a dirty read means that another transaction may see some of the updates but not others. For example, in Figure 7-2 , the user sees the new unread email but not the updated counter. This is a dirty read of the email. Seeing the database in a partially updated state is confusing to users and may cause other transactions to take incorrect decisions.
如果一次事务需要更新多个对象,脏读表示另一个事务可能会看到某些更新,但不是全部更新。例如,在图7-2中,用户看到了新的未读邮件,但没有看到更新的计数器。这是对邮件的脏读取。查看数据库的部分更新状态会令用户感到困惑,可能会导致其他事务做出错误的决策。
-
If a transaction aborts, any writes it has made need to be rolled back (like in Figure 7-3 ). If the database allows dirty reads, that means a transaction may see data that is later rolled back—i.e., which is never actually committed to the database. Reasoning about the consequences quickly becomes mind-bending.
如果事务中止了,它所做的任何写操作都需要回滚(就像图7-3中的一样)。如果数据库允许脏读取,那么意味着一个事务可能看到后来被回滚的数据,即实际上从未提交到数据库中的数据。对后果的推理很快变得令人费解。
No dirty writes
What happens if two transactions concurrently try to update the same object in a database? We don’t know in which order the writes will happen, but we normally assume that the later write overwrites the earlier write.
如果两个事务同时尝试更新数据库中的同一对象会发生什么?我们不知道写入的顺序,但通常假设后一次的写入会覆盖早先的写入。
However, what happens if the earlier write is part of a transaction that has not yet committed, so the later write overwrites an uncommitted value? This is called a dirty write [ 28 ]. Transactions running at the read committed isolation level must prevent dirty writes, usually by delaying the second write until the first write’s transaction has committed or aborted.
如果早期的写入是尚未提交的事务的一部分,那么后续的写入会覆盖未提交的值,这就是脏写(又译为“临时写入”,Dirty Write)发生的情况。在读取已提交的隔离级别下运行的事务必须防止脏写,通常会延迟第二个写入,直到第一个写入的事务提交或中止。
By preventing dirty writes, this isolation level avoids some kinds of concurrency problems:
通过防止脏写,此隔离级别避免了一些并发问题。
-
If transactions update multiple objects, dirty writes can lead to a bad outcome. For example, consider Figure 7-5 , which illustrates a used car sales website on which two people, Alice and Bob, are simultaneously trying to buy the same car. Buying a car requires two database writes: the listing on the website needs to be updated to reflect the buyer, and the sales invoice needs to be sent to the buyer. In the case of Figure 7-5 , the sale is awarded to Bob (because he performs the winning update to the
listings
table), but the invoice is sent to Alice (because she performs the winning update to theinvoices
table). Read committed prevents such mishaps.如果交易更新多个对象,则脏写入可能导致不良结果。例如,考虑图7-5所示的二手车销售网站,Alice和Bob正在同时尝试购买同一辆车。购买车辆需要两个数据库写入操作:网站上的列表需要更新以反映买家,销售发票需要发送给买家。在图7-5的情况下,Bob获得了销售权(因为他执行了对列表表的获胜更新),但发票却被发送给了Alice(因为她执行了对发票表的获胜更新)。读取已提交的事务可以防止发生此类意外情况。
-
However, read committed does not prevent the race condition between two counter increments in Figure 7-1 . In this case, the second write happens after the first transaction has committed, so it’s not a dirty write. It’s still incorrect, but for a different reason—in “Preventing Lost Updates” we will discuss how to make such counter increments safe.
然而,“读已提交”无法防止图7-1中两个计数器自增之间的竞争条件。在这种情况下,第二次写操作在第一个事务提交后发生,因此它不是脏写。它仍然不正确,但是原因不同——在“防止丢失更新”中,我们将讨论如何使这样的计数器自增安全。
Implementing read committed
Read committed is a very popular isolation level. It is the default setting in Oracle 11g, PostgreSQL, SQL Server 2012, MemSQL, and many other databases [ 8 ].
提交读是一个非常流行的隔离级别。它是Oracle 11g、PostgreSQL、SQL Server 2012、MemSQL等许多数据库的默认设置。
Most commonly, databases prevent dirty writes by using row-level locks: when a transaction wants to modify a particular object (row or document), it must first acquire a lock on that object. It must then hold that lock until the transaction is committed or aborted. Only one transaction can hold the lock for any given object; if another transaction wants to write to the same object, it must wait until the first transaction is committed or aborted before it can acquire the lock and continue. This locking is done automatically by databases in read committed mode (or stronger isolation levels).
通常情况下,数据库通过使用行级锁来防止脏写:当事务想要修改特定对象(行或文档)时,它必须首先获取该对象的锁。然后它必须保持该锁直到事务提交或中止。每个给定对象只能有一个事务持有锁;如果另一个事务想要写入相同的对象,它必须等待第一个事务提交或中止,然后才能获取锁并继续进行。在读提交模式(或更强的隔离级别)下,数据库会自动进行此锁定。
How do we prevent dirty reads? One option would be to use the same lock, and to require any transaction that wants to read an object to briefly acquire the lock and then release it again immediately after reading. This would ensure that a read couldn’t happen while an object has a dirty, uncommitted value (because during that time the lock would be held by the transaction that has made the write).
怎样防止脏读取?一个选项是使用相同的锁,并要求任何想要读取对象的事务短暂地获取锁,然后在读取后立即释放它。这将确保在对象具有脏的、未提交值的情况下不能进行读取(因为在此期间锁将由进行写操作的事务持有)。
However, the approach of requiring read locks does not work well in practice, because one long-running write transaction can force many read-only transactions to wait until the long-running transaction has completed. This harms the response time of read-only transactions and is bad for operability: a slowdown in one part of an application can have a knock-on effect in a completely different part of the application, due to waiting for locks.
然而,要求读锁的方法在实际应用中不起作用,因为一个长时间运行的写事务可以强制许多只读事务等待,直到长时间运行的事务完成。这会损害只读事务的响应时间,对于可操作性不利:应用程序的某一部分减速可能会对完全不同的应用程序的其他部分产生连锁效应,因为等待锁。
For that reason, most databases vi prevent dirty reads using the approach illustrated in Figure 7-4 : for every object that is written, the database remembers both the old committed value and the new value set by the transaction that currently holds the write lock. While the transaction is ongoing, any other transactions that read the object are simply given the old value. Only when the new value is committed do transactions switch over to reading the new value.
因此,大多数数据库使用图7-4所示的方法来防止脏读:对于每个被写入的对象,数据库都会记住旧的提交值和当前持有写锁的事务设置的新值。在事务进行时,任何其他读取对象的事务只会得到旧值。只有在新值提交后,事务才会开始读取新值。
Snapshot Isolation and Repeatable Read
If you look superficially at read committed isolation, you could be forgiven for thinking that it does everything that a transaction needs to do: it allows aborts (required for atomicity), it prevents reading the incomplete results of transactions, and it prevents concurrent writes from getting intermingled. Indeed, those are useful features, and much stronger guarantees than you can get from a system that has no transactions.
如果你只是肤浅地看待读已提交隔离级别,你可能会认为它能够完成事务所需的一切:它允许中断(原子性所需),它避免读取不完整的事务结果,它避免并发写入相互纠缠。的确,这些是有用的特性,比起一个没有事务的系统,提供了更强的保证。
However, there are still plenty of ways in which you can have concurrency bugs when using this isolation level. For example, Figure 7-6 illustrates a problem that can occur with read committed.
然而,在使用这种隔离级别时仍然存在许多可能导致并发错误的方式。例如,图 7-6展示了在读提交时可能出现的问题。
Say Alice has $1,000 of savings at a bank, split across two accounts with $500 each. Now a transaction transfers $100 from one of her accounts to the other. If she is unlucky enough to look at her list of account balances in the same moment as that transaction is being processed, she may see one account balance at a time before the incoming payment has arrived (with a balance of $500), and the other account after the outgoing transfer has been made (the new balance being $400). To Alice it now appears as though she only has a total of $900 in her accounts—it seems that $100 has vanished into thin air.
假设爱丽丝在银行有1000美元的存款,在两个账户中平均分配,每个账户500美元。现在,一笔交易将其中一个账户中的100美元转移到另一个账户。如果她在交易处理过程中不幸地查看她的账户余额列表,她可能会在收到入账之前看到一个账户余额(余额为500美元),并在出账后看到另一个账户余额(新余额为400美元)。对爱丽丝来说,现在她的账户总额只有900美元-似乎100美元已经消失了。
This anomaly is called a nonrepeatable read or read skew : if Alice were to read the balance of account 1 again at the end of the transaction, she would see a different value ($600) than she saw in her previous query. Read skew is considered acceptable under read committed isolation: the account balances that Alice saw were indeed committed at the time when she read them.
这种异常称为不可重复读或读取偏斜:如果艾丽斯在事务结束时再次读取账户1的余额,她会看到一个不同的值(600美元),而不是她之前查询到的值。在读已提交隔离级别下,读取偏斜被认为是可接受的:艾丽斯看到的帐户余额确实在她读取它们时已经提交。
Note
The term skew is unfortunately overloaded: we previously used it in the sense of an unbalanced workload with hot spots (see “Skewed Workloads and Relieving Hot Spots” ), whereas here it means timing anomaly .
“Skew”的术语不幸被重载:我们先前使用它来表示具有热点的不平衡工作负载(请参见“不平衡工作负载和缓解热点”),而在这里它表示时间异常。
In Alice’s case, this is not a lasting problem, because she will most likely see consistent account balances if she reloads the online banking website a few seconds later. However, some situations cannot tolerate such temporary inconsistency:
在艾丽斯的情况下,这不是一个长期的问题,因为如果她几秒钟后重新加载在线银行网站,她很可能会看到一致的账户余额。然而,有些情况无法容忍这种临时的不一致:
- Backups
-
Taking a backup requires making a copy of the entire database, which may take hours on a large database. During the time that the backup process is running, writes will continue to be made to the database. Thus, you could end up with some parts of the backup containing an older version of the data, and other parts containing a newer version. If you need to restore from such a backup, the inconsistencies (such as disappearing money) become permanent.
备份需要复制整个数据库,对于大型数据库可能需要数小时。在备份过程中,数据库仍会继续进行写入操作。因此,备份的某些部分可能包含旧版本的数据,而其他部分包含新版本的数据。如果需要从这样的备份进行还原,不一致性问题(如消失的金额)将变为永久性的。
- Analytic queries and integrity checks
-
Sometimes, you may want to run a query that scans over large parts of the database. Such queries are common in analytics (see “Transaction Processing or Analytics?” ), or may be part of a periodic integrity check that everything is in order (monitoring for data corruption). These queries are likely to return nonsensical results if they observe parts of the database at different points in time.
有时候,你可能想运行一个查询,它会扫描整个数据库的大部分。这些查询通常出现在分析中(参见“事务处理还是分析?”),或者作为定期完整性检查的一部分,以确保一切都井然有序(监控数据损坏)。如果这些查询不同时观察数据库的不同部分,在返回结果时很可能会出现荒谬的结果。
Snapshot isolation [ 28 ] is the most common solution to this problem. The idea is that each transaction reads from a consistent snapshot of the database—that is, the transaction sees all the data that was committed in the database at the start of the transaction. Even if the data is subsequently changed by another transaction, each transaction sees only the old data from that particular point in time.
快照隔离[28]是最常见的解决这个问题的方法。其思想是每个事务从数据库的一致快照中读取——即,事务在开始时看到数据库提交的所有数据。即使数据随后被另一个事务更改,每个事务仅看到那个特定时间点的旧数据。
Snapshot isolation is a boon for long-running, read-only queries such as backups and analytics. It is very hard to reason about the meaning of a query if the data on which it operates is changing at the same time as the query is executing. When a transaction can see a consistent snapshot of the database, frozen at a particular point in time, it is much easier to understand.
快照隔离对于长时间运行的只读查询,如备份和分析查询,是一个福音。如果数据正在查询执行的同时发生变化,很难理解查询的含义。当一个事务能够看到数据库的一致快照,冻结在某个时间点上时,就容易理解得多。
Snapshot isolation is a popular feature: it is supported by PostgreSQL, MySQL with the InnoDB storage engine, Oracle, SQL Server, and others [ 23 , 31 , 32 ].
快照隔离是一个受欢迎的特性:它被PostgreSQL, MySQL的InnoDB存储引擎, Oracle, SQL Server等支持。
Implementing snapshot isolation
Like read committed isolation, implementations of snapshot isolation typically use write locks to prevent dirty writes (see “Implementing read committed” ), which means that a transaction that makes a write can block the progress of another transaction that writes to the same object. However, reads do not require any locks. From a performance point of view, a key principle of snapshot isolation is readers never block writers, and writers never block readers . This allows a database to handle long-running read queries on a consistent snapshot at the same time as processing writes normally, without any lock contention between the two.
像读提交隔离一样,快照隔离的实现通常使用写锁来防止脏写(参见“实现读提交”),这意味着进行写操作的事务可以阻止另一个对同一对象进行写操作的事务的进展。然而,读取不需要任何锁。从性能角度来看,快照隔离的一个关键原则是读者永远不会阻塞写者,写者永远不会阻塞读者。这使得数据库可以在处理常规写入的同时,在一致的快照上处理长时间运行的读查询,而两者之间没有任何锁争用。
To implement snapshot isolation, databases use a generalization of the mechanism we saw for preventing dirty reads in Figure 7-4 . The database must potentially keep several different committed versions of an object, because various in-progress transactions may need to see the state of the database at different points in time. Because it maintains several versions of an object side by side, this technique is known as multi-version concurrency control (MVCC).
为了实现快照隔离,数据库使用了一种机制的扩展,该机制用于防止图7-4中看到的脏读。数据库必须潜在地保留一个对象的多个不同提交版本,因为各种正在进行的事务可能需要在不同时刻查看数据库的状态。由于它并排维护一个对象的几个版本,因此这种技术被称为多版本并发控制(MVCC)。
If a database only needed to provide read committed isolation, but not snapshot isolation, it would be sufficient to keep two versions of an object: the committed version and the overwritten-but-not-yet-committed version. However, storage engines that support snapshot isolation typically use MVCC for their read committed isolation level as well. A typical approach is that read committed uses a separate snapshot for each query, while snapshot isolation uses the same snapshot for an entire transaction.
如果一个数据库只需要提供读取已提交隔离,而不需要快照隔离,那么维护一个对象的两个版本就足够了:已提交的版本和覆盖但未提交的版本。但是,支持快照隔离的存储引擎通常也会使用MVCC来进行读取已提交隔离级别的处理。一个典型的方法是,读取已提交隔离对于每个查询使用单独的快照,而快照隔离对于整个事务使用同一个快照。
Figure 7-7
illustrates how MVCC-based snapshot isolation is implemented in PostgreSQL
[
31
] (other implementations are similar).
When a transaction is started, it is given a unique,
always-increasing
vii
transaction ID (
txid
). Whenever a transaction writes anything to the database, the data it writes
is tagged with the transaction ID of the writer.
图7-7说明了PostgreSQL中MVCC基于快照隔离的实现方式[31](其他实现方式类似)。当一个事务开始时,它被赋予一个唯一的、始终递增的事务ID(txid)。无论何时一个事务将任何内容写入数据库,它写入的数据都会被标记上写入者的事务ID。
Each row in a table has a
created_by
field, containing the ID of the transaction that inserted
this row into the table. Moreover, each row has a
deleted_by
field, which is initially empty. If a
transaction deletes a row, the row isn’t actually deleted from the database, but it is marked for
deletion by setting the
deleted_by
field to the ID of the transaction that requested the deletion.
At some later time, when it is certain that no transaction can any longer access the deleted data, a
garbage collection process in the database removes any rows marked for deletion and frees their
space.
每个表格行都有一个created_by字段,包含将此行插入表格的交易ID。 此外,每一行都有一个deleted_by字段,最初为空白。 如果交易删除一行,则实际上不会从数据库中删除该行,但是将删除_by字段设置为请求删除的事务的ID以标记该行将被删除。 在某个以后的时间,当确保不再有任何交易可以访问被删除的数据时,数据库中的垃圾回收过程将删除任何标记为删除的行并释放它们的空间。
An update is internally translated into a delete and a create. For example, in
Figure 7-7
, transaction 13 deducts $100 from account 2, changing the balance from
$500 to $400. The
accounts
table now actually contains two rows for account 2: a row with a balance
of $500 which was marked as deleted by transaction 13, and a row with a balance of $400 which was
created by transaction 13.
更新操作在内部被转化成删除和创建操作。例如,在图7-7中,事务13从账户2中扣除了100美元,将余额从500美元变为400美元。现在,账户表实际上包含了账户2的两行记录:一行余额为500美元的记录被事务13标记为已删除,另一行余额为400美元的记录则是由事务13创建的。 更新操作内部转化为删除和创建操作。例如,图7-7中,13号交易从第二账户中扣除100美元,将余额从500美元变为400美元。现在,账户表实际上包含第二账户的两行记录:一行余额为500美元的记录被13号交易标记为已删除,而一行余额为400美元的记录则是由13号交易创建的。
Visibility rules for observing a consistent snapshot
When a transaction reads from the database, transaction IDs are used to decide which objects it can see and which are invisible. By carefully defining visibility rules, the database can present a consistent snapshot of the database to the application. This works as follows:
当一个事务从数据库中读取时,事务ID被用来决定哪些对象是可见的,哪些是不可见的。通过精确定义可见性规则,数据库可以向应用程序呈现一致的数据库快照。这个过程如下:
-
At the start of each transaction, the database makes a list of all the other transactions that are in progress (not yet committed or aborted) at that time. Any writes that those transactions have made are ignored, even if the transactions subsequently commit.
每个事务开始时,数据库都会列出正在进行中(未提交或取消)的所有其他事务。即使这些事务随后提交,它们所做的任何写操作也将被忽略。
-
Any writes made by aborted transactions are ignored.
已中止的交易所进行的任何写入操作将被忽略。
-
Any writes made by transactions with a later transaction ID (i.e., which started after the current transaction started) are ignored, regardless of whether those transactions have committed.
任何通过后续事务进行的写操作(即,这些事务在当前事务开始之后启动)都会被忽略,无论这些事务是否已提交。
-
All other writes are visible to the application’s queries.
所有其他写入都对应用程序的查询可见。
These rules apply to both creation and deletion of objects. In Figure 7-7 , when transaction 12 reads from account 2, it sees a balance of $500 because the deletion of the $500 balance was made by transaction 13 (according to rule 3, transaction 12 cannot see a deletion made by transaction 13), and the creation of the $400 balance is not yet visible (by the same rule).
这些规则适用于对象的创建和删除。在图7-7中,当交易12从账户2读取时,它看到的余额为$500,因为$500余额的删除是由交易13执行的(根据规则3,交易12不能看到交易13执行的删除),而$400余额的创建尚不可见(同一规则)。
Put another way, an object is visible if both of the following conditions are true:
换句话说,一个物体是可见的,如果满足以下两个条件:
-
At the time when the reader’s transaction started, the transaction that created the object had already committed.
在读者的交易开始时,创建对象的交易已经提交。
-
The object is not marked for deletion, or if it is, the transaction that requested deletion had not yet committed at the time when the reader’s transaction started.
对象未被标记为删除,或者如果被标记为删除,那么请求删除的事务在读取器事务开始时尚未提交。
A long-running transaction may continue using a snapshot for a long time, continuing to read values that (from other transactions’ point of view) have long been overwritten or deleted. By never updating values in place but instead creating a new version every time a value is changed, the database can provide a consistent snapshot while incurring only a small overhead.
长时间执行的事务可能会长时间使用快照,继续读取其他事务已经覆盖或删除的值。通过不直接更新值而是每次改变值时创建一个新版本,数据库可以提供一致的快照,同时只产生很小的开销。
Indexes and snapshot isolation
How do indexes work in a multi-version database? One option is to have the index simply point to all versions of an object and require an index query to filter out any object versions that are not visible to the current transaction. When garbage collection removes old object versions that are no longer visible to any transaction, the corresponding index entries can also be removed.
多版本数据库中索引如何工作?一种选择是使索引指向对象的所有版本,并要求索引查询过滤掉任何当前事务看不到的对象版本。当垃圾回收删除对任何事务不再可见的旧对象版本时,相应的索引条目也可以被删除。
In practice, many implementation details determine the performance of multi-version concurrency control. For example, PostgreSQL has optimizations for avoiding index updates if different versions of the same object can fit on the same page [ 31 ].
在实践中,许多实现细节决定了多版本并发控制的性能。例如,PostgreSQL通过优化避免在同一页上更新不同版本的对象的索引。
Another approach is used in CouchDB, Datomic, and LMDB. Although they also use B-trees (see “B-Trees” ), they use an append-only/copy-on-write variant that does not overwrite pages of the tree when they are updated, but instead creates a new copy of each modified page. Parent pages, up to the root of the tree, are copied and updated to point to the new versions of their child pages. Any pages that are not affected by a write do not need to be copied, and remain immutable [ 33 , 34 , 35 ].
另一种方法用于CouchDB、Datomic和LMDB。虽然它们也使用B树(见“B树”),但它们使用追加只读/写入复制变体,在更新时不会覆盖树的页面,而是创建每个修改页面的新副本。父页面,直到树的根,都被复制并更新以指向子页面的新版本。不受写入影响的任何页面都不需要被复制,并保持不变[33、34、35]。
With append-only B-trees, every write transaction (or batch of transactions) creates a new B-tree root, and a particular root is a consistent snapshot of the database at the point in time when it was created. There is no need to filter out objects based on transaction IDs because subsequent writes cannot modify an existing B-tree; they can only create new tree roots. However, this approach also requires a background process for compaction and garbage collection.
使用只追加的B树,每个写事务(或一批事务)都会创建一个新的B树根节点,并且特定的根节点是数据库在创建时的一致快照。不需要基于事务ID过滤对象,因为后续的写操作不能修改现有的B树;它们只能创建新的树根节点。但是,这种方法还需要进行压缩和垃圾回收的后台处理程序。
Repeatable read and naming confusion
Snapshot isolation is a useful isolation level, especially for read-only transactions. However, many databases that implement it call it by different names. In Oracle it is called serializable , and in PostgreSQL and MySQL it is called repeatable read [ 23 ].
快照隔离级别是一种有用的隔离级别,特别是对于只读事务。然而,许多实现此级别的数据库使用不同的名称。在Oracle中,它被称为可序列化,在PostgreSQL和MySQL中,则称为可重复读[23]。
The reason for this naming confusion is that the SQL standard doesn’t have the concept of snapshot isolation, because the standard is based on System R’s 1975 definition of isolation levels [ 2 ] and snapshot isolation hadn’t yet been invented then. Instead, it defines repeatable read, which looks superficially similar to snapshot isolation. PostgreSQL and MySQL call their snapshot isolation level repeatable read because it meets the requirements of the standard, and so they can claim standards compliance.
“这种命名混淆的原因是因为SQL标准没有快照隔离的概念。这是因为该标准基于System R在1975年定义的隔离级别,而那时尚未发明快照隔离。相反,它定义了可重复读,这看起来与快照隔离表面上相似。PostgreSQL和MySQL将它们的快照隔离级别称为可重复读,因为它满足标准的要求,所以它们可以声称符合标准。”
Unfortunately, the SQL standard’s definition of isolation levels is flawed—it is ambiguous, imprecise, and not as implementation-independent as a standard should be [ 28 ]. Even though several databases implement repeatable read, there are big differences in the guarantees they actually provide, despite being ostensibly standardized [ 23 ]. There has been a formal definition of repeatable read in the research literature [ 29 , 30 ], but most implementations don’t satisfy that formal definition. And to top it off, IBM DB2 uses “repeatable read” to refer to serializability [ 8 ].
不幸的是,SQL标准中对隔离级别的定义存在缺陷 - 它是模棱两可的,不精确的,并且不像标准应该的那样独立于实现[28]。尽管有几个数据库实现了可重复读,但它们实际提供的保证存在很大差异,尽管它们表面上是规范化的[23]。研究文献中已经有对可重复读的正式定义[29、30],但大多数实现都不能满足这个正式定义。而且,IBM DB2使用“可重复读”来指代串行化[8]。
As a result, nobody really knows what repeatable read means.
因此,没有人真正知道可重复读取的含义。
Preventing Lost Updates
The read committed and snapshot isolation levels we’ve discussed so far have been primarily about the guarantees of what a read-only transaction can see in the presence of concurrent writes. We have mostly ignored the issue of two transactions writing concurrently—we have only discussed dirty writes (see “No dirty writes” ), one particular type of write-write conflict that can occur.
到目前为止,我们所讨论的读取提交和快照隔离级别主要关注的是在存在并发写入的情况下,只读事务可以看到什么样的保证。我们大多数情况下忽略了两个事务同时写入的问题 - 我们只讨论了脏写(参见“无脏写入”),一种可能发生的写-写冲突的特定类型。
There are several other interesting kinds of conflicts that can occur between concurrently writing transactions. The best known of these is the lost update problem, illustrated in Figure 7-1 with the example of two concurrent counter increments.
并发写入事务之间可能发生多种有趣的冲突。其中最为人所知的是丢失更新问题,本例以两个并发计数器增量的示例解释,如图7-1所示。
The lost update problem can occur if an application reads some value from the database, modifies it, and writes back the modified value (a read-modify-write cycle ). If two transactions do this concurrently, one of the modifications can be lost, because the second write does not include the first modification. (We sometimes say that the later write clobbers the earlier write.) This pattern occurs in various different scenarios:
如果应用程序从数据库中读取某个值,对其进行修改,然后写回修改后的值(读取-修改-写入周期),则可能会发生丢失更新问题。如果两个事务同时执行此操作,则其中一个修改可能会丢失,因为第二次写入不包括第一次修改。(我们有时会说后来的写入覆盖了先前的写入。)这种模式在各种不同的情况下发生:
-
Incrementing a counter or updating an account balance (requires reading the current value, calculating the new value, and writing back the updated value)
增加计数器或更新账户余额(需要读取当前值、计算新值,然后写回更新后的值)。
-
Making a local change to a complex value, e.g., adding an element to a list within a JSON document (requires parsing the document, making the change, and writing back the modified document)
进行本地更改复杂值,例如在JSON文档中向列表添加元素(需要解析文档、进行更改并写回修改后的文档)。
-
Two users editing a wiki page at the same time, where each user saves their changes by sending the entire page contents to the server, overwriting whatever is currently in the database
两个用户同时编辑一个维基页面,在保存更改时,它们将整个页面内容发送到服务器,覆盖当前数据库中的内容。
Because this is such a common problem, a variety of solutions have been developed.
因为这是一个很常见的问题,所以已经开发出了各种解决方案。
Atomic write operations
Many databases provide atomic update operations, which remove the need to implement read-modify-write cycles in application code. They are usually the best solution if your code can be expressed in terms of those operations. For example, the following instruction is concurrency-safe in most relational databases:
许多数据库提供原子更新操作,这消除了在应用程序代码中实现读取-修改-写入循环的需要。如果你的代码可以用这些操作来表达,它们通常是最好的解决方案。例如,以下指令在大多数关系数据库中是并发安全的:
UPDATE
counters
SET
value
=
value
+
1
WHERE
key
=
'foo'
;
Similarly, document databases such as MongoDB provide atomic operations for making local modifications to a part of a JSON document, and Redis provides atomic operations for modifying data structures such as priority queues. Not all writes can easily be expressed in terms of atomic operations—for example, updates to a wiki page involve arbitrary text editing viii —but in situations where atomic operations can be used, they are usually the best choice.
类似地,文档数据库(如MongoDB)提供原子操作,用于对JSON文档的部分进行本地修改,而Redis提供原子操作以修改数据结构(如优先队列)。并非所有的写操作都可以很容易地表达为原子操作。例如,对维基页面的更新涉及任意的文本编辑。但在可以使用原子操作的情况下,它们通常是最好的选择。
Atomic operations are usually implemented by taking an exclusive lock on the object when it is read so that no other transaction can read it until the update has been applied. This technique is sometimes known as cursor stability [ 36 , 37 ]. Another option is to simply force all atomic operations to be executed on a single thread.
原子操作通常在读取对象时采取独占锁,以防止其他事务在更新应用之前读取它。这种技术有时被称为游标稳定性。另一种选择是强制所有原子操作在单个线程上执行。 原子操作通常采用独占锁定对象,以防其他事务在应用更新之前读取它。这种技术有时称为游标稳定性。另一个选择是强制所有原子操作在单个线程上执行。
Unfortunately, object-relational mapping frameworks make it easy to accidentally write code that performs unsafe read-modify-write cycles instead of using atomic operations provided by the database [ 38 ]. That’s not a problem if you know what you are doing, but it is potentially a source of subtle bugs that are difficult to find by testing.
很遗憾,对象关系映射框架很容易让人无意中编写执行不安全的读取-修改-写入循环的代码,而不是使用数据库提供的原子操作。如果你知道自己在做什么,那没有问题,但这可能会成为难以通过测试发现的微妙错误的源头。
Explicit locking
Another option for preventing lost updates, if the database’s built-in atomic operations don’t provide the necessary functionality, is for the application to explicitly lock objects that are going to be updated. Then the application can perform a read-modify-write cycle, and if any other transaction tries to concurrently read the same object, it is forced to wait until the first read-modify-write cycle has completed.
当数据库内置的原子操作无法提供所需功能时,防止丢失更新的另一种选择是应用程序显式锁定要更新的对象。然后,应用程序可以执行读取-修改-写入周期,如果任何其他事务尝试同时读取相同的对象,则被迫等待第一个读取-修改-写入周期完成。
For example, consider a multiplayer game in which several players can move the same figure concurrently. In this case, an atomic operation may not be sufficient, because the application also needs to ensure that a player’s move abides by the rules of the game, which involves some logic that you cannot sensibly implement as a database query. Instead, you may use a lock to prevent two players from concurrently moving the same piece, as illustrated in Example 7-1 .
例如,考虑一个多人游戏,多个玩家可以同时移动同一角色。在这种情况下,原子操作可能不足够,因为应用还需要确保玩家的移动符合游戏规则,这涉及到一些逻辑,无法合理地实现为数据库查询。相反,您可以使用锁来防止两个玩家同时移动同一棋子,如示例7-1所示。
Example 7-1. Explicitly locking rows to prevent lost updates
BEGIN
TRANSACTION
;
SELECT
*
FROM
figures
WHERE
name
=
'robot'
AND
game_id
=
222
FOR
UPDATE
;
-- Check whether move is valid, then update the position
-- of the piece that was returned by the previous SELECT.
UPDATE
figures
SET
position
=
'c4'
WHERE
id
=
1234
;
COMMIT
;
This works, but to get it right, you need to carefully think about your application logic. It’s easy to forget to add a necessary lock somewhere in the code, and thus introduce a race condition.
这个方法是可行的,但要达到最佳效果,需要仔细考虑应用程序逻辑。很容易忘记在代码中加入必要的锁,从而引入竞态条件。
Automatically detecting lost updates
Atomic operations and locks are ways of preventing lost updates by forcing the read-modify-write cycles to happen sequentially. An alternative is to allow them to execute in parallel and, if the transaction manager detects a lost update, abort the transaction and force it to retry its read-modify-write cycle.
原子操作和锁是防止丢失更新的方法,通过强制读取-修改-写入周期按顺序执行。另一种方法是允许它们并行执行,如果事务管理器检测到丢失更新,则中止事务并强制其重试读取-修改-写入周期。
An advantage of this approach is that databases can perform this check efficiently in conjunction with snapshot isolation. Indeed, PostgreSQL’s repeatable read, Oracle’s serializable, and SQL Server’s snapshot isolation levels automatically detect when a lost update has occurred and abort the offending transaction. However, MySQL/InnoDB’s repeatable read does not detect lost updates [ 23 ]. Some authors [ 28 , 30 ] argue that a database must prevent lost updates in order to qualify as providing snapshot isolation, so MySQL does not provide snapshot isolation under this definition.
这种方法的优点是数据库能够高效地与快照隔离一起执行此检查。实际上,PostgreSQL的可重复读、Oracle的串行化和SQL Server的快照隔离级别自动检测到丢失的更新,并中止有问题的事务。然而,MySQL/InnoDB的可重复读无法检测到丢失的更新[23]。一些作者[28,30]认为,数据库必须防止丢失的更新,才能被认为是提供快照隔离的资格,因此MySQL在此定义下不提供快照隔离。
Lost update detection is a great feature, because it doesn’t require application code to use any special database features—you may forget to use a lock or an atomic operation and thus introduce a bug, but lost update detection happens automatically and is thus less error-prone.
失效更新检测是一个很好的功能,因为它不需要应用程序代码使用任何特殊的数据库功能。你可能会忘记使用锁定或原子操作并引入错误,但是失效更新检测会自动发生,因此更不容易出错。
Compare-and-set
In databases that don’t provide transactions, you sometimes find an atomic compare-and-set operation (previously mentioned in “Single-object writes” ). The purpose of this operation is to avoid lost updates by allowing an update to happen only if the value has not changed since you last read it. If the current value does not match what you previously read, the update has no effect, and the read-modify-write cycle must be retried.
在不提供事务处理的数据库中,有时可以找到原子比较和设置操作(在“单对象写入”中提到过)。此操作的目的是通过仅在上次读取后的值未更改时允许更新来避免丢失更新。如果当前值与您以前读取的值不匹配,则更新无效,并且必须重试读取-修改-写入循环。
For example, to prevent two users concurrently updating the same wiki page, you might try something like this, expecting the update to occur only if the content of the page hasn’t changed since the user started editing it:
例如,为了防止两个用户同时更新同一个Wiki页面,您可以尝试类似以下的方法,只有在页面内容自用户开始编辑时没有更改时,才能期望更新成功:
-- This may or may not be safe, depending on the database implementation
UPDATE
wiki_pages
SET
content
=
'new content'
WHERE
id
=
1234
AND
content
=
'old content'
;
If the content has changed and no longer matches
'old content'
, this update will have no effect,
so you need to check whether the update took effect and retry if necessary. However, if the database
allows the
WHERE
clause to read from an old snapshot, this statement may not prevent lost updates,
because the condition may be true even though another concurrent write is occurring. Check whether
your database’s compare-and-set operation is safe before relying on it.
如果内容已更改且不再与旧内容匹配,则此更新将不起作用,因此需要检查更新是否生效,如果必要,则重试。但是,如果数据库允许WHERE子句从旧快照中读取,则此语句可能无法防止丢失更新,因为条件可能为true,即使另一个并发写操作正在进行中。在依赖它之前,请检查您的数据库的比较和设置操作是否安全。
Conflict resolution and replication
In replicated databases (see Chapter 5 ), preventing lost updates takes on another dimension: since they have copies of the data on multiple nodes, and the data can potentially be modified concurrently on different nodes, some additional steps need to be taken to prevent lost updates.
在复制的数据库中(见第五章),防止丢失更新会有另一个层面:由于它们在多个节点上拥有数据的副本,并且数据可能会在不同节点上并发修改,因此需要采取一些额外的步骤来防止丢失更新。
Locks and compare-and-set operations assume that there is a single up-to-date copy of the data. However, databases with multi-leader or leaderless replication usually allow several writes to happen concurrently and replicate them asynchronously, so they cannot guarantee that there is a single up-to-date copy of the data. Thus, techniques based on locks or compare-and-set do not apply in this context. (We will revisit this issue in more detail in “Linearizability” .)
锁定和比较并设置操作假定数据有一个最新的副本。然而,具有多个领导者或无领导者复制的数据库通常允许同时进行多个写操作,并异步复制它们,因此无法保证数据有一个最新的副本。因此,基于锁定或比较并设置的技术在这种情况下不适用。(我们将在“线性化”中更详细地讨论这个问题。)
Instead, as discussed in “Detecting Concurrent Writes” , a common approach in such replicated databases is to allow concurrent writes to create several conflicting versions of a value (also known as siblings ), and to use application code or special data structures to resolve and merge these versions after the fact.
相反,在“检测并发写入”中讨论的方式,这种复制数据库中常见的方法是允许并发写入创建多个冲突版本的值(也称为兄弟姐妹),并使用应用程序代码或特殊数据结构在事后解决和合并这些版本。
Atomic operations can work well in a replicated context, especially if they are commutative (i.e., you can apply them in a different order on different replicas, and still get the same result). For example, incrementing a counter or adding an element to a set are commutative operations. That is the idea behind Riak 2.0 datatypes, which prevent lost updates across replicas. When a value is concurrently updated by different clients, Riak automatically merges together the updates in such a way that no updates are lost [ 39 ].
原子操作在复制环境中表现良好,尤其是在可交换的情况下(即在不同的副本上可以以不同的顺序应用它们,仍然可以得到相同的结果)。例如,对计数器进行递增或将元素添加到集合中都是可交换的操作。这就是Riak 2.0数据类型的想法,它可以防止副本之间出现丢失更新。当不同客户端同时更新一个值时,Riak会自动合并这些更新,以便不会丢失任何更新。[39]。
On the other hand, the last write wins (LWW) conflict resolution method is prone to lost updates, as discussed in “Last write wins (discarding concurrent writes)” . Unfortunately, LWW is the default in many replicated databases.
另一方面,最后写入胜出(LWW)冲突解决方法容易丢失更新,就像在“最后写入获胜(丢弃并行写入)”中讨论的那样。不幸的是,在许多复制数据库中,LWW是默认设置。
Write Skew and Phantoms
In the previous sections we saw dirty writes and lost updates , two kinds of race conditions that can occur when different transactions concurrently try to write to the same objects. In order to avoid data corruption, those race conditions need to be prevented—either automatically by the database, or by manual safeguards such as using locks or atomic write operations.
在前面的章节中,我们看到了脏写和丢失更新这两种竞态条件,当不同的事务同时尝试写入相同的对象时,可能会发生。为了避免数据损坏,需要防止这些竞争条件--可以通过数据库自动实现,也可以通过使用锁或原子写操作等手动保障实现。
However, that is not the end of the list of potential race conditions that can occur between concurrent writes. In this section we will see some subtler examples of conflicts.
然而,这并不是并发写入之间可能发生的潜在竞态条件列表的终点。在本节中,我们将看到一些更微妙的冲突示例。
To begin, imagine this example: you are writing an application for doctors to manage their on-call shifts at a hospital. The hospital usually tries to have several doctors on call at any one time, but it absolutely must have at least one doctor on call. Doctors can give up their shifts (e.g., if they are sick themselves), provided that at least one colleague remains on call in that shift [ 40 , 41 ].
首先,想象一个例子:你正在为医生编写一个应用程序,以管理他们在医院的轮班。医院通常会尽量安排几名医生轮班,但绝对必须有至少一名医生轮班。医生可以放弃他们的轮班(例如,如果他们自己生病了),前提是在该轮班中至少有一名同事仍然轮班。
Now imagine that Alice and Bob are the two on-call doctors for a particular shift. Both are feeling unwell, so they both decide to request leave. Unfortunately, they happen to click the button to go off call at approximately the same time. What happens next is illustrated in Figure 7-8 .
现在想象一下,艾丽斯和鲍勃是特定班次的两名值班医生。他们俩都不太舒服,于是都决定请假。不幸的是,他们恰巧在大约同一时间点击了下班按钮。接下来发生的事情如图7-8所示。
In each transaction, your application first checks that two or more doctors are currently on call;
if yes, it assumes it’s safe for one doctor to go off call. Since the database is using snapshot
isolation, both checks return
2
, so both transactions proceed to the next stage. Alice updates her
own record to take herself off call, and Bob updates his own record likewise. Both transactions
commit, and now no doctor is on call. Your requirement of having at least one doctor on call has
been violated.
在每一次交易中,你的应用程序首先检查当前是否有两个或更多医生在接听电话;如果是的话,它就默认有一个医生可以不用接听电话。由于数据库使用了快照隔离,两个检查都返回2,所以两个交易都进入下一个阶段。艾丽斯更新她自己的记录以使自己不再接听电话,鲍勃也做了类似的更新。两个交易都提交,现在没有医生在接听电话了。你的要求至少有一个医生接听电话已经被违反了。
Characterizing write skew
This anomaly is called write skew [ 28 ]. It is neither a dirty write nor a lost update, because the two transactions are updating two different objects (Alice’s and Bob’s on-call records, respectively). It is less obvious that a conflict occurred here, but it’s definitely a race condition: if the two transactions had run one after another, the second doctor would have been prevented from going off call. The anomalous behavior was only possible because the transactions ran concurrently.
这种异常被称为写入偏斜[28]。它既不是脏写也不是丢失更新,因为这两个交易正在更新两个不同的对象(分别是Alice和Bob的呼叫记录)。这里发生了冲突并不是很明显,但这绝对是一种竞争条件:如果这两个事务运行在彼此之后,第二个医生将被阻止下班。这种异常行为只有在事务并发运行时才有可能出现。
You can think of write skew as a generalization of the lost update problem. Write skew can occur if two transactions read the same objects, and then update some of those objects (different transactions may update different objects). In the special case where different transactions update the same object, you get a dirty write or lost update anomaly (depending on the timing).
无论何时,两个事务读取相同的对象,然后更新其中一些对象,就会发生写入偏斜。如果不同的事务更新相同的对象,那么你将会得到一种脏写或丢失更新异常(取决于时间)。
We saw that there are various different ways of preventing lost updates. With write skew, our options are more restricted:
我们看到了许多不同的方法来防止丢失的更新。但在写入偏差的情况下,我们的选择比较有限:
-
Atomic single-object operations don’t help, as multiple objects are involved.
原文:Atomic single-object operations don’t help, as multiple objects are involved. 简化中文:原子单对象操作无助于处理,因为涉及多个对象。
-
The automatic detection of lost updates that you find in some implementations of snapshot isolation unfortunately doesn’t help either: write skew is not automatically detected in PostgreSQL’s repeatable read, MySQL/InnoDB’s repeatable read, Oracle’s serializable, or SQL Server’s snapshot isolation level [ 23 ]. Automatically preventing write skew requires true serializable isolation (see “Serializability” ).
某些快照隔离实现中发现的丢失更新的自动检测不幸的是没有帮助:在PostgreSQL的可重复读、MySQL/InnoDB的可重复读、Oracle的可串行化或SQL Server的快照隔离级别[ 23 ]中无法自动检测到写入偏斜。自动防止写入偏斜需要真正的串行化隔离(见“串行化”)。
-
Some databases allow you to configure constraints, which are then enforced by the database (e.g., uniqueness, foreign key constraints, or restrictions on a particular value). However, in order to specify that at least one doctor must be on call, you would need a constraint that involves multiple objects. Most databases do not have built-in support for such constraints, but you may be able to implement them with triggers or materialized views, depending on the database [ 42 ].
一些数据库允许您配置限制条件,然后由数据库执行(例如:独特性、外部键限制或特定值的限制)。然而,为了指定至少有一位医生需要值班,您需要涉及多个对象的限制条件。大多数数据库没有内置的支持这样的限制条件,但您可以根据数据库的情况使用触发器或材料化视图来实现它们[42]。
-
If you can’t use a serializable isolation level, the second-best option in this case is probably to explicitly lock the rows that the transaction depends on. In the doctors example, you could write something like the following:
如果您无法使用可序列化隔离级别,在这种情况下,第二好的选择可能是显式锁定事务所依赖的行。在医生示例中,您可以编写以下内容:
BEGIN
TRANSACTION
;
SELECT
*
FROM
doctors
WHERE
on_call
=
true
AND
shift_id
=
1234
FOR
UPDATE
;
UPDATE
doctors
SET
on_call
=
false
WHERE
name
=
'Alice'
AND
shift_id
=
1234
;
COMMIT
;
More examples of write skew
Write skew may seem like an esoteric issue at first, but once you’re aware of it, you may notice more situations in which it can occur. Here are some more examples:
写偏斜一开始可能看起来像是一个深奥的问题,但是一旦你意识到它,你可能会注意到更多可能发生这种现象的情况。以下是一些例子:
- Meeting room booking system
-
Say you want to enforce that there cannot be two bookings for the same meeting room at the same time [ 43 ]. When someone wants to make a booking, you first check for any conflicting bookings (i.e., bookings for the same room with an overlapping time range), and if none are found, you create the meeting (see Example 7-2 ). ix
假设您想强制要求同一时间不能有两个预订同一个会议室的情况[43]。当有人想要进行预订时,您首先要检查是否存在任何冲突的预订(即按时间重叠的同一房间预订),如果不存在,您就可以创建会议(参见示例7-2)。
Example 7-2. A meeting room booking system tries to avoid double-booking (not safe under snapshot isolation)
BEGIN
TRANSACTION
;
-- Check for any existing bookings that overlap with the period of noon-1pm
SELECT
COUNT
(
*
)
FROM
bookings
WHERE
room_id
=
123
AND
end_time
>
'2015-01-01 12:00'
AND
start_time
<
'2015-01-01 13:00'
;
-- If the previous query returned zero:
INSERT
INTO
bookings
(
room_id
,
start_time
,
end_time
,
user_id
)
VALUES
(
123
,
'2015-01-01 12:00'
,
'2015-01-01 13:00'
,
666
);
COMMIT
;
Unfortunately, snapshot isolation does not prevent another user from concurrently inserting a conflicting meeting. In order to guarantee you won’t get scheduling conflicts, you once again need serializable isolation.
遗憾的是,快照隔离并不能防止其他用户同时插入冲突会议。为了确保您不会遇到调度冲突,您再次需要可序列化隔离。
- Multiplayer game
-
In Example 7-1 , we used a lock to prevent lost updates (that is, making sure that two players can’t move the same figure at the same time). However, the lock doesn’t prevent players from moving two different figures to the same position on the board or potentially making some other move that violates the rules of the game. Depending on the kind of rule you are enforcing, you might be able to use a unique constraint, but otherwise you’re vulnerable to write skew.
在示例7-1中,我们使用了锁来防止丢失更新(即确保两个玩家不能同时移动同一个图案)。然而,该锁并不能防止玩家将两个不同的图案移动到棋盘上的同一位置,或者可能进行其他违反游戏规则的动作。根据您要执行的规则类型,您可能可以使用唯一约束,但否则您容易受到写入扭曲的影响。
- Claiming a username
-
On a website where each user has a unique username, two users may try to create accounts with the same username at the same time. You may use a transaction to check whether a name is taken and, if not, create an account with that name. However, like in the previous examples, that is not safe under snapshot isolation. Fortunately, a unique constraint is a simple solution here (the second transaction that tries to register the username will be aborted due to violating the constraint).
在一个每个用户都有唯一用户名的网站上,可能会有两个用户同时尝试创建一个同名的帐户。您可以使用事务来检查用户名是否已被使用,如果没有,则使用该名称创建帐户。但是,像之前的例子一样,在快照隔离下不安全。幸运的是,唯一约束是一个简单的解决方案(尝试注册用户名的第二个事务将因违反约束而被中止)。
- Preventing double-spending
-
A service that allows users to spend money or points needs to check that a user doesn’t spend more than they have. You might implement this by inserting a tentative spending item into a user’s account, listing all the items in the account, and checking that the sum is positive [ 44 ]. With write skew, it could happen that two spending items are inserted concurrently that together cause the balance to go negative, but that neither transaction notices the other.
一个允许用户花费货币或积分的服务需要检查用户不会超支。可以通过将暂定的消费项目插入用户的账户,列出账户中的所有项目,并检查总和是否为正来实现这一点。但是,如果存在写入偏斜,可能会同时插入两个消费项目,这两个项目联合起来会导致余额变为负数,但是两个事务都没有注意到另一个事务。
Phantoms causing write skew
All of these examples follow a similar pattern:
所有这些例子都遵循相似的模式:
-
A
SELECT
query checks whether some requirement is satisfied by searching for rows that match some search condition (there are at least two doctors on call, there are no existing bookings for that room at that time, the position on the board doesn’t already have another figure on it, the username isn’t already taken, there is still money in the account).一个SELECT查询通过搜索符合某些条件的行来检查是否满足要求(当时有至少两位医生值班,该房间此时没有已有的预订,该棋盘位置上没有其他棋子,用户名没有被占用,该账户还有余额)。
-
Depending on the result of the first query, the application code decides how to continue (perhaps to go ahead with the operation, or perhaps to report an error to the user and abort).
根据第一个查询的结果,应用程序代码决定如何继续(可能继续执行操作,或者向用户报告错误并中止操作)。
-
If the application decides to go ahead, it makes a write (
INSERT
,UPDATE
, orDELETE
) to the database and commits the transaction.如果应用程序决定继续,它将向数据库进行写入(INSERT、UPDATE或DELETE),并提交事务。
The effect of this write changes the precondition of the decision of step 2. In other words, if you were to repeat the
SELECT
query from step 1 after commiting the write, you would get a different result, because the write changed the set of rows matching the search condition (there is now one fewer doctor on call, the meeting room is now booked for that time, the position on the board is now taken by the figure that was moved, the username is now taken, there is now less money in the account).这个写操作会改变第二步骤的前提条件。换句话说,如果你在提交了这个写操作之后重复执行第一步骤的SELECT查询,你会得到不同的结果,因为这个写操作改变了符合搜索条件的行集合(现在少了一个值班医生,会议室现在被预定,面板上的位置现在被移动的物体占据,用户名已被占用,帐户中的金额现在减少了)。
The steps may occur in a different order. For example, you could first make the write, then the
SELECT
query, and finally decide whether to abort or commit based on the result of the query.
步骤的顺序可能会不同。例如,您可以先写入,然后进行SELECT查询,最后根据查询结果决定是否中止或提交。
In the case of the doctor on call example, the row being modified in step 3 was one of the rows
returned in step 1, so we could make the transaction safe and avoid write skew by locking the rows
in step 1 (
SELECT FOR UPDATE
). However, the other four examples are different: they check for the
absence
of rows matching some search condition, and the write
adds
a row matching the same
condition. If the query in step 1 doesn’t return any rows,
SELECT FOR UPDATE
can’t attach locks to
anything.
对于值班医生的示例,第三步中要修改的行是在第一步返回的行之一,因此我们可以通过在第一步中锁定行(SELECT FOR UPDATE)使交易安全,避免写入偏差。然而,其他四个示例是不同的:它们检查匹配某些搜索条件的行是否不存在,而写入则添加了匹配相同条件的行。如果第一步中的查询没有返回任何行,则SELECT FOR UPDATE无法将锁附加到任何内容。
This effect, where a write in one transaction changes the result of a search query in another transaction, is called a phantom [ 3 ]. Snapshot isolation avoids phantoms in read-only queries, but in read-write transactions like the examples we discussed, phantoms can lead to particularly tricky cases of write skew.
这种效应被称为幻影[3],即在一个交易中的写操作改变了另一个交易中的搜索结果。 快照隔离避免了只读查询中的幻像,但在像我们讨论的读写交易中,幻像会导致特别棘手的写倾斜情况。
Materializing conflicts
If the problem of phantoms is that there is no object to which we can attach the locks, perhaps we can artificially introduce a lock object into the database?
如果幽灵问题在于没有对象可以附加锁定,或许我们可以在数据库中人为地引入锁定对象?
For example, in the meeting room booking case you could imagine creating a table of time slots and rooms. Each row in this table corresponds to a particular room for a particular time period (say, 15 minutes). You create rows for all possible combinations of rooms and time periods ahead of time, e.g. for the next six months.
例如,在预订会议室的情况下,您可以想象创建一个时间槽和房间的表格。该表中的每一行对应于特定时间段(例如,15分钟)内的特定房间。您提前为所有可能的房间和时间段组合创建行,例如未来六个月的时间。
Now a transaction that wants to create a booking can lock (
SELECT FOR UPDATE
) the rows in the
table that correspond to the desired room and time period. After it has acquired the locks, it can
check for overlapping bookings and insert a new booking as before. Note that the additional table
isn’t used to store information about the booking—it’s purely a collection of locks which is used
to prevent bookings on the same room and time range from being modified concurrently.
现在,想要创建预订的交易可以锁定(SELECT FOR UPDATE)与所需房间和时间段相对应的表中的行。在获取了锁之后,它可以检查重叠的预订并像以前一样插入新的预订。请注意,其他表不用于存储关于预订的信息,而是纯粹用作锁的集合,以防止在同一房间和时间范围内进行并发修改。
This approach is called materializing conflicts , because it takes a phantom and turns it into a lock conflict on a concrete set of rows that exist in the database [ 11 ]. Unfortunately, it can be hard and error-prone to figure out how to materialize conflicts, and it’s ugly to let a concurrency control mechanism leak into the application data model. For those reasons, materializing conflicts should be considered a last resort if no alternative is possible. A serializable isolation level is much preferable in most cases.
这种方法被称为“实化冲突”,因为它将一个幻象转化为存在于数据库中的一组具体行的锁冲突[11]。不幸的是,找出如何实现冲突可能会很困难和容易出错,并且让并发控制机制渗入应用数据模型是很丑陋的。出于这些原因,如果没有其他替代方案,应将实化冲突视为最后的选择。在大多数情况下,可串行化隔离级别更可取。
Serializability
In this chapter we have seen several examples of transactions that are prone to race conditions. Some race conditions are prevented by the read committed and snapshot isolation levels, but others are not. We encountered some particularly tricky examples with write skew and phantoms. It’s a sad situation:
在本章中,我们看到了几个容易出现竞争条件的交易示例。有些竞争条件可以通过读提交和快照隔离级别来防止,但其他一些无法防止。我们遇到了一些非常棘手的写偏差和幻象问题。这是一个令人沮丧的局面:
-
Isolation levels are hard to understand, and inconsistently implemented in different databases (e.g., the meaning of “repeatable read” varies significantly).
隔离级别很难理解,在不同的数据库中实现也不一致 (例如,“可重复读”的意义存在显著的差异)。
-
If you look at your application code, it’s difficult to tell whether it is safe to run at a particular isolation level—especially in a large application, where you might not be aware of all the things that may be happening concurrently.
如果你看你的应用程序代码,很难判断它是否可以在特定的隔离级别下运行,特别是在大型应用程序中,在那里你可能不知道所有可能同时发生的事情。
-
There are no good tools to help us detect race conditions. In principle, static analysis may help [ 26 ], but research techniques have not yet found their way into practical use. Testing for concurrency issues is hard, because they are usually nondeterministic—problems only occur if you get unlucky with the timing.
目前没有好的工具来帮助我们检测竞态条件。从原理上讲,静态分析可能有所帮助[26],但研究技术尚未进入实际应用。测试并发问题很困难,因为它们通常是不确定的,只有在时间不利的情况下才会出现问题。
This is not a new problem—it has been like this since the 1970s, when weak isolation levels were first introduced [ 2 ]. All along, the answer from researchers has been simple: use serializable isolation!
这不是一个新问题 - 自从1970年代引入了弱隔离级别以来,一直存在[2]。一直以来,研究者的答案很简单:使用可串行化隔离!
Serializable isolation is usually regarded as the strongest isolation level. It guarantees that even though transactions may execute in parallel, the end result is the same as if they had executed one at a time, serially , without any concurrency. Thus, the database guarantees that if the transactions behave correctly when run individually, they continue to be correct when run concurrently—in other words, the database prevents all possible race conditions.
可序列化隔离通常被认为是最强的隔离级别。它保证即使事务并行执行,最终结果与逐个按顺序串行执行的结果相同,没有任何并发。因此,数据库保证如果事务在单独运行时正确,它们在并发运行时仍然正确,换句话说,数据库排除了所有可能的竞态条件。
But if serializable isolation is so much better than the mess of weak isolation levels, then why isn’t everyone using it? To answer this question, we need to look at the options for implementing serializability, and how they perform. Most databases that provide serializability today use one of three techniques, which we will explore in the rest of this chapter:
但如果可序列化隔离比弱隔离级别混乱的方式好得多,为什么不是每个人都在使用它?要回答这个问题,我们需要看一下实现可串行性的选项以及它们的性能。今天提供可串行性的大多数数据库使用三种技术之一,我们将在本章的其余部分探讨这些技术:
-
Literally executing transactions in a serial order (see “Actual Serial Execution” )
字面意义上按照顺序依次执行交易(参见“实际串行执行”)。
-
Two-phase locking (see “Two-Phase Locking (2PL)” ), which for several decades was the only viable option
两阶段锁(请参见“两阶段锁定(2PL)”),数十年来一直是唯一可行的选择。
-
Optimistic concurrency control techniques such as serializable snapshot isolation (see “Serializable Snapshot Isolation (SSI)” )
乐观并发控制技术,例如可串行化快照隔离(参见“可串行化快照隔离(SSI)”)。
For now, we will discuss these techniques primarily in the context of single-node databases; in Chapter 9 we will examine how they can be generalized to transactions that involve multiple nodes in a distributed system.
目前,我们将主要在单节点数据库的背景下讨论这些技术;在第9章中,我们将研究如何将它们推广到涉及分布式系统中多个节点的事务中。
Actual Serial Execution
The simplest way of avoiding concurrency problems is to remove the concurrency entirely: to execute only one transaction at a time, in serial order, on a single thread. By doing so, we completely sidestep the problem of detecting and preventing conflicts between transactions: the resulting isolation is by definition serializable.
避免并发问题的最简单方法是完全消除并发:在单个线程上按序执行一次只有一个事务。这样做,我们完全回避了检测和防止事务之间冲突的问题:由此产生的隔离是可串行化的。
Even though this seems like an obvious idea, database designers only fairly recently—around 2007—decided that a single-threaded loop for executing transactions was feasible [ 45 ]. If multi-threaded concurrency was considered essential for getting good performance during the previous 30 years, what changed to make single-threaded execution possible?
即使这似乎是个显而易见的想法,但数据库设计师直到近年来(大约在2007年)才决定使用单线程循环来执行事务 [45]。如果在之前的30年中,多线程并发被认为是获得良好性能的必要条件,那么什么改变了,使得单线程执行成为可能?
Two developments caused this rethink:
两个发展导致了这种重新思考。
-
RAM became cheap enough that for many use cases is now feasible to keep the entire active dataset in memory (see “Keeping everything in memory” ). When all data that a transaction needs to access is in memory, transactions can execute much faster than if they have to wait for data to be loaded from disk.
RAM的价格越来越便宜,现在在许多情况下,将整个活动数据集保留在内存中变得可行(请参阅“全部保留在内存中”)。当事务需要访问的所有数据都在内存中时,与必须等待从磁盘加载数据相比,事务可以执行得更快。
-
Database designers realized that OLTP transactions are usually short and only make a small number of reads and writes (see “Transaction Processing or Analytics?” ). By contrast, long-running analytic queries are typically read-only, so they can be run on a consistent snapshot (using snapshot isolation) outside of the serial execution loop.
数据库设计师发现OLTP事务通常很短,只进行少量读写操作(参见“事务处理还是分析?”)。相比之下,长时间运行的分析查询通常是只读的,因此它们可以在一致的快照上运行(使用快照隔离)而不需要在串行执行循环内运行。
The approach of executing transactions serially is implemented in VoltDB/H-Store, Redis, and Datomic [ 46 , 47 , 48 ]. A system designed for single-threaded execution can sometimes perform better than a system that supports concurrency, because it can avoid the coordination overhead of locking. However, its throughput is limited to that of a single CPU core. In order to make the most of that single thread, transactions need to be structured differently from their traditional form.
以序列方式执行交易的方法在VoltDB/H-Store、Redis和Datomic中得到了实现。为了避免锁定的协调开销,设计用于单线程执行的系统有时比支持并发性的系统表现更好。然而,其吞吐量仅限于单个CPU内核的吞吐量。为了充分利用该单个线程,必须以不同于传统形式的方式来组织交易。
Encapsulating transactions in stored procedures
In the early days of databases, the intention was that a database transaction could encompass an entire flow of user activity. For example, booking an airline ticket is a multi-stage process (searching for routes, fares, and available seats; deciding on an itinerary; booking seats on each of the flights of the itinerary; entering passenger details; making payment). Database designers thought that it would be neat if that entire process was one transaction so that it could be committed atomically.
在数据库的早期阶段,其目的是使一个数据库事务涵盖一个完整的用户活动流程。例如,预订航空票是一个多阶段的过程(搜索路线、票价和可用座位;决定行程;预订行程中每个航班的座位;输入旅客详情;进行支付)。数据库设计师认为,如果整个过程都是一个事务,那将是一件很好的事情,以便可以原子地提交。
Unfortunately, humans are very slow to make up their minds and respond. If a database transaction needs to wait for input from a user, the database needs to support a potentially huge number of concurrent transactions, most of them idle. Most databases cannot do that efficiently, and so almost all OLTP applications keep transactions short by avoiding interactively waiting for a user within a transaction. On the web, this means that a transaction is committed within the same HTTP request—a transaction does not span multiple requests. A new HTTP request starts a new transaction.
不幸的是,人类在做决定和响应方面非常缓慢。如果数据库事务需要等待用户的输入,数据库需要支持潜在的大量并发事务,其中大部分处于空闲状态。大多数数据库不能高效地完成这项工作,因此几乎所有OLTP应用程序通过避免在事务中与用户交互等待而使事务短暂。在web上,这意味着一个事务在同一个HTTP请求中提交-一个事务不跨越多个请求。一个新的HTTP请求开始一个新的事务。
Even though the human has been taken out of the critical path, transactions have continued to be executed in an interactive client/server style, one statement at a time. An application makes a query, reads the result, perhaps makes another query depending on the result of the first query, and so on. The queries and results are sent back and forth between the application code (running on one machine) and the database server (on another machine).
即使人类已从关键路径中剔除,交易仍以交互式客户端/服务器方式执行,一次执行一个语句。应用程序发出查询,读取结果,根据第一个查询的结果也许再次发起查询,以此类推。查询和结果在应用程序代码(在一台机器上运行)和数据库服务器(在另一台机器上)之间来回发送。 即使该系统去除了人为的要素,也同样可以进行交互式的客户端/服务器的交易,每一条命令都将逐一执行。应用程序会发出请求,读取结果,根据第一个请求的结果决定是否继续发出查询等等。请求和结果将在应用程序代码(运行在一台机器上)和数据库服务器(运行在另一台机器上)之间来回传输。
In this interactive style of transaction, a lot of time is spent in network communication between the application and the database. If you were to disallow concurrency in the database and only process one transaction at a time, the throughput would be dreadful because the database would spend most of its time waiting for the application to issue the next query for the current transaction. In this kind of database, it’s necessary to process multiple transactions concurrently in order to get reasonable performance.
在这种交互式的事务模式中,应用程序和数据库之间需要花费大量时间进行网络通信。如果禁止数据库并发处理,只能一次处理一个事务,那么吞吐量会非常糟糕,因为数据库会花费大部分时间等待应用程序发出当前事务的下一个查询。在这种类型的数据库中,必须同时处理多个事务,才能获得合理的性能。
For this reason, systems with single-threaded serial transaction processing don’t allow interactive multi-statement transactions. Instead, the application must submit the entire transaction code to the database ahead of time, as a stored procedure . The differences between these approaches is illustrated in Figure 7-9 . Provided that all data required by a transaction is in memory, the stored procedure can execute very fast, without waiting for any network or disk I/O.
因此,采用单线程串行交易处理系统无法允许互动的多语句事务。相反地,应用程序必须将整个事务代码作为存储过程提前提交到数据库中。这些方法之间的差异如图7-9所示。只要事务所需的所有数据都在内存中,存储过程就可以非常快速地执行,不需要等待任何网络或磁盘I/O。
Pros and cons of stored procedures
Stored procedures have existed for some time in relational databases, and they have been part of the SQL standard (SQL/PSM) since 1999. They have gained a somewhat bad reputation, for various reasons:
存储过程在关系型数据库中已经存在了一段时间,并且它们自1999年以来就是SQL标准(SQL/PSM)的一部分。由于各种原因,它们已经获得了一种不太好的声誉。
-
Each database vendor has its own language for stored procedures (Oracle has PL/SQL, SQL Server has T-SQL, PostgreSQL has PL/pgSQL, etc.). These languages haven’t kept up with developments in general-purpose programming languages, so they look quite ugly and archaic from today’s point of view, and they lack the ecosystem of libraries that you find with most programming languages.
每个数据库供应商都有自己的存储过程语言(Oracle有PL/SQL,SQL Server有T-SQL,PostgreSQL有PL/pgSQL等)。这些语言没有跟上通用编程语言的发展,因此从今天的角度来看它们看起来非常丑陋和过时,缺乏大多数编程语言所具有的库生态系统。
-
Code running in a database is difficult to manage: compared to an application server, it’s harder to debug, more awkward to keep in version control and deploy, trickier to test, and difficult to integrate with a metrics collection system for monitoring.
在数据库中运行的代码难以管理:与应用服务器相比,更难调试,版本控制和部署更麻烦,测试更棘手,以及难以集成指标收集系统进行监控。
-
A database is often much more performance-sensitive than an application server, because a single database instance is often shared by many application servers. A badly written stored procedure (e.g., using a lot of memory or CPU time) in a database can cause much more trouble than equivalent badly written code in an application server.
一个数据库通常比应用服务器更注重性能,因为单个数据库实例通常由许多应用服务器共享。一个糟糕编写的存储过程(例如,使用大量内存或CPU时间)可能比应用服务器中相当糟糕的代码引起更多的麻烦。
However, those issues can be overcome. Modern implementations of stored procedures have abandoned PL/SQL and use existing general-purpose programming languages instead: VoltDB uses Java or Groovy, Datomic uses Java or Clojure, and Redis uses Lua.
然而,这些问题都可以克服。现代的存储过程实现已经放弃了PL/SQL,转而使用现有的通用编程语言:VoltDB使用Java或Groovy,Datomic使用Java或Clojure,Redis使用Lua。
With stored procedures and in-memory data, executing all transactions on a single thread becomes feasible. As they don’t need to wait for I/O and they avoid the overhead of other concurrency control mechanisms, they can achieve quite good throughput on a single thread.
通过存储过程和内存数据,使用单线程执行所有事务变得可行。因为它们不需要等待I/O,并且避免了其他并发控制机制的开销,它们可在单线程上实现相当好的吞吐量。
VoltDB also uses stored procedures for replication: instead of copying a transaction’s writes from one node to another, it executes the same stored procedure on each replica. VoltDB therefore requires that stored procedures are deterministic (when run on different nodes, they must produce the same result). If a transaction needs to use the current date and time, for example, it must do so through special deterministic APIs.
VoltDB也使用存储过程进行复制:而不是将一个事务的写操作从一个节点复制到另一个节点,它会在每个副本上执行相同的存储过程。因此,VoltDB要求存储过程是确定性的(在不同的节点上运行时,它们必须产生相同的结果)。例如,如果一个事务需要使用当前日期和时间,它必须通过特殊的确定性API来实现。
Partitioning
Executing all transactions serially makes concurrency control much simpler, but limits the transaction throughput of the database to the speed of a single CPU core on a single machine. Read-only transactions may execute elsewhere, using snapshot isolation, but for applications with high write throughput, the single-threaded transaction processor can become a serious bottleneck.
串行执行所有交易使并发控制更简单,但将数据库事务吞吐量限制为单台机器上单个CPU核心的速度。只读事务可以在其他地方执行,使用快照隔离,但对于具有高写入吞吐量的应用程序,单线程事务处理器可能成为严重瓶颈。
In order to scale to multiple CPU cores, and multiple nodes, you can potentially partition your data (see Chapter 6 ), which is supported in VoltDB. If you can find a way of partitioning your dataset so that each transaction only needs to read and write data within a single partition, then each partition can have its own transaction processing thread running independently from the others. In this case, you can give each CPU core its own partition, which allows your transaction throughput to scale linearly with the number of CPU cores [ 47 ].
为了适应多个CPU核心和多个节点,你可以将你的数据分区(参见第6章),这是VoltDB支持的。如果你能找到一种分区数据集的方法,使得每个事务只需要读写单个分区内的数据,那么每个分区可以有自己独立运行的事务处理线程。在这种情况下,你可以给每个CPU核心分配一个分区,从而使你的事务吞吐量能够与CPU核心数成线性比例的增长[47]。
However, for any transaction that needs to access multiple partitions, the database must coordinate the transaction across all the partitions that it touches. The stored procedure needs to be performed in lock-step across all partitions to ensure serializability across the whole system.
然而,任何需要访问多个分区的交易,数据库必须协调跨所有接触的分区的交易。存储过程需要在所有分区中锁步执行,以确保整个系统的串行性。
Since cross-partition transactions have additional coordination overhead, they are vastly slower than single-partition transactions. VoltDB reports a throughput of about 1,000 cross-partition writes per second, which is orders of magnitude below its single-partition throughput and cannot be increased by adding more machines [ 49 ].
由于跨分区事务需要额外的协调开销,因此它们比单分区事务慢得多。 VoltDB 的跨分区写入吞吐量约为每秒 1,000 次,远低于其单分区吞吐量,无法通过添加更多计算机来增加。
Whether transactions can be single-partition depends very much on the structure of the data used by the application. Simple key-value data can often be partitioned very easily, but data with multiple secondary indexes is likely to require a lot of cross-partition coordination (see “Partitioning and Secondary Indexes” ).
交易能否使用单个分区取决于应用程序使用的数据结构。简单的键值数据可以很容易地进行分区,但具有多个辅助索引的数据很可能需要大量的跨分区协调(请参见"分区和辅助索引")。
Summary of serial execution
Serial execution of transactions has become a viable way of achieving serializable isolation within certain constraints:
串行执行事务已成为在某些限制条件下实现可串行隔离的可行方式。
-
Every transaction must be small and fast, because it takes only one slow transaction to stall all transaction processing.
每个交易必须小而快,因为一个慢交易就可以使所有交易处理停滞。
-
It is limited to use cases where the active dataset can fit in memory. Rarely accessed data could potentially be moved to disk, but if it needed to be accessed in a single-threaded transaction, the system would get very slow. x
仅限于主数据集可以适配内存的情况下使用。很少访问的数据可能会被移动到磁盘,但如果它需要在单线程事务中被访问,系统会变得非常慢。
-
Write throughput must be low enough to be handled on a single CPU core, or else transactions need to be partitioned without requiring cross-partition coordination.
写入吞吐量必须低到可以在单个 CPU 核心上处理,否则需要对事务进行分区,而无需需要跨分区协调。
-
Cross-partition transactions are possible, but there is a hard limit to the extent to which they can be used.
跨分区交易是可能的,但它们的使用程度受到严格限制。
Two-Phase Locking (2PL)
For around 30 years, there was only one widely used algorithm for serializability in databases: two-phase locking (2PL). xi
大约30年来,在数据库中只有一个广泛使用的序列化算法:二阶段锁定(2PL)。
2PL is not 2PC
Note that while two-phase locking (2PL) sounds very similar to two-phase commit (2PC), they are completely different things. We will discuss 2PC in Chapter 9 .
请注意,虽然两阶段锁定(2PL)听起来与两阶段提交(2PC)非常相似,但它们是完全不同的事情。我们将在第9章讨论2PC。
We saw previously that locks are often used to prevent dirty writes (see “No dirty writes” ): if two transactions concurrently try to write to the same object, the lock ensures that the second writer must wait until the first one has finished its transaction (aborted or committed) before it may continue.
我们之前看到过,锁经常被用来防止脏写(参见“无脏写”):如果两个事务同时尝试写入同一对象,锁确保第二个写入者必须等待第一个完成其事务(中止或提交)才能继续。
Two-phase locking is similar, but makes the lock requirements much stronger. Several transactions are allowed to concurrently read the same object as long as nobody is writing to it. But as soon as anyone wants to write (modify or delete) an object, exclusive access is required:
双阶段锁定类似,但使锁定要求更强。只要没有人写入,多个事务被允许同时读取相同的对象。但是一旦任何人想要写入(修改或删除)对象,就需要独占访问权。
-
If transaction A has read an object and transaction B wants to write to that object, B must wait until A commits or aborts before it can continue. (This ensures that B can’t change the object unexpectedly behind A’s back.)
如果事务A已经读取了一个对象,而事务B想要写入该对象,B必须等待A提交或中止后才能继续执行。这确保了B不能在A背后意外地更改对象。
-
If transaction A has written an object and transaction B wants to read that object, B must wait until A commits or aborts before it can continue. (Reading an old version of the object, like in Figure 7-1 , is not acceptable under 2PL.)
如果事务A已经写入了一个对象,事务B想要读取该对象,则B必须等待A提交或中止后才能继续。(像图7-1中读取旧版本的对象不符合2PL条件。)
In 2PL, writers don’t just block other writers; they also block readers and vice versa. Snapshot isolation has the mantra readers never block writers, and writers never block readers (see “Implementing snapshot isolation” ), which captures this key difference between snapshot isolation and two-phase locking. On the other hand, because 2PL provides serializability, it protects against all the race conditions discussed earlier, including lost updates and write skew.
在2PL中,作家不仅会阻止其他作家,还会阻止读者,反之亦然。快照隔离的口号是“读者从不阻止作家,作家从不阻止读者”(见“实现快照隔离”),这反映了快照隔离和两阶段锁定之间的关键差异。另一方面,由于2PL提供串行化,它可以防止所有早期讨论的竞争条件,包括丢失更新和写偏斜。
Implementation of two-phase locking
2PL is used by the serializable isolation level in MySQL (InnoDB) and SQL Server, and the repeatable read isolation level in DB2 [ 23 , 36 ].
2PL被MySQL(InnoDB)和SQL Server中的可串行化隔离级别,以及DB2中的可重复读隔离级别所使用。
The blocking of readers and writers is implemented by a having a lock on each object in the database. The lock can either be in shared mode or in exclusive mode . The lock is used as follows:
通过在数据库中的每个对象上设置锁实现读取器和写入器的阻塞。该锁可以处于共享模式或独占模式。该锁的使用如下:
-
If a transaction wants to read an object, it must first acquire the lock in shared mode. Several transactions are allowed to hold the lock in shared mode simultaneously, but if another transaction already has an exclusive lock on the object, these transactions must wait.
如果一个事务想要读取一个对象,它必须首先以共享模式获取锁。多个事务可以同时持有共享锁,但如果另一个事务已经以排他锁的方式锁定了该对象,这些事务必须等待。
-
If a transaction wants to write to an object, it must first acquire the lock in exclusive mode. No other transaction may hold the lock at the same time (either in shared or in exclusive mode), so if there is any existing lock on the object, the transaction must wait.
如果事务想要写入一个对象,它必须先以独占模式获得锁。在同一时间内,没有其他事务可以以共享或独占模式持有该锁,因此如果对象已经存在锁,则该事务必须等待。
-
If a transaction first reads and then writes an object, it may upgrade its shared lock to an exclusive lock. The upgrade works the same as getting an exclusive lock directly.
如果一个事务首先读取,然后写入一个对象,它可以将其共享锁升级为独占锁。升级的过程与直接获取独占锁相同。
-
After a transaction has acquired the lock, it must continue to hold the lock until the end of the transaction (commit or abort). This is where the name “two-phase” comes from: the first phase (while the transaction is executing) is when the locks are acquired, and the second phase (at the end of the transaction) is when all the locks are released.
一旦一个交易获得了锁,它必须一直持有该锁直到交易结束(提交或中止)。这就是“两阶段”名称的由来:第一阶段(当交易正在执行时)是获得锁的阶段,第二阶段(在交易结束时)是释放所有锁的阶段。
Since so many locks are in use, it can happen quite easily that transaction A is stuck waiting for transaction B to release its lock, and vice versa. This situation is called deadlock . The database automatically detects deadlocks between transactions and aborts one of them so that the others can make progress. The aborted transaction needs to be retried by the application.
由于有许多锁被使用,很容易发生事务A被卡在等待事务B释放锁的状态,反之亦然。这种情况称为死锁。数据库会自动检测事务之间的死锁并中止其中一个,以便其他事务可以进行。被中止的事务需要由应用程序重试。
Performance of two-phase locking
The big downside of two-phase locking, and the reason why it hasn’t been used by everybody since the 1970s, is performance: transaction throughput and response times of queries are significantly worse under two-phase locking than under weak isolation.
两阶段锁定的一个巨大缺点,也是自20世纪70年代以来它并没有被所有人广泛使用的原因,就是性能问题:与弱隔离相比,两阶段锁定的交易吞吐量和查询响应时间都明显较差。
This is partly due to the overhead of acquiring and releasing all those locks, but more importantly due to reduced concurrency. By design, if two concurrent transactions try to do anything that may in any way result in a race condition, one has to wait for the other to complete.
这部分是由于获取与释放所有锁的开销,但更重要的是由于并发性降低。按设计,如果两个并发的事务尝试做可能以任何方式导致竞争条件的任何事情,其中一个必须等待另一个完成。
Traditional relational databases don’t limit the duration of a transaction, because they are designed for interactive applications that wait for human input. Consequently, when one transaction has to wait on another, there is no limit on how long it may have to wait. Even if you make sure that you keep all your transactions short, a queue may form if several transactions want to access the same object, so a transaction may have to wait for several others to complete before it can do anything.
传统的关系型数据库不会限制事物的时间,因为它们专为需要人类输入的交互应用程序而设计。因此,当一个事物需要等待另一个事物时,它可能需要等待的时间没有限制。即使你确保所有的事物都很短暂,如果有几个事物想访问同一个对象,那么一个事物在能够执行操作之前可能需要等待其他几个事物完成。
For this reason, databases running 2PL can have quite unstable latencies, and they can be very slow at high percentiles (see “Describing Performance” ) if there is contention in the workload. It may take just one slow transaction, or one transaction that accesses a lot of data and acquires many locks, to cause the rest of the system to grind to a halt. This instability is problematic when robust operation is required.
因此,运行2PL的数据库的稳定性可能会相当不稳定,并且当工作负载产生争用时,它们在高百分位数下可以非常慢(请参见“描述性能”)。一个慢速事务或一个访问大量数据并获取许多锁定的事务可能会导致系统的其余部分停止运转。当需要强大的操作时,这种不稳定性是有问题的。
Although deadlocks can happen with the lock-based read committed isolation level, they occur much more frequently under 2PL serializable isolation (depending on the access patterns of your transaction). This can be an additional performance problem: when a transaction is aborted due to deadlock and is retried, it needs to do its work all over again. If deadlocks are frequent, this can mean significant wasted effort.
虽然基于锁的“读提交”隔离级别可以发生死锁,但在2PL可串行化隔离级别下(取决于事务的访问模式),死锁发生的频率更高。这可能是一个额外的性能问题:当一个事务因死锁而被中止并重试时,它需要重新完成所有的工作。如果死锁频繁发生,这可能意味着显著的浪费努力。
Predicate locks
In the preceding description of locks, we glossed over a subtle but important detail. In “Phantoms causing write skew” we discussed the problem of phantoms —that is, one transaction changing the results of another transaction’s search query. A database with serializable isolation must prevent phantoms.
在之前的锁定描述中,我们忽略了一个微妙但重要的细节。在“幻读引起写入偏斜”的问题中,我们讨论了幽灵的问题,即一个事务改变了另一个事务的搜索查询结果。具有序列化隔离的数据库必须防止幽灵现象。
In the meeting room booking example this means that if one transaction has searched for existing bookings for a room within a certain time window (see Example 7-2 ), another transaction is not allowed to concurrently insert or update another booking for the same room and time range. (It’s okay to concurrently insert bookings for other rooms, or for the same room at a different time that doesn’t affect the proposed booking.)
在会议室预订示例中,这意味着如果一个交易正在搜索特定时间窗口内某个房间的现有预订(参见示例7-2),则不允许另一个交易同时插入或更新同一房间和时间范围的另一个预订。(可以同时插入其他房间的预订,或在不影响拟议预订的不同时间为同一房间插入预订。)
How do we implement this? Conceptually, we need a predicate lock [ 3 ]. It works similarly to the shared/exclusive lock described earlier, but rather than belonging to a particular object (e.g., one row in a table), it belongs to all objects that match some search condition, such as:
我们如何实现这个?从概念上讲,我们需要一个谓词锁[3]。它的工作方式类似于之前描述的共享/独占锁,但它不属于特定对象(例如,在表中的一行),而是属于所有符合某些搜索条件的对象,例如:
SELECT
*
FROM
bookings
WHERE
room_id
=
123
AND
end_time
>
'2018-01-01 12:00'
AND
start_time
<
'2018-01-01 13:00'
;
A predicate lock restricts access as follows:
谓词锁限制访问如下:
-
If transaction A wants to read objects matching some condition, like in that
SELECT
query, it must acquire a shared-mode predicate lock on the conditions of the query. If another transaction B currently has an exclusive lock on any object matching those conditions, A must wait until B releases its lock before it is allowed to make its query.如果交易A想要读取与某些条件匹配的对象,就像在SELECT查询中一样,它必须在查询条件上获取共享模式谓词锁。如果另一个交易B当前在与这些条件匹配的任何对象上拥有独占锁,则A必须等待B释放其锁之前才允许发出查询。
-
If transaction A wants to insert, update, or delete any object, it must first check whether either the old or the new value matches any existing predicate lock. If there is a matching predicate lock held by transaction B, then A must wait until B has committed or aborted before it can continue.
如果交易A想要插入、更新或删除任何对象,首先必须检查旧值或新值是否与任何现有谓词锁相匹配。如果事务B持有匹配的谓词锁,则A必须等待B提交或终止后才能继续。
The key idea here is that a predicate lock applies even to objects that do not yet exist in the database, but which might be added in the future (phantoms). If two-phase locking includes predicate locks, the database prevents all forms of write skew and other race conditions, and so its isolation becomes serializable.
这里的关键思想是,谓词锁适用于数据库中尚不存在但将来可能添加的对象(幻象)。如果两阶段锁定包括谓词锁,则数据库可以防止所有形式的写偏和其他竞争条件,因此它的隔离性变得可串行化。
Index-range locks
Unfortunately, predicate locks do not perform well: if there are many locks by active transactions, checking for matching locks becomes time-consuming. For that reason, most databases with 2PL actually implement index-range locking (also known as next-key locking ), which is a simplified approximation of predicate locking [ 41 , 50 ].
可惜的是,谓词锁定的性能并不好:如果存在许多由活动事务所持有的锁定,查找匹配的锁定就会变得耗时。因此,大多数实现了二阶段锁定的数据库实际上采用索引范围锁定(也称为next-key锁定),这是谓词锁定的简化近似[41,50]。
It’s safe to simplify a predicate by making it match a greater set of objects. For example, if you have a predicate lock for bookings of room 123 between noon and 1 p.m., you can approximate it by locking bookings for room 123 at any time, or you can approximate it by locking all rooms (not just room 123) between noon and 1 p.m. This is safe, because any write that matches the original predicate will definitely also match the approximations.
将谓词简化,使其与更大的对象集匹配是安全的。例如,如果你有一个谓词锁定了房间123在中午至下午1点之间的预订,你可以通过锁定任何时间的房间123的预订或锁定在中午至下午1点之间的所有房间(不仅仅是房间123)来近似它。这是安全的,因为与原始谓词匹配的任何写入肯定也与近似匹配。
In the room bookings database you would probably have an index on the
room_id
column, and/or
indexes on
start_time
and
end_time
(otherwise the preceding query would be very slow on a large
database):
在房间预订数据库中,你可能会在room_id列上建立一个索引,和/或在start_time和end_time上建立索引(否则在大型数据库上,前面的查询会非常慢)。
-
Say your index is on
room_id
, and the database uses this index to find existing bookings for room 123. Now the database can simply attach a shared lock to this index entry, indicating that a transaction has searched for bookings of room 123.假设你的索引是基于房间号(room_id)而建立的,数据库将使用这个索引来查找房间123的预订记录。现在数据库就可以简单地将一个共享锁附加到这个索引条目上,表示一个事务正在搜索房间123的预订记录。
-
Alternatively, if the database uses a time-based index to find existing bookings, it can attach a shared lock to a range of values in that index, indicating that a transaction has searched for bookings that overlap with the time period of noon to 1 p.m. on January 1, 2018.
或者,如果数据库使用基于时间的索引来查找现有预订,则可以将共享锁附加到该索引中的一系列值上,指示事务已搜索与2018年1月1日中午到下午1点的时间段重叠的预订。
Either way, an approximation of the search condition is attached to one of the indexes. Now, if another transaction wants to insert, update, or delete a booking for the same room and/or an overlapping time period, it will have to update the same part of the index. In the process of doing so, it will encounter the shared lock, and it will be forced to wait until the lock is released.
无论哪种方式,搜索条件的近似值都会被附加到其中一个索引上。现在,如果另一个事务想要插入、更新或删除同一房间和/或重叠时间段的预订,它将不得不更新索引的同一部分。在这个过程中,它将遇到共享锁,并被迫等待直到锁被释放。
This provides effective protection against phantoms and write skew. Index-range locks are not as precise as predicate locks would be (they may lock a bigger range of objects than is strictly necessary to maintain serializability), but since they have much lower overheads, they are a good compromise.
这提供了有效的保护,防止幽灵和写入倾斜。索引范围锁定不如谓词锁定精确(它们可能锁定比维护序列化所需的对象范围更大的范围),但由于它们的开销要小得多,所以它们是一个不错的折衷方案。
If there is no suitable index where a range lock can be attached, the database can fall back to a shared lock on the entire table. This will not be good for performance, since it will stop all other transactions writing to the table, but it’s a safe fallback position.
如果没有适合的索引来附加一个范围锁,则数据库可以回退到对整个表进行共享锁。这对性能来说不是很好,因为它会停止所有其他正在对表进行写入的事务,但这是一种安全的备份方案。
Serializable Snapshot Isolation (SSI)
This chapter has painted a bleak picture of concurrency control in databases. On the one hand, we have implementations of serializability that don’t perform well (two-phase locking) or don’t scale well (serial execution). On the other hand, we have weak isolation levels that have good performance, but are prone to various race conditions (lost updates, write skew, phantoms, etc.). Are serializable isolation and good performance fundamentally at odds with each other?
这一章描绘了数据库并发控制的一幅黯淡画面。一方面,我们拥有串行化的实现,但它们表现不佳(如两阶段锁定)或不能良好扩展(如串行执行)。另一方面,我们有弱隔离级别,其表现良好,但容易出现各种竞态条件(丢失更新,写偏斜,幻像等)。可串行隔离和良好性能在根本上是互相对立的吗?
Perhaps not: an algorithm called serializable snapshot isolation (SSI) is very promising. It provides full serializability, but has only a small performance penalty compared to snapshot isolation. SSI is fairly new: it was first described in 2008 [ 40 ] and is the subject of Michael Cahill’s PhD thesis [ 51 ].
也许不会: 一个名为“可串行化快照隔离”(SSI)的算法非常有前景。它提供了完全的串行性,但相比于快照隔离,只有很小的性能损失。SSI是相当新的: 它首次在2008年被描述[40],并成为Michael Cahill的博士论文[51]的主题。
Today SSI is used both in single-node databases (the serializable isolation level in PostgreSQL since version 9.1 [ 41 ]) and distributed databases (FoundationDB uses a similar algorithm). As SSI is so young compared to other concurrency control mechanisms, it is still proving its performance in practice, but it has the possibility of being fast enough to become the new default in the future.
今天,SSI在单节点数据库(PostgreSQL自9.1版本以来的串行化隔离级别[41])和分布式数据库(FoundationDB使用类似的算法)中都得到了应用。由于SSI与其他并发控制机制相比年轻,因此它仍在实践中证明其性能,但它有可能足够快,成为未来的新默认设置。
Pessimistic versus optimistic concurrency control
Two-phase locking is a so-called pessimistic concurrency control mechanism: it is based on the principle that if anything might possibly go wrong (as indicated by a lock held by another transaction), it’s better to wait until the situation is safe again before doing anything. It is like mutual exclusion , which is used to protect data structures in multi-threaded programming.
两阶段锁定是一种所谓的悲观并发控制机制:它基于原则,即如果可能发生问题(由另一事务持有的锁定指示),在做任何事情之前最好等待情况再安全。这就像互斥一样,用于保护多线程编程中的数据结构。
Serial execution is, in a sense, pessimistic to the extreme: it is essentially equivalent to each transaction having an exclusive lock on the entire database (or one partition of the database) for the duration of the transaction. We compensate for the pessimism by making each transaction very fast to execute, so it only needs to hold the “lock” for a short time.
串行执行可以说是极度悲观的,本质上相当于每个事务在整个数据库(或数据库的一个分区)上都具有独占锁定,持续时间为整个事务期间。我们通过使每个事务执行非常快来弥补这种悲观情绪,因此它只需要短时间内持有“锁定”即可。
By contrast, serializable snapshot isolation is an optimistic concurrency control technique. Optimistic in this context means that instead of blocking if something potentially dangerous happens, transactions continue anyway, in the hope that everything will turn out all right. When a transaction wants to commit, the database checks whether anything bad happened (i.e., whether isolation was violated); if so, the transaction is aborted and has to be retried. Only transactions that executed serializably are allowed to commit.
相比之下,可序列化的快照隔离是一种乐观的并发控制技术。在这个上下文中,乐观意味着如果发生了潜在危险的情况,而不会阻止交易继续,希望一切都能顺利进行。当一个交易想要提交时,数据库会检查是否发生了任何不良情况(即是否违反了隔离性); 如果是,交易将被中止并需要重试。只有执行可序列化操作的交易才被允许提交。
Optimistic concurrency control is an old idea [ 52 ], and its advantages and disadvantages have been debated for a long time [ 53 ]. It performs badly if there is high contention (many transactions trying to access the same objects), as this leads to a high proportion of transactions needing to abort. If the system is already close to its maximum throughput, the additional transaction load from retried transactions can make performance worse.
乐观并发控制是一个旧思想[52],它的优缺点已经被长时间讨论[53]。如果存在高争用(很多事务试图访问相同的对象),它的表现会很差,因为这会导致许多事务需要中止。如果系统已经接近最大吞吐量,重试事务的额外负载可能会使性能更差。
However, if there is enough spare capacity, and if contention between transactions is not too high, optimistic concurrency control techniques tend to perform better than pessimistic ones. Contention can be reduced with commutative atomic operations: for example, if several transactions concurrently want to increment a counter, it doesn’t matter in which order the increments are applied (as long as the counter isn’t read in the same transaction), so the concurrent increments can all be applied without conflicting.
然而,如果有足够的备用容量,而且事务之间的争用不是太高,乐观并发控制技术的性能往往优于悲观的技术。通过交换原子操作可以减少争用:例如,如果几个事务同时想要递增一个计数器,递增的顺序并不重要(只要在同一事务中没有读取计数器),因此,可以应用所有并发的递增而不会产生冲突。
As the name suggests, SSI is based on snapshot isolation—that is, all reads within a transaction are made from a consistent snapshot of the database (see “Snapshot Isolation and Repeatable Read” ). This is the main difference compared to earlier optimistic concurrency control techniques. On top of snapshot isolation, SSI adds an algorithm for detecting serialization conflicts among writes and determining which transactions to abort.
正如其名称所示,SSI基于快照隔离,即事务中的所有读取均从数据库的一致快照中进行(请参见“快照隔离和可重复读”)。这是与早期乐观并发控制技术相比的主要差异。在快照隔离的基础上,SSI添加了一种算法,用于检测写入之间的序列化冲突并确定要中止哪些事务。
Decisions based on an outdated premise
When we previously discussed write skew in snapshot isolation (see “Write Skew and Phantoms” ), we observed a recurring pattern: a transaction reads some data from the database, examines the result of the query, and decides to take some action (write to the database) based on the result that it saw. However, under snapshot isolation, the result from the original query may no longer be up-to-date by the time the transaction commits, because the data may have been modified in the meantime.
在之前讨论的快照隔离中的写倾斜(见“写倾斜和幻读”)中,我们观察到一个经常出现的模式:一个事务从数据库读取一些数据,检查查询结果,并根据所看到的结果来采取某些操作(写入数据库)。但是,在快照隔离下,原始查询的结果在事务提交时可能已经不再是最新的,因为数据可能已在此期间被修改。
Put another way, the transaction is taking an action based on a premise (a fact that was true at the beginning of the transaction, e.g., “There are currently two doctors on call”). Later, when the transaction wants to commit, the original data may have changed—the premise may no longer be true.
换句话说,该交易是基于一个前提条件采取行动(即在交易开始时是真实情况的事实,例如“目前有两名值班医生”)。稍后,当交易想要提交时,原始数据可能已经发生了变化,即前提条件可能不再成立。
When the application makes a query (e.g., “How many doctors are currently on call?”), the database doesn’t know how the application logic uses the result of that query. To be safe, the database needs to assume that any change in the query result (the premise) means that writes in that transaction may be invalid. In other words, there may be a causal dependency between the queries and the writes in the transaction. In order to provide serializable isolation, the database must detect situations in which a transaction may have acted on an outdated premise and abort the transaction in that case.
当应用程序发出查询(例如,“目前有多少医生在呼叫中?”)时,数据库不知道应用程序逻辑如何使用该查询的结果。为了安全起见,数据库需要假设查询结果的任何变化(前提)意味着事务中的写入可能无效。换句话说,查询和事务中的写入之间可能存在因果依赖关系。为了提供可串行化的隔离性,数据库必须检测可能已经基于过时前提执行操作的事务,并在这种情况下中止事务。
How does the database know if a query result might have changed? There are two cases to consider:
数据库如何判断查询结果是否可能已经改变?这里需要考虑两种情况:
-
Detecting reads of a stale MVCC object version (uncommitted write occurred before the read)
检测到读取了过期的MVCC对象版本(未提交写入在读取之前发生)。
-
Detecting writes that affect prior reads (the write occurs after the read)
检测到对先前读取的写入(写入发生在读取之后)。
Detecting stale MVCC reads
Recall that snapshot isolation is usually implemented by multi-version concurrency control (MVCC;
see
Figure 7-10
). When a transaction reads from a consistent snapshot in an
MVCC database, it ignores writes that were made by any other transactions that hadn’t yet committed
at the time when the snapshot was taken. In
Figure 7-10
, transaction 43 sees
Alice as having
on_call = true
, because transaction 42 (which modified Alice’s on-call status) is
uncommitted. However, by the time transaction 43 wants to commit, transaction 42 has already
committed. This means that the write that was ignored when reading from the consistent snapshot has
now taken effect, and transaction 43’s premise is no longer true.
请注意,快照隔离通常是通过多版本并发控制(MVCC)实现的(见图7-10)。当事务从MVCC数据库中读取一致快照时,它会忽略在拍摄快照时尚未提交的任何其他事务所做的写入。在图7-10中,事务43看到Alice的值为on_call = true,因为修改Alice值的事务42未完成。但是,当事务43想要提交时,事务42已经提交。这意味着在读取一致快照时忽略的写入现在已经生效,因此事务43的前提不再成立。
In order to prevent this anomaly, the database needs to track when a transaction ignores another transaction’s writes due to MVCC visibility rules. When the transaction wants to commit, the database checks whether any of the ignored writes have now been committed. If so, the transaction must be aborted.
为了防止这种异常情况,数据库需要跟踪何时事务由于MVCC可见性规则而忽略了另一个事务的写操作。当事务想要提交时,数据库会检查是否有任何被忽略的写操作现在已经被提交。如果是,则必须中止该事务。
Why wait until committing? Why not abort transaction 43 immediately when the stale read is detected? Well, if transaction 43 was a read-only transaction, it wouldn’t need to be aborted, because there is no risk of write skew. At the time when transaction 43 makes its read, the database doesn’t yet know whether that transaction is going to later perform a write. Moreover, transaction 42 may yet abort or may still be uncommitted at the time when transaction 43 is committed, and so the read may turn out not to have been stale after all. By avoiding unnecessary aborts, SSI preserves snapshot isolation’s support for long-running reads from a consistent snapshot.
为什么要等到提交才开始呢?为什么不在检测到陈旧读时立即中止事务43?如果事务43是只读事务,就不需要中止,因为不存在写入偏差的风险。在事务43进行读取时,数据库尚不知道该事务是否稍后会执行写操作。此外,事务42可能会中止或者在事务43提交时仍未提交,因此该读取可能最终并不陈旧。通过避免不必要的中止,SSI保留了快照隔离提供长时间运行读取一致快照的支持。
Detecting writes that affect prior reads
The second case to consider is when another transaction modifies data after it has been read. This case is illustrated in Figure 7-11 .
需要考虑的第二种情况是,当另一个交易在读取数据后修改数据。该情况如图7-11所示。
In the context of two-phase locking we discussed index-range locks (see
“Index-range locks”
), which allow the database to lock access to all rows matching some
search query, such as
WHERE shift_id = 1234
. We can use a similar technique here, except that SSI
locks don’t block other transactions.
在两阶段锁定的背景下,我们讨论了索引区间锁定(请参阅“索引区间锁定”),它允许数据库锁定访问与某些搜索查询(如WHERE shift_id = 1234)匹配的所有行。我们可以在这里使用类似的技术,只不过SSI锁定不会阻塞其他事务。
In
Figure 7-11
, transactions 42 and 43 both search for on-call doctors
during shift
1234
. If there is an index on
shift_id
, the database can use the index entry 1234 to
record the fact that transactions 42 and 43 read this data. (If there is no index, this information
can be tracked at the table level.) This information only needs to be kept for a while: after a
transaction has finished (committed or aborted), and all concurrent transactions have finished, the
database can forget what data it read.
在图7-11中,事务42和43都在查找在班次1234期间值班的医生。如果在shift_id上建立了索引,数据库可以使用索引条目1234记录事务42和43读取此数据的事实。(如果没有索引,这些信息可以在表级别上跟踪。) 这些信息只需要保持一段时间:在事务完成(提交或中止)和所有并发事务完成后,数据库可以忘记它读取的数据。
When a transaction writes to the database, it must look in the indexes for any other transactions that have recently read the affected data. This process is similar to acquiring a write lock on the affected key range, but rather than blocking until the readers have committed, the lock acts as a tripwire: it simply notifies the transactions that the data they read may no longer be up to date.
当事务向数据库写入时,必须查看索引以查找最近读取受影响数据的任何其他事务。该过程类似于获取有关受影响键范围的写锁,但不会阻塞直到读者提交,而是锁定像绊索一样:它只是通知事务,它们读取的数据可能不再是最新的。
In Figure 7-11 , transaction 43 notifies transaction 42 that its prior read is outdated, and vice versa. Transaction 42 is first to commit, and it is successful: although transaction 43’s write affected 42, 43 hasn’t yet committed, so the write has not yet taken effect. However, when transaction 43 wants to commit, the conflicting write from 42 has already been committed, so 43 must abort.
在图7-11中,事务43通知事务42其先前的读取已过时,反之亦然。事务42首先提交,并且成功:尽管事务43的写入会影响42,但43尚未提交,因此写入尚未生效。然而,当事务43想要提交时,42的冲突写入已经提交,因此43必须中止。
Performance of serializable snapshot isolation
As always, many engineering details affect how well an algorithm works in practice. For example, one trade-off is the granularity at which transactions’ reads and writes are tracked. If the database keeps track of each transaction’s activity in great detail, it can be precise about which transactions need to abort, but the bookkeeping overhead can become significant. Less detailed tracking is faster, but may lead to more transactions being aborted than strictly necessary.
一如既往地,许多工程细节会影响算法在实践中的表现。例如,一个权衡是跟踪交易读取和写入的粒度。如果数据库以极高的精度跟踪每个交易的活动,它可以精确确定哪些交易需要中止,但簿记开销可能会变得显著。跟踪较少的细节更快,但可能会导致比严格必要更多的交易被中止。
In some cases, it’s okay for a transaction to read information that was overwritten by another transaction: depending on what else happened, it’s sometimes possible to prove that the result of the execution is nevertheless serializable. PostgreSQL uses this theory to reduce the number of unnecessary aborts [ 11 , 41 ].
在某些情况下,事务读取被另一个事务覆盖的信息是可以的:根据其他情况,有时可以证明执行结果仍然是可串行化的。 PostgreSQL使用这个理论来减少不必要的终止[11, 41]。
Compared to two-phase locking, the big advantage of serializable snapshot isolation is that one transaction doesn’t need to block waiting for locks held by another transaction. Like under snapshot isolation, writers don’t block readers, and vice versa. This design principle makes query latency much more predictable and less variable. In particular, read-only queries can run on a consistent snapshot without requiring any locks, which is very appealing for read-heavy workloads.
与两阶段锁相比,串行化可序列化快照隔离的最大优势在于一个事务不需要阻塞等待另一个事务持有的锁。与快照隔离一样,写者不会阻塞读者,反之亦然。这个设计原则使查询延迟更加可预测和更少变化。特别是,只读查询可以在一致的快照上运行,而不需要任何锁定,这对于读取密集型工作负载非常有吸引力。
Compared to serial execution, serializable snapshot isolation is not limited to the throughput of a single CPU core: FoundationDB distributes the detection of serialization conflicts across multiple machines, allowing it to scale to very high throughput. Even though data may be partitioned across multiple machines, transactions can read and write data in multiple partitions while ensuring serializable isolation [ 54 ].
与串行执行相比,可序列化快照隔离并不限于单个 CPU 核的吞吐量:FoundationDB将序列化冲突检测分布在多台机器上,使其能够扩展到非常高的吞吐量。即使数据可能分布在多台机器上,事务也可以在多个分区中读取和写入数据,同时确保可序列化隔离 [54]。
The rate of aborts significantly affects the overall performance of SSI. For example, a transaction that reads and writes data over a long period of time is likely to run into conflicts and abort, so SSI requires that read-write transactions be fairly short (long-running read-only transactions may be okay). However, SSI is probably less sensitive to slow transactions than two-phase locking or serial execution.
事务的中断率显著影响了SSI的整体性能。例如,长时间读写数据的事务可能会遇到冲突并中断,因此SSI要求读写事务应该相对较短(长时间读取事务可能没问题)。然而,SSI对于慢事务可能比两阶段锁定或串行执行更不敏感。
Summary
Transactions are an abstraction layer that allows an application to pretend that certain concurrency problems and certain kinds of hardware and software faults don’t exist. A large class of errors is reduced down to a simple transaction abort , and the application just needs to try again.
事务是一个抽象层,允许应用程序假装某些并发问题和某些硬件和软件故障不存在。许多错误被降低到简单的事务中止,并且应用程序只需要重试即可。
In this chapter we saw many examples of problems that transactions help prevent. Not all applications are susceptible to all those problems: an application with very simple access patterns, such as reading and writing only a single record, can probably manage without transactions. However, for more complex access patterns, transactions can hugely reduce the number of potential error cases you need to think about.
在本章中,我们看到了许多交易可以帮助避免的问题的例子。并非所有应用程序都容易受到所有这些问题的影响:一个具有非常简单访问模式(例如只读写单个记录)的应用程序可能可以在没有交易的情况下管理。但是,对于更复杂的访问模式,交易可以极大地降低需要考虑的潜在错误案例数。
Without transactions, various error scenarios (processes crashing, network interruptions, power outages, disk full, unexpected concurrency, etc.) mean that data can become inconsistent in various ways. For example, denormalized data can easily go out of sync with the source data. Without transactions, it becomes very difficult to reason about the effects that complex interacting accesses can have on the database.
没有事务,各种错误情况(进程崩溃、网络中断、停电、磁盘满、意外并发等)意味着数据可能以各种不一致的方式变得不一致。例如,非规范化数据很容易与源数据不同步。没有事务,很难推断复杂交互访问对数据库的影响。
In this chapter, we went particularly deep into the topic of concurrency control. We discussed several widely used isolation levels, in particular read committed , snapshot isolation (sometimes called repeatable read ), and serializable . We characterized those isolation levels by discussing various examples of race conditions:
在这一章中,我们深入探讨了并发控制的主题。我们讨论了几种广泛使用的隔离级别,特别是读提交、快照隔离(有时也称为可重复读)和可串行化。我们通过讨论各种竞争条件的例子来表征这些隔离级别。
- Dirty reads
-
One client reads another client’s writes before they have been committed. The read committed isolation level and stronger levels prevent dirty reads.
一个客户端在另一个客户端写入数据未被提交前就读取,而读提交隔离级别及更强力的级别可以防止脏读。
- Dirty writes
-
One client overwrites data that another client has written, but not yet committed. Almost all transaction implementations prevent dirty writes.
一个客户端覆盖了另一个客户端尚未提交的数据,几乎所有的事务实现都会防止脏写操作。
- Read skew (nonrepeatable reads)
-
A client sees different parts of the database at different points in time. This issue is most commonly prevented with snapshot isolation, which allows a transaction to read from a consistent snapshot at one point in time. It is usually implemented with multi-version concurrency control (MVCC).
客户端在不同时间点看到的数据库部分不同。这个问题最常见地通过快照隔离来解决,它允许事务在某个时间点从一致的快照中读取。通常使用多版本并发控制(MVCC)实现。
- Lost updates
-
Two clients concurrently perform a read-modify-write cycle. One overwrites the other’s write without incorporating its changes, so data is lost. Some implementations of snapshot isolation prevent this anomaly automatically, while others require a manual lock (
SELECT FOR UPDATE
).两个客户端同时执行读取-修改-写入循环。其中一个覆盖另一个的写入,而没有合并其更改,从而导致数据丢失。某些快照隔离的实现自动防止此异常,而另一些需要手动锁定(SELECT FOR UPDATE)。
- Write skew
-
A transaction reads something, makes a decision based on the value it saw, and writes the decision to the database. However, by the time the write is made, the premise of the decision is no longer true. Only serializable isolation prevents this anomaly.
一项交易读取某些内容,根据所看到的值做出决策,并将该决策写入数据库。然而,在进行写入时,决策的前提已不再成立。只有序列化隔离防止此异常。
- Phantom reads
-
A transaction reads objects that match some search condition. Another client makes a write that affects the results of that search. Snapshot isolation prevents straightforward phantom reads, but phantoms in the context of write skew require special treatment, such as index-range locks.
一个事务读取与某些搜索条件匹配的对象。另一个客户端进行写入,影响了该搜索结果。快照隔离可以防止简单的幻读,但在写入竞争的情况下,幻像需要特殊处理,比如索引范围锁。
Weak isolation levels protect against some of those anomalies but leave you, the application developer, to handle others manually (e.g., using explicit locking). Only serializable isolation protects against all of these issues. We discussed three different approaches to implementing serializable transactions:
弱隔离级别可以保护一些异常,但需要您作为应用程序开发者手动处理其他异常(例如,使用显式锁定)。只有序列化隔离可以防止所有这些问题。我们讨论了三种实现序列化事务的不同方法:
- Literally executing transactions in a serial order
-
If you can make each transaction very fast to execute, and the transaction throughput is low enough to process on a single CPU core, this is a simple and effective option.
如果您能够使每个交易的执行非常快,且交易吞吐量足够低,可以在单个CPU核心上处理,这是一种简单有效的选择。
- Two-phase locking
-
For decades this has been the standard way of implementing serializability, but many applications avoid using it because of its performance characteristics.
多年来,这一直是实现序列化的标准方式,但许多应用程序避免使用它,因为它的性能特性。
- Serializable snapshot isolation (SSI)
-
A fairly new algorithm that avoids most of the downsides of the previous approaches. It uses an optimistic approach, allowing transactions to proceed without blocking. When a transaction wants to commit, it is checked, and it is aborted if the execution was not serializable.
一种相对较新的算法避免了之前方法的大部分缺点。该算法采用乐观的方式,允许事务在不阻塞的情况下继续进行。当一个事务想要提交时,将进行检查,并且如果执行不可串行化,事务将被中止。
The examples in this chapter used a relational data model. However, as discussed in “The need for multi-object transactions” , transactions are a valuable database feature, no matter which data model is used.
本章的例子使用了一个关系数据模型。然而,正如“需要多对象事务”中所讨论的那样,无论使用哪种数据模型,事务仍然是一个有价值的数据库特性。
In this chapter, we explored ideas and algorithms mostly in the context of a database running on a single machine. Transactions in distributed databases open a new set of difficult challenges, which we’ll discuss in the next two chapters.
在本章中,我们主要探讨了在单台计算机上运行的数据库的思想和算法。分布式数据库中的事务则带来了一系列新的挑战,这将在接下来的两章中进行讨论。
Footnotes
i Joe Hellerstein has remarked that the C in ACID was “tossed in to make the acronym work” in Härder and Reuter’s paper [ 7 ], and that it wasn’t considered important at the time.
乔·海勒斯坦指出,在哈德尔和鲁特尔的论文中,ACID中的C是"随意加进去的,只是为了让首字母缩写起作用",当时并不被认为是很重要的。
ii Arguably, an incorrect counter in an email application is not a particularly critical problem. Alternatively, think of a customer account balance instead of an unread counter, and a payment transaction instead of an email.
在电子邮件应用程序中,计数器错误可能并不是特别重要的问题。另一方面,想象一下客户账户余额而不是未读计数器,以及付款交易而不是电子邮件。
iii This is not ideal. If the TCP connection is interrupted, the transaction must be aborted. If the interruption happens after the client has requested a commit but before the server acknowledges that the commit happened, the client doesn’t know whether the transaction was committed or not. To solve this issue, a transaction manager can group operations by a unique transaction identifier that is not bound to a particular TCP connection. We will return to this topic in “The End-to-End Argument for Databases” .
这不是理想的情况。如果TCP连接中断,事务必须被中止。如果中断发生在客户端请求提交但服务器尚未确认提交时,客户端不知道事务是否已提交。为解决此问题,事务管理器可以通过唯一的事务标识符将操作分组,该标识符不绑定到特定的TCP连接。我们将在“数据库的端到端论证”中回到这个话题。
iv Strictly speaking, the term atomic increment uses the word atomic in the sense of multi-threaded programming. In the context of ACID, it should actually be called isolated or serializable increment. But that’s getting nitpicky.
从严格意义上讲,原子递增一词使用了多线程编程中的原子意义。在ACID的上下文中,实际应该称为隔离或可序列化递增。但这有点吹毛求疵。
v Some databases support an even weaker isolation level called read uncommitted . It prevents dirty writes, but does not prevent dirty reads.
有些数据库支持更弱的隔离级别,称为“读未提交”。它可以防止脏写,但不能防止脏读。
vi
At the time of
writing, the only mainstream databases that use locks for read committed isolation are IBM
DB2 and Microsoft SQL Server in the
read_committed_snapshot=off
configuration
[
23
,
36
].
目前为止,仅有IBM DB2和Microsoft SQL Server在read_committed_snapshot=off配置下使用锁定读取已提交隔离的主流数据库 [23, 36]。
vii To be precise, transaction IDs are 32-bit integers, so they overflow after approximately 4 billion transactions. PostgreSQL’s vacuum process performs cleanup which ensures that overflow does not affect the data.
vii 确切来说,交易ID是32位整数,因此在完成约40亿次交易后便会溢出。PostgreSQL的清理过程可保证该溢出不会对数据造成影响。
viii It is possible, albeit fairly complicated, to express the editing of a text document as a stream of atomic mutations. See “Automatic Conflict Resolution” for some pointers.
可以将文本文档的编辑表示为一系列原子突变,尽管有些复杂。请参阅“自动冲突解决”获取一些指南。
ix In PostgreSQL you can do this more elegantly using range types, but they are not widely supported in other databases.
在PostgreSQL中,您可以使用范围类型更加优雅地完成此操作,但它们在其他数据库中的支持并不广泛。
x If a transaction needs to access data that’s not in memory, the best solution may be to abort the transaction, asynchronously fetch the data into memory while continuing to process other transactions, and then restart the transaction when the data has been loaded. This approach is known as anti-caching , as previously mentioned in “Keeping everything in memory” .
如果一笔交易需要访问未在内存中的数据,则最好的解决方案可能是停止该交易,异步获取数据并将其加载到内存中,同时继续处理其他交易,然后在数据加载完成后重新开始该交易。这种方法被称为反缓存,正如之前所提到的“保持所有内容在内存中”。
xi Sometimes called strong strict two-phase locking (SS2PL) to distinguish it from other variants of 2PL.
有时候被称为强制严格二阶段锁定(SS2PL),以区别于其他二阶段锁定的变体。
References
[ 1 ] Donald D. Chamberlin, Morton M. Astrahan, Michael W. Blasgen, et al.: “ A History and Evaluation of System R ,” Communications of the ACM , volume 24, number 10, pages 632–646, October 1981. doi:10.1145/358769.358784
[1] 唐纳德·D·钱伯林、莫顿·M·阿斯特罕、迈克尔·W·布拉斯根等人: “System R的历史与评价”,ACM通讯杂志,第24卷,第10期,页码632-646,1981年10月。 doi:10.1145/358769.358784
[ 2 ] Jim N. Gray, Raymond A. Lorie, Gianfranco R. Putzolu, and Irving L. Traiger: “ Granularity of Locks and Degrees of Consistency in a Shared Data Base ,” in Modelling in Data Base Management Systems: Proceedings of the IFIP Working Conference on Modelling in Data Base Management Systems , edited by G. M. Nijssen, pages 364–394, Elsevier/North Holland Publishing, 1976. Also in Readings in Database Systems , 4th edition, edited by Joseph M. Hellerstein and Michael Stonebraker, MIT Press, 2005. ISBN: 978-0-262-69314-1
[2] Jim N. Gray, Raymond A. Lorie, Gianfranco R. Putzolu 和 Irving L. Traiger: “共享数据库中的锁的颗粒度和一致性程度”,收录于 G.M. Nijssen 编辑的《数据库管理系统建模》会议论文集,第 364-394 页,Elsevier/North Holland Publishing,1976。同时还收录于 MIT Press 出版的第四版《数据库系统阅读》(Joseph M. Hellerstein 和 Michael Stonebraker 编辑),ISBN: 978-0-262-69314-1。
[ 3 ] Kapali P. Eswaran, Jim N. Gray, Raymond A. Lorie, and Irving L. Traiger: “ The Notions of Consistency and Predicate Locks in a Database System ,” Communications of the ACM , volume 19, number 11, pages 624–633, November 1976.
Kapali P. Eswaran,Jim N. Gray,Raymond A. Lorie和Irving L. Traiger:“数据库系统中的一致性和谓词锁定概念”,ACM通讯,第19卷,第11号,第624-633页,1976年11月。
[ 4 ] “ ACID Transactions Are Incredibly Helpful ,” FoundationDB, LLC, 2013.
“ACID事务非常有用”,FoundationDB,LLC,2013。
[ 5 ] John D. Cook: “ ACID Versus BASE for Database Transactions ,” johndcook.com , July 6, 2009.
[5] 约翰·D·库克(John D. Cook):“数据库事务中的ACID与BASE”,johndcook.com,2009年7月6日。
[ 6 ] Gavin Clarke: “ NoSQL’s CAP Theorem Busters: We Don’t Drop ACID ,” theregister.co.uk , November 22, 2012.
《非关系型数据库的CAP理论破冰者:我们不会放弃ACID》(来源:theregister.co.uk,2012年11月22日)。
[ 7 ] Theo Härder and Andreas Reuter: “ Principles of Transaction-Oriented Database Recovery ,” ACM Computing Surveys , volume 15, number 4, pages 287–317, December 1983. doi:10.1145/289.291
[7] Theo Härder 和 Andreas Reuter: “基于事务的数据库恢复原则”,ACM Computing Surveys,第15卷,第4期,页码287-317,1983年12月。 doi:10.1145/289.291
[ 8 ] Peter Bailis, Alan Fekete, Ali Ghodsi, et al.: “ HAT, not CAP: Towards Highly Available Transactions ,” at 14th USENIX Workshop on Hot Topics in Operating Systems (HotOS), May 2013.
[8] Peter Bailis、Alan Fekete、Ali Ghodsi 等人: “HAT,而非CAP:走向高度可用交易,” 于第14届USENIX操作系统热门话题研讨会(HotOS)中,2013年5月。
[ 9 ] Armando Fox, Steven D. Gribble, Yatin Chawathe, et al.: “ Cluster-Based Scalable Network Services ,” at 16th ACM Symposium on Operating Systems Principles (SOSP), October 1997.
[9] Armando Fox, Steven D. Gribble, Yatin Chawathe等人:“基于集群的可伸缩网络服务”,发表于1997年10月的第16届ACM操作系统原则研讨会(SOSP)。
[ 10 ] Philip A. Bernstein, Vassos Hadzilacos, and Nathan Goodman: Concurrency Control and Recovery in Database Systems . Addison-Wesley, 1987. ISBN: 978-0-201-10715-9, available online at research.microsoft.com .
【10】Philip A. Bernstein、Vassos Hadzilacos 和 Nathan Goodman:《数据库系统的并发控制与恢复》。Addison-Wesley,1987年。ISBN:978-0-201-10715-9,可在research.microsoft.com网站上在线获取。
[ 11 ] Alan Fekete, Dimitrios Liarokapis, Elizabeth O’Neil, et al.: “ Making Snapshot Isolation Serializable ,” ACM Transactions on Database Systems , volume 30, number 2, pages 492–528, June 2005. doi:10.1145/1071610.1071615
[11] Alan Fekete, Dimitrios Liarokapis, Elizabeth O’Neil等人:《使快照隔离可串行化》,《ACM数据系统交易》,第30卷,第2期,2005年6月,页码492-528。doi:10.1145 / 1071610.1071615
[ 12 ] Mai Zheng, Joseph Tucek, Feng Qin, and Mark Lillibridge: “ Understanding the Robustness of SSDs Under Power Fault ,” at 11th USENIX Conference on File and Storage Technologies (FAST), February 2013.
`了解SSD在电源故障下的健壮性`:麦铮、约瑟夫·图塞克、秦峰和马克·利利布里奇,发表于2013年2月第11届USENIX文件和存储技术会议(FAST)。
[ 13 ] Laurie Denness: “ SSDs: A Gift and a Curse ,” laur.ie , June 2, 2015.
"SSD:礼物与诅咒" - 作者 Laurie Denness,来源于网站laur.ie,发表日期为2015年6月2日。
[ 14 ] Adam Surak: “ When Solid State Drives Are Not That Solid ,” blog.algolia.com , June 15, 2015.
[14] 亚当·苏拉克: “当固态硬盘并不稳定”,博客.algolia.com,2015年6月15日。
[ 15 ] Thanumalayan Sankaranarayana Pillai, Vijay Chidambaram, Ramnatthan Alagappan, et al.: “ All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications ,” at 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI), October 2014.
【15】Thanumalayan Sankaranarayana Pillai, Vijay Chidambaram, Ramnatthan Alagappan等人:《并非所有文件系统都是平等的:关于开发崩溃一致应用程序的复杂性》,2014年10月在第11届USENIX操作系统设计与实现研讨会(OSDI)上发表。
[ 16 ] Chris Siebenmann: “ Unix’s File Durability Problem ,” utcc.utoronto.ca , April 14, 2016.
[16] Chris Siebenmann:“Unix的文件耐久性问题”,utcc.utoronto.ca,2016年4月14日。
[ 17 ] Lakshmi N. Bairavasundaram, Garth R. Goodson, Bianca Schroeder, et al.: “ An Analysis of Data Corruption in the Storage Stack ,” at 6th USENIX Conference on File and Storage Technologies (FAST), February 2008.
[17] Lakshmi N. Bairavasundaram, Garth R. Goodson, Bianca Schroeder等: “存储栈中数据损坏的分析”,发表于2008年2月的第6届USENIX文件和存储技术会议(FAST)。
[ 18 ] Bianca Schroeder, Raghav Lagisetty, and Arif Merchant: “ Flash Reliability in Production: The Expected and the Unexpected ,” at 14th USENIX Conference on File and Storage Technologies (FAST), February 2016.
"[18] Bianca Schroeder、Raghav Lagisetty和Arif Merchant:「Flash生产中的可靠性:预期与意外」,发表于第14届USENIX文件和存储技术研讨会(FAST),2016年2月。"
[ 19 ] Don Allison: “ SSD Storage – Ignorance of Technology Is No Excuse ,” blog.korelogic.com , March 24, 2015.
“SSD存储——无知技术不是借口”,Don Allison,blog.korelogic.com,2015年3月24日。
[ 20 ] Dave Scherer: “ Those Are Not Transactions (Cassandra 2.0) ,” blog.foundationdb.com , September 6, 2013.
[20] Dave Scherer:“那不是事务(Cassandra 2.0)”,blog.foundationdb.com,2013年9月6日。
[ 21 ] Kyle Kingsbury: “ Call Me Maybe: Cassandra ,” aphyr.com , September 24, 2013.
[21] Kyle Kingsbury:“叫我也许:Cassandra”,aphyr.com,2013年9月24日。
[ 22 ] “ ACID Support in Aerospike ,” Aerospike, Inc., June 2014.
"[22] Aerospike公司,“Aerospike中的ACID支持”,2014年6月。"
[ 23 ] Martin Kleppmann: “ Hermitage: Testing the ‘I’ in ACID ,” martin.kleppmann.com , November 25, 2014.
【23】马丁·克莱普曼: “博物馆:在ACID中测试I”,martin.kleppmann.com,2014年11月25日。
[ 24 ] Tristan D’Agosta: “ BTC Stolen from Poloniex ,” bitcointalk.org , March 4, 2014.
[24] 特里斯坦·达戈斯塔: “Poloniex被盗的比特币。” bitcointalk.org,2014年3月4日。
[ 25 ] bitcointhief2: “ How I Stole Roughly 100 BTC from an Exchange and How I Could Have Stolen More! ,” reddit.com , February 2, 2014.
"25. 比特币小偷2: “我是如何从一个交易所盗走大约100个比特币的,以及我还可以盗取更多的方式!”,reddit.com,2014年2月2日。"
[ 26 ] Sudhir Jorwekar, Alan Fekete, Krithi Ramamritham, and S. Sudarshan: “ Automating the Detection of Snapshot Isolation Anomalies ,” at 33rd International Conference on Very Large Data Bases (VLDB), September 2007.
[26] Sudhir Jorwekar, Alan Fekete, Krithi Ramamritham, and S. Sudarshan: “自动检测快照隔离异常”,发表于2007年9月的第33届国际大型数据 库会议(VLDB)。
[ 27 ] Michael Melanson: “ Transactions: The Limits of Isolation ,” michaelmelanson.net , March 20, 2014.
[27] 迈克尔·梅兰森: “交易: 隔离的限制”,michaelmelanson.net,2014年3月20日。
[ 28 ] Hal Berenson, Philip A. Bernstein, Jim N. Gray, et al.: “ A Critique of ANSI SQL Isolation Levels ,” at ACM International Conference on Management of Data (SIGMOD), May 1995.
[28] Hal Berenson, Philip A. Bernstein, Jim N. Gray等人: “对ANSI SQL隔离级别的批评”, 于ACM数据管理国际会议(SIGMOD)于1995年5月举行。
[ 29 ] Atul Adya: “ Weak Consistency: A Generalized Theory and Optimistic Implementations for Distributed Transactions ,” PhD Thesis, Massachusetts Institute of Technology, March 1999.
[29] Atul Adya:“弱一致性:分布式事务的广义理论和乐观实现”,麻省理工学院博士论文,1999年3月。
[ 30 ] Peter Bailis, Aaron Davidson, Alan Fekete, et al.: “ Highly Available Transactions: Virtues and Limitations (Extended Version) ,” at 40th International Conference on Very Large Data Bases (VLDB), September 2014.
[30] Peter Bailis, Aaron Davidson, Alan Fekete等: “高可用事务:优点和限制(扩展版本)”,于2014年9月在40届国际大数据管理系统大会(VLDB)上发布。
[ 31 ] Bruce Momjian: “ MVCC Unmasked ,” momjian.us , July 2014.
[31] Bruce Momjian: "MVCC Unmasked", momjian.us, 2014年7月。
[ 32 ] Annamalai Gurusami: “ Repeatable Read Isolation Level in InnoDB – How Consistent Read View Works ,” blogs.oracle.com , January 15, 2013.
“在InnoDB中可重复读隔离级别 - 一致性读取视图的工作原理”,Annamalai Gurusami,blogs.oracle.com,2013年1月15日。
[ 33 ] Nikita Prokopov: “ Unofficial Guide to Datomic Internals ,” tonsky.me , May 6, 2014.
[33] 尼基塔·普罗科波夫: "Datomic 内部非官方指南",tonsky.me,2014年5月6日。
[ 34 ] Baron Schwartz: “ Immutability, MVCC, and Garbage Collection ,” xaprb.com , December 28, 2013.
巴伦·施瓦茨:“不变性、MVCC和垃圾回收”,xaprb.com,2013年12月28日。
[ 35 ] J. Chris Anderson, Jan Lehnardt, and Noah Slater: CouchDB: The Definitive Guide . O’Reilly Media, 2010. ISBN: 978-0-596-15589-6
[35] J. Chris Anderson,Jan Lehnardt,和Noah Slater:CouchDB:权威指南。O'Reilly Media,2010年。 ISBN:978-0-596-15589-6
[ 36 ] Rikdeb Mukherjee: “ Isolation in DB2 (Repeatable Read, Read Stability, Cursor Stability, Uncommitted Read) with Examples ,” mframes.blogspot.co.uk , July 4, 2013.
"[36] Rikdeb Mukherjee:“用例分别展示在DB2中的隔离(可重复读、读取稳定性、游标稳定性、未提交读)”,mframes.blogspot.co.uk,2013年7月4日。"
[ 37 ] Steve Hilker: “ Cursor Stability (CS) – IBM DB2 Community ,” toadworld.com , March 14, 2013.
[37] Steve Hilker:IBM DB2社区,“游标稳定性 (CS)”,toadworld.com,2013年3月14日。
[ 38 ] Nate Wiger: “ An Atomic Rant ,” nateware.com , February 18, 2010.
[38] 纳特·维格: “一个原子抱怨”,nateware.com, 2010年2月18日。
[ 39 ] Joel Jacobson: “ Riak 2.0: Data Types ,” blog.joeljacobson.com , March 23, 2014.
[39] Joel Jacobson: “Riak 2.0: 数据类型,” blog.joeljacobson.com, 2014年3月23日.
[ 40 ] Michael J. Cahill, Uwe Röhm, and Alan Fekete: “ Serializable Isolation for Snapshot Databases ,” at ACM International Conference on Management of Data (SIGMOD), June 2008. doi:10.1145/1376616.1376690
"[40] Michael J. Cahill,Uwe Röhm和Alan Fekete:`基于快照数据库的可串行隔离`,发表于2008年ACM国际数据管理会议(SIGMOD),doi:10.1145/1376616.1376690。"
[ 41 ] Dan R. K. Ports and Kevin Grittner: “ Serializable Snapshot Isolation in PostgreSQL ,” at 38th International Conference on Very Large Databases (VLDB), August 2012.
[41] Dan R. K. Ports 和 Kevin Grittner: “在 PostgreSQL 中实现串行化快照隔离”, 发表于第 38 届国际大型数据库会议(VLDB),2012年8月。
[ 42 ] Tony Andrews: “ Enforcing Complex Constraints in Oracle ,” tonyandrews.blogspot.co.uk , October 15, 2004.
[42] Tony Andrews:“在Oracle中实施复杂的约束,” tonyandrews.blogspot.co.uk,2004年10月15日。
[ 43 ] Douglas B. Terry, Marvin M. Theimer, Karin Petersen, et al.: “ Managing Update Conflicts in Bayou, a Weakly Connected Replicated Storage System ,” at 15th ACM Symposium on Operating Systems Principles (SOSP), December 1995. doi:10.1145/224056.224070
【43】Douglas B. Terry, Marvin M. Theimer, Karin Petersen等: “在Bayou中管理更新冲突:一种弱连接的复制存储系统”,发表于 1995年12月第15届ACM操作系统原则研讨会(SOSP)。doi:10.1145/224056.224070
[ 44 ] Gary Fredericks: “ Postgres Serializability Bug ,” github.com , September 2015.
"Postgres串行化漏洞",Gary Fredericks,github.com,2015年9月。
[ 45 ] Michael Stonebraker, Samuel Madden, Daniel J. Abadi, et al.: “ The End of an Architectural Era (It’s Time for a Complete Rewrite) ,” at 33rd International Conference on Very Large Data Bases (VLDB), September 2007.
[45] Michael Stonebraker, Samuel Madden, Daniel J. Abadi等人: “一个架构时代的结束(是时候进行完全重写了)”,发表于2007年9月的第33届超大型数据管理会议(VLDB)。
[ 46 ] John Hugg: “ H-Store/VoltDB Architecture vs. CEP Systems and Newer Streaming Architectures ,” at Data @Scale Boston , November 2014.
"John Hugg在2014年11月的Data @Scale Boston上演讲:“H-Store/VoltDB架构与CEP系统以及新一代流处理架构的对比。”"
[ 47 ] Robert Kallman, Hideaki Kimura, Jonathan Natkins, et al.: “ H-Store: A High-Performance, Distributed Main Memory Transaction Processing System ,” Proceedings of the VLDB Endowment , volume 1, number 2, pages 1496–1499, August 2008.
"[47] Robert Kallman,Hideaki Kimura,Jonathan Natkins等人: “H-Store:一种高性能的分布式主内存事务处理系统,” VLDB 纪念册,卷1,编号2,页1496-1499,2008年8月。" "[47] 罗伯特·卡尔曼,木村秀明,乔纳森·纳特金斯等人: “H-Store:一种高性能、分布式主内存事务处理系统,” VLDB纪念册,卷1,编号2,页1496-1499,2008年8月。"
[ 48 ] Rich Hickey: “ The Architecture of Datomic ,” infoq.com , November 2, 2012.
[48] Rich Hickey: “Datomic 的架构”,infoq.com,2012年11月2日。
[ 49 ] John Hugg: “ Debunking Myths About the VoltDB In-Memory Database ,” voltdb.com , May 12, 2014.
约翰·哈格:“揭开 VoltDB 内存数据库的谣言”,voltdb.com,2014 年 5 月 12 日。
[ 50 ] Joseph M. Hellerstein, Michael Stonebraker, and James Hamilton: “ Architecture of a Database System ,” Foundations and Trends in Databases , volume 1, number 2, pages 141–259, November 2007. doi:10.1561/1900000002
【50】Joseph M. Hellerstein、Michael Stonebraker和James Hamilton: “数据库系统的架构,” 《数据库基础与趋势》,卷1,号2,页141-259,2007年11月。 doi:10.1561/1900000002
[ 51 ] Michael J. Cahill: “ Serializable Isolation for Snapshot Databases ,” PhD Thesis, University of Sydney, July 2009.
迈克尔·J·卡希尔:《可序列化隔离对于快照数据库的应用》,悉尼大学博士论文,2009年7月。
[ 52 ] D. Z. Badal: “ Correctness of Concurrency Control and Implications in Distributed Databases ,” at 3rd International IEEE Computer Software and Applications Conference (COMPSAC), November 1979.
[52] D. Z. 巴达尔: “并发控制的正确性及其在分布式数据库中的意义”,出版于1979年11月第3届国际IEEE计算机软件和应用大会(COMPSAC)上。
[ 53 ] Rakesh Agrawal, Michael J. Carey, and Miron Livny: “ Concurrency Control Performance Modeling: Alternatives and Implications ,” ACM Transactions on Database Systems (TODS), volume 12, number 4, pages 609–654, December 1987. doi:10.1145/32204.32220
[53] Rakesh Agrawal,Michael J. Carey和Miron Livny:“并发控制性能建模:替代方案和影响”,ACM Transactions on Database Systems(TODS),第12卷,第4期,1987年12月,页609-654。 doi:10.1145/32204.32220
[ 54 ] Dave Rosenthal: “ Databases at 14.4MHz ,” blog.foundationdb.com , December 10, 2014.
[54] Dave Rosenthal: “Databases at 14.4MHz,” blog.foundationdb.com, December 10, 2014. [54] 戴夫·罗森塔尔: “14.4MHz下的数据库”, blog.foundationdb.com,2014年12月10日。