第四章。编码和演化
Chapter 4. Encoding and Evolution
Everything changes and nothing stands still.
一切都在变化,没有任何事情是静止不动的。
Heraclitus of Ephesus, as quoted by Plato in Cratylus (360 BCE)
古希腊哲学家赫拉克利特,在公元前360年的柏拉图的《克拉提洛斯》一书中被引用过。
Applications inevitably change over time. Features are added or modified as new products are launched, user requirements become better understood, or business circumstances change. In Chapter 1 we introduced the idea of evolvability : we should aim to build systems that make it easy to adapt to change (see “Evolvability: Making Change Easy” ).
应用程序难免会随着时间推移而改变。随着新产品的推出、用户需求的更好理解或业务情况的变化,功能会被添加或修改。在第一章中,我们介绍了可发展性的概念:我们应该致力于构建易于适应变化的系统(参见“可发展性:使变化易于实现”)。
In most cases, a change to an application’s features also requires a change to data that it stores: perhaps a new field or record type needs to be captured, or perhaps existing data needs to be presented in a new way.
大多数情况下,应用程序功能的更改也需要更改其存储的数据:可能需要捕获新的字段或记录类型,或者可能需要以新方式呈现现有数据。
The data models we discussed in
Chapter 2
have different ways of coping with such change.
Relational databases generally assume that all data in the database conforms to one schema: although
that schema can be changed (through schema migrations; i.e.,
ALTER
statements), there is exactly
one schema in force at any one point in time. By contrast, schema-on-read (“schemaless”) databases
don’t enforce a schema, so the database can contain a mixture of older and newer data formats
written at different times (see
“Schema flexibility in the document model”
).
在第二章讨论的数据模型中,面对这种变化的方式是不同的。 关系数据库通常假定数据库中的所有数据都符合一个模式:虽然这个模式可以通过模式迁移(即ALTER语句)进行更改,但在任何时刻只有一个模式正在生效。相比之下,“按需模式”(无模式)数据库不强制执行模式,因此数据库可以包含在不同时间编写的旧格式和新格式的数据混合在一起(参见“文档模型中的模式灵活性”)。
When a data format or schema changes, a corresponding change to application code often needs to happen (for example, you add a new field to a record, and the application code starts reading and writing that field). However, in a large application, code changes often cannot happen instantaneously:
当数据格式或模式发生更改时,通常需要相应地更改应用程序代码(例如,您将新字段添加到记录中,应用程序代码开始读取和写入该字段)。但是,在大型应用程序中,代码更改通常无法即时发生。
-
With server-side applications you may want to perform a rolling upgrade (also known as a staged rollout ), deploying the new version to a few nodes at a time, checking whether the new version is running smoothly, and gradually working your way through all the nodes. This allows new versions to be deployed without service downtime, and thus encourages more frequent releases and better evolvability.
对于服务器端应用程序,您可能希望执行滚动升级(也称为分阶段发布),将新版本逐个部署到几个节点,并检查新版本是否运行顺畅,逐步浏览所有节点。这样可以在不中断服务的情况下部署新版本,从而鼓励更频繁的发布和更好的可扩展性。
-
With client-side applications you’re at the mercy of the user, who may not install the update for some time.
使用客户端应用程序时,您完全得听从用户,而用户可能会在一段时间内不安装更新。
This means that old and new versions of the code, and old and new data formats, may potentially all coexist in the system at the same time. In order for the system to continue running smoothly, we need to maintain compatibility in both directions:
这意味着旧版本和新版本的代码,以及旧格式和新格式的数据,都有可能在系统中同时存在。为了系统继续平稳运行,我们需要在两个方向上保持兼容性。
- Backward compatibility
-
Newer code can read data that was written by older code.
新的代码可以读取由旧代码写入的数据。
- Forward compatibility
-
Older code can read data that was written by newer code.
旧代码可以读取由新代码编写的数据。
Backward compatibility is normally not hard to achieve: as author of the newer code, you know the format of data written by older code, and so you can explicitly handle it (if necessary by simply keeping the old code to read the old data). Forward compatibility can be trickier, because it requires older code to ignore additions made by a newer version of the code.
向后兼容通常不难实现:作为新代码的作者,您知道旧代码写入的数据格式,因此您可以明确地处理它(如果必要,可以通过保留旧代码来读取旧数据)。向前兼容可能更加棘手,因为它需要旧代码忽略新版本代码中添加的内容。
In this chapter we will look at several formats for encoding data, including JSON, XML, Protocol Buffers, Thrift, and Avro. In particular, we will look at how they handle schema changes and how they support systems where old and new data and code need to coexist. We will then discuss how those formats are used for data storage and for communication: in web services, Representational State Transfer (REST), and remote procedure calls (RPC), as well as message-passing systems such as actors and message queues.
在本章中,我们将介绍几种编码数据的格式,包括JSON、XML、协议缓冲、Thrift和Avro。特别地,我们将研究它们如何处理模式更改以及它们支持旧数据和新数据以及代码共存的系统。然后,我们将讨论这些格式在数据存储和通信方面的用途:在Web服务、表现状态转移(REST)和远程过程调用(RPC)以及像角色和消息队列这样的消息传递系统中。
Formats for Encoding Data
Programs usually work with data in (at least) two different representations:
程序通常使用至少两种不同的表示方式来处理数据:
-
In memory, data is kept in objects, structs, lists, arrays, hash tables, trees, and so on. These data structures are optimized for efficient access and manipulation by the CPU (typically using pointers).
在内存中,数据被存储为对象、结构体、列表、数组、哈希表、树等形式。这些数据结构被优化为通过CPU(通常使用指针)进行高效访问和操作。
-
When you want to write data to a file or send it over the network, you have to encode it as some kind of self-contained sequence of bytes (for example, a JSON document). Since a pointer wouldn’t make sense to any other process, this sequence-of-bytes representation looks quite different from the data structures that are normally used in memory. i
当您想要将数据写入文件或通过网络发送时,您必须将其编码为某种自包含的字节序列(例如JSON文档)。由于指针对任何其他进程都没有意义,因此这个字节序列表示看起来与通常在内存中使用的数据结构非常不同。
Thus, we need some kind of translation between the two representations. The translation from the in-memory representation to a byte sequence is called encoding (also known as serialization or marshalling ), and the reverse is called decoding ( parsing , deserialization , unmarshalling ). ii
因此,我们需要一些类型的翻译来在两种表示之间进行转换。从内存表示到字节序列的转换称为编码(也称为序列化或编组),而反向则称为解码(解析,反序列化,取消编组)。
Terminology clash
Serialization is unfortunately also used in the context of transactions (see Chapter 7 ), with a completely different meaning. To avoid overloading the word we’ll stick with encoding in this book, even though serialization is perhaps a more common term.
序列化不幸地也在交易的上下文中使用(参见第7章),具有完全不同的含义。为了避免过载这个词,本书将采用编码,即使序列化可能是更常用的术语。 (请注意,这是翻译文本,不包括原文)
As this is such a common problem, there are a myriad different libraries and encoding formats to choose from. Let’s do a brief overview.
由于这是一个常见的问题,所以有无数不同的库和编码格式可供选择。让我们做一个简要的概述。
Language-Specific Formats
Many programming languages come with built-in support for encoding in-memory objects into byte
sequences. For example, Java has
java.io.Serializable
[
1
], Ruby has
Marshal
[
2
], Python has
pickle
[
3
],
and so on. Many third-party libraries also exist, such as Kryo for Java
[
4
].
许多编程语言都配备了将内存对象编码成字节序列的内置支持。例如,Java有java.io.Serializable,Ruby有Marshal,Python有pickle等等。还有许多第三方库存在,比如Java的Kryo。
These encoding libraries are very convenient, because they allow in-memory objects to be saved and restored with minimal additional code. However, they also have a number of deep problems:
这些编码库非常方便,因为它们允许内存对象以最少的额外代码进行保存和恢复。然而,它们也存在许多深层问题:
-
The encoding is often tied to a particular programming language, and reading the data in another language is very difficult. If you store or transmit data in such an encoding, you are committing yourself to your current programming language for potentially a very long time, and precluding integrating your systems with those of other organizations (which may use different languages).
编码通常与特定的编程语言相关联,使用另一种语言读取数据非常困难。如果您以这种编码方式存储或传输数据,可能会长期限制自己使用当前的编程语言,并阻碍与其他组织(可能使用不同语言)集成系统的能力。
-
In order to restore data in the same object types, the decoding process needs to be able to instantiate arbitrary classes. This is frequently a source of security problems [ 5 ]: if an attacker can get your application to decode an arbitrary byte sequence, they can instantiate arbitrary classes, which in turn often allows them to do terrible things such as remotely executing arbitrary code [ 6 , 7 ].
为了在相同的对象类型中恢复数据,解码过程需要能够实例化任意类。这经常是安全问题的源泉:如果攻击者能够让您的应用程序解码任意字节序列,他们可以实例化任意类,这通常允许他们做可怕的事情,比如远程执行任意代码。
-
Versioning data is often an afterthought in these libraries: as they are intended for quick and easy encoding of data, they often neglect the inconvenient problems of forward and backward compatibility.
版本控制常常是这些库的边角料:由于它们的目的是快速和简单地编码数据,所以它们往往忽略了前向和后向兼容性的不方便问题。
-
Efficiency (CPU time taken to encode or decode, and the size of the encoded structure) is also often an afterthought. For example, Java’s built-in serialization is notorious for its bad performance and bloated encoding [ 8 ].
效率(编码或解码所需的CPU时间以及编码结构的大小)通常也是后来的想法。例如,Java内建的串行化因其性能差和编码过大而臭名昭著。
For these reasons it’s generally a bad idea to use your language’s built-in encoding for anything other than very transient purposes.
出于这些原因,除非是非常短暂的用途,通常不建议使用您的语言内置的编码方式。
JSON, XML, and Binary Variants
Moving to standardized encodings that can be written and read by many programming languages, JSON and XML are the obvious contenders. They are widely known, widely supported, and almost as widely disliked. XML is often criticized for being too verbose and unnecessarily complicated [ 9 ]. JSON’s popularity is mainly due to its built-in support in web browsers (by virtue of being a subset of JavaScript) and simplicity relative to XML. CSV is another popular language-independent format, albeit less powerful.
采用标准化编码,可以被多种编程语言编写和读取,JSON和XML是显而易见的竞争者。它们被广泛知晓,广泛支持,但几乎同样被人们所厌恶。XML经常因为太冗长和过于复杂而受到批评[9]。JSON之所以受欢迎,主要是因为它在网络浏览器中内置支持(由于是JavaScript的子集)并且相对于XML而言更为简单。CSV是另一种流行的独立于语言的格式,但它的功能比较有限。
JSON, XML, and CSV are textual formats, and thus somewhat human-readable (although the syntax is a popular topic of debate). Besides the superficial syntactic issues, they also have some subtle problems:
JSON、XML 和 CSV 都是文本格式,因此在某种程度上可以被人类阅读(尽管语法是热门话题,容易引发争议)。除了表面的语法问题,它们还有一些微妙的问题。
-
There is a lot of ambiguity around the encoding of numbers. In XML and CSV, you cannot distinguish between a number and a string that happens to consist of digits (except by referring to an external schema). JSON distinguishes strings and numbers, but it doesn’t distinguish integers and floating-point numbers, and it doesn’t specify a precision.
数字编码存在很多歧义。在XML和CSV中,你无法区分数字和由数字组成的字符串(除非参考外部架构)。JSON区分了字符串和数字,但它不区分整数和浮点数,并且它不指定精度。
This is a problem when dealing with large numbers; for example, integers greater than 2 53 cannot be exactly represented in an IEEE 754 double-precision floating-point number, so such numbers become inaccurate when parsed in a language that uses floating-point numbers (such as JavaScript). An example of numbers larger than 2 53 occurs on Twitter, which uses a 64-bit number to identify each tweet. The JSON returned by Twitter’s API includes tweet IDs twice, once as a JSON number and once as a decimal string, to work around the fact that the numbers are not correctly parsed by JavaScript applications [ 10 ].
处理大量数字时会遇到问题。例如,大于253的整数无法在IEEE 754双精度浮点数中准确表示,因此在使用浮点数(例如JavaScript)的语言中解析这些数字时,这些数字就会变得不准确。 Twitter上的数字就有大于253的情况,该网站使用64位数字来识别每个推文。 Twitter API返回的JSON包括推文ID两次,一次作为JSON数字,一次作为十进制字符串,以解决这些数字无法正确解析JavaScript应用程序的问题[10]。 处理大量数字时会遇到问题。例如,大于253的整数无法在IEEE 754双精度浮点数中准确表示,因此在使用浮点数(例如JavaScript)的语言中解析这些数字时,这些数字就会变得不准确。 Twitter上的数字就有大于253的情况,该网站使用64位数字来识别每个推文。 Twitter API返回的JSON包括推文ID两次,一次作为JSON数字,一次作为十进制字符串,以解决这些数字无法正确解析JavaScript应用程序的问题[10]。
-
JSON and XML have good support for Unicode character strings (i.e., human-readable text), but they don’t support binary strings (sequences of bytes without a character encoding). Binary strings are a useful feature, so people get around this limitation by encoding the binary data as text using Base64. The schema is then used to indicate that the value should be interpreted as Base64-encoded. This works, but it’s somewhat hacky and increases the data size by 33%.
JSON和XML对Unicode字符字符串(即人类可读的文本)有很好的支持,但它们不支持二进制字符串(没有字符编码的字节序列)。二进制字符串是一个有用的功能,因此人们通过使用Base64将二进制数据编码为文本来解决这个限制。然后使用模式来指示该值应解释为Base64编码。这可行,但它有点hacky,并且会将数据大小增加33%。
-
There is optional schema support for both XML [ 11 ] and JSON [ 12 ]. These schema languages are quite powerful, and thus quite complicated to learn and implement. Use of XML schemas is fairly widespread, but many JSON-based tools don’t bother using schemas. Since the correct interpretation of data (such as numbers and binary strings) depends on information in the schema, applications that don’t use XML/JSON schemas need to potentially hardcode the appropriate encoding/decoding logic instead.
XML和JSON都有可选的模式支持。这些模式语言非常强大,学习和实现都非常复杂。XML模式的使用相当广泛,但许多基于JSON的工具不使用模式。由于数据(例如数字和二进制字符串)的正确解释取决于模式中的信息,因此不使用XML/JSON模式的应用程序需要可能将适当的编码/解码逻辑硬编码。
-
CSV does not have any schema, so it is up to the application to define the meaning of each row and column. If an application change adds a new row or column, you have to handle that change manually. CSV is also a quite vague format (what happens if a value contains a comma or a newline character?). Although its escaping rules have been formally specified [ 13 ], not all parsers implement them correctly.
CSV 没有任何模式, 因此应用程序需要定义每行和每列的含义。如果应用程序的更改添加了一个新行或新列,则必须手动处理。 CSV 也是一种相当模糊的格式(如果一个值包含逗号或换行符会发生什么?)。 尽管其转义规则已被正式指定,但并非所有解析器都正确实现了它们。
Despite these flaws, JSON, XML, and CSV are good enough for many purposes. It’s likely that they will remain popular, especially as data interchange formats (i.e., for sending data from one organization to another). In these situations, as long as people agree on what the format is, it often doesn’t matter how pretty or efficient the format is. The difficulty of getting different organizations to agree on anything outweighs most other concerns.
尽管存在这些缺陷,JSON、XML和CSV对于许多目的来说已经足够好了。它们可能会继续流行,特别是作为数据交换格式(即用于从一个组织发送数据到另一个组织)。在这些情况下,只要人们对格式达成一致,格式的美观或效率通常并不重要。使不同组织就任何事情达成一致的难度超过了其他大多数考虑因素。
Binary encoding
For data that is used only internally within your organization, there is less pressure to use a lowest-common-denominator encoding format. For example, you could choose a format that is more compact or faster to parse. For a small dataset, the gains are negligible, but once you get into the terabytes, the choice of data format can have a big impact.
对于仅在组织内部使用的数据,使用最低公共编码格式的压力较小。例如,你可以选择更紧凑或更快速解析的格式。对于小数据集,收益微不足道,但是当你进入了以太字节时,数据格式的选择就会有很大的影响。
JSON is less verbose than XML, but both still use a lot of space compared to binary formats. This observation led to the development of a profusion of binary encodings for JSON (MessagePack, BSON, BJSON, UBJSON, BISON, and Smile, to name a few) and for XML (WBXML and Fast Infoset, for example). These formats have been adopted in various niches, but none of them are as widely adopted as the textual versions of JSON and XML.
JSON比XML更简洁,但与二进制格式相比,两者仍然使用很多空间。这个观察结果导致了大量针对JSON(如MessagePack、BSON、BJSON、UBJSON、BISON和Smile)和XML(例如WBXML和Fast Infoset)的二进制编码的开发。这些格式已被用于各种领域,但它们中没有一个像JSON和XML的文本版本那样被广泛采用。
Some of these formats extend the set of datatypes (e.g., distinguishing integers and floating-point numbers,
or adding support for binary strings), but otherwise they keep the JSON/XML data model unchanged. In
particular, since they don’t prescribe a schema, they need to include all the object field names within
the encoded data. That is, in a binary encoding of the JSON document in
Example 4-1
, they
will need to include the strings
userName
,
favoriteNumber
, and
interests
somewhere.
其中一些格式扩展了数据类型集合(例如,区分整数和浮点数,或添加对二进制字符串的支持),但是它们保持了JSON/XML数据模型不变。特别地,由于它们不规定模式,因此它们需要在编码数据中包含所有对象字段名称。也就是说,在JSON文档的二进制编码中,它们需要在某个地方包含字符串userName、favoriteNumber和interests。
Example 4-1. Example record which we will encode in several binary formats in this chapter
{
"userName"
:
"Martin"
,
"favoriteNumber"
:
1337
,
"interests"
:
[
"daydreaming"
,
"hacking"
]
}
Let’s look at an example of MessagePack, a binary encoding for JSON. Figure 4-1 shows the byte sequence that you get if you encode the JSON document in Example 4-1 with MessagePack [ 14 ]. The first few bytes are as follows:
让我们来看一个例子,MessagePack是一种用于JSON的二进制编码。图4-1显示了在MessagePack [14]中对Example 4-1中的JSON文档进行编码时所获得的字节序列。前几个字节如下:
-
The first byte,
0x83
, indicates that what follows is an object (top four bits =0x80
) with three fields (bottom four bits =0x03
). (In case you’re wondering what happens if an object has more than 15 fields, so that the number of fields doesn’t fit in four bits, it then gets a different type indicator, and the number of fields is encoded in two or four bytes.)第一个字节0x83表示接下来的内容是一个对象(高四位=0x80),有三个域(低四位=0x03)。如果你想知道对象有超过15个属性的情况下,没有办法用四位来编码属性的数量,那么就用不同的类型指示符,并用两个或四个字节来编码属性数量。
-
The second byte,
0xa8
, indicates that what follows is a string (top four bits =0xa0
) that is eight bytes long (bottom four bits =0x08
).第二个字节0xa8表示接下来的内容是一个字符串(前四位=0xa0),并且该字符串共有8个字节长度(后四位=0x08)。
-
The next eight bytes are the field name
userName
in ASCII. Since the length was indicated previously, there’s no need for any marker to tell us where the string ends (or any escaping).下一个八个字节是ASCII码格式的字段名userName。由于长度已经先前标示,因此不需要任何标记来告诉我们字符串的结尾(或任何转义)。
-
The next seven bytes encode the six-letter string value
Martin
with a prefix0xa6
, and so on.下七个字节编码了带有0xa6前缀的六个字母字符串值Martin,以此类推。
The binary encoding is 66 bytes long, which is only a little less than the 81 bytes taken by the textual JSON encoding (with whitespace removed). All the binary encodings of JSON are similar in this regard. It’s not clear whether such a small space reduction (and perhaps a speedup in parsing) is worth the loss of human-readability.
二进制编码长66字节,比文本JSON编码(无空格)占用的81字节略少。所有JSON的二进制编码在这方面都相似。尚不清楚这样的空间减少(可能会提高解析速度)是否值得失去人类可读性。
In the following sections we will see how we can do much better, and encode the same record in just 32 bytes.
在接下来的章节中,我们将看到我们如何可以做得更好,仅使用32个字节来编码相同的记录。
Thrift and Protocol Buffers
Apache Thrift [ 15 ] and Protocol Buffers (protobuf) [ 16 ] are binary encoding libraries that are based on the same principle. Protocol Buffers was originally developed at Google, Thrift was originally developed at Facebook, and both were made open source in 2007–08 [ 17 ].
Apache Thrift和Protocol Buffers(protobuf)是基于相同原则的二进制编码库。Protocol Buffers最初在Google开发,Thrift最初在Facebook开发,并在2007-08年开源 [17]。
Both Thrift and Protocol Buffers require a schema for any data that is encoded. To encode the data in Example 4-1 in Thrift, you would describe the schema in the Thrift interface definition language (IDL) like this:
Thrift和Protocol Buffers都需要为编码的任何数据提供模式。为了在Thrift中编码Example 4-1中的数据,您需要使用Thrift接口定义语言(IDL)描述模式,如下所示:
struct
Person
{
1
:
required
string
userName
,
2
:
optional
i64
favoriteNumber
,
3
:
optional
list
<
string
>
interests
}
The equivalent schema definition for Protocol Buffers looks very similar:
Protocol Buffers的等效模式定义看起来非常相似:
message
Person
{
required
string
user_name
=
1
;
optional
int64
favorite_number
=
2
;
repeated
string
interests
=
3
;
}
Thrift and Protocol Buffers each come with a code generation tool that takes a schema definition like the ones shown here, and produces classes that implement the schema in various programming languages [ 18 ]. Your application code can call this generated code to encode or decode records of the schema.
节俭和协议缓冲区都配备了一个代码生成工具,它可以采用类似于此处所示的模式定义并生成在各种编程语言中实现模式的类[18]。您的应用程序代码可以调用此生成的代码来编码或解码模式的记录。
What does data encoded with this schema look like? Confusingly, Thrift has two different binary encoding formats, iii called BinaryProtocol and CompactProtocol , respectively. Let’s look at BinaryProtocol first. Encoding Example 4-1 in that format takes 59 bytes, as shown in Figure 4-2 [ 19 ].
使用此架構編碼的數據長什麼樣子?令人困惑的是,Thrift 有兩種不同的二進制編碼格式,分別稱為 BinaryProtocol 和 CompactProtocol。讓我們先看看 BinaryProtocol。在這種格式中,編碼示例 4-1 需要 59 個字節,如圖 4-2 所示。
Similarly to Figure 4-1 , each field has a type annotation (to indicate whether it is a string, integer, list, etc.) and, where required, a length indication (length of a string, number of items in a list). The strings that appear in the data (“Martin”, “daydreaming”, “hacking”) are also encoded as ASCII (or rather, UTF-8), similar to before.
与图4-1类似,每个字段都有类型注释(用于表示它是字符串、整数、列表等),并且在必要时有长度指示(字符串的长度,列表中的项数)。出现在数据中的字符串(“Martin”、“daydreaming”、“hacking”)也被编码为ASCII(或更准确地说,是UTF-8),与以前类似。
The big difference compared to
Figure 4-1
is that there are no field names
(
userName
,
favoriteNumber
,
interests
). Instead, the encoded data contains
field tags
, which
are numbers (
1
,
2
, and
3
). Those are the numbers that appear in the schema definition. Field tags
are like aliases for fields—they are a compact way of saying what field we’re talking about,
without having to spell out the field name.
与图4-1相比的重大区别在于没有字段名称(用户名、收藏数、兴趣爱好)。相反,编码数据包含字段标记,即数字(1、2和3)。这些是模式定义中出现的数字。字段标记就像字段的别名-它们是一种紧凑的方式,用于说明我们正在谈论哪个字段,而无需拼写出字段名称。
The Thrift CompactProtocol encoding is semantically equivalent to BinaryProtocol, but as you can see in Figure 4-3 , it packs the same information into only 34 bytes. It does this by packing the field type and tag number into a single byte, and by using variable-length integers. Rather than using a full eight bytes for the number 1337, it is encoded in two bytes, with the top bit of each byte used to indicate whether there are still more bytes to come. This means numbers between –64 and 63 are encoded in one byte, numbers between –8192 and 8191 are encoded in two bytes, etc. Bigger numbers use more bytes.
Thrift CompactProtocol 编码与 BinaryProtocol 在语义上是等价的,但正如图 4-3 所示,它将相同的信息压缩到仅 34 字节内。它通过将字段类型和标记号码打包成单个字节,并使用可变长度整数来实现。它不是使用完整的八个字节来表示数字 1337,而是使用两个字节进行编码,每个字节的最高位用于指示是否还有更多的字节。这意味着在 -64 和 63 之间的数字在一个字节中编码,在 -8192 和 8191 之间的数字在两个字节中编码,以此类推。更大的数字使用更多的字节。
Finally, Protocol Buffers (which has only one binary encoding format) encodes the same data as shown in Figure 4-4 . It does the bit packing slightly differently, but is otherwise very similar to Thrift’s CompactProtocol. Protocol Buffers fits the same record in 33 bytes.
最终,协议缓冲(只有一个二进制编码格式)对于显示的相同数据进行了编码,如图4-4所示。它稍微不同地进行了位包装,但与Thrift的CompactProtocol非常相似。协议缓冲将相同的记录放在33个字节中。
One detail to note: in the schemas shown earlier, each field was marked either
required
or
optional
, but
this makes no difference to how the field is encoded (nothing in the binary data indicates whether a
field was required). The difference is simply that
required
enables a runtime check that fails if
the field is not set, which can be useful for catching bugs.
需要注意的一点是:在之前展示的架构中,每个字段都标记为必需或可选,但是这对字段的编码方式没有任何影响(二进制数据中没有指示一个字段是否必需的信息)。不同之处仅在于必需能够启用运行时检查,如果字段未设置,就会失败,这对于捕捉错误非常有用。
Field tags and schema evolution
We said previously that schemas inevitably need to change over time. We call this schema evolution . How do Thrift and Protocol Buffers handle schema changes while keeping backward and forward compatibility?
我们之前曾经提到过,模式必然会随着时间的推移而发生改变。我们称之为模式演变。Thrift和协议缓冲如何在保持向前和向后兼容性的同时处理模式更改?
As you can see from the examples, an encoded record is just the concatenation of its encoded fields.
Each field is identified by its tag number (the numbers
1
,
2
,
3
in the sample schemas) and
annotated with a datatype (e.g., string or integer). If a field value is not set, it is simply
omitted from the encoded record. From this you can see that field tags are critical to the meaning
of the encoded data. You can change the name of a field in the schema, since the encoded data never
refers to field names, but you cannot change a field’s tag, since that would make all existing
encoded data
invalid.
从这些示例中可以看出,编码记录只是其编码字段的串联。每个字段由其标签号(示例模式中的数字1、2、3)标识,并带有数据类型(例如,字符串或整数)的注释。如果字段值未设置,则会从编码记录中省略它。由此可以看出,字段标签对于编码数据的含义至关重要。你可以更改模式中字段的名称,因为编码数据从不引用字段名称,但你无法更改字段的标签,因为这将使所有现有的编码数据无效。
You can add new fields to the schema, provided that you give each field a new tag number. If old code (which doesn’t know about the new tag numbers you added) tries to read data written by new code, including a new field with a tag number it doesn’t recognize, it can simply ignore that field. The datatype annotation allows the parser to determine how many bytes it needs to skip. This maintains forward compatibility: old code can read records that were written by new code.
如果您给每个字段分配新标签号,那么可以向模式添加新字段。如果旧代码(不知道您添加的新标签号)尝试读取由新代码编写的数据,包括使用无法识别的标签号的新字段,则可以简单地忽略该字段。数据类型注释允许解析器确定需要跳过多少字节。 这保持向前兼容性:旧代码可以读取由新代码编写的记录。
What about backward compatibility? As long as each field has a unique tag number, new code can always read old data, because the tag numbers still have the same meaning. The only detail is that if you add a new field, you cannot make it required. If you were to add a field and make it required, that check would fail if new code read data written by old code, because the old code will not have written the new field that you added. Therefore, to maintain backward compatibility, every field you add after the initial deployment of the schema must be optional or have a default value.
对于向后兼容性的处理,只要每个字段都有唯一的标签号码,新代码就能够读取旧数据,因为标签号码的含义没有改变。唯一的问题是,如果你添加了一个新字段,就不能将其设为必填字段。如果你添加了一个必填字段,新代码读取旧代码写入的数据时,该检查将失败,因为旧代码将无法写入你添加的新字段。因此,为了保持向后兼容性,在架构的初始部署之后添加的每个字段都必须是可选的或者有一个默认值。
Removing a field is just like adding a field, with backward and forward compatibility concerns reversed. That means you can only remove a field that is optional (a required field can never be removed), and you can never use the same tag number again (because you may still have data written somewhere that includes the old tag number, and that field must be ignored by new code).
删除一个字段就像添加一个字段一样,需要考虑向前和向后兼容性问题的反转。这意味着你只能删除一个可选的字段(必需字段永远不能被删除),而且你永远不能再次使用相同的标签号码(因为你可能仍然有数据写在某个地方,包括旧的标签号码,而且新代码必须忽略该字段)。
Datatypes and schema evolution
What about changing the datatype of a field? That may be possible—check the documentation for details—but there is a risk that values will lose precision or get truncated. For example, say you change a 32-bit integer into a 64-bit integer. New code can easily read data written by old code, because the parser can fill in any missing bits with zeros. However, if old code reads data written by new code, the old code is still using a 32-bit variable to hold the value. If the decoded 64-bit value won’t fit in 32 bits, it will be truncated.
更改字段的数据类型怎么样?这可能是可能的-请检查文档以获取详细信息-但存在值将失去精度或被截断的风险。例如,假设您将32位整数更改为64位整数。新代码可以轻松读取旧代码编写的数据,因为解析器可以用零填充任何缺失的位。但是,如果旧代码读取新代码编写的数据,则旧代码仍在使用32位变量来保存该值。如果解码的64位值无法放入32位中,则将被截断。
A curious detail of Protocol Buffers is that it does not have a list or array datatype, but instead
has a
repeated
marker for fields (which is a third option alongside
required
and
optional
). As
you can see in
Figure 4-4
, the encoding of a
repeated
field is just what it says on
the tin: the same field tag simply appears multiple times in the record. This has the nice effect
that it’s okay to change an
optional
(single-valued) field into a
repeated
(multi-valued) field.
New code reading old data sees a list with zero or one elements (depending on whether the field was
present); old code reading new data sees only the last element of the list.
一个好奇的细节是,Protocol Buffers 没有列表或数组的数据类型,而是为字段添加了一个重复标记(这是 required 和 optional 之外的第三个选项)。如图 4-4 所示,重复字段的编码就像它的名字一样:同样的字段标签在记录中出现多次。这具有良好的效果,即可以将可选(单值)字段更改为重复的(多值)字段。读取旧数据的新代码会看到一个具有零个或一个元素的列表(取决于字段是否存在);读取新数据的旧代码只会看到列表的最后一个元素。
Thrift has a dedicated list datatype, which is parameterized with the datatype of the list elements. This does not allow the same evolution from single-valued to multi-valued as Protocol Buffers does, but it has the advantage of supporting nested lists.
Thrift有一个专用的列表数据类型,它带有列表元素的数据类型参数。与Protocol Buffers不同,它不允许从单值到多值的相同演变,但它支持嵌套列表的优点。
Avro
Apache Avro [ 20 ] is another binary encoding format that is interestingly different from Protocol Buffers and Thrift. It was started in 2009 as a subproject of Hadoop, as a result of Thrift not being a good fit for Hadoop’s use cases [ 21 ].
Apache Avro是另一种有趣而与Protocol Buffers和Thrift不同的二进制编码格式。它始于2009年,作为Hadoop的一个子项目,由于Thrift不适合Hadoop的用例而产生。
Avro also uses a schema to specify the structure of the data being encoded. It has two schema languages: one (Avro IDL) intended for human editing, and one (based on JSON) that is more easily machine-readable.
Avro还使用模式来指定被编码的数据的结构。它有两种模式语言:一种(Avro IDL)用于人类编辑,另一种(基于JSON的)更容易被机器读取。
Our example schema, written in Avro IDL, might look like this:
我们的示例模式,使用Avro IDL编写,可能如下所示:
record
Person
{
string
userName
;
union
{
null
,
long
}
favoriteNumber
=
null
;
array
<
string
>
interests
;
}
The equivalent JSON representation of that schema is as follows:
该模式的等效JSON表示形式如下:
{
"type"
:
"record"
,
"name"
:
"Person"
,
"fields"
:
[
{
"name"
:
"userName"
,
"type"
:
"string"
},
{
"name"
:
"favoriteNumber"
,
"type"
:
[
"null"
,
"long"
],
"default"
:
null
},
{
"name"
:
"interests"
,
"type"
:
{
"type"
:
"array"
,
"items"
:
"string"
}}
]
}
First of all, notice that there are no tag numbers in the schema. If we encode our example record ( Example 4-1 ) using this schema, the Avro binary encoding is just 32 bytes long—the most compact of all the encodings we have seen. The breakdown of the encoded byte sequence is shown in Figure 4-5 .
首先,请注意模式中没有标签号。如果使用此模式对我们的示例记录(示例4-1)进行编码,则Avro二进制编码仅为32个字节,是我们所见过的所有编码中最紧凑的。编码字节序列的分解如图4-5所示。
If you examine the byte sequence, you can see that there is nothing to identify fields or their datatypes. The encoding simply consists of values concatenated together. A string is just a length prefix followed by UTF-8 bytes, but there’s nothing in the encoded data that tells you that it is a string. It could just as well be an integer, or something else entirely. An integer is encoded using a variable-length encoding (the same as Thrift’s CompactProtocol).
如果您检查字节序列,您会发现没有东西可以标识字段或它们的数据类型。编码只是将值串联在一起。字符串只是长度前缀后跟UTF-8字节,但编码数据中没有任何东西告诉您它是字符串。它也可以是整数或其他任何东西。整数使用可变长度编码(与Thrift的CompactProtocol相同)。
To parse the binary data, you go through the fields in the order that they appear in the schema and use the schema to tell you the datatype of each field. This means that the binary data can only be decoded correctly if the code reading the data is using the exact same schema as the code that wrote the data. Any mismatch in the schema between the reader and the writer would mean incorrectly decoded data.
解析二进制数据时,按照模式中的顺序遍历各个字段,使用模式告诉你每个字段的数据类型。这意味着只有在读取数据的代码与写入数据的代码使用完全相同的模式时,二进制数据才能被正确解码。读者和写者之间模式的不匹配会导致数据解码错误。
So, how does Avro support schema evolution?
那么,Avro如何支持模式演化?
The writer’s schema and the reader’s schema
With Avro, when an application wants to encode some data (to write it to a file or database, to send it over the network, etc.), it encodes the data using whatever version of the schema it knows about—for example, that schema may be compiled into the application. This is known as the writer’s schema .
当一个应用程序想要对一些数据进行编码(写入文件或数据库,通过网络发送等),使用Avro,它会使用任何它知道的模式版本来对数据进行编码 - 例如,该模式可能已编译到应用程序中。这被称为编写者的模式。
When an application wants to decode some data (read it from a file or database, receive it from the network, etc.), it is expecting the data to be in some schema, which is known as the reader’s schema . That is the schema the application code is relying on—code may have been generated from that schema during the application’s build process.
当一个应用程序想要解码一些数据(从文件或数据库中读取,从网络中接收等等),它期望数据以某种模式存在,这被称为读取者模式。这是应用程序代码所依赖的模式,代码可能在应用程序构建过程中从该模式中生成。
The key idea with Avro is that the writer’s schema and the reader’s schema don’t have to be the same —they only need to be compatible. When data is decoded (read), the Avro library resolves the differences by looking at the writer’s schema and the reader’s schema side by side and translating the data from the writer’s schema into the reader’s schema. The Avro specification [ 20 ] defines exactly how this resolution works, and it is illustrated in Figure 4-6 .
Avro的关键思想是,编写者的模式和读者的模式不必相同-它们只需要兼容。当数据被解码(读取)时,Avro库通过并排查看编写者的模式和读者的模式并将数据从编写者的模式转换为读者的模式来解决差异。 Avro规范明确定义了这种解析方式,并在图4-6中进行了说明。
For example, it’s no problem if the writer’s schema and the reader’s schema have their fields in a different order, because the schema resolution matches up the fields by field name. If the code reading the data encounters a field that appears in the writer’s schema but not in the reader’s schema, it is ignored. If the code reading the data expects some field, but the writer’s schema does not contain a field of that name, it is filled in with a default value declared in the reader’s schema.
例如,如果写入者的模式和读者的模式具有不同顺序的字段,也不会有问题,因为模式分辨率通过字段名称将字段匹配起来。如果读取数据的代码遇到出现在写入者模式中但不在读者模式中的字段,则会被忽略。如果读取数据的代码期望某个字段,但写入者的模式中不包含该名称的字段,则会使用读者模式中声明的默认值填充。
Schema evolution rules
With Avro, forward compatibility means that you can have a new version of the schema as writer and an old version of the schema as reader. Conversely, backward compatibility means that you can have a new version of the schema as reader and an old version as writer.
使用 Avro,向前兼容性意味着可以将新版本的模式作为写入者,并将旧版本的模式作为读取者。反之,向后兼容性意味着可以将新版本的模式作为读取者,并将旧版本的模式作为写入者。
To maintain compatibility, you may only add or remove a field that has a default value. (The field
favoriteNumber
in our Avro schema has a default value of
null
.) For example, say you add a
field with a default value, so this new field exists in the new schema but not the old one. When a
reader using the new schema reads a record written with the old schema, the default value is filled
in for the missing field.
为了保持兼容性,您只能添加或删除具有默认值的字段。(在我们的Avro模式中,favoriteNumber字段具有null的默认值)。例如,假设您添加了一个具有默认值的字段,因此此新字段存在于新模式中但不存在于旧模式中。当使用新模式的读取器读取使用旧模式编写的记录时,缺失字段的默认值将被填充。
If you were to add a field that has no default value, new readers wouldn’t be able to read data written by old writers, so you would break backward compatibility. If you were to remove a field that has no default value, old readers wouldn’t be able to read data written by new writers, so you would break forward compatibility.
如果您添加的字段没有默认值,新的读者将无法读取旧的写入的数据,这将破坏向后兼容性。如果您删除没有默认值的字段,旧的读者将无法读取新的写入的数据,这将破坏向前兼容性。
In some programming languages,
null
is an acceptable default for any variable, but this is not the
case in Avro: if you want to allow a field to be null, you have to use a
union type
. For example,
union { null, long, string } field;
indicates that
field
can be a number, or a string, or null.
You can only use
null
as a default value if it is one of the branches of the
union.
iv
This is a little more verbose than having
everything nullable by default, but it helps prevent bugs by being explicit about what can and
cannot be null [
22
].
在一些编程语言中,null 是任何变量的可接受默认值,但在 Avro 中不是这种情况:如果您想要允许某个字段为空,则必须使用联合类型。例如,union {null,long,string} field; 表示字段可以是数字、字符串或 null。只有当 null 是联合类型的一个分支时,才能将其用作默认值。这比默认情况下所有内容都可为空要冗长一些,但通过明确规定哪些内容可以为空以及哪些内容不可以为空可以帮助预防出现错误 [22]。
Consequently, Avro doesn’t have
optional
and
required
markers in the same way as Protocol
Buffers and Thrift do (it has union types and default values instead).
因此,Avro没有像Protocol Buffers和Thrift那样的可选和必需标记(它有联合类型和默认值)。
Changing the datatype of a field is possible, provided that Avro can convert the type. Changing the name of a field is possible but a little tricky: the reader’s schema can contain aliases for field names, so it can match an old writer’s schema field names against the aliases. This means that changing a field name is backward compatible but not forward compatible. Similarly, adding a branch to a union type is backward compatible but not forward compatible.
改变字段的数据类型是可能的,前提是Avro可以转换类型。改变字段名称是可能的,但有点棘手:读取器模式可以包含字段名称的别名,因此它可以将旧写入器模式字段名称与别名进行匹配。这意味着更改字段名称是向后兼容但不是向前兼容。同样,向联合类型添加分支是向后兼容但不是向前兼容。
But what is the writer’s schema?
There is an important question that we’ve glossed over so far: how does the reader know the writer’s schema with which a particular piece of data was encoded? We can’t just include the entire schema with every record, because the schema would likely be much bigger than the encoded data, making all the space savings from the binary encoding futile.
这里有一个重要的问题我们之前忽略了:读者如何知道写手用哪种模式来编码特定的数据?我们不能每个记录都包含整个模式,因为模式很可能比编码出来的数据还要大,这样二进制编码所节省的空间就变得无意义了。
The answer depends on the context in which Avro is being used. To give a few examples:
答案取决于Avro使用的上下文。举几个例子:
- Large file with lots of records
-
A common use for Avro—especially in the context of Hadoop—is for storing a large file containing millions of records, all encoded with the same schema. (We will discuss this kind of situation in Chapter 10 .) In this case, the writer of that file can just include the writer’s schema once at the beginning of the file. Avro specifies a file format (object container files) to do this.
Avro的一种常见用途,尤其是在Hadoop的背景下,是用于存储包含数百万条记录、所有记录都使用相同模式进行编码的大型文件。(我们将在第10章中讨论这种情况。)在这种情况下,该文件的作者只需在文件开头包含一次作者的模式即可。Avro指定了一个文件格式(对象容器文件),用于执行此操作。
- Database with individually written records
-
In a database, different records may be written at different points in time using different writer’s schemas—you cannot assume that all the records will have the same schema. The simplest solution is to include a version number at the beginning of every encoded record, and to keep a list of schema versions in your database. A reader can fetch a record, extract the version number, and then fetch the writer’s schema for that version number from the database. Using that writer’s schema, it can decode the rest of the record. (Espresso [ 23 ] works this way, for example.)
在数据库中,不同的记录可能使用不同的编写者模式在不同的时间点编写 - 您不能假设所有记录都具有相同的模式。最简单的解决方案是在每个编码记录的开头包含一个版本号,并在数据库中保留模式版本列表。读取器可以获取记录,提取版本号,然后从数据库中获取该版本号的编写者模式。使用该编写者模式,它可以解码其余的记录。(例如 Espresso 以此方式工作。)
- Sending records over a network connection
-
When two processes are communicating over a bidirectional network connection, they can negotiate the schema version on connection setup and then use that schema for the lifetime of the connection. The Avro RPC protocol (see “Dataflow Through Services: REST and RPC” ) works like this.
当两个进程通过双向网络连接通信时,它们可以在连接设置期间协商模式版本,然后在连接的整个生命周期内使用该模式。Avro RPC协议(请参见“服务的数据流:REST和RPC”)是这样工作的。
A database of schema versions is a useful thing to have in any case, since it acts as documentation and gives you a chance to check schema compatibility [ 24 ]. As the version number, you could use a simple incrementing integer, or you could use a hash of the schema.
数据库模式版本的信息是任何情况下都有用的,因为它可以充当文档,同时还可以让你检查模式的兼容性[24]。你可以使用一个简单的递增整数作为版本号,或者你可以使用模式的哈希值。
Dynamically generated schemas
One advantage of Avro’s approach, compared to Protocol Buffers and Thrift, is that the schema doesn’t contain any tag numbers. But why is this important? What’s the problem with keeping a couple of numbers in the schema?
Avro的做法相对于Protocol Buffers和Thrift的优势之一就是模式中没有任何标签号。但这为什么重要呢?把几个数字写进模式有什么问题呢? Avro的优势在于它不需要使用标签号进行编码,这样就可以避免出现不必要的翻译错误。
The difference is that Avro is friendlier to dynamically generated schemas. For example, say you have a relational database whose contents you want to dump to a file, and you want to use a binary format to avoid the aforementioned problems with textual formats (JSON, CSV, SQL). If you use Avro, you can fairly easily generate an Avro schema (in the JSON representation we saw earlier) from the relational schema and encode the database contents using that schema, dumping it all to an Avro object container file [ 25 ]. You generate a record schema for each database table, and each column becomes a field in that record. The column name in the database maps to the field name in Avro.
不同之处在于Avro更友好地支持动态生成模式。例如,假设您有一个关系型数据库,希望将其内容转储到文件中,并且想要使用二进制格式避免文本格式(JSON、CSV、SQL)的问题。如果使用Avro,您可以相对容易地从关系模式生成一个Avro模式(使用我们之前看到的JSON表示),并使用该模式对数据库内容进行编码,将其全部转储到Avro对象容器文件[25]中。您为每个数据库表生成一个记录模式,每个列成为该记录中的一个字段。数据库中的列名映射到Avro中的字段名。
Now, if the database schema changes (for example, a table has one column added and one column removed), you can just generate a new Avro schema from the updated database schema and export data in the new Avro schema. The data export process does not need to pay any attention to the schema change—it can simply do the schema conversion every time it runs. Anyone who reads the new data files will see that the fields of the record have changed, but since the fields are identified by name, the updated writer’s schema can still be matched up with the old reader’s schema.
现在,如果数据库模式发生变化(例如,添加了一列和删除了一列),您只需从更新的数据库模式生成一个新的Avro模式,并按新的Avro模式导出数据。数据导出过程不需要关注模式变化 - 它可以每次运行时进行模式转换。读取新数据文件的任何人都会看到记录的字段已更改,但由于字段是通过名称标识的,因此更新的写入器模式仍然可以与旧的读者模式匹配。
By contrast, if you were using Thrift or Protocol Buffers for this purpose, the field tags would likely have to be assigned by hand: every time the database schema changes, an administrator would have to manually update the mapping from database column names to field tags. (It might be possible to automate this, but the schema generator would have to be very careful to not assign previously used field tags.) This kind of dynamically generated schema simply wasn’t a design goal of Thrift or Protocol Buffers, whereas it was for Avro.
相比之下,如果您使用Thrift或Protocol Buffers来实现此目的,字段标签很可能需要手动分配:每当数据库模式更改时,管理员都必须手动更新从数据库列名称到字段标签的映射。(这可能是可以自动化的,但模式生成器必须非常小心,以不分配先前使用过的字段标签。)这种动态生成的模式根本不是Thrift或Protocol Buffers的设计目标,但却是Avro的设计目标。
Code generation and dynamically typed languages
Thrift and Protocol Buffers rely on code generation: after a schema has been defined, you can generate code that implements this schema in a programming language of your choice. This is useful in statically typed languages such as Java, C++, or C#, because it allows efficient in-memory structures to be used for decoded data, and it allows type checking and autocompletion in IDEs when writing programs that access the data structures.
节俭和协议缓冲区依赖代码生成:在定义模式之后,您可以生成实现此模式的代码,使用您选择的编程语言。这在静态类型语言(如Java、C++或C#)中非常有用,因为它允许使用高效的内存结构以解码数据,并在编写访问数据结构的程序时允许在IDE中进行类型检查和自动完成。
In dynamically typed programming languages such as JavaScript, Ruby, or Python, there is not much point in generating code, since there is no compile-time type checker to satisfy. Code generation is often frowned upon in these languages, since they otherwise avoid an explicit compilation step. Moreover, in the case of a dynamically generated schema (such as an Avro schema generated from a database table), code generation is an unnecessarily obstacle to getting to the data.
在动态类型的编程语言中,例如JavaScript,Ruby或Python,生成代码几乎没有意义,因为没有编译时类型检查器需要满足。在这些语言中,代码生成通常是不被赞成的,因为它们可以避免显式编译步骤。此外,在动态生成模式的情况下(例如从数据库表生成的Avro模式),代码生成是一个不必要的障碍,阻碍获取数据。
Avro provides optional code generation for statically typed programming languages, but it can be used just as well without any code generation. If you have an object container file (which embeds the writer’s schema), you can simply open it using the Avro library and look at the data in the same way as you could look at a JSON file. The file is self-describing since it includes all the necessary metadata.
Avro提供可选的代码生成,用于静态类型的编程语言,但是即使没有任何代码生成,它也可以很好地使用。如果您有一个对象容器文件(其中嵌入了作者模式),则可以使用Avro库打开它,并查看数据,就像您可以查看JSON文件一样。该文件是自描述的,因为它包含了所有必要的元数据。
This property is especially useful in conjunction with dynamically typed data processing languages like Apache Pig [ 26 ]. In Pig, you can just open some Avro files, start analyzing them, and write derived datasets to output files in Avro format without even thinking about schemas.
这个属性在动态类型的数据处理语言,比如Apache Pig[26]方面非常有用。在Pig中,你可以直接打开一些Avro文件,开始分析它们,并把得出的数据集以Avro格式写入输出文件,而无需考虑模式。
The Merits of Schemas
As we saw, Protocol Buffers, Thrift, and Avro all use a schema to describe a binary encoding format. Their schema languages are much simpler than XML Schema or JSON Schema, which support much more detailed validation rules (e.g., “the string value of this field must match this regular expression” or “the integer value of this field must be between 0 and 100”). As Protocol Buffers, Thrift, and Avro are simpler to implement and simpler to use, they have grown to support a fairly wide range of programming languages.
正如我们所看到的,Protocol Buffers,Thrift和Avro都使用模式来描述二进制编码格式。他们的模式语言比XML Schema或JSON Schema简单得多,但支持更详细的验证规则(例如,“此字段的字符串值必须与此正则表达式匹配”或“此字段的整数值必须在0和100之间”)。由于Protocol Buffers,Thrift和Avro更易于实现和使用,它们已经发展成为支持相当广泛的编程语言。
The ideas on which these encodings are based are by no means new. For example, they have a lot in common with ASN.1, a schema definition language that was first standardized in 1984 [ 27 ]. It was used to define various network protocols, and its binary encoding (DER) is still used to encode SSL certificates (X.509), for example [ 28 ]. ASN.1 supports schema evolution using tag numbers, similar to Protocol Buffers and Thrift [ 29 ]. However, it’s also very complex and badly documented, so ASN.1 is probably not a good choice for new applications.
这些编码所基于的想法并不是新概念。例如,它们与 ASN.1 有很多共同点,这是一种模式定义语言,于 1984 年首次标准化[27]。它被用于定义各种网络协议,其二进制编码(DER)仍然用于编码 SSL 证书(X.509)等[28]。ASN.1 支持使用标签号进行模式演化,类似于 Protocol Buffers 和 Thrift[29]。但是,ASN.1 也非常复杂,文档不完整,因此 ASN.1 对于新应用程序可能不是一个好的选择。
Many data systems also implement some kind of proprietary binary encoding for their data. For example, most relational databases have a network protocol over which you can send queries to the database and get back responses. Those protocols are generally specific to a particular database, and the database vendor provides a driver (e.g., using the ODBC or JDBC APIs) that decodes responses from the database’s network protocol into in-memory data structures.
许多数据系统还实现了某种专有的二进制编码来处理其数据。例如,大多数关系型数据库都有一种网络协议,您可以通过该协议发送查询到数据库并获取响应。这些协议通常是特定于特定数据库的,并且数据库供应商提供了一个驱动程序(例如使用ODBC或JDBC API),该驱动程序将数据库的网络协议解码为内存中的数据结构。
So, we can see that although textual data formats such as JSON, XML, and CSV are widespread, binary encodings based on schemas are also a viable option. They have a number of nice properties:
因此,我们可以看到,尽管JSON、XML和CSV等文本数据格式很常见,但基于模式的二进制编码也是一个可行的选择。它们具有许多优点:
-
They can be much more compact than the various “binary JSON” variants, since they can omit field names from the encoded data.
它们可以比各种“二进制JSON”变体更紧凑,因为它们可以在编码数据中省略字段名称。
-
The schema is a valuable form of documentation, and because the schema is required for decoding, you can be sure that it is up to date (whereas manually maintained documentation may easily diverge from reality).
图表是一种有价值的文档形式,由于解码需要使用图表,所以您可以确信它是最新的(而手动维护的文档可能很容易与现实偏差)。
-
Keeping a database of schemas allows you to check forward and backward compatibility of schema changes, before anything is deployed.
保留模式数据库可以让您在部署任何更改之前对模式进行前向和后向兼容性检查。
-
For users of statically typed programming languages, the ability to generate code from the schema is useful, since it enables type checking at compile time.
对于静态类型编程语言的用户而言,从模式中生成代码的能力是有用的,因为它可以使编译时进行类型检查。
In summary, schema evolution allows the same kind of flexibility as schemaless/schema-on-read JSON databases provide (see “Schema flexibility in the document model” ), while also providing better guarantees about your data and better tooling.
简而言之,模式演化允许与无模式/基于读取的JSON数据库提供的灵活性相同(参见“文档模型中的架构灵活性”),同时提供更好的数据保证和更好的工具。
Modes of Dataflow
At the beginning of this chapter we said that whenever you want to send some data to another process with which you don’t share memory—for example, whenever you want to send data over the network or write it to a file—you need to encode it as a sequence of bytes. We then discussed a variety of different encodings for doing this.
在本章的开头,我们说过,当你想要将数据发送给另一个进程,而该进程与你不共享内存(例如,当你想要通过网络发送数据或将其写入文件时),你需要将其编码为一系列字节。我们随后讨论了许多不同的编码方式来实现这一点。
We talked about forward and backward compatibility, which are important for evolvability (making change easy by allowing you to upgrade different parts of your system independently, and not having to change everything at once). Compatibility is a relationship between one process that encodes the data, and another process that decodes it.
我们谈到了向前和向后兼容性,这对于可发展性非常重要(通过允许您独立升级系统的不同部分,而不必同时更改所有内容,从而使更改变得容易)。兼容性是编码数据的一个进程与解码其的另一个进程之间的关系。
That’s a fairly abstract idea—there are many ways data can flow from one process to another. Who encodes the data, and who decodes it? In the rest of this chapter we will explore some of the most common ways how data flows between processes:
这是一个相当抽象的概念——数据可以以许多方式从一个进程流向另一个进程。是谁编码数据,谁解码数据?在本章的其余部分,我们将探索数据在进程之间流动的一些最常见的方式:
-
Via databases (see “Dataflow Through Databases” )
通过数据库(参见“数据库中的数据流”)
-
Via service calls (see “Dataflow Through Services: REST and RPC” )
通过服务调用(见“通过服务进行数据流:REST和RPC”)
-
Via asynchronous message passing (see “Message-Passing Dataflow” )
通过异步消息传递(参见“消息传递数据流”)
Dataflow Through Databases
In a database, the process that writes to the database encodes the data, and the process that reads from the database decodes it. There may just be a single process accessing the database, in which case the reader is simply a later version of the same process—in that case you can think of storing something in the database as sending a message to your future self .
在数据库中,写入数据库的进程会对数据进行编码,而读取数据库的进程则会进行解码。有可能只有一个进程访问数据库,在这种情况下,读取进程只是同一进程的一个较新版本。因此,可以将将数据存储在数据库中看作是给自己未来发送一条信息。
Backward compatibility is clearly necessary here; otherwise your future self won’t be able to decode what you previously wrote.
向后兼容性在这里是明显必要的;否则,您未来的自己将无法解码您之前所写的内容。
In general, it’s common for several different processes to be accessing a database at the same time. Those processes might be several different applications or services, or they may simply be several instances of the same service (running in parallel for scalability or fault tolerance). Either way, in an environment where the application is changing, it is likely that some processes accessing the database will be running newer code and some will be running older code—for example because a new version is currently being deployed in a rolling upgrade, so some instances have been updated while others haven’t yet.
一般而言,多个不同的进程同时访问数据库是很常见的。这些进程可能是多个不同的应用程序或服务,或者只是同一个服务的几个实例(同时运行以实现可扩展性或容错性)。无论哪种方式,在应用程序变化的环境中,访问数据库的一些进程可能正在运行较新的代码,而其他进程可能运行原始的较旧的代码,例如因为正在进行滚动升级的新版本,因此某些实例已更新,而其他实例尚未更新。
This means that a value in the database may be written by a newer version of the code, and subsequently read by an older version of the code that is still running. Thus, forward compatibility is also often required for databases.
这意味着数据库中的值可能被新版本的代码写入,而后被仍在运行的旧版本代码读取。因此,对于数据库也经常需要前向兼容性。
However, there is an additional snag. Say you add a field to a record schema, and the newer code writes a value for that new field to the database. Subsequently, an older version of the code (which doesn’t yet know about the new field) reads the record, updates it, and writes it back. In this situation, the desirable behavior is usually for the old code to keep the new field intact, even though it couldn’t be interpreted.
然而,还有一个额外的问题。假设您向记录模式添加一个字段,而新代码向数据库写入该新字段的值。随后,一个不知道新字段的旧版本代码读取记录,更新它,并将其写回。在这种情况下,通常希望旧代码保留新字段的完整性,尽管旧代码无法解释新字段。
The encoding formats discussed previously support such preservation of unknown fields, but sometimes you need to take care at an application level, as illustrated in Figure 4-7 . For example, if you decode a database value into model objects in the application, and later reencode those model objects, the unknown field might be lost in that translation process. Solving this is not a hard problem; you just need to be aware of it.
之前讨论的编码格式支持未知字段的保留,但有时需要在应用程序级别上进行注意,如图4-7所示。例如,如果您在应用程序中将数据库值解码为模型对象,然后稍后重新对这些模型对象进行编码,则该未知字段可能会在翻译过程中丢失。解决这个问题并不困难,你只需要意识到这个问题即可。
Different values written at different times
A database generally allows any value to be updated at any time. This means that within a single database you may have some values that were written five milliseconds ago, and some values that were written five years ago.
一个数据库通常允许任何值在任何时间进行更新。这意味着在单个数据库中,你可能会有一些值是五毫秒前写入的,也可能会有一些值是五年前写入的。
When you deploy a new version of your application (of a server-side application, at least), you may entirely replace the old version with the new version within a few minutes. The same is not true of database contents: the five-year-old data will still be there, in the original encoding, unless you have explicitly rewritten it since then. This observation is sometimes summed up as data outlives code .
当您部署新版本的应用程序(至少是服务器端应用程序)时,您可以在几分钟内完全使用新版本替换旧版本。但是,数据库内容则不同:5年前的数据仍然以原始编码存在,除非自那时以来您已明确重写它。这种观察有时被总结为数据超越代码。
Rewriting (
migrating
) data into a new schema is certainly possible, but it’s an expensive thing to
do on a large dataset, so most databases avoid it if possible. Most relational databases allow
simple schema changes, such as adding a new column with a
null
default value, without rewriting
existing data.
v
When an old row is
read, the database fills in
null
s for any columns that are missing from the encoded data on disk.
LinkedIn’s document database Espresso uses Avro for storage, allowing it to use Avro’s schema
evolution rules [
23
].
将数据重写(迁移)到新模式中肯定是可能的,但对于大型数据集而言是一项昂贵的工作,因此大多数数据库在可能的情况下避免这种情况。大多数关系型数据库允许简单的模式更改,例如添加一个默认值为 null 的新列,而无需重写现有数据。当读取旧行时,数据库会自动为任何缺失的列填充 null。 LinkedIn 的文档数据库 Espresso 使用 Avro 进行存储,从而可以使用 Avro 的模式演化规则[23]。
Schema evolution thus allows the entire database to appear as if it was encoded with a single schema, even though the underlying storage may contain records encoded with various historical versions of the schema.
模式演化因此允许整个数据库看起来像是使用单一模式编码,尽管底层存储可能包含使用各种历史版本模式编码的记录。
Archival storage
Perhaps you take a snapshot of your database from time to time, say for backup purposes or for loading into a data warehouse (see “Data Warehousing” ). In this case, the data dump will typically be encoded using the latest schema, even if the original encoding in the source database contained a mixture of schema versions from different eras. Since you’re copying the data anyway, you might as well encode the copy of the data consistently.
也许您会定期快照您的数据库,比如用于备份或加载到数据仓库(请参见“数据仓库”)。在这种情况下,数据转储通常将使用最新的模式进行编码,即使源数据库中的原始编码包含不同年代的模式版本的混合物。既然您已经在复制数据,那么也可以对数据副本进行一致的编码。
As the data dump is written in one go and is thereafter immutable, formats like Avro object container files are a good fit. This is also a good opportunity to encode the data in an analytics-friendly column-oriented format such as Parquet (see “Column Compression” ).
由于数据转储是一次性写入的,并且之后是不可变的,因此诸如Avro对象容器文件之类的格式非常适合。这也是将数据编码为适合分析的列定向格式(例如Parquet)的绝佳机会(请参见“列压缩”)。
In Chapter 10 we will talk more about using data in archival storage.
在第10章中,我们将更多地谈论如何在档案存储中使用数据。
Dataflow Through Services: REST and RPC
When you have processes that need to communicate over a network, there are a few different ways of arranging that communication. The most common arrangement is to have two roles: clients and servers . The servers expose an API over the network, and the clients can connect to the servers to make requests to that API. The API exposed by the server is known as a service .
当您需要通过网络进行通信时,有几种不同的方式可以安排通信。最常见的方式是有两个角色:客户端和服务器。服务器在网络上公开API,客户端可以连接到服务器以向该API发出请求。服务器公开的API称为服务。
The web works this way: clients (web browsers) make requests to web servers, making
GET
requests
to download HTML, CSS, JavaScript, images, etc., and making
POST
requests to submit data to the
server. The API consists of a standardized set of protocols and data formats (HTTP, URLs, SSL/TLS,
HTML, etc.). Because web browsers, web servers, and website authors mostly agree on these standards,
you can use any web browser to access any website (at least in theory!).
网络的工作方式:客户端(Web浏览器)向Web服务器发出请求,进行GET请求下载HTML、CSS、JavaScript、图像等,并进行POST请求提交数据到服务器。API由一组标准协议和数据格式(HTTP、URL、SSL/TLS、HTML等)组成。因为Web浏览器、Web服务器和网站作者大多数都同意这些标准,所以理论上你可以使用任何Web浏览器访问任何网站!
Web browsers are not the only type of client. For example, a native app running on a mobile device or a desktop computer can also make network requests to a server, and a client-side JavaScript application running inside a web browser can use XMLHttpRequest to become an HTTP client (this technique is known as Ajax [ 30 ]). In this case, the server’s response is typically not HTML for displaying to a human, but rather data in an encoding that is convenient for further processing by the client-side application code (such as JSON). Although HTTP may be used as the transport protocol, the API implemented on top is application-specific, and the client and server need to agree on the details of that API.
网络浏览器并非唯一的客户端类型。例如,运行在移动设备或台式计算机上的本地应用程序也可以向服务器发送网络请求,而在 Web 浏览器内运行的客户端 JavaScript 应用程序可以使用 XMLHttpRequest 变成 HTTP 客户端(这种技术称为 Ajax [30])。 在这种情况下,服务器的响应通常不是用于向人显示的 HTML,而是以一种对客户端端应用程序代码进一步处理方便的编码形式呈现的数据(例如 JSON)。虽然 HTTP 可以用作传输协议,但在其上实现的 API 是应用程序特定的,客户端和服务器需要就该 API 的详细信息达成一致。
Moreover, a server can itself be a client to another service (for example, a typical web app server acts as client to a database). This approach is often used to decompose a large application into smaller services by area of functionality, such that one service makes a request to another when it requires some functionality or data from that other service. This way of building applications has traditionally been called a service-oriented architecture (SOA), more recently refined and rebranded as microservices architecture [ 31 , 32 ].
此外,一个服务器本身也可以作为另一个服务的客户端(例如,一个典型的Web应用服务器作为数据库的客户端)。这种方法通常用于按功能区域将大型应用程序分解为较小的服务,其中一个服务在需要另一个服务的功能或数据时向另一个服务发出请求。构建应用程序的这种方式传统上被称为面向服务的架构(SOA),最近被改进和重新品牌为微服务架构[31,32]。
In some ways, services are similar to databases: they typically allow clients to submit and query data. However, while databases allow arbitrary queries using the query languages we discussed in Chapter 2 , services expose an application-specific API that only allows inputs and outputs that are predetermined by the business logic (application code) of the service [ 33 ]. This restriction provides a degree of encapsulation: services can impose fine-grained restrictions on what clients can and cannot do.
有些方面,服务类似于数据库:它们通常允许客户端提交和查询数据。然而,尽管数据库允许使用我们在第2章中讨论的查询语言进行任意查询,但服务公开了一个特定于应用程序的API,只允许输入和输出的数据由业务逻辑(应用程序代码)预先确定[33]。这种限制提供了一定的封装:服务可以对客户端可以和不可以做的事情施加细粒度的限制。
A key design goal of a service-oriented/microservices architecture is to make the application easier to change and maintain by making services independently deployable and evolvable. For example, each service should be owned by one team, and that team should be able to release new versions of the service frequently, without having to coordinate with other teams. In other words, we should expect old and new versions of servers and clients to be running at the same time, and so the data encoding used by servers and clients must be compatible across versions of the service API—precisely what we’ve been talking about in this chapter.
服务导向/微服务架构的一个主要设计目标是通过使服务独立部署和演化来使应用程序更易于修改和维护。例如,每个服务应由一个团队拥有,并且该团队应能够频繁发布该服务的新版本,而无需与其他团队协调。换句话说,我们应该期望旧版和新版服务器和客户端同时运行,因此服务器和客户端使用的数据编码必须在服务API的各个版本中兼容 - 这正是我们在本章中谈论的内容。
Web services
When HTTP is used as the underlying protocol for talking to the service, it is called a web service . This is perhaps a slight misnomer, because web services are not only used on the web, but in several different contexts. For example:
当HTTP被用作与服务通信的基础协议时,就被称为web服务。这可能略微有误,因为web服务不仅在网络上使用,而且在几种不同的上下文中使用。例如:
-
A client application running on a user’s device (e.g., a native app on a mobile device, or JavaScript web app using Ajax) making requests to a service over HTTP. These requests typically go over the public internet.
用户设备上运行的客户端应用程序(例如移动设备上的本机应用程序或使用Ajax的JavaScript Web应用程序),通过HTTP向服务发出请求。这些请求通常通过公共互联网进行。
-
One service making requests to another service owned by the same organization, often located within the same datacenter, as part of a service-oriented/microservices architecture. (Software that supports this kind of use case is sometimes called middleware .)
一项服务向同一组织拥有的另一个服务发出请求,通常位于相同的数据中心,作为面向服务/微服务体系结构的一部分。(支持此类用例的软件有时称为中间件。)
-
One service making requests to a service owned by a different organization, usually via the internet. This is used for data exchange between different organizations’ backend systems. This category includes public APIs provided by online services, such as credit card processing systems, or OAuth for shared access to user data.
一种服务向不同组织拥有的服务发出请求,通常通过互联网进行。这用于不同组织后端系统之间的数据交换。这个类别包括在线服务提供的公共API,例如信用卡处理系统,或用于共享访问用户数据的OAuth。
There are two popular approaches to web services: REST and SOAP . They are almost diametrically opposed in terms of philosophy, and often the subject of heated debate among their respective proponents. vi
有两种流行的 Web 服务方法:REST 和 SOAP。它们在哲学上几乎是完全相反的,常常是各自支持者之间的激烈辩论的主题。
REST is not a protocol, but rather a design philosophy that builds upon the principles of HTTP [ 34 , 35 ]. It emphasizes simple data formats, using URLs for identifying resources and using HTTP features for cache control, authentication, and content type negotiation. REST has been gaining popularity compared to SOAP, at least in the context of cross-organizational service integration [ 36 ], and is often associated with microservices [ 31 ]. An API designed according to the principles of REST is called RESTful .
REST不是一种协议,而是一种基于HTTP原则的设计哲学。 它强调使用简单的数据格式,使用URL来识别资源,并利用HTTP的缓存控制、认证和内容类型协商等特性。与SOAP相比,REST在跨组织服务集成的上下文中越来越受欢迎,并经常与微服务相关联。 根据REST原则设计的API称为RESTful。
By contrast, SOAP is an XML-based protocol for making network API requests. vii Although it is most commonly used over HTTP, it aims to be independent from HTTP and avoids using most HTTP features. Instead, it comes with a sprawling and complex multitude of related standards (the web service framework , known as WS-* ) that add various features [ 37 ].
相比之下,SOAP是一种基于XML的协议,用于进行网络API请求。尽管它最常用于HTTP上,但它旨在独立于HTTP,并避免使用大多数HTTP功能。相反,它带有庞大而复杂的相关标准(称为WS-*的Web服务框架),添加了各种功能。
The API of a SOAP web service is described using an XML-based language called the Web Services Description Language, or WSDL. WSDL enables code generation so that a client can access a remote service using local classes and method calls (which are encoded to XML messages and decoded again by the framework). This is useful in statically typed programming languages, but less so in dynamically typed ones (see “Code generation and dynamically typed languages” ).
SOAP网络服务的API是使用名为Web Services Description Language或WSDL的基于XML的语言来描述的。 WSDL使代码生成成为可能,以便客户端可以使用本地类和方法调用访问远程 服务(这些调用被编码为XML消息,并由框架进行解码)。 这对于静态类型的编程语言很有用,但在动态 类型的编程语言中不太有用(请参见“代码生成和动态类型的语言”)。
As WSDL is not designed to be human-readable, and as SOAP messages are often too complex to construct manually, users of SOAP rely heavily on tool support, code generation, and IDEs [ 38 ]. For users of programming languages that are not supported by SOAP vendors, integration with SOAP services is difficult.
由于 WSDL 不是以人类可读的方式设计的,而且 SOAP 消息经常太复杂而无法手工构造,因此使用 SOAP 的用户严重依赖工具支持、代码生成和 IDE [38]。对于不受 SOAP 供应商支持的编程语言用户来说,与 SOAP 服务集成是困难的。
Even though SOAP and its various extensions are ostensibly standardized, interoperability between different vendors’ implementations often causes problems [ 39 ]. For all of these reasons, although SOAP is still used in many large enterprises, it has fallen out of favor in most smaller companies.
尽管SOAP及其各种扩展显然是标准化的,但不同厂商实现间的互操作性经常会引起问题[39]。出于所有这些原因,虽然SOAP在许多大型企业中仍在使用,但在大多数较小的公司中已不再受欢迎。
RESTful APIs tend to favor simpler approaches, typically involving less code generation and automated tooling. A definition format such as OpenAPI, also known as Swagger [ 40 ], can be used to describe RESTful APIs and produce documentation.
RESTful APIs倾向于采用简单的方法,通常涉及较少的代码生成和自动化工具。定义格式,例如OpenAPI,也称为Swagger [40],可用于描述RESTful APIs并生成文档。
The problems with remote procedure calls (RPCs)
Web services are merely the latest incarnation of a long line of technologies for making API requests over a network, many of which received a lot of hype but have serious problems. Enterprise JavaBeans (EJB) and Java’s Remote Method Invocation (RMI) are limited to Java. The Distributed Component Object Model (DCOM) is limited to Microsoft platforms. The Common Object Request Broker Architecture (CORBA) is excessively complex, and does not provide backward or forward compatibility [ 41 ].
网络服务仅是一系列通过网络进行API请求的技术中最新的一种,其中许多技术受到过大量宣传,但存在严重问题。企业JavaBeans(EJB) 和Java的远程方法调用(RMI)仅限于Java。分布式组件对象模型(DCOM)仅限于微软平台。通用对象请求代理架构(CORBA)过于复杂,并且不提供向前或向后兼容性[41]。
All of these are based on the idea of a remote procedure call (RPC), which has been around since the 1970s [ 42 ]. The RPC model tries to make a request to a remote network service look the same as calling a function or method in your programming language, within the same process (this abstraction is called location transparency ). Although RPC seems convenient at first, the approach is fundamentally flawed [ 43 , 44 ]. A network request is very different from a local function call:
所有这些都是基于远程过程调用(RPC)的概念,自1970年代以来一直存在[42]。 RPC模型试图使对远程网络服务的请求看起来与在同一进程中调用函数或方法相同(这个抽象称为位置透明度)。尽管RPC乍一看似乎很方便,但这种方法在根本上存在缺陷[43,44]。 网络请求与本地函数调用非常不同:
-
A local function call is predictable and either succeeds or fails, depending only on parameters that are under your control. A network request is unpredictable: the request or response may be lost due to a network problem, or the remote machine may be slow or unavailable, and such problems are entirely outside of your control. Network problems are common, so you have to anticipate them, for example by retrying a failed request.
本地函数调用是可预测的,仅取决于您控制的参数的成功或失败。网络请求不可预测:请求或响应可能由于网络问题丢失,远程计算机可能缓慢或不可用,并且此类问题完全超出您的控制范围。网络问题很常见,因此您必须预先考虑它们,例如通过重试失败的请求。
-
A local function call either returns a result, or throws an exception, or never returns (because it goes into an infinite loop or the process crashes). A network request has another possible outcome: it may return without a result, due to a timeout . In that case, you simply don’t know what happened: if you don’t get a response from the remote service, you have no way of knowing whether the request got through or not. (We discuss this issue in more detail in Chapter 8 .)
本地函数调用可能会返回结果,也可能会抛出异常,或者永远不会返回(因为进入无限循环或进程崩溃)。网络请求还有另一种可能的结果:由于超时而返回而没有结果。在这种情况下,你只是不知道发生了什么:如果你没有从远程服务收到响应,你就没有办法知道请求是否已经通过了。 (我们将在第8章中更详细地讨论这个问题。)
-
If you retry a failed network request, it could happen that the requests are actually getting through, and only the responses are getting lost. In that case, retrying will cause the action to be performed multiple times, unless you build a mechanism for deduplication ( idempotence ) into the protocol. Local function calls don’t have this problem. (We discuss idempotence in more detail in Chapter 11 .)
如果重新尝试一个失败的网络请求,可能会发生请求实际上能够通过,只是响应丢失的情况。在这种情况下,重新尝试会导致动作被执行多次,除非您在协议中构建重复消除(幂等性)机制。本地函数调用不会有这个问题。(我们将在第11章中更详细地讨论幂等性。)
-
Every time you call a local function, it normally takes about the same time to execute. A network request is much slower than a function call, and its latency is also wildly variable: at good times it may complete in less than a millisecond, but when the network is congested or the remote service is overloaded it may take many seconds to do exactly the same thing.
每次调用本地函数,它通常需要大致相同的时间来执行。网络请求比函数调用慢得多,而且延迟也极其不稳定:在良好的情况下,它可能在不到一毫秒的时间内完成,但当网络拥塞或远程服务过载时,完成相同的操作可能需要数秒钟的时间。
-
When you call a local function, you can efficiently pass it references (pointers) to objects in local memory. When you make a network request, all those parameters need to be encoded into a sequence of bytes that can be sent over the network. That’s okay if the parameters are primitives like numbers or strings, but quickly becomes problematic with larger objects.
当您调用本地函数时,您可以有效地将本地内存中的对象的引用(指针)传递给它。当您发出网络请求时,所有这些参数都需要被编码为可以发送到网络的字节序列。如果参数是原始类型如数字或字符串,那是没问题的,但是如果是大型对象,这会很快变得棘手。
-
The client and the service may be implemented in different programming languages, so the RPC framework must translate datatypes from one language into another. This can end up ugly, since not all languages have the same types—recall JavaScript’s problems with numbers greater than 2 53 , for example (see “JSON, XML, and Binary Variants” ). This problem doesn’t exist in a single process written in a single language.
客户端和服务可能是用不同的编程语言实现的,因此RPC框架必须将数据类型从一种语言转换为另一种语言。这可能会非常麻烦,因为不是所有语言都拥有相同的类型--例如JavaScript对大于253的数字的问题(请参见“JSON,XML和二进制变体”)。在单个进程中编写单个语言时不存在这个问题。
All of these factors mean that there’s no point trying to make a remote service look too much like a local object in your programming language, because it’s a fundamentally different thing. Part of the appeal of REST is that it doesn’t try to hide the fact that it’s a network protocol (although this doesn’t seem to stop people from building RPC libraries on top of REST).
所有这些因素意味着,在编程语言中没有必要试图让远程服务看起来过于像本地对象,因为它是完全不同的东西。 REST 的吸引力之一在于它不试图隐藏它是一个网络协议的事实(尽管这似乎并不能阻止人们在 REST 之上构建 RPC 库)。
Current directions for RPC
Despite all these problems, RPC isn’t going away. Various RPC frameworks have been built on top of all the encodings mentioned in this chapter: for example, Thrift and Avro come with RPC support included, gRPC is an RPC implementation using Protocol Buffers, Finagle also uses Thrift, and Rest.li uses JSON over HTTP.
尽管存在这些问题,RPC并不会消失。各种RPC框架已经基于本章提到的所有编码方式构建:例如,Thrift和Avro包含了RPC支持,gRPC是使用Protocol Buffers实现的RPC实施,Finagle还使用Thrift,而Rest.li使用JSON over HTTP。
This new generation of RPC frameworks is more explicit about the fact that a remote request is different from a local function call. For example, Finagle and Rest.li use futures ( promises ) to encapsulate asynchronous actions that may fail. Futures also simplify situations where you need to make requests to multiple services in parallel, and combine their results [ 45 ]. gRPC supports streams , where a call consists of not just one request and one response, but a series of requests and responses over time [ 46 ].
这一新一代的RPC框架更明确地表明远程请求与本地函数调用是不同的。例如,Finagle和Rest.li使用futures(承诺)来封装可能失败的异步操作。Futures还简化了需要并行向多个服务发出请求并合并其结果的情况[45]。 gRPC支持流式传输,在这种情况下,一个调用不仅仅包括一个请求和一个响应,还包括随时间的一系列请求和响应[46]。
Some of these frameworks also provide service discovery —that is, allowing a client to find out at which IP address and port number it can find a particular service. We will return to this topic in “Request Routing” .
其中一些框架还提供服务发现功能,即允许客户端查找特定服务的 IP 地址和端口号。我们在“请求路由”中会再次探讨这个话题。
Custom RPC protocols with a binary encoding format can achieve better performance than something
generic like JSON over REST. However, a RESTful API has other significant advantages: it is good for
experimentation and debugging (you can simply make requests to it using a web browser or the
command-line tool
curl
, without any code generation or software installation), it is supported by
all mainstream programming languages and platforms, and there is a vast ecosystem of tools available (servers,
caches, load balancers, proxies, firewalls, monitoring, debugging tools, testing tools, etc.).
采用二进制编码格式的自定义RPC协议可以比通用的JSON over REST实现更好的性能。然而,RESTful API有其他重要的优势:它适用于实验和调试(您可以使用Web浏览器或命令行工具curl轻松地对其进行请求,无需任何代码生成或软件安装),它被所有主流编程语言和平台支持,并且有大量的工具生态系统可用(服务器、缓存、负载平衡器、代理、防火墙、监控、调试工具、测试工具等等)。
For these reasons, REST seems to be the predominant style for public APIs. The main focus of RPC frameworks is on requests between services owned by the same organization, typically within the same datacenter.
因此,REST似乎是公共API的主要风格。 RPC框架的主要重点是同一组织拥有的服务之间的请求,通常位于相同的数据中心。
Data encoding and evolution for RPC
For evolvability, it is important that RPC clients and servers can be changed and deployed independently. Compared to data flowing through databases (as described in the last section), we can make a simplifying assumption in the case of dataflow through services: it is reasonable to assume that all the servers will be updated first, and all the clients second. Thus, you only need backward compatibility on requests, and forward compatibility on responses.
对于可进化性,RPC客户端和服务器可以独立更改和部署非常重要。与通过数据库流动的数据相比(如上一节所述),对于通过服务进行的数据流动,我们可以进行简化的假设:合理地假设所有服务器将首先更新,而所有客户端将其次更新。因此,您只需要在请求方面进行向后兼容性,在响应方面进行向前兼容性。
The backward and forward compatibility properties of an RPC scheme are inherited from whatever encoding it uses:
RPC方案的向后和向前兼容性属性是从其使用的编码继承来的。
-
Thrift, gRPC (Protocol Buffers), and Avro RPC can be evolved according to the compatibility rules of the respective encoding format.
Thrift、gRPC(Protocol Buffers)和Avro RPC都可以根据各自编码格式的兼容性规则进行升级。
-
In SOAP, requests and responses are specified with XML schemas. These can be evolved, but there are some subtle pitfalls [ 47 ].
在SOAP中,请求和响应是使用XML模式指定的。它们可以演变,但有一些微妙的陷阱[47]。
-
RESTful APIs most commonly use JSON (without a formally specified schema) for responses, and JSON or URI-encoded/form-encoded request parameters for requests. Adding optional request parameters and adding new fields to response objects are usually considered changes that maintain compatibility.
RESTful API 最常用 JSON(没有正式指定的模式)作为响应,JSON 或 URI 编码/表单编码请求参数作为请求。添加可选请求参数和向响应对象添加新字段通常被认为是保持兼容性的变化。
Service compatibility is made harder by the fact that RPC is often used for communication across organizational boundaries, so the provider of a service often has no control over its clients and cannot force them to upgrade. Thus, compatibility needs to be maintained for a long time, perhaps indefinitely. If a compatibility-breaking change is required, the service provider often ends up maintaining multiple versions of the service API side by side.
由于RPC通常用于跨组织边界的通信,因此服务兼容性变得更加困难,因此服务提供者通常无法控制其客户端并强制其升级。因此,兼容性需要长期维护,可能无限期地维护。如果需要破坏兼容性的更改,则服务提供者通常会同时维护多个版本的服务API。
There is no agreement on how API versioning should work (i.e., how a client can indicate which
version of the API it wants to use [
48
]). For RESTful APIs, common approaches are to use a version
number in the URL or in the HTTP
Accept
header. For services that use API keys to identify a
particular client, another option is to store a client’s requested API version on the server and to
allow this version selection to be updated through a separate administrative interface
[
49
].
API版本问题没有明确的协议,也就是说客户端如何指定要使用API的哪个版本[48]. 对于RESTful API,常见的方法是在URL或HTTP Accept头中使用版本号。对于使用API密钥标识特定客户端的服务,另一个选择是在服务器上存储客户端请求的API版本,并允许通过单独的管理接口更新此版本选择[49].
Message-Passing Dataflow
We have been looking at the different ways encoded data flows from one process to another. So far, we’ve discussed REST and RPC (where one process sends a request over the network to another process and expects a response as quickly as possible), and databases (where one process writes encoded data, and another process reads it again sometime in the future).
我们一直在研究编码数据从一个进程流向另一个进程的不同方式。到目前为止,我们已经讨论了REST和RPC(其中一个进程通过网络发送请求到另一个进程,并期望尽快得到响应),以及数据库(其中一个进程编写编码数据,而另一个进程在将来的某个时候再次读取它)。
In this final section, we will briefly look at asynchronous message-passing systems, which are somewhere between RPC and databases. They are similar to RPC in that a client’s request (usually called a message ) is delivered to another process with low latency. They are similar to databases in that the message is not sent via a direct network connection, but goes via an intermediary called a message broker (also called a message queue or message-oriented middleware ), which stores the message temporarily.
在这个最后的部分,我们将简要介绍异步消息传递系统,它们介于远程过程调用和数据库之间。它们与远程过程调用相似,因为客户端的请求(通常称为消息)以低延迟交付给另一个进程。它们与数据库相似,因为消息不是通过直接的网络连接发送,而是通过一个中介称为消息代理(也称为消息队列或面向消息的中间件)暂时存储。
Using a message broker has several advantages compared to direct RPC:
使用消息代理与直接RPC相比具有以下几个优点:
-
It can act as a buffer if the recipient is unavailable or overloaded, and thus improve system reliability.
如果接收者不可用或超负荷,它可以起到缓冲作用,从而提高系统的可靠性。
-
It can automatically redeliver messages to a process that has crashed, and thus prevent messages from being lost.
它可以自动将消息重新发送给已崩溃的进程,从而防止消息丢失。
-
It avoids the sender needing to know the IP address and port number of the recipient (which is particularly useful in a cloud deployment where virtual machines often come and go).
这避免了发送方需要知道接收方的IP地址和端口号(这在云部署中尤其有用,因为虚拟机经常会出现变动)。
-
It allows one message to be sent to several recipients.
它可以将一条消息发送给多个收件人。
-
It logically decouples the sender from the recipient (the sender just publishes messages and doesn’t care who consumes them).
它逻辑上将发送者与接收者分离开来(发送者只发布消息,不关心谁消费它们)。
However, a difference compared to RPC is that message-passing communication is usually one-way: a sender normally doesn’t expect to receive a reply to its messages. It is possible for a process to send a response, but this would usually be done on a separate channel. This communication pattern is asynchronous : the sender doesn’t wait for the message to be delivered, but simply sends it and then forgets about it.
然而,与RPC相比的一个区别是,消息传递通信通常是单向的:发送方通常不希望接收其消息的回复。进程可能会发送响应,但通常会在单独的通道上进行。这种通信模式是异步的:发送方不等待消息被传送,而只是发送它并忘记它。
Message brokers
In the past, the landscape of message brokers was dominated by commercial enterprise software from companies such as TIBCO, IBM WebSphere, and webMethods. More recently, open source implementations such as RabbitMQ, ActiveMQ, HornetQ, NATS, and Apache Kafka have become popular. We will compare them in more detail in Chapter 11 .
过去,消息代理的市场被TIBCO、IBM WebSphere和webMethods等公司的商业企业软件所主导。近年来,像RabbitMQ、ActiveMQ、HornetQ、NATS和Apache Kafka这样的开源实现变得流行起来。我们将在第11章中对它们进行更详细的比较。
The detailed delivery semantics vary by implementation and configuration, but in general, message brokers are used as follows: one process sends a message to a named queue or topic , and the broker ensures that the message is delivered to one or more consumers of or subscribers to that queue or topic. There can be many producers and many consumers on the same topic.
具体的传递语义因实现和配置而异,但通常情况下,消息代理如下使用:一个进程向命名队列或主题发送一条消息,代理确保该消息传递给一个或多个订阅该队列或主题的消费者。同一个主题上可以有多个生产者和多个消费者。
A topic provides only one-way dataflow. However, a consumer may itself publish messages to another topic (so you can chain them together, as we shall see in Chapter 11 ), or to a reply queue that is consumed by the sender of the original message (allowing a request/response dataflow, similar to RPC).
一个主题只提供单向数据流。然而,消费者本身可以发布消息到另一个主题(因此可以将它们链接在一起,如第11章所示),或者发布到由原始消息发送者消费的回复队列(允许请求/响应数据流,类似于RPC)。
Message brokers typically don’t enforce any particular data model—a message is just a sequence of bytes with some metadata, so you can use any encoding format. If the encoding is backward and forward compatible, you have the greatest flexibility to change publishers and consumers independently and deploy them in any order.
通常,消息代理不强制任何特定的数据模型-消息只是一系列字节和一些元数据,因此您可以使用任何编码格式。如果编码是向后和向前兼容的,则您可以最大限度地灵活地更改发布者和消费者的顺序,并将它们独立部署。
If a consumer republishes messages to another topic, you may need to be careful to preserve unknown fields, to prevent the issue described previously in the context of databases ( Figure 4-7 ).
如果消费者将消息重新发布到另一个主题,则需要小心保留未知字段,以防止在数据库上下文中描述的问题(图4-7)。
Distributed actor frameworks
The actor model is a programming model for concurrency in a single process. Rather than dealing directly with threads (and the associated problems of race conditions, locking, and deadlock), logic is encapsulated in actors . Each actor typically represents one client or entity, it may have some local state (which is not shared with any other actor), and it communicates with other actors by sending and receiving asynchronous messages. Message delivery is not guaranteed: in certain error scenarios, messages will be lost. Since each actor processes only one message at a time, it doesn’t need to worry about threads, and each actor can be scheduled independently by the framework.
演员模型是单个进程并发的编程模型。与直接处理线程(和相关的竞态条件、锁定和死锁问题)不同,逻辑被封装在演员中。每个演员通常代表一个客户或实体,它可能有一些本地状态(不与任何其他演员共享),并通过发送和接收异步消息与其他演员通信。消息传递不是保证的:在某些错误情况下,消息将被丢失。由于每个演员只处理一个消息,因此它不需要担心线程,并且每个演员可以由框架独立调度。
In distributed actor frameworks , this programming model is used to scale an application across multiple nodes. The same message-passing mechanism is used, no matter whether the sender and recipient are on the same node or different nodes. If they are on different nodes, the message is transparently encoded into a byte sequence, sent over the network, and decoded on the other side.
在分布式 actor 框架中,该编程模型被用于跨多个节点扩展应用程序。无论发送方和接收方是在同一节点还是不同节点,都使用相同的消息传递机制。如果它们在不同的节点上,消息将被透明地编码为字节序列,通过网络发送,并在另一侧进行解码。
Location transparency works better in the actor model than in RPC, because the actor model already assumes that messages may be lost, even within a single process. Although latency over the network is likely higher than within the same process, there is less of a fundamental mismatch between local and remote communication when using the actor model.
在Actor模型中,位置透明度比RPC更好,因为Actor模型已经假定消息可能会丢失,即使在单个进程中也是如此。虽然网络延迟可能比同一进程内的延迟要高,但使用Actor模型时,本地和远程通信之间的基本不匹配要少得多。
A distributed actor framework essentially integrates a message broker and the actor programming model into a single framework. However, if you want to perform rolling upgrades of your actor-based application, you still have to worry about forward and backward compatibility, as messages may be sent from a node running the new version to a node running the old version, and vice versa.
分布式演员框架基本上将消息代理和演员编程模型集成到单个框架中。然而,如果您想要执行基于演员的应用程序的滚动升级,则仍然需要考虑前向和后向兼容性,因为消息可能会从运行新版本的节点发送到运行旧版本的节点,反之亦然。
Three popular distributed actor frameworks handle message encoding as follows:
三个流行的分布式 actor 框架的信息编码方式如下:
-
Akka uses Java’s built-in serialization by default, which does not provide forward or backward compatibility. However, you can replace it with something like Protocol Buffers, and thus gain the ability to do rolling upgrades [ 50 ].
Akka默认使用Java内置的序列化,但是它没有提供向前或向后兼容性。不过,您可以使用类似Protocol Buffers的东西替换它,从而获得滚动升级的能力[50].
-
Orleans by default uses a custom data encoding format that does not support rolling upgrade deployments; to deploy a new version of your application, you need to set up a new cluster, move traffic from the old cluster to the new one, and shut down the old one [ 51 , 52 ]. Like with Akka, custom serialization plug-ins can be used.
Orleans 默认使用自定义数据编码格式,不支持滚动升级部署;要部署应用程序的新版本,需要设置一个新的集群,将流量从旧集群移动到新集群,并关闭旧集群。与 Akka 一样,可以使用自定义序列化插件。
-
In Erlang OTP it is surprisingly hard to make changes to record schemas (despite the system having many features designed for high availability); rolling upgrades are possible but need to be planned carefully [ 53 ]. An experimental new
maps
datatype (a JSON-like structure, introduced in Erlang R17 in 2014) may make this easier in the future [ 54 ].在Erlang OTP中,尽管该系统有许多为高可用性而设计的功能,但更改记录模式仍然令人惊讶地困难; 滚动升级可能是可行的,但需要仔细计划。一个实验性的新映射数据类型(类似于JSON的结构,在2014年Erlang R17中引入)可能会在未来使这一过程更加容易。
Summary
In this chapter we looked at several ways of turning data structures into bytes on the network or bytes on disk. We saw how the details of these encodings affect not only their efficiency, but more importantly also the architecture of applications and your options for deploying them.
在本章中,我们探讨了将数据结构转换为网络上的字节或磁盘上的字节的几种方法。我们看到了这些编码的细节不仅影响它们的效率,而且更重要的是对应用程序的架构和部署选项产生影响。
In particular, many services need to support rolling upgrades, where a new version of a service is gradually deployed to a few nodes at a time, rather than deploying to all nodes simultaneously. Rolling upgrades allow new versions of a service to be released without downtime (thus encouraging frequent small releases over rare big releases) and make deployments less risky (allowing faulty releases to be detected and rolled back before they affect a large number of users). These properties are hugely beneficial for evolvability , the ease of making changes to an application.
许多服务需要支持滚动升级,在此过程中,新版本的服务逐渐部署到一些节点,而不是同时部署到所有节点。滚动升级允许在没有停机的情况下发布服务的新版本(因此鼓励频繁发布小版本而非罕见的大版本),并降低了部署风险(允许在影响大量用户之前检测到并回滚错误发布)。这些特性对于应用程序的可发展性和易于进行更改非常有益。
During rolling upgrades, or for various other reasons, we must assume that different nodes are running the different versions of our application’s code. Thus, it is important that all data flowing around the system is encoded in a way that provides backward compatibility (new code can read old data) and forward compatibility (old code can read new data).
在滚动升级期间或出于其他原因,我们必须假设不同的节点运行着我们应用程序代码的不同版本。因此,重要的是所有在系统中流动的数据都以一种提供向后兼容性(新代码可以读取旧数据)和向前兼容性(旧代码可以读取新数据)的方式进行编码。
We discussed several data encoding formats and their compatibility properties:
我们讨论了几种数据编码格式及其兼容性属性:
-
Programming language–specific encodings are restricted to a single programming language and often fail to provide forward and backward compatibility.
编程语言特定编码只限于一个编程语言,经常不能提供向前和向后的兼容性。
-
Textual formats like JSON, XML, and CSV are widespread, and their compatibility depends on how you use them. They have optional schema languages, which are sometimes helpful and sometimes a hindrance. These formats are somewhat vague about datatypes, so you have to be careful with things like numbers and binary strings.
文本格式,如JSON、 XML和CSV,是广泛使用的,它们的兼容性取决于你如何使用它们。它们具有可选的模式语言,有时有帮助,有时是一种阻力。这些格式对于数据类型有些模糊,所以对于像数字和二进制字符串这样的东西,你必须小心。
-
Binary schema–driven formats like Thrift, Protocol Buffers, and Avro allow compact, efficient encoding with clearly defined forward and backward compatibility semantics. The schemas can be useful for documentation and code generation in statically typed languages. However, they have the downside that data needs to be decoded before it is human-readable.
二进制schema驱动格式(例如Thrift,Protocol Buffers和Avro)允许紧凑、高效的编码,并具有明确定义的向前和向后兼容性语义。在静态类型语言中,这些模式可以用于文档和代码生成。但是,它们的缺点是需要在数据变得可读之前进行解码。
We also discussed several modes of dataflow, illustrating different scenarios in which data encodings are important:
我们还讨论了几种数据流模式,并说明了数据编码在不同场景下的重要性。
-
Databases, where the process writing to the database encodes the data and the process reading from the database decodes it
数据库,写入数据库的过程对数据进行编码,读取数据库的过程对其进行解码。
-
RPC and REST APIs, where the client encodes a request, the server decodes the request and encodes a response, and the client finally decodes the response
RPC和REST API,客户端编码一个请求,服务器解码请求并编码响应,客户端最后解码响应。
-
Asynchronous message passing (using message brokers or actors), where nodes communicate by sending each other messages that are encoded by the sender and decoded by the recipient
异步消息传递(使用消息代理或演员),其中节点通过发送由发送者编码的消息,由接收者解码的方式进行通信。
We can conclude that with a bit of care, backward/forward compatibility and rolling upgrades are quite achievable. May your application’s evolution be rapid and your deployments be frequent.
我们可以得出结论,借助一些小心谨慎,实现向前/向后兼容和滚动升级是完全可行的。愿你的应用能够快速迭代,部署频繁。
Footnotes
i With the exception of some special cases, such as certain memory-mapped files or when operating directly on compressed data (as described in “Column Compression” ).
除了某些特殊情况,例如某些内存映射文件或在直接操作压缩数据时(如“列压缩”中所述)。
ii Note that encoding has nothing to do with encryption . We don’t discuss encryption in this book.
请注意编码与加密无关。本书不讨论加密。
iii Actually, it has three—BinaryProtocol, CompactProtocol, and DenseProtocol—although DenseProtocol is only supported by the C++ implementation, so it doesn’t count as cross-language [ 18 ]. Besides those, it also has two different JSON-based encoding formats [ 19 ]. What fun!
其实,Thrift协议有三种--BinaryProtocol、CompactProtocol和DenseProtocol,尽管DenseProtocol只被C++实现所支持,因此不能算作跨语言。此外,它还有两种不同的基于JSON的编码格式。多么有趣!
iv To be precise, the default value must be of the type of the first branch of the union, although this is a specific limitation of Avro, not a general feature of union types.
"具体来说,默认值必须是联合类型中第一个分支的类型,虽然这是Avro的一个具体限制,而不是联合类型的一般特征。"
v Except for MySQL, which often rewrites an entire table even though it is not strictly necessary, as mentioned in “Schema flexibility in the document model” .
除了MySQL,如“文档模型中的架构灵活性”中提到的那样,即使没有严格必要,它经常重写整个表。
vi Even within each camp there are plenty of arguments. For example, HATEOAS ( hypermedia as the engine of application state ), often provokes discussions [ 35 ].
甚至在每个阵营内部也有很多争论。例如,HATEOAS(将超媒体作为应用程序状态的引擎)经常引发讨论。
vii Despite the similarity of acronyms, SOAP is not a requirement for SOA. SOAP is a particular technology, whereas SOA is a general approach to building systems.
虽然这些首字母缩写很相似,但SOAP不是SOA的要求。SOAP只是一种特定技术,而SOA是一种通用的系统构建方法。
References
[ 1 ] “ Java Object Serialization Specification ,” docs.oracle.com , 2010.
[1] “Java对象序列化规范”,docs.oracle.com,2010。
[ 2 ] “ Ruby 2.2.0 API Documentation ,” ruby-doc.org , Dec 2014.
“Ruby 2.2.0 API 文档”,ruby-doc.org,2014年12月。
[ 3 ] “ The Python 3.4.3 Standard Library Reference Manual ,” docs.python.org , February 2015.
[3] “Python 3.4.3标准库参考手册”,docs.python.org,2015年2月。
[ 4 ] “ EsotericSoftware/kryo ,” github.com , October 2014.
「[4] “EsotericSoftware/kryo,” github.com,2014 年10月。」的簡化中文翻譯如下: 「[4] “EsotericSoftware/kryo”,github.com,2014年10月。」
[ 5 ] “ CWE-502: Deserialization of Untrusted Data ,” Common Weakness Enumeration, cwe.mitre.org , July 30, 2014.
CWE-502:“不受信任的数据反序列化”,通用弱点枚举,cwe.mitre.org,2014年7月30日。
[ 6 ] Steve Breen: “ What Do WebLogic, WebSphere, JBoss, Jenkins, OpenNMS, and Your Application Have in Common? This Vulnerability ,” foxglovesecurity.com , November 6, 2015.
史蒂夫·布林: “WebLogic、WebSphere、JBoss、Jenkins、OpenNMS和你的应用程序有什么共同点?这种漏洞,“foxglovesecurity.com,2015年11月6日。
[ 7 ] Patrick McKenzie: “ What the Rails Security Issue Means for Your Startup ,” kalzumeus.com , January 31, 2013.
[7] Patrick McKenzie:“Rails 安全问题对你的创业公司意味着什么”,kalzumeus.com,2013 年 1 月 31 日。
[ 8 ] Eishay Smith: “ jvm-serializers wiki ,” github.com , November 2014.
“jvm-serializers 维基”,github.com,2014年11月。
[ 9 ] “ XML Is a Poor Copy of S-Expressions ,” c2.com wiki.
"[9] "XML是S-表达式的劣质复制品",c2.com维基。" (Note: This is the translated content in simplified Chinese)
[ 10 ] Matt Harris: “ Snowflake: An Update and Some Very Important Information ,” email to Twitter Development Talk mailing list, October 19, 2010.
[10] 马特·哈里斯: “雪花: 更新和一些非常重要的信息”,电子邮件发送给Twitter开发者讨论组,2010年10月19日。
[ 11 ] Shudi (Sandy) Gao, C. M. Sperberg-McQueen, and Henry S. Thompson: “ XML Schema 1.1 ,” W3C Recommendation, May 2001.
[11] 高舒迪(桑迪)Gao、C.M.斯珀伯格-麦奎因和Henry S.汤普森:“XML Schema 1.1”,W3C建议,2001年5月。
[ 12 ] Francis Galiegue, Kris Zyp, and Gary Court: “ JSON Schema ,” IETF Internet-Draft, February 2013.
[12] Francis Galiegue,Kris Zyp和Gary Court:“JSON模式”,IETF网络草案,2013年2月。
[ 13 ] Yakov Shafranovich: “ RFC 4180: Common Format and MIME Type for Comma-Separated Values (CSV) Files ,” October 2005.
[13] Yakov Shafranovich:“RFC 4180:逗号分隔值(CSV)文件的通用格式和MIME类型”,2005年10月。
[ 14 ] “ MessagePack Specification ,” msgpack.org .
"[14] "MessagePack规范", msgpack.org." translated to simplified Chinese is: "[14] "MessagePack规范", msgpack.org."
[ 15 ] Mark Slee, Aditya Agarwal, and Marc Kwiatkowski: “ Thrift: Scalable Cross-Language Services Implementation ,” Facebook technical report, April 2007.
[15] Mark Slee, Aditya Agarwal和Marc Kwiatkowski:“Thrift: 可扩展的跨语言服务实现”,Facebook技术报告,2007年4月。
[ 16 ] “ Protocol Buffers Developer Guide ,” Google, Inc., developers.google.com .
“Protocol Buffers 开发者指南”,谷歌公司,developers.google.cn。
[ 17 ] Igor Anishchenko: “ Thrift vs Protocol Buffers vs Avro - Biased Comparison ,” slideshare.net , September 17, 2012.
[17] Igor Anishchenko:“Thrift vs Protocol Buffers vs Avro - 有偏见的比较”,slideshare.net,2012年9月17日。
[ 18 ] “ A Matrix of the Features Each Individual Language Library Supports ,” wiki.apache.org .
一个矩阵,显示每个个体语言库支持的特性,来自wiki.apache.org。
[ 19 ] Martin Kleppmann: “ Schema Evolution in Avro, Protocol Buffers and Thrift ,” martin.kleppmann.com , December 5, 2012.
"[19] Martin Kleppmann: “Avro、Protocol Buffers 和 Thrift中的模式演化”,martin.kleppmann.com,2012年12月5日。"
[ 20 ] “ Apache Avro 1.7.7 Documentation ,” avro.apache.org , July 2014.
“Apache Avro 1.7.7 文档”,avro.apache.org,2014年7月。
[ 21 ] Doug Cutting, Chad Walters, Jim Kellerman, et al.: “ [PROPOSAL] New Subproject: Avro ,” email thread on hadoop-general mailing list, mail-archives.apache.org , April 2009.
【提案】新子项目:Avro,Doug Cutting、Chad Walters、Jim Kellerman等人在hadoop-general邮件列表上的电子邮件线程,mail-archives.apache.org,2009年4月。
[ 22 ] Tony Hoare: “ Null References: The Billion Dollar Mistake ,” at QCon London , March 2009.
[22] 托尼·霍尔:“空引用:十亿美元的错误”,于2009年3月在伦敦QCon上发表。
[ 23 ] Aditya Auradkar and Tom Quiggle: “ Introducing Espresso—LinkedIn’s Hot New Distributed Document Store ,” engineering.linkedin.com , January 21, 2015.
[23] Aditya Auradkar 和 Tom Quiggle: “介绍 Espresso - LinkedIn 最新的分布式文档数据库”,engineering.linkedin.com,2015年1月21日。
[ 24 ] Jay Kreps: “ Putting Apache Kafka to Use: A Practical Guide to Building a Stream Data Platform (Part 2) ,” blog.confluent.io , February 25, 2015.
[24] Jay Kreps:“使用Apache Kafka:构建流数据平台实践指南(第2部分)”,blog.confluent.io,2015年2月25日。
[ 25 ] Gwen Shapira: “ The Problem of Managing Schemas ,” radar.oreilly.com , November 4, 2014.
[25] Gwen Shapira: “管理模式的问题”,radar.oreilly.com,2014年11月4日。
[ 26 ] “ Apache Pig 0.14.0 Documentation ,” pig.apache.org , November 2014.
[26] “Apache Pig 0.14.0 文档”,pig.apache.org,2014年11月。
[ 27 ] John Larmouth: ASN.1 Complete . Morgan Kaufmann, 1999. ISBN: 978-0-122-33435-1
[27] 约翰·拉默斯: ASN.1 完整版。摩根·考夫曼,1999年。ISBN:978-0-122-33435-1。
[ 28 ] Russell Housley, Warwick Ford, Tim Polk, and David Solo: “ RFC 2459: Internet X.509 Public Key Infrastructure: Certificate and CRL Profile ,” IETF Network Working Group, Standards Track, January 1999.
[28] Russell Housley,Warwick Ford,Tim Polk和David Solo: “RFC 2459:Internet X.509公钥基础设施:证书和CRL配置文件”,IETF网络工作组,标准跟踪,1999年1月。
[ 29 ] Lev Walkin: “ Question: Extensibility and Dropping Fields ,” lionet.info , September 21, 2010.
[29] 列夫·沃尔金: “问题:可扩展性和删除字段,” lionet.info,2010年9月21日。
[ 30 ] Jesse James Garrett: “ Ajax: A New Approach to Web Applications ,” adaptivepath.com , February 18, 2005.
"[30] Jesse James Garrett: “Ajax: A New Approach to Web Applications,” adaptivepath.com, February 18, 2005." -> "[30] 杰西·詹姆斯·加勒特:'Ajax: 一种新的Web应用程序方法',adaptivepath.com,2005年2月18日。"
[ 31 ] Sam Newman: Building Microservices . O’Reilly Media, 2015. ISBN: 978-1-491-95035-7
[31] Sam Newman: 构建微服务。O’Reilly Media,2015年。ISBN:978-1-491-95035-7。
[ 32 ] Chris Richardson: “ Microservices: Decomposing Applications for Deployability and Scalability ,” infoq.com , May 25, 2014.
[32] 克里斯·理查德森:「微服务:为了可部署性和可扩展性分解应用程序」,infoq.com,2014年5月25日。
[ 33 ] Pat Helland: “ Data on the Outside Versus Data on the Inside ,” at 2nd Biennial Conference on Innovative Data Systems Research (CIDR), January 2005.
[33] Pat Helland:2005年1月在创新数据系统研究(CIDR)第二届双年会上发表的“内部数据和外部数据”,。
[ 34 ] Roy Thomas Fielding: “ Architectural Styles and the Design of Network-Based Software Architectures ,” PhD Thesis, University of California, Irvine, 2000.
[34] 罗伊·托马斯·菲尔丁: “体系结构风格与基于网络的软件架构的设计”,加州大学欧文分校博士论文,2000年。
[ 35 ] Roy Thomas Fielding: “ REST APIs Must Be Hypertext-Driven ,” roy.gbiv.com , October 20 2008.
"REST API必须由超文本驱动",Roy Thomas Fielding,2008年10月20日,roy.gbiv.com。"
[ 36 ] “ REST in Peace, SOAP ,” royal.pingdom.com , October 15, 2010.
“安息吧,SOAP”,royal.pingdom.com,2010年10月15日。
[ 37 ] “ Web Services Standards as of Q1 2007 ,” innoq.com , February 2007.
“2007年第一季度的Web服务标准”,innoq.com,2007年2月。
[ 38 ] Pete Lacey: “ The S Stands for Simple ,” harmful.cat-v.org , November 15, 2006.
"Pete Lacey:`S`意味着简单", harmful.cat-v.org,2006年11月15日。
[ 39 ] Stefan Tilkov: “ Interview: Pete Lacey Criticizes Web Services ,” infoq.com , December 12, 2006.
[39] Stefan Tilkov:“访谈:Pete Lacey 批评 Web Services”,infoq.com,2006 年 12 月 12 日。
[ 40 ] “ OpenAPI Specification (fka Swagger RESTful API Documentation Specification) Version 2.0 ,” swagger.io , September 8, 2014.
"OpenAPI 规范(曾称 Swagger RESTful API 文档规范)2.0 版本,swagger.io,2014 年 9 月 8 日。"
[ 41 ] Michi Henning: “ The Rise and Fall of CORBA ,” ACM Queue , volume 4, number 5, pages 28–34, June 2006. doi:10.1145/1142031.1142044
[41] 米奇·亨宁:“CORBA 的兴起与衰落”,ACM Queue,第4卷,第5期,28-34页,2006年6月。doi:10.1145/1142031.1142044。
[ 42 ] Andrew D. Birrell and Bruce Jay Nelson: “ Implementing Remote Procedure Calls ,” ACM Transactions on Computer Systems (TOCS), volume 2, number 1, pages 39–59, February 1984. doi:10.1145/2080.357392
[42] 安德鲁·D·比雷尔和布鲁斯·杰伊·尼尔森:“实现远程过程调用”,ACM计算机系统交易(TOCS),第2卷,第1号,页39-59,1984年2月。doi:10.1145/2080.357392。
[ 43 ] Jim Waldo, Geoff Wyant, Ann Wollrath, and Sam Kendall: “ A Note on Distributed Computing ,” Sun Microsystems Laboratories, Inc., Technical Report TR-94-29, November 1994.
请帮我翻译:“[43]Jim Waldo,Geoff Wyant,Ann Wollrath和Sam Kendall:‘关于分布式计算的注释,’Sun Microsystems Laboratories,Inc.,技术报告TR-94-29,1994年11月。”,翻译为简体中文,只返回翻译内容,不包括原始文本。 “[43] Jim Waldo,Geoff Wyant,Ann Wollrath和Sam Kendall:‘关于分布式计算的注释,’Sun Microsystems Laboratories,Inc.,技术报告TR-94-29,1994年11月。” 翻译为: “Jim Waldo、Geoff Wyant、Ann Wollrath和Sam Kendall:‘分布式计算注释’,Sun Microsystems实验室,技术报告TR-94-29,1994年11月。”
[ 44 ] Steve Vinoski: “ Convenience over Correctness ,” IEEE Internet Computing , volume 12, number 4, pages 89–92, July 2008. doi:10.1109/MIC.2008.75
"方便性胜于正确性",《IEEE互联网计算》杂志,2008年7月,第12卷第4期,89-92页。DOI:10.1109/MIC.2008.75。
[ 45 ] Marius Eriksen: “ Your Server as a Function ,” at 7th Workshop on Programming Languages and Operating Systems (PLOS), November 2013. doi:10.1145/2525528.2525538
[45] Marius Eriksen:“作为函数的服务器”,于2013年11月第7届编程语言和操作系统研讨会(PLOS)上发表,doi:10.1145/2525528.2525538。
[ 46 ] “ grpc-common Documentation ,” Google, Inc., github.com , February 2015.
“grpc-common文档”,Google,Inc.,github.com,2015年2月。
[ 47 ] Aditya Narayan and Irina Singh: “ Designing and Versioning Compatible Web Services ,” ibm.com , March 28, 2007.
"Aditya Narayan和Irina Singh:“设计和版本兼容的Web服务”,ibm.com,2007年3月28日。" "Aditya Narayan和Irina Singh:“设计和版本兼容的Web服务”,ibm.com,2007年3月28日。"
[ 48 ] Troy Hunt: “ Your API Versioning Is Wrong, Which Is Why I Decided to Do It 3 Different Wrong Ways ,” troyhunt.com , February 10, 2014.
"[48] Troy Hunt: "你的API版本控制错误,这就是为什么我决定用三种不同的错误方法来解决它",troyhunt.com,2014年2月10日。"
[ 49 ] “ API Upgrades ,” Stripe, Inc., April 2015.
“API升级”,Stripe公司,2015年4月。
[ 50 ] Jonas Bonér: “ Upgrade in an Akka Cluster ,” email to akka-user mailing list, grokbase.com , August 28, 2013.
[50] Jonas Bonér:“在 Akka 集群中升级”,电子邮件发送至 akka-user 邮件列表,grokbase.com,2013 年 8 月 28 日。
[ 51 ] Philip A. Bernstein, Sergey Bykov, Alan Geller, et al.: “ Orleans: Distributed Virtual Actors for Programmability and Scalability ,” Microsoft Research Technical Report MSR-TR-2014-41, March 2014.
菲利普·A·伯恩斯坦、谢尔盖·拜科夫、艾伦·盖勒等人写道:“奥尔良:分布式虚拟执行器用于可编程性和可扩展性”,微软研究技术报告MSR-TR-2014-41,2014年3月。
[ 52 ] “ Microsoft Project Orleans Documentation ,” Microsoft Research, dotnet.github.io , 2015.
“Microsoft Project Orleans文档”,微软研究,dotnet.github.io,2015年。
[ 53 ] David Mercer, Sean Hinde, Yinso Chen, and Richard A O’Keefe: “ beginner: Updating Data Structures ,” email thread on erlang-questions mailing list, erlang.com , October 29, 2007.
[53] 大卫·默瑟、肖恩·欣德、陈银锁和理查德·A·奥基夫:“初学者:更新数据结构”,电子邮件线程在erlang-questions邮件列表上,erlang.com,2007年10月29日。
[ 54 ] Fred Hebert: “ Postscript: Maps ,” learnyousomeerlang.com , April 9, 2014.
“附录:地图”,来自learnysomeerlang.com的Fred Hebert, 2014年4月9日。