第二章。数据模型和查询语言
Chapter 2. Data Models and Query Languages
The limits of my language mean the limits of my world.
我语言的界限即为我世界的界限。
Ludwig Wittgenstein, Tractatus Logico-Philosophicus (1922)
路德维希·维特根斯坦,《逻辑哲学论》(1922年)。
Data models are perhaps the most important part of developing software, because they have such a profound effect: not only on how the software is written, but also on how we think about the problem that we are solving.
数据模型可能是开发软件最重要的部分,因为它们具有如此深远的影响:不仅影响软件的编写方式,还影响我们思考解决问题的方式。
Most applications are built by layering one data model on top of another. For each layer, the key question is: how is it represented in terms of the next-lower layer? For example:
大多数应用程序都是通过在另一个数据模型之上进行分层构建的。对于每个层次,关键问题是:在下一层次中如何表示它?例如:
-
As an application developer, you look at the real world (in which there are people, organizations, goods, actions, money flows, sensors, etc.) and model it in terms of objects or data structures, and APIs that manipulate those data structures. Those structures are often specific to your application.
作为应用程序开发人员,您观察现实世界(其中包括人员、组织、商品、行动、货币流动、传感器等)并将其建模为对象或数据结构和操作这些数据结构的API。这些结构通常是特定于您的应用程序的。
-
When you want to store those data structures, you express them in terms of a general-purpose data model, such as JSON or XML documents, tables in a relational database, or a graph model.
当您想要存储那些数据结构时,您将它们表示为通用数据模型,例如JSON或XML文档,关系型数据库中的表,或者图形模型。
-
The engineers who built your database software decided on a way of representing that JSON/XML/relational/graph data in terms of bytes in memory, on disk, or on a network. The representation may allow the data to be queried, searched, manipulated, and processed in various ways.
构建您的数据库软件的工程师们决定了用字节在内存、磁盘或网络上表示JSON/XML/关系型/图形数据的方式。这种表示方式可以让数据以不同的方式被查询、搜索、操作和处理。
-
On yet lower levels, hardware engineers have figured out how to represent bytes in terms of electrical currents, pulses of light, magnetic fields, and more.
在更低的层面上,硬件工程师已经想出了如何利用电流、光脉冲、磁场等来表示字节。
In a complex application there may be more intermediary levels, such as APIs built upon APIs, but the basic idea is still the same: each layer hides the complexity of the layers below it by providing a clean data model. These abstractions allow different groups of people—for example, the engineers at the database vendor and the application developers using their database—to work together effectively.
在复杂的应用程序中,可能会存在更多的中间层,例如建立在API之上的API,但基本思想仍然相同:每个层次通过提供清晰的数据模型来隐藏其下面的层次的复杂性。这些抽象允许不同的人群 - 例如,数据库供应商的工程师和使用他们的数据库的应用程序开发人员 - 有效地合作。
There are many different kinds of data models, and every data model embodies assumptions about how it is going to be used. Some kinds of usage are easy and some are not supported; some operations are fast and some perform badly; some data transformations feel natural and some are awkward.
有许多不同种类的数据模型,每个数据模型都必然包含着使用的假设。有一些使用方式比较容易,有些则不被支持;有些操作运行速度较快,而有些则表现较差;有些数据转换方式感觉自然,而有些则不太合适。
It can take a lot of effort to master just one data model (think how many books there are on relational data modeling). Building software is hard enough, even when working with just one data model and without worrying about its inner workings. But since the data model has such a profound effect on what the software above it can and can’t do, it’s important to choose one that is appropriate to the application.
掌握一个数据模型可能需要很多努力(想一想,关于关系数据建模有多少本书)。即使只使用一个数据模型且不必担心其内部运作,构建软件也很困难。但是由于数据模型在软件的所能与所不能影响深远,因此选择适合应用程序的模型非常重要。
In this chapter we will look at a range of general-purpose data models for data storage and querying (point 2 in the preceding list). In particular, we will compare the relational model, the document model, and a few graph-based data models. We will also look at various query languages and compare their use cases. In Chapter 3 we will discuss how storage engines work; that is, how these data models are actually implemented (point 3 in the list).
在这一章节中,我们将研究一系列通用数据模型,用于数据存储和查询(在前面列表中的第2点)。特别是,我们将比较关系模型、文档模型和几个基于图形的数据模型。我们还将研究各种查询语言,并比较它们的使用情况。在第3章中,我们将讨论存储引擎的工作原理,即这些数据模型如何实际实现(在列表中的第3点)。
Relational Model Versus Document Model
The best-known data model today is probably that of SQL, based on the relational model proposed by Edgar Codd in 1970 [ 1 ]: data is organized into relations (called tables in SQL), where each relation is an unordered collection of tuples ( rows in SQL).
SQL目前最知名的数据模型可能是基于1970年Edgar Codd提出的关系模型的,数据被组织成关系(在SQL中称为表),每个关系都是元组(在SQL中为行)的无序集合。
The relational model was a theoretical proposal, and many people at the time doubted whether it could be implemented efficiently. However, by the mid-1980s, relational database management systems (RDBMSes) and SQL had become the tools of choice for most people who needed to store and query data with some kind of regular structure. The dominance of relational databases has lasted around 25‒30 years—an eternity in computing history.
关系模型最初只是一个理论提议,许多人当时怀疑它是否能够高效地实现。然而,在20世纪80年代中期,关系数据库管理系统(RDBMSes)和SQL已经成为大多数需要存储和查询具有某种规律结构数据的人所选择的工具。关系数据库的主导地位已经持续了大约25-30年,在计算历史上是一个永恒的时代。
The roots of relational databases lie in business data processing , which was performed on mainframe computers in the 1960s and ’70s. The use cases appear mundane from today’s perspective: typically transaction processing (entering sales or banking transactions, airline reservations, stock-keeping in warehouses) and batch processing (customer invoicing, payroll, reporting).
关系型数据库的根源可以追溯到20世纪60年代和70年代的大型机上进行的企业数据处理。从今天的角度来看,使用案例似乎很平凡:通常是交易处理(输入销售或银行交易、航空公司预订、仓库库存管理)和批处理(客户发票、工资单、报告)。
Other databases at that time forced application developers to think a lot about the internal representation of the data in the database. The goal of the relational model was to hide that implementation detail behind a cleaner interface.
其他数据库当时迫使应用程序开发人员在数据库中考虑数据的内部表示方式。关系模型的目标是将这种实现细节隐藏在更清晰的界面后面。
Over the years, there have been many competing approaches to data storage and querying. In the 1970s and early 1980s, the network model and the hierarchical model were the main alternatives, but the relational model came to dominate them. Object databases came and went again in the late 1980s and early 1990s. XML databases appeared in the early 2000s, but have only seen niche adoption. Each competitor to the relational model generated a lot of hype in its time, but it never lasted [ 2 ].
多年来,数据存储和查询一直存在很多竞争性的方法。在20世纪70年代和80年代早期,网络模型和分层模型是主要的选择,但是关系模型成为了主流。对象数据库在1980年代末和1990年代初出现,但又逐渐消失。 XML数据库在21世纪初出现,但只被少数人使用。每一种与关系模型竞争的方法在当时都引起了很大轰动,但它们从未持续下去。
As computers became vastly more powerful and networked, they started being used for increasingly diverse purposes. And remarkably, relational databases turned out to generalize very well, beyond their original scope of business data processing, to a broad variety of use cases. Much of what you see on the web today is still powered by relational databases, be it online publishing, discussion, social networking, ecommerce, games, software-as-a-service productivity applications, or much more.
随着计算机变得更加强大和联网,它们开始被用于越来越多样化的目的。值得注意的是,关系型数据库被证明具有很强的通用性,超越了其最初的业务数据处理范围,适用于各种各样的用例。今天你在网上看到的许多内容仍然由关系型数据库提供支持,无论是在线发布、讨论、社交网络、电子商务、游戏、以软件为服务的生产力应用,还是其他更多。
The Birth of NoSQL
Now, in the 2010s, NoSQL is the latest attempt to overthrow the relational model’s dominance. The name “NoSQL” is unfortunate, since it doesn’t actually refer to any particular technology—it was originally intended simply as a catchy Twitter hashtag for a meetup on open source, distributed, nonrelational databases in 2009 [ 3 ]. Nevertheless, the term struck a nerve and quickly spread through the web startup community and beyond. A number of interesting database systems are now associated with the #NoSQL hashtag, and it has been retroactively reinterpreted as Not Only SQL [ 4 ].
现在,在2010年代, NoSQL是最新的尝试,试图推翻关系模型的统治地位。由于最初它只是一个流行的 Twitter 标签,用于描述开放源代码、分布式和非关系型数据库的会议,因此“NoSQL”这个名称有些不幸。然而,这个术语引起了人们的注意,并很快在网络创业社区和其他领域传播开来。现在,一些有趣的数据库系统与 #NoSQL 标签相关联,并被重新解释为“Not Only SQL”。
There are several driving forces behind the adoption of NoSQL databases, including:
NoSQL数据库被采用的原因有几个驱动力,包括:
-
A need for greater scalability than relational databases can easily achieve, including very large datasets or very high write throughput
需要更大规模的可扩展性,超过关系数据库可以轻松实现的范围,包括非常大的数据集或非常高的写入吞吐量。
-
A widespread preference for free and open source software over commercial database products
使用免费和开放源码软件的普遍选择,胜过于商业数据库产品。
-
Specialized query operations that are not well supported by the relational model
不受关系模型很好支持的专业查询操作。
-
Frustration with the restrictiveness of relational schemas, and a desire for a more dynamic and expressive data model [ 5 ]
对关系模式限制性的挫败感,以及对更动态和富有表现力的数据模型的渴望。
Different applications have different requirements, and the best choice of technology for one use case may well be different from the best choice for another use case. It therefore seems likely that in the foreseeable future, relational databases will continue to be used alongside a broad variety of nonrelational datastores—an idea that is sometimes called polyglot persistence [ 3 ].
不同的应用有不同的需求,而一个用例的最佳技术选择可能与另一个用例的最佳选择不同。因此,有可能在可预见的未来,关系型数据库将继续与各种非关系型数据存储一起使用,这个想法有时被称为多语言持久性。
The Object-Relational Mismatch
Most application development today is done in object-oriented programming languages, which leads to a common criticism of the SQL data model: if data is stored in relational tables, an awkward translation layer is required between the objects in the application code and the database model of tables, rows, and columns. The disconnect between the models is sometimes called an impedance mismatch . i
如今,大多数应用程序开发都是使用面向对象的编程语言完成的,这导致了对 SQL 数据模型的普遍批评:如果数据存储在关系表中,则应用程序代码中的对象和基于表、行和列的数据库模型之间需要使用一种笨拙的翻译层来建立联系。这种模型之间的不连贯有时被称为阻抗不匹配。
Object-relational mapping (ORM) frameworks like ActiveRecord and Hibernate reduce the amount of boilerplate code required for this translation layer, but they can’t completely hide the differences between the two models.
对象关系映射(ORM)框架如ActiveRecord和Hibernate可减少翻译层所需的样板代码量,但它们无法完全隐藏两个模型之间的差异。
For example,
Figure 2-1
illustrates how a résumé (a LinkedIn profile) could be
expressed in a relational schema. The profile as a whole can be identified by a unique identifier,
user_id
. Fields like
first_name
and
last_name
appear exactly once per user, so they can be
modeled as columns on the
users
table. However, most people have had more than one job in their
career (positions), and people may have varying numbers of periods of education and any number of
pieces of contact information. There is a one-to-many relationship from the user to these items,
which can be represented in various ways:
例如,图2-1说明了如何将简历(LinkedIn个人资料)表达为关系模式。整个个人资料可以通过唯一标识符user_id进行识别。像first_name和last_name这样的字段每个用户只出现一次,因此它们可以被建模为用户表中的列。然而,大多数人在他们的职业生涯中都有过多个工作(职位),人们可能会有不同数量的教育期间和任何数量的联系方式。从用户到这些项目的关系是一对多的,可以用不同的方法来表示。
-
In the traditional SQL model (prior to SQL:1999), the most common normalized representation is to put positions, education, and contact information in separate tables, with a foreign key reference to the
users
table, as in Figure 2-1 .在传统的SQL模型(SQL:1999之前),最常见的标准化表示法是将职位,教育和联系信息放置在单独的表中,并参照用户表进行外键引用,如图2-1。
-
Later versions of the SQL standard added support for structured datatypes and XML data; this allowed multi-valued data to be stored within a single row, with support for querying and indexing inside those documents. These features are supported to varying degrees by Oracle, IBM DB2, MS SQL Server, and PostgreSQL [ 6 , 7 ]. A JSON datatype is also supported by several databases, including IBM DB2, MySQL, and PostgreSQL [ 8 ].
SQL标准的更新版本支持结构化数据类型和XML数据;这使得可以将多值数据存储在单个行中,并支持在这些文档内进行查询和索引。这些功能在Oracle、IBM DB2、MS SQL Server和PostgreSQL [6、7]中的支持程度有所不同。JSON数据类型也被多个数据库支持,包括IBM DB2、MySQL和PostgreSQL [8]。
-
A third option is to encode jobs, education, and contact info as a JSON or XML document, store it on a text column in the database, and let the application interpret its structure and content. In this setup, you typically cannot use the database to query for values inside that encoded column.
第三个选项是将工作、教育和联系信息编码为JSON或XML文档,将其存储在数据库的文本列中,并让应用程序解释其结构和内容。在这种设置中,您通常无法使用数据库查询该编码列内部的值。
For a data structure like a résumé, which is mostly a self-contained document , a JSON representation can be quite appropriate: see Example 2-1 . JSON has the appeal of being much simpler than XML. Document-oriented databases like MongoDB [ 9 ], RethinkDB [ 10 ], CouchDB [ 11 ], and Espresso [ 12 ] support this data model.
对于像简历这样的数据结构,它主要是一个自包含的文档,JSON表示可以非常适合:参见例子2-1。JSON比XML简单得多,具有吸引力。文档导向型数据库,如MongoDB、RethinkDB、CouchDB和Espresso都支持这种数据模型。
Example 2-1. Representing a LinkedIn profile as a JSON document
{
"user_id"
:
251
,
"first_name"
:
"Bill"
,
"last_name"
:
"Gates"
,
"summary"
:
"Co-chair of the Bill & Melinda Gates... Active blogger."
,
"region_id"
:
"us:91"
,
"industry_id"
:
131
,
"photo_url"
:
"/p/7/000/253/05b/308dd6e.jpg"
,
"positions"
:
[
{
"job_title"
:
"Co-chair"
,
"organization"
:
"Bill & Melinda Gates Foundation"
},
{
"job_title"
:
"Co-founder, Chairman"
,
"organization"
:
"Microsoft"
}
],
"education"
:
[
{
"school_name"
:
"Harvard University"
,
"start"
:
1973
,
"end"
:
1975
},
{
"school_name"
:
"Lakeside School, Seattle"
,
"start"
:
null
,
"end"
:
null
}
],
"contact_info"
:
{
"blog"
:
"http://thegatesnotes.com"
,
"twitter"
:
"http://twitter.com/BillGates"
}
}
Some developers feel that the JSON model reduces the impedance mismatch between the application code and the storage layer. However, as we shall see in Chapter 4 , there are also problems with JSON as a data encoding format. The lack of a schema is often cited as an advantage; we will discuss this in “Schema flexibility in the document model” .
一些开发人员认为,JSON模型减少了应用程序代码和存储层之间的阻抗失配。然而,正如我们将在第4章中看到的那样,JSON作为数据编码格式也存在问题。缺乏模式通常被引用作为优点;我们将在“文档模型中的模式灵活性”中讨论这个问题。
The JSON representation has better
locality
than the multi-table schema in
Figure 2-1
. If you want to fetch a profile in the relational example, you need to
either perform multiple queries (query each table by
user_id
) or perform a messy multi-way join
between the
users
table and its subordinate tables. In the JSON representation, all the relevant
information is in one place, and one query is sufficient.
JSON表示法比2-1图中的多表模式具有更好的局部性。如果您想在关系示例中获取配置文件,您需要执行多个查询(通过user_id查询每个表)或在用户表及其从属表之间执行混乱的多向连接。在JSON表示法中,所有相关信息都在一个地方,只需要一次查询即可。
The one-to-many relationships from the user profile to the user’s positions, educational history, and contact information imply a tree structure in the data, and the JSON representation makes this tree structure explicit (see Figure 2-2 ).
用户档案与用户职位,教育历史和联系方式之间的一对多关系意味着数据中的树形结构,JSON表示明确了这种树形结构(请参见图2-2)。
Many-to-One and Many-to-Many Relationships
In
Example 2-1
in the preceding section,
region_id
and
industry_id
are given as IDs,
not as plain-text strings
"Greater Seattle Area"
and
"Philanthropy"
. Why?
为什么在前面的章节中的示例2-1中,region_id和industry_id是作为ID给出的,而不是作为纯文本字符串"Greater Seattle Area"和"Philanthropy"呢?
If the user interface has free-text fields for entering the region and the industry, it makes sense to store them as plain-text strings. But there are advantages to having standardized lists of geographic regions and industries, and letting users choose from a drop-down list or autocompleter:
如果用户界面有用于输入地区和行业的自由文本字段,则将它们存储为纯文本字符串是有意义的。但是,具有标准化的地理区域和行业列表,并让用户从下拉列表或自动完成器中选择是有优势的:
-
Consistent style and spelling across profiles
个人资料中风格和拼写要保持一致。
-
Avoiding ambiguity (e.g., if there are several cities with the same name)
避免歧义(例如,如果有几个同名城市)
-
Ease of updating—the name is stored in only one place, so it is easy to update across the board if it ever needs to be changed (e.g., change of a city name due to political events)
易于更新——名称只存储在一个地方,如果需要更改(例如,由于政治事件更改城市名称),则可以轻松在整个平台上进行更新。
-
Localization support—when the site is translated into other languages, the standardized lists can be localized, so the region and industry can be displayed in the viewer’s language
本地化支持-当网站翻译为其他语言时,标准化列表可以本地化,以便在观看者的语言中显示地区和行业。
-
Better search—e.g., a search for philanthropists in the state of Washington can match this profile, because the list of regions can encode the fact that Seattle is in Washington (which is not apparent from the string
"Greater Seattle Area"
)更好的搜索-例如,在华盛顿州搜寻慈善家可以匹配此档案,因为地区列表可以编码西雅图是华盛顿州的事实(这一点在“大西雅图地区”这个字符串中不明显)。
Whether you store an ID or a text string is a question of duplication. When you use an ID, the information that is meaningful to humans (such as the word Philanthropy ) is stored in only one place, and everything that refers to it uses an ID (which only has meaning within the database). When you store the text directly, you are duplicating the human-meaningful information in every record that uses it.
无论你存储一个ID还是文本字符串,都是一个重复的问题。当你使用ID时,对于人类有意义的信息(例如“慈善”这个词)只存储在一个位置,而所有引用它的内容都使用ID(它只有在数据库中有意义)。当你直接存储文本时,你将在每个使用它的记录中复制人类有意义的信息。
The advantage of using an ID is that because it has no meaning to humans, it never needs to change: the ID can remain the same, even if the information it identifies changes. Anything that is meaningful to humans may need to change sometime in the future—and if that information is duplicated, all the redundant copies need to be updated. That incurs write overheads, and risks inconsistencies (where some copies of the information are updated but others aren’t). Removing such duplication is the key idea behind normalization in databases. ii
使用ID的优点在于,由于它对人类没有意义,所以它永远不需要更改:即使所识别的信息发生更改,ID仍然可以保持不变。对人类有意义的任何内容都可能在将来需要更改 - 如果该信息被复制,所有冗余副本都需要更新。这会增加写入开销,并存在不一致性的风险(其中一些信息的副本得到更新,但其他副本没有)。消除这种重复是数据库规范化的关键思想。
Note
Database administrators and developers love to argue about normalization and denormalization, but we will suspend judgment for now. In Part III of this book we will return to this topic and explore systematic ways of dealing with caching, denormalization, and derived data.
数据库管理员和开发人员喜欢就标准化和反标准化进行争论,但我们暂不考虑这个问题。本书的第三部分将回到这个主题,并探讨处理缓存、反标准化和派生数据的系统方法。
Unfortunately, normalizing this data requires many-to-one relationships (many people live in one particular region, many people work in one particular industry), which don’t fit nicely into the document model. In relational databases, it’s normal to refer to rows in other tables by ID, because joins are easy. In document databases, joins are not needed for one-to-many tree structures, and support for joins is often weak. iii
不幸的是,将这些数据进行规范化需要多对一的关系(许多人居住在一个特定地区,许多人在一个特定行业工作),这些关系不适合文档模型。在关系数据库中,通常通过ID引用其他表中的行,因为连接很容易。在文档数据库中,连接不需要用于一对多树结构,并且对连接的支持通常较弱。
If the database itself does not support joins, you have to emulate a join in application code by making multiple queries to the database. (In this case, the lists of regions and industries are probably small and slow-changing enough that the application can simply keep them in memory. But nevertheless, the work of making the join is shifted from the database to the application code.)
如果数据库本身不支持连接,你需要在应用程序代码中通过对数据库进行多次查询来模拟连接。(在这种情况下,地区和行业列表可能很小且变化缓慢,应用程序可以将它们保留在内存中处理。但是,将连接的工作从数据库转移到应用程序代码。)
Moreover, even if the initial version of an application fits well in a join-free document model, data has a tendency of becoming more interconnected as features are added to applications. For example, consider some changes we could make to the résumé example:
此外,即使应用程序的初始版本适合无连接文档模型,随着功能的增加,数据往往会变得更加相互关联。例如,考虑以下我们可以对在线简历进行的更改:
- Organizations and schools as entities
-
In the previous description,
organization
(the company where the user worked) andschool_name
(where they studied) are just strings. Perhaps they should be references to entities instead? Then each organization, school, or university could have its own web page (with logo, news feed, etc.); each résumé could link to the organizations and schools that it mentions, and include their logos and other information (see Figure 2-3 for an example from LinkedIn).在先前的描述中,组织(用户所在的公司)和学校名称仅仅是字符串。也许它们应该是实体的引用?这样每个组织、学校或大学都可以有自己的网页(包含标志、新闻源等)。每份简历都可以链接到所提及的组织和学校,并包含它们的标志和其他信息(参见LinkedIn上的示例,如图2-3)。
- Recommendations
-
Say you want to add a new feature: one user can write a recommendation for another user. The recommendation is shown on the résumé of the user who was recommended, together with the name and photo of the user making the recommendation. If the recommender updates their photo, any recommendations they have written need to reflect the new photo. Therefore, the recommendation should have a reference to the author’s profile.
假设你想添加一个新的功能:一位用户可以给另一位用户写一份推荐。这份推荐会显示在被推荐用户的个人简历上,同时还会附带推荐人的姓名和照片。如果推荐人更新了自己的照片,那么他所写的所有推荐都应该显示最新的照片。因此,推荐应该引用作者的个人资料。
Figure 2-4 illustrates how these new features require many-to-many relationships. The data within each dotted rectangle can be grouped into one document, but the references to organizations, schools, and other users need to be represented as references, and require joins when queried.
图2-4说明了这些新功能需要多对多的关系。每个虚线矩形中的数据可以分组成一个文档,但对组织、学校和其他用户的引用必须表示为引用,并在查询时需要连接。
Are Document Databases Repeating History?
While many-to-many relationships and joins are routinely used in relational databases, document databases and NoSQL reopened the debate on how best to represent such relationships in a database. This debate is much older than NoSQL—in fact, it goes back to the very earliest computerized database systems.
尽管在关系型数据库中经常使用多对多关系和连接,但文档数据库和NoSQL重新打开了在数据库中表示这种关系的最佳方式的辩论。这个辩论比NoSQL要古老得多,实际上可以追溯到最早的计算机化数据库系统。
The most popular database for business data processing in the 1970s was IBM’s Information Management System (IMS), originally developed for stock-keeping in the Apollo space program and first commercially released in 1968 [ 13 ]. It is still in use and maintained today, running on OS/390 on IBM mainframes [ 14 ].
20世纪70年代,商业数据处理的最流行数据库是IBM的信息管理系统(IMS),最初为阿波罗航天计划的存货管理开发,于1968年首次商业发布[13]。它仍在使用和维护,运行在IBM主机的OS/390上[14]。
The design of IMS used a fairly simple data model called the hierarchical model , which has some remarkable similarities to the JSON model used by document databases [ 2 ]. It represented all data as a tree of records nested within records, much like the JSON structure of Figure 2-2 .
IMS的设计使用了一个相当简单的数据模型,称为分层模型,这与文档数据库使用的JSON模型有一些相似之处。它将所有数据表示为嵌套在记录中的记录树,就像图2-2中的JSON结构一样。
Like document databases, IMS worked well for one-to-many relationships, but it made many-to-many relationships difficult, and it didn’t support joins. Developers had to decide whether to duplicate (denormalize) data or to manually resolve references from one record to another. These problems of the 1960s and ’70s were very much like the problems that developers are running into with document databases today [ 15 ].
像文档数据库一样,IMS适用于一对多关系,但却使得多对多关系更加困难,并且不支持连接查询。开发者们不得不决定是复制(去规范化)数据还是手动解决从一个记录到另一个记录的引用。这些上世纪60年代和70年代的问题与今天开发者们在文档数据库中遇到的问题非常相似[15]。
Various solutions were proposed to solve the limitations of the hierarchical model. The two most prominent were the relational model (which became SQL, and took over the world) and the network model (which initially had a large following but eventually faded into obscurity). The “great debate” between these two camps lasted for much of the 1970s [ 2 ].
有许多解决分层模型限制的解决方案被提出。其中最著名的两个是关系模型(成为SQL并统治了世界)和网络模型(最初有很大的追随者,但最终逐渐被遗忘)。这两个阵营之间的“大辩论”持续了很长一段时间,一直到1970年代晚期。
Since the problem that the two models were solving is still so relevant today, it’s worth briefly revisiting this debate in today’s light.
由于两个模型所解决的问题仍然如此相关,因此在今天的光环下简要回顾这场争论是值得的。
The network model
The network model was standardized by a committee called the Conference on Data Systems Languages (CODASYL) and implemented by several different database vendors; it is also known as the CODASYL model [ 16 ].
网络模型是由一个称为数据系统语言会议(CODASYL)的委员会进行标准化,并由多个不同的数据库供应商实现;它也被称为CODASYL模型[16]。
The CODASYL model was a generalization of the hierarchical model. In the tree structure of the
hierarchical model, every record has exactly one parent; in the network model, a record could have
multiple parents. For example, there could be one record for the
"Greater Seattle Area"
region,
and every user who lived in that region could be linked to it. This allowed many-to-one and
many-to-many relationships to be modeled.
CODASYL模型是层次模型的一般化。层次模型的树状结构中,每个记录都仅有一个父节点;而在网络模型中,一个记录可以有多个父节点。例如,一个代表“大西雅图地区”的记录可以与生活在该地区的每个用户建立关联。这种方式使得可以建立一对多和多对多的关系。
The links between records in the network model were not foreign keys, but more like pointers in a programming language (while still being stored on disk). The only way of accessing a record was to follow a path from a root record along these chains of links. This was called an access path .
网络模型中记录之间的链接不是外键,而更像编程语言中的指针(同时仍存储在磁盘上)。访问记录的唯一方法是沿着这些链接链从根记录开始遵循路径。这被称为访问路径。
In the simplest case, an access path could be like the traversal of a linked list: start at the head of the list, and look at one record at a time until you find the one you want. But in a world of many-to-many relationships, several different paths can lead to the same record, and a programmer working with the network model had to keep track of these different access paths in their head.
在最简单的情况下,一个访问路径就像遍历链表一样:从列表头开始,逐个查看记录,直到找到所需的记录。但在多对多的关系世界中,有几条不同的路径可以导致相同的记录,一个使用网络模型的程序员必须在脑中跟踪这些不同的访问路径。
A query in CODASYL was performed by moving a cursor through the database by iterating over lists of records and following access paths. If a record had multiple parents (i.e., multiple incoming pointers from other records), the application code had to keep track of all the various relationships. Even CODASYL committee members admitted that this was like navigating around an n -dimensional data space [ 17 ].
在CODASYL中进行查询是通过迭代记录列表并遵循访问路径来移动游标来完成的。如果记录具有多个父项(即来自其他记录的多个入站指针),应用程序代码必须跟踪各种关系。即使CODASYL委员会成员也承认,这就像在n维数据空间中导航。
Although manual access path selection was able to make the most efficient use of the very limited hardware capabilities in the 1970s (such as tape drives, whose seeks are extremely slow), the problem was that they made the code for querying and updating the database complicated and inflexible. With both the hierarchical and the network model, if you didn’t have a path to the data you wanted, you were in a difficult situation. You could change the access paths, but then you had to go through a lot of handwritten database query code and rewrite it to handle the new access paths. It was difficult to make changes to an application’s data model.
虽然手动访问路径选择能够在20世纪70年代最有效地利用非常有限的硬件能力(例如磁带驱动器,它们的搜索非常缓慢),但问题在于它们使查询和更新数据库的代码变得复杂和僵硬。对于分层和网络模型,如果你没有到达所需数据的路径,那么你就会陷入困境。你可以更改访问路径,但是你必须经过大量手写的数据库查询代码并且重写它以处理新访问路径。很难改变应用程序的数据模型。
The relational model
What the relational model did, by contrast, was to lay out all the data in the open: a relation (table) is simply a collection of tuples (rows), and that’s it. There are no labyrinthine nested structures, no complicated access paths to follow if you want to look at the data. You can read any or all of the rows in a table, selecting those that match an arbitrary condition. You can read a particular row by designating some columns as a key and matching on those. You can insert a new row into any table without worrying about foreign key relationships to and from other tables. iv
相比之下,关系模型所做的是将所有数据公开展示出来:一个关系(表)只是元组(行)的集合,仅此而已。没有复杂嵌套的结构,也没有需要跟随的复杂访问路径,您可以查看任何或所有表中的行,并选择与任意条件匹配的行。只需指定一些列作为键并在这些列上进行匹配,即可读取特定行。您可以向任何表中插入新行,而不必担心与其他表之间的外键关系。
In a relational database, the query optimizer automatically decides which parts of the query to execute in which order, and which indexes to use. Those choices are effectively the “access path,” but the big difference is that they are made automatically by the query optimizer, not by the application developer, so we rarely need to think about them.
在关系型数据库中,查询优化器会自动决定查询的哪些部分以什么顺序执行,以及使用哪些索引。这些选择实际上是“访问路径”,但是这些选择是由查询优化器自动完成的,而不是由应用程序开发人员完成的,因此我们很少需要考虑它们。
If you want to query your data in new ways, you can just declare a new index, and queries will automatically use whichever indexes are most appropriate. You don’t need to change your queries to take advantage of a new index. (See also “Query Languages for Data” .) The relational model thus made it much easier to add new features to applications.
如果你想以新的方式查询数据,你只需要声明一个新的索引,查询将自动使用最合适的索引。你不需要改变你的查询来利用一个新的索引。(另请参见“数据查询语言”.) 因此,关系模型使应用程序添加新功能变得更加容易。
Query optimizers for relational databases are complicated beasts, and they have consumed many years of research and development effort [ 18 ]. But a key insight of the relational model was this: you only need to build a query optimizer once, and then all applications that use the database can benefit from it. If you don’t have a query optimizer, it’s easier to handcode the access paths for a particular query than to write a general-purpose optimizer—but the general-purpose solution wins in the long run.
关系数据库的查询优化器是非常复杂的东西,需要花费多年的研究和开发工作[18]。但关系模型的一个关键认识是:你只需要建立一个查询优化器,然后所有使用数据库的应用程序都可以从中受益。如果你没有查询优化器,那么手工编码一个特定查询的访问路径比编写通用优化器更容易,但长远来看,通用方案胜出。
Comparison to document databases
Document databases reverted back to the hierarchical model in one aspect: storing nested records
(one-to-many relationships, like
positions
,
education
, and
contact_info
in
Figure 2-1
) within their parent record rather than in a separate table.
文档数据库在某个方面回归了分层模型:将嵌套记录(例如图2-1中的职位、教育和联系信息等一对多关系)存储在其父记录中,而不是在单独的表中。
However, when it comes to representing many-to-one and many-to-many relationships, relational and document databases are not fundamentally different: in both cases, the related item is referenced by a unique identifier, which is called a foreign key in the relational model and a document reference in the document model [ 9 ]. That identifier is resolved at read time by using a join or follow-up queries. To date, document databases have not followed the path of CODASYL.
然而,当涉及到表示多对一和多对多关系时,关系型和文档数据库并没有根本的不同:在两种情况下,相关项目都有一个唯一标识符,它在关系模型中被称为外键,在文档模型中被称为文档引用[9]。这个标识符通过使用联接或后续查询在读取时解析。迄今为止,文档数据库没有遵循CODASYL的路径。
Relational Versus Document Databases Today
There are many differences to consider when comparing relational databases to document databases, including their fault-tolerance properties (see Chapter 5 ) and handling of concurrency (see Chapter 7 ). In this chapter, we will concentrate only on the differences in the data model.
当比较关系型数据库和文档数据库时,需要考虑许多差异,包括它们的容错性质(见第五章)和并发处理(见第七章)。在本章中,我们将仅集中讨论数据模型的差异。
The main arguments in favor of the document data model are schema flexibility, better performance due to locality, and that for some applications it is closer to the data structures used by the application. The relational model counters by providing better support for joins, and many-to-one and many-to-many relationships.
文档数据模型的主要优点是模式灵活性、由于局部性能更好,某些应用程序更接近于数据结构。关系模型则通过提供更好的联接支持以及一对多和多对多关系来反驳这些优点。
Which data model leads to simpler application code?
If the data in your application has a document-like structure (i.e., a tree of one-to-many
relationships, where typically the entire tree is loaded at once), then it’s probably a good idea to
use a document model. The relational technique of
shredding
—splitting a document-like structure
into multiple tables (like
positions
,
education
, and
contact_info
in
Figure 2-1
)—can lead to cumbersome schemas and unnecessarily complicated
application code.
如果您的应用程序数据具有类似文档的结构(即一个一对多关系树,通常整个树会一次性加载),那么使用文档模型可能是一个好主意。将文档式结构分解为多个表(例如Figure 2-1中的职位、教育以及联系方式)的关系技术可能导致笨重的模式和不必要复杂的应用程序代码。
The document model has limitations: for example, you cannot refer directly to a nested item within a document, but instead you need to say something like “the second item in the list of positions for user 251” (much like an access path in the hierarchical model). However, as long as documents are not too deeply nested, that is not usually a problem.
文档模型存在局限性:例如,您无法直接引用文档中的嵌套项,而是需要类似于分层模型中的访问路径,比如说“用户251的职位列表中的第二个项目”。然而,只要文档嵌套不太深,通常不会有问题。
The poor support for joins in document databases may or may not be a problem, depending on the application. For example, many-to-many relationships may never be needed in an analytics application that uses a document database to record which events occurred at which time [ 19 ].
文档数据库中连接的支持较差可能是一个问题,也可能不是,这取决于应用程序。例如,在使用文档数据库记录哪些事件发生在何时的分析应用程序中,可能永远不需要多对多关系。[19]。
However, if your application does use many-to-many relationships, the document model becomes less appealing. It’s possible to reduce the need for joins by denormalizing, but then the application code needs to do additional work to keep the denormalized data consistent. Joins can be emulated in application code by making multiple requests to the database, but that also moves complexity into the application and is usually slower than a join performed by specialized code inside the database. In such cases, using a document model can lead to significantly more complex application code and worse performance [ 15 ].
然而,如果你的应用程序使用多对多关系,文档模型就会变得不那么吸引人了。可以通过去标准化来减少使用连接的需求,但是应用程序代码需要做额外的工作来保持非标准化数据的一致性。连接可以通过应用程序代码模拟,通过多次请求数据库来实现,但这也将复杂性转移到应用程序中,并且通常比数据库内部的专门代码执行的连接还要慢。在这种情况下,使用文档模型可能会导致应用程序代码变得更加复杂,并且性能更差 [15]。
It’s not possible to say in general which data model leads to simpler application code; it depends on the kinds of relationships that exist between data items. For highly interconnected data, the document model is awkward, the relational model is acceptable, and graph models (see “Graph-Like Data Models” ) are the most natural.
通常无法一般化地说哪种数据模型可以导致更简单的应用代码;这取决于数据项之间存在何种关系。对于高度相互关联的数据而言,文档模型不太方便,关系模型可接受,而图形模型(参见“类似图形的数据模型”)则最为自然。
Schema flexibility in the document model
Most document databases, and the JSON support in relational databases, do not enforce any schema on the data in documents. XML support in relational databases usually comes with optional schema validation. No schema means that arbitrary keys and values can be added to a document, and when reading, clients have no guarantees as to what fields the documents may contain.
大多数文档数据库以及关系型数据库中的JSON支持,在文档数据中不强制执行任何模式。关系型数据库中的XML支持通常带有可选模式验证。没有模式意味着文档中可以添加任意键和值,而在阅读时,客户端无法保证文档可能包含哪些字段。
Document databases are sometimes called schemaless , but that’s misleading, as the code that reads the data usually assumes some kind of structure—i.e., there is an implicit schema, but it is not enforced by the database [ 20 ]. A more accurate term is schema-on-read (the structure of the data is implicit, and only interpreted when the data is read), in contrast with schema-on-write (the traditional approach of relational databases, where the schema is explicit and the database ensures all written data conforms to it) [ 21 ].
文档数据库有时被称为无模式,但这是具有误导性的,因为读取数据的代码通常会假设某种结构 - 即存在隐式模式,但是它不由数据库强制执行[20]。更准确的术语是读时模式(数据结构是隐式的,只有在读取数据时才解释),与写时模式相对(关系数据库的传统方法,其中模式是明确的,数据库确保所有写入的数据符合它) [21]。
Schema-on-read is similar to dynamic (runtime) type checking in programming languages, whereas schema-on-write is similar to static (compile-time) type checking. Just as the advocates of static and dynamic type checking have big debates about their relative merits [ 22 ], enforcement of schemas in database is a contentious topic, and in general there’s no right or wrong answer.
读取模式的模式类似于程序语言中的动态(运行时)类型检查,而写入模式则类似于静态(编译时)类型检查。正如静态和动态类型检查的拥护者对它们的相对优点进行大辩论[22]一样,数据库模式的执行是一个有争议的话题,通常没有正确或错误的答案。
The difference between the approaches is particularly noticeable in situations where an application wants to change the format of its data. For example, say you are currently storing each user’s full name in one field, and you instead want to store the first name and last name separately [ 23 ]. In a document database, you would just start writing new documents with the new fields and have code in the application that handles the case when old documents are read. For example:
不同方法的差异在应用程序想要更改其数据格式的情况下尤其明显。例如,假设您当前在一个字段中存储每个用户的全名,而您希望将名字和姓氏分别存储[23]。 在文档数据库中,您只需开始编写具有新字段的新文档,并在应用程序中编写处理旧文档读取的代码。例如:
if
(
user
&&
user
.
name
&&
!
user
.
first_name
)
{
// Documents written before Dec 8, 2013 don't have first_name
user
.
first_name
=
user
.
name
.
split
(
" "
)[
0
];
}
On the other hand, in a “statically typed” database schema, you would typically perform a migration along the lines of:
另一方面,在“静态类型”数据库模式中,您通常会执行以下类型的迁移:
ALTER
TABLE
users
ADD
COLUMN
first_name
text
;
UPDATE
users
SET
first_name
=
split_part
(
name
,
' '
,
1
);
-- PostgreSQL
UPDATE
users
SET
first_name
=
substring_index
(
name
,
' '
,
1
);
-- MySQL
Schema changes have a bad reputation of being slow and requiring downtime. This reputation is not
entirely deserved: most relational database systems execute the
ALTER TABLE
statement in a few
milliseconds. MySQL is a notable exception—it copies the entire table on
ALTER TABLE
, which
can mean minutes or even hours of downtime when altering a large table—although various tools exist to work
around this limitation [
24
,
25
,
26
].
模式更改有一个不好的名声,被认为是慢且需要停机时间。这种声誉并不完全有理:大多数关系型数据库系统可以在几毫秒内执行 ALTER TABLE 语句。 MySQL 是一个值得注意的例外 - ALTER TABLE时它会复制整个表,这意味着在更改大型表时可能需要几分钟甚至几个小时的停机时间 - 尽管存在各种工具来解决这个限制。 [24,25,26]。
Running the
UPDATE
statement on a large table is likely to be slow on any database, since every
row needs to be rewritten. If that is not acceptable, the application can leave
first_name
set to
its default of
NULL
and fill it in at read time, like it would with a document database.
在任何数据库上运行大型表的UPDATE语句都很可能很慢,因为每行都需要重写。如果这是不可接受的,应用程序可以将first_name保留为NULL的默认值,并在读取时填充,就像在文档数据库中一样。
The schema-on-read approach is advantageous if the items in the collection don’t all have the same structure for some reason (i.e., the data is heterogeneous)—for example, because:
“基于读取的架构方法非常有优势,如果集合中的项目出于某些原因并不都具有相同的结构(即数据是异构的)——例如,因为:” “基于读取的架构方法在数据异构的情况下非常有优势。”
-
There are many different types of objects, and it is not practical to put each type of object in its own table.
有许多不同类型的对象,把每种类型的对象都放在自己的表格中是不现实的。
-
The structure of the data is determined by external systems over which you have no control and which may change at any time.
数据的结构由你无法控制并且随时可能发生改变的外部系统决定。
In situations like these, a schema may hurt more than it helps, and schemaless documents can be a much more natural data model. But in cases where all records are expected to have the same structure, schemas are a useful mechanism for documenting and enforcing that structure. We will discuss schemas and schema evolution in more detail in Chapter 4 .
在这种情况下,模式可能会带来更多的麻烦,而无模式文档可以成为一种更自然的数据模型。但是,如果预计所有记录都具有相同的结构,则模式是记录和强制执行结构的有用机制。我们将在第4章中更详细地讨论模式和模式演化。
Data locality for queries
A document is usually stored as a single continuous string, encoded as JSON, XML, or a binary variant thereof (such as MongoDB’s BSON). If your application often needs to access the entire document (for example, to render it on a web page), there is a performance advantage to this storage locality . If data is split across multiple tables, like in Figure 2-1 , multiple index lookups are required to retrieve it all, which may require more disk seeks and take more time.
一个文档通常被存储为一个连续的字符串,编码为JSON、XML或其二进制变体(如MongoDB的BSON)。如果你的应用程序经常需要访问整个文档(例如,在网页上呈现它),那么这种存储本地性具有性能优势。如果数据分布在多个表中,如图2-1所示,需要多次索引查找才能检索所有数据,这可能需要更多的磁盘寻道并花费更长时间。
The locality advantage only applies if you need large parts of the document at the same time. The database typically needs to load the entire document, even if you access only a small portion of it, which can be wasteful on large documents. On updates to a document, the entire document usually needs to be rewritten—only modifications that don’t change the encoded size of a document can easily be performed in place [ 19 ]. For these reasons, it is generally recommended that you keep documents fairly small and avoid writes that increase the size of a document [ 9 ]. These performance limitations significantly reduce the set of situations in which document databases are useful.
在同一时间需要大部分文档时,地理位置优势才适用。数据库通常需要加载整个文档,即使你只访问其中的一小部分,这在大文档上可能是浪费的。在对文档进行更新时,通常需要重写整个文档 - 只有不改变文档编码大小的修改才能轻松地在原地执行[19]。因此,通常建议将文档保持较小,避免增加文档大小的写入[9]。这些性能限制显着减少了文档数据库有用的情况集。
It’s worth pointing out that the idea of grouping related data together for locality is not limited to the document model. For example, Google’s Spanner database offers the same locality properties in a relational data model, by allowing the schema to declare that a table’s rows should be interleaved (nested) within a parent table [ 27 ]. Oracle allows the same, using a feature called multi-table index cluster tables [ 28 ]. The column-family concept in the Bigtable data model (used in Cassandra and HBase) has a similar purpose of managing locality [ 29 ].
值得指出的是,将相关数据分组以实现本地化的理念并不仅限于文档模型。例如,谷歌的Spanner数据库在关系数据模型中也提供了相同的本地化特性,通过允许架构声明表的行应嵌套在父表中进行交错排列[27]。Oracle也可以使用称为多表索引群集表的功能实现相同的功能[28]。Bigtable数据模型(用于Cassandra和HBase)中的列族概念具有管理本地性的类似目的[29]。
We will also see more on locality in Chapter 3 .
我们还将在第三章中更多地了解地区性。
Convergence of document and relational databases
Most relational database systems (other than MySQL) have supported XML since the mid-2000s. This includes functions to make local modifications to XML documents and the ability to index and query inside XML documents, which allows applications to use data models very similar to what they would do when using a document database.
大多数关系型数据库系统(除了MySQL)自2000年代中期以来就支持XML。这包括对XML文档进行本地修改的功能以及索引和查询XML文档的能力,这使得应用程序可以使用非常类似于使用文档数据库时的数据模型。
PostgreSQL since version 9.3 [ 8 ], MySQL since version 5.7, and IBM DB2 since version 10.5 [ 30 ] also have a similar level of support for JSON documents. Given the popularity of JSON for web APIs, it is likely that other relational databases will follow in their footsteps and add JSON support.
自从PostgreSQL 9.3版本[8],MySQL 5.7版本和IBM DB2 10.5版本 [30],也开始支持JSON文档的类似级别的支持。鉴于JSON在Web API中的流行,其他关系型数据库很可能也会效仿并增加JSON支持。
On the document database side, RethinkDB supports relational-like joins in its query language, and some MongoDB drivers automatically resolve database references (effectively performing a client-side join, although this is likely to be slower than a join performed in the database since it requires additional network round-trips and is less optimized).
在文档数据库方面,RethinkDB的查询语言支持类似关系型数据库的联接,而一些MongoDB驱动程序会自动解析数据库引用(实际上是执行客户端联接,但这可能比在数据库中执行联接要慢,因为它需要额外的网络往返并且不太优化)。
It seems that relational and document databases are becoming more similar over time, and that is a good thing: the data models complement each other. v If a database is able to handle document-like data and also perform relational queries on it, applications can use the combination of features that best fits their needs.
看起來關聯型和文件型數據庫隨著時間越來越相似,這是好事情:數據模型相互補充。如果一個數據庫能夠處理類似文件的數據並且還能對其進行關聯查詢,應用程序可以使用最適合其需求的功能組合。
A hybrid of the relational and document models is a good route for databases to take in the future.
关系模型和文档模型的混合是数据库未来发展的良好选择。
Query Languages for Data
When the relational model was introduced, it included a new way of querying data: SQL is a declarative query language, whereas IMS and CODASYL queried the database using imperative code. What does that mean?
当关系模型被引入时,它包括了一种新的查询数据的方式:SQL是一种声明式查询语言,而IMS和CODASYL使用命令式代码来查询数据库。这是什么意思?
Many commonly used programming languages are imperative. For example, if you have a list of animal species, you might write something like this to return only the sharks in the list:
许多常用编程语言是命令式的。例如,如果你有一个动物种类的列表,你可能会写出这样的代码仅返回列表中的鲨鱼:
function
getSharks
()
{
var
sharks
=
[];
for
(
var
i
=
0
;
i
<
animals
.
length
;
i
++
)
{
if
(
animals
[
i
].
family
===
"Sharks"
)
{
sharks
.
push
(
animals
[
i
]);
}
}
return
sharks
;
}
In the relational algebra, you would instead write:
在关系代数中,您应该写:
sharks = σ family = “Sharks” (animals)
鲨鱼 = 鲨鱼家族(动物)
where σ (the Greek letter sigma) is the selection operator, returning only those animals that match the condition family = “Sharks” .
其中σ(希腊字母西格马)是选择运算符,仅返回符合条件“家族=“鲨鱼”的动物”。
When SQL was defined, it followed the structure of the relational algebra fairly closely:
当SQL被定义时,它相当 closely 地遵循了关系代数的结构。
SELECT
*
FROM
animals
WHERE
family
=
'Sharks'
;
An imperative language tells the computer to perform certain operations in a certain order. You can imagine stepping through the code line by line, evaluating conditions, updating variables, and deciding whether to go around the loop one more time.
一种命令式编程语言会告诉计算机按照一定的顺序执行某些操作。你可以想象逐行地浏览代码,评估条件,更新变量,然后决定是否再次执行循环。
In a declarative query language, like SQL or relational algebra, you just specify the pattern of the data you want—what conditions the results must meet, and how you want the data to be transformed (e.g., sorted, grouped, and aggregated)—but not how to achieve that goal. It is up to the database system’s query optimizer to decide which indexes and which join methods to use, and in which order to execute various parts of the query.
在声明式查询语言(如SQL或关系代数)中,您只需指定所需的数据模式 - 结果必须满足哪些条件以及如何转换数据(例如排序、分组和聚合) - 而不是如何实现该目标。由数据库系统的查询优化器决定使用哪些索引和连接方法,以及以哪种顺序执行查询的各个部分。
A declarative query language is attractive because it is typically more concise and easier to work with than an imperative API. But more importantly, it also hides implementation details of the database engine, which makes it possible for the database system to introduce performance improvements without requiring any changes to queries.
声明性查询语言有吸引力,因为通常比命令式API更简洁易用。但更重要的是,它也隐藏了数据库引擎的实现细节,这使得数据库系统能够引入性能改进而无需对查询进行任何更改。
For example, in the imperative code shown at the beginning of this section, the list of animals appears in a particular order. If the database wants to reclaim unused disk space behind the scenes, it might need to move records around, changing the order in which the animals appear. Can the database do that safely, without breaking queries?
在本节开头所示的命令式代码中,动物列表以特定的顺序出现。如果数据库想要在后台回收未使用的磁盘空间,则可能需要移动记录并更改动物出现的顺序。数据库能安全地这样做,而不会破坏查询吗?
The SQL example doesn’t guarantee any particular ordering, and so it doesn’t mind if the order changes. But if the query is written as imperative code, the database can never be sure whether the code is relying on the ordering or not. The fact that SQL is more limited in functionality gives the database much more room for automatic optimizations.
这个SQL示例并不保证特定的排序,所以它不在意顺序的改变。但是如果查询被写成命令式代码,数据库就不能确定代码是否依赖于排序。事实上,SQL的功能更加有限,这给数据库留下了更多自动优化的空间。
Finally, declarative languages often lend themselves to parallel execution. Today, CPUs are getting faster by adding more cores, not by running at significantly higher clock speeds than before [ 31 ]. Imperative code is very hard to parallelize across multiple cores and multiple machines, because it specifies instructions that must be performed in a particular order. Declarative languages have a better chance of getting faster in parallel execution because they specify only the pattern of the results, not the algorithm that is used to determine the results. The database is free to use a parallel implementation of the query language, if appropriate [ 32 ].
最终,声明性语言通常适合并行执行。如今,CPU 通过添加更多核心来加快速度,而不是以比以前显著更高的时钟速度运行[31]。 命令式代码很难跨多个核心和多个机器并行化,因为它指定必须按特定顺序执行的指令。声明性语言在并行执行方面有更大的机会变得更快,因为它们仅指定结果的模式,而不是用于确定结果的算法。如果合适,数据库可以自由地使用查询语言的并行实现[32]。
Declarative Queries on the Web
The advantages of declarative query languages are not limited to just databases. To illustrate the point, let’s compare declarative and imperative approaches in a completely different environment: a web browser.
声明式查询语言的优点不仅限于数据库。为了说明这一点,让我们比较声明式和命令式方法在一个完全不同的环境下:Web浏览器。
Say you have a website about animals in the ocean. The user is currently viewing the page on sharks, so you mark the navigation item “Sharks” as currently selected, like this:
假设你有一个关于海洋动物的网站。用户目前正在查看关于鲨鱼的页面,那么你可以将导航条目“鲨鱼”标记为当前选中状态,就像这样:
<ul
>
<li
class=
"selected"
>
<p
>
Sharks
</p>
<ul
>
<li
>
Great White Shark
</li>
<li
>
Tiger Shark
</li>
<li
>
Hammerhead Shark
</li>
</ul>
</li>
<li
>
<p
>
Whales
</p>
<ul
>
<li
>
Blue Whale
</li>
<li
>
Humpback Whale
</li>
<li
>
Fin Whale
</li>
</ul>
</li>
</ul>
-
The selected item is marked with the CSS class
"selected"
.所选择的项目带有CSS类“selected”。
-
<p>Sharks</p>
is the title of the currently selected page.当前选定页面的标题是“鲨鱼”。
Now say you want the title of the currently selected page to have a blue background, so that it is visually highlighted. This is easy, using CSS:
现在假设您想要当前选定页面的标题具有蓝色背景,以便在视觉上突出显示。使用 CSS 很容易实现:
li
.selected
>
p
{
background-color
:
blue
;
}
Here the CSS selector
li.selected > p
declares the pattern of elements to which we want to apply
the blue style: namely, all
<p>
elements whose direct parent is an
<li>
element with a CSS class
of
selected
. The element
<p>Sharks</p>
in the example matches this pattern, but
<p>Whales</p>
does not match because its
<li>
parent lacks
class="selected"
.
此处 CSS 选择器 li.selected > p 声明我们要应用蓝色样式的元素模式:即所有直接父元素为 class="selected" 的 <li> 元素的 <p> 元素。在示例中,元素 <p>Sharks</p> 符合此模式,但 <p>Whales</p> 不符合,因为其 <li> 父元素缺少 class="selected"。
If you were using XSL instead of CSS, you could do something similar:
如果你使用XSL而不是CSS,你可以做类似的事情:
<xsl:template
match=
"li[@class='selected']/p"
>
<fo:block
background-color=
"blue"
>
<xsl:apply-templates
/>
</fo:block>
</xsl:template>
Here, the XPath expression
li[@class='selected']/p
is equivalent to
the CSS selector
li.selected > p
in the previous example. What CSS and XSL have in common is that
they are both
declarative
languages for specifying the styling of a document.
这里,XPath表达式li[@class='selected']/p相当于前面例子中的CSS选择器li.selected > p。CSS和XSL的共同点在于它们都是用于指定文档样式的声明性语言。
Imagine what life would be like if you had to use an imperative approach. In JavaScript, using the core Document Object Model (DOM) API, the result might look something like this:
想象一下,如果你必须使用命令式方法来生活会是什么样子。在 JavaScript 中,使用核心文档对象模型 (DOM) API,结果可能看起来像这样:
var
liElements
=
document
.
getElementsByTagName
(
"li"
);
for
(
var
i
=
0
;
i
<
liElements
.
length
;
i
++
)
{
if
(
liElements
[
i
].
className
===
"selected"
)
{
var
children
=
liElements
[
i
].
childNodes
;
for
(
var
j
=
0
;
j
<
children
.
length
;
j
++
)
{
var
child
=
children
[
j
];
if
(
child
.
nodeType
===
Node
.
ELEMENT_NODE
&&
child
.
tagName
===
"P"
)
{
child
.
setAttribute
(
"style"
,
"background-color: blue"
);
}
}
}
}
This JavaScript imperatively sets the element
<p>Sharks</p>
to have a blue background, but the code is
awful. Not only is it much longer and harder to understand than the CSS and XSL equivalents, but it
also has some serious problems:
这个JavaScript命令式地将元素<p>鲨鱼</p>的背景设为蓝色,但代码十分糟糕。它不仅比CSS和XSL等价物更长,更难理解,而且还有一些严重的问题:
-
If the
selected
class is removed (e.g., because the user clicks a different page), the blue color won’t be removed, even if the code is rerun—and so the item will remain highlighted until the entire page is reloaded. With CSS, the browser automatically detects when theli.selected > p
rule no longer applies and removes the blue background as soon as theselected
class is removed.如果所选的类别被删除(例如,因为用户点击了不同的页面),即使重新运行代码,蓝色也不会删除,因此该项将保持突出显示,直到整个页面重新加载。使用CSS,浏览器会自动检测li.selected > p规则不再适用并在移除所选类别时立即删除蓝色背景。
-
If you want to take advantage of a new API, such as
document.getElementsByClassName("selected")
or evendocument.evaluate()
—which may improve performance—you have to rewrite the code. On the other hand, browser vendors can improve the performance of CSS and XPath without breaking compatibility.如果您想利用新API,例如document.getElementsByClassName(“selected”)或document.evaluate(),可能会提高性能,那么您必须重新编写代码。另一方面,浏览器厂商可以提高CSS和XPath的性能,而不会破坏兼容性。 如果想利用新的 API,比如document.getElementsByClassName("selected"),甚至是document.evaluate()——这些可能会提升性能——那么就必须重新编写代码。另一方面,浏览器厂商可以提升CSS和XPath的性能,而不毁于兼容性。
In a web browser, using declarative CSS styling is much better than manipulating styles imperatively in JavaScript. Similarly, in databases, declarative query languages like SQL turned out to be much better than imperative query APIs. vi
在Web浏览器中,使用声明性CSS样式比在JavaScript中使用命令式样式要好得多。同样,在数据库中,声明性查询语言例如SQL被证明比命令式查询API更好用。
MapReduce Querying
MapReduce is a programming model for processing large amounts of data in bulk across many machines, popularized by Google [ 33 ]. A limited form of MapReduce is supported by some NoSQL datastores, including MongoDB and CouchDB, as a mechanism for performing read-only queries across many documents.
MapReduce是一种编程模型,用于在许多计算机上批量处理大量数据,由Google流行化[33]。一些NoSQL数据存储支持MapReduce的有限形式,包括MongoDB和CouchDB,作为在许多文档上执行只读查询的机制。
MapReduce in general is described in more detail in Chapter 10 . For now, we’ll just briefly discuss MongoDB’s use of the model.
总体上说,MapReduce在第10章中有更详细的描述。现在,我们只是简要讨论MongoDB使用该模型的情况。
MapReduce is neither a declarative query language nor a fully imperative query API, but somewhere
in between: the logic of the query is expressed with snippets of code, which are called repeatedly
by the processing framework. It is based on the
map
(also known as
collect
) and
reduce
(also
known as
fold
or
inject
) functions that exist in many functional programming languages.
MapReduce不是声明性查询语言也不是完全的命令式查询API,而是介于两者之间:查询逻辑用代码片段表达,这些代码片段在处理框架中被重复调用。它基于在许多功能编程语言中存在的map(也称为collect)和reduce(也称为fold或inject)函数。
To give an example, imagine you are a marine biologist, and you add an observation record to your database every time you see animals in the ocean. Now you want to generate a report saying how many sharks you have sighted per month.
举个例子,比如你是一位海洋生物学家,每当你在海洋中看到动物时,你都会添加一条观察记录到你的数据库中。现在你想生成一份报告,说明每个月你观察到了多少只鲨鱼。
In PostgreSQL you might express that query like this:
在PostgreSQL中,您可以像这样表达该查询:
SELECT
date_trunc
(
'month'
,
observation_timestamp
)
AS
observation_month
,
sum
(
num_animals
)
AS
total_animals
FROM
observations
WHERE
family
=
'Sharks'
GROUP
BY
observation_month
;
-
The
date_trunc('month', timestamp)
function determines the calendar month containingtimestamp
, and returns another timestamp representing the beginning of that month. In other words, it rounds a timestamp down to the nearest month."date_trunc('month', timestamp)"函数确定包含时间戳的日历月份,并返回表示该月份开头的另一个时间戳。换句话说,它将时间戳向下舍入到最近的月份。
This query first filters the observations to only show species in the
Sharks
family, then groups
the observations by the calendar month in which they occurred, and finally adds up the number of
animals seen in all observations in that month.
该查询首先按照“鲨鱼”物种筛选观察记录,然后按照所发生的日历月份对观测进行分组,最后计算该月份所有观测中看到的动物数量。
The same can be expressed with MongoDB’s MapReduce feature as follows:
可以使用MongoDB的MapReduce功能以如下方式表达相同的内容:
db
.
observations
.
mapReduce
(
function
map
(
)
{
var
year
=
this
.
observationTimestamp
.
getFullYear
(
)
;
var
month
=
this
.
observationTimestamp
.
getMonth
(
)
+
1
;
emit
(
year
+
"-"
+
month
,
this
.
numAnimals
)
;
}
,
function
reduce
(
key
,
values
)
{
return
Array
.
sum
(
values
)
;
}
,
{
query
:
{
family
:
"Sharks"
}
,
out
:
"monthlySharkReport"
}
)
;
-
The filter to consider only shark species can be specified declaratively (this is a MongoDB-specific extension to MapReduce).
可以通过声明性的方式指定只考虑鲨鱼物种的过滤器(这是MongoDB特有的MapReduce扩展功能)。
-
The JavaScript function
map
is called once for every document that matchesquery
, withthis
set to the document object.JavaScript 函数map会针对每个与查询匹配的文档调用一次,将this设置为文档对象。
-
The
map
function emits a key (a string consisting of year and month, such as"2013-12"
or"2014-1"
) and a value (the number of animals in that observation).“map”函数发射一个键(由年份和月份组成的字符串,如“2013-12”或“2014-1”)和一个值(观察中动物的数量)。
-
The key-value pairs emitted by
map
are grouped by key. For all key-value pairs with the same key (i.e., the same month and year), thereduce
function is called once.map 发出的键值对根据键分组。对于所有具有相同键(即相同的月份和年份)的键值对,会调用一次 reduce 函数。
-
The
reduce
function adds up the number of animals from all observations in a particular month."reduce函数将特定月份内所有观察结果的动物数量相加。"
-
The final output is written to the collection
monthlySharkReport
.最终输出将被写入集合monthlySharkReport。
For example, say the
observations
collection contains these two documents:
例如,假设观察集合包含以下两个文档:
{
observationTimestamp
:
Date
.
parse
(
"Mon, 25 Dec 1995 12:34:56 GMT"
),
family
:
"Sharks"
,
species
:
"Carcharodon carcharias"
,
numAnimals
:
3
}
{
observationTimestamp
:
Date
.
parse
(
"Tue, 12 Dec 1995 16:17:18 GMT"
),
family
:
"Sharks"
,
species
:
"Carcharias taurus"
,
numAnimals
:
4
}
The
map
function would be called once for each document, resulting in
emit("1995-12", 3)
and
emit("1995-12", 4)
. Subsequently, the
reduce
function
would be called with
reduce("1995-12", [3, 4])
, returning
7
.
“map”函数将针对每个文档调用一次,导致“emit(“1995-12”,3)”和“emit(“1995-12”,4)” 。随后,reduce函数将使用reduce(“1995-12”,[3,4])进行调用,返回7。
The
map
and
reduce
functions are somewhat restricted in what they are allowed to do. They must be
pure
functions, which means they only use the data that is passed to them as input, they cannot
perform additional database queries, and they must not have any side effects. These restrictions
allow the database to run the functions anywhere, in any order, and rerun them on failure. However,
they are nevertheless powerful: they can parse strings, call library functions, perform calculations,
and more.
映射和归约函数在其允许执行的操作上有一定的限制。它们必须是纯函数,这意味着它们只能使用作为输入传递给它们的数据,不能执行额外的数据库查询,也不能产生任何副作用。这些限制使得数据库可以在任何地方以任何顺序运行这些函数,并在失败时重新运行它们。然而,它们仍然非常强大:它们可以解析字符串、调用库函数、执行计算等等。
MapReduce is a fairly low-level programming model for distributed execution on a cluster of machines. Higher-level query languages like SQL can be implemented as a pipeline of MapReduce operations (see Chapter 10 ), but there are also many distributed implementations of SQL that don’t use MapReduce. Note there is nothing in SQL that constrains it to running on a single machine, and MapReduce doesn’t have a monopoly on distributed query execution.
MapReduce是一个基于集群机器分布式执行的相对较低级的编程模型。高级查询语言如SQL可以作为MapReduce操作的流水线来实现(见第10章),但也有许多分布式的SQL实现不使用MapReduce。需要注意的是,SQL并不局限于单机运行,而MapReduce也不是唯一的分布式查询执行方案。
Being able to use JavaScript code in the middle of a query is a great feature for advanced queries, but it’s not limited to MapReduce—some SQL databases can be extended with JavaScript functions too [ 34 ].
JavaScript代码在查询过程中的使用是高级查询的重要功能,但它并不局限于MapReduce—某些SQL数据库也可以使用JavaScript函数扩展。[34]。
A usability problem with MapReduce is that you have to write two carefully coordinated JavaScript functions, which is often harder than writing a single query. Moreover, a declarative query language offers more opportunities for a query optimizer to improve the performance of a query. For these reasons, MongoDB 2.2 added support for a declarative query language called the aggregation pipeline [ 9 ]. In this language, the same shark-counting query looks like this:
MapReduce存在的一个可用性问题是你需要编写两个精心协调的JavaScript函数,通常比编写单个查询更困难。此外,声明性查询语言提供了更多的机会让查询优化器提高查询性能。出于这些原因,MongoDB 2.2添加了支持称为聚合管道的声明性查询语言[9]。在这种语言中,同样的鲨鱼计数查询看起来像这样:
db
.
observations
.
aggregate
([
{
$match
:
{
family
:
"Sharks"
}
},
{
$group
:
{
_id
:
{
year
:
{
$year
:
"$observationTimestamp"
},
month
:
{
$month
:
"$observationTimestamp"
}
},
totalAnimals
:
{
$sum
:
"$numAnimals"
}
}
}
]);
The aggregation pipeline language is similar in expressiveness to a subset of SQL, but it uses a JSON-based syntax rather than SQL’s English-sentence-style syntax; the difference is perhaps a matter of taste. The moral of the story is that a NoSQL system may find itself accidentally reinventing SQL, albeit in disguise.
聚合管道语言在表达能力上与 SQL 的子集类似,但它使用基于 JSON 的语法,而不是 SQL 的英语句子样式语法;这种差异可能只是口味问题。故事的道德是,NoSQL 系统可能会意外地重新发明 SQL,尽管是伪装的。
Graph-Like Data Models
We saw earlier that many-to-many relationships are an important distinguishing feature between different data models. If your application has mostly one-to-many relationships (tree-structured data) or no relationships between records, the document model is appropriate.
我们之前看到,多对多关系是不同数据模型之间的一个重要区别特征。如果你的应用程序主要有一对多关系(树形结构数据)或记录之间没有关系,那么文档模型是适合的。
But what if many-to-many relationships are very common in your data? The relational model can handle simple cases of many-to-many relationships, but as the connections within your data become more complex, it becomes more natural to start modeling your data as a graph.
如果你的数据中有很多多对多的关系会怎样?关系型模型可以处理简单的多对多关系,但是随着你的数据内部连接变得更加复杂,建模你的数据成为一个图形就更加自然了。
A graph consists of two kinds of objects: vertices (also known as nodes or entities ) and edges (also known as relationships or arcs ). Many kinds of data can be modeled as a graph. Typical examples include:
一个图表由两种对象组成:顶点(也称为节点或实体)和边(也称为关系或弧)。许多种数据可以建模为图。典型的例子包括:
- Social graphs
-
Vertices are people, and edges indicate which people know each other.
顶点是人,边表示哪些人彼此认识。
- The web graph
-
Vertices are web pages, and edges indicate HTML links to other pages.
顶点是网页,而边表示 HTML 链接到其他页面。
- Road or rail networks
-
Vertices are junctions, and edges represent the roads or railway lines between them.
顶点是交汇点,边代表它们之间的公路或铁路线。
Well-known algorithms can operate on these graphs: for example, car navigation systems search for the shortest path between two points in a road network, and PageRank can be used on the web graph to determine the popularity of a web page and thus its ranking in search results.
著名算法可以在这些图上运作:例如,汽车导航系统在道路网络中寻找两个点之间的最短路径,而PageRank可以在Web图上用于确定网页的受欢迎程度,从而确定其在搜索结果中的排名。
In the examples just given, all the vertices in a graph represent the same kind of thing (people, web pages, or road junctions, respectively). However, graphs are not limited to such homogeneous data: an equally powerful use of graphs is to provide a consistent way of storing completely different types of objects in a single datastore. For example, Facebook maintains a single graph with many different types of vertices and edges: vertices represent people, locations, events, checkins, and comments made by users; edges indicate which people are friends with each other, which checkin happened in which location, who commented on which post, who attended which event, and so on [ 35 ].
在刚才的例子中,图中的所有顶点代表的是同一种类型的事物(人、网页或者路口)。然而,图并不限于这样的同质数据:使用图的同样强大的方式是在一个单一的数据存储中提供一种一致的方式来存储完全不同类型的对象。例如,Facebook维护一个包含多种不同类型的顶点和边的单个图:顶点代表人、位置、事件、签到和用户做出的评论;边表示哪些人是彼此的朋友,哪个位置发生了哪个签到,谁评论了哪篇帖子,谁参加了哪个活动等等[35]。
In this section we will use the example shown in Figure 2-5 . It could be taken from a social network or a genealogical database: it shows two people, Lucy from Idaho and Alain from Beaune, France. They are married and living in London.
在这个部分,我们将使用图2-5中展示的例子。它可以来自社交网络或家谱数据库:它展示了两个人,来自爱达荷州的露西和来自法国博纳的阿兰。他们结婚并住在伦敦。
There are several different, but related, ways of structuring and querying data in graphs. In this section we will discuss the property graph model (implemented by Neo4j, Titan, and InfiniteGraph) and the triple-store model (implemented by Datomic, AllegroGraph, and others). We will look at three declarative query languages for graphs: Cypher, SPARQL, and Datalog. Besides these, there are also imperative graph query languages such as Gremlin [ 36 ] and graph processing frameworks like Pregel (see Chapter 10 ).
有几种不同但相关的方式来在图中对数据进行结构化和查询。在本节中,我们将讨论属性图模型(由Neo4j、Titan和InfiniteGraph实现)和三元组存储模型(由Datomic、AllegroGraph等实现)。我们将看一下三种图形的声明性查询语言:Cypher、SPARQL和Datalog。除此之外,还有类似Gremlin的命令式图形查询语言以及Pregel等图形处理框架(请参见第10章)。
Property Graphs
In the property graph model, each vertex consists of:
在属性图模型中,每个顶点由以下内容组成:
-
A unique identifier
一个唯一标识符
-
A set of outgoing edges
出边集合
-
A set of incoming edges
一组入边
-
A collection of properties (key-value pairs)
一组属性(键值对)
每个边缘由:
-
A unique identifier
一个唯一标识符
-
The vertex at which the edge starts (the tail vertex )
边起始的顶点(尾部顶点)
-
The vertex at which the edge ends (the head vertex )
边缘的终点顶点(头顶点)。
-
A label to describe the kind of relationship between the two vertices
两个各自描述不同的实体之间关系类型的标签。
-
A collection of properties (key-value pairs)
一组属性(键值对)
You can think of a graph store as consisting of two relational tables, one for vertices and one for
edges, as shown in
Example 2-2
(this schema uses the PostgreSQL
json
datatype to
store the properties of each vertex or edge). The head and tail vertex are stored for each edge; if
you want the set of incoming or outgoing edges for a vertex, you can query the
edges
table by
head_vertex
or
tail_vertex
, respectively.
你可以将图存储视为由两个关系表组成,一个存储顶点,另一个存储边,如示例2-2所示(此架构使用PostgreSQL json数据类型来存储每个顶点或边的属性)。每个边都存储头部和尾部顶点;如果您想要一个顶点的传入或传出边的集合,则可以分别通过head_vertex或tail_vertex查询edges表。
Example 2-2. Representing a property graph using a relational schema
CREATE
TABLE
vertices
(
vertex_id
integer
PRIMARY
KEY
,
properties
json
);
CREATE
TABLE
edges
(
edge_id
integer
PRIMARY
KEY
,
tail_vertex
integer
REFERENCES
vertices
(
vertex_id
),
head_vertex
integer
REFERENCES
vertices
(
vertex_id
),
label
text
,
properties
json
);
CREATE
INDEX
edges_tails
ON
edges
(
tail_vertex
);
CREATE
INDEX
edges_heads
ON
edges
(
head_vertex
);
Some important aspects of this model are:
这个模型的一些重要方面是:
-
Any vertex can have an edge connecting it with any other vertex. There is no schema that restricts which kinds of things can or cannot be associated.
任何顶点都可以与任何其他顶点连接一条边。没有任何模式限制哪些物品可以或不能关联。
-
Given any vertex, you can efficiently find both its incoming and its outgoing edges, and thus traverse the graph—i.e., follow a path through a chain of vertices—both forward and backward. (That’s why Example 2-2 has indexes on both the
tail_vertex
andhead_vertex
columns.)给定任何一个顶点,你可以高效地找到它的入边和出边,从而可以在图中遍历,即沿着顶点链条向前和向后走。(这就是为什么示例2-2在tail_vertex和head_vertex列上都有索引的原因。)
-
By using different labels for different kinds of relationships, you can store several different kinds of information in a single graph, while still maintaining a clean data model.
通过为不同类型的关系使用不同的标签,您可以在单个图中存储多种不同类型的信息,同时仍然保持清晰的数据模型。
Those features give graphs a great deal of flexibility for data modeling, as illustrated in Figure 2-5 . The figure shows a few things that would be difficult to express in a traditional relational schema, such as different kinds of regional structures in different countries (France has départements and régions , whereas the US has counties and states ), quirks of history such as a country within a country (ignoring for now the intricacies of sovereign states and nations), and varying granularity of data (Lucy’s current residence is specified as a city, whereas her place of birth is specified only at the level of a state).
这些特性赋予图形数据建模以极大的灵活性,如图2-5所示。该图显示了一些传统关系模式难以表达的东西,例如不同国家的不同地区结构(法国有départements和régions,而美国有郡和州),历史上的奇怪现象,如国家内的国家(暂时忽略主权国家和民族的复杂性),以及数据的变化粒度(Lucy的目前居住地指定为城市,而她的出生地仅在州一级指定)。
You could imagine extending the graph to also include many other facts about Lucy and Alain, or other people. For instance, you could use it to indicate any food allergies they have (by introducing a vertex for each allergen, and an edge between a person and an allergen to indicate an allergy), and link the allergens with a set of vertices that show which foods contain which substances. Then you could write a query to find out what is safe for each person to eat. Graphs are good for evolvability: as you add features to your application, a graph can easily be extended to accommodate changes in your application’s data structures.
你可以想象将该图扩展以包括有关Lucy和Alain或其他人的许多其他事实。例如,您可以使用它来指示他们有哪些食物过敏症(通过为每种过敏原引入一个顶点,并在人和过敏原之间添加一条边来表示过敏症),并使用一组显示哪些食物含有哪些物质的顶点将过敏原连接起来。然后,您可以编写查询以查找每个人可以安全食用的食物。 图表很适合扩展性:随着您向应用程序添加功能,图表可以轻松扩展以适应应用程序数据结构的变化。
The Cypher Query Language
Cypher is a declarative query language for property graphs, created for the Neo4j graph database [ 37 ]. (It is named after a character in the movie The Matrix and is not related to ciphers in cryptography [ 38 ].)
Cypher是一种面向属性图的声明式查询语言,为Neo4j图数据库创建[37]。 (它的名字来源于电影《黑客帝国》中的一个角色,并与密码学中的加密术语[38]无关。)
Example 2-3
shows the Cypher query to insert the lefthand portion of
Figure 2-5
into a graph database. The rest of the graph can be added similarly and is
omitted for readability. Each vertex is given a symbolic name like
USA
or
Idaho
, and other parts
of the query can use those names to create edges between the vertices, using an arrow notation:
(Idaho) -[:WITHIN]-> (USA)
creates an edge labeled
WITHIN
, with
Idaho
as the tail node and
USA
as the head node.
示例2-3展示了将图2-5左侧部分插入图形数据库的Cypher查询。图中其他部分可以类似地添加,为了便于阅读而被省略。每个顶点都被赋予一个象征性的名称,例如美国或爱达荷,查询的其他部分可以使用这些名称来创建顶点之间的边,使用箭头符号:“(爱达荷)-[:WITHIN]->(美国)”创建了一个标记为WITHIN的边,其中爱达荷作为尾部节点,美国作为头节点。
Example 2-3. A subset of the data in Figure 2-5 , represented as a Cypher query
CREATE
(NAmerica:Location {name:
'North America'
,
type
:
'continent'
}),
(USA:Location {name:
'United States'
,
type
:
'country'
}),
(Idaho:Location {name:
'Idaho'
,
type
:
'state'
}),
(Lucy:Person {name:
'Lucy'
}),
(Idaho) -[:WITHIN]-> (USA) -[:WITHIN]-> (NAmerica),
(Lucy) -[:BORN_IN]-> (Idaho)
When all the vertices and edges of
Figure 2-5
are added to the database, we can start
asking interesting questions: for example,
find the names of all the people who emigrated from the
United States to Europe
. To be more precise, here we want to find all the vertices that have a
BORN_IN
edge to a
location within the US, and also a
LIVING_IN
edge to a location within Europe, and return the
name
property of each of those vertices.
当图2-5的所有顶点和边被添加到数据库后,我们可以开始提出有趣的问题:例如,查找所有从美国移民到欧洲的人的名字。 更准确地说,我们要找到所有顶点,它们具有一条指向美国境内位置的BORN_IN边和一条指向欧洲境内位置的LIVING_IN边,并返回每个该顶点的名称属性。
Example 2-4
shows how to express that query in Cypher. The same arrow notation is used in a
MATCH
clause to find patterns in the graph:
(person) -[:BORN_IN]-> ()
matches any two vertices
that are related by an edge labeled
BORN_IN
. The tail vertex of that edge is bound to the
variable
person
, and the head vertex is left unnamed.
例2-4展示了如何用Cypher表达该查询。在MATCH子句中使用相同的箭头符号来查找图中的模式:(person)- [:BORN_IN] - >()匹配通过标记为BORN_IN的边缘相关的任意两个顶点。该边的尾顶点绑定到变量person,而头顶点则未命名。
Example 2-4. Cypher query to find people who emigrated from the US to Europe
MATCH
(person) -[:BORN_IN]-> () -[:WITHIN*0..]-> (us:Location {name:
'United States'
}),
(person) -[:LIVES_IN]-> () -[:WITHIN*0..]-> (eu:Location {name:
'Europe'
})
RETURN
person.name
The query can be read as follows:
该查询可以理解为:
Find any vertex (call it
person
) that meets both of the following conditions:寻找符合以下两个条件的任何顶点(称之为个人):
person
has an outgoingBORN_IN
edge to some vertex. From that vertex, you can follow a chain of outgoingWITHIN
edges until eventually you reach a vertex of typeLocation
, whosename
property is equal to"United States"
.某人将BORN_IN出边传递到某个顶点。从该顶点出发,您可以沿着输出WITHIN边的链,直到最终到达类型为“Location”的顶点,其名称属性等于“美国”。
That same
person
vertex also has an outgoingLIVES_IN
edge. Following that edge, and then a chain of outgoingWITHIN
edges, you eventually reach a vertex of typeLocation
, whosename
property is equal to"Europe"
.同样的顶点还具有一个向外的“居住于”边。沿着那条边,然后是一系列的向外的“位于”边,你最终会到达一个类型为位置的顶点,其名称属性等于“欧洲”。
For each such
person
vertex, return thename
property.对于每个这样的人顶点,返回名称属性。
There are several possible ways of executing the query. The description given here suggests that you start by scanning all the people in the database, examine each person’s birthplace and residence, and return only those people who meet the criteria.
有几种可能的执行查询的方式。这里给出的描述建议您首先扫描数据库中的所有人员,检查每个人的出生地和居住地,只返回符合条件的人员。
But equivalently, you could start with the two
Location
vertices and work backward. If there is
an index on the
name
property, you can probably efficiently find the two vertices representing the
US and Europe. Then you can proceed to find all locations (states, regions, cities, etc.) in the US
and Europe respectively by following all incoming
WITHIN
edges. Finally, you can look for people
who can be found through an incoming
BORN_IN
or
LIVES_IN
edge at one of the location vertices.
然而同等地,你也可以从两个位置顶点开始向后推导。如果名称属性上有索引,你可能可以有效地找到代表美国和欧洲的两个顶点。然后,你可以通过跟随所有传入的WITHIN边缘,分别找到美国和欧洲的所有位置(州,地区,城市等)。最后,你可以在位置顶点之一通过传入的BORN_IN或LIVES_IN边缘查找可以找到的人。
As is typical for a declarative query language, you don’t need to specify such execution details when writing the query: the query optimizer automatically chooses the strategy that is predicted to be the most efficient, so you can get on with writing the rest of your application.
作为声明式查询语言的典型特点,写查询时无需指定执行细节:查询优化器会自动选择预测为最有效的策略,因此您可以继续编写应用程序的其他部分。
Graph Queries in SQL
Example 2-2 suggested that graph data can be represented in a relational database. But if we put graph data in a relational structure, can we also query it using SQL?
示例2-2指出图形数据可以在关系型数据库中表示。但是如果将图形数据放入关系结构中,我们是否也可以使用SQL查询它?
The answer is yes, but with some difficulty. In a relational database, you usually know in advance which joins you need in your query. In a graph query, you may need to traverse a variable number of edges before you find the vertex you’re looking for—that is, the number of joins is not fixed in advance.
答案是肯定的,但需要一些困难。在关系数据库中,你通常已经知道了你查询需要哪些连接。而在图形查询中,你可能需要遍历不同数量的边缘,才能找到你想要的顶点——也就是说,连接的数量在事先是不固定的。
In our example, that happens in the
() -[:WITHIN*0..]-> ()
rule in the Cypher query. A person’s
LIVES_IN
edge may point at any kind of location: a street, a city, a district, a region, a state,
etc. A city may be
WITHIN
a region, a region
WITHIN
a state, a state
WITHIN
a country, etc.
The
LIVES_IN
edge may point directly at the location vertex you’re looking for, or it may be
several levels removed in the location hierarchy.
在我们的示例中,这发生在Cypher查询中的()-[:WITHIN*0..]->()规则中。一个人的“居住在”边缘可能指向任何类型的位置:一条街道,一个城市,一个区域,一个地区,一个州等等。一个城市可能在一个地区内,在一个地区内在一个州内,在一个州内在一个国家内等等。 “居住在”边缘可以直接指向您正在查找的位置顶点,也可以在位置层次结构中被移除数个级别。
In Cypher,
:WITHIN*0..
expresses that fact very concisely: it means “follow a
WITHIN
edge, zero
or more times.” It is like the
*
operator in a regular expression.
在密码中,:WITHIN*0..表达了这个事实非常简洁:它的意思是“遵循一个WITHIN边缘,零次或多次。” 就像正则表达式中的*操作符一样。
Since SQL:1999, this idea of variable-length traversal paths in a query can be expressed using
something called
recursive common table expressions
(the
WITH RECURSIVE
syntax).
Example 2-5
shows the same query—finding the names of people who emigrated from the
US to Europe—expressed in SQL using this technique (supported in PostgreSQL, IBM DB2, Oracle, and
SQL Server). However, the syntax is very clumsy in comparison to Cypher.
自从1999年的SQL版本开始,可以使用称为“带递归的公共表达式”(WITH RECURSIVE语法)的东西来表达查询中的可变长度遍历路径这个思想。例如,示例2-5展示了通过使用这种技术(在PostgreSQL、IBM DB2、Oracle和SQL Server中支持)表达的查询——找到移民从美国到欧洲的人的姓名。但是,与Cypher相比,语法非常笨拙。
Example 2-5. The same query as Example 2-4 , expressed in SQL using recursive common table expressions
WITH
RECURSIVE
-- in_usa is the set of vertex IDs of all locations within the United States
in_usa
(
vertex_id
)
AS
(
SELECT
vertex_id
FROM
vertices
WHERE
properties
-
>
>
'name'
=
'United States'
UNION
SELECT
edges
.
tail_vertex
FROM
edges
JOIN
in_usa
ON
edges
.
head_vertex
=
in_usa
.
vertex_id
WHERE
edges
.
label
=
'within'
)
,
-- in_europe is the set of vertex IDs of all locations within Europe
in_europe
(
vertex_id
)
AS
(
SELECT
vertex_id
FROM
vertices
WHERE
properties
-
>
>
'name'
=
'Europe'
UNION
SELECT
edges
.
tail_vertex
FROM
edges
JOIN
in_europe
ON
edges
.
head_vertex
=
in_europe
.
vertex_id
WHERE
edges
.
label
=
'within'
)
,
-- born_in_usa is the set of vertex IDs of all people born in the US
born_in_usa
(
vertex_id
)
AS
(
SELECT
edges
.
tail_vertex
FROM
edges
JOIN
in_usa
ON
edges
.
head_vertex
=
in_usa
.
vertex_id
WHERE
edges
.
label
=
'born_in'
)
,
-- lives_in_europe is the set of vertex IDs of all people living in Europe
lives_in_europe
(
vertex_id
)
AS
(
SELECT
edges
.
tail_vertex
FROM
edges
JOIN
in_europe
ON
edges
.
head_vertex
=
in_europe
.
vertex_id
WHERE
edges
.
label
=
'lives_in'
)
SELECT
vertices
.
properties
-
>
>
'name'
FROM
vertices
-- join to find those people who were both born in the US *and* live in Europe
JOIN
born_in_usa
ON
vertices
.
vertex_id
=
born_in_usa
.
vertex_id
JOIN
lives_in_europe
ON
vertices
.
vertex_id
=
lives_in_europe
.
vertex_id
;
-
First find the vertex whose
name
property has the value"United States"
, and make it the first element of the set of verticesin_usa
.首先找到名称属性值为“美国”的顶点,并将其作为in_usa顶点集的第一个元素。
-
Follow all incoming
within
edges from vertices in the setin_usa
, and add them to the same set, until all incomingwithin
edges have been visited.跟随集合in_usa中顶点的所有入边,并将它们添加到同一集合中,直到所有入边已访问。
-
Do the same starting with the vertex whose
name
property has the value"Europe"
, and build up the set of verticesin_europe
.从名称属性值为“欧洲”的顶点开始,执行相同操作并建立在欧洲内的顶点集合。
-
For each of the vertices in the set
in_usa
, follow incomingborn_in
edges to find people who were born in some place within the United States.对于集合in_usa中的每个顶点,沿着传入的born_in边跟踪以查找在美国境内出生的人。
-
Similarly, for each of the vertices in the set
in_europe
, follow incominglives_in
edges to find people who live in Europe.类似地,对于集合 in_europe 中的每个顶点,跟随传入的 lives_in 边来查找生活在欧洲的人。
-
Finally, intersect the set of people born in the USA with the set of people living in Europe, by joining them.
最后,通过结合美国出生人口的集合和居住在欧洲的人口的集合,交集。
If the same query can be written in 4 lines in one query language but requires 29 lines in another, that just shows that different data models are designed to satisfy different use cases. It’s important to pick a data model that is suitable for your application.
如果同样的查询在一种查询语言中可以写成4行,而在另一种语言中需要29行,那只说明不同的数据模型是为满足不同的用例而设计的。选择适合您应用的数据模型非常重要。
Triple-Stores and SPARQL
The triple-store model is mostly equivalent to the property graph model, using different words to describe the same ideas. It is nevertheless worth discussing, because there are various tools and languages for triple-stores that can be valuable additions to your toolbox for building applications.
三元组模型基本上等同于财产图模型,只是使用不同的词语来描述相同的思想。然而,这值得讨论,因为有各种工具和语言可以用于三元组存储,这些都可以成为构建应用程序工具箱中有价值的补充。
In a triple-store, all information is stored in the form of very simple three-part statements: ( subject , predicate , object ). For example, in the triple ( Jim , likes , bananas ), Jim is the subject, likes is the predicate (verb), and bananas is the object.
在三元组存储中,所有信息都以非常简单的三部分语句(主语,谓语,宾语)的形式存储。例如,在三元组(Jim,likes,bananas)中,Jim是主语,likes是谓语(动词),bananas是宾语。
The subject of a triple is equivalent to a vertex in a graph. The object is one of two things:
一个三元组的主语相当于图中的一个顶点。宾语则是两种情况之一:
-
A value in a primitive datatype, such as a string or a number. In that case, the predicate and object of the triple are equivalent to the key and value of a property on the subject vertex. For example, ( lucy , age , 33 ) is like a vertex
lucy
with properties{"age":33}
.简化版: 一个原始数据类型的值,例如字符串或数字。在这种情况下,三元组的谓词和对象等同于主题顶点上属性的键和值。例如,(卢西,年龄,33)就像一个具有属性{"age":33}的顶点卢西。
-
Another vertex in the graph. In that case, the predicate is an edge in the graph, the subject is the tail vertex, and the object is the head vertex. For example, in ( lucy , marriedTo , alain ) the subject and object lucy and alain are both vertices, and the predicate marriedTo is the label of the edge that connects them.
图形中的另一个顶点。在这种情况下,谓词是图形中的一条边,主语是尾顶点,宾语是头顶点。例如,在(lucy,marriedTo,alain)中,主语和宾语lucy和alain都是顶点,谓词marriedTo是连接它们的边的标签。
Example 2-6 shows the same data as in Example 2-3 , written as triples in a format called Turtle , a subset of Notation3 ( N3 ) [ 39 ].
示例2-6展示了与示例2-3相同的数据,以称为Turtle的格式编写的三元组的形式表示,它是Notation3(N3)的子集。[39]。
Example 2-6. A subset of the data in Figure 2-5 , represented as Turtle triples
@prefix : <urn:example:>. _:lucy a :Person. _:lucy :name "Lucy". _:lucy :bornIn _:idaho. _:idaho a :Location. _:idaho :name "Idaho". _:idaho :type "state". _:idaho :within _:usa. _:usa a :Location. _:usa :name "United States". _:usa :type "country". _:usa :within _:namerica. _:namerica a :Location. _:namerica :name "North America". _:namerica :type "continent".
In this example, vertices of the graph are written as
_:
someName
. The name doesn’t mean anything
outside of this file; it exists only because we otherwise wouldn’t know which triples refer to the
same vertex. When the predicate represents an edge, the object is a vertex, as in
_:idaho :within
_:usa
. When the predicate is a property, the object is a string literal, as in
_:usa :name
"United States"
.
在这个例子中,图形的顶点被写成_:someName。这个名字并没有任何意义,存在的唯一目的是因为我们不知道哪些三元组引用同一个顶点。当谓词表示边时,对象是顶点,例如_:idaho :within _:usa。当谓词是属性时,对象是一个字符串文字,例如_:usa :name "United States"。
It’s quite repetitive to repeat the same subject over and over again, but fortunately you can use semicolons to say multiple things about the same subject. This makes the Turtle format quite nice and readable: see Example 2-7 .
重复地反复讲同一个主题相当乏味,但幸运的是,你可以使用分号来表达关于同一个主题的多个事情。这使得Turtle格式非常好读:见示例2-7。
Example 2-7. A more concise way of writing the data in Example 2-6
@prefix : <urn:example:>. _:lucy a :Person; :name "Lucy"; :bornIn _:idaho. _:idaho a :Location; :name "Idaho"; :type "state"; :within _:usa. _:usa a :Location; :name "United States"; :type "country"; :within _:namerica. _:namerica a :Location; :name "North America"; :type "continent".
The semantic web
If you read more about triple-stores, you may get sucked into a maelstrom of articles written about the semantic web . The triple-store data model is completely independent of the semantic web—for example, Datomic [ 40 ] is a triple-store that does not claim to have anything to do with it. vii But since the two are so closely linked in many people’s minds, we should discuss them briefly.
如果你更多地了解三元存储,你可能会陷入一片关于语义网的文章漩涡中。三元存储数据模型完全独立于语义网,例如,Datomic 是一个三元存储,它并不声称与语义网有任何关系。但由于许多人将两者联系得如此之紧密,我们应该简要讨论一下它们。
The semantic web is fundamentally a simple and reasonable idea: websites already publish information as text and pictures for humans to read, so why don’t they also publish information as machine-readable data for computers to read? The Resource Description Framework (RDF) [ 41 ] was intended as a mechanism for different websites to publish data in a consistent format, allowing data from different websites to be automatically combined into a web of data —a kind of internet-wide “database of everything.”
语义网基本上是一个简单而合理的想法:网站已经发布了文本和图片供人类阅读,那么为什么它们不也将信息发布为机器可读的数据供计算机阅读?资源描述框架(RDF)旨在成为不同网站以一致的格式发布数据的机制,允许来自不同网站的数据自动组合成数据网络 - 种互联网范围内的“万物数据库”。
Unfortunately, the semantic web was overhyped in the early 2000s but so far hasn’t shown any sign of being realized in practice, which has made many people cynical about it. It has also suffered from a dizzying plethora of acronyms, overly complex standards proposals, and hubris.
不幸的是,语义Web在2000年代初被过度夸大,但到目前为止还没有在实践中得到实现的迹象,这使得许多人对其持怀疑态度。它也遭受了令人眼花缭乱的缩写词、过于复杂的标准提案和傲慢的打击。
However, if you look past those failings, there is also a lot of good work that has come out of the semantic web project. Triples can be a good internal data model for applications, even if you have no interest in publishing RDF data on the semantic web.
然而,如果你去掉那些缺点,语义网项目也有许多好的工作成果。即使您不想在语义网上发布RDF数据,三元组可以成为应用程序的良好内部数据模型。
The RDF data model
The Turtle language we used in Example 2-7 is a human-readable format for RDF data. Sometimes RDF is also written in an XML format, which does the same thing much more verbosely—see Example 2-8 . Turtle/N3 is preferable as it is much easier on the eyes, and tools like Apache Jena [ 42 ] can automatically convert between different RDF formats if necessary.
我们在Example 2-7中使用的乌龟语是一种可读性强的RDF数据格式。有时RDF也以XML格式书写,但是它更加冗长,例如Example 2-8。Turtle/N3更受欢迎,因为它更加易读,使用诸如Apache Jena的工具还可以在必要时自动转换不同的RDF格式。
Example 2-8. The data of Example 2-7 , expressed using RDF/XML syntax
<rdf:RDF
xmlns=
"urn:example:"
xmlns:rdf=
"http://www.w3.org/1999/02/22-rdf-syntax-ns#"
>
<Location
rdf:nodeID=
"idaho"
>
<name>
Idaho</name>
<type>
state</type>
<within>
<Location
rdf:nodeID=
"usa"
>
<name>
United States</name>
<type>
country</type>
<within>
<Location
rdf:nodeID=
"namerica"
>
<name>
North America</name>
<type>
continent</type>
</Location>
</within>
</Location>
</within>
</Location>
<Person
rdf:nodeID=
"lucy"
>
<name>
Lucy</name>
<bornIn
rdf:nodeID=
"idaho"
/>
</Person>
</rdf:RDF>
RDF has a few quirks due to the fact that it is designed for internet-wide data exchange. The
subject, predicate, and object of a triple are often URIs. For example, a predicate might be an URI
such as
<http://my-company.com/namespace#within>
or
<http://my-company.com/namespace#lives_in>
,
rather than just
WITHIN
or
LIVES_IN
. The reasoning behind this design is that you should be able
to combine your data with someone else’s data, and if they attach a different meaning to the word
within
or
lives_in
, you won’t get a conflict because their predicates are actually
<http://other.org/foo#within>
and
<http://other.org/foo#lives_in>
.
由于RDF旨在进行互联网范围的数据交换,因此它有一些小问题。一个三元组的主语、谓语和宾语通常是URI,例如,谓语可能是一个URI,例如<http://my-company.com/namespace#within>或<http://my-company.com/namespace#lives_in>,而不仅仅是WITHIN或LIVES_IN。这种设计的原因是您应该能够将您的数据与其他人的数据组合在一起,如果他们将within或lives_in这个词附加到不同的意义,您将不会遇到冲突,因为他们的谓语实际上是<http://other.org/foo#within>和<http://other.org/foo#lives_in>。
The URL
<http://my-company.com/namespace>
doesn’t necessarily need to resolve to anything—from
RDF’s point of view, it is simply a namespace. To avoid potential confusion with
http://
URLs, the
examples in this section use non-resolvable URIs such as
urn:example:within
. Fortunately, you can
just specify this prefix once at the top of the file, and then forget about it.
URL <http://my-company.com/namespace>从RDF的角度来看,并不一定需要解析任何内容,它只是一个命名空间。为了避免与http://网址潜在的混淆,本节中的示例使用不可解析的URI,如urn:example:within。幸运的是,您只需在文件顶部指定此前缀,然后忘记它即可。
The SPARQL query language
SPARQL is a query language for triple-stores using the RDF data model [ 43 ]. (It is an acronym for SPARQL Protocol and RDF Query Language , pronounced “sparkle.”) It predates Cypher, and since Cypher’s pattern matching is borrowed from SPARQL, they look quite similar [ 37 ].
SPARQL是使用RDF数据模型的三元存储的查询语言[43]。(它是SPARQL协议和RDF查询语言的缩写,发音为“sparkle”)。它早于Cypher,并且由于Cypher的模式匹配是从SPARQL借鉴的,它们看起来非常相似[37]。
The same query as before—finding people who have moved from the US to Europe—is even more concise in SPARQL than it is in Cypher (see Example 2-9 ).
在SPARQL中,与之前相同的查询——查找已经从美国移居到欧洲的人——甚至比Cypher更加简洁(参见示例2-9)。
Example 2-9. The same query as Example 2-4 , expressed in SPARQL
PREFIX
:
<
urn
:
example
:
>
SELECT
?
personName
WHERE
{
?
person
:name
?
personName
.
?
person
:bornIn
/
:within
*
/
:name
"United States"
.
?
person
:livesIn
/
:within
*
/
:name
"Europe"
.
}
The structure is very similar. The following two expressions are equivalent (variables start with a question mark in SPARQL):
结构非常相似。以下两个表达式是等效的(在SPARQL中,变量以问号开头):
(person) -[:BORN_IN]-> () -[:WITHIN*0..]-> (location) # Cypher ?person :bornIn / :within* ?location. # SPARQL
Because RDF doesn’t distinguish between properties and edges but just uses predicates for both, you
can use the same syntax for matching properties. In the following expression, the variable
usa
is
bound to any vertex that has a
name
property whose value is the string
"United States"
:
因为RDF不区分属性和边缘,而只是使用谓词来表示两者,所以您可以使用相同的语法来匹配属性。在以下表达式中,变量usa绑定到任何具有名称属性以及其值为字符串“美国”的顶点:
(usa {name:'United States'}) # Cypher ?usa :name "United States". # SPARQL
SPARQL is a nice query language—even if the semantic web never happens, it can be a powerful tool for applications to use internally.
SPARQL是一种很好的查询语言,即使语义网从未实现,它也可以成为应用内部使用的强大工具。
The Foundation: Datalog
Datalog is a much older language than SPARQL or Cypher, having been studied extensively by academics in the 1980s [ 44 , 45 , 46 ]. It is less well known among software engineers, but it is nevertheless important, because it provides the foundation that later query languages build upon.
Datalog比SPARQL或Cypher语言更为古老,已经在20世纪80年代被学者广泛研究过。虽然在软件工程师中不是很知名,但它仍然非常重要,因为后来的查询语言都建立在它的基础上。
In practice, Datalog is used in a few data systems: for example, it is the query language of Datomic [ 40 ], and Cascalog [ 47 ] is a Datalog implementation for querying large datasets in Hadoop. viii
在实践中,Datalog 在一些数据系统中使用:例如,它是 Datomic 的查询语言 [40],Cascalog是一个用于在 Hadoop 中查询大数据集的 Datalog 实现 [47]。
Datalog’s data model is similar to the triple-store model, generalized a bit. Instead of writing a triple as ( subject , predicate , object ), we write it as predicate ( subject , object ). Example 2-10 shows how to write the data from our example in Datalog.
Datalog的数据模型类似于三元组存储模型,但稍微推广一下。我们将三元组的写法 (主语,谓语,宾语) 转换成谓语(主语,宾语)的形式。例如,示例 2-10 展示了如何在 Datalog 中写入我们示例中的数据。
Example 2-10. A subset of the data in Figure 2-5 , represented as Datalog facts
name
(
namerica
,
'North America'
).
type
(
namerica
,
continent
).
name
(
usa
,
'United States'
).
type
(
usa
,
country
).
within
(
usa
,
namerica
).
name
(
idaho
,
'Idaho'
).
type
(
idaho
,
state
).
within
(
idaho
,
usa
).
name
(
lucy
,
'Lucy'
).
born_in
(
lucy
,
idaho
).
Now that we have defined the data, we can write the same query as before, as shown in Example 2-11 . It looks a bit different from the equivalent in Cypher or SPARQL, but don’t let that put you off. Datalog is a subset of Prolog, which you might have seen before if you’ve studied computer science.
现在我们已经定义了数据,我们可以像示例2-11那样编写与之前相同的查询。它看起来与Cypher或SPARQL中的等效语句有些不同,但不要让它吓到你。Datalog是Prolog的一个子集,如果你学过计算机科学,可能已经见过它了。
Example 2-11. The same query as Example 2-4 , expressed in Datalog
within_recursive
(
Location
,
Name
)
:-
name
(
Location
,
Name
).
/* Rule 1 */
within_recursive
(
Location
,
Name
)
:-
within
(
Location
,
Via
),
/* Rule 2 */
within_recursive
(
Via
,
Name
).
migrated
(
Name
,
BornIn
,
LivingIn
)
:-
name
(
Person
,
Name
),
/* Rule 3 */
born_in
(
Person
,
BornLoc
),
within_recursive
(
BornLoc
,
BornIn
),
lives_in
(
Person
,
LivingLoc
),
within_recursive
(
LivingLoc
,
LivingIn
).
?-
migrated
(
Who
,
'United States'
,
'Europe'
).
/* Who = 'Lucy'. */
Cypher and SPARQL jump in right away with
SELECT
, but Datalog takes a small step at a time. We
define
rules
that tell the database about new predicates: here, we define two new predicates,
within_recursive
and
migrated
. These predicates aren’t triples stored in the database, but
instead they are derived from data or from other rules. Rules can refer to other rules, just like
functions can call other functions or recursively call themselves. Like this, complex queries can be
built up a small piece at a time.
Cypher和SPARQL会立即使用SELECT,但Datalog会逐步迈出小步。我们定义规则来告诉数据库新谓词的存在:在这里,我们定义了两个新谓词,within_recursive和migrated。这些谓词不是存储在数据库中的三元组,而是从数据或其他规则派生出来的。规则可以引用其他规则,就像函数可以调用其他函数或递归调用自己一样。通过这种方式,可以逐步构建复杂的查询。
In rules, words that start with an uppercase letter are variables, and predicates are matched like
in Cypher and SPARQL. For example,
name(Location, Name)
matches the triple
name(namerica, 'North America')
with variable bindings
Location = namerica
and
Name = 'North America'
.
规则中,以大写字母开头的单词是变量,谓词的匹配方式类似于Cypher和SPARQL。例如,name(Location,Name)与三元组name(namerica,'North America')匹配,绑定变量为Location = namerica和Name ='North America'。
A rule applies if the system can find a match for
all
predicates on the righthand side of the
:-
operator. When the rule applies, it’s as though the lefthand side of the
:-
was added to the
database (with variables replaced by the values they matched).
当系统可以在“:-”运算符的右侧找到所有谓词的匹配时,规则适用。当规则适用时,就好像“:-”的左侧已被添加到数据库中(变量被匹配的值替换)。
One possible way of applying the rules is thus:
一种可能的规则应用方式如下:
-
name(namerica, 'North America')
exists in the database, so rule 1 applies. It generateswithin_recursive(namerica, 'North America')
.名称(name) (namerica, '北美洲') 在数据库中已存在,因此规则1适用。它生成within_recursive(namerica, '北美洲')。
-
within(usa, namerica)
exists in the database and the previous step generatedwithin_recursive(namerica, 'North America')
, so rule 2 applies. It generateswithin_recursive(usa, 'North America')
.在数据库中存在 `within(usa, namerica)`,并且之前一步生成了 `within_recursive(namerica, 'North America')`,因此应用了规则 2。它生成了 `within_recursive(usa, 'North America')`。
-
within(idaho, usa)
exists in the database and the previous step generatedwithin_recursive(usa, 'North America')
, so rule 2 applies. It generateswithin_recursive(idaho, 'North America')
.在数据库中存在位于美国爱达荷州的数据,并且之前的步骤生成了within_recursive(美国, '北美洲'),因此规则2适用。它会生成within_recursive(爱达荷州, '北美洲')。
By repeated application of rules 1 and 2, the
within_recursive
predicate can tell us all the
locations in North America (or any other location name) contained in our database. This process is
illustrated in
Figure 2-6
.
通过反复应用规则1和规则2,within_recursive谓词可以告诉我们所有在我们数据库中包含的北美(或任何其他位置名称)的位置。该过程在图2-6中说明。
Now rule 3 can find people who were born in some location
BornIn
and live in some location
LivingIn
. By querying with
BornIn = 'United States'
and
LivingIn = 'Europe'
, and leaving the person as a variable
Who
, we ask
the Datalog system to find out which values can appear for the variable
Who
.
So, finally we get the same answer as in the earlier Cypher and SPARQL queries.
现在规则3可以查找出生在BornIn某地并住在LivingIn某地的人。通过查询BornIn = 'United States'和LivingIn = 'Europe',并将人作为变量Who,我们要求Datalog系统查找哪些值可以出现在变量Who中。因此,最终我们得到与早期Cypher和SPARQL查询相同的答案。
The Datalog approach requires a different kind of thinking to the other query languages discussed in this chapter, but it’s a very powerful approach, because rules can be combined and reused in different queries. It’s less convenient for simple one-off queries, but it can cope better if your data is complex.
Datalog方法需要与本章中讨论的其他查询语言不同的思维方式,但它是一种非常强大的方法,因为规则可以在不同的查询中组合和重复使用。对于简单的一次性查询来说可能不太方便,但如果你的数据较为复杂,它处理起来更加从容。
Summary
Data models are a huge subject, and in this chapter we have taken a quick look at a broad variety of different models. We didn’t have space to go into all the details of each model, but hopefully the overview has been enough to whet your appetite to find out more about the model that best fits your application’s requirements.
数据模型是一个庞大的主题,在本章节中,我们简要介绍了各种不同的模型。我们没有足够的空间深入了解每个模型的所有细节,但是希望这个概述已经足够激发你对最符合你应用需求的模型的进一步了解的兴趣。
Historically, data started out being represented as one big tree (the hierarchical model), but that wasn’t good for representing many-to-many relationships, so the relational model was invented to solve that problem. More recently, developers found that some applications don’t fit well in the relational model either. New nonrelational “NoSQL” datastores have diverged in two main directions:
历史上,数据最初被表示为一棵大树(分层模型),但这并不适用于表示多对多关系,因此发明了关系模型来解决这个问题。最近,开发人员发现一些应用程序也不适用于关系模型。新的非关系型“NoSQL”数据存储已经分为两个主要方向。
-
Document databases target use cases where data comes in self-contained documents and relationships between one document and another are rare.
文档数据库专注于自包含文档的数据使用案例,文档与文档之间的关系较少。
-
Graph databases go in the opposite direction, targeting use cases where anything is potentially related to everything.
图形数据库朝着相反的方向发展,针对潜在的任何东西都可能相互关联的使用案例。
All three models (document, relational, and graph) are widely used today, and each is good in its respective domain. One model can be emulated in terms of another model—for example, graph data can be represented in a relational database—but the result is often awkward. That’s why we have different systems for different purposes, not a single one-size-fits-all solution.
所有三种模型(文档,关系和图形)今天都广泛使用,并且每个模型在其各自的域中都很好。一种模型可以用另一种模型来模拟 - 例如,可以用关系数据库表示图形数据 - 但结果通常是笨拙的。这就是为什么我们有不同的系统用于不同的用途,而不是单一的一刀切解决方案。
One thing that document and graph databases have in common is that they typically don’t enforce a schema for the data they store, which can make it easier to adapt applications to changing requirements. However, your application most likely still assumes that data has a certain structure; it’s just a question of whether the schema is explicit (enforced on write) or implicit (handled on read).
文档数据库和图形数据库的共同点之一是它们通常不强制实施存储的数据模式,这可以使应用程序更容易适应不断变化的需求。但是,您的应用程序很可能仍然假定数据具有某种结构;只是问题是模式是否显式(在写入时强制执行)或隐含(在读取时处理)。
Each data model comes with its own query language or framework, and we discussed several examples: SQL, MapReduce, MongoDB’s aggregation pipeline, Cypher, SPARQL, and Datalog. We also touched on CSS and XSL/XPath, which aren’t database query languages but have interesting parallels.
每个数据模型都配有其自己的查询语言或框架,我们讨论了几个例子:SQL、MapReduce、MongoDB的聚合管道、Cypher、SPARQL和Datalog。我们也提到了CSS和XSL/XPath,它们不是数据库查询语言,但有一些有趣的相似之处。
Although we have covered a lot of ground, there are still many data models left unmentioned. To give just a few brief examples:
虽然我们已经覆盖了很多领域,但还有许多数据模型未提及。只举几个简要的例子:
-
Researchers working with genome data often need to perform sequence-similarity searches , which means taking one very long string (representing a DNA molecule) and matching it against a large database of strings that are similar, but not identical. None of the databases described here can handle this kind of usage, which is why researchers have written specialized genome database software like GenBank [ 48 ].
研究基因组数据的人常常需要进行序列相似性搜索,这意味着将一个非常长的字符串(代表DNA分子)与一个大型数据库中类似但不完全相同的字符串进行匹配。这里描述的任何一个数据库都无法处理这种类型的使用,这就是为什么研究者编写了专门的基因组数据库软件,如GenBank [48]。
-
Particle physicists have been doing Big Data–style large-scale data analysis for decades, and projects like the Large Hadron Collider (LHC) now work with hundreds of petabytes! At such a scale custom solutions are required to stop the hardware cost from spiraling out of control [ 49 ].
粒子物理学家们已经进行了大规模数据分析多年,类似于大型强子对撞机等项目现在已经使用数百个拍字节!在这样的规模下,需要定制解决方案来防止硬件成本失控[49]。
-
Full-text search is arguably a kind of data model that is frequently used alongside databases. Information retrieval is a large specialist subject that we won’t cover in great detail in this book, but we’ll touch on search indexes in Chapter 3 and Part III .
全文搜索可以说是经常与数据库一起使用的一种数据模型。信息检索是一个大的专业课题,在本书中我们不会详细介绍,但我们会在第三章和第三部分中涉及搜索索引。
We have to leave it there for now. In the next chapter we will discuss some of the trade-offs that come into play when implementing the data models described in this chapter.
我们现在必须将其放在那里。在下一章中,我们将讨论在实施本章中描述的数据模型时发挥作用的一些权衡。
Footnotes
i A term borrowed from electronics. Every electric circuit has a certain impedance (resistance to alternating current) on its inputs and outputs. When you connect one circuit’s output to another one’s input, the power transfer across the connection is maximized if the output and input impedances of the two circuits match. An impedance mismatch can lead to signal reflections and other troubles.
电子领域常用术语。每个电路的输入和输出都有一定的阻抗(交流电阻)。当你连接一个电路的输出到另一个电路的输入时,如果两个电路的输出和输入阻抗相匹配,连接的功率传输将最大化。阻抗不匹配可能会导致信号反射和其他问题。
ii Literature on the relational model distinguishes several different normal forms, but the distinctions are of little practical interest. As a rule of thumb, if you’re duplicating values that could be stored in just one place, the schema is not normalized.
关于关系模型的文献区分了几种不同的正规形式,但这些区别在实际中没有太多的实用价值。作为一个经验法则,如果您正在复制可以只存储在一个位置的值,则模式未经归一化。
iii At the time of writing, joins are supported in RethinkDB, not supported in MongoDB, and only supported in predeclared views in CouchDB.
在撰写本文时,RethinkDB支持连接操作,MongoDB不支持,而CouchDB仅支持在预定义视图中进行连接操作。
iv Foreign key constraints allow you to restrict modifications, but such constraints are not required by the relational model. Even with constraints, joins on foreign keys are performed at query time, whereas in CODASYL, the join was effectively done at insert time.
外键约束可以限制修改,但这种约束并非关系模型所必需。即使有约束,在外键上的联接也是在查询时执行的,而在CODASYL中,联接实际上是在插入时执行的。
v Codd’s original description of the relational model [ 1 ] actually allowed something quite similar to JSON documents within a relational schema. He called it nonsimple domains . The idea was that a value in a row doesn’t have to just be a primitive datatype like a number or a string, but could also be a nested relation (table)—so you can have an arbitrarily nested tree structure as a value, much like the JSON or XML support that was added to SQL over 30 years later.
Codd最初对关系模型的描述实际上允许在关系模式中使用类似于JSON文档的东西。他称其为非简单域。其思想是行中的值不一定只是原始数据类型,比如数字或字符串,而可以是嵌套关系(表)——因此可以具有任意嵌套的树形结构作为值,就像30年后添加到SQL中的JSON或XML支持一样。
vi IMS and CODASYL both used imperative query APIs. Applications typically used COBOL code to iterate over records in the database, one record at a time [ 2 , 16 ].
VI IMS和CODASYL都使用命令式查询API。应用程序通常使用COBOL代码在数据库中迭代记录,一次处理一条记录[2,16]。
vii Technically, Datomic uses 5-tuples rather than triples; the two additional fields are metadata for versioning.
技术上来说,Datomic使用的是五元组而不是三元组;另外两个字段是用于版本控制的元数据。
viii Datomic and Cascalog use a Clojure S-expression syntax for Datalog. In the following examples we use a Prolog syntax, which is a little easier to read, but this makes no functional difference.
Datomic 和 Cascalog 使用Clojure的S表达式语法来处理Datalog。在以下示例中,我们使用Prolog语法,使其更易于阅读,但实际上功能并没有差异。
References
[ 1 ] Edgar F. Codd: “ A Relational Model of Data for Large Shared Data Banks ,” Communications of the ACM , volume 13, number 6, pages 377–387, June 1970. doi:10.1145/362384.362685
[1] Edgar F. Codd:“大型共享数据银行的数据关系模型”,ACM通讯,第13卷,第6期,1970年6月,页码377-387。DOI: 10.1145/362384.362685。
[ 2 ] Michael Stonebraker and Joseph M. Hellerstein: “ What Goes Around Comes Around ,” in Readings in Database Systems , 4th edition, MIT Press, pages 2–41, 2005. ISBN: 978-0-262-69314-1
[2] Michael Stonebraker和Joseph M. Hellerstein:"东西转了一圈又回来了",收录于《数据库系统读本》第四版,麻省理工学院出版社,2005年,第2-41页。ISBN: 978-0-262-69314-1。
[ 3 ] Pramod J. Sadalage and Martin Fowler: NoSQL Distilled . Addison-Wesley, August 2012. ISBN: 978-0-321-82662-6
[3] Pramod J. Sadalage 和 Martin Fowler:《NoSQL精粹》。Addison-Wesley出版社,2012年8月。ISBN:978-0-321-82662-6。
[ 4 ] Eric Evans: “ NoSQL: What’s in a Name? ,” blog.sym-link.com , October 30, 2009.
[4] Eric Evans:NoSQL:名称中有什么?”。博客.sym-link.com,2009年10月30日。
[ 5 ] James Phillips: “ Surprises in Our NoSQL Adoption Survey ,” blog.couchbase.com , February 8, 2012.
[5] 詹姆斯·菲利普斯:“我们的NoSQL采用调查中的惊喜”,blog.couchbase.com,2012年2月8日。
[ 6 ] Michael Wagner: SQL/XML:2006 – Evaluierung der Standardkonformität ausgewählter Datenbanksysteme . Diplomica Verlag, Hamburg, 2010. ISBN: 978-3-836-64609-3
[6] 迈克尔·瓦格纳:SQL / XML:2006年 - 评估所选数据库系统的标准符合性。Diplomica出版社,汉堡,2010年。ISBN:978-3-836-64609-3。
[ 7 ] “ XML Data in SQL Server ,” SQL Server 2012 documentation, technet.microsoft.com , 2013.
[7] “SQL Server 中的 XML 数据”,SQL Server 2012 文档,technet.microsoft.com,2013年。
[ 8 ] “ PostgreSQL 9.3.1 Documentation ,” The PostgreSQL Global Development Group, 2013.
[8] “PostgreSQL 9.3.1 文档”, PostgreSQL全球开发小组,2013年。
[ 9 ] “ The MongoDB 2.4 Manual ,” MongoDB, Inc., 2013.
[9] “MongoDB 2.4手册”,MongoDB,Inc.,2013年。
[ 10 ] “ RethinkDB 1.11 Documentation ,” rethinkdb.com , 2013.
请帮我翻译一下:“RethinkDB 1.11文档”, rethinkdb.com,2013.
[ 11 ] “ Apache CouchDB 1.6 Documentation ,” docs.couchdb.org , 2014.
【11】“Apache CouchDB 1.6 文档”,docs.couchdb.org,2014年。
[ 12 ] Lin Qiao, Kapil Surlaker, Shirshanka Das, et al.: “ On Brewing Fresh Espresso: LinkedIn’s Distributed Data Serving Platform ,” at ACM International Conference on Management of Data (SIGMOD), June 2013.
【12】Lin Qiao, Kapil Surlaker, Shirshanka Das等:「关于新鲜意式浓缩咖啡的研究:LinkedIn 的分布式数据服务平台」,ACM 数据管理国际会议 (SIGMOD),2013年6月。
[ 13 ] Rick Long, Mark Harrington, Robert Hain, and Geoff Nicholls: IMS Primer . IBM Redbook SG24-5352-00, IBM International Technical Support Organization, January 2000.
[13] Rick Long, Mark Harrington, Robert Hain, and Geoff Nicholls: IMS初学者指南。 IBM红皮书SG24-5352-00,IBM国际技术支持组织,2000年1月。
[ 14 ] Stephen D. Bartlett: “ IBM’s IMS—Myths, Realities, and Opportunities ,” The Clipper Group Navigator, TCG2013015LI, July 2013.
[14] Stephen D. Bartlett: “IBM的IMS——神话、现实和机遇”,The Clipper Group Navigator,TCG2013015LI,2013年7月。
[ 15 ] Sarah Mei: “ Why You Should Never Use MongoDB ,” sarahmei.com , November 11, 2013.
[15] Sarah Mei:“为什么你永远不应该使用MongoDB”,sarahmei.com,2013年11月11日。
[ 16 ] J. S. Knowles and D. M. R. Bell: “The CODASYL Model,” in Databases—Role and Structure: An Advanced Course , edited by P. M. Stocker, P. M. D. Gray, and M. P. Atkinson, pages 19–56, Cambridge University Press, 1984. ISBN: 978-0-521-25430-4
[16] J. S. Knowles 和 D. M. R. Bell: "CODASYL 模型",收录于数据库 - 角色与结构: 高级课程,由 P. M. Stocker、P. M. D. Gray 和 M. P. Atkinson 编辑,第 19 至 56 页,1984 年出版,剑桥大学出版社。ISBN: 978-0-521-25430-4。
[ 17 ] Charles W. Bachman: “ The Programmer as Navigator ,” Communications of the ACM , volume 16, number 11, pages 653–658, November 1973. doi:10.1145/355611.362534
[17] 查尔斯·W·巴赫曼: “程序员作为导航员,” ACM通讯杂志,第16卷,第11号,653-658页,1973年11月。 doi:10.1145/355611.362534
[ 18 ] Joseph M. Hellerstein, Michael Stonebraker, and James Hamilton: “ Architecture of a Database System ,” Foundations and Trends in Databases , volume 1, number 2, pages 141–259, November 2007. doi:10.1561/1900000002
【18】Joseph M. Hellerstein, Michael Stonebraker和James Hamilton:“数据库系统的架构”,《数据库基础和趋势》第1卷第2期,2007年11月,第141-259页。doi:10.1561/1900000002。
[ 19 ] Sandeep Parikh and Kelly Stirman: “ Schema Design for Time Series Data in MongoDB ,” blog.mongodb.org , October 30, 2013.
[19] Sandeep Parikh和Kelly Stirman:“MongoDB中的时间序列数据模式设计”,blog.mongodb.org,2013年10月30日。
[ 20 ] Martin Fowler: “ Schemaless Data Structures ,” martinfowler.com , January 7, 2013.
[20] Martin Fowler: “无模式数据结构”, martin fowler.com, 2013年1月7日。
[ 21 ] Amr Awadallah: “ Schema-on-Read vs. Schema-on-Write ,” at Berkeley EECS RAD Lab Retreat , Santa Cruz, CA, May 2009.
[21] Amr Awadallah: “读时模式与写时模式”,伯克利EECS RAD实验室撤退会议,加利福尼亚州圣克鲁斯,2009年5月。
[ 22 ] Martin Odersky: “ The Trouble with Types ,” at Strange Loop , September 2013.
[22] Martin Odersky: “类型的麻烦”,在奇怪的循环会议上,2013年9月。
[ 23 ] Conrad Irwin: “ MongoDB—Confessions of a PostgreSQL Lover ,” at HTML5DevConf , October 2013.
康拉德·欧文:《MongoDB——一个PostgreSQL爱好者的忏悔》,于2013年10月在HTML5DevConf上发表。
[ 24 ] “ Percona Toolkit Documentation: pt-online-schema-change ,” Percona Ireland Ltd., 2013.
[24] “Percona Toolkit 文档:pt-online-schema-change”,Percona Ireland Ltd.,2013年。
[ 25 ] Rany Keddo, Tobias Bielohlawek, and Tobias Schmidt: “ Large Hadron Migrator ,” SoundCloud, 2013.
[25] Rany Keddo、Tobias Bielohlawek 和 Tobias Schmidt: “大型强子迁移器”,SoundCloud,2013年。
[ 26 ] Shlomi Noach: “ gh-ost: GitHub’s Online Schema Migration Tool for MySQL ,” githubengineering.com , August 1, 2016.
[26] Shlomi Noach:「gh-ost:GitHub 的 MySQL 在线模式迁移工具」,githubengineering.com,2016 年 8 月 1 日。
[ 27 ] James C. Corbett, Jeffrey Dean, Michael Epstein, et al.: “ Spanner: Google’s Globally-Distributed Database ,” at 10th USENIX Symposium on Operating System Design and Implementation (OSDI), October 2012.
[27] James C. Corbett,Jeffrey Dean,Michael Epstein等人: “Spanner: Google的全球分布式数据库,” 于2012年10月在第10届USENIX操作系统设计和实现研讨会(OSDI)上。
[ 28 ] Donald K. Burleson: “ Reduce I/O with Oracle Cluster Tables ,” dba-oracle.com .
[28] Donald K. Burleson: "使用Oracle Cluster表降低I/O",dba-oracle.com.
[ 29 ] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, et al.: “ Bigtable: A Distributed Storage System for Structured Data ,” at 7th USENIX Symposium on Operating System Design and Implementation (OSDI), November 2006.
[29] Fay Chang, Jeffrey Dean, Sanjay Ghemawat等人:「Bigtable:一种用于结构化数据的分布式存储系统」,发表于第七届USENIX操作系统设计与实现研讨会(OSDI),2006年11月。
[ 30 ] Bobbie J. Cochrane and Kathy A. McKnight: “ DB2 JSON Capabilities, Part 1: Introduction to DB2 JSON ,” IBM developerWorks, June 20, 2013.
[30] Bobbie J. Cochrane和Kathy A. McKnight: “DB2 JSON 能力,第1部分:介绍DB2 JSON,“IBM developerWorks,2013年6月20日。
[ 31 ] Herb Sutter: “ The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software ,” Dr. Dobb’s Journal , volume 30, number 3, pages 202-210, March 2005.
[31] Herb Sutter:《自由午餐的结束:软件并发的基本转向》 《Dr. Dobb’s Journal》,2005年3月,卷30,号3,202-210页。
[ 32 ] Joseph M. Hellerstein: “ The Declarative Imperative: Experiences and Conjectures in Distributed Logic ,” Electrical Engineering and Computer Sciences, University of California at Berkeley, Tech report UCB/EECS-2010-90, June 2010.
[32] 约瑟夫·M·黑勒斯坦: “陈述式命令: 分布式逻辑中的经验与推测”,加州大学伯克利分校电气工程与计算机科学系,技术报告UCB/EECS-2010-90,2010年6月。
[ 33 ] Jeffrey Dean and Sanjay Ghemawat: “ MapReduce: Simplified Data Processing on Large Clusters ,” at 6th USENIX Symposium on Operating System Design and Implementation (OSDI), December 2004.
[33] 杰弗里·迪安和桑杰·基马瓦特: “MapReduce:在大型集群上简化数据处理”,于2004年12月在第六届USENIX操作系统设计和实现研讨会(OSDI)上发表。
[ 34 ] Craig Kerstiens: “ JavaScript in Your Postgres ,” blog.heroku.com , June 5, 2013.
[34] Craig Kerstiens:「在Postgres中使用JavaScript」,blog.heroku.com,2013年6月5日。
[ 35 ] Nathan Bronson, Zach Amsden, George Cabrera, et al.: “ TAO: Facebook’s Distributed Data Store for the Social Graph ,” at USENIX Annual Technical Conference (USENIX ATC), June 2013.
[35] Nathan Bronson, Zach Amsden, George Cabrera等: “TAO:Facebook的分布式数据存储器用于社交图谱”,收录于 USENIX年度技术会议(USENIX ATC),2013年6月。
[ 36 ] “ Apache TinkerPop3.2.3 Documentation ,” tinkerpop.apache.org , October 2016.
“Apache TinkerPop3.2.3 文档”,tinkerpop.apache.org,2016年10月。
[ 37 ] “ The Neo4j Manual v2.0.0 ,” Neo Technology, 2013.
[37] “Neo4j手册v2.0.0”,Neo Technology,2013年。
[ 38 ] Emil Eifrem: Twitter correspondence , January 3, 2014.
[38] Emil Eifrem:Twitter通信,2014年1月3日。
[ 39 ] David Beckett and Tim Berners-Lee: “ Turtle – Terse RDF Triple Language ,” W3C Team Submission, March 28, 2011.
[39] David Beckett和Tim Berners-Lee:"海龟 - 简洁的RDF三元语言",W3C团队提交,2011年3月28日。
[ 40 ] “ Datomic Development Resources ,” Metadata Partners, LLC, 2013.
[40] “Datomic开发资源,” Metadata Partners, LLC, 2013.
[ 41 ] W3C RDF Working Group: “ Resource Description Framework (RDF) ,” w3.org , 10 February 2004.
[41] W3C RDF 工作组: “资源描述框架(RDF),” w3.org,2004年2月10日。
[ 42 ] “ Apache Jena ,” Apache Software Foundation.
Apache Jena,Apache软件基金会。
[ 43 ] Steve Harris, Andy Seaborne, and Eric Prud’hommeaux: “ SPARQL 1.1 Query Language ,” W3C Recommendation, March 2013.
[43] Steve Harris,Andy Seaborne和Eric Prud'hommeaux:“SPARQL 1.1查询语言”, W3C推荐,2013年3月。 【43】史蒂夫·哈里斯(Steve Harris),安迪·西博恩(Andy Seaborne)和埃里克·普鲁德霍默(Eric Prud’hommeaux):“SPARQL 1.1查询语言”,W3C推荐,2013年3月。
[ 44 ] Todd J. Green, Shan Shan Huang, Boon Thau Loo, and Wenchao Zhou: “ Datalog and Recursive Query Processing ,” Foundations and Trends in Databases , volume 5, number 2, pages 105–195, November 2013. doi:10.1561/1900000017
[44] Todd J. Green, Shan Shan Huang, Boon Thau Loo和Wenchao Zhou: “Datalog和递归查询处理,”数据库的基础和趋势, 第5卷第2期,第105-195页,2013年11月。 doi:10.1561 / 1900000017
[ 45 ] Stefano Ceri, Georg Gottlob, and Letizia Tanca: “ What You Always Wanted to Know About Datalog (And Never Dared to Ask) ,” IEEE Transactions on Knowledge and Data Engineering , volume 1, number 1, pages 146–166, March 1989. doi:10.1109/69.43410
"Stefano Ceri、Georg Gottlob和Letizia Tanca: 《数据日志(Datalog)的奥秘与应用》,IEEE知识与数据工程学报,1989年3月第1卷第1期,146-166页。 doi:10.1109/69.43410"
[ 46 ] Serge Abiteboul, Richard Hull, and Victor Vianu: Foundations of Databases . Addison-Wesley, 1995. ISBN: 978-0-201-53771-0, available online at webdam.inria.fr/Alice
[46] Serge Abiteboul,Richard Hull和Victor Vianu:数据库的基础。Addison-Wesley,1995年。 ISBN:978-0-201-53771-0,在webdam.inria.fr/Alice上可在线获取。
[ 47 ] Nathan Marz: “ Cascalog ,” cascalog.org .
"47. Nathan Marz: “Cascalog,” cascalog.org." translated to simplified Chinese is: 47. Nathan Marz:“Cascalog”,cascalog.org。
[ 48 ] Dennis A. Benson, Ilene Karsch-Mizrachi, David J. Lipman, et al.: “ GenBank ,” Nucleic Acids Research , volume 36, Database issue, pages D25–D30, December 2007. doi:10.1093/nar/gkm929
[48] Dennis A. Benson、Ilene Karsch-Mizrachi、David J. Lipman等人:“GenBank”,《核酸研究》第36卷数据库专题,2007年12月,页码D25-D30。DOI:10.1093/nar/gkm929。
[ 49 ] Fons Rademakers: “ ROOT for Big Data Analysis ,” at Workshop on the Future of Big Data Management , London, UK, June 2013.
"[49] Fons Rademakers: 'ROOT for Big Data Analysis,' 在未来大数据管理研讨会上,伦敦,英国,2013年6月。"