BigTable A Distributed Storage System for
Structured Data
Abstract
thousands of
Bigtable is a distributed storage system for
managing structured data that is designed to
scale to a very large size: petabytes of data
across
commodity servers.
Many projects at Google store data in Bigtable,
including web indexing, Google Earth, and
Google Fi- nance. These applications place
very different demands on Bigtable, both in
terms of data size (from URLs to web pages to
satellite imagery) and latency requirements
(from backend bulk processing to real-time
data serving). Despite these varied demands,
Bigtable has successfully provided a flexible,
high-performance solution for all of these
Google products. In this paper we describe
the sim- ple data model provided by Bigtable,
which gives clients dynamic control over data
layout and format, and we de- scribe the
design and implementation of Bigtable.
Bigtable 是一个分布式存储系统,用来管理
结构化数据,被设计为可拓展到一个非常大
的规模:上千台普通机器上 PB 级别的数据。
Google 的一些项目在 Bigtable 中存储数据,
包 括 网 页 索 引 , Google Earth 和 Google
Finance。这些程序有明显不同的对 bigtable
的需求,包括数据大小和延时。Bigtable 成
功为所有 Google 产品提供了一个灵活高可
用的解决方案。在这篇文章,我们描述了这
个 bigtable 提供的简单数据模型,让客户端
可以动态控制数据分布和格式,我们描述了
Bigtable 的设计和实现。
1 Introduction
Over the last two and a half years we have
designed,
implemented, and deployed a
distributed storage system for managing
structured data at Google called Bigtable.
Bigtable is designed to reliably scale to
petabytes of data and thousands of machines.
Bigtable has achieved several goals: wide
applicability, scalability, high performance,
and high availability. Bigtable is used by more
than sixty Google products and projects,
including Google Analytics, Google Finance,
Orkut, Personalized Search, Writely, and
Google Earth. These products use Bigtable for
a variety of demanding workloads, which
range
throughput-oriented
batch-processing jobs
to latency-sensitive
serving of data to end users. The Bigtable
clusters used by these products span a wide
range of configurations, from a handful to
thousands of servers, and store up to several
hundred terabytes of data.
from
在过去的二年半,我们设计、实现并部署了
一个用来管理结构化数据的分布式存储系
统 Bigtable。Bigtable 设计可以拓展到 PB 级
别的数据和上千台机器。Bigtable 实现了以
下几个目标:广泛的适用性,拓展性,高性
能和高可用性。Bigtable 被超过 60 个 Google
产品和项目所使用,包括 Google Analytics、
Google Finance、Orkut、Personalized Search、
Writely 和 Google Earth 。 这 些 产 品 使 用
Bigtable 在不同需求的场景,包括吞吐类型
的批处理任务以及延时敏感的向终端用户
提供数据。这些产品使用 Bigtable 集群包含
大量不同的配置,从数十台到成百上千台服
务器,存储几百 TB 的数据。
In many ways, Bigtable resembles a database:
it shares many implementation strategies
with databases. Paral- lel databases [14] and
instead,
main-memory databases [13] have achieved
scalability and high performance, but Bigtable
provides a different
interface than such
systems. Bigtable does not support a full
it provides
relational data model;
clients with a simple data model
that
supports dynamic control over data layout
and format, and al- lows clients to reason
about the locality properties of the data
represented in the underlying storage. Data is
in- dexed using row and column names that
can be arbitrary strings. Bigtable also treats
data as uninterpreted strings, although clients
often serialize various forms of struc- tured
and semi-structured data into these strings.
Clients can control the locality of their data
through careful choices in their schemas.
Finally, Bigtable schema pa-
rameters let
clients dynamically control whether to serve
data out of memory or from disk.
从许多方面看,Bigtable 依然是一个数据库:
它使用了许多数据库的实现策略。并行数据
库[14]和内存数据库[13]有可拓展性和高性
能,但是 Bigtable 提供了一个不同的接口。
Bigtable 不支持一个完整的关系数据模型,
取而代之,它提供一个简单数据模型的客户
端,支持动态控制数据分布和格式,并让客
户端推断数据分布属性。数据使用任意字符
串的行名和列名来索引。Bigtable 将数据看
作不解析的字符串,客户端经常将不同的结
构化或半结构化数据序列化为这些字符串。
客户端可以通过小心选择他们的 schema 来
控制数据分布。最后,Bigtable schema 参数
让 客 户 端 动 态 控 制 不 管 serve data out of
memory or from disk。
Section 2 describes the data model in more
detail, and Section 3 provides an overview of
the client API. Sec- tion 4 briefly describes the
underlying Google infrastruc- ture on which
Bigtable depends. Section 5 describes the
fundamentals of the Bigtable implementation,
and Sec-
the
tion 6 describes some of
refinements
that we made to improve
Bigtable’s performance. Section 7 provides
measurements of Bigtable’s performance. We
describe several examples of how Bigtable is
used at Google in Section 8, and discuss some
lessons we
and
supporting Bigtable in Section 9. Fi- nally,
Section 10 describes
related work, and
Section 11 presents our conclusions.
designing
learned
in
Section 2 更 具 体 的 描 述 了 数 据 模 型 ,
Section3 提供了一个关于 client API 的概述。
Setion4 简要的描述了Bigtable 依赖的 Google
基础设施。Section5 描述 Bigtable 的实现。
Section6 描述了一些我们改进 Bigtable 性能
的一些方法。Section7 提供了对 Bigtable 性
能的测试。在 Section8 我们介绍了 Google
如何使用 Bigtable 的例子。在 Section9 我们
讨论一些我们在设计和支持 Bigtable 学到的
经 验 。 最 后 ,section10 介 绍 相 关 工 作 。
section11 展示我们的总结。
2 Data Model
A Bigtable is a sparse, distributed, persistent
multidimensional sorted map. The map is
indexed by a row key, column key, and a
timestamp; each value in the map is an
uninterpreted array of bytes.
(row:string, column:string,
string
time:int64) →
We settled on this data model after examining
a variety of potential uses of a Bigtable-like
system. As one con- crete example that drove
some of our design decisions, suppose we
want to keep a copy of a large collection of
web pages and related information that could
be used by many different projects; let us call
this particular
In
Webtable, we would use URLs as row keys,
various aspects of web pages as column
names, and store the contents of the web
the Webtable.
table
pages in the contents: column under the
timestamps when they were fetched, as
illustrated in Figure 1.
一个 Bigtable 是一个单纯的分布式的提供了
多维排序的 MAP,这个 map 使用行 key,列
key 和一个时间戳来索引,map 中的每个值
是一个不解释的字节数组。
(row: string, column:string,
string
time:int64)
->
我们在考察了一些可能会使用 Bigtable 相似
系统后决定这个数据模型。作为了一个具体
的例子,推动我们的一些设计决策,假设我
们想保存可能被许多不同项目使用的大量
集 合 的 网 页 和 相 关 信 息 , 假 设 这 个 表 叫
Webtable。在 Webtable,我们可以使用 URL
作为 row key,网页的不同属性作为 column
key,存储网页内容在 contents 中。
ROWS
The row keys in a table are arbitrary strings
(currently up to 64KB in size, although 10-100
bytes is a typical size for most of our users).
Every read or write of data under a single row
key is atomic (regardless of the number of
different columns being read or written in the
row), a design decision that makes it easier
for clients to reason about
the system’s
behavior
in the presence of concurrent
updates to the same row.
表格的行 key 是可变字符串(当前最多是
64KB,尽管 10 到 100 字节是我们大多数用
户的典型长度)。每个对单个行 key 下的数
据读写都是原子的(不管这行有多少不同的
列被读或写),这让客户端更容易相信系统
在并发更新相同行的行为。
Bigtable maintains data in lexicographic order
by row key. The row range for a table is
dynamically partitioned. Each row range is
called a tablet, which is the unit of dis-
tribution and load balancing. As a result,
reads of short row ranges are efficient and
typically require communi- cation with only a
small number of machines. Clients can exploit
this property by selecting their row keys so
that they get good locality for their data
accesses. For example, in Webtable, pages in
the same domain are grouped together into
contiguous rows by reversing the hostname
components of the URLs. For example, we
store data for maps.google.com/index.html
under the key com.google.maps/index.html.
Storing pages from the same domain near
each other makes some host and domain
analyses more efficient.
Bigtable 使用 row key 的字典序来管理数据。
一个 table 的列的范围是动态分区的。每个
列的范围叫作 tablet,tablet 是分布和负载
均衡的单位。作为结果,对于列范围较小的
读操作是高效的,通常只要与很少数目的机
器交互。客户端可以通过选择他们的 row
key 来利用这个属性,使它们得到对于他们
数 据 访 问 的 好 的 locality 。 举 例 来 说 , 在
Webtable,相同域名下的网页通过对 URL 的
hostname 取反操作,使其分组到连续行。
对相同域名的网页相邻存储让一些 host 和
domain 分析更高效。
Column Families
Column keys are grouped into sets called
column families, which form the basic unit of
access control. All data stored in a column
family is usually of
the same type (we
compress data in the same column family
together). A column family must be created
before data can be stored under any column
key in that family; after a family has been
created, any column key within the family can
be used. It is our intent that the number of
distinct column families in a table be small (in
the hundreds at most), and that families
rarely change during operation. In contrast, a
table may have an unbounded number of
columns.
column keys 分组到称为 column family 的
集合中,column family 是访问控制的基本单
位。所有存储在 column family 的数据经常是
相同类型的(我们压缩相同 column family
的数据到一块)。一个 column family 一定要
在该 family 中的 column 的数据存储之前创
建。在一个 column family 被创建后,任意相
同 family 的 column key 可以被使用。我们希
望不同的 column family 个数比较小(最多在
百个量级)。family 在操作期间尽量少的变更。
相对的,一个 table 可以有无上限的列。
column-family level. In our Webtable example,
these controls allow us to manage several
different types of applications: some that add
new base data, some that read the base data
and create derived column families, and some
that are only allowed to view existing data
(and possibly not even to view all of the
existing families for privacy reasons).
访问控制以及磁盘内存都在 column-family
级别统计。在我们的 Webtable 例子中,这
些控制允许我们管理一些不同类型的程序:
一些添加新的基础数据,一些读取基础数据
创建衍生 column family,一些只允许看到已
经存在数据(可能因为内部原因,不能看到
所有数据)。
A column key is named using the following
syntax: family:qualifier. Column family names
must be print- able, but qualifiers may be
arbitrary strings. An exam- ple column family
for the Webtable is language, which stores
the language in which a web page was written.
We use only one column key in the language
family, and it stores each web page’s language
ID. Another useful col- umn family for this
table is anchor; each column key in this family
represents a single anchor, as shown in Fig-
ure 1. The qualifier is the name of the
referring site; the cell contents is the link text.
一 个 column key 使 用 如 下 格 式 命 名 :
family:qualifier。column family 名一定要是
可打印,但是 qualifiers 可以是任意字符。
Webtable 中 一 个 示 例 column family 是
language,保存一个网页的不同语言版本。
我 们 只 使 用 一 个 column key 在 language
family,它保存每个网页的语言的 ID。另外有
用的 column family 是 anchor,这个 family
下 的 每 个 column key 代 表 一 个 anchor 。
qualifier 是引用站点的命名,这个单元的内
容是连接文字。
Timestamps
in a Bigtable can contain multiple
Each cell
versions of the same data; these versions are
indexed by timestamp. Bigtable timestamps
are 64-bit integers. They can be as- signed by
Bigtable, in which case they represent “real
time”
in microseconds, or be explicitly
assigned by client applications. Applications
that need to avoid collisions must generate
unique timestamps
themselves. Different
versions of a cell are stored in decreasing
timestamp or- der, so that the most recent
versions can be read first.
Bigtable 的每个 cell 可以包含多个版本的相
同 数 据 , 这 些 版 本 使 用 时 间 戳 来 索 引 。
Bigtable 时间戳是 64 位数字。他们可以被
Bigtable 分配,代表微秒为单位的真实时间,
或者由客户端分配。程序要避免冲突,必须
生成不同的时间戳。不同版本的 cell 都按照
时间下降的顺序存储,这样最近版本的数据
可以被每一个读到。
Access control and both disk and memory
account-
the
performed
are
ing
at
To make the management of versioned data
less
two
per-column-family settings that tell Bigtable
onerous,
support
we
to garbage-collect cell versions automatically.
The client can specify either that only the last
n versions of a cell be kept, or that only
new-enough versions be kept (e.g., only keep
values that were written in the last seven
days).
为了让对版本数据管理不太沉重,我们支持
两 种 per-column-family 设 置, 让 Bigtable
自动垃圾回收 cell 的版本。客户端可以指定
只保留最近 n 个版本或者只有足够新的版
本被保留(例如,只保存最近七天写下的版
本)。
In our Webtable example, we set
the
timestamps of the crawled pages stored in
the contents: column to the times at which
these page versions were actually crawled.
The garbage-collection mechanism described
above lets us keep only the most recent three
versions of every page.
在 我 们 的 Webtable 的 例 子 中 , 我 们 在
contents 中设置了网页被抓取的时间戳,垃
圾回收机制让我们只保留最近三天的版本。
3 API
functions
The Bigtable API provides
for
creating and deleting tables and column
families.
for
changing cluster, table, and column family
metadata, such as access control rights.
It also provides
functions
Bigtable API 提 供 了 像 创 建 、 删 除 表 格 和
column family 的功能。它也提供了像修改集
群,表格和 column family 元数据的功能,例
如访问控制权限。
Client applications can write or delete values
in Bigtable,
look up values from individual
rows, or iterate over a subset of the data in a
table. Figure 2 shows C++ code that uses a
RowMutation abstraction to per-
form a
series of updates.
(Irrelevant details were
elided to keep the example short.) The call to
Apply performs an atomic mutation to the
Webtable:
to
www.cnn.com and deletes a different anchor.
anchor
it
adds
one
客户程序可以写或者删除在 Bigtable 的值,
从不同的 rows 中查找值,或者在一个 table
中的数据子集迭代。
row. Clients
Figure 3 shows C++ code that uses a Scanner
abstraction to iterate over all anchors in a
particular
can iterate over
multiple column families, and there are
several mechanisms for limiting the rows,
columns, and timestamps produced by a scan.
For ex- ample, we could restrict the scan
above to only produce anchors whose
columns match
expression
anchor:*.cnn.com, or to only produce anchors
whose timestamps fall within ten days of the
current time.
regular
the
客户端可以迭代多个 column family,并有一
些机制来限制一个扫描产生的行、列和时间
戳。举例来说,我们可以限制扫描只产生
column 匹配 anchor:*.cnn.com 的列,或者只
产生时间戳在十天之内的。
First, Bigtable
Bigtable supports several other features that
allow the user to manipulate data in more
complex ways.
supports
single-row transactions, which can be used to
perform atomic read-modify-write sequences
on data stored under a single row key.
Bigtable does not currently support general
transactions across row keys, al- though it
provides an interface for batching writes
across row keys at
the clients. Second,
Bigtable allows cells to be used as integer
counters. Finally, Bigtable sup- ports the
execution of client-supplied scripts in the ad-
dress spaces of the servers. The scripts are
written in a language developed at Google for
processing data called Sawzall [28]. At the
moment, our Sawzall-based API does not
allow client scripts to write back into Bigtable,
but
it does allow various forms of data
transformation, filtering based on arbitrary
expressions, and summariza- tion via a variety
of operators.
Bigtable 支持一些其它特性:允许用户使用
更复杂的方法管理数据。首先,Bigtable 支
持单行的事务,这可以用来执行原子的读-
修改-写 操作。Bigtable 现在不支持传统的
跨行事务,尽管它在客户端提供了一个修改
跨行数据的批量写接口。第二,Bigtable 允
许 cless 可以作为计数器使用。最后,Bigtable
支持在服务器的地址空间运行客户端提供
的脚本。这个脚本使用 google 开发的用来处
理数据的语言 Sawzall。当前,我们的 Sawzall
为基础的 API 不允许客户脚本将数据写回
Bigtable,但是它支持不同形式的数据变换,
任意表达式的过滤和通过大量操作的聚合。
Bigtable can be used with MapReduce [12], a
frame- work for running large-scale parallel
computations de- veloped at Google. We have
written a set of wrappers that allow a Bigtable
to be used both as an input source and as an
output target for MapReduce jobs.
Bigtable 可以被 MapReduce 使用。我们编写
了 一 些 wrapper , 让 Bigtable 可 以 被 当 作
Input 和 Ouput 被 MapReduce job 使用。
4 Building Blocks
Bigtable is built on several other pieces of
Google in-
frastructure. Bigtable uses the
distributed Google File System (GFS) [17] to
store log and data files. A Bigtable cluster
typically operates
in a shared pool of
machines that run a wide variety of other
distributed
Bigtable
processes often share the same machines
with processes
from other applications.
applications,
and
scheduling
Bigtable de- pends on a cluster management
system for
jobs, managing
resources on shared machines, dealing with
machine failures, and monitoring machine
status.
Bigtable 在 google 其它基础设施之上构建。
Bigtable 使用分布式的 GFS 来存储日志和数
据文件。一个 Bigtable 集群一般运行在一个
运行大量不同的其它分布式程序的机器共
享池,Bigtable 进程经常与其它进程共享机
器。Bigtable 依靠一个集群管理系统来调度
作业,管理资源,处理机器故障以及监控机
器状况。
The Google SSTable file format
is used
internally to store Bigtable data. An SSTable
provides a persistent, ordered immutable
map from keys to values, where both keys and
values are arbitrary byte strings. Operations
are provided to look up the value associated
with a specifiedkey, and to iterate over all
key/value pairs in a specified key range.
Internally, each SSTable contains a sequence
of blocks (typically each block is 64KB in size,
but this is configurable). A block index (stored
at the end of the SSTable) is used to locate
blocks; the index is loaded into memory when
the SSTable is opened. A lookup can be
performed with a single disk seek: we first
find the appropriate block by performing a
binary search in the in-memory index, and
then reading the appropriate block from disk.
Optionally, an SSTable can be com- pletely
mapped into memory, which allows us to
perform lookups and scans without touching
disk.
Google SSTable 文件格式被内部使用来存储
Bigtable 数据。一个 SSTable 提供一个可持久,
排序的不可修改 map,它的 key 和 value 都
是任意字符串,并提供了根据 key 查找特定
value 的 功 能 以 及 迭 代 指 定 范 围 内 的
key/value 对的功能。在内部,每个 SSTable
包含一系列的 blocks(每个 block 一般为 64kb,
这是可以配置的)。一个 block index(存储
在 SSTable 文件尾部)用来定位 block。这个
index 会在 SSTable 被打开时加载到内存。一
个查找操作可以只做一次磁盘 seek 操作:
我们首先通过对内存中的 index 进行二分找
到对 应的 block,然 后从 磁盘读 取相应 的
block。可选的,一个 SSTable 可以完整的映
射到内存中,这允许我们无需操作磁盘的进
行查找和扫描。
Bigtable relies on a highly-available and
persistent distributed lock service called
Chubby [8]. A Chubby service consists of five
active replicas, one of which is elected to be
the master and actively serve requests. The
service is live when a majority of the replicas
are running and can communicate with each
other. Chubby uses the Paxos algorithm [9, 23]
to keep its replicas consistent in the face of
failure. Chubby provides a namespace that
consists of directories and small files. Each
directory or file can be used as a lock, and
reads and writes to a file are atomic. The
Chubby client
library provides consistent
caching of Chubby files. Each Chubby client
main- tains a session with a Chubby service. A
client’s session expires if it is unable to renew
its session lease within the lease expiration
time. When a client’s session expires, it loses
any locks and open handles. Chubby clients
can also register callbacks on Chubby files and
directories for notification of changes or
session expiration.
Bigtable 依赖于一个叫做Chubby 的高可用和
持久存储锁服务。一个 Chubby 服务包含五
个活动副本,其中之一选举成为 master 并
响应请求。Chubby 服务在大多数副本运行
并可以相互通信时就可以提供服务。Chubby
使用 Paxos 算法来保持它的副本在发生故障
是保存一致性。Chubby 提供一个由目录和
小文件组成的命名空间。每个目录或者文件
可以作为 lock 来使用,并原子性的读写文件。
Chubby 客户端库提供了 Chubby 文件的一致
性缓存。每一个 Chubby 客户端维护了一个
与 Chubby 服务的 Session。
Bigtable uses Chubby for a variety of tasks: to
ensure that there is at most one active master
at any time; to store the bootstrap location of
Bigtable data (see Sec- tion 5.1); to discover
tablet servers and finalize tablet server deaths
(see Section 5.2); to store Bigtable schema
information (the column family information
for each ta- ble); and to store access control
lists. If Chubby becomes unavailable for an
extended period of time, Bigtable be- comes
unavailable. We recently measured this effect
in 14 Bigtable clusters spanning 11 Chubby
instances. The average percentage of Bigtable
server hours during which some data stored
in Bigtable was not available due to Chubby
unavailability (caused by either Chubby out-
ages or network issues) was 0.0047%. The
percentage for the single cluster that was
most affected by Chubby unavailability was
0.0326%.
Bigtable 使用 Chubby 做不同的任务:确保任
何最多一个活动的 master;存储 Bitatable
数据的启动位置;发现 tablet 服务器以及确
定 tablet 服务器宕机;存储 Bigtable schema
信息(每个表的 column family 信息);存储
访问控制列表。如果 Chubby 变的超过一个
周期时间的不可用,Bigtable 也变的不可用。
我 们 最 近 在 14 个 Bigtale 集 群 以 及 11 个
Chubby 实 例 上 测 试 了 这 个 影 响 。 因 为
Chubby 不可用造成 Bigtable 不可用的比率是
0.0047%。对单个集群来说,最多被 chubby
不可用影响是 0.0326%。
5 Implementation
The Bigtable implementation has three major
compo- nents: a library that is linked into
every client, one mas- ter server, and many
tablet
be
servers.
Tablet
servers
can
dynamically added (or
from a
cluster to accomodate changes in workloads.
removed)
Bigtable 实现包含三个主要部分:一个链接
进客户端的库,一个 master 服务,一些 tablet
服务。Tablet 服务可以动态增删。
of
tablet
The master is responsible for assigning tablets
to tablet servers, detecting the addition and
expiration
balancing
tablet-server load, and garbage col- lection of
files in GFS. In addition, it handles schema
changes such as table and column family
creations.
servers,
master 负责分配 tablet 到 tablet 机器,检测
tablet 服 务 器 的 添 加 和 过 期 , 平 衡
tablet-server 的负载,以及回收 GFS 中的文
件。另外,它处理像 table 和 column family
创建这样的 schema 变更。
Each tablet server manages a set of tablets
(typically we have somewhere between ten to
a thousand tablets per tablet server). The
tablet server handles read and write requests
to the tablets that it has loaded, and also
splits tablets that have grown too large.
每个 tablet 服务器管理一些 tablets。tablet
服务器处理它加载的 tablet 的读写请求,在
tablets 增长太大时分裂他们。
As with many
single-master distributed
storage sys- tems [17, 21], client data does
not move through the mas-
ter: clients
communicate directly with tablet servers for
reads and writes. Because Bigtable clients do
not rely on the master for tablet location
information, most clients never communicate
with the master. As a result, the mas- ter is
lightly loaded in practice.
与许多单 master 分布式系统一样,客户端
数据不经过 master,客户端直接与 tablet
servers 通信来读写。因为 Bigtable 客户端不
依赖于 master 来获取位置信息,大部分客
户端不与 master 交互。这样,master 就是
较轻负载。
A Bigtable cluster stores a number of tables.
Each ta- ble consists of a set of tablets, and
each tablet contains all data associated with a
row range. Initially, each table consists of just
one tablet. As a table grows,
it is auto-
matically split
into multiple tablets, each
approximately 100-200 MB in size by default.
一 个 Bigtable 集 群 存 储 一 些 tables 。 每 个
table 包含一些 tablet,每个 tablets 包含所有
给定 row range 的数据。最开始,每个表包
含唯一一个 tablet。随着 table 增长,它会自
动分裂为多个 tablet,每个 tablet 的默认大
小为 100~200mb。
5.1 Tablet Location
We use a three-level hierarchy analogous to
that of a B+- tree [10] to store tablet location
information (Figure 4).
我们使用类似于一个 B+-树的三层结构来存
储 tablet 位置信息。
The first level is a file stored in Chubby that
contains the location of the root tablet. The
root tablet contains the location of all tablets
in a special METADATA table. Each METADATA
tablet contains the location of a set of user
tablets. The root tablet is just the first tablet
in the METADATA table, but
treated
specially—it is never split—to ensure that the
tablet location hierarchy has no more than
three levels.
is
第一层是存储在 Chubby 中的文件,包含了
root tablet 的位置信息。root tablet 包含特
殊的 METADATA 表的所有 tablet 位置信息。
每一个 METADATA tablet 包含用户 tablet 的