logo资料库

BigTable A Distributed Storage System for Structured Data.docx

第1页 / 共25页
第2页 / 共25页
第3页 / 共25页
第4页 / 共25页
第5页 / 共25页
第6页 / 共25页
第7页 / 共25页
第8页 / 共25页
资料共25页,剩余部分请下载后查看
BigTable A Distributed Storage System for Structur
BigTable A Distributed Storage System for Structured Data Abstract thousands of Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Fi- nance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products. In this paper we describe the sim- ple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and we de- scribe the design and implementation of Bigtable. Bigtable 是一个分布式存储系统,用来管理 结构化数据,被设计为可拓展到一个非常大 的规模:上千台普通机器上 PB 级别的数据。 Google 的一些项目在 Bigtable 中存储数据, 包 括 网 页 索 引 , Google Earth 和 Google Finance。这些程序有明显不同的对 bigtable 的需求,包括数据大小和延时。Bigtable 成 功为所有 Google 产品提供了一个灵活高可 用的解决方案。在这篇文章,我们描述了这 个 bigtable 提供的简单数据模型,让客户端 可以动态控制数据分布和格式,我们描述了 Bigtable 的设计和实现。 1 Introduction Over the last two and a half years we have designed, implemented, and deployed a distributed storage system for managing structured data at Google called Bigtable. Bigtable is designed to reliably scale to petabytes of data and thousands of machines. Bigtable has achieved several goals: wide applicability, scalability, high performance, and high availability. Bigtable is used by more than sixty Google products and projects, including Google Analytics, Google Finance, Orkut, Personalized Search, Writely, and Google Earth. These products use Bigtable for a variety of demanding workloads, which range throughput-oriented batch-processing jobs to latency-sensitive serving of data to end users. The Bigtable clusters used by these products span a wide range of configurations, from a handful to thousands of servers, and store up to several hundred terabytes of data. from 在过去的二年半,我们设计、实现并部署了 一个用来管理结构化数据的分布式存储系 统 Bigtable。Bigtable 设计可以拓展到 PB 级 别的数据和上千台机器。Bigtable 实现了以 下几个目标:广泛的适用性,拓展性,高性 能和高可用性。Bigtable 被超过 60 个 Google 产品和项目所使用,包括 Google Analytics、 Google Finance、Orkut、Personalized Search、 Writely 和 Google Earth 。 这 些 产 品 使 用 Bigtable 在不同需求的场景,包括吞吐类型 的批处理任务以及延时敏感的向终端用户 提供数据。这些产品使用 Bigtable 集群包含 大量不同的配置,从数十台到成百上千台服 务器,存储几百 TB 的数据。 In many ways, Bigtable resembles a database: it shares many implementation strategies with databases. Paral- lel databases [14] and
instead, main-memory databases [13] have achieved scalability and high performance, but Bigtable provides a different interface than such systems. Bigtable does not support a full it provides relational data model; clients with a simple data model that supports dynamic control over data layout and format, and al- lows clients to reason about the locality properties of the data represented in the underlying storage. Data is in- dexed using row and column names that can be arbitrary strings. Bigtable also treats data as uninterpreted strings, although clients often serialize various forms of struc- tured and semi-structured data into these strings. Clients can control the locality of their data through careful choices in their schemas. Finally, Bigtable schema pa- rameters let clients dynamically control whether to serve data out of memory or from disk. 从许多方面看,Bigtable 依然是一个数据库: 它使用了许多数据库的实现策略。并行数据 库[14]和内存数据库[13]有可拓展性和高性 能,但是 Bigtable 提供了一个不同的接口。 Bigtable 不支持一个完整的关系数据模型, 取而代之,它提供一个简单数据模型的客户 端,支持动态控制数据分布和格式,并让客 户端推断数据分布属性。数据使用任意字符 串的行名和列名来索引。Bigtable 将数据看 作不解析的字符串,客户端经常将不同的结 构化或半结构化数据序列化为这些字符串。 客户端可以通过小心选择他们的 schema 来 控制数据分布。最后,Bigtable schema 参数 让 客 户 端 动 态 控 制 不 管 serve data out of memory or from disk。 Section 2 describes the data model in more detail, and Section 3 provides an overview of the client API. Sec- tion 4 briefly describes the underlying Google infrastruc- ture on which Bigtable depends. Section 5 describes the fundamentals of the Bigtable implementation, and Sec- the tion 6 describes some of refinements that we made to improve Bigtable’s performance. Section 7 provides measurements of Bigtable’s performance. We describe several examples of how Bigtable is used at Google in Section 8, and discuss some lessons we and supporting Bigtable in Section 9. Fi- nally, Section 10 describes related work, and Section 11 presents our conclusions. designing learned in Section 2 更 具 体 的 描 述 了 数 据 模 型 , Section3 提供了一个关于 client API 的概述。 Setion4 简要的描述了Bigtable 依赖的 Google 基础设施。Section5 描述 Bigtable 的实现。 Section6 描述了一些我们改进 Bigtable 性能 的一些方法。Section7 提供了对 Bigtable 性 能的测试。在 Section8 我们介绍了 Google 如何使用 Bigtable 的例子。在 Section9 我们 讨论一些我们在设计和支持 Bigtable 学到的 经 验 。 最 后 ,section10 介 绍 相 关 工 作 。 section11 展示我们的总结。 2 Data Model A Bigtable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes. (row:string, column:string, string time:int64) → We settled on this data model after examining a variety of potential uses of a Bigtable-like system. As one con- crete example that drove some of our design decisions, suppose we want to keep a copy of a large collection of web pages and related information that could be used by many different projects; let us call this particular In Webtable, we would use URLs as row keys, various aspects of web pages as column names, and store the contents of the web the Webtable. table
pages in the contents: column under the timestamps when they were fetched, as illustrated in Figure 1. 一个 Bigtable 是一个单纯的分布式的提供了 多维排序的 MAP,这个 map 使用行 key,列 key 和一个时间戳来索引,map 中的每个值 是一个不解释的字节数组。 (row: string, column:string, string time:int64) -> 我们在考察了一些可能会使用 Bigtable 相似 系统后决定这个数据模型。作为了一个具体 的例子,推动我们的一些设计决策,假设我 们想保存可能被许多不同项目使用的大量 集 合 的 网 页 和 相 关 信 息 , 假 设 这 个 表 叫 Webtable。在 Webtable,我们可以使用 URL 作为 row key,网页的不同属性作为 column key,存储网页内容在 contents 中。 ROWS The row keys in a table are arbitrary strings (currently up to 64KB in size, although 10-100 bytes is a typical size for most of our users). Every read or write of data under a single row key is atomic (regardless of the number of different columns being read or written in the row), a design decision that makes it easier for clients to reason about the system’s behavior in the presence of concurrent updates to the same row. 表格的行 key 是可变字符串(当前最多是 64KB,尽管 10 到 100 字节是我们大多数用 户的典型长度)。每个对单个行 key 下的数 据读写都是原子的(不管这行有多少不同的 列被读或写),这让客户端更容易相信系统 在并发更新相同行的行为。 Bigtable maintains data in lexicographic order by row key. The row range for a table is dynamically partitioned. Each row range is called a tablet, which is the unit of dis- tribution and load balancing. As a result, reads of short row ranges are efficient and typically require communi- cation with only a small number of machines. Clients can exploit this property by selecting their row keys so that they get good locality for their data accesses. For example, in Webtable, pages in the same domain are grouped together into contiguous rows by reversing the hostname components of the URLs. For example, we store data for maps.google.com/index.html under the key com.google.maps/index.html. Storing pages from the same domain near each other makes some host and domain analyses more efficient. Bigtable 使用 row key 的字典序来管理数据。 一个 table 的列的范围是动态分区的。每个 列的范围叫作 tablet,tablet 是分布和负载 均衡的单位。作为结果,对于列范围较小的 读操作是高效的,通常只要与很少数目的机 器交互。客户端可以通过选择他们的 row key 来利用这个属性,使它们得到对于他们 数 据 访 问 的 好 的 locality 。 举 例 来 说 , 在 Webtable,相同域名下的网页通过对 URL 的 hostname 取反操作,使其分组到连续行。 对相同域名的网页相邻存储让一些 host 和 domain 分析更高效。 Column Families Column keys are grouped into sets called column families, which form the basic unit of access control. All data stored in a column family is usually of the same type (we compress data in the same column family together). A column family must be created before data can be stored under any column key in that family; after a family has been created, any column key within the family can be used. It is our intent that the number of distinct column families in a table be small (in the hundreds at most), and that families
rarely change during operation. In contrast, a table may have an unbounded number of columns. column keys 分组到称为 column family 的 集合中,column family 是访问控制的基本单 位。所有存储在 column family 的数据经常是 相同类型的(我们压缩相同 column family 的数据到一块)。一个 column family 一定要 在该 family 中的 column 的数据存储之前创 建。在一个 column family 被创建后,任意相 同 family 的 column key 可以被使用。我们希 望不同的 column family 个数比较小(最多在 百个量级)。family 在操作期间尽量少的变更。 相对的,一个 table 可以有无上限的列。 column-family level. In our Webtable example, these controls allow us to manage several different types of applications: some that add new base data, some that read the base data and create derived column families, and some that are only allowed to view existing data (and possibly not even to view all of the existing families for privacy reasons). 访问控制以及磁盘内存都在 column-family 级别统计。在我们的 Webtable 例子中,这 些控制允许我们管理一些不同类型的程序: 一些添加新的基础数据,一些读取基础数据 创建衍生 column family,一些只允许看到已 经存在数据(可能因为内部原因,不能看到 所有数据)。 A column key is named using the following syntax: family:qualifier. Column family names must be print- able, but qualifiers may be arbitrary strings. An exam- ple column family for the Webtable is language, which stores the language in which a web page was written. We use only one column key in the language family, and it stores each web page’s language ID. Another useful col- umn family for this table is anchor; each column key in this family represents a single anchor, as shown in Fig- ure 1. The qualifier is the name of the referring site; the cell contents is the link text. 一 个 column key 使 用 如 下 格 式 命 名 : family:qualifier。column family 名一定要是 可打印,但是 qualifiers 可以是任意字符。 Webtable 中 一 个 示 例 column family 是 language,保存一个网页的不同语言版本。 我 们 只 使 用 一 个 column key 在 language family,它保存每个网页的语言的 ID。另外有 用的 column family 是 anchor,这个 family 下 的 每 个 column key 代 表 一 个 anchor 。 qualifier 是引用站点的命名,这个单元的内 容是连接文字。 Timestamps in a Bigtable can contain multiple Each cell versions of the same data; these versions are indexed by timestamp. Bigtable timestamps are 64-bit integers. They can be as- signed by Bigtable, in which case they represent “real time” in microseconds, or be explicitly assigned by client applications. Applications that need to avoid collisions must generate unique timestamps themselves. Different versions of a cell are stored in decreasing timestamp or- der, so that the most recent versions can be read first. Bigtable 的每个 cell 可以包含多个版本的相 同 数 据 , 这 些 版 本 使 用 时 间 戳 来 索 引 。 Bigtable 时间戳是 64 位数字。他们可以被 Bigtable 分配,代表微秒为单位的真实时间, 或者由客户端分配。程序要避免冲突,必须 生成不同的时间戳。不同版本的 cell 都按照 时间下降的顺序存储,这样最近版本的数据 可以被每一个读到。 Access control and both disk and memory account- the performed are ing at To make the management of versioned data less two per-column-family settings that tell Bigtable onerous, support we
to garbage-collect cell versions automatically. The client can specify either that only the last n versions of a cell be kept, or that only new-enough versions be kept (e.g., only keep values that were written in the last seven days). 为了让对版本数据管理不太沉重,我们支持 两 种 per-column-family 设 置, 让 Bigtable 自动垃圾回收 cell 的版本。客户端可以指定 只保留最近 n 个版本或者只有足够新的版 本被保留(例如,只保存最近七天写下的版 本)。 In our Webtable example, we set the timestamps of the crawled pages stored in the contents: column to the times at which these page versions were actually crawled. The garbage-collection mechanism described above lets us keep only the most recent three versions of every page. 在 我 们 的 Webtable 的 例 子 中 , 我 们 在 contents 中设置了网页被抓取的时间戳,垃 圾回收机制让我们只保留最近三天的版本。 3 API functions The Bigtable API provides for creating and deleting tables and column families. for changing cluster, table, and column family metadata, such as access control rights. It also provides functions Bigtable API 提 供 了 像 创 建 、 删 除 表 格 和 column family 的功能。它也提供了像修改集 群,表格和 column family 元数据的功能,例 如访问控制权限。 Client applications can write or delete values in Bigtable, look up values from individual rows, or iterate over a subset of the data in a table. Figure 2 shows C++ code that uses a RowMutation abstraction to per- form a series of updates. (Irrelevant details were elided to keep the example short.) The call to Apply performs an atomic mutation to the Webtable: to www.cnn.com and deletes a different anchor. anchor it adds one 客户程序可以写或者删除在 Bigtable 的值, 从不同的 rows 中查找值,或者在一个 table 中的数据子集迭代。 row. Clients Figure 3 shows C++ code that uses a Scanner abstraction to iterate over all anchors in a particular can iterate over multiple column families, and there are several mechanisms for limiting the rows, columns, and timestamps produced by a scan. For ex- ample, we could restrict the scan above to only produce anchors whose columns match expression anchor:*.cnn.com, or to only produce anchors whose timestamps fall within ten days of the current time. regular the 客户端可以迭代多个 column family,并有一 些机制来限制一个扫描产生的行、列和时间 戳。举例来说,我们可以限制扫描只产生 column 匹配 anchor:*.cnn.com 的列,或者只 产生时间戳在十天之内的。 First, Bigtable Bigtable supports several other features that allow the user to manipulate data in more complex ways. supports single-row transactions, which can be used to perform atomic read-modify-write sequences on data stored under a single row key. Bigtable does not currently support general transactions across row keys, al- though it provides an interface for batching writes across row keys at the clients. Second, Bigtable allows cells to be used as integer counters. Finally, Bigtable sup- ports the execution of client-supplied scripts in the ad- dress spaces of the servers. The scripts are written in a language developed at Google for
processing data called Sawzall [28]. At the moment, our Sawzall-based API does not allow client scripts to write back into Bigtable, but it does allow various forms of data transformation, filtering based on arbitrary expressions, and summariza- tion via a variety of operators. Bigtable 支持一些其它特性:允许用户使用 更复杂的方法管理数据。首先,Bigtable 支 持单行的事务,这可以用来执行原子的读- 修改-写 操作。Bigtable 现在不支持传统的 跨行事务,尽管它在客户端提供了一个修改 跨行数据的批量写接口。第二,Bigtable 允 许 cless 可以作为计数器使用。最后,Bigtable 支持在服务器的地址空间运行客户端提供 的脚本。这个脚本使用 google 开发的用来处 理数据的语言 Sawzall。当前,我们的 Sawzall 为基础的 API 不允许客户脚本将数据写回 Bigtable,但是它支持不同形式的数据变换, 任意表达式的过滤和通过大量操作的聚合。 Bigtable can be used with MapReduce [12], a frame- work for running large-scale parallel computations de- veloped at Google. We have written a set of wrappers that allow a Bigtable to be used both as an input source and as an output target for MapReduce jobs. Bigtable 可以被 MapReduce 使用。我们编写 了 一 些 wrapper , 让 Bigtable 可 以 被 当 作 Input 和 Ouput 被 MapReduce job 使用。 4 Building Blocks Bigtable is built on several other pieces of Google in- frastructure. Bigtable uses the distributed Google File System (GFS) [17] to store log and data files. A Bigtable cluster typically operates in a shared pool of machines that run a wide variety of other distributed Bigtable processes often share the same machines with processes from other applications. applications, and scheduling Bigtable de- pends on a cluster management system for jobs, managing resources on shared machines, dealing with machine failures, and monitoring machine status. Bigtable 在 google 其它基础设施之上构建。 Bigtable 使用分布式的 GFS 来存储日志和数 据文件。一个 Bigtable 集群一般运行在一个 运行大量不同的其它分布式程序的机器共 享池,Bigtable 进程经常与其它进程共享机 器。Bigtable 依靠一个集群管理系统来调度 作业,管理资源,处理机器故障以及监控机 器状况。 The Google SSTable file format is used internally to store Bigtable data. An SSTable provides a persistent, ordered immutable map from keys to values, where both keys and values are arbitrary byte strings. Operations are provided to look up the value associated with a specifiedkey, and to iterate over all key/value pairs in a specified key range. Internally, each SSTable contains a sequence of blocks (typically each block is 64KB in size, but this is configurable). A block index (stored at the end of the SSTable) is used to locate blocks; the index is loaded into memory when the SSTable is opened. A lookup can be performed with a single disk seek: we first find the appropriate block by performing a binary search in the in-memory index, and then reading the appropriate block from disk. Optionally, an SSTable can be com- pletely mapped into memory, which allows us to perform lookups and scans without touching disk. Google SSTable 文件格式被内部使用来存储 Bigtable 数据。一个 SSTable 提供一个可持久, 排序的不可修改 map,它的 key 和 value 都 是任意字符串,并提供了根据 key 查找特定 value 的 功 能 以 及 迭 代 指 定 范 围 内 的 key/value 对的功能。在内部,每个 SSTable
包含一系列的 blocks(每个 block 一般为 64kb, 这是可以配置的)。一个 block index(存储 在 SSTable 文件尾部)用来定位 block。这个 index 会在 SSTable 被打开时加载到内存。一 个查找操作可以只做一次磁盘 seek 操作: 我们首先通过对内存中的 index 进行二分找 到对 应的 block,然 后从 磁盘读 取相应 的 block。可选的,一个 SSTable 可以完整的映 射到内存中,这允许我们无需操作磁盘的进 行查找和扫描。 Bigtable relies on a highly-available and persistent distributed lock service called Chubby [8]. A Chubby service consists of five active replicas, one of which is elected to be the master and actively serve requests. The service is live when a majority of the replicas are running and can communicate with each other. Chubby uses the Paxos algorithm [9, 23] to keep its replicas consistent in the face of failure. Chubby provides a namespace that consists of directories and small files. Each directory or file can be used as a lock, and reads and writes to a file are atomic. The Chubby client library provides consistent caching of Chubby files. Each Chubby client main- tains a session with a Chubby service. A client’s session expires if it is unable to renew its session lease within the lease expiration time. When a client’s session expires, it loses any locks and open handles. Chubby clients can also register callbacks on Chubby files and directories for notification of changes or session expiration. Bigtable 依赖于一个叫做Chubby 的高可用和 持久存储锁服务。一个 Chubby 服务包含五 个活动副本,其中之一选举成为 master 并 响应请求。Chubby 服务在大多数副本运行 并可以相互通信时就可以提供服务。Chubby 使用 Paxos 算法来保持它的副本在发生故障 是保存一致性。Chubby 提供一个由目录和 小文件组成的命名空间。每个目录或者文件 可以作为 lock 来使用,并原子性的读写文件。 Chubby 客户端库提供了 Chubby 文件的一致 性缓存。每一个 Chubby 客户端维护了一个 与 Chubby 服务的 Session。 Bigtable uses Chubby for a variety of tasks: to ensure that there is at most one active master at any time; to store the bootstrap location of Bigtable data (see Sec- tion 5.1); to discover tablet servers and finalize tablet server deaths (see Section 5.2); to store Bigtable schema information (the column family information for each ta- ble); and to store access control lists. If Chubby becomes unavailable for an extended period of time, Bigtable be- comes unavailable. We recently measured this effect in 14 Bigtable clusters spanning 11 Chubby instances. The average percentage of Bigtable server hours during which some data stored in Bigtable was not available due to Chubby unavailability (caused by either Chubby out- ages or network issues) was 0.0047%. The percentage for the single cluster that was most affected by Chubby unavailability was 0.0326%. Bigtable 使用 Chubby 做不同的任务:确保任 何最多一个活动的 master;存储 Bitatable 数据的启动位置;发现 tablet 服务器以及确 定 tablet 服务器宕机;存储 Bigtable schema 信息(每个表的 column family 信息);存储 访问控制列表。如果 Chubby 变的超过一个 周期时间的不可用,Bigtable 也变的不可用。 我 们 最 近 在 14 个 Bigtale 集 群 以 及 11 个 Chubby 实 例 上 测 试 了 这 个 影 响 。 因 为 Chubby 不可用造成 Bigtable 不可用的比率是 0.0047%。对单个集群来说,最多被 chubby 不可用影响是 0.0326%。 5 Implementation The Bigtable implementation has three major compo- nents: a library that is linked into every client, one mas- ter server, and many tablet be servers. Tablet servers can
dynamically added (or from a cluster to accomodate changes in workloads. removed) Bigtable 实现包含三个主要部分:一个链接 进客户端的库,一个 master 服务,一些 tablet 服务。Tablet 服务可以动态增删。 of tablet The master is responsible for assigning tablets to tablet servers, detecting the addition and expiration balancing tablet-server load, and garbage col- lection of files in GFS. In addition, it handles schema changes such as table and column family creations. servers, master 负责分配 tablet 到 tablet 机器,检测 tablet 服 务 器 的 添 加 和 过 期 , 平 衡 tablet-server 的负载,以及回收 GFS 中的文 件。另外,它处理像 table 和 column family 创建这样的 schema 变更。 Each tablet server manages a set of tablets (typically we have somewhere between ten to a thousand tablets per tablet server). The tablet server handles read and write requests to the tablets that it has loaded, and also splits tablets that have grown too large. 每个 tablet 服务器管理一些 tablets。tablet 服务器处理它加载的 tablet 的读写请求,在 tablets 增长太大时分裂他们。 As with many single-master distributed storage sys- tems [17, 21], client data does not move through the mas- ter: clients communicate directly with tablet servers for reads and writes. Because Bigtable clients do not rely on the master for tablet location information, most clients never communicate with the master. As a result, the mas- ter is lightly loaded in practice. 与许多单 master 分布式系统一样,客户端 数据不经过 master,客户端直接与 tablet servers 通信来读写。因为 Bigtable 客户端不 依赖于 master 来获取位置信息,大部分客 户端不与 master 交互。这样,master 就是 较轻负载。 A Bigtable cluster stores a number of tables. Each ta- ble consists of a set of tablets, and each tablet contains all data associated with a row range. Initially, each table consists of just one tablet. As a table grows, it is auto- matically split into multiple tablets, each approximately 100-200 MB in size by default. 一 个 Bigtable 集 群 存 储 一 些 tables 。 每 个 table 包含一些 tablet,每个 tablets 包含所有 给定 row range 的数据。最开始,每个表包 含唯一一个 tablet。随着 table 增长,它会自 动分裂为多个 tablet,每个 tablet 的默认大 小为 100~200mb。 5.1 Tablet Location We use a three-level hierarchy analogous to that of a B+- tree [10] to store tablet location information (Figure 4). 我们使用类似于一个 B+-树的三层结构来存 储 tablet 位置信息。 The first level is a file stored in Chubby that contains the location of the root tablet. The root tablet contains the location of all tablets in a special METADATA table. Each METADATA tablet contains the location of a set of user tablets. The root tablet is just the first tablet in the METADATA table, but treated specially—it is never split—to ensure that the tablet location hierarchy has no more than three levels. is 第一层是存储在 Chubby 中的文件,包含了 root tablet 的位置信息。root tablet 包含特 殊的 METADATA 表的所有 tablet 位置信息。 每一个 METADATA tablet 包含用户 tablet 的
分享到:
收藏