Rows The row keys in a table are arbitrary strings currently up to 64KB in size, although bytes is a typical size for most of our users. Bigtable maintains data in lexicographic order by row key. The row range for a table is dynamically partitioned. Each row range is called a tablet, which is the unit of distribution and load balancing. As a result, reads of short row ranges are efficient and typically require communication with only a small number of machines. Clients can exploit this property by selecting their row keys so that they get good locality for their data accesses.
|Published (Last):||16 June 2015|
|PDF File Size:||8.21 Mb|
|ePub File Size:||10.74 Mb|
|Price:||Free* [*Free Regsitration Required]|
Set "anchor:www. Delete "anchor:www. Applications that need to avoid collisions Figure 3: Reading from Bigtable must generate unique timestamps themselves. Different versions of a cell are stored in decreasing timestamp or- Bigtable supports several other features that allow the der. First To make the management of versioned data less oner- Bigtable supports single-row transactions, which can be ous, we support two per-column-family settings that tell used to perform atomic read-Inodify-write sequences on Bigtable to garbage-collect cell versions automatically.
Bigtable does not cur- The client can specify either that only the last n versions rently support general transactions across row keys, al of a cell be kept, or that only new-enough versions be though it provides an interface for batching writes across kept e. Second, Bigtable allows cells seven days to be used as integer counters.
Finally, Bigtable sup- In our Webtable example, we set the timestamps of ports the execution of client-supplied scripts in the ad- the crawled pages stored in the contents: column to dress spaces of the servers.
The scripts are written in a the times at which these page versions were actually language developed at Google for processing data called crawled. The garbage-collection mechanism described Sawzall .
At the moment, our Sawzall-based API above lets us keep only the most recent three versions of does not allow client scripts to write back into Bigtable, every page. We have written a set of wrappers deleting tables and column families. It also provides that allow a Bigtable to be used both as an input source functions for changing cluster, table, and column family and as an output target for MapReduce jobs metadata, such as access control rights Client applications can write or delete values in 4 Building Blocks Bigtable, look up values from individual rows, or iter ate over a subset of the data in a table.
Bigtable uses the distributed Google File form a series of updates. Irrelevant details were elided System GFS [17 to store log and data files. A Bigtable to keep the example short. The call to Apply performs cluster typically operates in a shared pool of machines an atomic mutation to the Webtable: it adds one anchor that run a wide variety of other distributed applications, towww.
Bigtable de straction to iterate over all anchors in a particular ro pends on a cluster management system for scheduling Clients can iterate over multiple column families, and jobs, managing resources on shared machines, dealing there are several mechanisms for limiting the rows, with machine failures, and monitoring machine status columns, and timestamps produced by a scan.
For er The Google ssTable file format is used internally to ample, we could restrict the scan above to only produce store bigtable data. An Sstable provides a persistent anchors whose columns match the regular expression ordered immutable map from keys to values, where both anchor:. Internally, each SSTable contains a sequence modate changes in workloads of blocks typically each block is 64KB in size, but this The master is responsible for assigning tablets to tablet is configurable.
A lookup lection of files in GFS. In addition, it handles schema can be performed with a single disk seek: we first find changes such as table and column family creations the appropriate block by performing a binary search in Each tablet server manages a set of tablets typically the in-memory index, and then reading the appropriate we have somewhere between ten to a thousand tablets per block from disk.
Optionally, an SSTable can be com- tablet server The tablet server handles read and write pletely mapped into memory, which allows us to perform requests to the tablets that it has loaded, and also splits lookups and scans without touching disk tablets that have grown too large Bigtable relies on a highly-available and persistent As with many single-master distributed storage sys- distributed lock service called Chubby .
A Chubby tems [17, 21], client data does not move through the mas service consists of five active replicas, one of which is ter: clients communicate directly with tablet servers for elected to be the master and actively serve requests. The reads and writes Because Bigtable clients do not rely on service is live when a majority of the replicas are running the master for tablet location information, most clients and can communicate with each other.
Chubby uses the never communicate with the master. As a result. Chubby provides a namespace that A Bigtable cluster stores a number of tables.
Each ta consists of directories and small files. Each directory or ble consists of a set of tablets, and each tablet contains file can be used as a lock, and reads and writes to a file all data associated with a row range. Initially, each table are atomic. The Chubby client library provides consis- consists of just one tablet. As a table grows, it is auto- tent caching of Chubby files. Each Chubby client main- matically split into multiple tablets, each approximately tains a session with a Chubby service.
If Chubby becomes UserTableN unavailable for an extended period of time, bigtable be comes unavailable. We recently measured this effect in 14 Bigtable clusters spanning l l Chubby instances The average percentage of Bigtable server hours during which some data stored in Bigtable was not available due Figure 4: Tablet location hierarch to Chubby unavailability caused by either chubby out- ages or network issues was 0.
The percentage The first level is a file stored in Chubby that contains for the single cluster that was most affected by Chubb the location of the root tablet. The root tablet contains unavailability was 0.
Each metadata tablet contains the location of a set of user tablets. The root tablet is just the first tablet in the 5 Implementation MeTADATA table, but is treated specially-it is never split--to ensure that the tablet location hierarchy has no The Bigtable implementation has three major compo more than three levels nents: a library that is linked into every client, one mas The metadata table stores the location of a tablet ler server, and many tablet servers.
If the client lock. If a tablet server reports that it has lost its lock, does not know the location of a tablet or if it disco or if the master was unable to reach a server during its ers that cached location information is incorrect, then last several attempts, the master attenpts to acquire an it recursively moves up the tablet location hierarchy. Once a servers file has been cache entries are only discovered upon misses assuming deleted, the master can move all the tablets that were pre that METADATA tablets do not move very frequently.
To ensure that a bigtable cluster is not vulnera GFS accesses are required, we further reduce this cost ble to networking issues between the master and chubb in the common case by having the client library prefetch the master kills itself if its Chubby session expires. How- tablet locations: it reads the metadata for more than one ever, as described above, master failures do not change tablet whenever it reads the metadata table the assignment of tablets to tablet servers We also store secondary information in the When a master is started by the cluster management METADATA table, including a log of all events per- system, it needs to discover the current tablet assign taining to each tablet such as when a server begins ments before it can change them.
The master executes serving it. This information is helpful for debugging the following steps at startup. When a tablet is a tablet that is not already assigned, the master adds the unassigned, and a tablet server with sufficient room for tablet to the set of unassigned tablets, which makes the he tablet is available, the master assigns the tablet by tablet eligible for tablet assignment sending a tablet load request to the tablet server.
One complication is that the scan of the MeTADATA Big table uses Chubby to keep track of tablet servers table cannot happen until the metadata tablets have When a tablet server starts, it creates, and acquires an been assigned.
Therefore, before starting this scan step xclusive lock on, a uniquely-named file in a specific 4 , the master adds the root tablet to the set of unassigned Chubby directory. The master monitors this directory tablets if an assignment for the root tablet was not dis- the servers directory to discover tablet servers. A tablet covered during step 3. This addition ensures that the root server stops serving its tablets if it loses its exclusive tablet will be assigned. Because the root tablet contains lock: e.
Chubby provides an about all of them after it has scanned the root tablet efficient mechanism that allows a tablet server to check The set of existing tablets only changes when a ta whether it still holds its lock without incurring network ble is created or deleted, two existing tablets are merged traffic. If the into two smaller tablets. The master is able to keep file no longer exists, then the tablet server will never be track of these changes because it initiates all but the last able to serve again, so it kills itself.
Whenever a tablet Tablet splits are treated specially since they are initi server terminates e. When the split has committed, it noti will reassign its tablets more quickly fies the master. In case the split notification is lost either To appear in OSDi 5 because the tablet server or the master died , the master 5. The tablet server will notify As write operations execute, the size of the memtable in the master of the split, because the tablet entry it finds in creases.
This minor compaction process has two goals 5. Incom trated in Figure 5. Updates are committed to a commit ing read and write operations can continue while com- log that stores redo records.
Of these updates, the re- actions occur cently committed ones are stored in memory in a sorted Every minor compaction creates a new SStable. If this buffer called a memtable; the older updates are stored in a behavior continued unchecked, read operations might sequence of SSTables.
To recover a tablet, a tablet server need to merge updates from an arbitrary number of SSTables. The input ss tables and memtable can be Memory discarded as soon as the compaction has finished GFS A merging compaction that rewrites all SSTables tablet log f into exactly one SSTable is called a major compaction SSTables produced by non-major compactions can con Write Op tain special deletion entries that suppress deleted data in SSTable file older Sstables that are still live.
A major compaction on the other hand produces an SsTable that contains Figure 5: Tablet Representation no deletion information or deleted data. This meta- compactions to them. These major compactions allow data contains the list of SsTables that comprise a tablet Bigtable to reclaim resources used by deleted data, and and a set of a redo points, which are pointers into any also allow it to ensure that deleted data disappears from commit logs that may contain data for the tablet.
The the system in a timely fashion, which is important for server reads the indices of the SSTables into memory and services that store sensitive data reconstructs the memtable by applying all of the updates that have committed since the redo points When a write operation arrives at a tablet server, the 6 Refinements server checks that it is well-formed and that the sender is authorized to perform the mutation.
Authorization is The implementation described in the previous section performed by reading the list of permitted writers from a required a number of refinements to achieve the high Chubby file which is almost always a hit in the Chubby performance, availability, and reliability required by our lient cache. This section describes portions of the implementa- log. Group commit is used to improve the throughput of tion in more detail in order to highlight these refinements lots of small mutations [13, After the write has been ommitted.
Segregating column Since the sstables and the memtable are lexicograph families that are not typically accessed together into sep ically sorted data structures, the merged view can be arate locality groups enables more efficient reads.
For formed efficiently example, page metadata in Webtable such as language Incoming read and write operations can continue and checksums can be in one locality group, and the while tablets are split and merged contents of the page can be in a different group an ap- To appear in OSDi plication that wants to read the metadata does not need Caching for read performance to read through all of the page contents To improve read performance, tablet servers use two lev- In addition, some useful tuning parameters can be els of caching.
The Scan Cache is a higher-level cache ecified on a per-locality group basis. For example, a lo- that caches the key-value pairs returned by the SSTable cality group can be declared to be in-memory.
Sstables interface to the tablet server code. The block cache is a for in-memory locality groups are loaded lazily into the lower-level cache that caches sstables blocks that were memory of the tablet server. Once loaded, column fam- read from GFS. The Scan Cache is most useful for appli- ilies that belong to such locality groups can be read cations that tend to read the same data repeatedly. The without accessing the disk. This feature is useful for Block Cache is useful for applications that tend to read mall pieces of data that are accessed frequently: we data that is close to the data they recently read e.
We reduce the num pression format is used. The user-specified compres- accesses by allowing clients to specify that Bloom fil- sion format is applied to each SSTable block whose size ters [7 should be created for SSTables in a particu is controllable via a locality group specific tuning pa- lar locality group.
A Bloom filter allows us to ask rameter. For certain applications, a small lions of an SSTable can be read without decompress amount of tablet server memory used for storing Bloom ing the entire file. Many clients use a two-pass custom filters drastically reduces the number of disk seeks re compression scheme. The first pass uses Bentley and quired for read operations. The second pass uses a columns do not need to touch disk fast compression algorithm that looks for repetitions in a small 16 Kb window of the data.
Depending on the underlying file lwO-pass compression scheme does surprisingly well. In one experiment, different physical log files. In addition, having separate we stored a large number of documents in a compressed log files per tablet also reduces the effectiveness of the locality group For the purposes of the experiment, we group commit optimization, since groups would tend to limited ourselves to one version of each document in- be smaller.
To fix these issues, we append mutations stead of storing all versions available to us. The scheme to a single commit log per tablet server, co-mingling achieved a to-1 reduction in space This is much mutations for different tablets in the same physical log better than typical Gzip reductions of 3-to-I or 4-to-1 file [ 18, on html pages because of the way Webtable rows are USing one log provides significant performance ben laid out: all pages from a single host are stored close efits during normal operation, but it complicates recov to each other.
Faugrel A tablet is stored in the form of a log-structured merge tree which they call memtable and SSTable. It does not support transactions spanning multiple rows http: BigTable provides clients with the following APIs: BigTable was among the early attempts Google made to manage big data. Jeffrey Dean and Sanjay Ghemawat were involved in it. Look Up Read a Single Row 2. BigTable BigTable is a distributed storage system used in Google, it can be classified as a non-relational database system. Different tablets of a table may be assigned to bogtable tablet servers.
BIGTABLE OSDI06 PDF