Big Data Engineering, Practices and Research: BigTable

1. Intruduction

HFile is a mimic of Google’s SSTable. Now, it is available in Hadoop HBase-0.20.0. And the previous releases of HBase temporarily use an alternate file format – MapFile[4], which is a common file format in Hadoop IO package. I think HFile should also become a common file format when it becomes mature, and should be moved into the common IO package of Hadoop in the future.

Following words of SSTable are from section 4 of Google’s Bigtable paper.

The Google SSTable file format is used internally to store Bigtable data. An SSTable provides a persistent, ordered immutable map from keys to values, where both keys and values are arbitrary byte strings. Operations are provided to look up the value associated with a specified key, and to iterate over all key/value pairs in a specified key range. Internally, each SSTable contains a sequence of blocks (typically each block is 64KB in size, but this is configurable). A block index (stored at the end of the SSTable) is used to locate blocks; the index is loaded into memory when the SSTable is opened. A lookup can be performed with a single disk seek: we first find the appropriate block by performing a binary search in the in-memory index, and then reading the appropriate block from disk. Optionally, an SSTable can be completely mapped into memory, which allows us to perform lookups and scans without touching disk.[1]

The HFile implements the same features as SSTable, but may provide more or less.

2. File Format

Data Block Size

Whenever we say Block Size, it means the uncompressed size.

The size of each data block is 64KB by default, and is configurable in HFile.Writer. It means the data block will not exceed this size more than one key/value pair. The HFile.Writer starts a new data block to add key/value pairs if the current writing block is equal to or bigger than this size. The 64KB size is same as Google’s [1].

To achieve better performance, we should select different block size. If the average key/value size is very short (e.g. 100 bytes), we should select small blocks (e.g. 16KB) to avoid too many key/value pairs in each block, which will increase the latency of in-block seek, because the seeking operation always finds the key from the first key/value pair in sequence within a block.

Maximum Key Length

The key of each key/value pair is currently up to 64KB in size. Usually, 10-100 bytes is a typical size for most of our applications. Even in the data model of HBase, the key (rowkey+column family:qualifier+timestamp) should not be too long.

Maximum File Size

The trailer, file-info and total data block indexes (optionally, may add meta block indexes) will be in memory when writing and reading of an HFile. So, a larger HFile (with more data blocks) requires more memory. For example, a 1GB uncompressed HFile would have about 15600 (1GB/64KB) data blocks, and correspondingly about 15600 indexes. Suppose the average key size is 64 bytes, then we need about 1.2MB RAM (15600X80) to hold these indexes in memory.

Compression Algorithm

- Compression reduces the number of bytes written to/read from HDFS.

- Compression effectively improves the efficiency of network bandwidth and disk space

- Compression reduces the size of data needed to be read when issuing a read

To be as low friction as necessary, a real-time compression library is preferred. Currently, HFile supports following three algorithms:

(1)NONE (Default, uncompressed, string name=”none”)

(2)GZ (Gzip, string name=”gz”)

Out of the box, HFile ships with only Gzip compression, which is fairly slow.

(3)LZO(Lempel-Ziv-Oberhumer, preferred, string name=”lzo”)

To achieve maximal performance and benefit, you must enable LZO, which is a lossless data compression algorithm that is focused on decompression speed.

Following figures show the format of an HFile.

In above figures, an HFile is separated into multiple segments, from beginning to end, they are:

- Data Block segment

To store key/value pairs, may be compressed.

- Meta Block segment (Optional)

To store user defined large metadata, may be compressed.

- File Info segment

It is a small metadata of the HFile, without compression. User can add user defined small metadata (name/value) here.

- Data Block Index segment

Indexes the data block offset in the HFile. The key of each index is the key of first key/value pair in the block.

- Meta Block Index segment (Optional)

Indexes the meta block offset in the HFile. The key of each index is the user defined unique name of the meta block.

- Trailer

The fix sized metadata. To hold the offset of each segment, etc. To read an HFile, we should always read the Trailer firstly.

The current implementation of HFile does not include Bloom Filter, which should be added in the future.

3. LZO Compression

LZO is now removed from Hadoop or HBase 0.20+ because of GPL restrictions. To enable it, we should install native library firstly as following. [6][7][8][9]

(1) Download LZO: http://www.oberhumer.com/, and build.

# ./configure --build=x86_64-redhat-linux-gnu --enable-shared --disable-asm

# make

# make install

Then the libraries have been installed in: /usr/local/lib

(2) Download the native connector library http://code.google.com/p/hadoop-gpl-compression/, and build.

Copy hadoo-0.20.0-core.jar to ./lib.

# ant compile-native

# ant jar

(3) Copy the native library (build/native/ Linux-amd64-64) and hadoop-gpl-compression-0.1.0-dev.jar to your application’s lib directory. If your application is a MapReduce job, copy them to hadoop’s lib directory. Your application should follow the $HADOOP_HOME/bin/hadoop script to ensure that the native hadoop library is on the library path via the system property -Djava.library.path=. [9]

4. Performance Evaluation

Testbed

− 4 slaves + 1 master

− Machine: 4 CPU cores (2.0G), 2x500GB 7200RPM SATA disks, 8GB RAM.

− Linux: RedHat 5.1 (2.6.18-53.el5), ext3, no RAID, noatime

− 1Gbps network, all nodes under the same switch.

− Hadoop-0.20.0 (1GB heap), lzo-2.0.3

Some MapReduce-based benchmarks are designed to evaluate the performance of operations to HFiles, in parallel.

− Total key/value entries: 30,000,000.

− Key/Value size: 1000 bytes (10 for key, and 990 for value). We have totally 30GB of data.

− Sequential key ranges: 60, i.e. each range have 500,000 entries.

− Use default block size.

− The entry value is a string, each continuous 8 bytes are a filled with a same letter (A~Z). E.g. “BBBBBBBBXXXXXXXXGGGGGGGG……”.

We set mapred.tasktracker.map.tasks.maximum=3 to avoid client side bottleneck.

(1) Write

Each MapTask for each range of key, which writes a separate HFile with 500,000 key/value entries.

(2) Full Scan

Each MapTask scans a separate HFile from beginning to end.

(3) Random Seek a specified key

Each MapTask opens one separate HFile, and selects a random key within that file to seek it. Each MapTask runs 50,000 (1/10 of the entries) random seeks.

(4) Random Short Scan

Each MapTask opens one separate HFile, and selects a random key within that file as a beginning to scan 30 entries. Each MapTask runs 50,000 scans, i.e. scans 50,000*30=1,500,000 entries.

This table shows the average entries which are written/seek/scanned per second, and per node.

In this evaluation case, the compression ratio is about 7:1 for gz(Gzip), and about 4:1 for lzo. Even through the compression ratio is just moderate, the lzo column shows the best performance, especially for writes.

The performance of full scan is much better than SequenceFile, so HFile may provide better performance to MapReduce-based analytical applications.

The random seek in HFiles is slow, especially in none-compressed HFiles. But the above numbers already show 6X~10X better performance than a disk seek (10ms). Following Ganglia charts show us the overhead of load, CPU, and network. The random short scan makes the similar phenomena.

HFile: A Block-Indexed File Format to Store Sorted Key-Value Pairs

View more documents from schubertzhang.

References

[1] Google, Bigtable: A Distributed Storage System for Structured Data, http://labs.google.com/papers/bigtable.html

[2] HBase-0.20.0 Documentation, http://hadoop.apache.org/hbase/docs/r0.20.0/

[3] HFile code review and refinement. http://issues.apache.org/jira/browse/HBASE-1818

[4] MapFile API: http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/MapFile.html

[5] Parallel LZO: Splittable Compression for Hadoop. http://www.cloudera.com/blog/2009/06/24/parallel-lzo-splittable-compression-for-hadoop/

http://blog.chrisgoffinet.com/2009/06/parallel-lzo-splittable-on-hadoop-using-cloudera/

[6] Using LZO in Hadoop and HBase: http://wiki.apache.org/hadoop/UsingLzoCompression

[7] LZO: http://www.oberhumer.com

[8] Hadoop LZO native connector library: http://code.google.com/p/hadoop-gpl-compression/

[9] Hadoop Native Libraries Guide: http://hadoop.apache.org/common/docs/r0.20.0/native_libraries.html

New update:
With the comments from the community, we just generated a new performance evaluation report for HBase 0.20.0. Please refer to following document.

HBase 0.20.0 Performance Evaluation

View more documents from schubertzhang.

We have been using HBase for around a year in our development and projects, from 0.17.x to 0.19.x. We and all of the community know the serious Performance/Throughput issue of these releases.

Now, the great news is that hbase-0.20.0 will be released soon. Jonathan Gray from Streamy, Ryan Rawson from StumbleUpon and Jean-Daniel Cryans had done a great job to rewrite many codes to enhance the performance. The two presentations [1][2] provide more details of this release.

Following items are very important for us:
- Insert performance: data generated fast.
- Scan performance: for data analysis by MapReduce.
- Random Access performance.
- The HFile (same as SSTable)
- Less memory and I/O overheads

Bellow is our evaluations on hbase-0.20.0 RC1:

Cluster:
- 5 slaves + 1 master
- Slaves (1-4): 4 CPU cores(2.0G), 800GB SATA disks, 8GB RAM. Slave(5): 8 CPU cores(2.0G) 6 disks with RAID1, 4GB RAM
- 1Gbps network, all nodes under the same switch.
- Hadoop-0.20.0, HBase-0.20.0, Zookeeper-3.2.0

We modified the org.apache.hadoop.hbase.PerformanceEvaluation since the code have following problems:
- Is not match for hadoop-0.20.0.
- The approach to split map is not strict. Need provide correct InputSplit and InputFormat classes.

The evaluation programs use MapReduce to do parallel operations against HBase table.
- Total rows: 5,242,850.
- Row size: 1000 bytes for value, and 10 bytes for rowkey.
- Sequential ranges: 50. (also used to define the total number of MapTasks in each evaluation)
- Each Sequential Range rows: 104,857

The principle is same as the evaluation programs described in Section 7, Performance Evaluation, of the Google Bigtable paper[3], pages 8-10. Since we have only 5 nodes to work clients, we set mapred.tasktracker.map.tasks.maximum=3 to avoid client side bottleneck.

randomWrite (init) and sequentialWrite (init) are evaluations against a new table. Since there is only one RegionServer is accessed at the beginning, the performance is not so good. randomWrite and sequentialWrite are evaluations against a existing table that is already distributed on all 5 nodes.

Compares to the metrics in Google Paper (Figure 6): The write and randomRead performance is still not so good, but this result is much better than any previous HBase release, especially the randomRead. We even got better result than the paper on sequentialRead and scan evaluations. (and we should be aware of that the paper was published in 2006). This result gives us confidence.
- The new HFile should be the major success.
- BlockCache provide more performance to sequentialRead and scan.
- Client side write-buffer accelerates the sequentialWrite, but not so distinct. Since each write operation always writes into commit-log file and memstore.
- randomRead performance is not good enough, maybe bloom filter shall enhance it in the future.
- scan is so fast, MapReduce analysis on HBase table will be efficient.

Looking forward to and researching following features:
- Bloom Filter to accelerate randomRead.
- Bulk-load.

We need do more analysis for this evaluation and read code detail. Here is our PerformanceEvaluation code: http://dl.getdropbox.com/u/24074/code/PerformanceEvaluation.java

References:
[1] Ryan Rawson’s Presentation on NOSQL. http://blog.oskarsson.nu/2009/06/nosql-debrief.html
[2] HBase goes Realtime, http://wiki.apache.org/hadoop-data/attachments/HBase(2f)HBasePresentations/attachments/HBase_Goes_Realtime.pdf
[3] Google, Bigtable: A Distributed Storage System for Structured Data http://labs.google.com/papers/bigtable.html

Anty Rao and Schubert Zhang

Big Data Engineering, Practices and Research

Saturday, September 12, 2009

HFile: A Block-Indexed File Format to Store Sorted Key-Value Pairs

Tuesday, August 18, 2009

HBase-0.20.0 Performance Evaluation

About Me

Blog Archive

Labels

Search This Blog

Followers

Big Data Engineering, Practices and Research

Saturday, September 12, 2009

HFile: A Block-Indexed File Format to Store Sorted Key-Value Pairs

Tuesday, August 18, 2009

HBase-0.20.0 Performance Evaluation

About Me

Blog Archive

Labels

Search This Blog

Subscribe To

Followers