Tuesday, August 18, 2009

HBase-0.20.0 Performance Evaluation

New update:
With the comments from the community, we just generated a new performance evaluation report for HBase 0.20.0. Please refer to following document.
We have been using HBase for around a year in our development and projects, from 0.17.x to 0.19.x. We and all of the community know the serious Performance/Throughput issue of these releases.

Now, the great news is that hbase-0.20.0 will be released soon. Jonathan Gray from Streamy, Ryan Rawson from StumbleUpon and Jean-Daniel Cryans had done a great job to rewrite many codes to enhance the performance. The two presentations [1][2] provide more details of this release.

Following items are very important for us:
- Insert performance: data generated fast.
- Scan performance: for data analysis by MapReduce.
- Random Access performance.
- The HFile (same as SSTable)
- Less memory and I/O overheads

Bellow is our evaluations on hbase-0.20.0 RC1:

Cluster:
- 5 slaves + 1 master
- Slaves (1-4): 4 CPU cores(2.0G), 800GB SATA disks, 8GB RAM. Slave(5): 8 CPU cores(2.0G) 6 disks with RAID1, 4GB RAM
- 1Gbps network, all nodes under the same switch.
- Hadoop-0.20.0, HBase-0.20.0, Zookeeper-3.2.0

We modified the org.apache.hadoop.hbase.PerformanceEvaluation since the code have following problems:
- Is not match for hadoop-0.20.0.
- The approach to split map is not strict. Need provide correct InputSplit and InputFormat classes.

The evaluation programs use MapReduce to do parallel operations against HBase table.
- Total rows: 5,242,850.
- Row size: 1000 bytes for value, and 10 bytes for rowkey.
- Sequential ranges: 50. (also used to define the total number of MapTasks in each evaluation)
- Each Sequential Range rows: 104,857

The principle is same as the evaluation programs described in Section 7, Performance Evaluation, of the Google Bigtable paper[3], pages 8-10. Since we have only 5 nodes to work clients, we set mapred.tasktracker.map.tasks.maximum=3 to avoid client side bottleneck.

randomWrite (init) and sequentialWrite (init) are evaluations against a new table. Since there is only one RegionServer is accessed at the beginning, the performance is not so good. randomWrite and sequentialWrite are evaluations against a existing table that is already distributed on all 5 nodes.

Compares to the metrics in Google Paper (Figure 6): The write and randomRead performance is still not so good, but this result is much better than any previous HBase release, especially the randomRead. We even got better result than the paper on sequentialRead and scan evaluations. (and we should be aware of that the paper was published in 2006). This result gives us confidence.
- The new HFile should be the major success.
- BlockCache provide more performance to sequentialRead and scan.
- Client side write-buffer accelerates the sequentialWrite, but not so distinct. Since each write operation always writes into commit-log file and memstore.
- randomRead performance is not good enough, maybe bloom filter shall enhance it in the future.
- scan is so fast, MapReduce analysis on HBase table will be efficient.

Looking forward to and researching following features:
- Bloom Filter to accelerate randomRead.
- Bulk-load.

We need do more analysis for this evaluation and read code detail. Here is our PerformanceEvaluation code: http://dl.getdropbox.com/u/24074/code/PerformanceEvaluation.java

References:
[1] Ryan Rawson’s Presentation on NOSQL. http://blog.oskarsson.nu/2009/06/nosql-debrief.html
[2] HBase goes Realtime, http://wiki.apache.org/hadoop-data/attachments/HBase(2f)HBasePresentations/attachments/HBase_Goes_Realtime.pdf
[3] Google, Bigtable: A Distributed Storage System for Structured Data http://labs.google.com/papers/bigtable.html

Anty Rao and Schubert Zhang

12 comments:

  1. Thanks for posting Anty and Schubert. How many CPUs in your nodes? Looks like hbase is faster than the BigTable paper scanning and doing sequential reads (Am I reading that right)? Somethings up with our writes at the moment. Need to look into it. We seem to have lost speed here since 0.19. Good work!

    ReplyDelete
  2. You are showing ~2ms for random reads per node? Is that right?

    That's about as good as we can get on your hardware. Bloom filters will only help in the case of a miss, not a hit. Besides that, you're already showing 2-4X better performance than a disk seek. Any other improvement will have to come from HDFS optimizations, RPC optimizations, and of course you can always get better performance by loading up with more RAM for the filesystem cache. Try 8GB or 16GB in your nodes and you might get sub-ms on average per node, but remember, you're serving out of memory then and not seeking. Adding more memory (and regionserver heap) should help the numbers across the board.

    The BigTable paper shows 1212 random reads per second on a single node... That's sub-ms for random access, clearly not actually doing disk seeks for most gets.

    ReplyDelete
  3. See HBASE-1771 Anty and Schubert. Sequential/Random writes should go up by factors of 2-4.

    ReplyDelete
  4. @stack:
    Slaves(1-4): 4 CPU cores(2.0G), Slave(5): 8 CPU cores(2.0G). I have just updated the post to add CPU info.
    We will modify conf according to HBASE-1771, and post the new result. It seems great. Thanks stack.

    @J-G:
    Yes, ~2ms per row and per node for random reads, it is a eventually average. It is less than the 10ms of disk seek, should be profit from cache and the HFile implementation. Now, I only assign 2GB heap to each region-server. Maybe RAID0 on multiple disks can help.

    And I just think more about bloom filters, your are right, no help in these evaluations.

    ReplyDelete
  5. After our retest, we found HBASE-1771 does not improve the performance. So, it seems that's about as good as we can get on our hardware.

    Another issue is about random reads.
    I use sequentical-writes to insert 5GB of data in our HBase table from empty, and ~30 regions are generated. Then the random-reads takes about 30 minutes to complete. And then, I run the sequentical-writes again. Thus, another version of each cell are inserted, thus ~60 regions are generated. But, we I ran the random-reads again to this table, it always take long time (more than 2 hours). And when the data in table are inserted by random-writes, the random-reads is also slow.

    ReplyDelete
  6. @Schubert So, your machines resemble those described in the BigTable paper? On the random-read test, now its 4X longer so its average of 8ms a read? My guess is that you are missing the cache more often now. What if you ran a major compaction on the table (Check logs to see it completes). Does time change much?

    ReplyDelete
  7. @stack, yes. we have fix this random-read issue, it is caused by one ineffective node.

    anyone can get the test code from https://issues.apache.org/jira/browse/HBASE-1778

    ReplyDelete
  8. Bradford Stephens to hbase-user Sep 23 (5 days ago)

    A quick performance snapshot: I believe with our cluster of 18 nodes (8
    cores, 8 GB RAM, 2 x 500 GB drives per node), we were inserting rows of
    about 5-10kb at the rate of 180,000 /second. That's on a completely untuned
    cluster. You could see much better performance with proper tweaking and LZO
    compression.

    ReplyDelete
  9. Can't print this presentation from SlideShare, will it be publicly available in any other form?

    ReplyDelete
  10. @Igor Katkov,
    Now, you can download the document from slideshare.

    ReplyDelete
  11. Hi

    I read this post 2 times. It is very useful.

    Pls try to keep posting.

    Let me show other source that may be good for community.

    Source: Performance evaluation forms

    Best regards
    Jonathan.

    ReplyDelete
  12. Hbase provides Bigtable-like capabilities on top of Hadoop and HDFS.

    ReplyDelete