Big Data Engineering, Practices and Research: Cassandra

Showing posts with label Cassandra. Show all posts

Sunday, June 12, 2011

Cassandra Compression and the Performance Evaluation

Even though we had put the Cassandra away in all our products, we would like to share our works here.

Cassandra Compression and Performance Evaluation

Why we put away the Cassandra in our products? Because:
(1) It is a big wrong in Cassandra’s implementation, especially on it’s local storage engine layer, i.e. SSTable and Indexing.
(2) It is a big wrong to combine Bigtable and Dynamo. Dynamo’s hash ring architecture is a obsolete technolohy for scale, it’s consistency and replication policy is also unusable in big data storage.

Saturday, July 10, 2010

My comments to "Cassandra at Twitter Today"

Someone said the twitter blog post "Cassandra at Twitter Today" is a big blow to the reputation of Cassandra.

It is ardently discussing @http://news.ycombinator.com/item?id=1502756

Here are my comments:

1. Cassandra is very young! Especially, the design and implementation of local storage and local indexing are junior and not good.

2. Pool read-performance is also due to the poor local storage implementation.

3. The local storage, indexing and persistence structures are not stable. They need to be re-designed /re-implemented. If Twitter move data to current Cassandra, they should do another move later for a new local storage, indexing and persistence structure.

4. There are many good techniques in Cassandra and other open-sourced projects (such as Hadoop, HBase ...), etc. But, they are not ready for production. Understand the detail of these techniques and implement them in your projects/products.

Monday, April 19, 2010

Cassandra Insert Throughput

** 0.5.1

Test Cluster:

DELL 2950 1*CPU Intel Xeon 5310 (4 cores)

5 nodes

1 node: 2GB heap for Cassandra JVM

4 nodes: 4GB heap for Cassandra JVM

Commit-log and Data stored on same disks.

25 client threads run on 5 nodes.

Data Model:

Keyspace Name = “Test”

Column Family Name = “ABC”

CompareWith for Column = LongType

Column Name = Timestamp (LongType), Value = 400 bytes binary

Billions of keys, thousands of columns.

Partitioner = dht.RandomPartitioner

MemtableSizeInMB = 64MB

ReplicationFactor = 3

Use Thrift Client Interface

Client.insert(..)

Consistency Level (write) = 1

Total inserted 1,076,333,461 columns.

Disk Use: 302GB+283GB+335GB+186GB+276GB=1,382GB (~~400B*1G=400GB *3= 1200GB)

On inserting: 1000 SSTables on each node. The latency of a query is about 1~3 seconds.

Quiet for long time: 10 SSTables (very big files, such as there is one 144GB SSTable data file)

The latency of a query is in ms.

Result: 18,000 columns/second

** 0.6.0

Only 4 nodes.

JVM GC for big heap.

Memory, GC..., always to be the bottleneck and big issue of java-based infrastructure software!

http://wiki.apache.org/cassandra/FAQ#slows_down_after_lotso_inserts

https://issues.apache.org/jira/browse/CASSANDRA-896 (LinkedBlockingQueue issue, fixed in jdk-6u19)

Seems 0.5.1 performed better.

0.6.0 eat more memory.

Cassandra 0.6.0 insert throughput

View more presentations from Schubert Zhang.

Big Data Engineering, Practices and Research

Sunday, June 12, 2011

Cassandra Compression and the Performance Evaluation

Saturday, July 10, 2010

My comments to "Cassandra at Twitter Today"

Monday, April 19, 2010

Cassandra Insert Throughput

About Me

Blog Archive

Labels

Search This Blog

Followers

Big Data Engineering, Practices and Research

Sunday, June 12, 2011

Cassandra Compression and the Performance Evaluation

Saturday, July 10, 2010

My comments to "Cassandra at Twitter Today"

Monday, April 19, 2010

Cassandra Insert Throughput

About Me

Blog Archive

Labels

Search This Blog

Subscribe To

Followers