Big Data Engineering, Practices and Research: Vertica

Recently, tow famous vendors of analytic DBMS, Vertica and Aster-Data announced their integration with Hadoop. The analytic DBMS and Hadoop, each address distinct but complementary problems for managing large data.

Vertica:

Currently it is a light integration.

ETL, ELT, data cleansing, data mining, etc.
Moving data between Hadoop and Vertica.
InputFormat (InputSplit , VerticaRecord, push down relational map operations by parameterizing the database query).
OutputFormat (to existing or create a new table).
Easy for Hadoop developers to push down Map operations to Vertica databases in parallel by specifying parameterized queries which result in pre-aggregated data for each mapper.
Support Hadoop streaming interface.

Typical usages:
(1) Raw Data->Hadoop(ETL)->Vertical (for fast ad-hoc query, near realtime)
(2) Vertical -> Hadoop(ETL) ->Vertical (for fast ad-hoc query, near realtime)
(3) Vertical -> Hadoop (sophisticated query for analysis or mining)

We can expect to see tighter integration and higher performance.

References
[1] The Scoop on Hadoop and Vertica: http://databasecolumn.vertica.com/2009/09/the_scoop_on_hadoop_and_vertic.html
[2] Using Vertica as a Structured Data Repository for Apache Hadoop: http://www.vertica.com/MapReduce
[3] Cloudera DBInputFormat interface: http://www.cloudera.com/blog/2009/03/06/database-access-with-hadoop/
[4] Managing Big Data with Hadoop and Vertica: http://www.vertica.com/resourcelogin?type=pdf&item=ManagingBigDatawithHadoopandVertica.pdf

AsterData:

AsterData already provide in-database MapReduce.

The new Aster-Hadoop Data Connector, which utilizes Aster’s patent-pending SQL-MapReduce capabilities for two-way, high-speed, data transfer between Apache Hadoop and Aster Data’s massively parallel data warehouse.

ETL processing or data mining, and then pull that data into Aster for interactive queries or ad-hoc analytics on massive data scales.
The Connector utilizes key new SQL-MapReduce functions to provide ultra-fast, two-way data loading between HDFS (Hadoop Distributed File System) and Aster Data’s MPP Database.
Parallel loader.
LoadFromHadoop: Parallel data loading from HDFS to Aster nCluster.
LoadToHadoop: Parallel data loading from Aster nCluster to HDFS.

Key advantages of Aster’s Hadoop Connector include:

High-performance: Fast, parallel data transfer between Hadoop and Aster nCluster.
Ease-of-use: Analysts can now seamlessly invoke a SQL command for ultra-simple import of Hadoop-MapReduce jobs, for deeper data analysis. Aster intelligently and automatically parallelizes the load.
Data Consistency: Aster Data's data integrity and transactional consistency capabilities treat the data load as a 'transaction', ensuring that the data load or export is always consistent and can be carried out while other queries are running in parallel in Aster.
Extensibility: Customers can easily further extend the Connector using SQL-MapReduce, to provide further customization for their specific environment.

The typical usages are similar to Vertica.

References
[1] Aster Data Announces Seamless Connectivity With Hadoop: http://www.nearshorejournal.com/2009/10/aster-data-announces-seamless-connectivity-with-hadoop/
http://www.asterdata.com/news/091001-Aster-Hadoop-connector.php
[2] DBMS2 - MapReduce tidbits http://www.dbms2.com/2009/10/01/mapreduce-tidbits/#more-983
[3] AstaData Blog: Aster Data Seamlessly Connects to Hadoop, http://www.asterdata.com/blog/index.php/2009/10/05/aster-data-seamlessly-connects-to-hadoop/

Another Integration of Analytic DBMS and Hadoop case is HadoopDB project. http://db.cs.yale.edu/hadoopdb/hadoopdb.html

- Hybrid store of row and column!

In our practices, we were aware of the hybrid of row-oriented store and column-oriented store is a realistic choice. I got this inspiration from Bigtable's column-family concept.

Now Vertica 3.5 move from pure columnar store to hybrid. It is called "Column Grouping", which is the major part of the veritica's enhancement in storing and processing columnar data called FlexStore. I think FlexStore means "Flexible Store". Users can define their column group flexibly.

Hybrid is the trend. I like Bigtable's model abstraction, it is simple and flexible.

- Hybrid query of lookup and MapReduce?

It seems it is a contradiction for low-latency lookup and high-latency ad-hoc MapReduce query. But I don't know if it make sense to support both in one data system. But sometimes, it seems needed.

Hive is one of the best practices to provide a easy-to-used MapReduce expression tool, or data warehouse. No real-time lookup. In fact, it is not a easy work to melt MapReduce into SQL, after reading of the DAG abstraction in Hive's paper in VLDB09.

As expected, Dr. Stonebraker's Vertica 3.5 also integrate Hadoop MapReduce now. And HadoopDB is Hive+Hadoop+PostgresDB. Vertica does not integrate MapReduce into SQL now, it is different from Greenplum and AsterData and HadoopDB.

References:
http://www.vertica.com/company/news/vertica-analytic-database-broadens-reach-with-flexstore
http://www.dbms2.com/2009/08/04/flexstore-and-the-rest-of-vertica-35/
http://www.dbms2.com/2009/08/04/verticas-version-of-mapreduce-integration/
http://db.cs.yale.edu/hadoopdb/hadoopdb.html
http://db.csail.mit.edu/pubs/benchmarks-sigmod09.pdf
http://www.slideshare.net/namit_jain/hive-demo-paper-at-vldb-2009

Big Data Engineering, Practices and Research

Friday, October 2, 2009

The Integration of Analytic DBMS and Hadoop

Monday, August 17, 2009

Hybrid store of row and column! Hybrid query of lookup and MapReduce?

About Me

Blog Archive

Labels

Search This Blog

Followers