Wednesday, January 6, 2010

Jeff Dean and Sanjay Ghemawat's good advices on MapReduce

I'd like to put a copy here, since this paper[1] matchs my opinions so much on MapReduce model and the pratices about large-dataset management/processing implementations.

In the paper, Jeffrey Dean and Sanjay Ghemawat reply Stonebrake and DeWitt's misconceptions about MapReduce. In fact, these misconceptions are so obvious and easy to understand for us.

It is also a good guide to improve the implementation of Hadoop and other members in the family. Suggest you reading it carefully.

Dean and other scientists from Google always bring us clear and reasonable explains about their technologies and pratices. But sometimes, someones from other organizations bring use puzzles.

Except for the five witchcrafts which Google exposed in following papers:
幻灯片 6 Google Cluster and WorkQueue Cluster Management

Following papers/articles/keynotes are very worthy of careful reading:
Jeff Dean Keynotes on LADIS09 (Designs, Lessons and Advice from Building Large
Distributed Systems):
Jeff Dean Keynotes on WSDM09(Challenges in Building Large-Scale Information Retrieval Systems):
Jeff Dean Stanford-295-talk (Software Engineering Advice from Building Large-Scale Distributed Systems):
Jeff Dean "Handling Large Datasets at Google":
Jeff Dean "A Behind the ScenesTour":

And following so called GFS-II articals:
Sean Quinlan: GFS: Evolution on Fast-forward (


  1. The big data is not handling by the one person the handling of this big data needs the number of people to understand or using it into a project. But this big data can be convert into more simple and short as you want by just taking the help of data scientist who are expert in handling the big data of the big companies.

  2. ok thanks for this post it's quite informative and I have learned new things.

    kajal hot