Wednesday, January 6, 2010

Jeff Dean and Sanjay Ghemawat's good advices on MapReduce


I'd like to put a copy here, since this paper[1] matchs my opinions so much on MapReduce model and the pratices about large-dataset management/processing implementations.

In the paper, Jeffrey Dean and Sanjay Ghemawat reply Stonebrake and DeWitt's misconceptions about MapReduce. In fact, these misconceptions are so obvious and easy to understand for us.

It is also a good guide to improve the implementation of Hadoop and other members in the family. Suggest you reading it carefully.

Dean and other scientists from Google always bring us clear and reasonable explains about their technologies and pratices. But sometimes, someones from other organizations bring use puzzles.

Except for the five witchcrafts which Google exposed in following papers:
GFS: http://labs.google.com/papers/gfs.html
MapReduce: http://labs.google.com/papers/mapreduce.html
Bigtable: http://labs.google.com/papers/bigtable.html
Chubby: http://labs.google.com/papers/chubby.html
幻灯片 6 Google Cluster and WorkQueue Cluster Management

Following papers/articles/keynotes are very worthy of careful reading:
Jeff Dean Keynotes on LADIS09 (Designs, Lessons and Advice from Building Large
Distributed Systems): http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf
Jeff Dean Keynotes on WSDM09(Challenges in Building Large-Scale Information Retrieval Systems): http://research.google.com/people/jeff/WSDM09-keynote.pdf
Jeff Dean Stanford-295-talk (Software Engineering Advice from Building Large-Scale Distributed Systems): http://research.google.com/people/jeff/stanford-295-talk.pdf
Jeff Dean "Handling Large Datasets at Google": http://hepix.caspur.it/storage/hep_pdf/2008/Spring/handling-large-datasets-20080507.pdf
Jeff Dean "A Behind the ScenesTour": http://www.slideshare.net/rawwell/googleabehindthescenestourjeffdean

And following so called GFS-II articals:
Sean Quinlan: GFS: Evolution on Fast-forward (http://queue.acm.org/detail.cfm?id=1594206)