Outline

History
Infrastructure
Submitting a Job
Smith Cluster
Example 1: Java
Example 2: Python
Useful Commands

References

Apache Hadoop Tutorial: http://hadoop.apache.org/common/docs/current/mapred_tutorial.html
Wikipedia: http://en.wikipedia.org/wiki/MapReduce
Byzantine Reality: http://www.byzantinereality.com/?p=323

History

Introduced in 2004
MapReduce is a patented[1] software framework introduced by Google to support distributed computing on large data sets on clusters of computers. [wikipedia]
2010 first conference: The First International Workshop on MapReduce and its Applications (MAPREDUCE'10). (http://graal.ens-lyon.fr/mapreduce/) Interesting tidbit: nobody from Google on planning committee. Mostley INRIA

Sibblings

Hadoop, open source. Written in Java, with a few routines in C. Wins Terabyte sorting competition in 2008 (http://developer.yahoo.net/blogs/hadoop/2008/07/apache_hadoop_wins_terabyte_sort_benchmark.html)

Apache Hadoop Wins Terabyte Sort Benchmark
One of Yahoo's Hadoop clusters sorted 1 terabyte of data in 209 seconds, which beat the previous record of 297 seconds in the annual general purpose (daytona) terabyte sort benchmark. The sort benchmark, which was created in 1998 by Jim Gray, specifies the input data (10 billion 100 byte records), which must be completely sorted and written to disk. This is the first time that either a Java or an open source program has won. Yahoo is both the largest user of Hadoop with 13,000+ nodes running hundreds of thousands of jobs a month and the largest contributor, although non-Yahoo usage and contributions are increasing rapidly.

Their implementation provided also to be highly scalable, and last year placed first place in the Terabyte Sort Benchmark. In order to do so, they used 910 nodes, each with 2 cores (a total of 1820 cores), and were able to keep the data entirely in memory across the nodes. With this many machines and their open-source MapReduce implementation, they were able to sort one terabyte of data in 209 seconds, in contrast to the previous record of 297 seconds. Not one to be left behind, Google showed that they were still quite ahead, claiming they could do the same with their closed-source implementation in only 68 seconds. The only details they note are that they use 1000 nodes, but since it’s not verifiable or mentioned on the official Sort Benchmark Home Page, it may not be worthwhile to discuss it further.

CSC352 MapReduce/Hadoop Class Notes

Contents

Outline

References

History

Sibblings

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools