Difference between revisions of "CSC352 MapReduce/Hadoop Class Notes"

From dftwiki3
Jump to: navigation, search
(History)
(History)
Line 17: Line 17:
 
* MapReduce is a patented[1] software framework introduced by Google to support distributed computing on large data sets on clusters of computers. [wikipedia]
 
* MapReduce is a patented[1] software framework introduced by Google to support distributed computing on large data sets on clusters of computers. [wikipedia]
 
* 2010 first conference: The First International Workshop on MapReduce and its Applications (MAPREDUCE'10). (http://graal.ens-lyon.fr/mapreduce/) Interesting tidbit: nobody from Google on planning committee.  Mostley INRIA
 
* 2010 first conference: The First International Workshop on MapReduce and its Applications (MAPREDUCE'10). (http://graal.ens-lyon.fr/mapreduce/) Interesting tidbit: nobody from Google on planning committee.  Mostley INRIA
 +
 +
=Sibblings=
 +
* Hadoop, open source.  Wins Terabyte sorting competition in 2008 (http://developer.yahoo.net/blogs/hadoop/2008/07/apache_hadoop_wins_terabyte_sort_benchmark.html)
 +
::Apache Hadoop Wins Terabyte Sort Benchmark<br /> One of Yahoo's Hadoop clusters sorted 1 terabyte of data in 209 seconds, which beat the previous record of 297 seconds in the annual general purpose (daytona) terabyte sort benchmark. The sort benchmark, which was created in 1998 by Jim Gray, specifies the input data (10 billion 100 byte records), which must be completely sorted and written to disk. This is the first time that either a Java or an open source program has won. Yahoo is both the largest user of Hadoop with 13,000+ nodes running hundreds of thousands of jobs a month and the largest contributor, although non-Yahoo usage and contributions are increasing rapidly.

Revision as of 12:16, 29 March 2010

Outline

  • History
  • Infrastructure
  • Submitting a Job
  • Smith Cluster
  • Example 1: Java
  • Example 2: Python
  • Useful Commands

References

History

  • Introduced in 2004
  • MapReduce is a patented[1] software framework introduced by Google to support distributed computing on large data sets on clusters of computers. [wikipedia]
  • 2010 first conference: The First International Workshop on MapReduce and its Applications (MAPREDUCE'10). (http://graal.ens-lyon.fr/mapreduce/) Interesting tidbit: nobody from Google on planning committee. Mostley INRIA

Sibblings

Apache Hadoop Wins Terabyte Sort Benchmark
One of Yahoo's Hadoop clusters sorted 1 terabyte of data in 209 seconds, which beat the previous record of 297 seconds in the annual general purpose (daytona) terabyte sort benchmark. The sort benchmark, which was created in 1998 by Jim Gray, specifies the input data (10 billion 100 byte records), which must be completely sorted and written to disk. This is the first time that either a Java or an open source program has won. Yahoo is both the largest user of Hadoop with 13,000+ nodes running hundreds of thousands of jobs a month and the largest contributor, although non-Yahoo usage and contributions are increasing rapidly.