Difference between revisions of "CSC352 MapReduce/Hadoop Class Notes"

From dftwiki3
Jump to: navigation, search
 
(17 intermediate revisions by the same user not shown)
Line 21: Line 21:
 
* '''Phoenix''' <ref name="phoenix">Evaluating MapReduce for Multi-core and Multiprocessor Systems  
 
* '''Phoenix''' <ref name="phoenix">Evaluating MapReduce for Multi-core and Multiprocessor Systems  
 
Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, Christos Kozyrakis∗  
 
Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, Christos Kozyrakis∗  
Computer Systems Laboratory, Stanford University, http://csl.stanford.edu/~christos/publications/2007.cmp_mapreduce.hpca.pdf</ref>  Another implementation of MapReduce, named Phoenix [2], has been created to facilitate optimal use of multi-core and multiprocessor systems. Written in C++, they sought to use the MapReduce programming paradigm to perform dense integer matrix multiplication as well as performing linear regression and principal components analysis on a matrix (amongst other applications). The following graph shows the speedup they achieve over varying number of cores:
+
Computer Systems Laboratory, Stanford University, http://csl.stanford.edu/~christos/publications/2007.cmp_mapreduce.hpca.pdf</ref>   
 +
::Another implementation of MapReduce, named Phoenix [2], has been created to facilitate optimal use of multi-core and multiprocessor systems. Written in C++, they sought to use the MapReduce programming paradigm to perform dense integer matrix multiplication as well as performing linear regression and principal components analysis on a matrix (amongst other applications). The following graph shows the speedup they achieve over varying number of cores:
 
<center>[[Image:MapReduceMultiCorePerformance.png]]</center>
 
<center>[[Image:MapReduceMultiCorePerformance.png]]</center>
  
Line 71: Line 72:
 
<br />
 
<br />
  
==The Computation==
+
==The Computation: micro view==
  
 
The images in this section are mostly taken from the excellent Yahoo tutorial (Module 4) on MapReduce <ref name="YahooMapReduceTutorial4">Yahoo Tutorial, Module 4, MapReduce Basics http://developer.yahoo.com/hadoop/tutorial/module4.html</ref>
 
The images in this section are mostly taken from the excellent Yahoo tutorial (Module 4) on MapReduce <ref name="YahooMapReduceTutorial4">Yahoo Tutorial, Module 4, MapReduce Basics http://developer.yahoo.com/hadoop/tutorial/module4.html</ref>
Line 89: Line 90:
 
<br />
 
<br />
  
 +
==Important Statements To Remember==
 +
 +
Taken from <ref name="hadoopGuide">[http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/0596521979 Hadoop, the definitive guide], Tim White, O'Reilly Media, June 2009, ISBN 0596521979. The Web site for the book is http://www.hadoopbook.com/</ref>
 +
 +
* A '''Map Reduce job''' is a unit of work submitted by a client. 
 +
 +
 +
* A '''job''' contains<br /><br />
 +
** the data<br /><br />
 +
** a MapReduce program<br /><br />
 +
** a configuration<br /><br />
 +
 +
 +
* Hadoop divides a '''job''' into '''tasks''', of which there are two kinds, '''map tasks''', and '''reduce tasks'''.
 +
 +
 +
* There are two types of '''nodes''': a '''jobTracker''' node, which oversees the execution of a '''job''', and '''taskTraker''' nodes that execute '''tasks'''.
 +
 +
 +
* Hadoop divides the '''input''' into '''splits'''.
 +
 +
 +
* '''Hadoop creates one map task for each split'''
 +
 +
 +
* A split is '''64 MB''' by default.
 +
 +
 +
* '''Each map task runs the user-defined map function for each record of a split'''.
 +
 +
 +
* Hadoop does its best to run the map task on the node where the split resides, '''but it is not always the case'''.
 +
 +
 +
* The '''sorted''' map outputs are transfered across the network to where the reduce task is running.  These '''sorted''' outputs  are '''merged''' and fed to the user-defined '''reduce function.'''
 +
 +
 +
* The '''output''' of the '''reduce task''' is stored in the '''HDFS'''.
 +
 +
 +
* When they are many reducers, the map tasks '''partition''' their output into '''partitions'''.  There is '''one''' partition per '''reduce task'''.
 +
 +
 +
== Examples of Data Flows==
 +
Taken from <ref name="hadoopGuide" />
 +
 +
<center>
 +
[[Image:MapReduceDataFlowOneReduce.png]]
 +
</center>
 +
The shaded boxes are nodes.  The dotted arrows show transfers on a node.  The heavy arrows show transfers across nodes.
 +
 +
<center>
 +
[[Image:MapReduceDataFlowTwoReduces.png]]
 +
</center>
 +
The general case of having several reduce tasks.  In this case the outputs of the map tasks are shuffled; each reduce task receives many outputs of map tasks.
 +
 +
 +
<center>
 +
[[Image:MapReduceDataFlowNoReduce.png]]
 +
</center>
 +
It is possible to have 0 reduce tasks...
 +
 +
 +
==Computation: macro view==
 
<center>[[Image:MapReduceDataFlow.png]]</center>
 
<center>[[Image:MapReduceDataFlow.png]]</center>
 
<br />
 
<br />
Line 104: Line 169:
  
 
<center>[[Image:wordcountMapReduceBlockDiagram.png]] </center>
 
<center>[[Image:wordcountMapReduceBlockDiagram.png]] </center>
 +
  
 
===The Map Function, simplified===
 
===The Map Function, simplified===
Line 120: Line 186:
 
   emit (word, sum)
 
   emit (word, sum)
  
 +
===The Whole Program===
  
===The Map and Reduce Java Blocks===
+
[[Hadoop WordCount.java | WordCount.java]]
 
 
<source lang="java">
 
  public static class MapClass extends MapReduceBase
 
    implements Mapper<LongWritable, Text, Text, IntWritable> {
 
 
 
    private final static IntWritable one = new IntWritable(1);
 
    private Text word = new Text();
 
 
 
    public void map(LongWritable key, Text value,
 
                    OutputCollector<Text, IntWritable> output,
 
                    Reporter reporter) throws IOException {
 
      String line = value.toString();
 
      StringTokenizer itr = new StringTokenizer(line);
 
      while (itr.hasMoreTokens()) {
 
        word.set(itr.nextToken());
 
        output.collect(word, one);
 
      }
 
    }
 
  }
 
 
 
  /**
 
  * A reducer class that just emits the sum of the input values.
 
  */
 
  public static class Reduce extends MapReduceBase
 
    implements Reducer<Text, IntWritable, Text, IntWritable> {
 
  
    public void reduce(Text key, Iterator<IntWritable> values,
 
                      OutputCollector<Text, IntWritable> output,
 
                      Reporter reporter) throws IOException {
 
      int sum = 0;
 
      while (values.hasNext()) {
 
        sum += values.next().get();
 
      }
 
      output.collect(key, new IntWritable(sum));
 
    }
 
  }
 
</source>
 
  
 
<br />
 
<br />
Line 166: Line 197:
 
[[Image:ComputerLogo.png|100px|right]]
 
[[Image:ComputerLogo.png|100px|right]]
 
;Lab Experiment 1
 
;Lab Experiment 1
:Jump to the first Hadoop/MapReduce  [[Hadoop_Tutorial_1_--_Running_WordCount | Lab #1]]!
+
:Jump to the first Hadoop/MapReduce  [[Hadoop_Tutorial_1_--_Running_WordCount | Lab #1]]! Run all the sections up to, but not including Section 4.
 
</greenbox>
 
</greenbox>
  
Line 358: Line 389:
 
</pre></code>
 
</pre></code>
 
|}
 
|}
 +
 +
=Compiling Your Own Version of the WordCount Program=
 +
 +
* This is illustrated and explained in Section 4 of  [[Hadoop_Tutorial_1_--_Running_WordCount#Running_Your_Own_Version_of_WordCount.java Tutorial #1 | Tutorial #1: Compiling your own version of WordCoung.java]]
 +
 +
=How does hadoop on 6 compare to Linux on 1?=
 +
 +
* This is very interesting! 
 +
 +
<greenbox>
 +
[[Image:ComputerLogo.png|100px|right]]
 +
;Lab Experiment 2
 +
:Jump to the Section 5 of the  [[Hadoop_Tutorial_1_--_Running_WordCount | Hadoop Lab #1]] and see how Hadoop compares with basic Linux for Ulysses, and for Ulysses plus 5 other books
 +
 +
 +
</greenbox>
 +
<br />
 +
<br />
 +
 +
;Question 1
 +
: Comment on the timing you observe, for 1 book, and for 6 books.
 +
 +
;Question 2
 +
: There a 4 large files in the HDFS, in '''wikipages/block/'''.  Each is approximately 180 MByte in size.  Run another experiment and  compare the execution time of hadoop on the 4 files (~3/4 GByte) and of one of the Linux boxes on the same 4 files using Linux commands.  Compare the execution times again.
 +
=Generating Task Timelines=
 +
 +
<br />
 +
<br />
 +
<greenbox>
 +
[[Image:ComputerLogo.png|right |100px]]
 +
;Lab Experiment #3:
 +
: [[Hadoop Tutorial 1.1 -- Generating Task Timelines | Tutorial 1.1]] on generating '''Timelines'''.
 +
 +
</greenbox>
 +
<br />
 +
<br />
 +
 +
=Debugging/Testing using Counters=
 +
 +
Section 6 of [[Hadoop_Tutorial_1_--_Running_WordCount#Counters | Tutorial #1]] shows how to create counters.  Hadoop Counters are special variables that are gathered after each task runs and the values are accumulated and reported at the end and during the computation.  They are useful for counting quantities such as amount of data processed, number of tasks executed, etc.
 +
 +
<br />
 +
<br />
 +
<greenbox>
 +
[[Image:ComputerLogo.png|right |100px]]
 +
;Lab Experiment #4:
 +
: [[Hadoop_Tutorial_1_--_Running_WordCount#Counters | Tutorial 1 on Counters]].  Create counters in your Java version of WordCount and count the number of Map tasks and the number of Reduce tasks.
 +
 +
</greenbox>
 +
<br />
 +
<br />
 +
 +
 +
=Running WordCount in Python=
 +
 +
<br />
 +
<br />
 +
<greenbox>
 +
[[Image:ComputerLogo.png|right |100px]]
 +
;Lab Experiment #5:
 +
: [[Hadoop Tutorial 2 -- Running WordCount in Python | Tutorial 2]] on running Python programs with MapReduce/Hadoop.
 +
 +
</greenbox>
 +
<br />
 +
<br />
 +
  
 
=References=
 
=References=
Line 377: Line 474:
 
<br />
 
<br />
 
<br />
 
<br />
[[Category:CSC352]][[Category:MapReduce]][[Category:Hadoop]]
+
[[Category:CSC352]][[Category:Class Notes]][[Category:MapReduce]][[Category:Hadoop]]

Latest revision as of 08:16, 6 April 2010


This section is only visible to computers located at Smith College