Difference between revisions of "CSC352 MapReduce/Hadoop Class Notes"

From dftwiki3
Jump to: navigation, search
Line 119: Line 119:
  
 
* '''Each map task runs the user-defined map function for each record of a split'''.
 
* '''Each map task runs the user-defined map function for each record of a split'''.
 +
 +
* Hadoop does its best to run the map task on the node where the split resides, '''but it is not always the case'''.
 +
 +
 +
* The '''sorted''' map outputs are transfered across the network to where the reduce task is running.  These '''sorted''' outputs  are '''merged''' and fed to the user-defined '''reduce function.'''
 +
 +
 +
* The '''output''' of the '''reduce task''' is stored in the '''HDFS'''.
 +
 +
 +
* When they are many reducers, the map tasks '''partition''' their output into '''partitions'''.  There is '''one''' partition per '''reduce task'''.
 +
 +
 +
=== Examples of Data Flows===
 +
 +
<center>
 +
[[Image:MapReduceDataFlowOneReduce.png]]
 +
</center>
 +
 +
 +
<center>
 +
[[Image:MapReduceDataFlowTwoReduces.png]]
 +
</center>
 +
 +
 +
 +
<center>
 +
[[Image:MapReduceDataFlowNoReduce.png]]
 +
</center>
 +
  
  

Revision as of 17:55, 31 March 2010


This section is only visible to computers located at Smith College