Difference between revisions of "CSC352 MapReduce/Hadoop Class Notes"
Line 119: | Line 119: | ||
* '''Each map task runs the user-defined map function for each record of a split'''. | * '''Each map task runs the user-defined map function for each record of a split'''. | ||
+ | |||
+ | * Hadoop does its best to run the map task on the node where the split resides, '''but it is not always the case'''. | ||
+ | |||
+ | |||
+ | * The '''sorted''' map outputs are transfered across the network to where the reduce task is running. These '''sorted''' outputs are '''merged''' and fed to the user-defined '''reduce function.''' | ||
+ | |||
+ | |||
+ | * The '''output''' of the '''reduce task''' is stored in the '''HDFS'''. | ||
+ | |||
+ | |||
+ | * When they are many reducers, the map tasks '''partition''' their output into '''partitions'''. There is '''one''' partition per '''reduce task'''. | ||
+ | |||
+ | |||
+ | === Examples of Data Flows=== | ||
+ | |||
+ | <center> | ||
+ | [[Image:MapReduceDataFlowOneReduce.png]] | ||
+ | </center> | ||
+ | |||
+ | |||
+ | <center> | ||
+ | [[Image:MapReduceDataFlowTwoReduces.png]] | ||
+ | </center> | ||
+ | |||
+ | |||
+ | |||
+ | <center> | ||
+ | [[Image:MapReduceDataFlowNoReduce.png]] | ||
+ | </center> | ||
+ | |||
Revision as of 17:55, 31 March 2010