Difference between revisions of "CSC352 Problem of the Day"

From dftwiki3
Jump to: navigation, search
(Homework #4, Problem #2)
(Homework #4, Problem #2)
 
(5 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
==Homework #4, Problem #2==
 
==Homework #4, Problem #2==
  
Conditions: Processing of wiki pages with Hadoop on 6-PC cluster
+
* Conditions:  
 +
** Processing of wiki pages with Hadoop on 6-PC cluster
 +
** Same Mapper and same Reducer program to process two different input folders
 
<br />
 
<br />
 
<br />
 
<br />
Line 9: Line 11:
 
! Number of files
 
! Number of files
 
! Number of wiki pages
 
! Number of wiki pages
 +
! Number of categories
 
! Execution Time<br />(seconds)
 
! Execution Time<br />(seconds)
 
|-
 
|-
 
| 589
 
| 589
 
| 589
 
| 589
 +
| 832
 
| 388
 
| 388
 
|-
 
|-
 
| 1
 
| 1
 
| 117,617
 
| 117,617
 +
| 51,120
 
| 30.7
 
| 30.7
 +
|-
 +
| Ratio=589/1
 +
| Ratio=1/199
 +
| Ratio=1/61.4
 +
| Ratio=12.6/1
 
|}
 
|}
  
Line 25: Line 35:
  
 
* Discuss these results
 
* Discuss these results
 +
* If you were to add another column to this table, what quantity would you add?
 
* Identify the parties responsible for this surprising difference
 
* Identify the parties responsible for this surprising difference
 +
<font color="white">I would add a column showing the number of splits.  I think the main culprit is the HDFS, and the fact that a lot of information has to flow through one ethernet switch that is pretty old, and slow...</font>
 +
<br />
 +
<br />
 +
<br />
 +
<br />
 +
<br />
 +
<br />
 +
<br />
 +
<br />
 +
<br />
 +
<br />
 +
<br />
 +
 +
[[Category:CSC352]][[Category:Hadoop]][[Category:MapReduce]]

Latest revision as of 08:20, 27 April 2010

Homework #4, Problem #2

  • Conditions:
    • Processing of wiki pages with Hadoop on 6-PC cluster
    • Same Mapper and same Reducer program to process two different input folders



Number of files Number of wiki pages Number of categories Execution Time
(seconds)
589 589 832 388
1 117,617 51,120 30.7
Ratio=589/1 Ratio=1/199 Ratio=1/61.4 Ratio=12.6/1



  • Discuss these results
  • If you were to add another column to this table, what quantity would you add?
  • Identify the parties responsible for this surprising difference

I would add a column showing the number of splits. I think the main culprit is the HDFS, and the fact that a lot of information has to flow through one ethernet switch that is pretty old, and slow...