Difference between revisions of "CSC352 Project 3"

From dftwiki3
Jump to: navigation, search
(Created page with ' __TOC__ <bluebox> This is the extension of Project #2, which is built on top of the [[http://cs.smith.edu/dftwiki/index.php/Hadoop/MapReduce_Tutorials| H…')
 
 
(20 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
 
__TOC__
 
__TOC__
  
 
<bluebox>
 
<bluebox>
This is the extension of [[CSC352_Project_2 | Project #2]], which is built on top of the [[http://cs.smith.edu/dftwiki/index.php/Hadoop/MapReduce_Tutorials| Hadoop/Mapreduce Tutorials]].  It is due on the last day of Exams, at 4:00 p.m.
+
This is the extension of [[CSC352_Project_2 | Project #2]], which is built on top of the [[Hadoop/MapReduce_Tutorials| Hadoop/Mapreduce Tutorials]].  It is due on the last day of Exams, at 4:00 p.m.
 
</bluebox>
 
</bluebox>
  
 +
<onlysmith>
 
=The Big Picture=
 
=The Big Picture=
 +
{|
 +
|
 
<tanbox>
 
<tanbox>
 
Your project should present your answers to the following three questions:
 
Your project should present your answers to the following three questions:
Line 13: Line 15:
 
* How does this compare to the execution time of the 5 Million pages on an XGrid system?
 
* How does this compare to the execution time of the 5 Million pages on an XGrid system?
 
</tanbox>
 
</tanbox>
 +
|
 +
[[Image:cherriesXparent.gif|right|100px]]
 +
|}
 +
<br />
  
=Assignment (same as for the XGrid)=
+
=Assignment (same as for the XGrid Project)=
  
 
* Process N wiki pages, and for each one keep track of the categories contained in the page find the 5 most frequent words (not including stop words) in the page.
 
* Process N wiki pages, and for each one keep track of the categories contained in the page find the 5 most frequent words (not including stop words) in the page.
Line 21: Line 27:
 
* Measure the execution time of the program
 
* Measure the execution time of the program
 
* write a summary of the approach as illustrated in the guidelines presented in class (3/9, 3/11).  
 
* write a summary of the approach as illustrated in the guidelines presented in class (3/9, 3/11).  
* Submit a pdf with your presentation, graphs, and analysis. Submit your programs, even if they are the same as the files you submitted for previous homework or projects. 
+
* Submit a pdf with your presentation, graphs, and analysis.  
  
    submit project3 file1
+
=Project Details=
    submit project3 file2
 
    ...
 
 
 
:'''Note''': You cannot submit directories with the '''submit''' command.  If you want to submit the contents of a whole directory, then proceed as follows:
 
 
 
    cd ''theDirectoryWhereAllTheFilesYouWantToSubmitReside''
 
    tar -czvf  ''yourFirstNameProject3.tgz'' *
 
    submit ''yourFirstNameProject3.tgz''
 
  
=Project Details=
 
 
==Accessing Wiki Pages==
 
==Accessing Wiki Pages==
  
Line 72: Line 69:
 
The list above contains roughly 590 files.  
 
The list above contains roughly 590 files.  
  
Back to the '''wikipages''' directory, there are a few other files along with the 00, 01, ... 99 directories.  They are named '''all_00.xml''', '''all_01.xml''', and so on.  These are '''files''', and not '''directories'''.  The file all_00.xml for example, is a text file that contains all the xml files that are stored in 00/00, 00/01, 00/02, ... 00/99, all 58,000 of them listed one after the other in one long text file.  Similarly for all_01.xml, it contains about 58,000 pages in xml, listed one after the other, in one big text file.
+
Back to the '''wikipages''' directory, there are a few other files along with the 00, 01, ... 99 directories.  They are named '''all_00.xml''', '''all_01.xml''', and so on.  These are '''files''', and not '''directories'''.  The file all_00.xml for example, is a text file that contains all the xml files that are stored in 00/00/, 00/01/, 00/02/, ... 00/99/, all 58,000 files listed one after the other in one long text file.  Because they are in xml, each file is sandwiched between '''&lt;xml&gt;''' and '''&lt;/xml&gt;'''.  Similarly for all_01.xml, which contains about 58,000 pages in xml, listed one after the other, in one big text file.
  
Remember, all of these files are on the '''local disk''' of hadoop6.
+
Remember, all of these files are on the '''local disk''' of hadoop6.  Your MapReduce/Hadoop programs can only work on files stored in HDFS.
  
 
===HDFS===
 
===HDFS===
  
So what's on the HDFS of our cluster.
 
  
Everything wiki related is in the HDFS directory '''wikipages'''.  Not all 5 million pages are there, but we have the contents of 00/00/ (about 590 files) and of 00/01/ (also about 590 files).
+
Everything wiki related is in the HDFS directory '''wikipages'''.  Not all 5 million pages are there, because it is unclear wheather hadoop1 through hadoop5 would have enough room on their disk to keep 5 million pages replicated with a factor of 2...
 +
 
 +
We do have, however, the contents of 00/00/ (about 590 files) and of 00/01/ (also about 590 files) in HDFS.  The listing below shows where they are.
 +
 
 +
In addition we have a directory '''wikipages/few/''' with just just 4 wiki pages (good for debugging), and we also have one directory called '''wikipages/block''' with 4 of the large blocks of xml: ''all_00.xml'', ''all_01.xml'', ''all_02.xml'', and ''all_03.xml''.
  
In addition we have a directory '''wikipages/few/''' with just 4 pages (good for debugging), and one directory, '''wikipages/block''' with 4 of the large blocks of xml: ''all_00.xml'', ''all_01.xml'', ''all_02.xml'', and ''all_03.xml''.
+
The listing below shows how to access all the files.
  
 
<code><pre>
 
<code><pre>
Line 123: Line 123:
 
</pre></code>
 
</pre></code>
  
:::http://xgridmac.dyndns.org/cgi-bin/getWikiPageById.cgi?Count=10
+
You are free to put additional wiki pages from the local disk of Hadoop6 into HDFS, but if you do so, do it in the '''wikipages''' directory, and update the README_dft.txt file in the HDFS wikipages directory with information about what you have added and how to access it.   Thanks!
  
The output will be:
+
===Web Server===
  
  10000
+
Of course, all the pages are still available on XGridMac, as they were for Project 2. It is up to you to figure out if it is worth exploring writing MapReduce programs that would gather the pages from the Web rather than from HDFS.
10050000
 
10070000
 
10140000
 
10200000
 
10230000
 
1030000
 
10320000
 
1040000
 
10430000
 
  
To get the page with Id 1000, for example, then we access the Web server at the same address, but with a different ''request'':
 
  
:::http://xgridmac.dyndns.org/cgi-bin/getWikiPageById.cgi?Id=1000
+
=Submission=
  
The output is:
+
Submit a pdf (and additional files if needed) as follows:
  
<code><pre>
+
  submit project3 project3.pdf
<xml>
 
<title>Hercule Poirot</title>
 
<id>1000</id>
 
<contributors>
 
<contrib>
 
<username>TXiKiBoT</username>
 
<id>3171782</id>
 
  
<length>51946</length></contrib>
+
Submit your programs, even if they are the same as the files you submitted for previous homework or projects.  
</contributors>
 
<categories>
 
<cat>Hercule Poirot</cat>
 
<cat>Fictional private investigators</cat>
 
<cat>Series of books</cat>
 
<cat>Hercule Poirot characters</cat>
 
<cat>Fictional Belgians</cat>
 
</categories>
 
<pagelinks>
 
<page></page>
 
<page>16 July</page>
 
<page>1916</page>
 
<page>1989</page>
 
<page>2011</page>
 
<page>A. E. W. Mason</page>
 
<page>Academy Award</page>
 
...
 
<page>private detective</page>
 
<page>refugee</page>
 
<page>retroactive continuity</page>
 
<page>turnip pocket watch</page>
 
</pagelinks>
 
<text>
 
. Belgium Belgian . occupation = Private Dectective. Former Retired DetectiveFormer Police Police officer officer .
 
... (lot's of text removed here...)
 
. Hercule Poirot . uk. Еркюль Пуаро . vi. Hercule Poirot . zh. 赫丘勒·白羅 .
 
</text>
 
</xml>
 
</pre></code>
 
  
In general, the page will have several sections, coded in XML, and always in the same order:
+
    submit project3 file1
* the title, in '''&lt;title&gt;''' tags,
+
    submit project3 file2
* the contributor, in '''&lt;contributor&gt;''' tag,
+
    ...
* the categories the page belongs to, in '''&lt;categories&gt;''' and '''&lt;cat&gt;''' tags,
 
* the links to other wikipedia pages the page contains, in '''&lt;pagelinks&gt;''' and '''&lt;page&gt;''' tags,
 
* the text of the page, with all the html and wiki tags removed, between '''&lt;text&gt;''' tags.
 
  
The end of the text section always contains foreign charactersThe text should be coded in UTF-8, which is the international character set, of which ASCII is a variant.
+
:'''Note''': You cannot submit directories with the '''submit''' commandIf you want to submit the contents of a whole directory, then proceed as follows:
  
===CGI Program===
+
    cd ''theDirectoryWhereAllTheFilesYouWantToSubmitReside''
Just for information, the CGI program that processes the request is available [[CSC352 getWikiPageById.cgi | here]].
+
    tar -czvf  ''yourFirstNameProject3.tgz'' *
</onlysmith>
+
    submit project3  ''yourFirstNameProject3.tgz''
  
==Submission==
+
=Extra Credits=
  
Submit a pdf (and additional files if needed) as follows:
+
Extra credits will be given for some work done on AWS.  This could be the whole project or sections of it, or just comparison on some of the input sets.
 
+
</onlysmith>
  submit project2 project2.pdf
 
  
 
<br />
 
<br />
Line 210: Line 160:
 
<br />
 
<br />
 
<br />
 
<br />
[[Category:CSC352]][[Category:Projects]][[Category:XGrid]]
+
[[Category:CSC352]][[Category:Project]][[Category:MapReduce]][[Category:XGrid]]

Latest revision as of 13:07, 18 November 2010


This is the extension of Project #2, which is built on top of the Hadoop/Mapreduce Tutorials. It is due on the last day of Exams, at 4:00 p.m.


This section is only visible to computers located at Smith College