Difference between revisions of "CSC352 Project 3"
(Created page with ' __TOC__ <bluebox> This is the extension of Project #2, which is built on top of the [[http://cs.smith.edu/dftwiki/index.php/Hadoop/MapReduce_Tutorials| H…') |
|||
(20 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
− | |||
__TOC__ | __TOC__ | ||
<bluebox> | <bluebox> | ||
− | This is the extension of [[CSC352_Project_2 | Project #2]], which is built on top of the [[ | + | This is the extension of [[CSC352_Project_2 | Project #2]], which is built on top of the [[Hadoop/MapReduce_Tutorials| Hadoop/Mapreduce Tutorials]]. It is due on the last day of Exams, at 4:00 p.m. |
</bluebox> | </bluebox> | ||
+ | <onlysmith> | ||
=The Big Picture= | =The Big Picture= | ||
+ | {| | ||
+ | | | ||
<tanbox> | <tanbox> | ||
Your project should present your answers to the following three questions: | Your project should present your answers to the following three questions: | ||
Line 13: | Line 15: | ||
* How does this compare to the execution time of the 5 Million pages on an XGrid system? | * How does this compare to the execution time of the 5 Million pages on an XGrid system? | ||
</tanbox> | </tanbox> | ||
+ | | | ||
+ | [[Image:cherriesXparent.gif|right|100px]] | ||
+ | |} | ||
+ | <br /> | ||
− | =Assignment (same as for the XGrid)= | + | =Assignment (same as for the XGrid Project)= |
* Process N wiki pages, and for each one keep track of the categories contained in the page find the 5 most frequent words (not including stop words) in the page. | * Process N wiki pages, and for each one keep track of the categories contained in the page find the 5 most frequent words (not including stop words) in the page. | ||
Line 21: | Line 27: | ||
* Measure the execution time of the program | * Measure the execution time of the program | ||
* write a summary of the approach as illustrated in the guidelines presented in class (3/9, 3/11). | * write a summary of the approach as illustrated in the guidelines presented in class (3/9, 3/11). | ||
− | * Submit a pdf with your presentation, graphs, and analysis. | + | * Submit a pdf with your presentation, graphs, and analysis. |
− | + | =Project Details= | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
==Accessing Wiki Pages== | ==Accessing Wiki Pages== | ||
Line 72: | Line 69: | ||
The list above contains roughly 590 files. | The list above contains roughly 590 files. | ||
− | Back to the '''wikipages''' directory, there are a few other files along with the 00, 01, ... 99 directories. They are named '''all_00.xml''', '''all_01.xml''', and so on. These are '''files''', and not '''directories'''. The file all_00.xml for example, is a text file that contains all the xml files that are stored in 00/00, 00/01, 00/02, ... 00/99, all 58,000 | + | Back to the '''wikipages''' directory, there are a few other files along with the 00, 01, ... 99 directories. They are named '''all_00.xml''', '''all_01.xml''', and so on. These are '''files''', and not '''directories'''. The file all_00.xml for example, is a text file that contains all the xml files that are stored in 00/00/, 00/01/, 00/02/, ... 00/99/, all 58,000 files listed one after the other in one long text file. Because they are in xml, each file is sandwiched between '''<xml>''' and '''</xml>'''. Similarly for all_01.xml, which contains about 58,000 pages in xml, listed one after the other, in one big text file. |
− | Remember, all of these files are on the '''local disk''' of hadoop6. | + | Remember, all of these files are on the '''local disk''' of hadoop6. Your MapReduce/Hadoop programs can only work on files stored in HDFS. |
===HDFS=== | ===HDFS=== | ||
− | |||
− | Everything wiki related is in the HDFS directory '''wikipages'''. Not all 5 million pages are there, | + | Everything wiki related is in the HDFS directory '''wikipages'''. Not all 5 million pages are there, because it is unclear wheather hadoop1 through hadoop5 would have enough room on their disk to keep 5 million pages replicated with a factor of 2... |
+ | |||
+ | We do have, however, the contents of 00/00/ (about 590 files) and of 00/01/ (also about 590 files) in HDFS. The listing below shows where they are. | ||
+ | |||
+ | In addition we have a directory '''wikipages/few/''' with just just 4 wiki pages (good for debugging), and we also have one directory called '''wikipages/block''' with 4 of the large blocks of xml: ''all_00.xml'', ''all_01.xml'', ''all_02.xml'', and ''all_03.xml''. | ||
− | + | The listing below shows how to access all the files. | |
<code><pre> | <code><pre> | ||
Line 123: | Line 123: | ||
</pre></code> | </pre></code> | ||
− | + | You are free to put additional wiki pages from the local disk of Hadoop6 into HDFS, but if you do so, do it in the '''wikipages''' directory, and update the README_dft.txt file in the HDFS wikipages directory with information about what you have added and how to access it. Thanks! | |
− | + | ===Web Server=== | |
− | + | Of course, all the pages are still available on XGridMac, as they were for Project 2. It is up to you to figure out if it is worth exploring writing MapReduce programs that would gather the pages from the Web rather than from HDFS. | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | + | =Submission= | |
− | + | Submit a pdf (and additional files if needed) as follows: | |
− | + | submit project3 project3.pdf | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | + | Submit your programs, even if they are the same as the files you submitted for previous homework or projects. | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | + | submit project3 file1 | |
− | + | submit project3 file2 | |
− | + | ... | |
− | |||
− | |||
− | |||
− | + | :'''Note''': You cannot submit directories with the '''submit''' command. If you want to submit the contents of a whole directory, then proceed as follows: | |
− | + | cd ''theDirectoryWhereAllTheFilesYouWantToSubmitReside'' | |
− | + | tar -czvf ''yourFirstNameProject3.tgz'' * | |
− | + | submit project3 ''yourFirstNameProject3.tgz'' | |
− | = | + | =Extra Credits= |
− | + | Extra credits will be given for some work done on AWS. This could be the whole project or sections of it, or just comparison on some of the input sets. | |
− | + | </onlysmith> | |
− | |||
<br /> | <br /> | ||
Line 210: | Line 160: | ||
<br /> | <br /> | ||
<br /> | <br /> | ||
− | [[Category:CSC352]][[Category: | + | [[Category:CSC352]][[Category:Project]][[Category:MapReduce]][[Category:XGrid]] |
Latest revision as of 12:07, 18 November 2010
This is the extension of Project #2, which is built on top of the Hadoop/Mapreduce Tutorials. It is due on the last day of Exams, at 4:00 p.m.