CSC352 Project 3
Contents
This is the extension of Project #2, which is built on top of the [Hadoop/Mapreduce Tutorials]. It is due on the last day of Exams, at 4:00 p.m.
The Big Picture
Your project should present your answers to the following three questions:
- How should one attempt to process 5 Million Wikipedia pages with MapReduce/Hadoop? What parameters control the execution time, and what is the best guess for the values they should be set at?
- What is the estimate for how long it will take to processing of 5 Million pages under the conditions specified above.
- How does this compare to the execution time of the 5 Million pages on an XGrid system?
Assignment (same as for the XGrid Project)
- Process N wiki pages, and for each one keep track of the categories contained in the page find the 5 most frequent words (not including stop words) in the page.
- Associate with each category the most frequent words that have been associated with it over the N pages processed
- Output the result (or a sample of it)
- Measure the execution time of the program
- write a summary of the approach as illustrated in the guidelines presented in class (3/9, 3/11).
- Submit a pdf with your presentation, graphs, and analysis. Submit your programs, even if they are the same as the files you submitted for previous homework or projects.
submit project3 file1 submit project3 file2 ...
- Note: You cannot submit directories with the submit command. If you want to submit the contents of a whole directory, then proceed as follows:
cd theDirectoryWhereAllTheFilesYouWantToSubmitReside tar -czvf yourFirstNameProject3.tgz * submit yourFirstNameProject3.tgz
Project Details
Accessing Wiki Pages
Local Disk
The 5 Million Wikipedia pages are on the local disk of hadoop6.
hadoop@hadoop6:~$ cd 352/
hadoop@hadoop6:~/352$ ls
dft wikipages
hadoop@hadoop6:~/352$ cd wikipages/
hadoop@hadoop6:~/352/wikipages$ ls
00 07 14 21 28 35 42 49 56 63 70 77 84 91 98 all_05.xml
01 08 15 22 29 36 43 50 57 64 71 78 85 92 99 all_06.xml
02 09 16 23 30 37 44 51 58 65 72 79 86 93 all_00.xml all_07.xml
03 10 17 24 31 38 45 52 59 66 73 80 87 94 all_01.xml all_08.xml
04 11 18 25 32 39 46 53 60 67 74 81 88 95 all_02.xml
05 12 19 26 33 40 47 54 61 68 75 82 89 96 all_03.xml
06 13 20 27 34 41 48 55 62 69 76 83 90 97 all_04.xml
Each of the 00, 01, 02, 99 directories contain 100 subdirectories, also named 00, 01, 02... 99, and each one of these contains a collection of wiki pages in xml.
hadoop@hadoop6:~/352/wikipages$ cd 00/00
hadoop@hadoop6:~/352/wikipages/00/00$ ls
10000.xml 14500000.xml 19670000.xml 24100000.xml 5240000.xml
10050000.xml 1450000.xml 19680000.xml 24130000.xml 530000.xml
10070000.xml 14660000.xml 19700000.xml 24140000.xml 5310000.xml
10140000.xml 14700000.xml 1970000.xml 24150000.xml 5320000.xml
...
14250000.xml 1950000.xml 23970000.xml 510000.xml 9940000.xml
14260000.xml 19580000.xml 24000000.xml 5200000.xml 9970000.xml
14320000.xml 19590000.xml 240000.xml 520000.xml 9990000.xml
14430000.xml 19620000.xml 24020000.xml 5230000.xml list.txt
The list above contains roughly 590 files.
Back to the wikipages directory, there are a few other files along with the 00, 01, ... 99 directories. They are named all_00.xml, all_01.xml, and so on. These are files, and not directories. The file all_00.xml for example, is a text file that contains all the xml files that are stored in 00/00/, 00/01/, 00/02/, ... 00/99/, all 58,000 files listed one after the other in one long text file. Because they are in xml, each file is sandwiched between <xml> and </xml>. Similarly for all_01.xml, which contains about 58,000 pages in xml, listed one after the other, in one big text file.
Remember, all of these files are on the local disk of hadoop6. Your MapReduce/Hadoop programs can only work on files stored in HDFS.
HDFS
Everything wiki related is in the HDFS directory wikipages. Not all 5 million pages are there, because it is unclear wheather hadoop1 through hadoop5 would have enough room on their disk to keep 5 million pages replicated with a factor of 2...
We do have, however, the contents of 00/00/ (about 590 files) and of 00/01/ (also about 590 files) in HDFS. The listing below shows where they are.
In addition we have a directory wikipages/few/ with just just 4 wiki pages (good for debugging), and we also have one directory called wikipages/block with 4 of the large blocks of xml: all_00.xml, all_01.xml, all_02.xml, and all_03.xml.
The listing below shows how to access all the files.
hadoop@hadoop6:~/352/wikipages$ hadoop dfs -ls wikipages
Found 3 items
drwxr-xr-x - hadoop supergroup 0 2010-03-31 21:59 /user/hadoop/wikipages/00
drwxr-xr-x - hadoop supergroup 0 2010-04-05 16:21 /user/hadoop/wikipages/block
drwxr-xr-x - hadoop supergroup 0 2010-04-12 21:33 /user/hadoop/wikipages/few
hadoop@hadoop6:~/352/wikipages$ hadoop dfs -ls wikipages/few
Found 4 items
-rw-r--r-- 2 hadoop supergroup 877 2010-04-12 21:33 /user/hadoop/wikipages/few/25200000.xml
-rw-r--r-- 2 hadoop supergroup 4880 2010-04-12 21:33 /user/hadoop/wikipages/few/25210000.xml
-rw-r--r-- 2 hadoop supergroup 4517 2010-04-12 21:33 /user/hadoop/wikipages/few/25220000.xml
-rw-r--r-- 2 hadoop supergroup 430 2010-04-12 21:33 /user/hadoop/wikipages/few/25240000.xml
hadoop@hadoop6:~/352/wikipages$ hadoop dfs -ls wikipages/block
Found 4 items
-rw-r--r-- 2 hadoop supergroup 187789938 2010-04-05 15:58 /user/hadoop/wikipages/block/all_00.xml
-rw-r--r-- 2 hadoop supergroup 192918963 2010-04-05 16:14 /user/hadoop/wikipages/block/all_01.xml
-rw-r--r-- 2 hadoop supergroup 198549500 2010-04-05 16:20 /user/hadoop/wikipages/block/all_03.xml
-rw-r--r-- 2 hadoop supergroup 191317937 2010-04-05 16:21 /user/hadoop/wikipages/block/all_04.xml
hadoop@hadoop6:~/352/wikipages$ hadoop dfs -ls wikipages/00
Found 2 items
drwxr-xr-x - hadoop supergroup 0 2010-03-31 21:59 /user/hadoop/wikipages/00/00
drwxr-xr-x - hadoop supergroup 0 2010-03-31 21:59 /user/hadoop/wikipages/00/01
hadoop@hadoop6:~/352/wikipages$ hadoop dfs -ls wikipages/00/00
Found 590 items
-rw-r--r-- 2 hadoop supergroup 1147 2010-03-31 21:59 /user/hadoop/wikipages/00/00/10000.xml
-rw-r--r-- 2 hadoop supergroup 2073 2010-03-31 21:59 /user/hadoop/wikipages/00/00/10050000.xml
-rw-r--r-- 2 hadoop supergroup 1326 2010-03-31 21:59 /user/hadoop/wikipages/00/00/10070000.xml
-rw-r--r-- 2 hadoop supergroup 2719 2010-03-31 21:59 /user/hadoop/wikipages/00/00/10140000.xml
...
-rw-r--r-- 2 hadoop supergroup 467 2010-03-31 21:59 /user/hadoop/wikipages/00/00/9940000.xml
-rw-r--r-- 2 hadoop supergroup 3455 2010-03-31 21:59 /user/hadoop/wikipages/00/00/9970000.xml
-rw-r--r-- 2 hadoop supergroup 541 2010-03-31 21:59 /user/hadoop/wikipages/00/00/9990000.xml
The output will be:
10000 10050000 10070000 10140000 10200000 10230000 1030000 10320000 1040000 10430000
To get the page with Id 1000, for example, then we access the Web server at the same address, but with a different request:
The output is:
<xml>
<title>Hercule Poirot</title>
<id>1000</id>
<contributors>
<contrib>
<username>TXiKiBoT</username>
<id>3171782</id>
<length>51946</length></contrib>
</contributors>
<categories>
<cat>Hercule Poirot</cat>
<cat>Fictional private investigators</cat>
<cat>Series of books</cat>
<cat>Hercule Poirot characters</cat>
<cat>Fictional Belgians</cat>
</categories>
<pagelinks>
<page></page>
<page>16 July</page>
<page>1916</page>
<page>1989</page>
<page>2011</page>
<page>A. E. W. Mason</page>
<page>Academy Award</page>
...
<page>private detective</page>
<page>refugee</page>
<page>retroactive continuity</page>
<page>turnip pocket watch</page>
</pagelinks>
<text>
. Belgium Belgian . occupation = Private Dectective. Former Retired DetectiveFormer Police Police officer officer .
... (lot's of text removed here...)
. Hercule Poirot . uk. Еркюль Пуаро . vi. Hercule Poirot . zh. 赫丘勒·白羅 .
</text>
</xml>
In general, the page will have several sections, coded in XML, and always in the same order:
- the title, in <title> tags,
- the contributor, in <contributor> tag,
- the categories the page belongs to, in <categories> and <cat> tags,
- the links to other wikipedia pages the page contains, in <pagelinks> and <page> tags,
- the text of the page, with all the html and wiki tags removed, between <text> tags.
The end of the text section always contains foreign characters. The text should be coded in UTF-8, which is the international character set, of which ASCII is a variant.
CGI Program
Just for information, the CGI program that processes the request is available here. </onlysmith>
Submission
Submit a pdf (and additional files if needed) as follows:
submit project2 project2.pdf