The Big Picture

Your project should present your answers to the following three questions:

How should one attempt to process 5 Million Wikipedia pages with MapReduce/Hadoop? What parameters control the execution time, and what is the best guess for the values they should be set at?
What is the estimate for how long it will take to processing of 5 Million pages under the conditions specified above.
How does this compare to the execution time of the 5 Million pages on an XGrid system?

Assignment (same as for the XGrid Project)

Process N wiki pages, and for each one keep track of the categories contained in the page find the 5 most frequent words (not including stop words) in the page.
Associate with each category the most frequent words that have been associated with it over the N pages processed
Output the result (or a sample of it)
Measure the execution time of the program
write a summary of the approach as illustrated in the guidelines presented in class (3/9, 3/11).
Submit a pdf with your presentation, graphs, and analysis. Submit your programs, even if they are the same as the files you submitted for previous homework or projects.

   submit project3 file1
   submit project3 file2
   ...

Note: You cannot submit directories with the submit command. If you want to submit the contents of a whole directory, then proceed as follows:

   cd theDirectoryWhereAllTheFilesYouWantToSubmitReside
   tar -czvf  yourFirstNameProject3.tgz *
   submit yourFirstNameProject3.tgz

Project Details

Accessing Wiki Pages

Local Disk

The 5 Million Wikipedia pages are on the local disk of hadoop6.

hadoop@hadoop6:~$ cd 352/
hadoop@hadoop6:~/352$ ls
dft  wikipages
hadoop@hadoop6:~/352$ cd wikipages/
hadoop@hadoop6:~/352/wikipages$ ls
00  07	14  21	28  35	42  49	56  63	70  77	84  91	98	    all_05.xml
01  08	15  22	29  36	43  50	57  64	71  78	85  92	99	    all_06.xml
02  09	16  23	30  37	44  51	58  65	72  79	86  93	all_00.xml  all_07.xml
03  10	17  24	31  38	45  52	59  66	73  80	87  94	all_01.xml  all_08.xml
04  11	18  25	32  39	46  53	60  67	74  81	88  95	all_02.xml
05  12	19  26	33  40	47  54	61  68	75  82	89  96	all_03.xml
06  13	20  27	34  41	48  55	62  69	76  83	90  97	all_04.xml

Each of the 00, 01, 02, 99 directories contain 100 subdirectories, also named 00, 01, 02... 99, and each one of these contains a collection of wiki pages in xml.

hadoop@hadoop6:~/352/wikipages$ cd 00/00
hadoop@hadoop6:~/352/wikipages/00/00$ ls
10000.xml     14500000.xml  19670000.xml  24100000.xml	5240000.xml
10050000.xml  1450000.xml   19680000.xml  24130000.xml	530000.xml
10070000.xml  14660000.xml  19700000.xml  24140000.xml	5310000.xml
10140000.xml  14700000.xml  1970000.xml   24150000.xml	5320000.xml
...
14250000.xml  1950000.xml   23970000.xml  510000.xml	9940000.xml
14260000.xml  19580000.xml  24000000.xml  5200000.xml	9970000.xml
14320000.xml  19590000.xml  240000.xml	  520000.xml	9990000.xml
14430000.xml  19620000.xml  24020000.xml  5230000.xml	list.txt

The list above contains roughly 590 files.

Back to the wikipages directory, there are a few other files along with the 00, 01, ... 99 directories. They are named all_00.xml, all_01.xml, and so on. These are files, and not directories. The file all_00.xml for example, is a text file that contains all the xml files that are stored in 00/00/, 00/01/, 00/02/, ... 00/99/, all 58,000 files listed one after the other in one long text file. Because they are in xml, each file is sandwiched between <xml> and </xml>. Similarly for all_01.xml, which contains about 58,000 pages in xml, listed one after the other, in one big text file.

Remember, all of these files are on the local disk of hadoop6. Your MapReduce/Hadoop programs can only work on files stored in HDFS.

HDFS

Everything wiki related is in the HDFS directory wikipages. Not all 5 million pages are there, because it is unclear wheather hadoop1 through hadoop5 would have enough room on their disk to keep 5 million pages replicated with a factor of 2...

We do have, however, the contents of 00/00/ (about 590 files) and of 00/01/ (also about 590 files) in HDFS. The listing below shows where they are.

In addition we have a directory wikipages/few/ with just just 4 wiki pages (good for debugging), and we also have one directory called wikipages/block with 4 of the large blocks of xml: all_00.xml, all_01.xml, all_02.xml, and all_03.xml.

The listing below shows how to access all the files.

hadoop@hadoop6:~/352/wikipages$ hadoop dfs -ls wikipages
Found 3 items
drwxr-xr-x   - hadoop supergroup          0 2010-03-31 21:59 /user/hadoop/wikipages/00
drwxr-xr-x   - hadoop supergroup          0 2010-04-05 16:21 /user/hadoop/wikipages/block
drwxr-xr-x   - hadoop supergroup          0 2010-04-12 21:33 /user/hadoop/wikipages/few

hadoop@hadoop6:~/352/wikipages$ hadoop dfs -ls wikipages/few
Found 4 items
-rw-r--r--   2 hadoop supergroup        877 2010-04-12 21:33 /user/hadoop/wikipages/few/25200000.xml
-rw-r--r--   2 hadoop supergroup       4880 2010-04-12 21:33 /user/hadoop/wikipages/few/25210000.xml
-rw-r--r--   2 hadoop supergroup       4517 2010-04-12 21:33 /user/hadoop/wikipages/few/25220000.xml
-rw-r--r--   2 hadoop supergroup        430 2010-04-12 21:33 /user/hadoop/wikipages/few/25240000.xml

hadoop@hadoop6:~/352/wikipages$ hadoop dfs -ls wikipages/block
Found 4 items
-rw-r--r--   2 hadoop supergroup  187789938 2010-04-05 15:58 /user/hadoop/wikipages/block/all_00.xml
-rw-r--r--   2 hadoop supergroup  192918963 2010-04-05 16:14 /user/hadoop/wikipages/block/all_01.xml
-rw-r--r--   2 hadoop supergroup  198549500 2010-04-05 16:20 /user/hadoop/wikipages/block/all_03.xml
-rw-r--r--   2 hadoop supergroup  191317937 2010-04-05 16:21 /user/hadoop/wikipages/block/all_04.xml

hadoop@hadoop6:~/352/wikipages$ hadoop dfs -ls wikipages/00
Found 2 items
drwxr-xr-x   - hadoop supergroup          0 2010-03-31 21:59 /user/hadoop/wikipages/00/00
drwxr-xr-x   - hadoop supergroup          0 2010-03-31 21:59 /user/hadoop/wikipages/00/01

hadoop@hadoop6:~/352/wikipages$ hadoop dfs -ls wikipages/00/00
Found 590 items
-rw-r--r--   2 hadoop supergroup       1147 2010-03-31 21:59 /user/hadoop/wikipages/00/00/10000.xml
-rw-r--r--   2 hadoop supergroup       2073 2010-03-31 21:59 /user/hadoop/wikipages/00/00/10050000.xml
-rw-r--r--   2 hadoop supergroup       1326 2010-03-31 21:59 /user/hadoop/wikipages/00/00/10070000.xml
-rw-r--r--   2 hadoop supergroup       2719 2010-03-31 21:59 /user/hadoop/wikipages/00/00/10140000.xml
...
-rw-r--r--   2 hadoop supergroup        467 2010-03-31 21:59 /user/hadoop/wikipages/00/00/9940000.xml
-rw-r--r--   2 hadoop supergroup       3455 2010-03-31 21:59 /user/hadoop/wikipages/00/00/9970000.xml
-rw-r--r--   2 hadoop supergroup        541 2010-03-31 21:59 /user/hadoop/wikipages/00/00/9990000.xml

http://xgridmac.dyndns.org/cgi-bin/getWikiPageById.cgi?Count=10

The output will be:

To get the page with Id 1000, for example, then we access the Web server at the same address, but with a different request:

http://xgridmac.dyndns.org/cgi-bin/getWikiPageById.cgi?Id=1000

The output is:

<xml>
<title>Hercule Poirot</title>
<id>1000</id>
<contributors>
<contrib>
<username>TXiKiBoT</username>
<id>3171782</id>

<length>51946</length></contrib>
</contributors>
<categories>
<cat>Hercule Poirot</cat>
<cat>Fictional private investigators</cat>
<cat>Series of books</cat>
<cat>Hercule Poirot characters</cat>
<cat>Fictional Belgians</cat>
</categories>
<pagelinks>
<page></page>
<page>16 July</page>
<page>1916</page>
<page>1989</page>
<page>2011</page>
<page>A. E. W. Mason</page>
<page>Academy Award</page>
...
<page>private detective</page>
<page>refugee</page>
<page>retroactive continuity</page>
<page>turnip pocket watch</page>
</pagelinks>
<text>
. Belgium Belgian . occupation = Private Dectective. Former Retired DetectiveFormer Police Police officer officer . 
... (lot's of text removed here...)
. Hercule Poirot . uk. Ð•Ñ€ÐºÑŽÐ»ÑŒ ÐŸÑƒÐ°Ñ€Ð¾ . vi. Hercule Poirot . zh. èµ«ä¸˜å‹’Â·ç™½ç¾… .
</text>
</xml>

In general, the page will have several sections, coded in XML, and always in the same order:

the title, in <title> tags,
the contributor, in <contributor> tag,
the categories the page belongs to, in <categories> and <cat> tags,
the links to other wikipedia pages the page contains, in <pagelinks> and <page> tags,
the text of the page, with all the html and wiki tags removed, between <text> tags.

The end of the text section always contains foreign characters. The text should be coded in UTF-8, which is the international character set, of which ASCII is a variant.

CGI Program

Just for information, the CGI program that processes the request is available here. </onlysmith>

Submission

Submit a pdf (and additional files if needed) as follows:

 submit project2 project2.pdf

CSC352 Project 3

Contents

The Big Picture

Assignment (same as for the XGrid Project)

Project Details

Accessing Wiki Pages

Local Disk

HDFS

CGI Program

Submission

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools