Difference between revisions of "CSC352 Project 2"
(Created page with 'This project is currently under construction... <onlysmith> ==Accessing Wiki Pages== This is a two-step process. First you need to get a number of Page Ids. :::http://xgridma…') |
|||
(15 intermediate revisions by the same user not shown) | |||
Line 2: | Line 2: | ||
<onlysmith> | <onlysmith> | ||
+ | __TOC__ | ||
+ | |||
+ | <bluebox> | ||
+ | This is the extension of [[CSC352_Homework_3 | Homework #3]], which is built on top of the [[XGrid Tutorial Part 2: Processing Wikipedia Pages | XGrid Lab 2]]. It is due on Thursday, April 8th. | ||
+ | </bluebox> | ||
+ | |||
+ | =Assignment= | ||
+ | |||
+ | * Process N wiki pages, and for each one keep track of the categories contained in the page find the 5 most frequent words (not including stop words) in the page. | ||
+ | * Associate with each category the most frequent words that have been associated with it over the N pages processed | ||
+ | * Output the result (or a sample of it) | ||
+ | * Measure the execution time of the program | ||
+ | * write a summary of it as illustrated in the guidelines presented in class (3/9, 3/11). | ||
+ | * For this project, build on top of the homework and concentrate on the formatting of the project, and include graphs, and an analysis of your results. | ||
+ | * Submit a pdf with your presentation, graphs, and analysis. Submit your programs, even if they are the same as the files you submitted for the homework. | ||
+ | |||
+ | submit project2 file1 | ||
+ | submit project2 file2 | ||
+ | ... | ||
+ | |||
+ | =Project Details= | ||
==Accessing Wiki Pages== | ==Accessing Wiki Pages== | ||
− | This is a two-step process. First | + | This is a two-step process. First we need to get a number of Page Ids. For example, if we just want 10 pages, we request the following Url: |
− | :::http://xgridmac.dyndns.org/ | + | :::http://xgridmac.dyndns.org/cgi-bin/getWikiPageById.cgi?Count=10 |
The output will be: | The output will be: | ||
Line 21: | Line 42: | ||
10430000 | 10430000 | ||
− | To get the page with Id | + | To get the page with Id 1000, for example, then we access the Web server at the same address, but with a different ''request'': |
− | :::http://xgridmac.dyndns.org/ | + | :::http://xgridmac.dyndns.org/cgi-bin/getWikiPageById.cgi?Id=1000 |
The output is: | The output is: | ||
Line 53: | Line 74: | ||
<page>A. E. W. Mason</page> | <page>A. E. W. Mason</page> | ||
<page>Academy Award</page> | <page>Academy Award</page> | ||
− | + | ... | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
<page>private detective</page> | <page>private detective</page> | ||
<page>refugee</page> | <page>refugee</page> | ||
Line 211: | Line 86: | ||
</text> | </text> | ||
</xml> | </xml> | ||
+ | </pre></code> | ||
+ | In general, the page will have several sections, coded in XML, and always in the same order: | ||
+ | * the title, in '''<title>''' tags, | ||
+ | * the contributor, in '''<contributor>''' tag, | ||
+ | * the categories the page belongs to, in '''<categories>''' and '''<cat>''' tags, | ||
+ | * the links to other wikipedia pages the page contains, in '''<pagelinks>''' and '''<page>''' tags, | ||
+ | * the text of the page, with all the html and wiki tags removed, between '''<text>''' tags. | ||
− | + | The end of the text section always contains foreign characters. The text should be coded in UTF-8, which is the international character set, of which ASCII is a variant. | |
+ | ===CGI Program=== | ||
+ | Just for information, the CGI program that processes the request is available [[CSC352 getWikiPageById.cgi | here]]. | ||
</onlysmith> | </onlysmith> | ||
+ | |||
+ | ==Submission== | ||
+ | |||
+ | Submit a pdf (and additional files if needed) as follows: | ||
+ | |||
+ | submit project2 project2.pdf | ||
+ | |||
+ | <br /> | ||
+ | <br /> | ||
+ | <br /> | ||
+ | <br /> | ||
+ | <br /> | ||
+ | <br /> | ||
+ | <br /> | ||
+ | [[Category:CSC352]][[Category:Project]][[Category:XGrid]] |
Latest revision as of 13:06, 18 November 2010
This project is currently under construction...
Submission
Submit a pdf (and additional files if needed) as follows:
submit project2 project2.pdf