Difference between revisions of "CSC352 Project 2"

From dftwiki3
Jump to: navigation, search
 
(14 intermediate revisions by the same user not shown)
Line 2: Line 2:
 
<onlysmith>
 
<onlysmith>
  
 +
__TOC__
 +
 +
<bluebox>
 +
This is the extension of [[CSC352_Homework_3 | Homework #3]], which is built on top of the [[XGrid Tutorial Part 2: Processing Wikipedia Pages | XGrid Lab 2]].  It is due on Thursday, April 8th.
 +
</bluebox>
 +
 +
=Assignment=
 +
 +
* Process N wiki pages, and for each one keep track of the categories contained in the page find the 5 most frequent words (not including stop words) in the page.
 +
* Associate with each category the most frequent words that have been associated with it over the N pages processed
 +
* Output the result (or a sample of it)
 +
* Measure the execution time of the program
 +
* write a summary of it as illustrated in the guidelines presented in class (3/9, 3/11).
 +
* For this project, build on top of the homework and concentrate on the formatting of the project, and include graphs, and an analysis of your results.
 +
* Submit a pdf with your presentation, graphs, and analysis.  Submit your programs, even if they are the same as the files you submitted for the homework.
 +
 +
    submit project2 file1
 +
    submit project2 file2
 +
    ...
 +
 +
=Project Details=
 
==Accessing Wiki Pages==
 
==Accessing Wiki Pages==
  
This is a two-step process.  First you need to get a number of Page Ids.  For example, if we just want 10 pages, we request the following Url:
+
This is a two-step process.  First we need to get a number of Page Ids.  For example, if we just want 10 pages, we request the following Url:
  
:::http://xgridmac.dyndns.org/~thiebaut/getWikiPageById.cgi?Count=10
+
:::http://xgridmac.dyndns.org/cgi-bin/getWikiPageById.cgi?Count=10
  
 
The output will be:
 
The output will be:
Line 21: Line 42:
 
  10430000
 
  10430000
  
To get the page with Id 10000, for example, then access the Web server at the same address, but with a different ''request'':
+
To get the page with Id 1000, for example, then we access the Web server at the same address, but with a different ''request'':
  
:::http://xgridmac.dyndns.org/~thiebaut/getWikiPageById.cgi?Id=1000
+
:::http://xgridmac.dyndns.org/cgi-bin/getWikiPageById.cgi?Id=1000
  
 
The output is:
 
The output is:
Line 53: Line 74:
 
<page>A. E. W. Mason</page>
 
<page>A. E. W. Mason</page>
 
<page>Academy Award</page>
 
<page>Academy Award</page>
<page>Agatha Christie</page>
+
...
<page>Agatha Christie Hour</page>
 
<page>Agatha Christie's Great Detectives Poirot and Marple</page>
 
<page>Agatha Christie's Poirot</page>
 
<page>Albert Finney</page>
 
<page>Alfred Molina</page>
 
<page>Alibi (1931 film)</page>
 
<page>Alibi (play)</page>
 
<page>Angela Easterling</page>
 
<page>Animaniacs</page>
 
<page>Appointment with Death (film)</page>
 
<page>Arthur Conan Doyle</page>
 
<page>Austin Trevor</page>
 
<page>BBC 7</page>
 
<page>BBC Radio 4</page>
 
<page>Belgium</page>
 
<page>Bernice Summerfield</page>
 
<page>Black Coffee (1931 film)</page>
 
<page>Brussels</page>
 
<page>C. Auguste Dupin</page>
 
<page>Captain Arthur Hastings</page>
 
<page>Cards on the Table</page>
 
<page>Charles Laughton</page>
 
<page>Charlie Chan</page>
 
<page>Charterhouse Square</page>
 
<page>Count Duckula</page>
 
<page>Crime na Pensão Estrelinha</page>
 
<page>Crooked House</page>
 
<page>Curtain (novel)</page>
 
<page>Dave Stone</page>
 
<page>David Suchet</page>
 
<page>Daylight Robbery on the Orient Express</page>
 
<page>Dead Man's Folly</page>
 
<page>Death on the Nile</page>
 
<page>Death on the Nile (1978 film)</page>
 
<page>Demographics of Belgium</page>
 
<page>Detective Conan</page>
 
<page>Detective-Judge Armitage</page>
 
<page>Dudley Jones</page>
 
<page>Eastern Europe</page>
 
<page>Edgar Allan Poe</page>
 
<page>Edmund Wilson</page>
 
<page>Elephants Can Remember</page>
 
<page>Emma Bunton</page>
 
<page>Evil Under the Sun (1982 film)</page>
 
<page>Faye Dunaway</page>
 
<page>Finland</page>
 
<page>Five Little Pigs</page>
 
<page>Florin Court</page>
 
<page>Frank Howel Evans</page>
 
<page>Geronimo Stilton</page>
 
<page>Grey matter</page>
 
<page>HP Brown Sauce</page>
 
<page>Hallowe'en Party</page>
 
<page>Harold Huber</page>
 
<page>Henry Edwards (actor)</page>
 
<page>Hercules</page>
 
<page>Herman José</page>
 
<page>Hugh Laurie</page>
 
<page>ITV</page>
 
<page>Ian Holm</page>
 
<page>Inspector Lestrade</page>
 
<page>Jason Alexander</page>
 
<page>John Cleese</page>
 
<page>John Dickson Carr</page>
 
<page>John Moffat (actor)</page>
 
<page>José Carlos Somoza</page>
 
<page>Kaoru Yachigusa</page>
 
<page>Leslie S. Hiscott</page>
 
<page>London</page>
 
<page>Lord Edgware Dies</page>
 
<page>Lord Edgware Dies (1934 film)</page>
 
<page>Marie Belloc Lowndes</page>
 
<page>Mercury Players</page>
 
<page>Michael Morton (dramatist)</page>
 
<page>Middle East</page>
 
<page>Miss Marple</page>
 
<page>Mouri Kogoro</page>
 
<page>Muppets Tonight</page>
 
<page>Murder By Death</page>
 
<page>Murder on the Orient Express</page>
 
<page>Murder on the Orient Express (1974 film)</page>
 
<page>Murder on the Orient Express (2001 film)</page>
 
<page>Mycroft Holmes</page>
 
<page>NHK</page>
 
<page>New York Times</page>
 
<page>Nick and Nora Charles</page>
 
<page>Ordeal by Innocence</page>
 
<page>Parker Pyne</page>
 
<page>Pauline Moran</page>
 
<page>Peter Serafinowicz</page>
 
<page>Peter Ustinov</page>
 
<page>Plot devices in Agatha Christie's novels</page>
 
<page>Police Officer</page>
 
<page>Police officer</page>
 
<page>Rape of Belgium</page>
 
<page>Rashomon (movie)</page>
 
<page>Robert Barnard</page>
 
<page>Roman Catholic</page>
 
<page>Rosalind Hicks</page>
 
<page>Russian Revolution (1917)</page>
 
<page>Sam Spade</page>
 
<page>Sandhurst</page>
 
<page>Scotland Yard</page>
 
<page>Sherlock Holmes</page>
 
<page>Ship of Fools (Stone novel)</page>
 
<page>Smithfield, London</page>
 
<page>South America</page>
 
<page>Spa, Belgium</page>
 
<page>Spice World (film)</page>
 
<page>Spiceworld (film)</page>
 
<page>Squash (plant)</page>
 
<page>Sven Hjerson</page>
 
<page>The ABC Murders</page>
 
<page>The Alphabet Murders</page>
 
<page>The Athenian Murders</page>
 
<page>The Big Four (novel)</page>
 
<page>The Campbell Playhouse</page>
 
<page>The Goodies (TV series)</page>
 
<page>The Labours of Hercules</page>
 
<page>The Murder of Roger Ackroyd</page>
 
<page>The Mysterious Affair at Styles</page>
 
<page>The Pajamas</page>
 
<page>The Strange Case of the End of Civilization as We Know It</page>
 
<page>Thirteen at Dinner</page>
 
<page>Three Act Tragedy</page>
 
<page>Tony Randall</page>
 
<page>Treaty of Versailles</page>
 
<page>United Kingdom</page>
 
<page>Versailles</page>
 
<page>Walloons</page>
 
<page>Warner Brothers</page>
 
<page>Wilkie Collins</page>
 
<page>World War I</page>
 
<page>Yakko Warner</page>
 
<page>amyl nitrite</page>
 
<page>anime</page>
 
<page>arthritis</page>
 
<page>casus belli</page>
 
<page>charlatan</page>
 
<page>detective</page>
 
<page>fictional character</page>
 
<page>made-for-television</page>
 
<page>manga</page>
 
<page>narrator</page>
 
<page>novel</page>
 
<page>parody</page>
 
 
<page>private detective</page>
 
<page>private detective</page>
 
<page>refugee</page>
 
<page>refugee</page>
Line 211: Line 86:
 
</text>
 
</text>
 
</xml>
 
</xml>
 +
</pre></code>
  
 +
In general, the page will have several sections, coded in XML, and always in the same order:
 +
* the title, in '''&lt;title&gt;''' tags,
 +
* the contributor, in '''&lt;contributor&gt;''' tag,
 +
* the categories the page belongs to, in '''&lt;categories&gt;''' and '''&lt;cat&gt;''' tags,
 +
* the links to other wikipedia pages the page contains, in '''&lt;pagelinks&gt;''' and '''&lt;page&gt;''' tags,
 +
* the text of the page, with all the html and wiki tags removed, between '''&lt;text&gt;''' tags.
  
</pre></code>
+
The end of the text section always contains foreign characters.  The text should be coded in UTF-8, which is the international character set, of which ASCII is a variant.
  
 +
===CGI Program===
 +
Just for information, the CGI program that processes the request is available [[CSC352 getWikiPageById.cgi | here]].
 
</onlysmith>
 
</onlysmith>
 +
 +
==Submission==
 +
 +
Submit a pdf (and additional files if needed) as follows:
 +
 +
  submit project2 project2.pdf
 +
 +
<br />
 +
<br />
 +
<br />
 +
<br />
 +
<br />
 +
<br />
 +
<br />
 +
[[Category:CSC352]][[Category:Project]][[Category:XGrid]]

Latest revision as of 13:06, 18 November 2010

This project is currently under construction...


This section is only visible to computers located at Smith College

Submission

Submit a pdf (and additional files if needed) as follows:

 submit project2 project2.pdf