Difference between revisions of "CSC352 Homework 3"

From dftwiki3
Jump to: navigation, search
(CGI Program)
(Programming the XGrid)
Line 3: Line 3:
  
 
<bluebox>
 
<bluebox>
The class decided on the contents of this homework, and its due date: March 30th.  <font color="red">Feel free to work in pairs.</font>
+
The class decided on the contents of this homework, and its due date: March 30th.  It lays the groundwork for [[CSC352_Project_2 | Project #2]].  <font color="red">Feel free to work in pairs.</font>
 
</bluebox>
 
</bluebox>
  

Revision as of 09:58, 23 March 2010

XgridLogo.png

Programming the XGrid

The class decided on the contents of this homework, and its due date: March 30th. It lays the groundwork for Project #2. Feel free to work in pairs.





Problem Statement

Process N wiki pages, and for each one

  • keep track of the categories contained in the page
  • find the 5 most frequent words (not including stop words) in the page.
  • associate with each category the most frequent words that have been associated with it over the N pages processed
  • output the result (or a sample of it)
  • measure the execution time of the program
  • write a summary of it as illustrated in the guidelines presented in class (3/9, 3/11). (Removed --D. Thiebaut 13:15, 10 March 2010 (UTC))
  • (Addition --D. Thiebaut 13:15, 10 March 2010 (UTC)) For this homework, concentrate on the programming, and leave the formatting, graphing, and analysis part for the project. You should still report a table of measurements, which you can include in the header of the program or in a pdf.

Details

Wiki Pages

The details of how to obtain the Ids of wiki pages, and fetch wiki pages is presented in the XGrid Lab 2.

XGrid Submission

You are free to use asynchronous (Lab 1) jobs or batch (Lab 2) jobs to submit jobs to the XGrid. One might be better than the other, but having the class try several different approaches might be good as a group approach. Note that Batch processing is easier given that Lab 2 is already an example of ways to process wiki pages.

XGrid Controller

You will use the XgridMac controller for this homework, unless the XGrid in Bass becomes available early enough.

Performance Measure

XgridAgents.png

Two performance measures are obvious candidates: The total execution time, and the number of wiki pages per unit time (per second, or per minute if we have a slow implementation).

Your assignment is to compute the number of pages processed per unit of time as a function of the number of processors. The complexity here is that this number is not fixed, and we cannot easily control it (although we could always go to FH341 and turn some machines ON or OFF!). Sometimes the number of processors available will be 14, sometimes 16, sometimes 20...

You will also need to figure out a way to pick the right number of pages per block of pages processed in one swoop by your program. In other words, to process 1000 wiki pages, you could gnereate 100 tasks that can run in parallel, where each task runs on 1 processor and parses 10 different pages. Or you could create 10 tasks processing 100 pages each. Make sure you explain why you pick a particular approach.

Submission

Please submit your program(s), including everything needed to make it/them work (that includes files of stop words!). If you reported your measurements in a pdf, please include it as wel!

  submit hw3 yourfile1
  submit hw3 yourfile2
  submit hw3 etc...

Misc. Information

  • Remember that this is the first part of Project 2. You may discover that there is something important that controls the performance of your program, but you may not have time to fully explore/develop a solution in this homework. Make sure you mention it in your report, indicating that this is something that will be useful to incorporate in the project.


Accessing Wiki Pages

This is a two-step process. First we need to get a number of Page Ids. For example, if we just want 10 pages, we request the following Url:

http://xgridmac.dyndns.org/cgi-bin/getWikiPageById.cgi?Count=10

The output will be:

10000
10050000
10070000
10140000
10200000
10230000
1030000
10320000
1040000
10430000

To get the page with Id 1000, for example, then we access the Web server at the same address, but with a different request:

http://xgridmac.dyndns.org/cgi-bin/getWikiPageById.cgi?Id=1000

The output is:

<xml>
<title>Hercule Poirot</title>
<id>1000</id>
<contributors>
<contrib>
<username>TXiKiBoT</username>
<id>3171782</id>

<length>51946</length></contrib>
</contributors>
<categories>
<cat>Hercule Poirot</cat>
<cat>Fictional private investigators</cat>
<cat>Series of books</cat>
<cat>Hercule Poirot characters</cat>
<cat>Fictional Belgians</cat>
</categories>
<pagelinks>
<page></page>
<page>16 July</page>
<page>1916</page>
<page>1989</page>
<page>2011</page>
<page>A. E. W. Mason</page>
<page>Academy Award</page>
<page>Agatha Christie</page>
<page>Agatha Christie Hour</page>
<page>Agatha Christie's Great Detectives Poirot and Marple</page>
<page>Agatha Christie's Poirot</page>
<page>Albert Finney</page>
<page>Alfred Molina</page>
<page>Alibi (1931 film)</page>
<page>Alibi (play)</page>
<page>Angela Easterling</page>
<page>Animaniacs</page>
<page>Appointment with Death (film)</page>
<page>Arthur Conan Doyle</page>
<page>Austin Trevor</page>
<page>BBC 7</page>
<page>BBC Radio 4</page>
<page>Belgium</page>
<page>Bernice Summerfield</page>
<page>Black Coffee (1931 film)</page>
<page>Brussels</page>
<page>C. Auguste Dupin</page>
<page>Captain Arthur Hastings</page>
<page>Cards on the Table</page>
<page>Charles Laughton</page>
<page>Charlie Chan</page>
<page>Charterhouse Square</page>
<page>Count Duckula</page>
<page>Crime na Pensão Estrelinha</page>
<page>Crooked House</page>
<page>Curtain (novel)</page>
<page>Dave Stone</page>
<page>David Suchet</page>
<page>Daylight Robbery on the Orient Express</page>
<page>Dead Man's Folly</page>
<page>Death on the Nile</page>
<page>Death on the Nile (1978 film)</page>
<page>Demographics of Belgium</page>
<page>Detective Conan</page>
<page>Detective-Judge Armitage</page>
<page>Dudley Jones</page>
<page>Eastern Europe</page>
<page>Edgar Allan Poe</page>
<page>Edmund Wilson</page>
<page>Elephants Can Remember</page>
<page>Emma Bunton</page>
<page>Evil Under the Sun (1982 film)</page>
<page>Faye Dunaway</page>
<page>Finland</page>
<page>Five Little Pigs</page>
<page>Florin Court</page>
<page>Frank Howel Evans</page>
<page>Geronimo Stilton</page>
<page>Grey matter</page>
<page>HP Brown Sauce</page>
<page>Hallowe'en Party</page>
<page>Harold Huber</page>
<page>Henry Edwards (actor)</page>
<page>Hercules</page>
<page>Herman José</page>
<page>Hugh Laurie</page>
<page>ITV</page>
<page>Ian Holm</page>
<page>Inspector Lestrade</page>
<page>Jason Alexander</page>
<page>John Cleese</page>
<page>John Dickson Carr</page>
<page>John Moffat (actor)</page>
<page>José Carlos Somoza</page>
<page>Kaoru Yachigusa</page>
<page>Leslie S. Hiscott</page>
<page>London</page>
<page>Lord Edgware Dies</page>
<page>Lord Edgware Dies (1934 film)</page>
<page>Marie Belloc Lowndes</page>
<page>Mercury Players</page>
<page>Michael Morton (dramatist)</page>
<page>Middle East</page>
<page>Miss Marple</page>
<page>Mouri Kogoro</page>
<page>Muppets Tonight</page>
<page>Murder By Death</page>
<page>Murder on the Orient Express</page>
<page>Murder on the Orient Express (1974 film)</page>
<page>Murder on the Orient Express (2001 film)</page>
<page>Mycroft Holmes</page>
<page>NHK</page>
<page>New York Times</page>
<page>Nick and Nora Charles</page>
<page>Ordeal by Innocence</page>
<page>Parker Pyne</page>
<page>Pauline Moran</page>
<page>Peter Serafinowicz</page>
<page>Peter Ustinov</page>
<page>Plot devices in Agatha Christie's novels</page>
<page>Police Officer</page>
<page>Police officer</page>
<page>Rape of Belgium</page>
<page>Rashomon (movie)</page>
<page>Robert Barnard</page>
<page>Roman Catholic</page>
<page>Rosalind Hicks</page>
<page>Russian Revolution (1917)</page>
<page>Sam Spade</page>
<page>Sandhurst</page>
<page>Scotland Yard</page>
<page>Sherlock Holmes</page>
<page>Ship of Fools (Stone novel)</page>
<page>Smithfield, London</page>
<page>South America</page>
<page>Spa, Belgium</page>
<page>Spice World (film)</page>
<page>Spiceworld (film)</page>
<page>Squash (plant)</page>
<page>Sven Hjerson</page>
<page>The ABC Murders</page>
<page>The Alphabet Murders</page>
<page>The Athenian Murders</page>
<page>The Big Four (novel)</page>
<page>The Campbell Playhouse</page>
<page>The Goodies (TV series)</page>
<page>The Labours of Hercules</page>
<page>The Murder of Roger Ackroyd</page>
<page>The Mysterious Affair at Styles</page>
<page>The Pajamas</page>
<page>The Strange Case of the End of Civilization as We Know It</page>
<page>Thirteen at Dinner</page>
<page>Three Act Tragedy</page>
<page>Tony Randall</page>
<page>Treaty of Versailles</page>
<page>United Kingdom</page>
<page>Versailles</page>
<page>Walloons</page>
<page>Warner Brothers</page>
<page>Wilkie Collins</page>
<page>World War I</page>
<page>Yakko Warner</page>
<page>amyl nitrite</page>
<page>anime</page>
<page>arthritis</page>
<page>casus belli</page>
<page>charlatan</page>
<page>detective</page>
<page>fictional character</page>
<page>made-for-television</page>
<page>manga</page>
<page>narrator</page>
<page>novel</page>
<page>parody</page>
<page>private detective</page>
<page>refugee</page>
<page>retroactive continuity</page>
<page>turnip pocket watch</page>
</pagelinks>
<text>
. Belgium Belgian . occupation = Private Dectective. Former Retired DetectiveFormer Police Police officer officer . 
... (lot's of text removed here...)
. Hercule Poirot . uk. Еркюль Пуаро . vi. Hercule Poirot . zh. 赫丘勒·白羅 .
</text>
</xml>

In general, the page will have several sections, coded in XML, and always in the same order:

  • the title, in <title> tags,
  • the contributor, in <contributor> tag,
  • the categories the page belongs to, in <categories> and <cat> tags,
  • the links to other wikipedia pages the page contains, in <pagelinks> and <page> tags,
  • the text of the page, with all the html and wiki tags removed, between <text> tags.

The end of the text section always contains foreign characters. The text should be coded in UTF-8, which is the international character set, of which ASCII is a variant.

CGI Program


This section is only visible to computers located at Smith College