|
|
(12 intermediate revisions by the same user not shown) |
Line 2: |
Line 2: |
| <onlysmith> | | <onlysmith> |
| | | |
| + | __TOC__ |
| + | |
| + | <bluebox> |
| + | This is the extension of [[CSC352_Homework_3 | Homework #3]], which is built on top of the [[XGrid Tutorial Part 2: Processing Wikipedia Pages | XGrid Lab 2]]. It is due on Thursday, April 8th. |
| + | </bluebox> |
| + | |
| + | =Assignment= |
| + | |
| + | * Process N wiki pages, and for each one keep track of the categories contained in the page find the 5 most frequent words (not including stop words) in the page. |
| + | * Associate with each category the most frequent words that have been associated with it over the N pages processed |
| + | * Output the result (or a sample of it) |
| + | * Measure the execution time of the program |
| + | * write a summary of it as illustrated in the guidelines presented in class (3/9, 3/11). |
| + | * For this project, build on top of the homework and concentrate on the formatting of the project, and include graphs, and an analysis of your results. |
| + | * Submit a pdf with your presentation, graphs, and analysis. Submit your programs, even if they are the same as the files you submitted for the homework. |
| + | |
| + | submit project2 file1 |
| + | submit project2 file2 |
| + | ... |
| + | |
| + | =Project Details= |
| ==Accessing Wiki Pages== | | ==Accessing Wiki Pages== |
| | | |
− | This is a two-step process. First you need to get a number of Page Ids. For example, if we just want 10 pages, we request the following Url: | + | This is a two-step process. First we need to get a number of Page Ids. For example, if we just want 10 pages, we request the following Url: |
| | | |
− | :::http://xgridmac.dyndns.org/~thiebaut/getWikiPageById.cgi?Count=10 | + | :::http://xgridmac.dyndns.org/cgi-bin/getWikiPageById.cgi?Count=10 |
| | | |
| The output will be: | | The output will be: |
Line 21: |
Line 42: |
| 10430000 | | 10430000 |
| | | |
− | To get the page with Id 10000, for example, then access the Web server at the same address, but with a different ''request'': | + | To get the page with Id 1000, for example, then we access the Web server at the same address, but with a different ''request'': |
| | | |
− | :::http://xgridmac.dyndns.org/~thiebaut/getWikiPageById.cgi?Id=1000 | + | :::http://xgridmac.dyndns.org/cgi-bin/getWikiPageById.cgi?Id=1000 |
| | | |
| The output is: | | The output is: |
Line 53: |
Line 74: |
| <page>A. E. W. Mason</page> | | <page>A. E. W. Mason</page> |
| <page>Academy Award</page> | | <page>Academy Award</page> |
− | <page>Agatha Christie</page>
| + | ... |
− | <page>Agatha Christie Hour</page>
| |
− | <page>Agatha Christie's Great Detectives Poirot and Marple</page>
| |
− | <page>Agatha Christie's Poirot</page>
| |
− | <page>Albert Finney</page>
| |
− | <page>Alfred Molina</page>
| |
− | <page>Alibi (1931 film)</page>
| |
− | <page>Alibi (play)</page>
| |
− | <page>Angela Easterling</page>
| |
− | <page>Animaniacs</page>
| |
− | <page>Appointment with Death (film)</page>
| |
− | <page>Arthur Conan Doyle</page>
| |
− | <page>Austin Trevor</page>
| |
− | <page>BBC 7</page>
| |
− | <page>BBC Radio 4</page>
| |
− | <page>Belgium</page>
| |
− | <page>Bernice Summerfield</page>
| |
− | <page>Black Coffee (1931 film)</page>
| |
− | <page>Brussels</page>
| |
− | <page>C. Auguste Dupin</page>
| |
− | <page>Captain Arthur Hastings</page>
| |
− | <page>Cards on the Table</page>
| |
− | <page>Charles Laughton</page>
| |
− | <page>Charlie Chan</page>
| |
− | <page>Charterhouse Square</page>
| |
− | <page>Count Duckula</page>
| |
− | <page>Crime na Pensão Estrelinha</page>
| |
− | <page>Crooked House</page>
| |
− | <page>Curtain (novel)</page>
| |
− | <page>Dave Stone</page>
| |
− | <page>David Suchet</page>
| |
− | <page>Daylight Robbery on the Orient Express</page>
| |
− | <page>Dead Man's Folly</page>
| |
− | <page>Death on the Nile</page>
| |
− | <page>Death on the Nile (1978 film)</page>
| |
− | <page>Demographics of Belgium</page>
| |
− | <page>Detective Conan</page>
| |
− | <page>Detective-Judge Armitage</page>
| |
− | <page>Dudley Jones</page>
| |
− | <page>Eastern Europe</page>
| |
− | <page>Edgar Allan Poe</page>
| |
− | <page>Edmund Wilson</page>
| |
− | <page>Elephants Can Remember</page>
| |
− | <page>Emma Bunton</page>
| |
− | <page>Evil Under the Sun (1982 film)</page>
| |
− | <page>Faye Dunaway</page>
| |
− | <page>Finland</page>
| |
− | <page>Five Little Pigs</page>
| |
− | <page>Florin Court</page>
| |
− | <page>Frank Howel Evans</page>
| |
− | <page>Geronimo Stilton</page>
| |
− | <page>Grey matter</page>
| |
− | <page>HP Brown Sauce</page>
| |
− | <page>Hallowe'en Party</page>
| |
− | <page>Harold Huber</page>
| |
− | <page>Henry Edwards (actor)</page>
| |
− | <page>Hercules</page>
| |
− | <page>Herman José</page>
| |
− | <page>Hugh Laurie</page>
| |
− | <page>ITV</page>
| |
− | <page>Ian Holm</page>
| |
− | <page>Inspector Lestrade</page>
| |
− | <page>Jason Alexander</page>
| |
− | <page>John Cleese</page>
| |
− | <page>John Dickson Carr</page>
| |
− | <page>John Moffat (actor)</page>
| |
− | <page>José Carlos Somoza</page>
| |
− | <page>Kaoru Yachigusa</page>
| |
− | <page>Leslie S. Hiscott</page>
| |
− | <page>London</page>
| |
− | <page>Lord Edgware Dies</page>
| |
− | <page>Lord Edgware Dies (1934 film)</page>
| |
− | <page>Marie Belloc Lowndes</page>
| |
− | <page>Mercury Players</page>
| |
− | <page>Michael Morton (dramatist)</page>
| |
− | <page>Middle East</page>
| |
− | <page>Miss Marple</page>
| |
− | <page>Mouri Kogoro</page>
| |
− | <page>Muppets Tonight</page>
| |
− | <page>Murder By Death</page>
| |
− | <page>Murder on the Orient Express</page>
| |
− | <page>Murder on the Orient Express (1974 film)</page>
| |
− | <page>Murder on the Orient Express (2001 film)</page>
| |
− | <page>Mycroft Holmes</page>
| |
− | <page>NHK</page>
| |
− | <page>New York Times</page>
| |
− | <page>Nick and Nora Charles</page>
| |
− | <page>Ordeal by Innocence</page>
| |
− | <page>Parker Pyne</page>
| |
− | <page>Pauline Moran</page>
| |
− | <page>Peter Serafinowicz</page>
| |
− | <page>Peter Ustinov</page>
| |
− | <page>Plot devices in Agatha Christie's novels</page>
| |
− | <page>Police Officer</page>
| |
− | <page>Police officer</page>
| |
− | <page>Rape of Belgium</page>
| |
− | <page>Rashomon (movie)</page>
| |
− | <page>Robert Barnard</page>
| |
− | <page>Roman Catholic</page>
| |
− | <page>Rosalind Hicks</page>
| |
− | <page>Russian Revolution (1917)</page>
| |
− | <page>Sam Spade</page>
| |
− | <page>Sandhurst</page>
| |
− | <page>Scotland Yard</page>
| |
− | <page>Sherlock Holmes</page>
| |
− | <page>Ship of Fools (Stone novel)</page>
| |
− | <page>Smithfield, London</page>
| |
− | <page>South America</page>
| |
− | <page>Spa, Belgium</page>
| |
− | <page>Spice World (film)</page>
| |
− | <page>Spiceworld (film)</page>
| |
− | <page>Squash (plant)</page>
| |
− | <page>Sven Hjerson</page>
| |
− | <page>The ABC Murders</page>
| |
− | <page>The Alphabet Murders</page>
| |
− | <page>The Athenian Murders</page>
| |
− | <page>The Big Four (novel)</page>
| |
− | <page>The Campbell Playhouse</page>
| |
− | <page>The Goodies (TV series)</page>
| |
− | <page>The Labours of Hercules</page>
| |
− | <page>The Murder of Roger Ackroyd</page>
| |
− | <page>The Mysterious Affair at Styles</page>
| |
− | <page>The Pajamas</page>
| |
− | <page>The Strange Case of the End of Civilization as We Know It</page>
| |
− | <page>Thirteen at Dinner</page>
| |
− | <page>Three Act Tragedy</page>
| |
− | <page>Tony Randall</page>
| |
− | <page>Treaty of Versailles</page>
| |
− | <page>United Kingdom</page>
| |
− | <page>Versailles</page>
| |
− | <page>Walloons</page>
| |
− | <page>Warner Brothers</page>
| |
− | <page>Wilkie Collins</page>
| |
− | <page>World War I</page>
| |
− | <page>Yakko Warner</page>
| |
− | <page>amyl nitrite</page>
| |
− | <page>anime</page>
| |
− | <page>arthritis</page>
| |
− | <page>casus belli</page>
| |
− | <page>charlatan</page>
| |
− | <page>detective</page>
| |
− | <page>fictional character</page>
| |
− | <page>made-for-television</page>
| |
− | <page>manga</page>
| |
− | <page>narrator</page>
| |
− | <page>novel</page>
| |
− | <page>parody</page>
| |
| <page>private detective</page> | | <page>private detective</page> |
| <page>refugee</page> | | <page>refugee</page> |
Line 219: |
Line 94: |
| * the links to other wikipedia pages the page contains, in '''<pagelinks>''' and '''<page>''' tags, | | * the links to other wikipedia pages the page contains, in '''<pagelinks>''' and '''<page>''' tags, |
| * the text of the page, with all the html and wiki tags removed, between '''<text>''' tags. | | * the text of the page, with all the html and wiki tags removed, between '''<text>''' tags. |
| + | |
| + | The end of the text section always contains foreign characters. The text should be coded in UTF-8, which is the international character set, of which ASCII is a variant. |
| + | |
| + | ===CGI Program=== |
| + | Just for information, the CGI program that processes the request is available [[CSC352 getWikiPageById.cgi | here]]. |
| </onlysmith> | | </onlysmith> |
| + | |
| + | ==Submission== |
| + | |
| + | Submit a pdf (and additional files if needed) as follows: |
| + | |
| + | submit project2 project2.pdf |
| + | |
| + | <br /> |
| + | <br /> |
| + | <br /> |
| + | <br /> |
| + | <br /> |
| + | <br /> |
| + | <br /> |
| + | [[Category:CSC352]][[Category:Project]][[Category:XGrid]] |