CSC352 Homework 4 2013

From dftwiki3
Revision as of 14:28, 21 October 2013 by Thiebaut (talk | contribs) (Requirements)
Jump to: navigation, search

--D. Thiebaut (talk) 14:19, 21 October 2013 (EDT)




This homework deals with writing code in C and becoming proficient at processing text using C, which requires some pointer operations. It also contributes some important functionality to our project. The homework is due on the 31st of Oct., at 11:59 p.m. You may work in pair on this homework, or by yourself.




Info for Mac Users


If you want to do some of this homework on your Mac, you should install the command line tools, if you haven't done already so. To install them, go to http://developer.apple.com/downloads/index.action, and sign in for an Apple Id (you should already have one), and download the package. In the pane on the left, search for "command line tools" and choose the package appropriate to your version of OS X.

Computing Image-Access Statistics


First, you should take a look at the information in the Class Wiki] on the image repository and how to gather important files for us.

Of interest to us are the files that contain access statistics for all pages and files in the Mediawiki projects. They are available for free download from dumps.wikipedia.org. Note that wikipedia is just one of the projects supported by Mediawiki, which is the foundation that hosts wikipedia. There are many other projects. This is important because the pages of statistics contain frequency for many projects, and not just wikipedia.

To get a feel for what they contain, download a couple of them.

  • Login to beowulf or beowulf2 (or work on your mac, in which case you may want to install wget )
  • type the following commands (user input in bold face):
 cd
 mkdir temp
 cd temp
  • If you work on your mac, type the following command to download one of the stats file from mediawik:
 curl http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/pagecounts-20130101-000000.gz \
               -o pagecounts-20130101-000000.gz
Note the backslash is to break the line in two. You do not need it. But if you do, do not put any other character besides pressing ENTER after the backslash.
  • If you are working on beowulf/beowulf2, type instead
 wget http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/pagecounts-20130101-000000.gz
Note: curl is also available on beowulf/2, but I think wget is more universal and you should become familiar with it as well
  • On either Mac or Beowulf, unzip the file:
 gunzip pagecounts-20130101-000000.gz
  • You should have a new file in your directory:
 less pagecounts-20130101-000000 

 *.s Especial:P\xC3\xA1ginas_afluentes/Ficheiro:50\x13+\xD5hG\xD7\xC0\x22\xFF 1 325
 *.s Especial:P\xC3\xA1ginas_afluentes/Ficheiro:75\x89XZX-Cach 1 325
 *.s File:25nishV 1 325
 * Cookie_\x00\x00 3 802
 ...
The file is long. It contains over 6 million lines. Below are some explanations for its format.

Format of the Stats Files

The information below is taken verbatim directly from dumps.wikimedia.org/other/pagecounts-raw/.

Each request of a page, whether for editing or reading, whether a "special page" such as a log of actions generated on the fly, or an article from Wikipedia or one of the other projects, reaches one of our squid caching hosts and the request is sent via udp to a filter which tosses requests from our internal hosts, as well as requests for wikis that aren't among our general projects. This filter writes out the project name, the size of the page requested, and the title of the page requested.

Here are a few sample lines from one file:

    fr.b Special:Recherche/Achille_Baraguey_d%5C%27Hilliers 1 624
    fr.b Special:Recherche/Acteurs_et_actrices_N 1 739
    fr.b Special:Recherche/Agrippa_d/%27Aubign%C3%A9 1 743
    fr.b Special:Recherche/All_Mixed_Up 1 730
    fr.b Special:Recherche/Andr%C3%A9_Gazut.html 1 737
  

In the above, the first column "fr.b" is the project name. The following abbreviations are used:

wikibooks: ".b" wiktionary: ".d" wikimedia: ".m" wikipedia mobile: ".mw" wikinews: ".n" wikiquote: ".q" wikisource: ".s" wikiversity: ".v" mediawiki: ".w"

Projects without a period and a following character are wikipedia projects. The second column is the title of the page retrieved, the third column is the number of requests, and the fourth column is the size of the content returned.

These are hourly statistics, so in the line

    en Main_Page 242332 4737756101
  

we see that the main page of the English language Wikipedia was requested over 240 thousand times during the specific hour. These are not unique visits. In some directories you will see files which have names starting with "projectcount". These are total views per hour per project, generated by summing up the entries in the pagecount files. The first entry in a line is the project name, the second is the number of non-unique views, and the third is the total number of bytes transferred.

Filtering out the Information


So the lines of interest to us are the lines that have the keyworkd "File:" in them, and start with "en " or "en.mw ". Here is an example of such lines, taken from the same stats file:

en File:%22Almighty_God_as_Created_the_Mind_Free_._._.%22_at_Jefferson_Memorial.jpg 1 12544
en File:%22Andromeda%22_by_Tamara_de_Lempicka.jpg 1 9776
...
en file:wordpress_template_hierarchy.png 1 617
en file:workathomead.jpg 1 601
en file:world_countries_standard_&_poor%27s_ratings.svg 2 1365

I computed (using bash, grep and wc) that there are 228,991 such lines in the pagecount file.

Assignment Problem #1


Your first assignment is to write a program in C that will read the stats file and output only the lines of interest, that is the lines that start with "en " (a space follows "en" ), but not the ones that start with "ten ", and that also contains " File:" or " file:" in them (whichever the case).

Requirements


  • Call your program filter4.c (4 for Homework 4)
  • Your program should get two file names from the command line:
    • the name of the input file (e.g. pagecounts-20130101-000000)
    • the name of the output file that will contain only the the lines corresponding to image files in the English wikipedia (e.g. pagecounts-20130101-000000.filtered)
  • If the user does not provide two file names, your program should display the expected syntax for its use.


Optional and Extra Credits


Make your program output only the 10 lines containing the 10 largest frequency numbers.