CSC352 Homework 4 2013

From dftwiki3
Revision as of 14:55, 21 October 2013 by Thiebaut (talk | contribs) (Statistics-File Format)
Jump to: navigation, search

--D. Thiebaut (talk) 14:19, 21 October 2013 (EDT)




This homework deals with writing code in C and becoming proficient at processing text using C, which requires some pointer operations. It also contributes some important functionality to our project. The homework is due on the 31st of Oct., at 11:59 p.m. You may work in pair on this homework, or by yourself.



Computing Image-Access Statistics


First, you should take a look at the information in the Class Wiki] on the image repository and how to gather important files for us.

Of interest to us are the files that contain access statistics for all pages and files in the Mediawiki projects. They are available for free download from dumps.wikipedia.org. Note that wikipedia is just one of the projects supported by Mediawiki, which is the foundation that hosts wikipedia. There are many other projects. This is important because the pages of statistics contain frequency for many projects, and not just wikipedia.

To get a feel for what they contain, download a couple of them.

  • Login to beowulf or beowulf2 (or work on your mac, in which case you may want to install wget )
  • type the following commands (user input in bold face):
 cd
 mkdir temp
 cd temp
  • If you work on your mac, type the following command to download one of the stats file from mediawik:
 curl http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/pagecounts-20130101-000000.gz \
               -o pagecounts-20130101-000000.gz
(note the backslash is to break the line in two. You do not need it. But if you do, do not put any other character besides pressing ENTER after the backslash)
  • If you are working on beowulf/beowulf2, type instead
 wget http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/pagecounts-20130101-000000.gz
  • On either Mac or Beowulf, unzip the file:
 gunzip pagecounts-20130101-000000.gz
  • You should have a new file in your directory:
 less pagecounts-20130101-000000 

 *.s Especial:P\xC3\xA1ginas_afluentes/Ficheiro:50\x13+\xD5hG\xD7\xC0\x22\xFF 1 325
 *.s Especial:P\xC3\xA1ginas_afluentes/Ficheiro:75\x89XZX-Cach 1 325
 *.s File:25nishV 1 325
 * Cookie_\x00\x00 3 802
 ...
The file is long. It contains over 6 million lines. Below are some explanations for its format.

Format of the Stats File

The information below is taken verbatim directly from dumps.wikimedia.org/other/pagecounts-raw/.

Each request of a page, whether for editing or reading, whether a "special page" such as a log of actions generated on the fly, or an article from Wikipedia or one of the other projects, reaches one of our squid caching hosts and the request is sent via udp to a filter which tosses requests from our internal hosts, as well as requests for wikis that aren't among our general projects. This filter writes out the project name, the size of the page requested, and the title of the page requested.

Here are a few sample lines from one file:

    fr.b Special:Recherche/Achille_Baraguey_d%5C%27Hilliers 1 624
    fr.b Special:Recherche/Acteurs_et_actrices_N 1 739
    fr.b Special:Recherche/Agrippa_d/%27Aubign%C3%A9 1 743
    fr.b Special:Recherche/All_Mixed_Up 1 730
    fr.b Special:Recherche/Andr%C3%A9_Gazut.html 1 737
  

In the above, the first column "fr.b" is the project name. The following abbreviations are used:

wikibooks: ".b" wiktionary: ".d" wikimedia: ".m" wikipedia mobile: ".mw" wikinews: ".n" wikiquote: ".q" wikisource: ".s" wikiversity: ".v" mediawiki: ".w"

Projects without a period and a following character are wikipedia projects. The second column is the title of the page retrieved, the third column is the number of requests, and the fourth column is the size of the content returned.

These are hourly statistics, so in the line

    en Main_Page 242332 4737756101
  

we see that the main page of the English language Wikipedia was requested over 240 thousand times during the specific hour. These are not unique visits. In some directories you will see files which have names starting with "projectcount". These are total views per hour per project, generated by summing up the entries in the pagecount files. The first entry in a line is the project name, the second is the number of non-unique views, and the third is the total number of bytes transferred.

Filtering out Unuseful Information