CSC352 Homework 4 2013

From dftwiki3
Revision as of 14:42, 21 October 2013 by Thiebaut (talk | contribs) (Computing Image-Access Statistics)
Jump to: navigation, search

--D. Thiebaut (talk) 14:19, 21 October 2013 (EDT)




This homework deals with writing code in C and becoming proficient at processing text using C, which requires some pointer operations. It also contributes some important functionality to our project. The homework is due on the 31st of Oct., at 11:59 p.m. You may work in pair on this homework, or by yourself.



Computing Image-Access Statistics


First, you should take a look at the information in the Class Wiki] on the image repository and how to gather important files for us.

Of interest to us are the files that contain access statistics for all pages and files in the Mediawiki projects. They are available for free download from dumps.wikipedia.org. Note that wikipedia is just one of the projects supported by Mediawiki, which is the foundation that hosts wikipedia. There are many other projects. This is important because the pages of statistics contain frequency for many projects, and not just wikipedia.

To get a feel for what they contain, download a couple of them.

  • Login to beowulf or beowulf2 (or work on your mac)
  • type the following commands (user input in bold face):
 cd
 mkdir temp
 cd temp
 wget http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/pagecounts-20130101-000000.gz
 --2013-10-21 14:41:02--  http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/pagecounts-20130101-000000.gz
 Resolving dumps.wikimedia.org... 208.80.152.185
 Connecting to dumps.wikimedia.org|208.80.152.185|:80... connected.
 HTTP request sent, awaiting response... 200 OK
 Length: 80093452 (76M) [application/x-gzip]
 Saving to: `pagecounts-20130101-000000.gz'

 100%  [=========================...=============================================>] 80,093,452  6.89M/s   in 8.9s    
 
 2013-10-21 14:41:12 (8.59 MB/s) - `pagecounts-20130101-000000.gz' saved [80093452/80093452]