CSC352 Homework 4 2013
--D. Thiebaut (talk) 14:19, 21 October 2013 (EDT)
This homework deals with writing code in C and becoming proficient at processing text using C, which requires some pointer operations. It also contributes some important functionality to our project. The homework is due on the 31st of Oct., at 11:59 p.m. You may work in pair on this homework, or by yourself.
Computing Image-Access Statistics
First, you should take a look at the information in the Class Wiki] on the image repository and how to gather important files for us.
Of interest to us are the files that contain access statistics for all pages and files in the Mediawiki projects. They are available for free download from dumps.wikipedia.org. Note that wikipedia is just one of the projects supported by Mediawiki, which is the foundation that hosts wikipedia. There are many other projects. This is important because the pages of statistics contain frequency for many projects, and not just wikipedia.
To get a feel for what they contain, download a couple of them.
- Login to beowulf or beowulf2 (or work on your mac)
- type the following commands (user input in bold face):
cd mkdir temp cd temp wget http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/pagecounts-20130101-000000.gz --2013-10-21 14:41:02-- http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/pagecounts-20130101-000000.gz Resolving dumps.wikimedia.org... 208.80.152.185 Connecting to dumps.wikimedia.org|208.80.152.185|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 80093452 (76M) [application/x-gzip] Saving to: `pagecounts-20130101-000000.gz' 100% [=========================...=============================================>] 80,093,452 6.89M/s in 8.9s 2013-10-21 14:41:12 (8.59 MB/s) - `pagecounts-20130101-000000.gz' saved [80093452/80093452]