Difference between revisions of "CSC352 Homework 4 2013"
(→Computing Image-Access Statistics) |
|||
Line 16: | Line 16: | ||
If you want to do some of this homework on your Mac, you should install the '''command line tools''', if you haven't done already so. To install them, go to [http://developer.apple.com/downloads/index.action http://developer.apple.com/downloads/index.action], and sign in for an Apple Id (you should already have one), and download the package. In the pane on the left, search for "command line tools" and choose the package appropriate to your version of OS X. | If you want to do some of this homework on your Mac, you should install the '''command line tools''', if you haven't done already so. To install them, go to [http://developer.apple.com/downloads/index.action http://developer.apple.com/downloads/index.action], and sign in for an Apple Id (you should already have one), and download the package. In the pane on the left, search for "command line tools" and choose the package appropriate to your version of OS X. | ||
<br /> | <br /> | ||
− | = | + | =Obtaining the Access Statistics for Wikipedia Images= |
<br /> | <br /> | ||
Revision as of 15:22, 21 October 2013
--D. Thiebaut (talk) 14:19, 21 October 2013 (EDT)
This homework deals with writing code in C and becoming proficient at processing text using C, which requires some pointer operations. It also contributes some important functionality to our project. The homework is due on the 31st of Oct., at 11:59 p.m. You may work in pair on this homework, or by yourself.
Contents
Info for Mac Users
If you want to do some of this homework on your Mac, you should install the command line tools, if you haven't done already so. To install them, go to http://developer.apple.com/downloads/index.action, and sign in for an Apple Id (you should already have one), and download the package. In the pane on the left, search for "command line tools" and choose the package appropriate to your version of OS X.
Obtaining the Access Statistics for Wikipedia Images
First, you should take a look at the information in the Class Wiki] on the image repository and how to gather important files for us.
Of interest to us are the files that contain access statistics for all pages and files in the Mediawiki projects. They are available for free download from dumps.wikipedia.org. Note that wikipedia is just one of the projects supported by Mediawiki, which is the foundation that hosts wikipedia. There are many other projects. This is important because the pages of statistics contain frequency for many projects, and not just wikipedia.
To get a feel for what they contain, download a couple of them.
- Login to beowulf or beowulf2 (or work on your mac, in which case you may want to install wget )
- type the following commands (user input in bold face):
cd mkdir temp cd temp
- If you work on your mac, type the following command to download one of the stats file from mediawik:
curl http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/pagecounts-20130101-000000.gz \ -o pagecounts-20130101-000000.gz
- Note the backslash is to break the line in two. You do not need it. But if you do, do not put any other character besides pressing ENTER after the backslash.
- If you are working on beowulf/beowulf2, type instead
wget http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/pagecounts-20130101-000000.gz
- Note: curl is also available on beowulf/2, but I think wget is more universal and you should become familiar with it as well
- On either Mac or Beowulf, unzip the file:
gunzip pagecounts-20130101-000000.gz
- You should have a new file in your directory:
less pagecounts-20130101-000000 *.s Especial:P\xC3\xA1ginas_afluentes/Ficheiro:50\x13+\xD5hG\xD7\xC0\x22\xFF 1 325 *.s Especial:P\xC3\xA1ginas_afluentes/Ficheiro:75\x89XZX-Cach 1 325 *.s File:25nishV 1 325 * Cookie_\x00\x00 3 802 ...
- The file is long. It contains over 6 million lines. Below are some explanations for its format.
Format of the Stats Files
The information below is taken verbatim directly from dumps.wikimedia.org/other/pagecounts-raw/.
Each request of a page, whether for editing or reading, whether a "special page" such as a log of actions generated on the fly, or an article from Wikipedia or one of the other projects, reaches one of our squid caching hosts and the request is sent via udp to a filter which tosses requests from our internal hosts, as well as requests for wikis that aren't among our general projects. This filter writes out the project name, the size of the page requested, and the title of the page requested.
Here are a few sample lines from one file:
fr.b Special:Recherche/Achille_Baraguey_d%5C%27Hilliers 1 624 fr.b Special:Recherche/Acteurs_et_actrices_N 1 739 fr.b Special:Recherche/Agrippa_d/%27Aubign%C3%A9 1 743 fr.b Special:Recherche/All_Mixed_Up 1 730 fr.b Special:Recherche/Andr%C3%A9_Gazut.html 1 737
In the above, the first column "fr.b" is the project name. The following abbreviations are used:
wikibooks: ".b" wiktionary: ".d" wikimedia: ".m" wikipedia mobile: ".mw" wikinews: ".n" wikiquote: ".q" wikisource: ".s" wikiversity: ".v" mediawiki: ".w"
Projects without a period and a following character are wikipedia projects. The second column is the title of the page retrieved, the third column is the number of requests, and the fourth column is the size of the content returned.
These are hourly statistics, so in the line
en Main_Page 242332 4737756101
we see that the main page of the English language Wikipedia was requested over 240 thousand times during the specific hour. These are not unique visits. In some directories you will see files which have names starting with "projectcount". These are total views per hour per project, generated by summing up the entries in the pagecount files. The first entry in a line is the project name, the second is the number of non-unique views, and the third is the total number of bytes transferred.
Filtering out the Information
So the lines of interest to us are the lines that have the keyworkd "File:" in them, and start with "en " or "en.mw ". Here is an example of such lines, taken from the same stats file:
en File:%22Almighty_God_as_Created_the_Mind_Free_._._.%22_at_Jefferson_Memorial.jpg 1 12544 en File:%22Andromeda%22_by_Tamara_de_Lempicka.jpg 1 9776 ... en file:wordpress_template_hierarchy.png 1 617 en file:workathomead.jpg 1 601 en file:world_countries_standard_&_poor%27s_ratings.svg 2 1365
I computed (using bash, grep and wc) that there are 228,991 such lines in the pagecount file.
Assignment Problem #1
Your first assignment is to write a program in C that will read the stats file and output only the lines of interest, that is the lines
that start with "en " (a space follows "en" ), but not the ones that start with "ten ", and that also contains " File:" or " file:" in them (whichever the case).
Requirements
- Call your program filter4.c (4 for Homework 4)
- Your program should get two file names from the command line:
- the name of the input file (e.g. pagecounts-20130101-000000)
- the name of the output file that will contain only the the lines corresponding to image files in the English wikipedia (e.g. pagecounts-20130101-000000.filtered)
- If the user does not provide two file names, your program should display the expected syntax for its use.
Assignment Problem #2: C+Sort
Study the C program below, and figure out a way for you to apply it to the previous program so that it only outputs the K most frequent images files, not all of them.
Call your program filter4topK.c and make it get K as the last parameter on the command line. If K is omitted it will work the same as filter4.c
// sortArrayOfStruct.c
// D. Thiebaut
// Simple example using insertion sort to
// sort a small array of structs. Insertion sort
// is an O(N^2) algorithm, but quite performant for
// arrays of small dimensions.
//
// to compile and run:
// cc -o sortArray sortArrayOfStruct.c
// ./sortArray
//
#include <stdio.h> // for printf
#include <string.h> // for strcpy
#include <stdlib.h> // for malloc
#define N 5
//--- structures ---
struct Person {
char *name; // name of the person
int age; // age of the person
};
//--- Prototypes ---
void mySort( struct Person *A );
char *myStrcpy( char *name );
//--------------------- M A I N ------------------------
int main() {
struct Person people[N];
int i;
//--- initialize the array ---
people[0].name = myStrcpy( "Alex" );
people[0].age = 31;
people[1].name = myStrcpy( "Zoe" );
people[1].age = 10;
people[2].name = myStrcpy( "Mona" );
people[2].age = 7;
people[3].name = myStrcpy( "Gail" );
people[3].age = 70;
people[4].name = myStrcpy( "Mado" );
people[4].age = 13;
//--- sort the array of people according to age ---
mySort( people );
//--- display the resulting list ---
for ( i = 0; i < N; i++ )
printf( "%s: %d\n", people[i].name, people[i].age );
}
//--------------------------------------------------------
// myStrcpy: combines strcpy and malloc to create a new string
// big enough to contain the chars of the string passed
// as a parameter. Returns a pointer to the new copy.
char *myStrcpy( char *name ) {
char *p = (char *) malloc( (strlen( name ) + 1) * sizeof( char ) );
strcpy( p, name );
return p;
}
//--------------------------------------------------------
// swap: swaps 2 structs in the array of structs. The two
// entries are identified by two indexes.
void swap( struct Person *A, int i, int j ) {
int tempAge = A[i].age;
char *tempName = A[i].name;
A[i].age = A[j].age;
A[i].name = A[j].name;
A[j].age = tempAge;
A[j].name = tempName;
}
//--------------------------------------------------------
// mySort: sorts the array of structs using insertion sort
void mySort( struct Person *A ) {
int i, j;
int temp;
for ( i = 0; i < N-1; i++ ) {
for ( j=i+1; j < N; j++ ) {
temp = i;
if ( A[j].age < A[temp].age )
temp = j;
swap( A, i, temp );
}
}
}
Assignment Problem #3: MPI
Write an MPI Manager/Worker application in C that computes an approximation of Pi using the Monte Carlo method that we saw earlier this semester. Write the code yourself, please, and don't pull it off the Web!
Requirements
Your program should be called mpiMonteCarlo4.c, and should support the following features:
- It should get the number of processes NP from the command line, as the first parameter.
- It should get the number of random samples for the Monte Carlo computation, N, from the second parameter on the command line.
- The manager should divide the number of samples N (say 1,000,000) by the number of processes NP (say 10, or 1 manager plus 9 workers), pass the resulting number ( 100,000) to each worker, compute its own evaluation of Pi for 100,000 samples, gather the results from the other N-1workers, compute the overall result for 1,000,000, and display it on the standard output.
- If the user does not input the correct number of parameters, the program will exit and display the expected syntax.
Submission
Submit your programs on beowulf as follows:
submit hw4 filter4.c submit hw4 filter4topK.c submit hw4 mpiMonteCarlo4.c