Difference between revisions of "CSC352 Homework 4 2013"

From dftwiki3
Jump to: navigation, search
(Format of the Stats Files)
 
Line 6: Line 6:
 
<bluebox>
 
<bluebox>
 
This homework deals with writing code in C and becoming proficient at processing text using C, which requires some pointer operations.  It also contributes some important functionality to our project.
 
This homework deals with writing code in C and becoming proficient at processing text using C, which requires some pointer operations.  It also contributes some important functionality to our project.
The homework is due on the '''29th of Oct., at 11:59 p.m.'''  You may work in pair on this homework, or by yourself.   
+
The homework is due on the '''<strike>29th of Oct.</strike> 31st of Oct.  at 11:59 p.m.'''  You may work in pair on this homework, or by yourself.   
 
</bluebox>
 
</bluebox>
  

Latest revision as of 07:42, 25 October 2013

--D. Thiebaut (talk) 14:19, 21 October 2013 (EDT)




This homework deals with writing code in C and becoming proficient at processing text using C, which requires some pointer operations. It also contributes some important functionality to our project. The homework is due on the 29th of Oct. 31st of Oct. at 11:59 p.m. You may work in pair on this homework, or by yourself.




Info for Mac Users


If you want to do some of this homework on your Mac, you should install the command line tools, if you haven't done already so. To install them, go to http://developer.apple.com/downloads/index.action, and sign in for an Apple Id (you may already have one), and download the package. In the pane on the left, search for "command line tools" and choose the package appropriate to your version of OS X.

Obtaining the Access Statistics for Wikipedia Images


First, you should take a look at the information in the Class Wiki] on the image repository and how to gather important files for us.

Of interest to us are the files that contain access statistics for all pages and files in the Mediawiki projects. They are available for free download from dumps.wikipedia.org. Note that wikipedia is just one of the projects supported by Mediawiki, which is the foundation that hosts wikipedia. There are many other projects. This is important because the pages of statistics contain frequency for many projects, and not just wikipedia.

To get a feel for what they contain, download a couple of them.

  • Login to beowulf or beowulf2 (or work on your mac, in which case you may want to install wget )
  • type the following commands (user input in bold face):
 cd
 mkdir temp
 cd temp
  • If you are working on a mac, type the following command to download one of the stats file from mediawiki:
 curl http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/pagecounts-20130101-000000.gz \
               -o pagecounts-20130101-000000.gz
Note the backslash is to break the line in two. You do not need it. But if you do, do not put any other character besides pressing ENTER after the backslash.
  • If you are working on beowulf/beowulf2, type instead
 wget http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/pagecounts-20130101-000000.gz
Note: curl is also available on beowulf/2, but I think wget is more universal and you should become familiar with it as well
Wait a few seconds...
  • On either Mac or Beowulf, unzip the file:
 gunzip pagecounts-20130101-000000.gz
  • You should have a new file in your directory:
 less pagecounts-20130101-000000 

 *.s Especial:P\xC3\xA1ginas_afluentes/Ficheiro:50\x13+\xD5hG\xD7\xC0\x22\xFF 1 325
 *.s Especial:P\xC3\xA1ginas_afluentes/Ficheiro:75\x89XZX-Cach 1 325
 *.s File:25nishV 1 325
 * Cookie_\x00\x00 3 802
 ...
The file is long. It contains over 6 million lines. Below are some explanations for its format.

Format of the Stats Files


First you should remember that images in a Mediawiki wiki are tagged as follows: <tt>[[Image:imagename.jpg]]</tt>, as illustrated below where a portion of the main schedule page for CSC352 is shown:

ExampleOfMediawikiImageTag.jpg


Note that Mediawiki allows two different tag names for images, Image, and File. I use the first one extensively, while Mediawiki now favors the second one.

It is these "File:name.extension" lines that we will want to find in the statistics file we just downloaded.

The information below is taken verbatim directly from dumps.wikimedia.org/other/pagecounts-raw/.

Each request of a page, whether for editing or reading, whether a "special page" such as a log of actions generated on the fly, or an article from Wikipedia or one of the other projects, reaches one of our squid caching hosts and the request is sent via udp to a filter which tosses requests from our internal hosts, as well as requests for wikis that aren't among our general projects. This filter writes out the project name, the size of the page requested, and the title of the page requested.

Here are a few sample lines from one file:

    fr.b Special:Recherche/Achille_Baraguey_d%5C%27Hilliers 1 624
    fr.b Special:Recherche/Acteurs_et_actrices_N 1 739
    fr.b Special:Recherche/Agrippa_d/%27Aubign%C3%A9 1 743
    fr.b Special:Recherche/All_Mixed_Up 1 730
    fr.b Special:Recherche/Andr%C3%A9_Gazut.html 1 737
  

In the above, the first column "fr.b" is the project name. The following abbreviations are used:

wikibooks: ".b" wiktionary: ".d" wikimedia: ".m" wikipedia mobile: ".mw" wikinews: ".n" wikiquote: ".q" wikisource: ".s" wikiversity: ".v" mediawiki: ".w"

Projects without a period and a following character are wikipedia projects. The second column is the title of the page retrieved, the third column is the number of requests, and the fourth column is the size of the content returned.

These are hourly statistics, so in the line

    en Main_Page 242332 4737756101
  

we see that the main page of the English language Wikipedia was requested over 240 thousand times during the specific hour. These are not unique visits. In some directories you will see files which have names starting with "projectcount". These are total views per hour per project, generated by summing up the entries in the pagecount files. The first entry in a line is the project name, the second is the number of non-unique views, and the third is the total number of bytes transferred.

Filtering out the Information


So the lines of interest to us are the lines that have the keyworkd "File:" in them, and start with "en " or "en.mw ". Here is an example of such lines, taken from the same stats file:

en File:%22Almighty_God_as_Created_the_Mind_Free_._._.%22_at_Jefferson_Memorial.jpg 1 12544
en File:%22Andromeda%22_by_Tamara_de_Lempicka.jpg 1 9776
...
en file:wordpress_template_hierarchy.png 1 617
en file:workathomead.jpg 1 601
en file:world_countries_standard_&_poor%27s_ratings.svg 2 1365

I computed (using bash, grep and wc) that there are 228,991 such lines in the pagecount file.

Assignment Problem #1


Your first assignment is to write a program in C that will read the stats file and output only the lines of interest, that is the lines that start with "en " (a space follows "en" ), but not the ones that start with "ten ", and that also contains " File:" or " file:" in them (whichever the case).

Requirements


  • Call your program filter4.c (4 for Homework 4)
  • Your program should get two file names from the command line:
    • the name of the input file (e.g. pagecounts-20130101-000000)
    • the name of the output file that will contain only the the lines corresponding to image files in the English wikipedia (e.g. pagecounts-20130101-000000.filtered)
  • If the user does not provide two file names, your program should display the expected syntax for its use.



Assignment Problem #2: C+Sort


Study the C program below, and figure out a way for you to apply it to the previous program so that it only outputs the K most frequent images files, not all of them.

Be careful and think. Sorting 6 million structs with insertion sort is not a very clever approach...

Call your program filter4topK.c and make it get K as the last parameter on the command line. If K is omitted it will work the same as filter4.c

// sortArrayOfStruct.c
// D. Thiebaut
// Simple example using insertion sort to 
// sort a small array of structs.  Insertion sort
// is an O(N^2) algorithm, but quite performant for
// arrays of small dimensions.
//
// to compile and run:
// cc -o sortArray sortArrayOfStruct.c
// ./sortArray
//
#include <stdio.h>    // for printf
#include <string.h>   // for strcpy
#include <stdlib.h>   // for malloc
#define N 5


//--- structures ---
struct Person {
  char *name;    // name of the person
  int age;       // age of the person
};


//--- Prototypes ---
void mySort( struct Person *A );
char *myStrcpy( char *name );

//--------------------- M A I N ------------------------
int main() {
  struct Person people[N];
  int i;

  //--- initialize the array ---
  people[0].name = myStrcpy( "Alex" );
  people[0].age = 31;

  people[1].name = myStrcpy( "Zoe" );
  people[1].age = 10;

  people[2].name = myStrcpy( "Mona" );
  people[2].age = 7;

  people[3].name = myStrcpy( "Gail" );
  people[3].age = 70;

  people[4].name = myStrcpy( "Mado" );
  people[4].age = 13;

  //--- sort the array of people according to age ---
  mySort( people );

  //--- display the resulting list ---
  for ( i = 0; i < N; i++ ) 
    printf( "%s: %d\n", people[i].name, people[i].age );
}



//--------------------------------------------------------
// myStrcpy: combines strcpy and malloc to create a new string
//          big enough to contain the chars of the string passed
//          as a parameter. Returns a pointer to the new copy.
char *myStrcpy( char *name ) {
  char *p = (char *) malloc( (strlen( name ) + 1) * sizeof( char ) );
  strcpy( p, name );
  return p;
}

//--------------------------------------------------------
// swap: swaps 2 structs in the array of structs.  The two
//       entries are identified by two indexes.
void swap( struct Person *A, int i, int j ) {
  int tempAge = A[i].age;
  char *tempName = A[i].name;

  A[i].age = A[j].age;
  A[i].name = A[j].name;

  A[j].age = tempAge;
  A[j].name = tempName;
}

//--------------------------------------------------------
// mySort: sorts the array of structs using insertion sort
void mySort( struct Person *A ) {
  int i, j;
  int temp;

  for ( i = 0; i < N-1; i++ ) {
    for ( j=i+1; j < N; j++ ) {
      temp = i;
      if ( A[j].age < A[temp].age ) 
	temp = j;
      swap( A, i, temp );
    }
  }
}


Assignment Problem #3: MPI


Write an MPI Manager/Worker application in C that computes an approximation of Pi using the Monte Carlo method that we saw earlier this semester. Write the code yourself, please, and don't pull it off the Web!

Requirements


Your program should be called mpiMonteCarlo4.c, and should support the following features:

  • It should get the number of processes NP from the command line, as the first parameter.
  • It should get the number of random samples for the Monte Carlo computation, N, from the second parameter on the command line.
  • The manager should divide the number of samples N (say 1,000,000) by the number of processes NP (say 10, or 1 manager plus 9 workers), pass the resulting number ( 100,000) to each worker, compute its own evaluation of Pi for 100,000 samples, gather the results from the other N-1workers, compute the overall result for 1,000,000, and display it on the standard output.
  • If the user does not input the correct number of parameters, the program will exit and display the expected syntax.



Submission

Submit your programs on beowulf as follows:

 submit hw4 filter4.c
 submit hw4 filter4topK.c
 submit hw4 mpiMonteCarlo4.c