Difference between revisions of "XGrid Tutorial Part 2: Processing Wikipedia Pages"

From dftwiki3
Jump to: navigation, search
(processIdPage.py)
 
(22 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
{|
 
{|
 
| width="40%" |  __TOC__
 
| width="40%" |  __TOC__
| <bluebox>
+
|  
 +
<bluebox>
 +
[[File:XgridLogo.png| right | 100px ]]
 
This tutorial is intended for running distributed programs on an  8-core  MacPro that is setup as an XGrid Controller at Smith College.  Most of the steps presented here should work on other Apple grids, except for the specific details of login and host addresses.
 
This tutorial is intended for running distributed programs on an  8-core  MacPro that is setup as an XGrid Controller at Smith College.  Most of the steps presented here should work on other Apple grids, except for the specific details of login and host addresses.
  
Line 8: Line 10:
 
This document is the second part of a tutorial on the XGrid and follows  [[XGrid Tutorial Part 1: Monte Carlo | the Monte Carlo tutorial]].  Make sure you go through this tutorial first.
 
This document is the second part of a tutorial on the XGrid and follows  [[XGrid Tutorial Part 1: Monte Carlo | the Monte Carlo tutorial]].  Make sure you go through this tutorial first.
 
</bluebox>
 
</bluebox>
 +
|
 +
|}
 +
 +
<br />
 +
This is second of several tutorials on the XGrid at Smith College.
 +
* [[XGrid Tutorial Part 1: Monte Carlo | Part I]]: Monte Carlo
 +
* [[XGrid Tutorial Part 2: Processing Wikipedia Pages | Part II]]: Processing Wikipedia Pages
 +
* [[XGrid Tutorial Part 3: Monte Carlo on the Science Center XGrid | Part III]]: Monte Carlo on Smith's 88-core XGrid.
  
|}
 
 
==Setup==
 
==Setup==
  
Line 96: Line 105:
 
* If the list of Ids is stored in a file, '''processIdPage.py''' can easily get them as follows:
 
* If the list of Ids is stored in a file, '''processIdPage.py''' can easily get them as follows:
  
 +
./getListOfIds.py -n 10 > Ids.txt
 
  cat Ids.txt | ./processIdPage.py
 
  cat Ids.txt | ./processIdPage.py
  
Line 328: Line 338:
  
 
<br />
 
<br />
 +
==Customizing the CGI==
 +
[[Image:sillyPutty.jpg|right|150px]]
 +
The way one gets the lists of Ids or wiki pages over the Web is by issue an HTTP request at a URL.  The last part of the URL is the name of a CGI (Common Gateway Interface) program on the XgridMac server.
 +
 +
When one has an account on the same server, one can use his/her own CGI instead of the one proposed here.
 +
 +
The goal of this section is to illustrate how to implement one's own CGI, and control the flow of information from the server to the agents.
 +
 +
===Getting the CGI===
 +
 +
* Login to XgridMac
 +
* cd to the '''Sites''' directory, which is the Mac equivalent of ~/public/html
 +
 +
  cd
 +
  cd Sites
 +
 +
* get a copy of the default CGI script
 +
 +
  cp  /Library/WebServer/CGI-Executables/getWikiPageById.cgi  352lab.cgi
 +
 +
* Make it e'''x'''ecutable by '''all''' (including the http server)
 +
 +
  chmod a+x 352lab.cgi
 +
 +
* Test it.  Enter the following address in the URL window of a browser:
 +
 +
  http://xgridmac.dyndns.org/~XXXX/352lab.cgi?Count=10
 +
 +
* Observe the list of Ids returned by the cgi.  Test the fetching of wiki pages:
 +
 +
  http://xgridmac.dyndns.org/~XXXX/352lab.cgi?Id=1000
 +
 +
:where XXXX is the user name.
 +
 +
<br />
 +
<br />
 +
<greenbox>
 +
[[Image:ComputerLogo.png|right |100px]]
 +
;Lab Experiment #6:
 +
: Create a copy of the CGI in your XgridMac account.
 +
</greenbox>
 +
 +
<br />
 +
 +
===Modifying the CGI===
 +
 +
Note that in order to run your own CGI, you need to make Apache aware of it (on Mac OS X).  See the boxed information below for more information.
 +
 +
The 352lab.cgi CGI program is available [[CSC352 getWikiPageById.cgi | here]].
 +
 +
* Edit the CGI with emacs, and turn debugging on by passing '''True''' to  '''main()''' at the very end of the file:
 +
 +
    main( True )
 +
 +
:This will set the '''debug''' boolean True and all the debugging print statements will be activated.
 +
 +
* Save the file
 +
 +
* Test the CGI again from the browser, requesting a list of 10 Ids.
 +
* Ask the browser to show the source of the information display:
 +
 +
Count =  10
 +
/Volumes/RAIDSet2/enwikiXml/00/00/
 +
1 10000.xml
 +
10000
 +
2 10050000.xml
 +
10050000
 +
3 10070000.xml
 +
10070000
 +
4 10140000.xml
 +
10140000
 +
5 10200000.xml
 +
10200000
 +
6 10230000.xml
 +
10230000
 +
7 1030000.xml
 +
1030000
 +
8 10320000.xml
 +
10320000
 +
9 1040000.xml
 +
1040000
 +
10 10430000.xml
 +
10430000
 +
 +
:The information printed out shows that the CGI went to the directory  /Volumes/RAIDSet2/enwikiXml/00/00/ to find a file called list.txt that contains the list of all the wiki pages stored in  /00/00.  Here's the first few lines of this file:
 +
 +
10000.xml
 +
10050000.xml
 +
10070000.xml
 +
10140000.xml
 +
10200000.xml
 +
10230000.xml
 +
1030000.xml
 +
10320000.xml
 +
1040000.xml
 +
10430000.xml
 +
10440000.xml
 +
  ...
 +
 +
Observe that all the files names are a number followed by '''".xml"'''.  The number is the Id of the wikipedia page, as recorded by Wikipedia.  A page with Id 123456, for example, would be stored in  /Volumes/RAIDSet2/enwikiXml/34/56/ and its name would be '''123456.xml'''.
 +
 +
 +
 +
<br />
 +
<br />
 +
<greenbox>
 +
[[Image:ComputerLogo.png|right |100px]]
 +
;Lab Experiment #7:
 +
: Modify the CGI in your Sites directory, and make it start giving out Ids starting with files stored in 55/55/ rather than 00/00.  Check that your modification works.
 +
</greenbox>
 +
 +
<br />
 +
<br />
 +
<tanbox>
 +
;Note
 +
:In order to use CGI under Mac OS X, you have to make sure that you give Apache2 the permission to run a CGI from your Sites directory.  This is done by setting the user's configuration file in /etc/apache2/users/yourname.conf as shown below, for User Alex:
 +
 +
:<Directory "/Users/Alex/Sites/"><br />
 +
:::Options Indexes MultiViews SymLinksIfOwnerMatch Includes '''ExecCGI
 +
:::    DirectoryIndex index.html index.cgi'''
 +
:::    AllowOverride None
 +
:::    Order allow,deny
 +
:::    Allow from all
 +
:</Directory>
 +
 +
</tanbox>
 +
<br />
 +
<br />
 +
<br />
 +
<tanbox>
 +
;Note
 +
:Remember to return the last line of your CGI to '''main( False)''' for normal operations!
 +
</tanbox>
 +
 +
==Move on to Lab 3==
 +
 +
[[XGrid Tutorial Part 3: Monte Carlo on the Science Center XGrid | Lab 3]] is an introduction to running programs (in this case the Monte Carlo simulation) on Smith's 88-core XGrid.
 +
 
<br />
 
<br />
 
<br />
 
<br />
Line 333: Line 481:
 
<br />
 
<br />
 
<br />
 
<br />
[[Category:CSC352]][[Category:XGrid]]
+
[[Category:CSC352]][[Category:XGrid]][[Category:Tutorials]]

Latest revision as of 22:02, 22 March 2010

XgridLogo.png

This tutorial is intended for running distributed programs on an 8-core MacPro that is setup as an XGrid Controller at Smith College. Most of the steps presented here should work on other Apple grids, except for the specific details of login and host addresses.

Another document details how to access the 88-processor XGrid in the Science Center at Smith College.

This document is the second part of a tutorial on the XGrid and follows the Monte Carlo tutorial. Make sure you go through this tutorial first.


This is second of several tutorials on the XGrid at Smith College.

Setup

The main setup is shown below

WikiPageServer.png

See the Project 2 page for more information on accessing the server of wikipedia pages.

In summary, any computer can issue http requests to the server at the Url associated with the wiki page server and append ?Count=nnnn at the end to get a list of nnnn Ids, or ?Id=nnnn to get the contents of the page with the given Id.

Goal of this Tutorial

Create a Pipeline

The goal is to create a pipeline of two programs (and possibly other Mac OS X commands) that will retrieve several pages from the wiki-page server and process them. The programs are used in a pipeline fashion, the output of one being fed to the input of the other. A third program, a bash script called pipeline.sh, organizes the pipeline structure.

The figure below illustrates the process.

PipelineXgridWiki.png



Submit a Batch of Jobs to the XGrid

Once the pipeline is created, and tested on the XGrid, a batch job is created. Batch jobs are PLIST files containing the files that need to be sent to the XGrid, the data files required, if any, and the command or commands to be executed.

The figure below illustrates the process. The XGrid controller is "clever" enough to break the batch job into individual processes that are sent to the different agents that are available.

XgridBatchSubmissionPipeline.png

The Basic Elements of the Pipeline

getListOfIds.py
This program is given a number and fetches that many Ids from the wiki-page server.
processIdPage.py
This program receives a list of Ids from the command line or from standard input, and fetches the wiki-pages corresponding to these Ids. There is no limitation on the number of Ids except the amount of buffering offered by the computer.
pipeline.sh
This program is the glue that makes the previous two programs work in a pipeline fashion.

Typical Usage

getListOfIds.py

  • getListOfIds.py receives the number of Ids it should retrieve on the command line:
 ./getListOfIds.py -n 10
10000
10050000
10070000
10140000
10200000
10230000
1030000
10320000
1040000
10430000
(Note: make sure the different program are made executable with the chmod +x command.)
  • Another interesting use of a command that outputs a collection of lines is that we can easily "carve" this list with the head and tail Linux commands:
./getListOfIds.py -n 10 | tail -5
10230000
1030000
10320000
1040000
10430000
./getListOfIds.py -n 10 | tail -5 | head -2
10230000
1030000


processIdPage.py

  • processIdPage.py accepts a list of Ids from the command line of from standard input:
./processIdPage.py 10000 10050000
count:26
Here 26 represents the number of links to other pages that exist in the two wiki pages with Ids 10000 and 10050000.
./getListOfIds.py -n 10 | ./processIdPage.py 
count:152
  • If the list of Ids is stored in a file, processIdPage.py can easily get them as follows:
./getListOfIds.py -n 10 > Ids.txt
cat Ids.txt | ./processIdPage.py



ComputerLogo.png
Lab Experiment #1
Create your own versions of the two Python programs, and repeat the same experiments as above with the Python programs.



Pipeline.sh

The pipeline.sh script is straightforward:

#! /bin/bash 
# pipe that feeds output of getLIstOfIds.py to processIdPage.py
# User must provide number of wikipedia pages on command line.
# Usage:
#         ./pipeline.sh 100

./getListOfIds.py -n $1 | ./processIdPage.py

It calls the first Python program, getListOfIds.py and passes it the first parameter on its command line. The output of getListOfIds.py is then fed via a Linux pipe to processIdPage.py. The output of processIdPage.py is passed out by the bash script.

A typical call would be:

 ./pipeline.sh 10
 count:152




ComputerLogo.png
Lab Experiment #2
Create your own version of the bash script, and make sure it works as it should..

XGrid Batch Files

Why a Batch Job?

The reason is that a batch job can include in one package many parallel tasks for the XGrid. In our case, we will want to run simultaneously many copies of the pipeline so that we can process many different wiki pages in parallel.

Batch packaging

A batch job is defined by a special file in XML or PList format. Both formats are supported by the Mac OS. An example of a batch job for our application is available here. Don't bother copying it. Just look at it to see its format. Its general format is best illustrated by this image taken from a very good tutorial on www.macresearch.org:




XGridBatchJobFormat.png


Note: The best way to create a batch file is to use a utility available for download from Kellerfarm.com. The utility is a Batch Editor for the XGrid. However, this utility runs only in GUI mode on a Mac. Because we assume in this tutorial that we are accessing an XGrid through a Windows PC, we'll use a different solution. If you are on a Mac, you should check this tutorial out.

A Python Program for creating Batch Jobs

Creating a batch job by hand would be terribly complicated, boring, and wasting too much time. Instead we'll use a Python program for this.

The program asks the user for the names of programs and data files that are needed, along with the commands that should be executed, then it generates a text file in PList format with all the information.

The Python program is called makeBatchMulti.py and is available here.

We use it as follows (the user input is underlined):

./makeBatchMulti.py pipe.batch


XGrid batch file maker
Creates a batch file for the XGrid system from a collection of 
data files and programs


Please enter your name to identify the batch job: dft

Please enter the names of the different programs needed by the batch job.
Enter them one per line.  Press Enter twice when done.
Program #1 > getListOfIds.py
Program #2 > processIdPage.py
Program #3 > pipeline.sh
Program #4 > 

Please enter the names of the different data files needed. 
No need to list the name of temporary files created by the programs.
Enter them one per line.  Press Enter twice when done. 
Data file #1 > 

Enter the commands that the XGrid should run.  Enter each command on one line.
For each line, enter the name of the program followed by all the arguments. 
Do not use redirection or pipes in the command lines.
Enter an empty line to stop.
Command #1 > ./pipeline.sh 100
Command #2 >

The result is a long file called pipe.batch in this example, with the structure shown below (warning: some of the information has been removed from the file to make it fit the display):

{
   jobSpecification =     {
       applicationIdentifier = "com.apple.xgrid.cli";
       inputFiles =         {
             "pipeline.sh" = { fileData = <2321202f 62696e2f [...] 6167652e 70790a0a>; isExecutable = YES; }; 
             "getListOfIds.py" = { fileData = <2321202f 7573722f [...] 0a202020 200a0a>; isExecutable = YES; }; 
             "processIdPage.py" = { fileData = <2321202f 7573722f [...] 0a0a2020 20200a0a>; isExecutable = YES; }; 
       };
       name = dft;
       schedulerHints =  { 0 = mathgrid5; };
       submissionIdentifier = "dft batch job";
       taskSpecifications = {
       0 = { arguments = ( 100 ); command = "./pipeline.sh" ;  };
       };
   };
}



Submitting Batch Jobs to the XGrid

XGridAdminOveriew.png

We are now ready to submit this job to the XGrid. We'll use the xgrid utility to submit the batch job, and the getXGridOutput.py program from Part 1 of our tutorial to gather the output.

 xgrid -job batch pipe.batch | getXGridOutput.py 
 Job 57 stopped: Execution time: 9.000000 seconds
 count:1455
 
 Total execution time: 9.000000 seconds




ComputerLogo.png
Lab Experiment #3
Create your own version of makeBatchMulti.py and create a batch file for our pipeline. Submit the batch file to the XGrid.
How many wiki pages is your pipeline processing?
When you run your batch job on the XGrid, how many processors are involved?



A Batch Files for Many Jobs

Observe the last line of the batch file, colored purple below:

{
   jobSpecification =     {
      applicationIdentifier = "com.apple.xgrid.cli";
      inputFiles =         {
            "pipeline.sh" = { fileData = <2321202f 62696e2f [...] 6167652e 70790a0a>; isExecutable = YES; }; 
            "getListOfIds.py" = { fileData = <2321202f 7573722f [...] 0a202020 200a0a>; isExecutable = YES; }; 
            "processIdPage.py" = { fileData = <2321202f 7573722f [...] 0a0a2020 20200a0a>; isExecutable = YES; }; 
      };
      name = dft;
      schedulerHints =  { 0 = mathgrid5; };
      submissionIdentifier = "dft batch job";
      taskSpecifications = {
      0 = { arguments = ( 100 ); command = "./pipeline.sh" ;  };
      };
   };
}

This line instructs the XGrid to run our pipeline once, with 100 as its argument.

If we wanted to run our pipeline multiple times, all we would have to do is replicate the last line several times, as shown below:

{
   jobSpecification =     {
      applicationIdentifier = "com.apple.xgrid.cli";
      inputFiles =         {
            "pipeline.sh" = { fileData = <2321202f 62696e2f [...] 6167652e 70790a0a>; isExecutable = YES; }; 
            "getListOfIds.py" = { fileData = <2321202f 7573722f [...] 0a202020 200a0a>; isExecutable = YES; }; 
            "processIdPage.py" = { fileData = <2321202f 7573722f [...] 0a0a2020 20200a0a>; isExecutable = YES; }; 
      };
      name = dft;
      schedulerHints =  { 0 = mathgrid5; };
      submissionIdentifier = "dft batch job";
      taskSpecifications = {
      0 = { arguments = ( 100 ); command = "./pipeline.sh" ;  };
      1 = { arguments = ( 100 ); command = "./pipeline.sh" ;  };
      2 = { arguments = ( 100 ); command = "./pipeline.sh" ;  };
      3 = { arguments = ( 100 ); command = "./pipeline.sh" ;  };
      4 = { arguments = ( 100 ); command = "./pipeline.sh" ;  };
      5 = { arguments = ( 100 ); command = "./pipeline.sh" ;  };
      6 = { arguments = ( 100 ); command = "./pipeline.sh" ;  };
      7 = { arguments = ( 100 ); command = "./pipeline.sh" ;  };
      8 = { arguments = ( 100 ); command = "./pipeline.sh" ;  };
      9 = { arguments = ( 100 ); command = "./pipeline.sh" ;  };
      };
   };
}

Observe the XGrid Admin GUI responding to this submission:

XGridBatchJobMultipleCommandsAgents.png

XGridBatchJobMultipleCommandsJobs.png

XGridBatchJobMultipleCommandsOverview.png



ComputerLogo.png
Lab Experiment #4
You know what to do!
Modify your batch file and make it run 10 copies of your pipeline. Submit the batch file to the XGrid.
How many wiki pages is your batch file processing?
When you run your batch job on the XGrid, how many processors are involved?
What is the difference between a job and a task?



ComputerLogo.png
Lab Experiment #5
Make your batch job run 10 pipelines on 10 different sets of 10 wiki-pages...


Customizing the CGI

SillyPutty.jpg

The way one gets the lists of Ids or wiki pages over the Web is by issue an HTTP request at a URL. The last part of the URL is the name of a CGI (Common Gateway Interface) program on the XgridMac server.

When one has an account on the same server, one can use his/her own CGI instead of the one proposed here.

The goal of this section is to illustrate how to implement one's own CGI, and control the flow of information from the server to the agents.

Getting the CGI

  • Login to XgridMac
  • cd to the Sites directory, which is the Mac equivalent of ~/public/html
  cd 
  cd Sites
  • get a copy of the default CGI script
  cp  /Library/WebServer/CGI-Executables/getWikiPageById.cgi  352lab.cgi
  • Make it executable by all (including the http server)
  chmod a+x 352lab.cgi
  • Test it. Enter the following address in the URL window of a browser:
  http://xgridmac.dyndns.org/~XXXX/352lab.cgi?Count=10
  • Observe the list of Ids returned by the cgi. Test the fetching of wiki pages:
  http://xgridmac.dyndns.org/~XXXX/352lab.cgi?Id=1000
where XXXX is the user name.



ComputerLogo.png
Lab Experiment #6
Create a copy of the CGI in your XgridMac account.


Modifying the CGI

Note that in order to run your own CGI, you need to make Apache aware of it (on Mac OS X). See the boxed information below for more information.

The 352lab.cgi CGI program is available here.

  • Edit the CGI with emacs, and turn debugging on by passing True to main() at the very end of the file:
   main( True )
This will set the debug boolean True and all the debugging print statements will be activated.
  • Save the file
  • Test the CGI again from the browser, requesting a list of 10 Ids.
  • Ask the browser to show the source of the information display:
Count =  10
/Volumes/RAIDSet2/enwikiXml/00/00/
1 10000.xml
10000
2 10050000.xml
10050000
3 10070000.xml
10070000
4 10140000.xml
10140000
5 10200000.xml
10200000
6 10230000.xml
10230000
7 1030000.xml 
1030000
8 10320000.xml
10320000
9 1040000.xml
1040000
10 10430000.xml
10430000
The information printed out shows that the CGI went to the directory /Volumes/RAIDSet2/enwikiXml/00/00/ to find a file called list.txt that contains the list of all the wiki pages stored in /00/00. Here's the first few lines of this file:
10000.xml
10050000.xml
10070000.xml
10140000.xml
10200000.xml
10230000.xml
1030000.xml
10320000.xml
1040000.xml
10430000.xml
10440000.xml
 ...

Observe that all the files names are a number followed by ".xml". The number is the Id of the wikipedia page, as recorded by Wikipedia. A page with Id 123456, for example, would be stored in /Volumes/RAIDSet2/enwikiXml/34/56/ and its name would be 123456.xml.




ComputerLogo.png
Lab Experiment #7
Modify the CGI in your Sites directory, and make it start giving out Ids starting with files stored in 55/55/ rather than 00/00. Check that your modification works.



Note
In order to use CGI under Mac OS X, you have to make sure that you give Apache2 the permission to run a CGI from your Sites directory. This is done by setting the user's configuration file in /etc/apache2/users/yourname.conf as shown below, for User Alex:
<Directory "/Users/Alex/Sites/">
Options Indexes MultiViews SymLinksIfOwnerMatch Includes ExecCGI
DirectoryIndex index.html index.cgi
AllowOverride None
Order allow,deny
Allow from all
</Directory>




Note
Remember to return the last line of your CGI to main( False) for normal operations!

Move on to Lab 3

Lab 3 is an introduction to running programs (in this case the Monte Carlo simulation) on Smith's 88-core XGrid.