Difference between revisions of "XGrid Tutorial Part 2: Processing Wikipedia Pages"

From dftwiki3
Jump to: navigation, search
(Batch packaging)
(XGrid Batch Files)
Line 141: Line 141:
 
'''Note''': The best way to create a batch file is to use a utility available for download from [http://kellerfarm.com/kfsproducts/yesfree/xgridbatcheditor/index.html '''Kellerfarm.com'''].  The utility is a ''Batch Editor for the XGrid''.  However, this utility runs only in GUI mode on a Mac.  Because we assume in this tutorial that we are accessing an XGrid through a Windows PC, we'll use a different solution.  If you are on a Mac, you should check this [http://cs.smith.edu/classwiki/index.php/XGrid_Perl_Pipeline tutorial] out.
 
'''Note''': The best way to create a batch file is to use a utility available for download from [http://kellerfarm.com/kfsproducts/yesfree/xgridbatcheditor/index.html '''Kellerfarm.com'''].  The utility is a ''Batch Editor for the XGrid''.  However, this utility runs only in GUI mode on a Mac.  Because we assume in this tutorial that we are accessing an XGrid through a Windows PC, we'll use a different solution.  If you are on a Mac, you should check this [http://cs.smith.edu/classwiki/index.php/XGrid_Perl_Pipeline tutorial] out.
 
</tanbox>
 
</tanbox>
 +
 +
===A Python Program for creating Batch Jobs===
 +
Creating a batch job by hand would be terribly complicated, boring, and wasting too much time. Instead we'll use a Python program for this.
 +
 +
The program asks the user for the names of programs and data files that are needed, along with the commands that should be executed, then it generates a text file in PList format with all the information.
 +
 +
The Python program is called '''makeBatchMulti.py''' and is available [[CSC352 makeBatchMulti.py | here]].

Revision as of 00:35, 4 March 2010

This tutorial is intended for running distributed programs on an 8-core MacPro that is setup as an XGrid Controller at Smith College. Most of the steps presented here should work on other Apple grids, except for the specific details of login and host addresses.

Another document details how to access the 88-processor XGrid in the Science Center at Smith College.

This document is the second part of a tutorial on the XGrid and follows the Monte Carlo tutorial. Make sure you go through this tutorial first.

Setup

The main setup is shown below

WikiPageServer.png

See the Project 2 page for more information on accessing the server of wikipedia pages.

In summary, any computer can issue http requests to the server at the Url associated with the wiki page server and append ?Count=nnnn at the end to get a list of nnnn Ids, or ?Id=nnnn to get the contents of the page with the given Id.

Goal of this Tutorial

Create a Pipeline

The goal is to create a pipeline of two programs (and possibly other Mac OS X commands) that will retrieve several pages from the wiki-page server and process them. The programs are used in a pipeline fashion, the output of one being fed to the input of the other. A third program, a bash script called pipeline.sh, organizes the pipeline structure.

The figure below illustrates the process.

PipelineXgridWiki.png

Submit a Batch of Jobs to the XGrid

Once the pipeline is created, and tested on the XGrid, a batch job is created. Batch jobs are PLIST files containing the files that need to be sent to the XGrid, the data files required, if any, and the command or commands to be executed.

The figure below illustrates the process. The XGrid controller is "clever" enough to break the batch job into individual processes that are sent to the different agents that are available.

XgridBatchSubmissionPipeline.png

The Basic Elements of the Pipeline

getListOfIds.py
This program is given a number and fetches that many Ids from the wiki-page server.
processIdPage.py
This program receives a list of Ids from the command line or from standard input, and fetches the wiki-pages corresponding to these Ids. There is no limitation on the number of Ids except the amount of buffering offered by the computer.
pipeline.sh
This program is the glue that makes the previous two programs work in a pipeline fashion.

Typical Usage

getListOfIds.py

  • getListOfIds.py receives the number of Ids it should retrieve on the command line:
 ./getListOfIds.py -n 10
10000
10050000
10070000
10140000
10200000
10230000
1030000
10320000
1040000
10430000
(Note: make sure the different program are made executable with the chmod +x command.)
  • Another interesting use of a command that outputs a collection of lines is that we can easily "carve" this list with the head and tail Linux commands:
./getListOfIds.py -n 10 | tail -5
10230000
1030000
10320000
1040000
10430000
./getListOfIds.py -n 10 | tail -5 | head -2
10230000
1030000


processIdPage.py

  • processIdPage.py accepts a list of Ids from the command line of from standard input:
./processIdPage.py 10000 10050000
count:26
Here 26 represents the number of links to other pages that exist in the two wiki pages with Ids 10000 and 10050000.
./getListOfIds.py -n 10 | ./processIdPage.py 
count:152
  • If the list of Ids is stored in a file, processIdPage.py can easily get them as follows:
cat Ids.txt | ./processIdPage.py
ComputerLogo.png
Lab Experiment #1
Create your own versions of the two Python programs and of the bash script, and repeat the same experiments as above with the Python programs.

Pipeline.sh

The pipeline.sh script is straightforward:

#! /bin/bash 
# pipe that feeds output of getLIstOfIds.py to processIdPage.py
# User must provide number of wikipedia pages on command line.
# Usage:
#         ./pipeline.sh 100

./getListOfIds.py -n $1 | ./processIdPage.py

It calls the first Python program, getListOfIds.py and passes it the first parameter on its command line. The output of getListOfIds.py is then fed via a Linux pipe to processIdPage.py. The output of processIdPage.py is passed out by the bash script.

A typical call would be:

 ./pipeline.sh 10
 count:152

XGrid Batch Files

Why a Batch Job?

The reason is that a batch job can include in one package many parallel tasks for the XGrid. In our case, we will want to run simultaneously many copies of the pipeline so that we can process many different wiki pages in parallel.

Batch packaging

A batch job is defined by a special file in XML or PList format. Both formats are supported by the Mac OS. An example of a batch job for our application is available here. Don't bother copying it. Just look at it to see its format. Its general format is best illustrated by this image taken from a very good tutorial on www.macresearch.org:




XGridBatchJobFormat.png


Note: The best way to create a batch file is to use a utility available for download from Kellerfarm.com. The utility is a Batch Editor for the XGrid. However, this utility runs only in GUI mode on a Mac. Because we assume in this tutorial that we are accessing an XGrid through a Windows PC, we'll use a different solution. If you are on a Mac, you should check this tutorial out.

A Python Program for creating Batch Jobs

Creating a batch job by hand would be terribly complicated, boring, and wasting too much time. Instead we'll use a Python program for this.

The program asks the user for the names of programs and data files that are needed, along with the commands that should be executed, then it generates a text file in PList format with all the information.

The Python program is called makeBatchMulti.py and is available here.