CSC352 Homework 5 2013

From dftwiki3
Revision as of 23:17, 4 November 2013 by Thiebaut (talk | contribs) (AWS Cluster)
Jump to: navigation, search

--D. Thiebaut (talk) 20:06, 4 November 2013 (EST)


This assignment is due on 11/14 at 11:59 p.m.


Assignment


This assignment is in two parts. They are both identical, but you have to do both.

The first part is to develop an MPI program on Hadoop0.

The second part is to port this program to AWS.

It is important to debug programs on local machines rather than AWS otherwise the money is spend on development time rather than production time. For us, for this assignment, the production is to calculate the geometry of a large set of images.

On Hadoop0


  • Take the MPI program we studied in class, and available here, and modify it so that it takes two parameters from the command line:
    1. the total number of images to process, N, and
    2. the number M of image names sent by the manager to its workers in one packet.
  • Note that on hadoop0 the program expects the images to be stored in the directory /media/dominique/3TB/mediawiki/images/wikipedia/en
  • Remove the part of the program that stores the geometry in the database. Your program will have the manager send block of file names to the workers, the workers will use identify to get the geometry of each file, and will simply drop that information and not store it anywhere. That's okay for this assignment.
  • Set the number of nodes to 8: 1 manager and 7 workers.
  • Figure out how many images N to process with 8 nodes so that the computation of their geometry takes less than a minute, but more than 10 seconds.
  • Run a series of experiments to figure out the size of the packet of file names exchanged between the manager and a worker that yields the shortest real execution time. In other words, run the MPI program on hadoop0 and pass it a value of M equal to, say, 100. This means that the manager will walk the directories of images and grab 100 consecutive images and pass (MPI_Send) their names in one packet to a worker. It will then grab another 100 images and pass that packet to another worker. etc. Measure the time it takes your program to process N images. Repeat the same experiment for other values of

M ranging from 10 to 5,000. For example 10, 50, 100, 250, 500, 1000, 2500, 5000.

Once your program runs correctly on hadoop0, port it to an MPI cluster on AWS that you will start with the starcluster utility.


On the AWS Cluster


You will need to modify the program and the starcluster config file to fully port your program to AWS.

Cluster

Modify the config file in the ~/.starcluster directory on your mac and set the number of nodes to 8. The instance type should be m1.medium.

Images


A sample (150,000 or so) of the 3 million images have been transferred to an EBS drive in our AWS environment. You need to attach it to your cluster in order for your program to access the files. You should also create a 1-GByte EBS for yourself, where you can keep your program files. Follow the directions (slightly modified since we did the lab on AWS) from this section and the the section that follows to attach your personal data EBS volume as well as the enwiki volume to your cluster.

Go ahead and follow the tutorial on creating your own data EBS and come back to this point when you're done.

Edit your starcluster config file to add the EBS with the 150,000 images, and that it should be mounted automatically.


This section is only visible to computers located at Smith College

When you next start the cluster, you will have two directories that will appear in the root directory, /data, and /enwiki. All nodes will have access to both of them. Approximately 150,000 images have already been uploaded to the directory /data/enwiki/, in three subdirectories, 0, 1, and 2.

To get a sense of where the images are, start your cluster with just 1 node (no need to create a large cluster just to explore the system), and ssh to the master:

starcluster start mycluster
starcluster sshmaster mycluster
ls /enwiki
ls /enwiki/0
ls /enwiki/0/01
etc...


ImageMagick and Identify


Identify is a utility that is part of Imagemagick. Unfortunately, Imagemagick is not installed by default on our AWS clusters. Doing image processing is apparently something not regularly performed by mpi programs. But installing it is easy:

On the master node, type

apt-get update
apt-get install imagemagick

And identify will be installed on the master. Unfortunately, you'll have to install it as well on all the workers. If you stop your cluster and not terminate it, the installation will remain until the next time you restart your cluster. If you terminate your cluster, however, you'll have to reinstall imagemagick the next time to start your cluster.


Measurements


Modify the original program and make it receive two parameters from the command line:

Pick a value of N that will make the execution time not exceed a minute.

Run the program on a cluster of 8 m1.medium nodes, 1 master and 7 workers.







Misc. Information

In case you wanted to have the MPI program store the image geometry in your database, you'd have to follow the process described in this tutorial. However, if you were to create the program mysqlTest.c on your AWS cluster, you'd find that the command mysql_config is not installed on the default AMI used by starcluster to create the MPI cluster.

To install the mysql_config utility, run the following commands on the master node of your cluster as root:

apt-get update
apt-get build-dep python-mysqldb

Edit the constants in mysqlTest.c that define the address of the database server (hadoop0), as well as the credentials of your account on the mysql server.

You can then compile and run program:

gcc -o mysqlTest $(mysql_config --cflags) mysqlTest.c $(mysql_config --libs)
./mysqlTest
MySQL Tables in mysql database:
images 
images2  
pics1