CSC352 Homework 5 2013
--D. Thiebaut (talk) 20:06, 4 November 2013 (EST)
Contents
Assignment
Run an MPI program on Amazon AWS that finds the geometry of image files. Entering the image geometry in a database will be skipped for this assignment. We are interested in optimizing a master-workers protocol on an MPI cluster of N nodes.
Implementation Details
Program
You can use the program we saw in class, and covered in this tutorial. You need to remove the storing of information in the MySQL database.
Images
A sample (150,000 or so) of the 3 million images have been transferred to an EBS drive in our AWS environment. You need to attach it to your cluster in order for your program to access the files. You should also create a 1-GByte EBS for yourself, where you can keep your program files. Follow the directions (slightly modified since we did the lab on AWS) from this section and the the section that follows to attach your personal data EBS volume to your cluster, as well as the enwiki volume.
Go ahead and follow the tutorial on creating your own data EBS and come back to this point when you're done.
Edit your starcluster config file to add the EBS with the 150,000 images, and that it should be mounted automatically.
When you next start the cluster, you will have two directories that will appear in the root directory, /data, and /enwiki. All nodes will have access to both of them. Approximately 150,000 images have already been uploaded to the directory /data/enwiki/, in three subdirectories, 0, 1, and 2.
To get a sense of where the images are, start your cluster with just 1 node (no need to create a large cluster just to explore the system), and ssh to the master:
starcluster start mycluster starcluster sshmaster mycluster ls /enwiki ls /enwiki/0 ls /enwiki/0/01 etc...
ImageMagick and Identify
Identify is a utility that is part of Imagemagick. Unfortunately, Imagemagick is not installed by default on our clusters. Doing image processing is apparently something not regularly performed by mpi programs. But installing it is easy:
On the master node, type
apt-get update apt-get install imagemagick
And identify will be installed on the master. Unfortunately, you'll have to install it as well on all the workers. If you stop your cluster and not terminate it, the installation will remain until the next time you restart your cluster. If you terminate your cluster, however, you'll have to reinstall imagemagick the next time to start your cluster.
Measurements
Run the program on a cluster of 10
Misc. Information
In case you wanted to have the MPI program store the image geometry in your database, you'd have to follow the process described in this tutorial. However, if you were to create the program mysqlTest.c on your AWS cluster, you'd find that the command mysql_config is not installed on the default AMI used by starcluster to create the MPI cluster.
To install the mysql_config utility, run the following commands on the master node of your cluster as root:
apt-get update apt-get build-dep python-mysqldb
Edit the constants in mysqlTest.c that define the address of the database server (hadoop0), as well as the credentials of your account on the mysql server.
You can then compile and run program:
gcc -o mysqlTest $(mysql_config --cflags) mysqlTest.c $(mysql_config --libs) ./mysqlTest MySQL Tables in mysql database: images images2 pics1