Difference between revisions of "CSC352 Homework 5 2013"
(→Config File) |
(→Config File) |
||
Line 51: | Line 51: | ||
==Config File== | ==Config File== | ||
− | Your config file should look something like this (I have removed all unnecessary comments), and anonymized personal information. The areas in magenta are the places you will need to enter your own data. | + | Your config file should look something like this (I have removed all unnecessary comments), and anonymized personal information. The areas in magenta are the places you will need to enter your own data or data provided here (for example the enwiki volume is fixed). |
− | + | <onlysmith> | |
#################################### | #################################### | ||
## StarCluster Configuration File ## | ## StarCluster Configuration File ## |
Revision as of 11:11, 5 November 2013
--D. Thiebaut (talk) 20:06, 4 November 2013 (EST)
Contents
Assignment
This assignment is in two parts. They are both identical, but you have to do both:
- The first part is to develop an MPI program on Hadoop0.
- The second part is to port this program to AWS.
It is important to debug programs on local machines rather than AWS otherwise the money is spend on development time rather than production time. For us, for this assignment, the production is to calculate the geometry of a large set of images.
On Hadoop0
- Take the MPI program we studied in class, and available here, and modify it so that it takes two parameters from the command line:
- the total number of images to process, N, and
- the number M of image names sent by the manager to its workers in one packet.
- Note that on hadoop0 the program expects the images to be stored in the directory /media/dominique/3TB/mediawiki/images/wikipedia/en
- Remove the part of the program that stores the geometry in the database. Your program will have the manager send block of file names to the workers, the workers will use identify to get the geometry of each file, and will simply drop that information and not store it anywhere. That's okay for this assignment.
- Set the number of nodes to 8: 1 manager and 7 workers. This should work well on the 8-core Hadoop0 processor.
- Figure out how many images N to process with 8 nodes so that the computation of their geometry takes less than a minute, but more than 10 seconds.
- Run a series of experiments to figure out the size of the packet of file names exchanged between the manager and a worker that yields the fastest real execution-time. In other words, run the MPI program on hadoop0 and pass it a value of M equal to, say, 10. This means that the manager will walk the directories of images and grab 10 consecutive images and pass (MPI_Send) their names in one packet to a worker. It will then grab another 10 images and pass that packet to another worker. etc. Measure the time it takes your program to process N images. Repeat the same experiment for other values of M ranging from 10 to 5,000. For example 10, 50, 100, 250, 500, 1000, 2500, 5000.
Once your program runs correctly on hadoop0, port it to an MPI cluster on AWS that you will start with the starcluster utility.
On the AWS Cluster
You will need to modify the program and the starcluster config file to fully port your program to AWS.
EBS Volume
You should create a 1-GByte EBS for yourself, where you can keep your program files. Files stored in the default directories of the cluster will disappear when you terminate the cluster. To keep files around on AWS, you need to store them in Elastic Block Devices, or EBS volumes.
Follow the directions (slightly modified since we did the lab on AWS) from this section and the the section that follows to attach your personal data EBS volume as well as the enwiki volume to your cluster.
Cluster Size
Modify the config file in the ~/.starcluster directory on your mac and set the number of nodes to 8. The instance type should be m1.medium.
Config File
Your config file should look something like this (I have removed all unnecessary comments), and anonymized personal information. The areas in magenta are the places you will need to enter your own data or data provided here (for example the enwiki volume is fixed).
When you next start the cluster, you will have two directories that will appear in the root directory, /data, and /enwiki. All nodes will have access to both of them. Approximately 150,000 images have already been uploaded to the directory /data/enwiki/, in three subdirectories, 0, 1, and 2.
To get a sense of where the images are, start your cluster with just 1 node (no need to create a large cluster just to explore the system), and ssh to the master:
starcluster start mycluster starcluster sshmaster mycluster ls /enwiki ls /enwiki/0 ls /enwiki/0/01 etc...
ImageMagick and Identify
Identify is a utility that is part of Imagemagick. Unfortunately, Imagemagick is not installed by default on our AWS clusters. Doing image processing is apparently something not regularly performed by mpi programs. But installing it is easy:
On the master node, type
apt-get update apt-get install imagemagick
And identify will be installed on the master. Unfortunately, you'll have to install it as well on all the workers. If you stop your cluster and not terminate it, the installation will remain until the next time you restart your cluster. If you terminate your cluster, however, you'll have to reinstall imagemagick the next time to start your cluster.
Measurements
Perform the same measurements you did on Hadoop0 and measure the execution times (real time) for different values of M.
You may have to pick a different value of N that will make the execution time not exceed a minute.
Plot the execution times as a function of M, and submit a jpg/png/pdf version of it via email. Use CSC352 Homework 5 graph for subject of your email message, please!
Submission
Call your program hw5aws.c and submit it from your 352a-xx account on beowulf as follows:
submit hw5 hw5aws.c
If you created additional files (such as shell files), submit them as well.
Optional and Extra Credit
The program we saw in class uses a round-robin approach to feed data to the workers. Modify the program so that it will send packets of file names to a worker that is idle, rather than to the next logical one.
<tangox>ALWAYS REMEMBER TO STOP YOUR CLUSTER WHEN YOU ARE NOT USING IT!</tanbox>
Misc. Information
In case you wanted to have the MPI program store the image geometry in your database, you'd have to follow the process described in this tutorial. However, if you were to create the program mysqlTest.c on your AWS cluster, you'd find that the command mysql_config is not installed on the default AMI used by starcluster to create the MPI cluster.
To install the mysql_config utility, run the following commands on the master node of your cluster as root:
apt-get update apt-get build-dep python-mysqldb
Edit the constants in mysqlTest.c that define the address of the database server (hadoop0), as well as the credentials of your account on the mysql server.
You can then compile and run program:
gcc -o mysqlTest $(mysql_config --cflags) mysqlTest.c $(mysql_config --libs) ./mysqlTest MySQL Tables in mysql database: images images2 pics1