Hadoop Tutorial 1.1 -- Generating Task Timelines

Introduction

In May 2009 Yahoo announced it could sort a Petabyte of dat in 16.25 hours and a Terabyte of data in 62 seconds using Hadoop running on 3658 processors in the first case, and 1460 in the second case ^[1]. In their report they show very convincing diagrams showing the evolution of the computation as a time-line of map, shuffle, sort, and reduce tasks as a function of time, an example of which is shown below.

The graph is generated by parsing one of the many logs generated by hadoop when a job is running, and is due to one of the authors of the Yahoo report cited above. It's original name is job_history_summary.py, and is available from here. We have renamed it generateTimeLine.py.

Generating the Log

First run a MapReduce program. We'll use the WordCount program of Tutorial #1.

Run the word count program on your input directory where you have one or more text files containing large documents.

 hadoop jar /home/hadoop/352/dft/wordcount_counters/wordcount.jar org.myorg.WordCount dft6 dft6-output

When the program is over, look for the most recent log file in the ~/hadoop/hadoop/logs/history directory:

 ls -ltr ~/hadoop/hadoop/logs/history/ | tail -1

 -rwxrwxrwx 1 hadoop hadoop 35578 2010-04-04 14:38 hadoop1_1270135155456_job_201004011119_0014_hadoop_wordcount

This is the file we are interested in.

Check that the script generateTimeLine.py is installed on your system:

  which generateTimeLine.py          (if you get response to the command, you have it!)

If the script hasn't been installed yet, create a file in your path that contains the Yahoo script for parsing the log file. The script is also available here.

Feed the log file above to the script:

 cat hadoop1_1270135155456_job_201004011119_0014_hadoop_wordcount | generateTileLine.py
 time maps shuffle merge reduce
 0 1 0 0 0
 1 1 0 0 0
 2 1 0 0 0
 3 1 0 0 0
 4 0 1 0 0
 5 0 1 0 0
 ...
 294 1 0 0 0
 295 1 0 0 0
 296 1 0 0 0
 297 1 0 0 0
 298 1 0 0 0
 299 1 0 0 0
 300 1 0 0 0

Copy/Paste the output of the script into your favorite spreadsheet software and generate an Area graph for the data that you will have distributed in individual columns.

You should obtain something like this:

References

↑ Owen O'Malley and Arun Murthy, Hadoop Sorts a Petabyte in 16.25 Hours and a Terabyte in 62 seconds, http://developer.yahoo.net/blogs, May 2009.

[yahoopetabyte-1] Owen O'Malley and Arun Murthy, Hadoop Sorts a Petabyte in 16.25 Hours and a Terabyte in 62 seconds, http://developer.yahoo.net/blogs, May 2009.

[1]

Hadoop Tutorial 1.1 -- Generating Task Timelines

Contents

Introduction

Generating the Log

References

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools