Hadoop Tutorial 2.2 -- Running C++ Programs on Hadoop
--D. Thiebaut 12:46, 13 April 2010 (UTC)
This tutorial is the continuation of Tutorial 2, and uses Hadoop Pipes to run a C++ on the Smith College Hadoop Cluster.
|
Running the WordCount program in C++
The reference for this is the C++ wordcount presented in the Hadoop Wiki [1].
Submitting C++ compiled code is supported by Hadoop through a pipes API, the syntax of which is documented here.
The Files
You need 3 files to run the wordCount example:
- a C++ file containing the map and reduce functions,
- a data file containing some text, such as Ulysses, and
- a Makefile to compile the C++ file.
wordcount.cpp
- The wordcount program is shown below
- It contains two classes, one for the map, one for the reduce
- It makes use of several Hadoop classes, one of which contains useful methods for converting from tuples to other types: StringUtils.
#include <algorithm>
#include <limits>
#include <string>
#include "stdint.h" // <--- to prevent uint64_t errors!
#include "hadoop/Pipes.hh"
#include "hadoop/TemplateFactory.hh"
#include "hadoop/StringUtils.hh"
using namespace std;
class WordCountMapper : public HadoopPipes::Mapper {
public:
// constructor: does nothing
WordCountMapper( HadoopPipes::TaskContext& context ) {
}
// map function: receives a line, outputs (word,"1")
// to reducer.
void map( HadoopPipes::MapContext& context ) {
//--- get line of text ---
string line = context.getInputValue();
//--- split it into words ---
vector< string > words =
HadoopUtils::splitString( line, " " );
//--- emit each word tuple (word, "1" ) ---
for ( unsigned int i=0; i < words.size(); i++ ) {
context.emit( words[i], HadoopUtils::toString( 1 ) );
}
}
};
class WordCountReducer : public HadoopPipes::Reducer {
public:
// constructor: does nothing
WordCountReducer(HadoopPipes::TaskContext& context) {
}
// reduce function
void reduce( HadoopPipes::ReduceContext& context ) {
int count = 0;
//--- get all tuples with the same key, and count their numbers ---
while ( context.nextValue() ) {
count += HadoopUtils::toInt( context.getInputValue() );
}
//--- emit (word, count) ---
context.emit(context.getInputKey(), HadoopUtils::toString( count ));
}
};
int main(int argc, char *argv[]) {
return HadoopPipes::runTask(HadoopPipes::TemplateFactory<
WordCountMapper,
WordCountReducer >() );
}
Makefile
Before you create the Makefile, you need to figure out whether your computer hosts a 32-bit processor or a 64-bit processor, and pick the right library. To find this out, run the following command:
uname -a
To which the OS responds:
Linux hadoop6 2.6.31-20-generic #58-Ubuntu SMP Fri Mar 12 05:23:09 UTC 2010 i686 GNU/Linux
The i686 indicates a 32-bit machine, for which you need to use the Linux-i386-32 library. Anything with 64 indicates the other type, for which you use the Linux-amd64-64 library.
Once you have this information create the Makefile (make sure to spell it with an uppercase M):
CC = g++
HADOOP_INSTALL = /home/hadoop/hadoop
PLATFORM = Linux-i386-32
CPPFLAGS = -m32 -I$(HADOOP_INSTALL)/c++/$(PLATFORM)/include
wordcount: wordcount.cpp
$(CC) $(CPPFLAGS) $< -Wall -L$(HADOOP_INSTALL)/c++/$(PLATFORM)/lib -lhadooppipes \
-lhadooputils -lpthread -g -O2 -o $@
Make sure you set the HADOOP_INSTALL directory to the directory that is right for your system. It may not be the same as shown above!
- Note 1: Users have reported that in some cases the command above returns errors, and that adding -lssl will help get rid of the error. Thanks for the tip! --D. Thiebaut 08:16, 25 February 2011 (EST)
- Note 2: User Kösters reported having to add -lcrypto behind the -lssl switch to get compilation to work, and suggested installing the llbsl-dev package (sudo apt-get install libssl-dev).
Data File
- We'll assume that you have some large text files already in HDFS, in a directory called dft1.
Compiling and Running
- You need a C++ compiler. GNU g++ is probably the best choice. Check that it is installed (by typing g++ at the prompt). If it is not installed yet, install it!
sudo apt-get install g++
- Compile the code:
make wordcount
- and fix any errors you're getting.
- Copy the executable file (wordcount) to the bin directory in HDFS:
hadoop dfs -mkdir bin (Note: it should already exist!) hadoop dfs -put wordcount bin/wordcount
- Run the program!
hadoop pipes -D hadoop.pipes.java.recordreader=true \ -D hadoop.pipes.java.recordwriter=true \ -input dft1 -output dft1-out \ -program bin/wordcount
- Verify that you have gotten the right output:
hadoop dfs -text dft1-out/part-00000 "Come 1 "Defects," 1 "I 1 "Information 1 "J" 1 "Plain 2 ... zodiacal 2 zoe)_ 1 zones: 1 zoo. 1 zoological 1 zouave's 1 zrads, 2 zrads. 1
| ||
| ||
Computing the Maximum Temperature in NCDC Data-Files
This is taken directly from Tom White's Hadoop, the Definitive Guide [2]. It fixes a bug in the book that prevents the compiling of the example code given on page 36.
The Files
You need 3 files to run the maxTemperature example:
- a C++ file containing the map and reduce functions,
- a data file containing some temperature data such as found at the National Climatic Data Cener (NCDC), and
- a Makefile to compile the C++ file.
Max_temperature.cpp
#include <algorithm>
#include <limits>
#include <string>
#include "stdint.h" // <-- this is missing from the book
#include "hadoop/Pipes.hh"
#include "hadoop/TemplateFactory.hh"
#include "hadoop/StringUtils.hh"
using namespace std;
class MaxTemperatureMapper : public HadoopPipes::Mapper {
public:
MaxTemperatureMapper(HadoopPipes::TaskContext& context) {
}
void map(HadoopPipes::MapContext& context) {
string line = context.getInputValue();
string year = line.substr(15, 4);
string airTemperature = line.substr(87, 5);
string q = line.substr(92, 1);
if (airTemperature != "+9999" &&
(q == "0" || q == "1" || q == "4" || q == "5" || q == "9")) {
context.emit(year, airTemperature);
}
}
};
class MapTemperatureReducer : public HadoopPipes::Reducer {
public:
MapTemperatureReducer(HadoopPipes::TaskContext& context) {
}
void reduce(HadoopPipes::ReduceContext& context) {
int maxValue = -10000;
while (context.nextValue()) {
maxValue = max(maxValue, HadoopUtils::toInt(context.getInputValue()));
}
context.emit(context.getInputKey(), HadoopUtils::toString(maxValue));
}
};
int main(int argc, char *argv[]) {
return HadoopPipes::runTask(HadoopPipes::TemplateFactory<MaxTemperatureMapper,
MapTemperatureReducer>());
}
Makefile
Create a Make file with the following entries. Note that you need to figure out if your computer hosts a 32-bit processor or a 64-bit processor, and pick the right library. To find this out, run the following command:
uname -a
To which the OS responds:
Linux hadoop6 2.6.31-20-generic #58-Ubuntu SMP Fri Mar 12 05:23:09 UTC 2010 i686 GNU/Linux
The i686 indicates a 32-bit machine, for which you need to use the Linux-i386-32 library. Anything with 64 indicates the other type, for which you use the Linux-amd64-64 library.
CC = g++
HADOOP_INSTALL = /home/hadoop/hadoop
PLATFORM = Linux-i386-32
CPPFLAGS = -m32 -I$(HADOOP_INSTALL)/c++/$(PLATFORM)/include
max_temperature: max_temperature.cpp
$(CC) $(CPPFLAGS) $< -Wall -L$(HADOOP_INSTALL)/c++/$(PLATFORM)/lib -lhadooppipes \
-lhadooputils -lpthread -g -O2 -o $@
Data File
- Create a file called sample.txt which will contain sample temperature data from the NCDC.
0067011990999991950051507004+68750+023550FM-12+038299999V0203301N00671220001CN9999999N9+00001+99999999999
0043011990999991950051512004+68750+023550FM-12+038299999V0203201N00671220001CN9999999N9+00221+99999999999
0043011990999991950051518004+68750+023550FM-12+038299999V0203201N00261220001CN9999999N9-00111+99999999999
0043012650999991949032412004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+01111+99999999999
0043012650999991949032418004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+00781+99999999999
- Put the data file in HDFS:
hadoop dfs -mkdir ncdc hadoop dfs -put sample.txt ncdc
Compiling and Running
- You need a C++ compiler. GNU g++ is probably the best choice. Check that it is installed (by typing g++ at the prompt). If it is not installed yet, install it!
sudo apt-get install g++
- Compile the code:
make max_temperature
- and fix any errors you're getting.
- Copy the executable file (max_temperature) to a bin directory in HDFS:
hadoop dfs -mkdir bin hadoop dfs -put max_temperature bin/max_temperature
- Run the program!
hadoop pipes -D hadoop.pipes.java.recordreader=true \ -D hadoop.pipes.java.recordwriter=true \ -input ncdc/sample.txt -output ncdc-out \ -program bin/max_temperature
- Verify that you have gotten the right output:
hadoop dfs -text ncdc-out/part-00000 1949 111 1950 22
- Congrats, you have computed the maximum of 5 recorded temperatures for 2 different years!
Speed Test: Java vs Python vs C++
Data Set #1: 6 books
- The wordcount program in native Java, in Python streaming mode and in C++ pipes mode is run on 6 books from the Gutenberg project:
-rw-r--r-- 2 hadoop supergroup 393995 2010-03-24 06:30 /user/hadoop/dft6/ambroseBierce.txt -rw-r--r-- 2 hadoop supergroup 674762 2010-03-24 06:30 /user/hadoop/dft6/arthurThompson.txt -rw-r--r-- 2 hadoop supergroup 590099 2010-03-24 06:30 /user/hadoop/dft6/conanDoyle.txt -rw-r--r-- 2 hadoop supergroup 1945731 2010-03-24 06:30 /user/hadoop/dft6/encyclopediaBritanica.txt -rw-r--r-- 2 hadoop supergroup 343694 2010-03-24 06:30 /user/hadoop/dft6/sunzi.txt -rw-r--r-- 2 hadoop supergroup 1573044 2010-03-23 21:16 /user/hadoop/dft6/ulysses.txt
- The successful execution of all three versions yield these execution times (real time):
Method | Real Time (seconds) |
Ratio to Java |
---|---|---|
Java | 19.2 | 1 |
C++ | 40.2 | 0.48 |
Python | 23.4 | 0.82 |
Data Set #2: 4 180-MB files
- Same idea as above, but with much larger files that will be split into several splits
-rw-r--r-- 2 hadoop supergroup 187789938 2010-04-05 15:58 /user/hadoop/wikipages/block/all_00.xml -rw-r--r-- 2 hadoop supergroup 192918963 2010-04-05 16:14 /user/hadoop/wikipages/block/all_01.xml -rw-r--r-- 2 hadoop supergroup 198549500 2010-04-05 16:20 /user/hadoop/wikipages/block/all_03.xml -rw-r--r-- 2 hadoop supergroup 191317937 2010-04-05 16:21 /user/hadoop/wikipages/block/all_04.xml
- Execution times (real time):
Method | Real Time (seconds) |
Ratio to Java |
---|---|---|
Java | 2 min 15.7 | 1.0 |
C++ | 5 min 26 | 0.416 |
Python | 12 min 46.5 | 0.177 |
MapReduce vs Hadoop
The Terabyte Sort benchmark gives a good estimate of the difference between the two.
In May 2009, Yahoo! reported in its Yahoo! Developer Network Blog sorting 1 TB of dat in 62 seconds on 3800 computers.
Google reported in its 11/2008 blogspot sorting 1 TB of data on MapReduce in 68 seconds, on 1000 computers.
Company | Platform | Number of Computers | 1 TB Sort Execution Time |
---|---|---|---|
MapReduce | 1000 | 68 sec | |
Yahoo! | Hadoop | 3800 | 62 sec |
So, roughly, given the number of computers, and assuming linear speedup, MapReduce is about 4 times faster than Hadoop.
References
- ↑ Hadoop Wiki, C/C++ MapReduce Code & build
- ↑ Tim White, Hadoop, the definitive guide, O'Reilly Media, June 2009, ISBN 0596521979