Hadoop Tutorial 2.2 -- Running C++ Programs on Hadoop

From dftwiki3
Jump to: navigation, search

--D. Thiebaut 12:46, 13 April 2010 (UTC)


HadoopCartoon.png







This tutorial is the continuation of Tutorial 2, and uses Hadoop Pipes to run a C++ on the Smith College Hadoop Cluster.






Running the WordCount program in C++


The reference for this is the C++ wordcount presented in the Hadoop Wiki [1].

Submitting C++ compiled code is supported by Hadoop through a pipes API, the syntax of which is documented here.


The Files

You need 3 files to run the wordCount example:

  • a C++ file containing the map and reduce functions,
  • a data file containing some text, such as Ulysses, and
  • a Makefile to compile the C++ file.

wordcount.cpp

  • The wordcount program is shown below
  • It contains two classes, one for the map, one for the reduce
  • It makes use of several Hadoop classes, one of which contains useful methods for converting from tuples to other types: StringUtils.


#include <algorithm>
#include <limits>
#include <string>
  
#include  "stdint.h"  // <--- to prevent uint64_t errors! 
 
#include "hadoop/Pipes.hh"
#include "hadoop/TemplateFactory.hh"
#include "hadoop/StringUtils.hh"
 
using namespace std;

class WordCountMapper : public HadoopPipes::Mapper {
public:
  // constructor: does nothing
  WordCountMapper( HadoopPipes::TaskContext& context ) {
  }

  // map function: receives a line, outputs (word,"1")
  // to reducer.
  void map( HadoopPipes::MapContext& context ) {
    //--- get line of text ---
    string line = context.getInputValue();

    //--- split it into words ---
    vector< string > words =
      HadoopUtils::splitString( line, " " );

    //--- emit each word tuple (word, "1" ) ---
    for ( unsigned int i=0; i < words.size(); i++ ) {
      context.emit( words[i], HadoopUtils::toString( 1 ) );
    }
  }
};
 
class WordCountReducer : public HadoopPipes::Reducer {
public:
  // constructor: does nothing
  WordCountReducer(HadoopPipes::TaskContext& context) {
  }

  // reduce function
  void reduce( HadoopPipes::ReduceContext& context ) {
    int count = 0;

    //--- get all tuples with the same key, and count their numbers ---
    while ( context.nextValue() ) {
      count += HadoopUtils::toInt( context.getInputValue() );
    }

    //--- emit (word, count) ---
    context.emit(context.getInputKey(), HadoopUtils::toString( count ));
  }
};
 
int main(int argc, char *argv[]) {
  return HadoopPipes::runTask(HadoopPipes::TemplateFactory< 
			      WordCountMapper, 
                              WordCountReducer >() );
}


Makefile

Before you create the Makefile, you need to figure out whether your computer hosts a 32-bit processor or a 64-bit processor, and pick the right library. To find this out, run the following command:

  uname -a

To which the OS responds:

  Linux hadoop6 2.6.31-20-generic #58-Ubuntu SMP Fri Mar 12 05:23:09 UTC 2010 i686 GNU/Linux

The i686 indicates a 32-bit machine, for which you need to use the Linux-i386-32 library. Anything with 64 indicates the other type, for which you use the Linux-amd64-64 library.

Once you have this information create the Makefile (make sure to spell it with an uppercase M):

CC = g++
HADOOP_INSTALL = /home/hadoop/hadoop
PLATFORM = Linux-i386-32
CPPFLAGS = -m32 -I$(HADOOP_INSTALL)/c++/$(PLATFORM)/include

wordcount: wordcount.cpp
	$(CC) $(CPPFLAGS) $< -Wall -L$(HADOOP_INSTALL)/c++/$(PLATFORM)/lib -lhadooppipes \
	-lhadooputils -lpthread -g -O2 -o $@

Make sure you set the HADOOP_INSTALL directory to the directory that is right for your system. It may not be the same as shown above!

Note 1: Users have reported that in some cases the command above returns errors, and that adding -lssl will help get rid of the error. Thanks for the tip! --D. Thiebaut 08:16, 25 February 2011 (EST)
Note 2: User Kösters reported having to add -lcrypto behind the -lssl switch to get compilation to work, and suggested installing the llbsl-dev package (sudo apt-get install libssl-dev).

Data File

  • We'll assume that you have some large text files already in HDFS, in a directory called dft1.

Compiling and Running

  • You need a C++ compiler. GNU g++ is probably the best choice. Check that it is installed (by typing g++ at the prompt). If it is not installed yet, install it!
  sudo apt-get install g++
  • Compile the code:
  make  wordcount
and fix any errors you're getting.
  • Copy the executable file (wordcount) to the bin directory in HDFS:
  hadoop dfs -mkdir bin                    (Note: it should already exist!)
  hadoop dfs -put  wordcount   bin/wordcount

  • Run the program!
  hadoop pipes -D hadoop.pipes.java.recordreader=true  \ 
                   -D hadoop.pipes.java.recordwriter=true \
                   -input dft1  -output dft1-out  \
                   -program bin/wordcount
  • Verify that you have gotten the right output:
  hadoop dfs -text dft1-out/part-00000
 
  "Come   1
  "Defects,"      1
  "I      1
  "Information    1
  "J"     1
  "Plain  2
  ...
  zodiacal        2
  zoe)_   1
  zones:  1
  zoo.    1
  zoological      1
  zouave's        1
  zrads,  2
  zrads.  1





ComputerLogo.png
Lab Experiment #1
Compare the execution time of streaming wordcount written in Python to piping C++ wordcount on Ulysses (which is in HDFS in dft1), or Ulysses plus 5 other books (which are in HDFS in dft6).





ComputerLogo.png
Lab Experiment #2
Modify your C++ program and make it count the number of words containing Buck.
The Standard String library is documented here and examples of the use of the some of the functions are available here.



Computing the Maximum Temperature in NCDC Data-Files


This is taken directly from Tom White's Hadoop, the Definitive Guide [2]. It fixes a bug in the book that prevents the compiling of the example code given on page 36.


The Files

You need 3 files to run the maxTemperature example:

  • a C++ file containing the map and reduce functions,
  • a data file containing some temperature data such as found at the National Climatic Data Cener (NCDC), and
  • a Makefile to compile the C++ file.

Max_temperature.cpp

#include <algorithm>
#include <limits>
#include <string>

#include  "stdint.h"  // <-- this is missing from the book

#include "hadoop/Pipes.hh"
#include "hadoop/TemplateFactory.hh"
#include "hadoop/StringUtils.hh"

using namespace std;

class MaxTemperatureMapper : public HadoopPipes::Mapper {
public:
  MaxTemperatureMapper(HadoopPipes::TaskContext& context) {
  }
  void map(HadoopPipes::MapContext& context) {
    string line = context.getInputValue();
    string year = line.substr(15, 4);
    string airTemperature = line.substr(87, 5);
    string q = line.substr(92, 1);
    if (airTemperature != "+9999" &&
        (q == "0" || q == "1" || q == "4" || q == "5" || q == "9")) {
      context.emit(year, airTemperature);
    }
  }
};

class MapTemperatureReducer : public HadoopPipes::Reducer {
public:
  MapTemperatureReducer(HadoopPipes::TaskContext& context) {
  }
  void reduce(HadoopPipes::ReduceContext& context) {
    int maxValue = -10000;
    while (context.nextValue()) {
      maxValue = max(maxValue, HadoopUtils::toInt(context.getInputValue()));
    }
    context.emit(context.getInputKey(), HadoopUtils::toString(maxValue));
  }
};

int main(int argc, char *argv[]) {
  return HadoopPipes::runTask(HadoopPipes::TemplateFactory<MaxTemperatureMapper, 
                              MapTemperatureReducer>());
}


Makefile

Create a Make file with the following entries. Note that you need to figure out if your computer hosts a 32-bit processor or a 64-bit processor, and pick the right library. To find this out, run the following command:

  uname -a

To which the OS responds:

  Linux hadoop6 2.6.31-20-generic #58-Ubuntu SMP Fri Mar 12 05:23:09 UTC 2010 i686 GNU/Linux

The i686 indicates a 32-bit machine, for which you need to use the Linux-i386-32 library. Anything with 64 indicates the other type, for which you use the Linux-amd64-64 library.

CC = g++
HADOOP_INSTALL = /home/hadoop/hadoop
PLATFORM = Linux-i386-32
CPPFLAGS = -m32 -I$(HADOOP_INSTALL)/c++/$(PLATFORM)/include


max_temperature: max_temperature.cpp 
	$(CC) $(CPPFLAGS) $< -Wall -L$(HADOOP_INSTALL)/c++/$(PLATFORM)/lib -lhadooppipes \
	-lhadooputils -lpthread -g -O2 -o $@

Data File

  • Create a file called sample.txt which will contain sample temperature data from the NCDC.
0067011990999991950051507004+68750+023550FM-12+038299999V0203301N00671220001CN9999999N9+00001+99999999999
0043011990999991950051512004+68750+023550FM-12+038299999V0203201N00671220001CN9999999N9+00221+99999999999
0043011990999991950051518004+68750+023550FM-12+038299999V0203201N00261220001CN9999999N9-00111+99999999999
0043012650999991949032412004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+01111+99999999999
0043012650999991949032418004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+00781+99999999999
  • Put the data file in HDFS:
 hadoop dfs -mkdir ncdc  
 hadoop dfs -put sample.txt ncdc

Compiling and Running

  • You need a C++ compiler. GNU g++ is probably the best choice. Check that it is installed (by typing g++ at the prompt). If it is not installed yet, install it!
  sudo apt-get install g++
  • Compile the code:
  make  max_temperature
and fix any errors you're getting.
  • Copy the executable file (max_temperature) to a bin directory in HDFS:
  hadoop dfs -mkdir bin
  hadoop dfs -put max_temperature bin/max_temperature

  • Run the program!
  hadoop pipes -D hadoop.pipes.java.recordreader=true  \ 
                   -D hadoop.pipes.java.recordwriter=true \
                   -input ncdc/sample.txt  -output ncdc-out  \
                   -program bin/max_temperature
  • Verify that you have gotten the right output:
  hadoop dfs -text ncdc-out/part-00000

  1949	111
  1950	22


Congrats, you have computed the maximum of 5 recorded temperatures for 2 different years!

Speed Test: Java vs Python vs C++

Data Set #1: 6 books

  • The wordcount program in native Java, in Python streaming mode and in C++ pipes mode is run on 6 books from the Gutenberg project:
-rw-r--r--   2 hadoop supergroup     393995 2010-03-24 06:30 /user/hadoop/dft6/ambroseBierce.txt
-rw-r--r--   2 hadoop supergroup     674762 2010-03-24 06:30 /user/hadoop/dft6/arthurThompson.txt
-rw-r--r--   2 hadoop supergroup     590099 2010-03-24 06:30 /user/hadoop/dft6/conanDoyle.txt
-rw-r--r--   2 hadoop supergroup    1945731 2010-03-24 06:30 /user/hadoop/dft6/encyclopediaBritanica.txt
-rw-r--r--   2 hadoop supergroup     343694 2010-03-24 06:30 /user/hadoop/dft6/sunzi.txt
-rw-r--r--   2 hadoop supergroup    1573044 2010-03-23 21:16 /user/hadoop/dft6/ulysses.txt


  • The successful execution of all three versions yield these execution times (real time):



Method Real Time
(seconds)
Ratio to Java
Java 19.2 1
C++ 40.2 0.48
Python 23.4 0.82



Data Set #2: 4 180-MB files

  • Same idea as above, but with much larger files that will be split into several splits
-rw-r--r--   2 hadoop supergroup  187789938 2010-04-05 15:58 /user/hadoop/wikipages/block/all_00.xml
-rw-r--r--   2 hadoop supergroup  192918963 2010-04-05 16:14 /user/hadoop/wikipages/block/all_01.xml
-rw-r--r--   2 hadoop supergroup  198549500 2010-04-05 16:20 /user/hadoop/wikipages/block/all_03.xml
-rw-r--r--   2 hadoop supergroup  191317937 2010-04-05 16:21 /user/hadoop/wikipages/block/all_04.xml
  • Execution times (real time):
Method Real Time
(seconds)
Ratio to Java
Java 2 min 15.7 1.0
C++ 5 min 26 0.416
Python 12 min 46.5 0.177



MapReduce vs Hadoop

The Terabyte Sort benchmark gives a good estimate of the difference between the two.

In May 2009, Yahoo! reported in its Yahoo! Developer Network Blog sorting 1 TB of dat in 62 seconds on 3800 computers.

Google reported in its 11/2008 blogspot sorting 1 TB of data on MapReduce in 68 seconds, on 1000 computers.



Company Platform Number of Computers 1 TB Sort
Execution Time
Google MapReduce 1000 68 sec
Yahoo! Hadoop 3800 62 sec


So, roughly, given the number of computers, and assuming linear speedup, MapReduce is about 4 times faster than Hadoop.

References

  1. Hadoop Wiki, C/C++ MapReduce Code & build
  2. Tim White, Hadoop, the definitive guide, O'Reilly Media, June 2009, ISBN 0596521979