--D. Thiebaut 12:46, 13 April 2010 (UTC)

Running the WordCount program in C++

The reference for this is the C++ wordcount presented in the Hadoop Wiki ^[1].

Submitting C++ compiled code is supported by Hadoop through a pipes API, the syntax of which is documented here.

The Files

You need 3 files to run the wordCount example:

a C++ file containing the map and reduce functions,
a data file containing some text, such as Ulysses, and
a Makefile to compile the C++ file.

wordcount.cpp

The wordcount program is shown below
It contains two classes, one for the map, one for the reduce
It makes use of several Hadoop classes, one of which contains useful methods for converting from tuples to other types: StringUtils.

#include <algorithm>
#include <limits>
#include <string>
  
#include  "stdint.h"  // <--- to prevent uint64_t errors! 
 
#include "hadoop/Pipes.hh"
#include "hadoop/TemplateFactory.hh"
#include "hadoop/StringUtils.hh"
 
using namespace std;

class WordCountMapper : public HadoopPipes::Mapper {
public:
  // constructor: does nothing
  WordCountMapper( HadoopPipes::TaskContext& context ) {
  }

  // map function: receives a line, outputs (word,"1")
  // to reducer.
  void map( HadoopPipes::MapContext& context ) {
    //--- get line of text ---
    string line = context.getInputValue();

    //--- split it into words ---
    vector< string > words =
      HadoopUtils::splitString( line, " " );

    //--- emit each word tuple (word, "1" ) ---
    for ( unsigned int i=0; i < words.size(); i++ ) {
      context.emit( words[i], HadoopUtils::toString( 1 ) );
    }
  }
};
 
class WordCountReducer : public HadoopPipes::Reducer {
public:
  // constructor: does nothing
  WordCountReducer(HadoopPipes::TaskContext& context) {
  }

  // reduce function
  void reduce( HadoopPipes::ReduceContext& context ) {
    int count = 0;

    //--- get all tuples with the same key, and count their numbers ---
    while ( context.nextValue() ) {
      count += HadoopUtils::toInt( context.getInputValue() );
    }

    //--- emit (word, count) ---
    context.emit(context.getInputKey(), HadoopUtils::toString( count ));
  }
};
 
int main(int argc, char *argv[]) {
  return HadoopPipes::runTask(HadoopPipes::TemplateFactory< 
			      WordCountMapper, 
                              WordCountReducer >() );
}

Makefile

Before you create the Makefile, you need to figure out whether your computer hosts a 32-bit processor or a 64-bit processor, and pick the right library. To find this out, run the following command:

  uname -a

To which the OS responds:

  Linux hadoop6 2.6.31-20-generic #58-Ubuntu SMP Fri Mar 12 05:23:09 UTC 2010 i686 GNU/Linux

The i686 indicates a 32-bit machine, for which you need to use the Linux-i386-32 library. Anything with 64 indicates the other type, for which you use the Linux-amd64-64 library.

Once you have this information create the Makefile (make sure to spell it with an uppercase M):

CC = g++
HADOOP_INSTALL = /home/hadoop/hadoop
PLATFORM = Linux-i386-32
CPPFLAGS = -m32 -I$(HADOOP_INSTALL)/c++/$(PLATFORM)/include

wordcount: wordcount.cpp
	$(CC) $(CPPFLAGS) $< -Wall -L$(HADOOP_INSTALL)/c++/$(PLATFORM)/lib -lhadooppipes \
	-lhadooputils -lpthread -g -O2 -o $@

Make sure you set the HADOOP_INSTALL directory to the directory that is right for your system. It may not be the same as shown above!

Note 1: Users have reported that in some cases the command above returns errors, and that adding -lssl will help get rid of the error. Thanks for the tip! --D. Thiebaut 08:16, 25 February 2011 (EST)

Note 2: User Kösters reported having to add -lcrypto behind the -lssl switch to get compilation to work, and suggested installing the llbsl-dev package (sudo apt-get install libssl-dev).

Data File

We'll assume that you have some large text files already in HDFS, in a directory called dft1.

Compiling and Running

You need a C++ compiler. GNU g++ is probably the best choice. Check that it is installed (by typing g++ at the prompt). If it is not installed yet, install it!

  sudo apt-get install g++

Compile the code:

  make  wordcount

and fix any errors you're getting.

Copy the executable file (wordcount) to the bin directory in HDFS:

  hadoop dfs -mkdir bin                    (Note: it should already exist!)
  hadoop dfs -put  wordcount   bin/wordcount

Run the program!

  hadoop pipes -D hadoop.pipes.java.recordreader=true  \ 
                   -D hadoop.pipes.java.recordwriter=true \
                   -input dft1  -output dft1-out  \
                   -program bin/wordcount

Verify that you have gotten the right output:

  hadoop dfs -text dft1-out/part-00000
 
  "Come   1
  "Defects,"      1
  "I      1
  "Information    1
  "J"     1
  "Plain  2
  ...
  zodiacal        2
  zoe)_   1
  zones:  1
  zoo.    1
  zoological      1
  zouave's        1
  zrads,  2
  zrads.  1


	Lab Experiment #1 Compare the execution time of streaming wordcount written in Python to piping C++ wordcount on Ulysses (which is in HDFS in dft1), or Ulysses plus 5 other books (which are in HDFS in dft6).

Lab Experiment #2: Modify your C++ program and make it count the number of words containing Buck.

The Standard String library is documented here and examples of the use of the some of the functions are available here.

Computing the Maximum Temperature in NCDC Data-Files

This is taken directly from Tom White's Hadoop, the Definitive Guide ^[2]. It fixes a bug in the book that prevents the compiling of the example code given on page 36.

The Files

You need 3 files to run the maxTemperature example:

a C++ file containing the map and reduce functions,
a data file containing some temperature data such as found at the National Climatic Data Cener (NCDC), and
a Makefile to compile the C++ file.

Max_temperature.cpp

#include <algorithm>
#include <limits>
#include <string>

#include  "stdint.h"  // <-- this is missing from the book

#include "hadoop/Pipes.hh"
#include "hadoop/TemplateFactory.hh"
#include "hadoop/StringUtils.hh"

using namespace std;

class MaxTemperatureMapper : public HadoopPipes::Mapper {
public:
  MaxTemperatureMapper(HadoopPipes::TaskContext& context) {
  }
  void map(HadoopPipes::MapContext& context) {
    string line = context.getInputValue();
    string year = line.substr(15, 4);
    string airTemperature = line.substr(87, 5);
    string q = line.substr(92, 1);
    if (airTemperature != "+9999" &&
        (q == "0" || q == "1" || q == "4" || q == "5" || q == "9")) {
      context.emit(year, airTemperature);
    }
  }
};

class MapTemperatureReducer : public HadoopPipes::Reducer {
public:
  MapTemperatureReducer(HadoopPipes::TaskContext& context) {
  }
  void reduce(HadoopPipes::ReduceContext& context) {
    int maxValue = -10000;
    while (context.nextValue()) {
      maxValue = max(maxValue, HadoopUtils::toInt(context.getInputValue()));
    }
    context.emit(context.getInputKey(), HadoopUtils::toString(maxValue));
  }
};

int main(int argc, char *argv[]) {
  return HadoopPipes::runTask(HadoopPipes::TemplateFactory<MaxTemperatureMapper, 
                              MapTemperatureReducer>());
}

Makefile

Create a Make file with the following entries. Note that you need to figure out if your computer hosts a 32-bit processor or a 64-bit processor, and pick the right library. To find this out, run the following command:

  uname -a

To which the OS responds:

  Linux hadoop6 2.6.31-20-generic #58-Ubuntu SMP Fri Mar 12 05:23:09 UTC 2010 i686 GNU/Linux

The i686 indicates a 32-bit machine, for which you need to use the Linux-i386-32 library. Anything with 64 indicates the other type, for which you use the Linux-amd64-64 library.

CC = g++
HADOOP_INSTALL = /home/hadoop/hadoop
PLATFORM = Linux-i386-32
CPPFLAGS = -m32 -I$(HADOOP_INSTALL)/c++/$(PLATFORM)/include


max_temperature: max_temperature.cpp 
	$(CC) $(CPPFLAGS) $< -Wall -L$(HADOOP_INSTALL)/c++/$(PLATFORM)/lib -lhadooppipes \
	-lhadooputils -lpthread -g -O2 -o $@

Data File

Create a file called sample.txt which will contain sample temperature data from the NCDC.

0067011990999991950051507004+68750+023550FM-12+038299999V0203301N00671220001CN9999999N9+00001+99999999999
0043011990999991950051512004+68750+023550FM-12+038299999V0203201N00671220001CN9999999N9+00221+99999999999
0043011990999991950051518004+68750+023550FM-12+038299999V0203201N00261220001CN9999999N9-00111+99999999999
0043012650999991949032412004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+01111+99999999999
0043012650999991949032418004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+00781+99999999999

Put the data file in HDFS:

 hadoop dfs -mkdir ncdc  
 hadoop dfs -put sample.txt ncdc

Compiling and Running

You need a C++ compiler. GNU g++ is probably the best choice. Check that it is installed (by typing g++ at the prompt). If it is not installed yet, install it!

  sudo apt-get install g++

Compile the code:

  make  max_temperature

and fix any errors you're getting.

Copy the executable file (max_temperature) to a bin directory in HDFS:

  hadoop dfs -mkdir bin
  hadoop dfs -put max_temperature bin/max_temperature

Run the program!

  hadoop pipes -D hadoop.pipes.java.recordreader=true  \ 
                   -D hadoop.pipes.java.recordwriter=true \
                   -input ncdc/sample.txt  -output ncdc-out  \
                   -program bin/max_temperature

Verify that you have gotten the right output:

  hadoop dfs -text ncdc-out/part-00000

  1949	111
  1950	22

Congrats, you have computed the maximum of 5 recorded temperatures for 2 different years!

Speed Test: Java vs Python vs C++

Data Set #1: 6 books

The wordcount program in native Java, in Python streaming mode and in C++ pipes mode is run on 6 books from the Gutenberg project:

-rw-r--r--   2 hadoop supergroup     393995 2010-03-24 06:30 /user/hadoop/dft6/ambroseBierce.txt
-rw-r--r--   2 hadoop supergroup     674762 2010-03-24 06:30 /user/hadoop/dft6/arthurThompson.txt
-rw-r--r--   2 hadoop supergroup     590099 2010-03-24 06:30 /user/hadoop/dft6/conanDoyle.txt
-rw-r--r--   2 hadoop supergroup    1945731 2010-03-24 06:30 /user/hadoop/dft6/encyclopediaBritanica.txt
-rw-r--r--   2 hadoop supergroup     343694 2010-03-24 06:30 /user/hadoop/dft6/sunzi.txt
-rw-r--r--   2 hadoop supergroup    1573044 2010-03-23 21:16 /user/hadoop/dft6/ulysses.txt

The successful execution of all three versions yield these execution times (real time):

Method	Real Time (seconds)	Ratio to Java
Java	19.2	1
C++	40.2	0.48
Python	23.4	0.82

Data Set #2: 4 180-MB files

Same idea as above, but with much larger files that will be split into several splits

-rw-r--r--   2 hadoop supergroup  187789938 2010-04-05 15:58 /user/hadoop/wikipages/block/all_00.xml
-rw-r--r--   2 hadoop supergroup  192918963 2010-04-05 16:14 /user/hadoop/wikipages/block/all_01.xml
-rw-r--r--   2 hadoop supergroup  198549500 2010-04-05 16:20 /user/hadoop/wikipages/block/all_03.xml
-rw-r--r--   2 hadoop supergroup  191317937 2010-04-05 16:21 /user/hadoop/wikipages/block/all_04.xml

Execution times (real time):

Method	Real Time (seconds)	Ratio to Java
Java	2 min 15.7	1.0
C++	5 min 26	0.416
Python	12 min 46.5	0.177

MapReduce vs Hadoop

The Terabyte Sort benchmark gives a good estimate of the difference between the two.

In May 2009, Yahoo! reported in its Yahoo! Developer Network Blog sorting 1 TB of dat in 62 seconds on 3800 computers.

Google reported in its 11/2008 blogspot sorting 1 TB of data on MapReduce in 68 seconds, on 1000 computers.

Company	Platform	Number of Computers	1 TB Sort Execution Time
Google	MapReduce	1000	68 sec
Yahoo!	Hadoop	3800	62 sec

So, roughly, given the number of computers, and assuming linear speedup, MapReduce is about 4 times faster than Hadoop.

References

↑ Hadoop Wiki, C/C++ MapReduce Code & build
↑ Tim White, Hadoop, the definitive guide, O'Reilly Media, June 2009, ISBN 0596521979

[HadoopWiki-1] Hadoop Wiki, C/C++ MapReduce Code & build

[HadoopDefinitiveGuide-2] Tim White, Hadoop, the definitive guide, O'Reilly Media, June 2009, ISBN 0596521979

[1]

[2]

Hadoop Tutorial 2.2 -- Running C++ Programs on Hadoop

Contents

Running the WordCount program in C++

The Files

wordcount.cpp

Makefile

Data File

Compiling and Running

Computing the Maximum Temperature in NCDC Data-Files

The Files

Max_temperature.cpp

Makefile

Data File

Compiling and Running

Speed Test: Java vs Python vs C++

Data Set #1: 6 books

Data Set #2: 4 180-MB files

MapReduce vs Hadoop

References

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools