Revision as of 13:00, 13 January 2010

Map-Reduce/Hadoop

Options for Setup

Xen Live CD

livecd-xen-3.2-0.8.2-amd64.iso works
must have digital cable for video monitor
2 servers and 2 clients
not sure how one can use it besides examples presented

Setting up Hadoop using VmWare

Check that Google has one copy of it for download. Just a single VM for hadoop
Berkely has a lab that uses this copy: see http://bnrg.cs.berkeley.edu/~adj/cs16x/Nachos/project2.html

Setting Up Hadoop and Eclipse on the Mac

Install Hadoop

No big deal, just install hadoop-0.19.1.tgz, and set a symbolic link hadoop pointing the directory holding hadoop-0.19.1

Verify configuration of Hadoop

cd 
cd hadoop/conf
cat hadoop-site.xml

Yields

~/hadoop/conf$: cat hadoop-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

  <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:9100</value>
  </property>

  <property>
    <name>mapred.job.tracker</name>
    <value>localhost:9101</value>
  </property>

  <property>
    <name>dfs.data.dir</name>
    <value>/Users/thiebaut/hdfs/data</value>
  </property>

  <property>
    <name>dfs.name.dir</name>
    <value>/Users/thiebaut/hdfs/name</value>
  </property>

  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property> 

</configuration>

Setting up Eclipse for Hadoop

requires Java 1.6
http://v-lad.org/Tutorials/Hadoop/03%20-%20Prerequistes.html
download Eclipse 3.3.2 (Europa) from http://www.eclipse.org/downloads/packages/release/europa/winter
open eclipse and deploy (Mac)
copy the eclipse-plugin from hadoop to the plugin directory of eclipse
start hadoop on the Mac and follow directions from http://v-lad.org/Tutorials/Hadoop page:
- open the cluster: http://v-lad.org/Tutorials/Hadoop/14%20-%20start%20up%20the%20cluster.html

 start-all.sh

Map-Reduce Locations

setup eclipse

http://v-lad.org/Tutorials/Hadoop/17%20-%20set%20up%20hadoop%20location%20in%20the%20eclipse.html

- localhost
- Map/Reduce Master: localhost, 9101
- DFS Master: user M/R Master host, localhost, 9100
- user name: hadoop-thiebaut
- SOCKS proxy: (not checked) host, 1080

DFS Locations

Open DFS Locations
- localhost
  - (2)
    - tmp(1)
      - hadoop-thiebaut (1)
        mapred (1)
        system (0)
    - user(1)
      - thiebaut (2)
        hello.txt
        
        readme.txt
make In directory:

hadoop fs -mkdir In

remove Out directory if it exists

hadoop fs -rmr Out

Create a new project with Eclipse

Create a project as explained in http://v-lad.org/Tutorials/Hadoop/23%20-%20create%20the%20project.html

Project

Right-click on the blank space in the Project Explorer window and select New -> Project.. to create a new project.
Select Map/Reduce Project from the list of project types as shown in the image below.
Press the Next button.
Project Name: HadoopTest
Use default location
click on configure hadoop location, browse, and select /Users/thiebaut/hadoop-0.19.1 (or whatever it is)
Ok
Finish

Map/Reduce driver class

Right-click on the newly created Hadoop project in the Project Explorer tab and select New -> Other from the context menu.
Go to the Map/Reduce folder, select MapReduceDriver, then press the Next button as shown in the image below.
When the MapReduce Driver wizard appears, enter TestDriver in the Name field and press the Finish button. This will create the skeleton code for the MapReduce Driver.
Finish
Unfortunately the Hadoop plug-in for Eclipse is slightly out of step with the recent Hadoop API, so we need to edit the driver code a bit.

Find the following two lines in the source code and comment them out:

    conf.setInputPath(new Path("src"));
    conf.setOutputPath(new Path("out"));

Enter the following code immediatly after the two lines you just commented out (see image below):

    conf.setInputFormat(TextInputFormat.class);
    conf.setOutputFormat(TextOutputFormat.class);

    FileInputFormat.setInputPaths(conf, new Path("In"));
    FileOutputFormat.setOutputPath(conf, new Path("Out"));

After you have changed the code, you will see the new lines marked as incorrect by Eclipse. Click on the error icon for each line and select Eclipse's suggestion to import the missing class. You need to import the following classes: TextInputFormat, TextOutputFormat, FileInputFormat, FileOutputFormat.

After the missing classes are imported you are ready to run the project.

Running the Project

Right-click on the TestDriver class in the Project Explorer tab and select Run As --> Run on Hadoop. This will bring up a window like the one shown below.
Select localhost as hadoop host to run on
should see something like this:

09/12/15 20:15:31 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. 
                             Applications should  implement Tool for the same.
09/12/15 20:15:31 INFO mapred.FileInputFormat: Total input paths to process : 3
09/12/15 20:15:32 INFO mapred.JobClient: Running job: job_200912152008_0001
09/12/15 20:15:33 INFO mapred.JobClient:  map 0% reduce 0%
09/12/15 20:16:05 INFO mapred.JobClient: Task Id : attempt_200912152008_0001_m_000000_0, Status : FAILED
09/12/15 20:16:19 INFO mapred.JobClient: Task Id : attempt_200912152008_0001_m_000001_0, Status : FAILE D

WordCount Example on Eclipse on Mac

New project: make sure to select Map/Reduce project. Call it WordCount

Mapper

in this project, right click and New, then Other, then Mapper, enter the code below:

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;

public class WordCountMapper extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {

	private final IntWritable one = new IntWritable(1);
	private Text word = new Text();

	public void map(WritableComparable key, Writable value,
			OutputCollector output, Reporter reporter) throws IOException {

		String line = value.toString();
		StringTokenizer itr = new StringTokenizer(line.toLowerCase());
		while(itr.hasMoreTokens()) {
			word.set(itr.nextToken());
			output.collect(word, one);
		}
	}

        // found myself having to add this for Eclipse to be happy...
        // it matches the definition of the map() function better than what the hadoop example 
        // does...  Oh well...
	public void map(LongWritable key, Text value,
			OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
		String line = value.toString();
		StringTokenizer itr = new StringTokenizer(line.toLowerCase());
		while(itr.hasMoreTokens()) {
			word.set(itr.nextToken());
			output.collect(word, one);
		}
	}
}

Reducer

Similarly, in this project, right click and New, then Other, then Reducer, and enter the code below:

import java.io.IOException;
import java.util.Iterator;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;

public class WordCountReducer extends MapReduceBase
    implements Reducer<Text, IntWritable, Text, IntWritable> {

  public void reduce(Text key, Iterator values,
      OutputCollector output, Reporter reporter) throws IOException {

    int sum = 0;
    while (values.hasNext()) {
      IntWritable value = (IntWritable) values.next();
      sum += value.get(); // process value
    }

    output.collect(key, new IntWritable(sum));
  }
}

Driver

Similarly, in this project, right click and New, then Other, then Driver, and enter code below:

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;

public class WordCount {

  public static void main(String[] args) {
    JobClient client = new JobClient();
    JobConf conf = new JobConf(WordCount.class);

    // specify output types
    conf.setOutputKeyClass(Text.class);
    conf.setOutputValueClass(IntWritable.class);

    // specify input and output dirs
    //FileInputPath.addInputPath(conf, new Path("input"));
    //FileOutputPath.addOutputPath(conf, new Path("output"));
    conf.setInputFormat(TextInputFormat.class);
    conf.setOutputFormat(TextOutputFormat.class);

    // make sure In directory exists in the DFS area
    // make sure Out directory does NOT exist in DFS area
    FileInputFormat.setInputPaths(conf, new Path("In")); 
    FileOutputFormat.setOutputPath(conf, new Path("Out"));

    // specify a mapper
    conf.setMapperClass(WordCountMapper.class);

    // specify a reducer
    conf.setReducerClass(WordCountReducer.class);
    conf.setCombinerClass(WordCountReducer.class);

    client.setConf(conf);
    try {
      JobClient.runJob(conf);
    } catch (Exception e) {
      e.printStackTrace();
    }
  }
}

Run WordCount Project

In explorer, right-click on WordCount.java and Run as, and pick Run on Hadoop. Select localhost.
IT WILL TAKE A LONG TIME TO RUN!
output in console window:

09/12/15 20:47:19 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. 
               Applications should implement Tool for the same.
09/12/15 20:47:19 INFO mapred.FileInputFormat: Total input paths to process : 3
09/12/15 20:47:20 INFO mapred.JobClient: Running job: job_200912152008_0003
09/12/15 20:47:21 INFO mapred.JobClient:  map 0% reduce 0%
09/12/15 20:48:53 INFO mapred.JobClient:  map 33% reduce 0%
09/12/15 20:48:59 INFO mapred.JobClient:  map 66% reduce 0%
09/12/15 20:49:03 INFO mapred.JobClient:  map 100% reduce 0%
...
09/12/15 20:49:20 INFO mapred.JobClient:     Map input bytes=71
09/12/15 20:49:20 INFO mapred.JobClient:     Combine input records=16
09/12/15 20:49:20 INFO mapred.JobClient:     Map output records=16
09/12/15 20:49:20 INFO mapred.JobClient:     Reduce input records=15

Refresh DFS area, found Out folder, and check the part-00000 file:

a	2
hadoop	2
i	2
is	1
kid	1
lemon	1
lemons	1
like	4
on	1
stick	1

Notes on doing example in Yahoo Tutorial, Module 2

http://developer.yahoo.com/hadoop/tutorial/module2.html

   cd ../hadoop/examples/
   cat > HDFSHelloWorld.java
   mkdir hello_classes
   javac -classpath /Users/thiebaut/hadoop/hadoop-0.19.2-core.jar -d hello_classes HDFSHelloWorld.java 
   emacs HDFSHelloWorld.java -nw
   javac -classpath /Users/thiebaut/hadoop/hadoop-0.19.2-core.jar -d hello_classes HDFSHelloWorld.java 
   ls
   ls hello_classes/
   jar -cvf helloworld.jar -C hello_classes/ .
   ls
   ls -ltr
   find . -name "*" -print
   hadoop jar helloworld.jar HDFSHelloWorld

   Hello, world!

@@ Line 1: / Line 1: @@
-=Xen Live CD=
+=Map-Reduce/Hadoop=
+==Options for Setup==
+=== Xen Live CD===
 * livecd-xen-3.2-0.8.2-amd64.iso '''works'''
@@ Line 6: / Line 9: @@
 * not sure how one can use it besides examples presented
-=Setting up Hadoop using VmWare=
+===Setting up Hadoop using VmWare===
 * Check that Google has one copy of it for download.  Just a single VM for hadoop
 * Berkely has a lab that uses this copy: see http://bnrg.cs.berkeley.edu/~adj/cs16x/Nachos/project2.html
-=Setting Up Hadoop and Eclipse on the Mac=
+==Setting Up Hadoop and Eclipse on the Mac==
-==Install Hadoop==
+===Install Hadoop===
 No big deal, just install hadoop-0.19.1.tgz, and set a symbolic link '''hadoop''' pointing the directory holding hadoop-0.19.1
-==Verify configuration of Hadoop==
+===Verify configuration of Hadoop===
   cd

Difference between revisions of "CSC352 Notes 2013"

Revision as of 13:00, 13 January 2010

Contents

Map-Reduce/Hadoop

Options for Setup

Xen Live CD

Setting up Hadoop using VmWare

Setting Up Hadoop and Eclipse on the Mac

Install Hadoop

Verify configuration of Hadoop

Setting up Eclipse for Hadoop

Map-Reduce Locations

DFS Locations

Create a new project with Eclipse

Project

Map/Reduce driver class

Running the Project

WordCount Example on Eclipse on Mac

Mapper

Reducer

Driver

Run WordCount Project

Notes on doing example in Yahoo Tutorial, Module 2

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools