CSC352 Notes 2013
Contents
Setting Up Hadoop and Eclipse on the Mac
Install Hadoop
No big deal, just install hadoop-0.19.1.tgz, and set a symbolic link hadoop pointing the directory holding hadoop-0.19.1
Verify configuration of Hadoop
cd cd hadoop/conf cat hadoop-site.xml
Yields
~/hadoop/conf$: cat hadoop-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9100</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9101</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/Users/thiebaut/hdfs/data</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/Users/thiebaut/hdfs/name</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
Setting up Eclipse for Hadoop
- Java 1.6
- http://v-lad.org/Tutorials/Hadoop/03%20-%20Prerequistes.html
- download Eclipse 3.3.2 (Europa) from http://www.eclipse.org/downloads/packages/release/europa/winter
- Use Hadoop 0.19.1
- open eclipse and deploy (Mac)
- uncompress hadoop 19.1
- copy the eclipse-plugin from hadoop to the plugin directory of eclipse
- start hadoop on the Mac and follow directions from http://v-lad.org/Tutorials/Hadoop page:
start-all.sh
Map-Reduce Locations
- setup eclipse
- localhost
- Map/Reduce Master: localhost, 9101
- DFS Master: user M/R Master host, localhost, 9100
- user name: hadoop-thiebaut
- SOCKS proxy: (not checked) host, 1080
DFS Locations
- Open DFS Locations
- localhost
- (2)
- tmp(1)
- hadoop-thiebaut (1)
- mapred (1)
- system (0)
- mapred (1)
- hadoop-thiebaut (1)
- user(1)
- thiebaut (2)
- hello.txt
- readme.txt
- thiebaut (2)
- tmp(1)
- (2)
- localhost
- make In directory:
hadoop fs -mkdir In
Create a new project with Eclipse
Create a project as explained in http://v-lad.org/Tutorials/Hadoop/23%20-%20create%20the%20project.html
Project
- Right-click on the blank space in the Project Explorer window and select New -> Project.. to create a new project.
- Select Map/Reduce Project from the list of project types as shown in the image below.
- Press the Next button.
- Project Name: HadoopTest
- Use default location
- click on configure hadoop location, browse, and select /Users/thiebaut/hadoop-0.19.1 (or whatever it is)
- Ok
- Finish
Map/Reduce driver class
- Right-click on the newly created Hadoop project in the Project Explorer tab and select New -> Other from the context menu.
- Go to the Map/Reduce folder, select MapReduceDriver, then press the Next button as shown in the image below.
- When the MapReduce Driver wizard appears, enter TestDriver in the Name field and press the Finish button. This will create the skeleton code for the MapReduce Driver.
- Finish
- Unfortunately the Hadoop plug-in for Eclipse is slightly out of step with the recent Hadoop API, so we need to edit the driver code a bit.
- Find the following two lines in the source code and comment them out:
conf.setInputPath(new Path("src")); conf.setOutputPath(new Path("out"));
- Enter the following code immediatly after the two lines you just commented out (see image below):
conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path("In")); FileOutputFormat.setOutputPath(conf, new Path("Out"));
- After you have changed the code, you will see the new lines marked as incorrect by Eclipse. Click on the error icon for each line and select Eclipse's suggestion to import the missing class. You need to import the following classes: TextInputFormat, TextOutputFormat, FileInputFormat, FileOutputFormat.
- After the missing classes are imported you are ready to run the project.
Running the Project
- Right-click on the TestDriver class in the Project Explorer tab and select Run As --> Run on Hadoop. This will bring up a window like the one shown below.
- Select localhost as hadoop host to run on
- should see something like this:
09/12/15 20:15:31 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 09/12/15 20:15:31 INFO mapred.FileInputFormat: Total input paths to process : 3 09/12/15 20:15:32 INFO mapred.JobClient: Running job: job_200912152008_0001 09/12/15 20:15:33 INFO mapred.JobClient: map 0% reduce 0% 09/12/15 20:16:05 INFO mapred.JobClient: Task Id : attempt_200912152008_0001_m_000000_0, Status : FAILED 09/12/15 20:16:19 INFO mapred.JobClient: Task Id : attempt_200912152008_0001_m_000001_0, Status : FAILE D
WordCount Example on Eclipse on Mac
- New project: make sure to select Map/Reduce project. Call it WordCount
- in this project, right click and New, then Other, then Mapper, enter the code below:
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class WordCountMapper extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(WritableComparable key, Writable value,
OutputCollector output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line.toLowerCase());
while(itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line.toLowerCase());
while(itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
- Similarly, in this project, right click and New, then Other, then Reducer, and enter the code below:
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
public class WordCountReducer extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator values,
OutputCollector output, Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
IntWritable value = (IntWritable) values.next();
sum += value.get(); // process value
}
output.collect(key, new IntWritable(sum));
}
}
- Similarly, in this project, right click and New, then Other, then Driver, and enter code below:
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
public class WordCount {
public static void main(String[] args) {
JobClient client = new JobClient();
JobConf conf = new JobConf(WordCount.class);
// specify output types
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
// specify input and output dirs
//FileInputPath.addInputPath(conf, new Path("input"));
//FileOutputPath.addOutputPath(conf, new Path("output"));
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path("In")); // make sure it exists
// in the DFS area
FileOutputFormat.setOutputPath(conf, new Path("Out"));
// specify a mapper
conf.setMapperClass(WordCountMapper.class);
// specify a reducer
conf.setReducerClass(WordCountReducer.class);
conf.setCombinerClass(WordCountReducer.class);
client.setConf(conf);
try {
JobClient.runJob(conf);
} catch (Exception e) {
e.printStackTrace();
}
}
}
Notes on doing example in Yahoo Tutorial, Module 2
http://developer.yahoo.com/hadoop/tutorial/module2.html
cd ../hadoop/examples/ cat > HDFSHelloWorld.java mkdir hello_classes javac -classpath /Users/thiebaut/hadoop/hadoop-0.19.2-core.jar -d hello_classes HDFSHelloWorld.java emacs HDFSHelloWorld.java -nw javac -classpath /Users/thiebaut/hadoop/hadoop-0.19.2-core.jar -d hello_classes HDFSHelloWorld.java ls ls hello_classes/ jar -cvf helloworld.jar -C hello_classes/ . ls ls -ltr find . -name "*" -print hadoop jar helloworld.jar HDFSHelloWorld Hello, world!