Hadoop Tutorial 3.1 -- Using Amazon's WordCount program

--D. Thiebaut 16:00, 18 April 2010 (UTC)

Processing Ulysses on Amazon's Elastic MapReduce: Using Amazon's WordCount

Now that you have your data in your S3 storage, we'll use Amazon's copy of the WordCount program and run it. This is also described in an amazon tutorial on their developer network.

The program is in Python, and contains only the Map section of the Map-Reduce program. The Reduce part is a standard aggregate section that is predefined.

The Python code is shown below:

.

#!/usr/bin/python
     
   import sys
   import re
     
   def main(argv):
     line = sys.stdin.readline()
     pattern = re.compile("[a-zA-Z][a-zA-Z0-9]*")
     try:
       while line:
         for word in  pattern.findall(line):
           print  "LongValueSum:" + word.lower() + "\t" + "1"
         line =  sys.stdin.readline()
     except "end of file":
       return None
   if __name__ == "__main__":
     main(sys.argv)

.

and it is stored on S3 at the following location: s3://elasticmapreduce/samples/wordcount/wordSplit

Create a New Job Flow

First, go to the AWS Management Console and open the tab Amazon Elastic MapReduce tab.
Create a New Job Flow.

Call it WordCount
Select Run a Sample Application
Pick Word Count (Streaming) (Streaming refers to the fact that the program is not in Java, and its input/output will be streamed in and out to the Map-Reduce framework.

Click Continue
Specify the parameters as follows:
- Input Location: s3://352-xxxxxx/data/ (replace xxxxxx by the name you selected)
- Output Location: s3://352-xxxxxx/output/ (replace xxxxxx by the name you selected)
- Mapper: keep default provided
- Reducer: keep default provided
- Extra Args: leave it blank
Click Continue

Configure EC2 Instances, and keep the default (4 instances, Small instance, Amazon EC2 Key Pair, no debugging.
Review, then Create Job Flow

Close
Refresh
You should get a window showing your Job Flow:

Click on the job flow line, and observe the status window appearing at the bottom of the screen:

Then, wait...
wait...
wait some more... until the Job Flow completes. (in this case it took 2 minutes.)

Download the Results

Once the Job Flow has completed, download the part-0000? file or files from your output folder on S3 to your disk. You can then delete these files from S3 (remember that keeping them there costs money!)

Once they are on your local machine, observe that the file(s) contain(s) an index of words.

Congratulations, you have just run your first Hadoop/MapReduce job on Amazon!

Back to Tutorial 3...

Hadoop Tutorial 3.1 -- Using Amazon's WordCount program

Contents

Processing Ulysses on Amazon's Elastic MapReduce: Using Amazon's WordCount

Create a New Job Flow

Download the Results

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools