Difference between revisions of "CSC352 Notes 2013"

From dftwiki3
Jump to: navigation, search
(Setting up Eclipse for Hadoop)
 
(75 intermediate revisions by the same user not shown)
Line 1: Line 1:
=Setting up Eclipse for Hadoop=
+
<onlydft>
 +
<br />
 +
<center>
 +
<font color="red">
 +
'''See [[CSC352 2017 DT's Notes| this page]] for 2017 updated notes'''
 +
</font>
 +
</center>
 +
<br />
 +
__TOC__
 +
<br />
 +
=Resources 2013=
 +
==Rocco's Presentation 10/10/13==
 +
* libguides.smith.edu/content.php?pid=510405
 +
* idea:
 +
** for paper, start getting the thread, collage, packing, parallel image processing. 
 +
** approaches.
 +
** intro: what has been done in the field
 +
* Citation database: Web of Science
 +
* Ref Works & Zotero can help maintain citations
 +
* 5-College catalogs
 +
* Worldcat is the world catalog for books
 +
* Web of Science: can get information on references and also who's publishing in the field or which institutions are publishing in the given area.
 +
* Discover searches other databases.
 +
* Library Guide (Albany), super guide for libraries.
 +
* [http://VideoLectures.net videolectures.net]
  
* Java 1.6
+
<br />
 +
==Hadoop==
 +
<br />
 +
* [[CSC352_MapReduce/Hadoop_Class_Notes | DT's Class notes on Hadoop/MapReduce]]
 +
* [http://www.umiacs.umd.edu/~jimmylin/cloud-2008-Fall/index.html Cloud Computing notes from UMD] (2008, old)
 +
 
 +
==On-Line==
 +
 
 +
* [https://computing.llnl.gov/tutorials/parallel_comp/ Introduction to Parallel Processing]
 +
* [[Media:RITParallelProgrammingWorkshop.pdf | RIT Parallel Programming Workshop]]
 +
==Papers==
 +
* [[Media:AViewOfCloudComputing_CACM_Apr2010.pdf| A View of Cloud Computing]], 2010, By Armbrust, Michael and Fox, Armando and Griffith, Rean and Joseph, Anthony D. and Katz, Randy and Konwinski, Andy and Lee, Gunho and Patterson, David and Rabkin, Ariel and Stoica, Ion and Zaharia, Matei.
 +
* [[Media:NIST_Definition_Cloud_Computing_2010.pdf | The NIST Definition of Cloud Computing (Draft)]] (very short paper)
 +
* [[Media:NobodyGotFiredUsingHadoopOnCluster_2012.pdf| Nobody ever got fired for using Hadoop on a cluster]], Rowstron, Antony and Narayanan, Dushyanth and Donnelly, Austin and O'Shea, Greg and Douglas, Andrew
 +
* [http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.pdf The Landscape of Parallel Computing Research: A View From Berkely], 2006, still good! (very long paper)
 +
* [[Media:UpdateOnaViewFromBerkeley2010.pdf | Update on a view from Berkeley]], 2010. (short paper)
 +
* [[Media:GeneralPurposeVsGPU_Comparison_Many_Cores_2010_Caragea.pdf |General-Purpose vs. GPU: Comparisons of Many-Cores on Irregular Workloads]], 2010
 +
* [[Media:ParallelCOmputingWithPatternsAndFrameworks2010b.pdf | Parallel Computing with Patterns and Frameworks]], 2010, ''XRDS''.
 +
* [[Media:ServerVirtualizationArchitectureAndImplementation2009.pdf | Server Virtualization Architecture and Implementation]], xrds, 2009.
 +
* [[Media:XGridHadoopCloser2011.pdf | Processing Wikipedia Dumps: A Case-Study comparing the XGrid and MapReduce Approaches]], D. Thiebaut, Yang Li, Diana Jaunzeikare, Alexandra Cheng, Ellysha Raelen Recto, Gillian Riggs, Xia Ting Zhao, Tonje Stolpestad, and Cam Le T Nguyen, ''in proceedings of 1st Int'l Conf. On Cloud Computing and Services Science'' (CLOSER 2011), Noordwijkerhout, NL, May 2011. ([[Media:XGridHadoopFeb2011.pdf |longer version]])
 +
* [[Media:BeyondHadoop_CACM_Mone_2013.pdf | Beyond Hadoop]], Gregory Mone, CACM, 2013. (short paper).
 +
* [[Media:UnderstandingThroughputOrientedArchitectures2010.pdf | Understanding Throughput-Oriented Architectures]], CACM, 2010.
 +
* [[Media:LearningFromTheSuccessOfMPI2002_WilliamGropp.pdf | Learning from the Success of MPI]], by WIlliam D. Gropp,  Argonne National Lab, 2002.
 +
<p>
 +
 
 +
==Art ==
 +
* Maggie Lind's [[Media:MaggieLindProposalCSC352.pdf | MaggieLindProposalCSC352.pdf]]
 +
* Fraser?
 +
* Chester?
 +
<br />
 +
==Some good references==
 +
* Sounds of wikipedia: http://listen.hatnote.com/#nowelcomes,en
 +
 
 +
* Exhibition at Somerset House
 +
<center>[[Image:The_Exhibition_Room_at_Somerset_House_by_Thomas_Rowlandson_and_Augustus_Pugin._1800.jpg|500px]]</center>
 +
<br />
 +
{|
 +
|
 +
[http://lens.blogs.nytimes.com/2010/03/23/behind-38/?_r=0 Bill Cunningham] of the New York Times.
 +
|
 +
[[Image:BillCunningham.jpg|150px|right]]
 +
|-
 +
|
 +
[http://infosthetics.com/archives/2013/07/phototrails_the_visual_structure_of_millions_of_user-generated_photos.html visual structures of millions of user-generated photos]
 +
|
 +
[[Image:milionsUserGeneratedPhotos.jpg|right|150px]]
 +
|-
 +
|
 +
[[Image:digitalsignagecollection.png|150px]]
 +
|
 +
[http://www.digitalsignageconnection.com/art-museum-creates-interactive-visitor-experience-christie-microtiles-video-walls-959  Cleveland Museum of Art's Collection Wall allows up to 16 people to interact simultaneously with the wall using RFID tags on iPad stations.]
 +
|}
 +
<br />
 +
<br />
 +
*[http://computinged.wordpress.com/2012/11/21/cs2013-ironman-draft-available/ Ironman ACM/IEEE Curriculum] stipulates that distributed computing must be incorporated at all levels of curriculum. [http://ai.stanford.edu/users/sahami/CS2013//ironman-draft/cs2013-ironman-v0.8.pdf Link to the pdf paper].  The report suggest that parallel and distributed computed should be an integral part of the CS curriculum.  Some people (e.g. Danner & Newall at Swarthmore) go even further and suggest it should be incorporated at all levels of the curriculum.
 +
 
 +
=Misc. Topics=
 +
* Latex
 +
* writing papers
 +
* reading ==> Newsletter
 +
* presentations
 +
* museum visit
 +
* parallel programming
 +
** MPI
 +
** Java threads
 +
** concurrency issues
 +
** where's the data?  Where are the processors?
 +
* Projects
 +
** MPI
 +
** GPU
 +
* Look at recent conference.  Where are the trends? [http://conference.researchbib.com/?action=viewEventDetails&eventid=26507&uid=raf013 APPT 2013 - 2013 International Conference on Advanced Parallel Processing Technology]
 +
 
 +
=XSEDE.ORG=
 +
* registered 8/8/13: thiebaut/ToMoKo2#
 +
* https://portal.xsede.org/
 +
* [[http://cs.smith.edu/dftwiki/images/BerkeleyBootCampAug2013_DFTNotes.pdf | My notes from the Berkeley Boot-Camp on-line Workshop]], Aug 2013.
 +
 
 +
=Update 2015: Downloading images to Hadoop0=
 +
<br />
 +
* Rsync from http://ftpmirror.your.org/pub/wikimedia/images/wikipedia/en/xxx where xxx is 0, 1, 2, ... 9, a, b, c, d, e, f.
 +
<br />
 +
=Downloading All Wikipedia Images=
 +
* From [http://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia http://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia]
 +
:::''Where are images and uploaded files<br /><br />Images and other uploaded media are available from mirrors in addition to being served directly from Wikimedia servers. Bulk download is currently (as of September 2012) available from mirrors but not offered directly from Wikimedia servers. See the list of current mirrors.<br /><br />Unlike most article text, images are not necessarily licensed under the GFDL & CC-BY-SA-3.0. They may be under one of many free licenses, in the public domain, believed to be fair use, or even copyright infringements (which should be deleted). In particular, use of fair use images outside the context of Wikipedia or similar works may be illegal. Images under most licenses require a credit, and possibly other attached copyright information. This information is included in image description pages, which are part of the text dumps available from dumps.wikimedia.org. In conclusion, download these images at your own risk (Legal)''
 +
 
 +
* [http://wikimedia.wansec.com/other/pagecounts-raw/ Page View Statistics for Wikimedia projects] at
 +
wikimedia.wansec.com/other/pagecounts-raw/
 +
* The main information about the dumps and the format is here: [https://wikitech.wikimedia.org/wiki/Dumps/media https://wikitech.wikimedia.org/wiki/Dumps/media]
 +
:::''Tarballs are generated on a server provided by Your.org and made available from that mirror. The rsynced copy of the media itself and an rsynced copy of the above files (image/imagelinks/redirs info) is used as input to createmediatarballs.py to create two series of tarballs per wiki, one containing all locally uploaded media and the other containing all media uploaded to commons and used on the wiki.<br />One series of tarballs (with names looking like, e.g., enwiki-20120430-remote-media-1.tar, enwiki-20120430-remote-media-2.tar, and so on for remote media, and enwiki-20120430-local-media-1.tar, enwiki-20120430-local-media-2.tar and so on for local media), should contain all media for a given project. We bundle up the media into tarballs of 100k files per tarball for convenience of the downloader.<br />''
 +
 
 +
** Dumps are here: [ftp://ftpmirror.your.org/pub/wikimedia/imagedumps/tarballs/fulls/ ftp://ftpmirror.your.org/pub/wikimedia/imagedumps/tarballs/fulls/]
 +
** The size of all the all the media media for 20121201 is 172 GB for the local dumps, and 2.153 TB for the remote dumps.  Total = 2.3 TB.
 +
enwiki-20121201-local-media-2.tar 22.5 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-local-media-3.tar 25.6 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-local-media-4.tar 21.5 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-local-media-5.tar 20.7 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-local-media-6.tar 22.4 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-local-media-7.tar 18.2 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-local-media-8.tar 24.4 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-local-media-9.tar 1.3 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-remote-media-1.tar 89.9 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-remote-media-10.tar 90.5 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-11.tar 88.2 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-12.tar 88.4 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-13.tar 89.6 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-14.tar 88.6 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-15.tar 91.2 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-16.tar 91.3 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-17.tar 89.4 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-18.tar 90.0 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-19.tar 90.0 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-2.tar 90.5 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-remote-media-20.tar 90.1 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-21.tar 91.2 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-22.tar 89.3 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-23.tar 91.0 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-24.tar 44.3 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-24.tar.bz2 42.6 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-3.tar 88.6 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-remote-media-4.tar 90.0 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-remote-media-5.tar 90.9 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-remote-media-6.tar 88.3 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-remote-media-7.tar 89.6 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-remote-media-8.tar 90.4 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-9.tar 89.7 GB 12/7/12 12:00:00 AM
 +
 +
* To get them, store list above in a text file (listOfTarArchives.txt) and use wget:
 +
 
 +
for i in `cat listOfTarArchives.txt | cut -f 1 | grep -v bz2`; do
 +
      echo $i
 +
      wget ftp://ftpmirror.your.org/pub/wikimedia/imagedumps/tarballs/fulls/20121201/$i
 +
      done
 +
 
 +
* Total size should be 2.310 TB.
 +
 
 +
=Download the page statistics=
 +
 
 +
==Links of Interest==
 +
* http://stats.grok.se/
 +
* http://stats.grok.se/about
 +
* http://dom.as/
 +
* http://dumps.wikimedia.org/other/pagecounts-raw/
 +
* http://dumps.wikimedia.org/other/pagecounts-raw/2013/
 +
* started downloading all files from above link to hadoop0:/media/dominique/3TB/mediawiki/statistics/
 +
* wgetStats.sh
 +
#! /bin/bash
 +
wget http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/pagecounts-20130101-000000.gz
 +
wget http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/pagecounts-20130101-010000.gz
 +
wget http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/pagecounts-20130101-020001.gz
 +
...
 +
wget http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/projectcounts-20130131-210000
 +
wget http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/projectcounts-20130131-220000
 +
wget http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/projectcounts-20130131-230000
 +
 
 +
----
 +
 
 +
----
 +
 
 +
----
 +
 
 +
----
 +
 
 +
----
 +
 
 +
----
 +
 
 +
----
 +
 
 +
----
 +
 
 +
----
 +
 
 +
----
 +
 
 +
----
 +
 
 +
=Resources 2010=
 +
____
 +
 
 +
=Map-Reduce/Hadoop=
 +
 
 +
==Options for Setup==
 +
=== Xen Live CD===
 +
 
 +
* livecd-xen-3.2-0.8.2-amd64.iso '''works'''
 +
* must have digital cable for video monitor
 +
* 2 servers and 2 clients
 +
* not sure how one can use it besides examples presented
 +
 
 +
 
 +
===Setting up Hadoop using VmWare===
 +
 
 +
* Check that Google has one copy of it for download.  Just a single VM for hadoop
 +
* Berkely has a lab that uses this copy: see http://bnrg.cs.berkeley.edu/~adj/cs16x/Nachos/project2.html
 +
 
 +
==Setting Up Hadoop and Eclipse on the Mac==
 +
 
 +
===Install Hadoop===
 +
 
 +
No big deal, just install hadoop-0.19.1.tgz, and set a symbolic link '''hadoop''' pointing the directory holding hadoop-0.19.1
 +
 
 +
===Verify configuration of Hadoop===
 +
 
 +
cd
 +
cd hadoop/conf
 +
cat hadoop-site.xml
 +
 
 +
Yields
 +
 
 +
<code><pre>
 +
~/hadoop/conf$: cat hadoop-site.xml
 +
 
 +
<?xml version="1.0"?>
 +
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
 +
 
 +
<!-- Put site-specific property overrides in this file. -->
 +
 
 +
<configuration>
 +
 
 +
  <property>
 +
    <name>fs.default.name</name>
 +
    <value>hdfs://localhost:9100</value>
 +
  </property>
 +
 
 +
  <property>
 +
    <name>mapred.job.tracker</name>
 +
    <value>localhost:9101</value>
 +
  </property>
 +
 
 +
  <property>
 +
    <name>dfs.data.dir</name>
 +
    <value>/Users/thiebaut/hdfs/data</value>
 +
  </property>
 +
 
 +
  <property>
 +
    <name>dfs.name.dir</name>
 +
    <value>/Users/thiebaut/hdfs/name</value>
 +
  </property>
 +
 
 +
  <property>
 +
    <name>dfs.replication</name>
 +
    <value>1</value>
 +
  </property>
 +
 
 +
</configuration>
 +
</pre></code>
 +
 
 +
==Setting up Eclipse for Hadoop==
 +
 
 +
* requires Java 1.6
 
* http://v-lad.org/Tutorials/Hadoop/03%20-%20Prerequistes.html
 
* http://v-lad.org/Tutorials/Hadoop/03%20-%20Prerequistes.html
 +
** [[media:hadoopWithEclipse1.pdf | page1.pdf]]
 +
** [[media:hadoopWithEclipse2.pdf | page2.pdf]]
 +
** [[media:hadoopWithEclipse3.pdf | page3.pdf]]
 +
** [[media:hadoopWithEclipse4.pdf | page4.pdf]]
 
* download Eclipse 3.3.2 (Europa) from http://www.eclipse.org/downloads/packages/release/europa/winter
 
* download Eclipse 3.3.2 (Europa) from http://www.eclipse.org/downloads/packages/release/europa/winter
* Use Hadoop 0.19.1
 
 
* open eclipse and deploy (Mac)
 
* open eclipse and deploy (Mac)
* uncompress hadoop 19.1
 
 
* copy the eclipse-plugin from hadoop to the plugin directory of eclipse
 
* copy the eclipse-plugin from hadoop to the plugin directory of eclipse
 
* start hadoop on the Mac and follow directions from http://v-lad.org/Tutorials/Hadoop page:
 
* start hadoop on the Mac and follow directions from http://v-lad.org/Tutorials/Hadoop page:
Line 13: Line 288:
 
   start-all.sh
 
   start-all.sh
  
 +
===Map-Reduce Locations===
 
* setup eclipse  
 
* setup eclipse  
 
:  http://v-lad.org/Tutorials/Hadoop/17%20-%20set%20up%20hadoop%20location%20in%20the%20eclipse.html
 
:  http://v-lad.org/Tutorials/Hadoop/17%20-%20set%20up%20hadoop%20location%20in%20the%20eclipse.html
 
** localhost
 
** localhost
 
** Map/Reduce Master: localhost, 9101
 
** Map/Reduce Master: localhost, 9101
** DFS Master: user M/R Master ht, localhost, 9000 (must match number in hadoop/conf/hadoop-site.xml for hdfs value, i.e. localhost:9000
+
** DFS Master: user M/R Master host, localhost, 9100
** user name: hadoop-user
+
** user name: hadoop-thiebaut
** SOCKS proxy: host, 1080
+
** SOCKS proxy: (not checked) host, 1080
 +
 
 +
===DFS Locations===
 
* Open DFS Locations
 
* Open DFS Locations
 
** localhost
 
** localhost
Line 35: Line 313:
 
  hadoop fs -mkdir In
 
  hadoop fs -mkdir In
  
* create a project as explained in http://v-lad.org/Tutorials/Hadoop/23%20-%20create%20the%20project.html
+
* remove Out directory if it exists
*
+
 
 +
hadoop fs -rmr Out
 +
 
 +
==Create a new project with Eclipse==
 +
 
 +
Create a project as explained in http://v-lad.org/Tutorials/Hadoop/23%20-%20create%20the%20project.html
 +
 
 +
===Project===
 +
* Right-click on the blank space in the Project Explorer window and select New -> Project.. to create a new project.
 +
* Select Map/Reduce Project from the list of project types as shown in the image below.
 +
* Press the Next button.
 +
* Project Name: HadoopTest
 +
* Use default location
 +
* click on configure hadoop location, browse, and select /Users/thiebaut/hadoop-0.19.1 (or whatever it is)
 +
* Ok
 +
* Finish
 +
==Map/Reduce driver class==
 +
* Right-click on the newly created Hadoop project in the Project Explorer tab and select New -> Other from the context menu.
 +
* Go to the Map/Reduce folder, select MapReduceDriver, then press the Next button as shown in the image below.
 +
* When the MapReduce Driver wizard appears, enter TestDriver in the Name field and press the Finish button. This will create the skeleton code for the MapReduce Driver.
 +
* Finish
 +
* Unfortunately the Hadoop plug-in for Eclipse is slightly out of step with the recent Hadoop API, so we need to edit the driver code a bit.
 +
 
 +
:Find the following two lines in the source code and comment them out:
 +
 
 +
    conf.setInputPath(new Path("src"));
 +
    conf.setOutputPath(new Path("out"));
 +
 
 +
:Enter the following code immediatly after the two lines you just commented out (see image below):
 +
 
 +
    conf.setInputFormat(TextInputFormat.class);
 +
    conf.setOutputFormat(TextOutputFormat.class);
 +
 +
    FileInputFormat.setInputPaths(conf, new Path("In"));
 +
    FileOutputFormat.setOutputPath(conf, new Path("Out"));
 +
 
 +
* After you have changed the code, you will see the new lines marked as incorrect by Eclipse. Click on the error icon for each line and select Eclipse's suggestion to import the missing class.  You need to import the following classes: TextInputFormat, TextOutputFormat, FileInputFormat, FileOutputFormat.
 +
 
 +
* After the missing classes are imported you are ready to run the project.
 +
 
 +
===Running the Project===
 +
 
 +
* Right-click on the TestDriver class in the Project Explorer tab and select Run As --> Run on Hadoop. This will bring up a window like the one shown below.
 +
* Select localhost as hadoop host to run on
 +
* should see something like this:
 +
 
 +
09/12/15 20:15:31 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments.
 +
                              Applications should  implement Tool for the same.
 +
09/12/15 20:15:31 INFO mapred.FileInputFormat: Total input paths to process : 3
 +
09/12/15 20:15:32 INFO mapred.JobClient: Running job: job_200912152008_0001
 +
09/12/15 20:15:33 INFO mapred.JobClient:  map 0% reduce 0%
 +
09/12/15 20:16:05 INFO mapred.JobClient: Task Id : attempt_200912152008_0001_m_000000_0, Status : FAILED
 +
09/12/15 20:16:19 INFO mapred.JobClient: Task Id : attempt_200912152008_0001_m_000001_0, Status : FAILE D
 +
 
 +
=WordCount Example on Eclipse on Mac=
 +
* New project: make sure to select Map/Reduce project.  Call it '''WordCount'''
 +
==Mapper==
 +
* in this project, right click and '''New''', then '''Other''', then '''Mapper''', enter the code below:
 +
<code><pre>
 +
import java.io.IOException;
 +
import java.util.StringTokenizer;
 +
 
 +
import org.apache.hadoop.io.IntWritable;
 +
import org.apache.hadoop.io.LongWritable;
 +
import org.apache.hadoop.io.Text;
 +
import org.apache.hadoop.io.Writable;
 +
import org.apache.hadoop.io.WritableComparable;
 +
import org.apache.hadoop.mapred.MapReduceBase;
 +
import org.apache.hadoop.mapred.Mapper;
 +
import org.apache.hadoop.mapred.OutputCollector;
 +
import org.apache.hadoop.mapred.Reporter;
 +
 
 +
public class WordCountMapper extends MapReduceBase
 +
implements Mapper<LongWritable, Text, Text, IntWritable> {
 +
 
 +
private final IntWritable one = new IntWritable(1);
 +
private Text word = new Text();
 +
 
 +
public void map(WritableComparable key, Writable value,
 +
OutputCollector output, Reporter reporter) throws IOException {
 +
 
 +
String line = value.toString();
 +
StringTokenizer itr = new StringTokenizer(line.toLowerCase());
 +
while(itr.hasMoreTokens()) {
 +
word.set(itr.nextToken());
 +
output.collect(word, one);
 +
}
 +
}
 +
 
 +
        // found myself having to add this for Eclipse to be happy...
 +
        // it matches the definition of the map() function better than what the hadoop example
 +
        // does...  Oh well...
 +
public void map(LongWritable key, Text value,
 +
OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
 +
String line = value.toString();
 +
StringTokenizer itr = new StringTokenizer(line.toLowerCase());
 +
while(itr.hasMoreTokens()) {
 +
word.set(itr.nextToken());
 +
output.collect(word, one);
 +
}
 +
}
 +
}
 +
 
 +
</pre></code>
 +
==Reducer==
 +
* Similarly, in this project, right click and '''New''', then '''Other''', then '''Reducer''', and enter the code below:
 +
<code><pre>
 +
import java.io.IOException;
 +
import java.util.Iterator;
 +
 
 +
import org.apache.hadoop.io.IntWritable;
 +
import org.apache.hadoop.io.Text;
 +
import org.apache.hadoop.io.WritableComparable;
 +
import org.apache.hadoop.mapred.MapReduceBase;
 +
import org.apache.hadoop.mapred.OutputCollector;
 +
import org.apache.hadoop.mapred.Reducer;
 +
import org.apache.hadoop.mapred.Reporter;
 +
 
 +
public class WordCountReducer extends MapReduceBase
 +
    implements Reducer<Text, IntWritable, Text, IntWritable> {
 +
 
 +
  public void reduce(Text key, Iterator values,
 +
      OutputCollector output, Reporter reporter) throws IOException {
 +
 
 +
    int sum = 0;
 +
    while (values.hasNext()) {
 +
      IntWritable value = (IntWritable) values.next();
 +
      sum += value.get(); // process value
 +
    }
 +
 
 +
    output.collect(key, new IntWritable(sum));
 +
  }
 +
}
 +
 
 +
</pre></code>
 +
==Driver==
 +
* Similarly, in this project, right click and '''New''', then '''Other''', then '''Driver''', and enter code below:
 +
<code><pre>
 +
import org.apache.hadoop.fs.Path;
 +
import org.apache.hadoop.io.IntWritable;
 +
import org.apache.hadoop.io.Text;
 +
import org.apache.hadoop.mapred.JobClient;
 +
import org.apache.hadoop.mapred.JobConf;
 +
import org.apache.hadoop.mapred.Mapper;
 +
import org.apache.hadoop.mapred.Reducer;
 +
import org.apache.hadoop.mapred.TextInputFormat;
 +
import org.apache.hadoop.mapred.TextOutputFormat;
 +
import org.apache.hadoop.fs.Path;
 +
import org.apache.hadoop.io.IntWritable;
 +
import org.apache.hadoop.io.Text;
 +
import org.apache.hadoop.mapred.FileInputFormat;
 +
import org.apache.hadoop.mapred.FileOutputFormat;
 +
import org.apache.hadoop.mapred.JobClient;
 +
import org.apache.hadoop.mapred.JobConf;
 +
 
 +
public class WordCount {
 +
 
 +
  public static void main(String[] args) {
 +
    JobClient client = new JobClient();
 +
    JobConf conf = new JobConf(WordCount.class);
 +
 
 +
    // specify output types
 +
    conf.setOutputKeyClass(Text.class);
 +
    conf.setOutputValueClass(IntWritable.class);
 +
 
 +
    // specify input and output dirs
 +
    //FileInputPath.addInputPath(conf, new Path("input"));
 +
    //FileOutputPath.addOutputPath(conf, new Path("output"));
 +
    conf.setInputFormat(TextInputFormat.class);
 +
    conf.setOutputFormat(TextOutputFormat.class);
 +
 
 +
    // make sure In directory exists in the DFS area
 +
    // make sure Out directory does NOT exist in DFS area
 +
    FileInputFormat.setInputPaths(conf, new Path("In"));
 +
    FileOutputFormat.setOutputPath(conf, new Path("Out"));
 +
 
 +
    // specify a mapper
 +
    conf.setMapperClass(WordCountMapper.class);
 +
 
 +
    // specify a reducer
 +
    conf.setReducerClass(WordCountReducer.class);
 +
    conf.setCombinerClass(WordCountReducer.class);
 +
 
 +
    client.setConf(conf);
 +
    try {
 +
      JobClient.runJob(conf);
 +
    } catch (Exception e) {
 +
      e.printStackTrace();
 +
    }
 +
  }
 +
}
 +
 
 +
 +
</pre></code>
 +
 
 +
==Run WordCount Project==
 +
 
 +
* In explorer, right-click on '''WordCount.java''' and '''Run as''', and pick '''Run on Hadoop'''.  Select '''localhost'''.
 +
* IT WILL TAKE A LONG TIME TO RUN!
 +
* output in console window:
 +
09/12/15 20:47:19 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments.
 +
                Applications should implement Tool for the same.
 +
09/12/15 20:47:19 INFO mapred.FileInputFormat: Total input paths to process : 3
 +
09/12/15 20:47:20 INFO mapred.JobClient: Running job: job_200912152008_0003
 +
09/12/15 20:47:21 INFO mapred.JobClient:  map 0% reduce 0%
 +
09/12/15 20:48:53 INFO mapred.JobClient:  map 33% reduce 0%
 +
09/12/15 20:48:59 INFO mapred.JobClient:  map 66% reduce 0%
 +
09/12/15 20:49:03 INFO mapred.JobClient:  map 100% reduce 0%
 +
...
 +
09/12/15 20:49:20 INFO mapred.JobClient:    Map input bytes=71
 +
09/12/15 20:49:20 INFO mapred.JobClient:    Combine input records=16
 +
09/12/15 20:49:20 INFO mapred.JobClient:    Map output records=16
 +
09/12/15 20:49:20 INFO mapred.JobClient:    Reduce input records=15
 +
 
 +
* Refresh DFS area, found '''Out''' folder, and check the '''part-00000''' file:
 +
a 2
 +
hadoop 2
 +
i 2
 +
is 1
 +
kid 1
 +
lemon 1
 +
lemons 1
 +
like 4
 +
on 1
 +
stick 1
  
 
=Notes on doing example in Yahoo Tutorial, Module 2=
 
=Notes on doing example in Yahoo Tutorial, Module 2=
Line 57: Line 559:
 
   
 
   
 
     Hello, world!
 
     Hello, world!
 +
 +
</onlydft>

Latest revision as of 10:24, 8 December 2016


...