Difference between revisions of "CSC352 Notes 2013"

From dftwiki3
Jump to: navigation, search
(Map/Reduce driver class)
 
(70 intermediate revisions by the same user not shown)
Line 1: Line 1:
=Verify configuration of Hadoop=
+
<onlydft>
 +
<br />
 +
<center>
 +
<font color="red">
 +
'''See [[CSC352 2017 DT's Notes| this page]] for 2017 updated notes'''
 +
</font>
 +
</center>
 +
<br />
 +
__TOC__
 +
<br />
 +
=Resources 2013=
 +
==Rocco's Presentation 10/10/13==
 +
* libguides.smith.edu/content.php?pid=510405
 +
* idea:
 +
** for paper, start getting the thread, collage, packing, parallel image processing. 
 +
** approaches.
 +
** intro: what has been done in the field
 +
* Citation database: Web of Science
 +
* Ref Works & Zotero can help maintain citations
 +
* 5-College catalogs
 +
* Worldcat is the world catalog for books
 +
* Web of Science: can get information on references and also who's publishing in the field or which institutions are publishing in the given area.
 +
* Discover searches other databases.
 +
* Library Guide (Albany), super guide for libraries.
 +
* [http://VideoLectures.net videolectures.net]
 +
 
 +
<br />
 +
==Hadoop==
 +
<br />
 +
* [[CSC352_MapReduce/Hadoop_Class_Notes | DT's Class notes on Hadoop/MapReduce]]
 +
* [http://www.umiacs.umd.edu/~jimmylin/cloud-2008-Fall/index.html Cloud Computing notes from UMD] (2008, old)
 +
 
 +
==On-Line==
 +
 
 +
* [https://computing.llnl.gov/tutorials/parallel_comp/ Introduction to Parallel Processing]
 +
* [[Media:RITParallelProgrammingWorkshop.pdf | RIT Parallel Programming Workshop]]
 +
==Papers==
 +
* [[Media:AViewOfCloudComputing_CACM_Apr2010.pdf| A View of Cloud Computing]], 2010, By Armbrust, Michael and Fox, Armando and Griffith, Rean and Joseph, Anthony D. and Katz, Randy and Konwinski, Andy and Lee, Gunho and Patterson, David and Rabkin, Ariel and Stoica, Ion and Zaharia, Matei.
 +
* [[Media:NIST_Definition_Cloud_Computing_2010.pdf | The NIST Definition of Cloud Computing (Draft)]] (very short paper)
 +
* [[Media:NobodyGotFiredUsingHadoopOnCluster_2012.pdf| Nobody ever got fired for using Hadoop on a cluster]], Rowstron, Antony and Narayanan, Dushyanth and Donnelly, Austin and O'Shea, Greg and Douglas, Andrew
 +
* [http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.pdf The Landscape of Parallel Computing Research: A View From Berkely], 2006, still good! (very long paper)
 +
* [[Media:UpdateOnaViewFromBerkeley2010.pdf | Update on a view from Berkeley]], 2010. (short paper)
 +
* [[Media:GeneralPurposeVsGPU_Comparison_Many_Cores_2010_Caragea.pdf |General-Purpose vs. GPU: Comparisons of Many-Cores on Irregular Workloads]], 2010
 +
* [[Media:ParallelCOmputingWithPatternsAndFrameworks2010b.pdf | Parallel Computing with Patterns and Frameworks]], 2010, ''XRDS''.
 +
* [[Media:ServerVirtualizationArchitectureAndImplementation2009.pdf | Server Virtualization Architecture and Implementation]], xrds, 2009.
 +
* [[Media:XGridHadoopCloser2011.pdf | Processing Wikipedia Dumps: A Case-Study comparing the XGrid and MapReduce Approaches]], D. Thiebaut, Yang Li, Diana Jaunzeikare, Alexandra Cheng, Ellysha Raelen Recto, Gillian Riggs, Xia Ting Zhao, Tonje Stolpestad, and Cam Le T Nguyen, ''in proceedings of 1st Int'l Conf. On Cloud Computing and Services Science'' (CLOSER 2011), Noordwijkerhout, NL, May 2011. ([[Media:XGridHadoopFeb2011.pdf |longer version]])
 +
* [[Media:BeyondHadoop_CACM_Mone_2013.pdf | Beyond Hadoop]], Gregory Mone, CACM, 2013. (short paper).
 +
* [[Media:UnderstandingThroughputOrientedArchitectures2010.pdf | Understanding Throughput-Oriented Architectures]], CACM, 2010.
 +
* [[Media:LearningFromTheSuccessOfMPI2002_WilliamGropp.pdf | Learning from the Success of MPI]], by WIlliam D. Gropp,  Argonne National Lab, 2002.
 +
<p>
 +
 
 +
==Art ==
 +
* Maggie Lind's [[Media:MaggieLindProposalCSC352.pdf | MaggieLindProposalCSC352.pdf]]
 +
* Fraser?
 +
* Chester?
 +
<br />
 +
==Some good references==
 +
* Sounds of wikipedia: http://listen.hatnote.com/#nowelcomes,en
 +
 
 +
* Exhibition at Somerset House
 +
<center>[[Image:The_Exhibition_Room_at_Somerset_House_by_Thomas_Rowlandson_and_Augustus_Pugin._1800.jpg|500px]]</center>
 +
<br />
 +
{|
 +
|
 +
[http://lens.blogs.nytimes.com/2010/03/23/behind-38/?_r=0 Bill Cunningham] of the New York Times.
 +
|
 +
[[Image:BillCunningham.jpg|150px|right]]
 +
|-
 +
|
 +
[http://infosthetics.com/archives/2013/07/phototrails_the_visual_structure_of_millions_of_user-generated_photos.html visual structures of millions of user-generated photos]
 +
|
 +
[[Image:milionsUserGeneratedPhotos.jpg|right|150px]]
 +
|-
 +
|
 +
[[Image:digitalsignagecollection.png|150px]]
 +
|
 +
[http://www.digitalsignageconnection.com/art-museum-creates-interactive-visitor-experience-christie-microtiles-video-walls-959  Cleveland Museum of Art's Collection Wall allows up to 16 people to interact simultaneously with the wall using RFID tags on iPad stations.]
 +
|}
 +
<br />
 +
<br />
 +
*[http://computinged.wordpress.com/2012/11/21/cs2013-ironman-draft-available/ Ironman ACM/IEEE Curriculum] stipulates that distributed computing must be incorporated at all levels of curriculum. [http://ai.stanford.edu/users/sahami/CS2013//ironman-draft/cs2013-ironman-v0.8.pdf Link to the pdf paper].  The report suggest that parallel and distributed computed should be an integral part of the CS curriculum.  Some people (e.g. Danner & Newall at Swarthmore) go even further and suggest it should be incorporated at all levels of the curriculum.
 +
 
 +
=Misc. Topics=
 +
* Latex
 +
* writing papers
 +
* reading ==> Newsletter
 +
* presentations
 +
* museum visit
 +
* parallel programming
 +
** MPI
 +
** Java threads
 +
** concurrency issues
 +
** where's the data?  Where are the processors?
 +
* Projects
 +
** MPI
 +
** GPU
 +
* Look at recent conference.  Where are the trends? [http://conference.researchbib.com/?action=viewEventDetails&eventid=26507&uid=raf013 APPT 2013 - 2013 International Conference on Advanced Parallel Processing Technology]
 +
 
 +
=XSEDE.ORG=
 +
* registered 8/8/13: thiebaut/ToMoKo2#
 +
* https://portal.xsede.org/
 +
* [[http://cs.smith.edu/dftwiki/images/BerkeleyBootCampAug2013_DFTNotes.pdf | My notes from the Berkeley Boot-Camp on-line Workshop]], Aug 2013.
 +
 
 +
=Update 2015: Downloading images to Hadoop0=
 +
<br />
 +
* Rsync from http://ftpmirror.your.org/pub/wikimedia/images/wikipedia/en/xxx where xxx is 0, 1, 2, ... 9, a, b, c, d, e, f.
 +
<br />
 +
=Downloading All Wikipedia Images=
 +
* From [http://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia http://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia]
 +
:::''Where are images and uploaded files<br /><br />Images and other uploaded media are available from mirrors in addition to being served directly from Wikimedia servers. Bulk download is currently (as of September 2012) available from mirrors but not offered directly from Wikimedia servers. See the list of current mirrors.<br /><br />Unlike most article text, images are not necessarily licensed under the GFDL & CC-BY-SA-3.0. They may be under one of many free licenses, in the public domain, believed to be fair use, or even copyright infringements (which should be deleted). In particular, use of fair use images outside the context of Wikipedia or similar works may be illegal. Images under most licenses require a credit, and possibly other attached copyright information. This information is included in image description pages, which are part of the text dumps available from dumps.wikimedia.org. In conclusion, download these images at your own risk (Legal)''
 +
 
 +
* [http://wikimedia.wansec.com/other/pagecounts-raw/ Page View Statistics for Wikimedia projects] at
 +
wikimedia.wansec.com/other/pagecounts-raw/
 +
* The main information about the dumps and the format is here: [https://wikitech.wikimedia.org/wiki/Dumps/media https://wikitech.wikimedia.org/wiki/Dumps/media]
 +
:::''Tarballs are generated on a server provided by Your.org and made available from that mirror. The rsynced copy of the media itself and an rsynced copy of the above files (image/imagelinks/redirs info) is used as input to createmediatarballs.py to create two series of tarballs per wiki, one containing all locally uploaded media and the other containing all media uploaded to commons and used on the wiki.<br />One series of tarballs (with names looking like, e.g., enwiki-20120430-remote-media-1.tar, enwiki-20120430-remote-media-2.tar, and so on for remote media, and enwiki-20120430-local-media-1.tar, enwiki-20120430-local-media-2.tar and so on for local media), should contain all media for a given project. We bundle up the media into tarballs of 100k files per tarball for convenience of the downloader.<br />''
 +
 
 +
** Dumps are here: [ftp://ftpmirror.your.org/pub/wikimedia/imagedumps/tarballs/fulls/ ftp://ftpmirror.your.org/pub/wikimedia/imagedumps/tarballs/fulls/]
 +
** The size of all the all the media media for 20121201 is 172 GB for the local dumps, and 2.153 TB for the remote dumps.  Total = 2.3 TB.
 +
enwiki-20121201-local-media-2.tar 22.5 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-local-media-3.tar 25.6 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-local-media-4.tar 21.5 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-local-media-5.tar 20.7 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-local-media-6.tar 22.4 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-local-media-7.tar 18.2 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-local-media-8.tar 24.4 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-local-media-9.tar 1.3 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-remote-media-1.tar 89.9 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-remote-media-10.tar 90.5 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-11.tar 88.2 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-12.tar 88.4 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-13.tar 89.6 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-14.tar 88.6 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-15.tar 91.2 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-16.tar 91.3 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-17.tar 89.4 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-18.tar 90.0 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-19.tar 90.0 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-2.tar 90.5 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-remote-media-20.tar 90.1 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-21.tar 91.2 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-22.tar 89.3 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-23.tar 91.0 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-24.tar 44.3 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-24.tar.bz2 42.6 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-3.tar 88.6 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-remote-media-4.tar 90.0 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-remote-media-5.tar 90.9 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-remote-media-6.tar 88.3 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-remote-media-7.tar 89.6 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-remote-media-8.tar 90.4 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-9.tar 89.7 GB 12/7/12 12:00:00 AM
 +
 +
* To get them, store list above in a text file (listOfTarArchives.txt) and use wget:
 +
 
 +
for i in `cat listOfTarArchives.txt | cut -f 1 | grep -v bz2`; do
 +
      echo $i
 +
      wget ftp://ftpmirror.your.org/pub/wikimedia/imagedumps/tarballs/fulls/20121201/$i
 +
      done
 +
 
 +
* Total size should be 2.310 TB.
 +
 
 +
=Download the page statistics=
 +
 
 +
==Links of Interest==
 +
* http://stats.grok.se/
 +
* http://stats.grok.se/about
 +
* http://dom.as/
 +
* http://dumps.wikimedia.org/other/pagecounts-raw/
 +
* http://dumps.wikimedia.org/other/pagecounts-raw/2013/
 +
* started downloading all files from above link to hadoop0:/media/dominique/3TB/mediawiki/statistics/
 +
* wgetStats.sh
 +
#! /bin/bash
 +
wget http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/pagecounts-20130101-000000.gz
 +
wget http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/pagecounts-20130101-010000.gz
 +
wget http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/pagecounts-20130101-020001.gz
 +
...
 +
wget http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/projectcounts-20130131-210000
 +
wget http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/projectcounts-20130131-220000
 +
wget http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/projectcounts-20130131-230000
 +
 
 +
----
 +
 
 +
----
 +
 
 +
----
 +
 
 +
----
 +
 
 +
----
 +
 
 +
----
 +
 
 +
----
 +
 
 +
----
 +
 
 +
----
 +
 
 +
----
 +
 
 +
----
 +
 
 +
=Resources 2010=
 +
____
 +
 
 +
=Map-Reduce/Hadoop=
 +
 
 +
==Options for Setup==
 +
=== Xen Live CD===
 +
 
 +
* livecd-xen-3.2-0.8.2-amd64.iso '''works'''
 +
* must have digital cable for video monitor
 +
* 2 servers and 2 clients
 +
* not sure how one can use it besides examples presented
 +
 
 +
 
 +
===Setting up Hadoop using VmWare===
 +
 
 +
* Check that Google has one copy of it for download.  Just a single VM for hadoop
 +
* Berkely has a lab that uses this copy: see http://bnrg.cs.berkeley.edu/~adj/cs16x/Nachos/project2.html
 +
 
 +
==Setting Up Hadoop and Eclipse on the Mac==
 +
 
 +
===Install Hadoop===
 +
 
 +
No big deal, just install hadoop-0.19.1.tgz, and set a symbolic link '''hadoop''' pointing the directory holding hadoop-0.19.1
 +
 
 +
===Verify configuration of Hadoop===
  
 
  cd  
 
  cd  
Line 44: Line 271:
 
</configuration>
 
</configuration>
 
</pre></code>
 
</pre></code>
=Setting up Eclipse for Hadoop=
 
  
* Java 1.6
+
==Setting up Eclipse for Hadoop==
 +
 
 +
* requires Java 1.6
 
* http://v-lad.org/Tutorials/Hadoop/03%20-%20Prerequistes.html
 
* http://v-lad.org/Tutorials/Hadoop/03%20-%20Prerequistes.html
 +
** [[media:hadoopWithEclipse1.pdf | page1.pdf]]
 +
** [[media:hadoopWithEclipse2.pdf | page2.pdf]]
 +
** [[media:hadoopWithEclipse3.pdf | page3.pdf]]
 +
** [[media:hadoopWithEclipse4.pdf | page4.pdf]]
 
* download Eclipse 3.3.2 (Europa) from http://www.eclipse.org/downloads/packages/release/europa/winter
 
* download Eclipse 3.3.2 (Europa) from http://www.eclipse.org/downloads/packages/release/europa/winter
* Use Hadoop 0.19.1
 
 
* open eclipse and deploy (Mac)
 
* open eclipse and deploy (Mac)
* uncompress hadoop 19.1
 
 
* copy the eclipse-plugin from hadoop to the plugin directory of eclipse
 
* copy the eclipse-plugin from hadoop to the plugin directory of eclipse
 
* start hadoop on the Mac and follow directions from http://v-lad.org/Tutorials/Hadoop page:
 
* start hadoop on the Mac and follow directions from http://v-lad.org/Tutorials/Hadoop page:
Line 58: Line 288:
 
   start-all.sh
 
   start-all.sh
  
==Map-Reduce Locations==
+
===Map-Reduce Locations===
 
* setup eclipse  
 
* setup eclipse  
 
:  http://v-lad.org/Tutorials/Hadoop/17%20-%20set%20up%20hadoop%20location%20in%20the%20eclipse.html
 
:  http://v-lad.org/Tutorials/Hadoop/17%20-%20set%20up%20hadoop%20location%20in%20the%20eclipse.html
Line 67: Line 297:
 
** SOCKS proxy: (not checked) host, 1080
 
** SOCKS proxy: (not checked) host, 1080
  
==DFS Locations==
+
===DFS Locations===
 
* Open DFS Locations
 
* Open DFS Locations
 
** localhost
 
** localhost
Line 83: Line 313:
 
  hadoop fs -mkdir In
 
  hadoop fs -mkdir In
  
=Create a new project with Eclipse=
+
* remove Out directory if it exists
 +
 
 +
hadoop fs -rmr Out
 +
 
 +
==Create a new project with Eclipse==
  
 
Create a project as explained in http://v-lad.org/Tutorials/Hadoop/23%20-%20create%20the%20project.html
 
Create a project as explained in http://v-lad.org/Tutorials/Hadoop/23%20-%20create%20the%20project.html
  
==Project==
+
===Project===
 
* Right-click on the blank space in the Project Explorer window and select New -> Project.. to create a new project.  
 
* Right-click on the blank space in the Project Explorer window and select New -> Project.. to create a new project.  
 
* Select Map/Reduce Project from the list of project types as shown in the image below.
 
* Select Map/Reduce Project from the list of project types as shown in the image below.
Line 120: Line 354:
 
* After the missing classes are imported you are ready to run the project.
 
* After the missing classes are imported you are ready to run the project.
  
==Running the Project==
+
===Running the Project===
  
 
* Right-click on the TestDriver class in the Project Explorer tab and select Run As --> Run on Hadoop. This will bring up a window like the one shown below.
 
* Right-click on the TestDriver class in the Project Explorer tab and select Run As --> Run on Hadoop. This will bring up a window like the one shown below.
Line 126: Line 360:
 
* should see something like this:
 
* should see something like this:
  
  09/12/15 20:15:31 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should  implement Tool for the same.
+
  09/12/15 20:15:31 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments.  
 +
                              Applications should  implement Tool for the same.
 
  09/12/15 20:15:31 INFO mapred.FileInputFormat: Total input paths to process : 3
 
  09/12/15 20:15:31 INFO mapred.FileInputFormat: Total input paths to process : 3
 
  09/12/15 20:15:32 INFO mapred.JobClient: Running job: job_200912152008_0001
 
  09/12/15 20:15:32 INFO mapred.JobClient: Running job: job_200912152008_0001
Line 132: Line 367:
 
  09/12/15 20:16:05 INFO mapred.JobClient: Task Id : attempt_200912152008_0001_m_000000_0, Status : FAILED
 
  09/12/15 20:16:05 INFO mapred.JobClient: Task Id : attempt_200912152008_0001_m_000000_0, Status : FAILED
 
  09/12/15 20:16:19 INFO mapred.JobClient: Task Id : attempt_200912152008_0001_m_000001_0, Status : FAILE D
 
  09/12/15 20:16:19 INFO mapred.JobClient: Task Id : attempt_200912152008_0001_m_000001_0, Status : FAILE D
 +
 +
=WordCount Example on Eclipse on Mac=
 +
* New project: make sure to select Map/Reduce project.  Call it '''WordCount'''
 +
==Mapper==
 +
* in this project, right click and '''New''', then '''Other''', then '''Mapper''', enter the code below:
 +
<code><pre>
 +
import java.io.IOException;
 +
import java.util.StringTokenizer;
 +
 +
import org.apache.hadoop.io.IntWritable;
 +
import org.apache.hadoop.io.LongWritable;
 +
import org.apache.hadoop.io.Text;
 +
import org.apache.hadoop.io.Writable;
 +
import org.apache.hadoop.io.WritableComparable;
 +
import org.apache.hadoop.mapred.MapReduceBase;
 +
import org.apache.hadoop.mapred.Mapper;
 +
import org.apache.hadoop.mapred.OutputCollector;
 +
import org.apache.hadoop.mapred.Reporter;
 +
 +
public class WordCountMapper extends MapReduceBase
 +
implements Mapper<LongWritable, Text, Text, IntWritable> {
 +
 +
private final IntWritable one = new IntWritable(1);
 +
private Text word = new Text();
 +
 +
public void map(WritableComparable key, Writable value,
 +
OutputCollector output, Reporter reporter) throws IOException {
 +
 +
String line = value.toString();
 +
StringTokenizer itr = new StringTokenizer(line.toLowerCase());
 +
while(itr.hasMoreTokens()) {
 +
word.set(itr.nextToken());
 +
output.collect(word, one);
 +
}
 +
}
 +
 +
        // found myself having to add this for Eclipse to be happy...
 +
        // it matches the definition of the map() function better than what the hadoop example
 +
        // does...  Oh well...
 +
public void map(LongWritable key, Text value,
 +
OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
 +
String line = value.toString();
 +
StringTokenizer itr = new StringTokenizer(line.toLowerCase());
 +
while(itr.hasMoreTokens()) {
 +
word.set(itr.nextToken());
 +
output.collect(word, one);
 +
}
 +
}
 +
}
 +
 +
</pre></code>
 +
==Reducer==
 +
* Similarly, in this project, right click and '''New''', then '''Other''', then '''Reducer''', and enter the code below:
 +
<code><pre>
 +
import java.io.IOException;
 +
import java.util.Iterator;
 +
 +
import org.apache.hadoop.io.IntWritable;
 +
import org.apache.hadoop.io.Text;
 +
import org.apache.hadoop.io.WritableComparable;
 +
import org.apache.hadoop.mapred.MapReduceBase;
 +
import org.apache.hadoop.mapred.OutputCollector;
 +
import org.apache.hadoop.mapred.Reducer;
 +
import org.apache.hadoop.mapred.Reporter;
 +
 +
public class WordCountReducer extends MapReduceBase
 +
    implements Reducer<Text, IntWritable, Text, IntWritable> {
 +
 +
  public void reduce(Text key, Iterator values,
 +
      OutputCollector output, Reporter reporter) throws IOException {
 +
 +
    int sum = 0;
 +
    while (values.hasNext()) {
 +
      IntWritable value = (IntWritable) values.next();
 +
      sum += value.get(); // process value
 +
    }
 +
 +
    output.collect(key, new IntWritable(sum));
 +
  }
 +
}
 +
 +
</pre></code>
 +
==Driver==
 +
* Similarly, in this project, right click and '''New''', then '''Other''', then '''Driver''', and enter code below:
 +
<code><pre>
 +
import org.apache.hadoop.fs.Path;
 +
import org.apache.hadoop.io.IntWritable;
 +
import org.apache.hadoop.io.Text;
 +
import org.apache.hadoop.mapred.JobClient;
 +
import org.apache.hadoop.mapred.JobConf;
 +
import org.apache.hadoop.mapred.Mapper;
 +
import org.apache.hadoop.mapred.Reducer;
 +
import org.apache.hadoop.mapred.TextInputFormat;
 +
import org.apache.hadoop.mapred.TextOutputFormat;
 +
import org.apache.hadoop.fs.Path;
 +
import org.apache.hadoop.io.IntWritable;
 +
import org.apache.hadoop.io.Text;
 +
import org.apache.hadoop.mapred.FileInputFormat;
 +
import org.apache.hadoop.mapred.FileOutputFormat;
 +
import org.apache.hadoop.mapred.JobClient;
 +
import org.apache.hadoop.mapred.JobConf;
 +
 +
public class WordCount {
 +
 +
  public static void main(String[] args) {
 +
    JobClient client = new JobClient();
 +
    JobConf conf = new JobConf(WordCount.class);
 +
 +
    // specify output types
 +
    conf.setOutputKeyClass(Text.class);
 +
    conf.setOutputValueClass(IntWritable.class);
 +
 +
    // specify input and output dirs
 +
    //FileInputPath.addInputPath(conf, new Path("input"));
 +
    //FileOutputPath.addOutputPath(conf, new Path("output"));
 +
    conf.setInputFormat(TextInputFormat.class);
 +
    conf.setOutputFormat(TextOutputFormat.class);
 +
 +
    // make sure In directory exists in the DFS area
 +
    // make sure Out directory does NOT exist in DFS area
 +
    FileInputFormat.setInputPaths(conf, new Path("In"));
 +
    FileOutputFormat.setOutputPath(conf, new Path("Out"));
 +
 +
    // specify a mapper
 +
    conf.setMapperClass(WordCountMapper.class);
 +
 +
    // specify a reducer
 +
    conf.setReducerClass(WordCountReducer.class);
 +
    conf.setCombinerClass(WordCountReducer.class);
 +
 +
    client.setConf(conf);
 +
    try {
 +
      JobClient.runJob(conf);
 +
    } catch (Exception e) {
 +
      e.printStackTrace();
 +
    }
 +
  }
 +
}
 +
 +
 +
</pre></code>
 +
 +
==Run WordCount Project==
 +
 +
* In explorer, right-click on '''WordCount.java''' and '''Run as''', and pick '''Run on Hadoop'''.  Select '''localhost'''.
 +
* IT WILL TAKE A LONG TIME TO RUN!
 +
* output in console window:
 +
09/12/15 20:47:19 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments.
 +
                Applications should implement Tool for the same.
 +
09/12/15 20:47:19 INFO mapred.FileInputFormat: Total input paths to process : 3
 +
09/12/15 20:47:20 INFO mapred.JobClient: Running job: job_200912152008_0003
 +
09/12/15 20:47:21 INFO mapred.JobClient:  map 0% reduce 0%
 +
09/12/15 20:48:53 INFO mapred.JobClient:  map 33% reduce 0%
 +
09/12/15 20:48:59 INFO mapred.JobClient:  map 66% reduce 0%
 +
09/12/15 20:49:03 INFO mapred.JobClient:  map 100% reduce 0%
 +
...
 +
09/12/15 20:49:20 INFO mapred.JobClient:    Map input bytes=71
 +
09/12/15 20:49:20 INFO mapred.JobClient:    Combine input records=16
 +
09/12/15 20:49:20 INFO mapred.JobClient:    Map output records=16
 +
09/12/15 20:49:20 INFO mapred.JobClient:    Reduce input records=15
 +
 +
* Refresh DFS area, found '''Out''' folder, and check the '''part-00000''' file:
 +
a 2
 +
hadoop 2
 +
i 2
 +
is 1
 +
kid 1
 +
lemon 1
 +
lemons 1
 +
like 4
 +
on 1
 +
stick 1
  
 
=Notes on doing example in Yahoo Tutorial, Module 2=
 
=Notes on doing example in Yahoo Tutorial, Module 2=
Line 152: Line 559:
 
   
 
   
 
     Hello, world!
 
     Hello, world!
 +
 +
</onlydft>

Latest revision as of 10:24, 8 December 2016


...