Difference between revisions of "CSC352 Resources"

From dftwiki3
Jump to: navigation, search
(Literature)
 
(100 intermediate revisions by the same user not shown)
Line 5: Line 5:
 
<br />
 
<br />
  
=Resources: References &amp; Bibliography=
+
=Resources: References &amp; Bibliography for CSC352=
  
 
   
 
   
 
<!--onlysmith-->
 
<!--onlysmith-->
==Parallel Processing/Good background information==
+
==General Knowledge Papers==
* Asanovic K. ''et al'', [http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.pdf The Landscape of Parallel Computing Research: A View from Berkeley], Dec. 2006. ([[media:LandscapeParallelProcessingBerkeley1206.pdf|cached copy]])
+
* Von Neumann J., [[Media:vonNewmannEdvac.pdf | First Draft of a Report on the EDVAC]], Moore School of Electrical Engineering, University of Pennsylvania, June 30, 1945.  (Especially interesting are the first 5 pages)
 +
 
 +
* Rob Weir's [[Rob_Weir's_4_Z-Rules | 4Z Method]] for reviewing papers.
 +
 
 +
==Papers, Articles and University Courses on Parallel &amp; Distributed Processing==
 +
* Parallelism in general
 +
** Asanovic K. ''et al'', [http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.pdf The Landscape of Parallel Computing Research: A View from Berkeley], Dec. 2006. ([[media:LandscapeParallelProcessingBerkeley1206.pdf|cached copy]])
 +
* Performance Evaluation
 +
** Lei Hu, and I. Gorton, [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.12.9079 Performance Evaluation for Parallel Systems: A Survey], University of NSW, Technical Report UNSW-CSE-TR-9707, October 1997 ([[Media:PerformanceEvalutationForParallelSystems.pdf |cached copy]])
 +
**  [http://insidehpc.com/2010/01/05/sun-video-tutorial-optimizing-performance-in-parallel-processing/Sun Video Tutorial]: Optimizing Performance in Parallel Processing, Jan. 2010 (19 minutes)
 +
** [http://www.cs.wisc.edu/multifacet/papers/ieeecomputer08_amdahl_multicore.pdf Amdahl's Law in the Multicore Era], Mark Hill and Michael Marty, IEEE Computer, July 2008, and accompanying [http://www.cs.wisc.edu/multifacet/amdahl/ dynamic graph]. ([[Media:ieeecomputer08_amdahl_multicore.pdf |cached copy]])
 
* Xen
 
* Xen
 
** Mauer, R., [http://www.linuxjournal.com/article/8812 Xen Virtualization and Linux Clustering], [http://www.linuxjournal.com Linux Journal] January 12th, 2006
 
** Mauer, R., [http://www.linuxjournal.com/article/8812 Xen Virtualization and Linux Clustering], [http://www.linuxjournal.com Linux Journal] January 12th, 2006
Line 27: Line 37:
 
** Final Thoughts
 
** Final Thoughts
  
* Xgrid
+
* <u>Threading</u>
 +
** D. Tullsen, S. Eggers, and H. M. Levy, [http://www.cs.washington.edu/research/smt/papers/ISCA95.ps  Simultaneous Multithreading: Maximizing On-Chip Parallelism], Proc. ISCA, Santa Margherita Ligure, Italy, 1997 ([[Media:simultaneousMultithreading_isca95.pdf |cached copy]])
 +
 +
* <u>Xgrid</u>
 
** Hughes, B., [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.60.7248&rep=rep1&type=pdf Building Computational Grids with Apple's XGrid Middleware],  ''ACM International Conference Proceeding Series'', Vol. 167, Hobart, Tasmania, Australia, 2006. ([[media:buildingComputationalGrids.pdf|cached copy]])
 
** Hughes, B., [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.60.7248&rep=rep1&type=pdf Building Computational Grids with Apple's XGrid Middleware],  ''ACM International Conference Proceeding Series'', Vol. 167, Hobart, Tasmania, Australia, 2006. ([[media:buildingComputationalGrids.pdf|cached copy]])
 
<!--** Kokaly M., et. al., [http://www.cas.mcmaster.ca/~downd/mgst/files/LPAS%20Paper.pdf MGST: A framework for performance evaluation of Desktop Grids], 2009 IEEE International Symposium on Parallel&Distributed Processing, Rome, Italy ([[Media:MGSTFrameworkPerformanceXGrid.pdf|cached copy]])-->
 
<!--** Kokaly M., et. al., [http://www.cas.mcmaster.ca/~downd/mgst/files/LPAS%20Paper.pdf MGST: A framework for performance evaluation of Desktop Grids], 2009 IEEE International Symposium on Parallel&Distributed Processing, Rome, Italy ([[Media:MGSTFrameworkPerformanceXGrid.pdf|cached copy]])-->
 
** Tsouloupas G, and M. Dikaiakos, [http://grid.ucy.ac.cy/reports/TR-04-5.pdf Characterization of Computational Grid Resources Using Low-Level Benchmarks], Second IEEE International Conference on e-Science and Grid Computing, Amsterdam, Netherlands, 2006 ([[Media:CharacterizationComputationalGridBenchmark.pdf|cached copy]])
 
** Tsouloupas G, and M. Dikaiakos, [http://grid.ucy.ac.cy/reports/TR-04-5.pdf Characterization of Computational Grid Resources Using Low-Level Benchmarks], Second IEEE International Conference on e-Science and Grid Computing, Amsterdam, Netherlands, 2006 ([[Media:CharacterizationComputationalGridBenchmark.pdf|cached copy]])
  
==Python Threads==
+
* <u>MapReduce</u>
<greenbox>
+
** Dean, J., and S. Ghemawat, [http://labs.google.com/papers/mapreduce-osdi04.pdf MapReduce: Simplified Data Processing on Large Clusters], Dec. 2004,  ([[media:MapReduce1204.pdf|cached copy]])
 +
** Dan Gillick, Arlo Faria, John DeNero, [http://www.icsi.berkeley.edu/~arlo/publications/gillick_cs262a_proj.pdf  MapReduce: Distributed Computing for Machine Learning],  Berkeley U., 2006 ([[Media:MapReduceDistributedComputingMachineLearning.pdf|Cached Copy]])
 +
** Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, Christos Kozyrakis, [http://csl.stanford.edu/~christos/publications/2007.cmp_mapreduce.hpca.pdf Evaluating MapReduce for Multi-core and Multiprocessor Systems], Stanford U., 2007 ([[Media:EvaluatingMapReduceforMulti-coreandMultiprocessorSystems.pdf | Cached Copy]]).  (not for class presentation)
 +
** Jeffrey Dean and Sanjay Ghemawat, [http://mags.acm.org/communications/201001/?folio=72&CFID=81054304&CFTOKEN=91057591 MapReduce: A Flexible Data Processing Tool], CACM, Jan. 2010, Vol 53, No. 1 ([[Media:communicaions201001-MapReduceFlexibleDataProcessingTool.pdf |Cached Copy]])
 +
** U Kang, Charalampos Tsourakakis, Ana Paula Appel, Christos Faloutsos, Jure Leskovec, [http://reports-archive.adm.cs.cmu.edu/anon/anon/home/ftp/ml2008/CMU-ML-08-117.pdf HADI] Fast Diameter Estimation and Mining in Massive Graphs with Hadoop, December 2008, Technical Report CMU-ML-08-117
 +
 
 +
==Videos: Big Data and Analytics==
 +
 +
{|
 +
| <videoflash>dRrkgvr9V_s</videoflash><br /> A
 +
| video by Linkedin's Chief Scientist DJ Patil. As a mathematician specializing in dynamical systems and chaos theory, DJ began his career as a weather forecaster working for the Federal government.  DJ shares his observations on how analytics has changed in recent years, especially as Big Data increasingly becomes common.
 +
|-
 +
| <videoflash>acimvXoKwhc</videoflash><br />
 +
|Roger Magoulas, from O'Reily Radar, discusses "big data" (10 minutes).
 +
|-
 +
| <videoflash>NmiUsdn7qRk</videoflash><br />
 +
|Jeff Veen: [http://infosthetics.com/archives/2009/04/jeff_veen_talk_designing_for_big_data.html Designing for "Big Data"], April 2009.
 +
|}
 +
 
 +
==Documentation on Python Threads==
 +
 
 
[[Image:smilingPython.png| right| 100px]]
 
[[Image:smilingPython.png| right| 100px]]
 
* [http://python.org/ The main Python reference]
 
* [http://python.org/ The main Python reference]
Line 39: Line 72:
 
* [http://linuxgazette.net/107/pai.html Understanding Threading in Python], Krishna G Pai, Linux Gazette, Oct. 2004
 
* [http://linuxgazette.net/107/pai.html Understanding Threading in Python], Krishna G Pai, Linux Gazette, Oct. 2004
 
* [http://www.python.org/doc/2.3.5/lib/thread-objects.html Thread Objects] from [http://www.python.org Python.Org]
 
* [http://www.python.org/doc/2.3.5/lib/thread-objects.html Thread Objects] from [http://www.python.org Python.Org]
</greenbox>
+
* [http://www.slideshare.net/pvergain/multiprocessing-with-python-presentation Multiprocessing with Python] a presentation by Jesse Noller who wrote the PEP 371
 +
* [http://blip.tv/file/2232410 Video Presentation] on the Python GIL (found by Diana)
 +
 
 +
==Documentation on XGrid==
  
==XGrid==
 
<bluebox>
 
 
[[Image:xgridLogo.png | right|100px]]
 
[[Image:xgridLogo.png | right|100px]]
  
* What's an XGrid system?
+
* Introduction: What's an XGrid system?
 
** [http://developer.apple.com/mac/library/documentation/MacOSXServer/Conceptual/Xgrid_Programming_Guide/Overview/Overview.html#//apple_ref/doc/uid/TP40006246-CH2-SW1 XGrid Overview] from Apple
 
** [http://developer.apple.com/mac/library/documentation/MacOSXServer/Conceptual/Xgrid_Programming_Guide/Overview/Overview.html#//apple_ref/doc/uid/TP40006246-CH2-SW1 XGrid Overview] from Apple
** [http://data.scl.utah.edu/fmi/xsl/stream/details.xsl?-recid=104&a::v=2212a4Eaya A Video] presentation of the XGrid (click on movie reel icon to start).
+
** Videos
 +
*** [http://data.scl.utah.edu/fmi/xsl/stream/details.xsl?-recid=104&a::v=2212a4Eaya A Video] presentation of the XGrid (click on movie reel icon to start).
 +
*** A [http://www.youtube.com/watch?v=qhs1AW_el5c YouTube short video] showing the XGrid running the Mandelbrot demo.
 
** A very good overview of the XGrid from [http://www.macdevcenter.com/pub/a/mac/2005/08/23/xgrid.html?page=1 macdevcenter.com]
 
** A very good overview of the XGrid from [http://www.macdevcenter.com/pub/a/mac/2005/08/23/xgrid.html?page=1 macdevcenter.com]
* [http://tango.csc.smith.edu/classwiki/index.php/Xgrid_Programming Programming Examples, Setup, and References]
+
* [[Tutorials | Programming Examples, Setup, and References]] relating to the XGrid system at Smith College.
 +
* [[XGrid Tutorial Part 1: Monte Carlo | Tutorial #1: Monte Carlo]]
  
 
===General References===
 
===General References===
Line 63: Line 100:
 
===Applications===
 
===Applications===
 
* [http://developer.apple.com/documentation/MacOSXServer/Conceptual/Xgrid_Programming_Guide/Introduction/chapter_1_section_1.html XGrid Programming Guide]
 
* [http://developer.apple.com/documentation/MacOSXServer/Conceptual/Xgrid_Programming_Guide/Introduction/chapter_1_section_1.html XGrid Programming Guide]
* [http://tango.csc.smith.edu/classwiki/images/a/a4/XGrid_An_Introduction_to_R.pdf An Introduction to R]
+
* [http://cs.smith.edu/classwiki/images/a/a4/XGrid_An_Introduction_to_R.pdf An Introduction to R]
 
* [http://unu.novajo.ca/simple/archives/000024.html  POVray] on the XGrid
 
* [http://unu.novajo.ca/simple/archives/000024.html  POVray] on the XGrid
 
* [http://cmgm.stanford.edu/~cparnot/xgrid-stanford/index.html Stanford] Xgrid: One of the largest XGrid systems around.
 
* [http://cmgm.stanford.edu/~cparnot/xgrid-stanford/index.html Stanford] Xgrid: One of the largest XGrid systems around.
Line 69: Line 106:
 
* [http://reference.wolfram.com/mathematica/guide/StandaloneMathematicaKernels.html Using the Mathematica Kernel].
 
* [http://reference.wolfram.com/mathematica/guide/StandaloneMathematicaKernels.html Using the Mathematica Kernel].
  
</bluebox>
+
==Documentation on Cloud Computing, Map-Reduce, &amp; Hadoop==
 
 
==Cloud Computing==
 
 
<blockquote>"''Failure is the defining difference between distributed and local programming''" <br>
 
<blockquote>"''Failure is the defining difference between distributed and local programming''" <br>
 
Ken Arnold, CORBA Designer
 
Ken Arnold, CORBA Designer
 
</blockquote>
 
</blockquote>
<tanbox>
 
 
__NOTOC__
 
__NOTOC__
 
===Literature===
 
===Literature===
* [[Image:hadoopOReilly.jpg | right |100px]] [http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/0596521979  Hadoop, the definitive guide], Tim White, O'Reilly Media, June 2009, ISBN 0596521979.  The Web site for the book is http://www.hadoopbook.com/ (with the data used as examples in the book)
+
* [[Media:ApacheChapterOnStreaming.pdf | Apache's chapter on Hadoop Streaming]], Apache.org.
 +
* [http://answers.oreilly.com/topic/460-how-to-benchmark-a-hadoop-cluster/ How to Benchmark a Hadoop Cluster], by Tom White, [http://answers.oreilly.com O'Reilly Answers], Oct. 2009.
 +
* [[Image:hadoopOReilly.jpg | right |100px]] [http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/0596521979  Hadoop, the definitive guide], Tom White, O'Reilly Media, June 2009, ISBN 0596521979.  The Web site for the book is http://www.hadoopbook.com/ (with the data used as examples in the book)
 +
* Dan Sullivan [http://nexus.realtimepublishers.com/dgcc.php The Definitive Guide to Cloud Computing], IBM, 2010, ''in production'' (but can be downloaded in parts).
 
* Dean, J., and S. Ghemawat, [http://labs.google.com/papers/mapreduce-osdi04.pdf MapReduce: Simplified Data Processing on Large Clusters], Dec. 2004,  ([[media:MapReduce1204.pdf|cached copy]])
 
* Dean, J., and S. Ghemawat, [http://labs.google.com/papers/mapreduce-osdi04.pdf MapReduce: Simplified Data Processing on Large Clusters], Dec. 2004,  ([[media:MapReduce1204.pdf|cached copy]])
*  Czajkowski G., [http://googleblog.blogspot.com/2008/11/sorting-1pb-with-mapreduce.html  Sorting 1 PB with MapReduce], Nov. 2008, ([[media:Sorting1PBWithMapReduce.pdf|cached copy]])
+
*  Czajkowski G., [http://googleblog.blogspot.com/2008/11/sorting-1pb-with-mapreduce.html  Sorting 1 PB with MapReduce], Nov. 2008, ([[media:Sorting1PBWithMapReduce.pdf|cached copy]]) (1 page only).
 
* Armbrust M, ''et al'', [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-28.pdf Above the Clouds: A Berkeley View of Cloud Computing], Tech Rep. CB/EECS-2009-28, Feb. 2009 ([[media:AboveTheCloudsBerkeley.pdf|cached copy]])
 
* Armbrust M, ''et al'', [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-28.pdf Above the Clouds: A Berkeley View of Cloud Computing], Tech Rep. CB/EECS-2009-28, Feb. 2009 ([[media:AboveTheCloudsBerkeley.pdf|cached copy]])
 
* Olson C. ''et. al.'', [[Media:pigLatinNotSoForeignLanguage.pdf |Pig  Latin: A Not-So-Foreign Language for Data Processing]], SIGMOD’08, June 9–12, 2008, Vancouver, BC, Canada.
 
* Olson C. ''et. al.'', [[Media:pigLatinNotSoForeignLanguage.pdf |Pig  Latin: A Not-So-Foreign Language for Data Processing]], SIGMOD’08, June 9–12, 2008, Vancouver, BC, Canada.
Line 88: Line 125:
 
** [http://research.microsoft.com/en-us/collaboration/fourthparadigm/4th_paradigm_book_part3_gannon_reed.pdf Parallelism and the Cloud], by Gannon and Reed
 
** [http://research.microsoft.com/en-us/collaboration/fourthparadigm/4th_paradigm_book_part3_gannon_reed.pdf Parallelism and the Cloud], by Gannon and Reed
 
** [http://research.microsoft.com/en-us/collaboration/fourthparadigm/4th_paradigm_book_part3_hansen_johnson.pdf Visualization and Data-Intensive Science] by Hansen, Johnson, Pascucci, and Silva.
 
** [http://research.microsoft.com/en-us/collaboration/fourthparadigm/4th_paradigm_book_part3_hansen_johnson.pdf Visualization and Data-Intensive Science] by Hansen, Johnson, Pascucci, and Silva.
 +
* Talbot D., [http://www.technologyreview.in/computing/24284/page1/ Security in the Ether], Technology Review, Jan/Feb 2010. ([[CSC352 Security In the Ether |cached copy]])
 +
* HadoopWiki, [http://wiki.apache.org/hadoop/HowManyMapsAndReduces Partitioning your job into Maps and Reduces], 2009.
 +
* U Kang, Charalampos Tsourakakis, Ana Paula Appel, Christos Faloutsos, Jure Leskovec, [http://reports-archive.adm.cs.cmu.edu/anon/anon/home/ftp/ml2008/CMU-ML-08-117.pdf HADI: Fast Diameter Estimation and Mining in Massive Graphs with Hadoop], December 2008, Technical Report CMU-ML-08-117
 +
* Matthews, S., & Williams, T. [http://www.biomedcentral.com/1471-2105/11/S1/S15 MrsRF: an efficient MapReduce algorithm for analyzing large collections of evolutionary trees BMC Bioinformatics], 11, 2010 (Suppl 1) <font color=magenta>(authors show that speedups of close to 18 on 32 cores can be reached for treating 20,000 trees of 150 taxa each and 33,306 trees of 567 taxa each.)</font>
 +
* Chris K Wensel, [http://www.manamplified.org/archives/2008/11/hadoop-is-about-scalability.html Hadoop Is About Scalability, Not Performance], www.manamplified.org, November 12, 2008.
 +
* Pavlo, Paulson, Rasin, Abadi, DeWitt, Madden, and Stonebraker, [[Media:ComparisonOfApproachesToLargeScaleDataAnalysis.pdf |A Comparison of Approaches to Large Scale Data-Analysis]], SIGMOD-09, June 2009.
 +
 +
* [[Image:mapReduceTaskTimeLine.png|right|150px]]<u>TimeLine Graphs and Performance</u>
 +
**  Owen O'Malley and Arun Murthy, [http://developer.yahoo.net/blogs/hadoop/2009/05/hadoop_sorts_a_petabyte_in_162.html Hadoop Sorts a Petabyte in 16.25 Hours and a Terabyte in 62 Seconds], http://developer.yahoo.net, May 2009.
 +
** Joseph Gebis, [http://blogs.sun.com/jgebis/entry/understanding_hadoop_task_timelines Understanding Hadoop Task Timelines], http://blogs.sun.com, June 2009. (<font color="magenta">A good description of the ''Task Timelines'' used to quantify hadoop performance</font>)
 +
** Joseph Gebis, [http://blogs.sun.com/jgebis/entry/hadoop_resource_utilization_monitoring_scripts Hadoop Resource Utilization Monitoring -- scripts], http://blogs.sun.com, June 2009.
 +
** Joseph Gebis, [http://blogs.sun.com/jgebis/entry/hadoop_resource_utilization_and_performance Hadoop resource utilization and performance analysis], http://blogs.sun.com, June 2009.
 +
** Elias Torres, [http://hadoop-timelines.appspot.com/ Hadoop TimeLines], http://hadoop-timelines.appspot.com, c. 2009.
 +
 +
===Collections of Hadoop Papers and/or Algorithms===
 +
 +
* [http://atbrox.com/2010/02/12/mapreduce-hadoop-algorithms-in-academic-papers-updated/ atbrox.com]
 +
* [http://wiki.apache.org/hadoop/Papers wiki.apache.org/hadoop/Papers]
 +
* [http://portal.acm.org/citation.cfm?id=1739041.1739056 portal.acm.org]
 +
 +
===Presentations===
 +
 +
* Milind Bhandarkar, [[Media:DataIntensiveComputingWithHadoopAtYahoo.pdf | Data-Intensive Computing with Hadoop]], Yahoo, Inc., Sept. 2008.
 +
 +
===Tutorials===
 +
* Tom White, [http://developer.amazonwebservices.com/connect/entry.jspa?externalID=873 Running Hadoop MapReduce on Amazon EC2 and S3], Amazon Web Services Articles and Tutorials, 2007.
 +
* Robert Sosinski, [http://www.robertsosinski.com/2008/01/26/starting-amazon-ec2-with-mac-os-x/ Starting Amazon EC2 with Mac OS X], www.robertsosinski.com, 2008.
 +
* [http://developer.amazonwebservices.com/connect/entry.jspa?externalID=848&categoryID=135 Introduction to Java for AWS developers], Amazon Web Services, 2007.
 +
* Aaron Jacob (Evri.com)[http://www.facebook.com/note.php?note_id=79212337002&ref=mf Using Cloudera's Hadoop AMIs to process EBS datasets on EC2], www.facebook.com, 2009. ([[Media:usingClouderaHadoopAMIsToProcessEBSDataSetsOnEC2.pdf| cached copy]])
 +
* Yahoo Developer Network: Module 4: MapReduce Basics http://developer.yahoo.com/hadoop/tutorial/module4.html , a must-read!
 +
* Python and streaming, a [http://atbrox.com/2010/02/08/parallel-machine-learning-for-hadoopmapreduce-a-python-example/ tutorial] by [http://atbrox.com].
 +
 +
===Installation Tutorials===
 +
 +
* Jochen Leidner and Gary Berosik, [[http://arxiv4.library.cornell.edu/pdf/0911.5438v1| Building and Installing a Hadoop/MapReduce Cluster from Commodity Components]], [http://arxiv4.library.cornell.edu/pdf/0911.5438v1 library.cornell.edu], 2009. ([[Media:HadoopInstallationOnUbuntuLeidnerBerosik.pdf|cached copy]])
 +
 
===Media Reports===
 
===Media Reports===
 
* Markoff, J., [[media:DelugeOfDataShapesNewEraInComputing.pdf | A Deluge of Data Shapes a New Era in Computing]], ''New York Times'', 12/15/09
 
* Markoff, J., [[media:DelugeOfDataShapesNewEraInComputing.pdf | A Deluge of Data Shapes a New Era in Computing]], ''New York Times'', 12/15/09
 +
 +
===News Feed===
 +
* [http://cloud-computing.alltop.com/ cloud-computing.alltop.com]: aggregated news about the cloud
  
 
===Class Material on the Web===
 
===Class Material on the Web===
Line 107: Line 183:
 
===Software/Web Links===
 
===Software/Web Links===
 
[[Image:HadoopCartoon.png | 100px | right]]
 
[[Image:HadoopCartoon.png | 100px | right]]
 +
*[http://www.hadoopstudio.org/docs/tutorials/nb-tutorial-jobdev-streaming.html Karmasphere Studio] for Hadoop. An interesting IDE worth looking into...
 
*[http://hadoop.apache.org/common/ Apache's Documentation on Hadoop Common]
 
*[http://hadoop.apache.org/common/ Apache's Documentation on Hadoop Common]
 
**[http://hadoop.apache.org/common/docs/current/mapred_tutorial.html The Hadoop Tutorial] from Apache.  A "Must-Do!"
 
**[http://hadoop.apache.org/common/docs/current/mapred_tutorial.html The Hadoop Tutorial] from Apache.  A "Must-Do!"
Line 125: Line 202:
 
*[http://www.cloudera.com/blog/2009/04/20/configuring-eclipse-for-hadoop-development-a-screencast/ Configuring Eclipse for Hadoop] A video from Cloudera on setting up Hadoop... not easy to follow...
 
*[http://www.cloudera.com/blog/2009/04/20/configuring-eclipse-for-hadoop-development-a-screencast/ Configuring Eclipse for Hadoop] A video from Cloudera on setting up Hadoop... not easy to follow...
 
* [https://trac.declarativity.net/browser/hadoop-0.19.1-bfs/src/examples/org/apache/hadoop/examples The source code for the examples] that come with the Hadoop 0.19.1 distribution.  Includes WordCount, WordCountAggregate, WordCountHistogram, PiEstimator, Join, and Grep, among others.
 
* [https://trac.declarativity.net/browser/hadoop-0.19.1-bfs/src/examples/org/apache/hadoop/examples The source code for the examples] that come with the Hadoop 0.19.1 distribution.  Includes WordCount, WordCountAggregate, WordCountHistogram, PiEstimator, Join, and Grep, among others.
   
+
* [http://github.com/datawrangling/spatialanalytics Spatial Analysis of Twitter Data with Hadoop, Pig, & Mechanical Turk], [http://github.com github.com], March 2010.
 +
 
 +
* <u>Generating Hadoop TimeLines</u>
 +
** [http://people.apache.org/~omalley/tera-2009/job_history_summary.py Python script] from apache.org to generate the time line ([[CSC352 ApacheHadoopJobHistorySummary.py | Apache's script to generate Hadoop Timeline ]]).
 +
 
 
===Videos===
 
===Videos===
 
* [http://code.google.com/edu/submissions/mapreduce-minilecture/listing.html Google]'s series of 4 lectures on map-reduce, distributed file-system, and clustering algorithms.
 
* [http://code.google.com/edu/submissions/mapreduce-minilecture/listing.html Google]'s series of 4 lectures on map-reduce, distributed file-system, and clustering algorithms.
 +
* [http://www.youtube.com/watch?v=mVXpvsdeuKU Berkeley] lecture on Map-Reduce (CS 61A Lecture 34)
 
* [http://jez.blip.tv/file/245701/ A video of Tom White], author of O'Reilly's Hadoop guide, on BlipTV. White outlines the suite of projects centered around Hadoop ( an open source Map / Reduce project)
 
* [http://jez.blip.tv/file/245701/ A video of Tom White], author of O'Reilly's Hadoop guide, on BlipTV. White outlines the suite of projects centered around Hadoop ( an open source Map / Reduce project)
 
* [http://www.cloudera.com/hadoop-training-basic Cloudera]'s collection of videos.   
 
* [http://www.cloudera.com/hadoop-training-basic Cloudera]'s collection of videos.   
Line 150: Line 232:
 
** Part V: Application, hands on
 
** Part V: Application, hands on
 
** Users Amazon as test platform.
 
** Users Amazon as test platform.
</tanbox>
+
====Visualizations====
 +
* Visualizations of Hadoop Data Transfers, from the U. of Nebraska ([http://www.google.com/search?q=university+of+Nebraska+hadoop+visualization&hl=en&safe=off&tbs=vid:1&tbo=u&ei=oKO4S6GMCoH7lwfq88SXCg&sa=X&oi=video_result_group&ct=title&resnum=1&ved=0CBEQqwQwAA more videos])
 +
 
 +
<br /><br /><center><videoflash>qoBoEzOkeDQ</videoflash></center><br /><br />
 +
 
 +
* Monitoring a Cluster of Computers as a school of fish (U. Nebraska). 
 +
::In this video, the researchers at U. of Nebraska decided to use fish swiming in a tank as a way of displaying what is going on with a cluster of many computers working on a large problem.  All the computers are involved in a common computation.  Each fish (as far as we can tell, given the lack of better information) represents a computer or a program running on a computer.  As the user zooms in on a fish, a blue window pops up to give some vital information about that system's health.  Fish change color and size to indicate a change in status.  One could imaging that green fish represent computers not doing much work, which orange fish represent computers loaded with work.  It is interesting to see how researchers would use a school of fish as a way to indicate what is going on in a cluster of computers, and relying on human beings's ability to recognize visual clues quickly to understand what is going on quickly and accurately.  This is certainly better than trying to have the same human beings read tons of log files containing the date and time of many different events occurring in the cluster.
 +
<br /><br /><center><videoflash>LM1j_8sWSEk</videoflash></center><br /><br />
 +
 
 +
* The evolution of Hadoop (Code-Swarm)
 +
<br /><br /><center><videoflash type="vimeo">2513321</videoflash></center><br /><br />
 +
<br />
 +
 
  
 
[[CSC352_Notes | <font color="white">Notes</font>]]
 
[[CSC352_Notes | <font color="white">Notes</font>]]
 +
===Cloud Cluster @ Smith===
 +
* [http://cs.smith.edu/classwiki/index.php/CSC352_Hadoop_Cluster_Howto#Workstation_Setup Cloud Cluster Setup]
 +
* [http://maven.smith.edu/~thiebaut/showhadoopip.php Smith Hadoop Cluster IPs]
 
<!--/onlysmith-->
 
<!--/onlysmith-->
  
Line 167: Line 264:
 
<br />
 
<br />
 
<br />
 
<br />
<center><font size=-2>(c) D. Thiebaut 2009, Dept. Computer Science, Smith College.</font></center>
+
<br />
 +
<br />
 +
[[CSC352 DT's Class Notes|<font color="white">class notes</font>]]
 +
[[Category:CSC352]][[Category:Class]][[Category:Resources]]

Latest revision as of 13:16, 31 July 2010


Main Page | Syllabus | Schedule | Links & Resources


Resources: References & Bibliography for CSC352

General Knowledge Papers

Papers, Articles and University Courses on Parallel & Distributed Processing

Videos: Big Data and Analytics


A
video by Linkedin's Chief Scientist DJ Patil. As a mathematician specializing in dynamical systems and chaos theory, DJ began his career as a weather forecaster working for the Federal government. DJ shares his observations on how analytics has changed in recent years, especially as Big Data increasingly becomes common.

Roger Magoulas, from O'Reily Radar, discusses "big data" (10 minutes).

Jeff Veen: Designing for "Big Data", April 2009.

Documentation on Python Threads

SmilingPython.png

Documentation on XGrid

XgridLogo.png

General References

Applications

Documentation on Cloud Computing, Map-Reduce, & Hadoop

"Failure is the defining difference between distributed and local programming"

Ken Arnold, CORBA Designer

Literature

Collections of Hadoop Papers and/or Algorithms

Presentations

Tutorials

Installation Tutorials

Media Reports

News Feed

Class Material on the Web

Software/Web Links

HadoopCartoon.png
The IBM MapReduce Tools for Eclipse Plug-in is a robust plug-in that brings Hadoop support to the Eclipse platform. Features include server configuration, support for launching MapReduce jobs and browsing the distributed file system. This setup assumes that you are running Eclipse (version 3.3 or above) on your computer.

Videos

Visualizations

  • Visualizations of Hadoop Data Transfers, from the U. of Nebraska (more videos)




  • Monitoring a Cluster of Computers as a school of fish (U. Nebraska).
In this video, the researchers at U. of Nebraska decided to use fish swiming in a tank as a way of displaying what is going on with a cluster of many computers working on a large problem. All the computers are involved in a common computation. Each fish (as far as we can tell, given the lack of better information) represents a computer or a program running on a computer. As the user zooms in on a fish, a blue window pops up to give some vital information about that system's health. Fish change color and size to indicate a change in status. One could imaging that green fish represent computers not doing much work, which orange fish represent computers loaded with work. It is interesting to see how researchers would use a school of fish as a way to indicate what is going on in a cluster of computers, and relying on human beings's ability to recognize visual clues quickly to understand what is going on quickly and accurately. This is certainly better than trying to have the same human beings read tons of log files containing the date and time of many different events occurring in the cluster.




  • The evolution of Hadoop (Code-Swarm)






Notes

Cloud Cluster @ Smith















class notes