Difference between revisions of "CSC352 Resources"
(→Documentation on Cloud Computing, Map-Reduce, & Hadoop) |
(→Documentation on Cloud Computing, Map-Reduce, & Hadoop) |
||
Line 228: | Line 228: | ||
<br /><br /><center><videoflash>LM1j_8sWSEk</videoflash></center><br /><br /> | <br /><br /><center><videoflash>LM1j_8sWSEk</videoflash></center><br /><br /> | ||
+ | * The evolution of Hadoop (Code-Swarm) | ||
+ | <br /><br /><center><videoflash type="vimeo">2513321</videoflash></center><br /><br /> | ||
<br /> | <br /> | ||
Revision as of 17:29, 13 April 2010
Resources: References & Bibliography for CSC352
General Knowledge Papers
- Von Neumann J., First Draft of a Report on the EDVAC, Moore School of Electrical Engineering, University of Pennsylvania, June 30, 1945. (Especially interesting are the first 5 pages)
- Rob Weir's 4Z Method for reviewing papers.
Papers, Articles and University Courses on Parallel & Distributed Processing
- Parallelism in general
- Asanovic K. et al, The Landscape of Parallel Computing Research: A View from Berkeley, Dec. 2006. (cached copy)
- Performance Evaluation
- Lei Hu, and I. Gorton, Performance Evaluation for Parallel Systems: A Survey, University of NSW, Technical Report UNSW-CSE-TR-9707, October 1997 (cached copy)
- Video Tutorial: Optimizing Performance in Parallel Processing, Jan. 2010 (19 minutes)
- Amdahl's Law in the Multicore Era, Mark Hill and Michael Marty, IEEE Computer, July 2008, and accompanying dynamic graph. (cached copy)
- Xen
- Mauer, R., Xen Virtualization and Linux Clustering, Linux Journal January 12th, 2006
- Barham P., et al., Xen and the Art of Virtualization, University of Cambridge Computer Laboratory 15 JJ Thomson Avenue, Cambridge, UK, CB3 0FD
- AMD News
- Hardwidge, B., AMD plans supercomputer with 1,000 GPUs, Jan. 2009, bit-tech.net (or graphics goes to the clouds!)
- Halfacree G., AMD supercomputer tops TOP500 list, November 2009, bit-tech.net (or Intel gets a black eye!)
- Google University Code
- Lecture Notes by Paul Krzyzanowski for a course on Distributed Computing at Rutgers. Quite complete, and covering the basics of parallelism, RPC, synchronization, fault tolerance, security, and distributed file systems.
- The Fourth Paradigm: Data-Intensive Scientific Discovery, Microsoft Research, 2009. Table of Contents. A superb collection of essays on different topics (Low-res cached copy). The main chapters are:
- Part 1: Earth and Environment
- Part 2: Health and Wellbeing
- Part 3: Scientific Infrastructure
- Part 4: Scholarly Communication
- Final Thoughts
- Threading
- D. Tullsen, S. Eggers, and H. M. Levy, Simultaneous Multithreading: Maximizing On-Chip Parallelism, Proc. ISCA, Santa Margherita Ligure, Italy, 1997 (cached copy)
- Xgrid
- Hughes, B., Building Computational Grids with Apple's XGrid Middleware, ACM International Conference Proceeding Series, Vol. 167, Hobart, Tasmania, Australia, 2006. (cached copy)
- Tsouloupas G, and M. Dikaiakos, Characterization of Computational Grid Resources Using Low-Level Benchmarks, Second IEEE International Conference on e-Science and Grid Computing, Amsterdam, Netherlands, 2006 (cached copy)
- MapReduce
- Dean, J., and S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, Dec. 2004, (cached copy)
- Dan Gillick, Arlo Faria, John DeNero, MapReduce: Distributed Computing for Machine Learning, Berkeley U., 2006 (Cached Copy)
- Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, Christos Kozyrakis, Evaluating MapReduce for Multi-core and Multiprocessor Systems, Stanford U., 2007 ( Cached Copy). (not for class presentation)
- Jeffrey Dean and Sanjay Ghemawat, MapReduce: A Flexible Data Processing Tool, CACM, Jan. 2010, Vol 53, No. 1 (Cached Copy)
- U Kang, Charalampos Tsourakakis, Ana Paula Appel, Christos Faloutsos, Jure Leskovec, HADI Fast Diameter Estimation and Mining in Massive Graphs with Hadoop, December 2008, Technical Report CMU-ML-08-117
Videos: Big Data and Analytics
A |
video by Linkedin's Chief Scientist DJ Patil. As a mathematician specializing in dynamical systems and chaos theory, DJ began his career as a weather forecaster working for the Federal government. DJ shares his observations on how analytics has changed in recent years, especially as Big Data increasingly becomes common. |
|
Roger Magoulas, from O'Reily Radar, discusses "big data" (10 minutes). |
|
Jeff Veen: Designing for "Big Data", April 2009. |
Documentation on Python Threads
| ||
Documentation on XGrid
- Introduction: What's an XGrid system?
- XGrid Overview from Apple
- Videos
- A Video presentation of the XGrid (click on movie reel icon to start).
- A YouTube short video showing the XGrid running the Mandelbrot demo.
- A very good overview of the XGrid from macdevcenter.com
- Programming Examples, Setup, and References relating to the XGrid system at Smith College.
- Tutorial #1: Monte Carlo
General References
- XGrid Admin and High Performance Computing document (PDF)
- Apple Xgrid
- Apple Xgrid FAQ
- MacDevCenter
- MacResearch
- Stanford Xgrid
- Utah Xgrid
Applications
- XGrid Programming Guide
- An Introduction to R
- POVray on the XGrid
- Stanford Xgrid: One of the largest XGrid systems around.
- Utah Xgrid: Lots of good stuff.
- Using the Mathematica Kernel.
Documentation on Cloud Computing, Map-Reduce, & Hadoop
"Failure is the defining difference between distributed and local programming"
Ken Arnold, CORBA Designer
Literature
- Hadoop, the definitive guide, Tim White, O'Reilly Media, June 2009, ISBN 0596521979. The Web site for the book is http://www.hadoopbook.com/ (with the data used as examples in the book)
- Dan Sullivan The Definitive Guide to Cloud Computing, IBM, 2010, in production (but can be downloaded in parts).
- Dean, J., and S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, Dec. 2004, (cached copy)
- Czajkowski G., Sorting 1 PB with MapReduce, Nov. 2008, (cached copy) (1 page only).
- Armbrust M, et al, Above the Clouds: A Berkeley View of Cloud Computing, Tech Rep. CB/EECS-2009-28, Feb. 2009 (cached copy)
- Olson C. et. al., Pig Latin: A Not-So-Foreign Language for Data Processing, SIGMOD’08, June 9–12, 2008, Vancouver, BC, Canada.
- Ghemawat S., H. Gobioff, and S.T. Leung, The Google File System, SOSP’03, October 19–22, 2003, Bolton Landing, New York, USA.
- The Fourth Paradigm: Data-Intensive Scientific Discovery, Microsoft Research, 2009. Table of Contents, (Low-res cached copy).
- Multicore Computing and Scientific Discovery, by Larus and Gannon
- Parallelism and the Cloud, by Gannon and Reed
- Visualization and Data-Intensive Science by Hansen, Johnson, Pascucci, and Silva.
- Talbot D., Security in the Ether, Technology Review, Jan/Feb 2010. (cached copy)
- HadoopWiki, Partitioning your job into Maps and Reduces, 2009.
- U Kang, Charalampos Tsourakakis, Ana Paula Appel, Christos Faloutsos, Jure Leskovec, HADI: Fast Diameter Estimation and Mining in Massive Graphs with Hadoop, December 2008, Technical Report CMU-ML-08-117
- Matthews, S., & Williams, T. MrsRF: an efficient MapReduce algorithm for analyzing large collections of evolutionary trees BMC Bioinformatics, 11, 2010 (Suppl 1) (authors show that speedups of close to 18 on 32 cores can be reached for treating 20,000 trees of 150 taxa each and 33,306 trees of 567 taxa each.)
- Chris K Wensel, Hadoop Is About Scalability, Not Performance, www.manamplified.org, November 12, 2008.
- Paulson, Rasin, Abadi, DeWitt, Madden, and Stonebraker, A Comparison of Approaches to Large Scale Data-Analysis, SIGMOD-09, June 2009.
- TimeLine Graphs and Performance
- Owen O'Malley and Arun Murthy, Hadoop Sorts a Petabyte in 16.25 Hours and a Terabyte in 62 Seconds, http://developer.yahoo.net, May 2009.
- Joseph Gebis, Understanding Hadoop Task Timelines, http://blogs.sun.com, June 2009. (A good description of the Task Timelines used to quantify hadoop performance)
- Joseph Gebis, Hadoop Resource Utilization Monitoring -- scripts, http://blogs.sun.com, June 2009.
- Joseph Gebis, Hadoop resource utilization and performance analysis, http://blogs.sun.com, June 2009.
- Elias Torres, Hadoop TimeLines, http://hadoop-timelines.appspot.com, c. 2009.
Presentations
- Milind Bhandarkar, Data-Intensive Computing with Hadoop, Yahoo, Inc., Sept. 2008.
Tutorials
- Tom White, Running Hadoop MapReduce on Amazon EC2 and S3, Amazon Web Services Articles and Tutorials, 2007.
- Robert Sosinski, Starting Amazon EC2 with Mac OS X, www.robertsosinski.com, 2008.
- Introduction to Java for AWS developers, Amazon Web Services, 2007.
- Aaron Jacob (Evri.com)Using Cloudera's Hadoop AMIs to process EBS datasets on EC2, www.facebook.com, 2009. ( cached copy)
- Yahoo Developer Network: Module 4: MapReduce Basics http://developer.yahoo.com/hadoop/tutorial/module4.html , a must-read!
Media Reports
- Markoff, J., A Deluge of Data Shapes a New Era in Computing, New York Times, 12/15/09
News Feed
- cloud-computing.alltop.com: aggregated news about the cloud
Class Material on the Web
- Google's series of 4 lectures on map-reduce, distributed file-system, and clustering algorithms.
- University of Washington: Problem Solving on Large Scale Clusters
- Brandeis University: Distributed Systems Course
- Google: Introduction to Parallel Programming and MapReduce
- U. C. Berkeley: Intro to Parallel Programming and Threading
- California PolyTech: A lab on the NetFlix data set
- New Mexico Tech: syllabus (pdf)
- U. Maryland: Syllabus, and Jimmy Lin's Cloud 9 page.
Software/Web Links
- Apache's Documentation on Hadoop Common
- The Hadoop Tutorial from Apache. A "Must-Do!"
- Hadoop Streaming, i.e. using Hadoop with Python, for example.
- A Yahoo Tutorial on Hadoop. Another "Must-Do!"
- An Hadoop-On-Eclipse tutorial. For Windows platform but works for Macs as well. Best way to setup Eclipse! You will need Eclipse 3.3.2 and Hadoop 0.19.1.
- The Hadoop-Book Web site.
- The Hadoop Wiki, the authoritative source on working with Hadoop. Many examples in Java and Python
- Hadoop at Google: A preconfigured single node instance available at Google.
- Writing the WordCount in Python
- Guide for setting up IBM's Eclipse Tools for Hadoop (go to bottom of page)
- The IBM MapReduce Tools for Eclipse Plug-in is a robust plug-in that brings Hadoop support to the Eclipse platform. Features include server configuration, support for launching MapReduce jobs and browsing the distributed file system. This setup assumes that you are running Eclipse (version 3.3 or above) on your computer.
- Guide from Cornell for setting up Hadoop on a Mac.
- Configuring Eclipse for Hadoop A video from Cloudera on setting up Hadoop... not easy to follow...
- The source code for the examples that come with the Hadoop 0.19.1 distribution. Includes WordCount, WordCountAggregate, WordCountHistogram, PiEstimator, Join, and Grep, among others.
- Generating Hadoop TimeLines
- Python script from apache.org to generate the time line ( Apache's script to generate Hadoop Timeline ).
Videos
- Google's series of 4 lectures on map-reduce, distributed file-system, and clustering algorithms.
- Berkeley lecture on Map-Reduce (CS 61A Lecture 34)
- A video of Tom White, author of O'Reilly's Hadoop guide, on BlipTV. White outlines the suite of projects centered around Hadoop ( an open source Map / Reduce project)
- Cloudera's collection of videos.
- CNBC's report: Inside the Mind of Google. "The best way to watch “Inside the Mind of Google,” Maria Bartiromo’s report on the Internet giant Thursday on CNBC, is to not watch the first quarter of it. (from Neil enzlinger's 12/02/09 NYT review)
- Short video by consultant at http://www.stratoslearning.com (5 min) . Outlines a course on Cloud Computing.
- Part I: cloud fondamentals
- Part II: technology and barriers
- Part III: security
- Part IV: what options? players?
- Part V: Application, hands on
- Users Amazon as test platform.
Visualizations
- Visualizations of Hadoop Data Transfers, from the U. of Nebraska (more videos)
- Monitoring a Cluster of Computer as a school of fish (U. Nebraska)
- The evolution of Hadoop (Code-Swarm)