Difference between revisions of "CSC352 Notes 2013"

From dftwiki3
Jump to: navigation, search
 
(18 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
<onlydft>
 
<onlydft>
 +
<br />
 +
<center>
 +
<font color="red">
 +
'''See [[CSC352 2017 DT's Notes| this page]] for 2017 updated notes'''
 +
</font>
 +
</center>
 +
<br />
 +
__TOC__
 +
<br />
 +
=Resources 2013=
 +
==Rocco's Presentation 10/10/13==
 +
* libguides.smith.edu/content.php?pid=510405
 +
* idea:
 +
** for paper, start getting the thread, collage, packing, parallel image processing. 
 +
** approaches.
 +
** intro: what has been done in the field
 +
* Citation database: Web of Science
 +
* Ref Works & Zotero can help maintain citations
 +
* 5-College catalogs
 +
* Worldcat is the world catalog for books
 +
* Web of Science: can get information on references and also who's publishing in the field or which institutions are publishing in the given area.
 +
* Discover searches other databases.
 +
* Library Guide (Albany), super guide for libraries.
 +
* [http://VideoLectures.net videolectures.net]
 +
 +
<br />
 +
==Hadoop==
 +
<br />
 +
* [[CSC352_MapReduce/Hadoop_Class_Notes | DT's Class notes on Hadoop/MapReduce]]
 +
* [http://www.umiacs.umd.edu/~jimmylin/cloud-2008-Fall/index.html Cloud Computing notes from UMD] (2008, old)
  
=Resources 2013=
+
==On-Line==
  
 +
* [https://computing.llnl.gov/tutorials/parallel_comp/ Introduction to Parallel Processing]
 +
* [[Media:RITParallelProgrammingWorkshop.pdf | RIT Parallel Programming Workshop]]
 
==Papers==
 
==Papers==
 
* [[Media:AViewOfCloudComputing_CACM_Apr2010.pdf| A View of Cloud Computing]], 2010, By Armbrust, Michael and Fox, Armando and Griffith, Rean and Joseph, Anthony D. and Katz, Randy and Konwinski, Andy and Lee, Gunho and Patterson, David and Rabkin, Ariel and Stoica, Ion and Zaharia, Matei.
 
* [[Media:AViewOfCloudComputing_CACM_Apr2010.pdf| A View of Cloud Computing]], 2010, By Armbrust, Michael and Fox, Armando and Griffith, Rean and Joseph, Anthony D. and Katz, Randy and Konwinski, Andy and Lee, Gunho and Patterson, David and Rabkin, Ariel and Stoica, Ion and Zaharia, Matei.
* [[Media:NIST_Definition_Cloud_Computing_2010.pdf | The NIST Definition of Cloud Computing (Draft)]]
+
* [[Media:NIST_Definition_Cloud_Computing_2010.pdf | The NIST Definition of Cloud Computing (Draft)]] (very short paper)
 
* [[Media:NobodyGotFiredUsingHadoopOnCluster_2012.pdf| Nobody ever got fired for using Hadoop on a cluster]], Rowstron, Antony and Narayanan, Dushyanth and Donnelly, Austin and O'Shea, Greg and Douglas, Andrew
 
* [[Media:NobodyGotFiredUsingHadoopOnCluster_2012.pdf| Nobody ever got fired for using Hadoop on a cluster]], Rowstron, Antony and Narayanan, Dushyanth and Donnelly, Austin and O'Shea, Greg and Douglas, Andrew
* [http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.pdf The Landscape of Parallel Computing Research: A View From Berkely], 2006, still good!
+
* [http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.pdf The Landscape of Parallel Computing Research: A View From Berkely], 2006, still good! (very long paper)
* [[Media:UpdateOnaViewFromBerkeley2010.pdf | Update on a view from Berkeley]], 2010.
+
* [[Media:UpdateOnaViewFromBerkeley2010.pdf | Update on a view from Berkeley]], 2010. (short paper)
 
* [[Media:GeneralPurposeVsGPU_Comparison_Many_Cores_2010_Caragea.pdf |General-Purpose vs. GPU: Comparisons of Many-Cores on Irregular Workloads]], 2010
 
* [[Media:GeneralPurposeVsGPU_Comparison_Many_Cores_2010_Caragea.pdf |General-Purpose vs. GPU: Comparisons of Many-Cores on Irregular Workloads]], 2010
 
* [[Media:ParallelCOmputingWithPatternsAndFrameworks2010b.pdf | Parallel Computing with Patterns and Frameworks]], 2010, ''XRDS''.
 
* [[Media:ParallelCOmputingWithPatternsAndFrameworks2010b.pdf | Parallel Computing with Patterns and Frameworks]], 2010, ''XRDS''.
Line 14: Line 46:
 
* [[Media:XGridHadoopCloser2011.pdf | Processing Wikipedia Dumps: A Case-Study comparing the XGrid and MapReduce Approaches]], D. Thiebaut, Yang Li, Diana Jaunzeikare, Alexandra Cheng, Ellysha Raelen Recto, Gillian Riggs, Xia Ting Zhao, Tonje Stolpestad, and Cam Le T Nguyen, ''in proceedings of 1st Int'l Conf. On Cloud Computing and Services Science'' (CLOSER 2011), Noordwijkerhout, NL, May 2011. ([[Media:XGridHadoopFeb2011.pdf |longer version]])
 
* [[Media:XGridHadoopCloser2011.pdf | Processing Wikipedia Dumps: A Case-Study comparing the XGrid and MapReduce Approaches]], D. Thiebaut, Yang Li, Diana Jaunzeikare, Alexandra Cheng, Ellysha Raelen Recto, Gillian Riggs, Xia Ting Zhao, Tonje Stolpestad, and Cam Le T Nguyen, ''in proceedings of 1st Int'l Conf. On Cloud Computing and Services Science'' (CLOSER 2011), Noordwijkerhout, NL, May 2011. ([[Media:XGridHadoopFeb2011.pdf |longer version]])
 
* [[Media:BeyondHadoop_CACM_Mone_2013.pdf | Beyond Hadoop]], Gregory Mone, CACM, 2013. (short paper).
 
* [[Media:BeyondHadoop_CACM_Mone_2013.pdf | Beyond Hadoop]], Gregory Mone, CACM, 2013. (short paper).
 +
* [[Media:UnderstandingThroughputOrientedArchitectures2010.pdf | Understanding Throughput-Oriented Architectures]], CACM, 2010.
 +
* [[Media:LearningFromTheSuccessOfMPI2002_WilliamGropp.pdf | Learning from the Success of MPI]], by WIlliam D. Gropp,  Argonne National Lab, 2002.
 
<p>
 
<p>
  
Line 22: Line 56:
 
<br />
 
<br />
 
==Some good references==
 
==Some good references==
 +
* Sounds of wikipedia: http://listen.hatnote.com/#nowelcomes,en
 +
 +
* Exhibition at Somerset House
 
<center>[[Image:The_Exhibition_Room_at_Somerset_House_by_Thomas_Rowlandson_and_Augustus_Pugin._1800.jpg|500px]]</center>
 
<center>[[Image:The_Exhibition_Room_at_Somerset_House_by_Thomas_Rowlandson_and_Augustus_Pugin._1800.jpg|500px]]</center>
 
<br />
 
<br />
Line 65: Line 102:
 
* [[http://cs.smith.edu/dftwiki/images/BerkeleyBootCampAug2013_DFTNotes.pdf | My notes from the Berkeley Boot-Camp on-line Workshop]], Aug 2013.
 
* [[http://cs.smith.edu/dftwiki/images/BerkeleyBootCampAug2013_DFTNotes.pdf | My notes from the Berkeley Boot-Camp on-line Workshop]], Aug 2013.
  
 +
=Update 2015: Downloading images to Hadoop0=
 +
<br />
 +
* Rsync from http://ftpmirror.your.org/pub/wikimedia/images/wikipedia/en/xxx where xxx is 0, 1, 2, ... 9, a, b, c, d, e, f.
 +
<br />
 
=Downloading All Wikipedia Images=
 
=Downloading All Wikipedia Images=
 
* From [http://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia http://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia]
 
* From [http://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia http://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia]
Line 76: Line 117:
 
** Dumps are here: [ftp://ftpmirror.your.org/pub/wikimedia/imagedumps/tarballs/fulls/ ftp://ftpmirror.your.org/pub/wikimedia/imagedumps/tarballs/fulls/]
 
** Dumps are here: [ftp://ftpmirror.your.org/pub/wikimedia/imagedumps/tarballs/fulls/ ftp://ftpmirror.your.org/pub/wikimedia/imagedumps/tarballs/fulls/]
 
** The size of all the all the media media for 20121201 is 172 GB for the local dumps, and 2.153 TB for the remote dumps.  Total = 2.3 TB.
 
** The size of all the all the media media for 20121201 is 172 GB for the local dumps, and 2.153 TB for the remote dumps.  Total = 2.3 TB.
 +
enwiki-20121201-local-media-2.tar 22.5 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-local-media-3.tar 25.6 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-local-media-4.tar 21.5 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-local-media-5.tar 20.7 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-local-media-6.tar 22.4 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-local-media-7.tar 18.2 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-local-media-8.tar 24.4 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-local-media-9.tar 1.3 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-remote-media-1.tar 89.9 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-remote-media-10.tar 90.5 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-11.tar 88.2 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-12.tar 88.4 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-13.tar 89.6 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-14.tar 88.6 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-15.tar 91.2 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-16.tar 91.3 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-17.tar 89.4 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-18.tar 90.0 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-19.tar 90.0 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-2.tar 90.5 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-remote-media-20.tar 90.1 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-21.tar 91.2 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-22.tar 89.3 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-23.tar 91.0 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-24.tar 44.3 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-24.tar.bz2 42.6 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-3.tar 88.6 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-remote-media-4.tar 90.0 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-remote-media-5.tar 90.9 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-remote-media-6.tar 88.3 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-remote-media-7.tar 89.6 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-remote-media-8.tar 90.4 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-9.tar 89.7 GB 12/7/12 12:00:00 AM
 +
 +
* To get them, store list above in a text file (listOfTarArchives.txt) and use wget:
 +
 +
for i in `cat listOfTarArchives.txt | cut -f 1 | grep -v bz2`; do
 +
      echo $i
 +
      wget ftp://ftpmirror.your.org/pub/wikimedia/imagedumps/tarballs/fulls/20121201/$i
 +
      done
 +
 +
* Total size should be 2.310 TB.
 +
 +
=Download the page statistics=
 +
 +
==Links of Interest==
 +
* http://stats.grok.se/
 +
* http://stats.grok.se/about
 +
* http://dom.as/
 +
* http://dumps.wikimedia.org/other/pagecounts-raw/
 +
* http://dumps.wikimedia.org/other/pagecounts-raw/2013/
 +
* started downloading all files from above link to hadoop0:/media/dominique/3TB/mediawiki/statistics/
 +
* wgetStats.sh
 +
#! /bin/bash
 +
wget http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/pagecounts-20130101-000000.gz
 +
wget http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/pagecounts-20130101-010000.gz
 +
wget http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/pagecounts-20130101-020001.gz
 +
...
 +
wget http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/projectcounts-20130131-210000
 +
wget http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/projectcounts-20130131-220000
 +
wget http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/projectcounts-20130131-230000
  
 
----
 
----

Latest revision as of 10:24, 8 December 2016


...