Difference between revisions of "CSC352 Notes 2013"

From dftwiki3
Jump to: navigation, search
 
(56 intermediate revisions by the same user not shown)
Line 1: Line 1:
 +
<onlydft>
 +
<br />
 +
<center>
 +
<font color="red">
 +
'''See [[CSC352 2017 DT's Notes| this page]] for 2017 updated notes'''
 +
</font>
 +
</center>
 +
<br />
 +
__TOC__
 +
<br />
 
=Resources 2013=
 
=Resources 2013=
* [http://infosthetics.com/archives/2013/07/phototrails_the_visual_structure_of_millions_of_user-generated_photos.html visual structures of millions of user-generated photos]
+
==Rocco's Presentation 10/10/13==
 +
* libguides.smith.edu/content.php?pid=510405
 +
* idea:
 +
** for paper, start getting the thread, collage, packing, parallel image processing. 
 +
** approaches.
 +
** intro: what has been done in the field
 +
* Citation database: Web of Science
 +
* Ref Works & Zotero can help maintain citations
 +
* 5-College catalogs
 +
* Worldcat is the world catalog for books
 +
* Web of Science: can get information on references and also who's publishing in the field or which institutions are publishing in the given area.
 +
* Discover searches other databases.
 +
* Library Guide (Albany), super guide for libraries.
 +
* [http://VideoLectures.net videolectures.net]
 +
 
 +
<br />
 +
==Hadoop==
 +
<br />
 +
* [[CSC352_MapReduce/Hadoop_Class_Notes | DT's Class notes on Hadoop/MapReduce]]
 +
* [http://www.umiacs.umd.edu/~jimmylin/cloud-2008-Fall/index.html Cloud Computing notes from UMD] (2008, old)
 +
 
 +
==On-Line==
 +
 
 +
* [https://computing.llnl.gov/tutorials/parallel_comp/ Introduction to Parallel Processing]
 +
* [[Media:RITParallelProgrammingWorkshop.pdf | RIT Parallel Programming Workshop]]
 +
==Papers==
 +
* [[Media:AViewOfCloudComputing_CACM_Apr2010.pdf| A View of Cloud Computing]], 2010, By Armbrust, Michael and Fox, Armando and Griffith, Rean and Joseph, Anthony D. and Katz, Randy and Konwinski, Andy and Lee, Gunho and Patterson, David and Rabkin, Ariel and Stoica, Ion and Zaharia, Matei.
 +
* [[Media:NIST_Definition_Cloud_Computing_2010.pdf | The NIST Definition of Cloud Computing (Draft)]] (very short paper)
 +
* [[Media:NobodyGotFiredUsingHadoopOnCluster_2012.pdf| Nobody ever got fired for using Hadoop on a cluster]], Rowstron, Antony and Narayanan, Dushyanth and Donnelly, Austin and O'Shea, Greg and Douglas, Andrew
 +
* [http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.pdf The Landscape of Parallel Computing Research: A View From Berkely], 2006, still good! (very long paper)
 +
* [[Media:UpdateOnaViewFromBerkeley2010.pdf | Update on a view from Berkeley]], 2010. (short paper)
 +
* [[Media:GeneralPurposeVsGPU_Comparison_Many_Cores_2010_Caragea.pdf |General-Purpose vs. GPU: Comparisons of Many-Cores on Irregular Workloads]], 2010
 +
* [[Media:ParallelCOmputingWithPatternsAndFrameworks2010b.pdf | Parallel Computing with Patterns and Frameworks]], 2010, ''XRDS''.
 +
* [[Media:ServerVirtualizationArchitectureAndImplementation2009.pdf | Server Virtualization Architecture and Implementation]], xrds, 2009.
 +
* [[Media:XGridHadoopCloser2011.pdf | Processing Wikipedia Dumps: A Case-Study comparing the XGrid and MapReduce Approaches]], D. Thiebaut, Yang Li, Diana Jaunzeikare, Alexandra Cheng, Ellysha Raelen Recto, Gillian Riggs, Xia Ting Zhao, Tonje Stolpestad, and Cam Le T Nguyen, ''in proceedings of 1st Int'l Conf. On Cloud Computing and Services Science'' (CLOSER 2011), Noordwijkerhout, NL, May 2011. ([[Media:XGridHadoopFeb2011.pdf |longer version]])
 +
* [[Media:BeyondHadoop_CACM_Mone_2013.pdf | Beyond Hadoop]], Gregory Mone, CACM, 2013. (short paper).
 +
* [[Media:UnderstandingThroughputOrientedArchitectures2010.pdf | Understanding Throughput-Oriented Architectures]], CACM, 2010.
 +
* [[Media:LearningFromTheSuccessOfMPI2002_WilliamGropp.pdf | Learning from the Success of MPI]], by WIlliam D. Gropp,  Argonne National Lab, 2002.
 +
<p>
 +
 
 +
==Art ==
 +
* Maggie Lind's [[Media:MaggieLindProposalCSC352.pdf | MaggieLindProposalCSC352.pdf]]
 +
* Fraser?
 +
* Chester?
 +
<br />
 +
==Some good references==
 +
* Sounds of wikipedia: http://listen.hatnote.com/#nowelcomes,en
 +
 
 +
* Exhibition at Somerset House
 +
<center>[[Image:The_Exhibition_Room_at_Somerset_House_by_Thomas_Rowlandson_and_Augustus_Pugin._1800.jpg|500px]]</center>
 +
<br />
 +
{|
 +
|
 +
[http://lens.blogs.nytimes.com/2010/03/23/behind-38/?_r=0 Bill Cunningham] of the New York Times.
 +
|
 +
[[Image:BillCunningham.jpg|150px|right]]
 +
|-
 +
|
 +
[http://infosthetics.com/archives/2013/07/phototrails_the_visual_structure_of_millions_of_user-generated_photos.html visual structures of millions of user-generated photos]
 +
|
 +
[[Image:milionsUserGeneratedPhotos.jpg|right|150px]]
 +
|-
 +
|
 +
[[Image:digitalsignagecollection.png|150px]]
 +
|
 +
[http://www.digitalsignageconnection.com/art-museum-creates-interactive-visitor-experience-christie-microtiles-video-walls-959  Cleveland Museum of Art's Collection Wall allows up to 16 people to interact simultaneously with the wall using RFID tags on iPad stations.]
 +
|}
 +
<br />
 +
<br />
 +
*[http://computinged.wordpress.com/2012/11/21/cs2013-ironman-draft-available/ Ironman ACM/IEEE Curriculum] stipulates that distributed computing must be incorporated at all levels of curriculum. [http://ai.stanford.edu/users/sahami/CS2013//ironman-draft/cs2013-ironman-v0.8.pdf Link to the pdf paper].  The report suggest that parallel and distributed computed should be an integral part of the CS curriculum.  Some people (e.g. Danner & Newall at Swarthmore) go even further and suggest it should be incorporated at all levels of the curriculum.
 +
 
 +
=Misc. Topics=
 +
* Latex
 +
* writing papers
 +
* reading ==> Newsletter
 +
* presentations
 +
* museum visit
 +
* parallel programming
 +
** MPI
 +
** Java threads
 +
** concurrency issues
 +
** where's the data?  Where are the processors?
 +
* Projects
 +
** MPI
 +
** GPU
 +
* Look at recent conference.  Where are the trends? [http://conference.researchbib.com/?action=viewEventDetails&eventid=26507&uid=raf013 APPT 2013 - 2013 International Conference on Advanced Parallel Processing Technology]
 +
 
 +
=XSEDE.ORG=
 +
* registered 8/8/13: thiebaut/ToMoKo2#
 +
* https://portal.xsede.org/
 +
* [[http://cs.smith.edu/dftwiki/images/BerkeleyBootCampAug2013_DFTNotes.pdf | My notes from the Berkeley Boot-Camp on-line Workshop]], Aug 2013.
 +
 
 +
=Update 2015: Downloading images to Hadoop0=
 +
<br />
 +
* Rsync from http://ftpmirror.your.org/pub/wikimedia/images/wikipedia/en/xxx where xxx is 0, 1, 2, ... 9, a, b, c, d, e, f.
 +
<br />
 +
=Downloading All Wikipedia Images=
 +
* From [http://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia http://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia]
 +
:::''Where are images and uploaded files<br /><br />Images and other uploaded media are available from mirrors in addition to being served directly from Wikimedia servers. Bulk download is currently (as of September 2012) available from mirrors but not offered directly from Wikimedia servers. See the list of current mirrors.<br /><br />Unlike most article text, images are not necessarily licensed under the GFDL & CC-BY-SA-3.0. They may be under one of many free licenses, in the public domain, believed to be fair use, or even copyright infringements (which should be deleted). In particular, use of fair use images outside the context of Wikipedia or similar works may be illegal. Images under most licenses require a credit, and possibly other attached copyright information. This information is included in image description pages, which are part of the text dumps available from dumps.wikimedia.org. In conclusion, download these images at your own risk (Legal)''
 +
 
 +
* [http://wikimedia.wansec.com/other/pagecounts-raw/ Page View Statistics for Wikimedia projects] at
 +
wikimedia.wansec.com/other/pagecounts-raw/
 +
* The main information about the dumps and the format is here: [https://wikitech.wikimedia.org/wiki/Dumps/media https://wikitech.wikimedia.org/wiki/Dumps/media]
 +
:::''Tarballs are generated on a server provided by Your.org and made available from that mirror. The rsynced copy of the media itself and an rsynced copy of the above files (image/imagelinks/redirs info) is used as input to createmediatarballs.py to create two series of tarballs per wiki, one containing all locally uploaded media and the other containing all media uploaded to commons and used on the wiki.<br />One series of tarballs (with names looking like, e.g., enwiki-20120430-remote-media-1.tar, enwiki-20120430-remote-media-2.tar, and so on for remote media, and enwiki-20120430-local-media-1.tar, enwiki-20120430-local-media-2.tar and so on for local media), should contain all media for a given project. We bundle up the media into tarballs of 100k files per tarball for convenience of the downloader.<br />''
 +
 
 +
** Dumps are here: [ftp://ftpmirror.your.org/pub/wikimedia/imagedumps/tarballs/fulls/ ftp://ftpmirror.your.org/pub/wikimedia/imagedumps/tarballs/fulls/]
 +
** The size of all the all the media media for 20121201 is 172 GB for the local dumps, and 2.153 TB for the remote dumps.  Total = 2.3 TB.
 +
enwiki-20121201-local-media-2.tar 22.5 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-local-media-3.tar 25.6 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-local-media-4.tar 21.5 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-local-media-5.tar 20.7 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-local-media-6.tar 22.4 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-local-media-7.tar 18.2 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-local-media-8.tar 24.4 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-local-media-9.tar 1.3 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-remote-media-1.tar 89.9 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-remote-media-10.tar 90.5 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-11.tar 88.2 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-12.tar 88.4 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-13.tar 89.6 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-14.tar 88.6 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-15.tar 91.2 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-16.tar 91.3 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-17.tar 89.4 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-18.tar 90.0 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-19.tar 90.0 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-2.tar 90.5 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-remote-media-20.tar 90.1 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-21.tar 91.2 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-22.tar 89.3 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-23.tar 91.0 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-24.tar 44.3 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-24.tar.bz2 42.6 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-3.tar 88.6 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-remote-media-4.tar 90.0 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-remote-media-5.tar 90.9 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-remote-media-6.tar 88.3 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-remote-media-7.tar 89.6 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-remote-media-8.tar 90.4 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-9.tar 89.7 GB 12/7/12 12:00:00 AM
 +
 +
* To get them, store list above in a text file (listOfTarArchives.txt) and use wget:
 +
 
 +
for i in `cat listOfTarArchives.txt | cut -f 1 | grep -v bz2`; do
 +
      echo $i
 +
      wget ftp://ftpmirror.your.org/pub/wikimedia/imagedumps/tarballs/fulls/20121201/$i
 +
      done
 +
 
 +
* Total size should be 2.310 TB.
 +
 
 +
=Download the page statistics=
 +
 
 +
==Links of Interest==
 +
* http://stats.grok.se/
 +
* http://stats.grok.se/about
 +
* http://dom.as/
 +
* http://dumps.wikimedia.org/other/pagecounts-raw/
 +
* http://dumps.wikimedia.org/other/pagecounts-raw/2013/
 +
* started downloading all files from above link to hadoop0:/media/dominique/3TB/mediawiki/statistics/
 +
* wgetStats.sh
 +
#! /bin/bash
 +
wget http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/pagecounts-20130101-000000.gz
 +
wget http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/pagecounts-20130101-010000.gz
 +
wget http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/pagecounts-20130101-020001.gz
 +
...
 +
wget http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/projectcounts-20130131-210000
 +
wget http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/projectcounts-20130131-220000
 +
wget http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/projectcounts-20130131-230000
 +
 
 +
----
 +
 
 +
----
 +
 
 +
----
 +
 
 +
----
 +
 
 +
----
 +
 
 +
----
 +
 
 +
----
 +
 
 +
----
 +
 
 +
----
 +
 
 +
----
 +
 
 +
----
 +
 
 +
=Resources 2010=
 +
____
  
 
=Map-Reduce/Hadoop=
 
=Map-Reduce/Hadoop=
Line 357: Line 559:
 
   
 
   
 
     Hello, world!
 
     Hello, world!
 +
 +
</onlydft>

Latest revision as of 09:24, 8 December 2016


...