Difference between revisions of "CSC352 Notes 2013"
(16 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
<onlydft> | <onlydft> | ||
+ | <br /> | ||
+ | <center> | ||
+ | <font color="red"> | ||
+ | '''See [[CSC352 2017 DT's Notes| this page]] for 2017 updated notes''' | ||
+ | </font> | ||
+ | </center> | ||
+ | <br /> | ||
+ | __TOC__ | ||
+ | <br /> | ||
+ | =Resources 2013= | ||
+ | ==Rocco's Presentation 10/10/13== | ||
+ | * libguides.smith.edu/content.php?pid=510405 | ||
+ | * idea: | ||
+ | ** for paper, start getting the thread, collage, packing, parallel image processing. | ||
+ | ** approaches. | ||
+ | ** intro: what has been done in the field | ||
+ | * Citation database: Web of Science | ||
+ | * Ref Works & Zotero can help maintain citations | ||
+ | * 5-College catalogs | ||
+ | * Worldcat is the world catalog for books | ||
+ | * Web of Science: can get information on references and also who's publishing in the field or which institutions are publishing in the given area. | ||
+ | * Discover searches other databases. | ||
+ | * Library Guide (Albany), super guide for libraries. | ||
+ | * [http://VideoLectures.net videolectures.net] | ||
+ | |||
+ | <br /> | ||
+ | ==Hadoop== | ||
+ | <br /> | ||
+ | * [[CSC352_MapReduce/Hadoop_Class_Notes | DT's Class notes on Hadoop/MapReduce]] | ||
+ | * [http://www.umiacs.umd.edu/~jimmylin/cloud-2008-Fall/index.html Cloud Computing notes from UMD] (2008, old) | ||
− | = | + | ==On-Line== |
+ | * [https://computing.llnl.gov/tutorials/parallel_comp/ Introduction to Parallel Processing] | ||
+ | * [[Media:RITParallelProgrammingWorkshop.pdf | RIT Parallel Programming Workshop]] | ||
==Papers== | ==Papers== | ||
* [[Media:AViewOfCloudComputing_CACM_Apr2010.pdf| A View of Cloud Computing]], 2010, By Armbrust, Michael and Fox, Armando and Griffith, Rean and Joseph, Anthony D. and Katz, Randy and Konwinski, Andy and Lee, Gunho and Patterson, David and Rabkin, Ariel and Stoica, Ion and Zaharia, Matei. | * [[Media:AViewOfCloudComputing_CACM_Apr2010.pdf| A View of Cloud Computing]], 2010, By Armbrust, Michael and Fox, Armando and Griffith, Rean and Joseph, Anthony D. and Katz, Randy and Konwinski, Andy and Lee, Gunho and Patterson, David and Rabkin, Ariel and Stoica, Ion and Zaharia, Matei. | ||
Line 15: | Line 47: | ||
* [[Media:BeyondHadoop_CACM_Mone_2013.pdf | Beyond Hadoop]], Gregory Mone, CACM, 2013. (short paper). | * [[Media:BeyondHadoop_CACM_Mone_2013.pdf | Beyond Hadoop]], Gregory Mone, CACM, 2013. (short paper). | ||
* [[Media:UnderstandingThroughputOrientedArchitectures2010.pdf | Understanding Throughput-Oriented Architectures]], CACM, 2010. | * [[Media:UnderstandingThroughputOrientedArchitectures2010.pdf | Understanding Throughput-Oriented Architectures]], CACM, 2010. | ||
+ | * [[Media:LearningFromTheSuccessOfMPI2002_WilliamGropp.pdf | Learning from the Success of MPI]], by WIlliam D. Gropp, Argonne National Lab, 2002. | ||
<p> | <p> | ||
Line 23: | Line 56: | ||
<br /> | <br /> | ||
==Some good references== | ==Some good references== | ||
+ | * Sounds of wikipedia: http://listen.hatnote.com/#nowelcomes,en | ||
+ | |||
+ | * Exhibition at Somerset House | ||
<center>[[Image:The_Exhibition_Room_at_Somerset_House_by_Thomas_Rowlandson_and_Augustus_Pugin._1800.jpg|500px]]</center> | <center>[[Image:The_Exhibition_Room_at_Somerset_House_by_Thomas_Rowlandson_and_Augustus_Pugin._1800.jpg|500px]]</center> | ||
<br /> | <br /> | ||
Line 66: | Line 102: | ||
* [[http://cs.smith.edu/dftwiki/images/BerkeleyBootCampAug2013_DFTNotes.pdf | My notes from the Berkeley Boot-Camp on-line Workshop]], Aug 2013. | * [[http://cs.smith.edu/dftwiki/images/BerkeleyBootCampAug2013_DFTNotes.pdf | My notes from the Berkeley Boot-Camp on-line Workshop]], Aug 2013. | ||
+ | =Update 2015: Downloading images to Hadoop0= | ||
+ | <br /> | ||
+ | * Rsync from http://ftpmirror.your.org/pub/wikimedia/images/wikipedia/en/xxx where xxx is 0, 1, 2, ... 9, a, b, c, d, e, f. | ||
+ | <br /> | ||
=Downloading All Wikipedia Images= | =Downloading All Wikipedia Images= | ||
* From [http://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia http://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia] | * From [http://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia http://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia] | ||
Line 77: | Line 117: | ||
** Dumps are here: [ftp://ftpmirror.your.org/pub/wikimedia/imagedumps/tarballs/fulls/ ftp://ftpmirror.your.org/pub/wikimedia/imagedumps/tarballs/fulls/] | ** Dumps are here: [ftp://ftpmirror.your.org/pub/wikimedia/imagedumps/tarballs/fulls/ ftp://ftpmirror.your.org/pub/wikimedia/imagedumps/tarballs/fulls/] | ||
** The size of all the all the media media for 20121201 is 172 GB for the local dumps, and 2.153 TB for the remote dumps. Total = 2.3 TB. | ** The size of all the all the media media for 20121201 is 172 GB for the local dumps, and 2.153 TB for the remote dumps. Total = 2.3 TB. | ||
+ | enwiki-20121201-local-media-2.tar 22.5 GB 12/6/12 12:00:00 AM | ||
+ | enwiki-20121201-local-media-3.tar 25.6 GB 12/6/12 12:00:00 AM | ||
+ | enwiki-20121201-local-media-4.tar 21.5 GB 12/6/12 12:00:00 AM | ||
+ | enwiki-20121201-local-media-5.tar 20.7 GB 12/6/12 12:00:00 AM | ||
+ | enwiki-20121201-local-media-6.tar 22.4 GB 12/6/12 12:00:00 AM | ||
+ | enwiki-20121201-local-media-7.tar 18.2 GB 12/6/12 12:00:00 AM | ||
+ | enwiki-20121201-local-media-8.tar 24.4 GB 12/6/12 12:00:00 AM | ||
+ | enwiki-20121201-local-media-9.tar 1.3 GB 12/6/12 12:00:00 AM | ||
+ | enwiki-20121201-remote-media-1.tar 89.9 GB 12/6/12 12:00:00 AM | ||
+ | enwiki-20121201-remote-media-10.tar 90.5 GB 12/7/12 12:00:00 AM | ||
+ | enwiki-20121201-remote-media-11.tar 88.2 GB 12/7/12 12:00:00 AM | ||
+ | enwiki-20121201-remote-media-12.tar 88.4 GB 12/7/12 12:00:00 AM | ||
+ | enwiki-20121201-remote-media-13.tar 89.6 GB 12/7/12 12:00:00 AM | ||
+ | enwiki-20121201-remote-media-14.tar 88.6 GB 12/7/12 12:00:00 AM | ||
+ | enwiki-20121201-remote-media-15.tar 91.2 GB 12/7/12 12:00:00 AM | ||
+ | enwiki-20121201-remote-media-16.tar 91.3 GB 12/7/12 12:00:00 AM | ||
+ | enwiki-20121201-remote-media-17.tar 89.4 GB 12/7/12 12:00:00 AM | ||
+ | enwiki-20121201-remote-media-18.tar 90.0 GB 12/7/12 12:00:00 AM | ||
+ | enwiki-20121201-remote-media-19.tar 90.0 GB 12/7/12 12:00:00 AM | ||
+ | enwiki-20121201-remote-media-2.tar 90.5 GB 12/6/12 12:00:00 AM | ||
+ | enwiki-20121201-remote-media-20.tar 90.1 GB 12/7/12 12:00:00 AM | ||
+ | enwiki-20121201-remote-media-21.tar 91.2 GB 12/7/12 12:00:00 AM | ||
+ | enwiki-20121201-remote-media-22.tar 89.3 GB 12/7/12 12:00:00 AM | ||
+ | enwiki-20121201-remote-media-23.tar 91.0 GB 12/7/12 12:00:00 AM | ||
+ | enwiki-20121201-remote-media-24.tar 44.3 GB 12/7/12 12:00:00 AM | ||
+ | enwiki-20121201-remote-media-24.tar.bz2 42.6 GB 12/7/12 12:00:00 AM | ||
+ | enwiki-20121201-remote-media-3.tar 88.6 GB 12/6/12 12:00:00 AM | ||
+ | enwiki-20121201-remote-media-4.tar 90.0 GB 12/6/12 12:00:00 AM | ||
+ | enwiki-20121201-remote-media-5.tar 90.9 GB 12/6/12 12:00:00 AM | ||
+ | enwiki-20121201-remote-media-6.tar 88.3 GB 12/6/12 12:00:00 AM | ||
+ | enwiki-20121201-remote-media-7.tar 89.6 GB 12/6/12 12:00:00 AM | ||
+ | enwiki-20121201-remote-media-8.tar 90.4 GB 12/7/12 12:00:00 AM | ||
+ | enwiki-20121201-remote-media-9.tar 89.7 GB 12/7/12 12:00:00 AM | ||
+ | |||
+ | * To get them, store list above in a text file (listOfTarArchives.txt) and use wget: | ||
+ | |||
+ | for i in `cat listOfTarArchives.txt | cut -f 1 | grep -v bz2`; do | ||
+ | echo $i | ||
+ | wget ftp://ftpmirror.your.org/pub/wikimedia/imagedumps/tarballs/fulls/20121201/$i | ||
+ | done | ||
+ | |||
+ | * Total size should be 2.310 TB. | ||
+ | |||
+ | =Download the page statistics= | ||
+ | |||
+ | ==Links of Interest== | ||
+ | * http://stats.grok.se/ | ||
+ | * http://stats.grok.se/about | ||
+ | * http://dom.as/ | ||
+ | * http://dumps.wikimedia.org/other/pagecounts-raw/ | ||
+ | * http://dumps.wikimedia.org/other/pagecounts-raw/2013/ | ||
+ | * started downloading all files from above link to hadoop0:/media/dominique/3TB/mediawiki/statistics/ | ||
+ | * wgetStats.sh | ||
+ | #! /bin/bash | ||
+ | wget http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/pagecounts-20130101-000000.gz | ||
+ | wget http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/pagecounts-20130101-010000.gz | ||
+ | wget http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/pagecounts-20130101-020001.gz | ||
+ | ... | ||
+ | wget http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/projectcounts-20130131-210000 | ||
+ | wget http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/projectcounts-20130131-220000 | ||
+ | wget http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/projectcounts-20130131-230000 | ||
---- | ---- |