Difference between revisions of "CSC352 Notes 2013"
(13 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
<onlydft> | <onlydft> | ||
+ | <br /> | ||
+ | <center> | ||
+ | <font color="red"> | ||
+ | '''See [[CSC352 2017 DT's Notes| this page]] for 2017 updated notes''' | ||
+ | </font> | ||
+ | </center> | ||
+ | <br /> | ||
+ | __TOC__ | ||
+ | <br /> | ||
+ | =Resources 2013= | ||
+ | ==Rocco's Presentation 10/10/13== | ||
+ | * libguides.smith.edu/content.php?pid=510405 | ||
+ | * idea: | ||
+ | ** for paper, start getting the thread, collage, packing, parallel image processing. | ||
+ | ** approaches. | ||
+ | ** intro: what has been done in the field | ||
+ | * Citation database: Web of Science | ||
+ | * Ref Works & Zotero can help maintain citations | ||
+ | * 5-College catalogs | ||
+ | * Worldcat is the world catalog for books | ||
+ | * Web of Science: can get information on references and also who's publishing in the field or which institutions are publishing in the given area. | ||
+ | * Discover searches other databases. | ||
+ | * Library Guide (Albany), super guide for libraries. | ||
+ | * [http://VideoLectures.net videolectures.net] | ||
+ | |||
+ | <br /> | ||
+ | ==Hadoop== | ||
+ | <br /> | ||
+ | * [[CSC352_MapReduce/Hadoop_Class_Notes | DT's Class notes on Hadoop/MapReduce]] | ||
+ | * [http://www.umiacs.umd.edu/~jimmylin/cloud-2008-Fall/index.html Cloud Computing notes from UMD] (2008, old) | ||
− | |||
==On-Line== | ==On-Line== | ||
Line 27: | Line 56: | ||
<br /> | <br /> | ||
==Some good references== | ==Some good references== | ||
+ | * Sounds of wikipedia: http://listen.hatnote.com/#nowelcomes,en | ||
+ | |||
+ | * Exhibition at Somerset House | ||
<center>[[Image:The_Exhibition_Room_at_Somerset_House_by_Thomas_Rowlandson_and_Augustus_Pugin._1800.jpg|500px]]</center> | <center>[[Image:The_Exhibition_Room_at_Somerset_House_by_Thomas_Rowlandson_and_Augustus_Pugin._1800.jpg|500px]]</center> | ||
<br /> | <br /> | ||
Line 70: | Line 102: | ||
* [[http://cs.smith.edu/dftwiki/images/BerkeleyBootCampAug2013_DFTNotes.pdf | My notes from the Berkeley Boot-Camp on-line Workshop]], Aug 2013. | * [[http://cs.smith.edu/dftwiki/images/BerkeleyBootCampAug2013_DFTNotes.pdf | My notes from the Berkeley Boot-Camp on-line Workshop]], Aug 2013. | ||
+ | =Update 2015: Downloading images to Hadoop0= | ||
+ | <br /> | ||
+ | * Rsync from http://ftpmirror.your.org/pub/wikimedia/images/wikipedia/en/xxx where xxx is 0, 1, 2, ... 9, a, b, c, d, e, f. | ||
+ | <br /> | ||
=Downloading All Wikipedia Images= | =Downloading All Wikipedia Images= | ||
* From [http://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia http://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia] | * From [http://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia http://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia] | ||
Line 81: | Line 117: | ||
** Dumps are here: [ftp://ftpmirror.your.org/pub/wikimedia/imagedumps/tarballs/fulls/ ftp://ftpmirror.your.org/pub/wikimedia/imagedumps/tarballs/fulls/] | ** Dumps are here: [ftp://ftpmirror.your.org/pub/wikimedia/imagedumps/tarballs/fulls/ ftp://ftpmirror.your.org/pub/wikimedia/imagedumps/tarballs/fulls/] | ||
** The size of all the all the media media for 20121201 is 172 GB for the local dumps, and 2.153 TB for the remote dumps. Total = 2.3 TB. | ** The size of all the all the media media for 20121201 is 172 GB for the local dumps, and 2.153 TB for the remote dumps. Total = 2.3 TB. | ||
+ | enwiki-20121201-local-media-2.tar 22.5 GB 12/6/12 12:00:00 AM | ||
+ | enwiki-20121201-local-media-3.tar 25.6 GB 12/6/12 12:00:00 AM | ||
+ | enwiki-20121201-local-media-4.tar 21.5 GB 12/6/12 12:00:00 AM | ||
+ | enwiki-20121201-local-media-5.tar 20.7 GB 12/6/12 12:00:00 AM | ||
+ | enwiki-20121201-local-media-6.tar 22.4 GB 12/6/12 12:00:00 AM | ||
+ | enwiki-20121201-local-media-7.tar 18.2 GB 12/6/12 12:00:00 AM | ||
+ | enwiki-20121201-local-media-8.tar 24.4 GB 12/6/12 12:00:00 AM | ||
+ | enwiki-20121201-local-media-9.tar 1.3 GB 12/6/12 12:00:00 AM | ||
+ | enwiki-20121201-remote-media-1.tar 89.9 GB 12/6/12 12:00:00 AM | ||
+ | enwiki-20121201-remote-media-10.tar 90.5 GB 12/7/12 12:00:00 AM | ||
+ | enwiki-20121201-remote-media-11.tar 88.2 GB 12/7/12 12:00:00 AM | ||
+ | enwiki-20121201-remote-media-12.tar 88.4 GB 12/7/12 12:00:00 AM | ||
+ | enwiki-20121201-remote-media-13.tar 89.6 GB 12/7/12 12:00:00 AM | ||
+ | enwiki-20121201-remote-media-14.tar 88.6 GB 12/7/12 12:00:00 AM | ||
+ | enwiki-20121201-remote-media-15.tar 91.2 GB 12/7/12 12:00:00 AM | ||
+ | enwiki-20121201-remote-media-16.tar 91.3 GB 12/7/12 12:00:00 AM | ||
+ | enwiki-20121201-remote-media-17.tar 89.4 GB 12/7/12 12:00:00 AM | ||
+ | enwiki-20121201-remote-media-18.tar 90.0 GB 12/7/12 12:00:00 AM | ||
+ | enwiki-20121201-remote-media-19.tar 90.0 GB 12/7/12 12:00:00 AM | ||
+ | enwiki-20121201-remote-media-2.tar 90.5 GB 12/6/12 12:00:00 AM | ||
+ | enwiki-20121201-remote-media-20.tar 90.1 GB 12/7/12 12:00:00 AM | ||
+ | enwiki-20121201-remote-media-21.tar 91.2 GB 12/7/12 12:00:00 AM | ||
+ | enwiki-20121201-remote-media-22.tar 89.3 GB 12/7/12 12:00:00 AM | ||
+ | enwiki-20121201-remote-media-23.tar 91.0 GB 12/7/12 12:00:00 AM | ||
+ | enwiki-20121201-remote-media-24.tar 44.3 GB 12/7/12 12:00:00 AM | ||
+ | enwiki-20121201-remote-media-24.tar.bz2 42.6 GB 12/7/12 12:00:00 AM | ||
+ | enwiki-20121201-remote-media-3.tar 88.6 GB 12/6/12 12:00:00 AM | ||
+ | enwiki-20121201-remote-media-4.tar 90.0 GB 12/6/12 12:00:00 AM | ||
+ | enwiki-20121201-remote-media-5.tar 90.9 GB 12/6/12 12:00:00 AM | ||
+ | enwiki-20121201-remote-media-6.tar 88.3 GB 12/6/12 12:00:00 AM | ||
+ | enwiki-20121201-remote-media-7.tar 89.6 GB 12/6/12 12:00:00 AM | ||
+ | enwiki-20121201-remote-media-8.tar 90.4 GB 12/7/12 12:00:00 AM | ||
+ | enwiki-20121201-remote-media-9.tar 89.7 GB 12/7/12 12:00:00 AM | ||
+ | |||
+ | * To get them, store list above in a text file (listOfTarArchives.txt) and use wget: | ||
+ | |||
+ | for i in `cat listOfTarArchives.txt | cut -f 1 | grep -v bz2`; do | ||
+ | echo $i | ||
+ | wget ftp://ftpmirror.your.org/pub/wikimedia/imagedumps/tarballs/fulls/20121201/$i | ||
+ | done | ||
+ | |||
+ | * Total size should be 2.310 TB. | ||
+ | |||
+ | =Download the page statistics= | ||
+ | |||
+ | ==Links of Interest== | ||
+ | * http://stats.grok.se/ | ||
+ | * http://stats.grok.se/about | ||
+ | * http://dom.as/ | ||
+ | * http://dumps.wikimedia.org/other/pagecounts-raw/ | ||
+ | * http://dumps.wikimedia.org/other/pagecounts-raw/2013/ | ||
+ | * started downloading all files from above link to hadoop0:/media/dominique/3TB/mediawiki/statistics/ | ||
+ | * wgetStats.sh | ||
+ | #! /bin/bash | ||
+ | wget http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/pagecounts-20130101-000000.gz | ||
+ | wget http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/pagecounts-20130101-010000.gz | ||
+ | wget http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/pagecounts-20130101-020001.gz | ||
+ | ... | ||
+ | wget http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/projectcounts-20130131-210000 | ||
+ | wget http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/projectcounts-20130131-220000 | ||
+ | wget http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/projectcounts-20130131-230000 | ||
---- | ---- |