Difference between revisions of "CSC352 Notes 2013"

From dftwiki3
Jump to: navigation, search
 
(13 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
<onlydft>
 
<onlydft>
 +
<br />
 +
<center>
 +
<font color="red">
 +
'''See [[CSC352 2017 DT's Notes| this page]] for 2017 updated notes'''
 +
</font>
 +
</center>
 +
<br />
 +
__TOC__
 +
<br />
 +
=Resources 2013=
 +
==Rocco's Presentation 10/10/13==
 +
* libguides.smith.edu/content.php?pid=510405
 +
* idea:
 +
** for paper, start getting the thread, collage, packing, parallel image processing. 
 +
** approaches.
 +
** intro: what has been done in the field
 +
* Citation database: Web of Science
 +
* Ref Works & Zotero can help maintain citations
 +
* 5-College catalogs
 +
* Worldcat is the world catalog for books
 +
* Web of Science: can get information on references and also who's publishing in the field or which institutions are publishing in the given area.
 +
* Discover searches other databases.
 +
* Library Guide (Albany), super guide for libraries.
 +
* [http://VideoLectures.net videolectures.net]
 +
 +
<br />
 +
==Hadoop==
 +
<br />
 +
* [[CSC352_MapReduce/Hadoop_Class_Notes | DT's Class notes on Hadoop/MapReduce]]
 +
* [http://www.umiacs.umd.edu/~jimmylin/cloud-2008-Fall/index.html Cloud Computing notes from UMD] (2008, old)
  
=Resources 2013=
 
 
==On-Line==
 
==On-Line==
  
Line 27: Line 56:
 
<br />
 
<br />
 
==Some good references==
 
==Some good references==
 +
* Sounds of wikipedia: http://listen.hatnote.com/#nowelcomes,en
 +
 +
* Exhibition at Somerset House
 
<center>[[Image:The_Exhibition_Room_at_Somerset_House_by_Thomas_Rowlandson_and_Augustus_Pugin._1800.jpg|500px]]</center>
 
<center>[[Image:The_Exhibition_Room_at_Somerset_House_by_Thomas_Rowlandson_and_Augustus_Pugin._1800.jpg|500px]]</center>
 
<br />
 
<br />
Line 70: Line 102:
 
* [[http://cs.smith.edu/dftwiki/images/BerkeleyBootCampAug2013_DFTNotes.pdf | My notes from the Berkeley Boot-Camp on-line Workshop]], Aug 2013.
 
* [[http://cs.smith.edu/dftwiki/images/BerkeleyBootCampAug2013_DFTNotes.pdf | My notes from the Berkeley Boot-Camp on-line Workshop]], Aug 2013.
  
 +
=Update 2015: Downloading images to Hadoop0=
 +
<br />
 +
* Rsync from http://ftpmirror.your.org/pub/wikimedia/images/wikipedia/en/xxx where xxx is 0, 1, 2, ... 9, a, b, c, d, e, f.
 +
<br />
 
=Downloading All Wikipedia Images=
 
=Downloading All Wikipedia Images=
 
* From [http://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia http://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia]
 
* From [http://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia http://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia]
Line 81: Line 117:
 
** Dumps are here: [ftp://ftpmirror.your.org/pub/wikimedia/imagedumps/tarballs/fulls/ ftp://ftpmirror.your.org/pub/wikimedia/imagedumps/tarballs/fulls/]
 
** Dumps are here: [ftp://ftpmirror.your.org/pub/wikimedia/imagedumps/tarballs/fulls/ ftp://ftpmirror.your.org/pub/wikimedia/imagedumps/tarballs/fulls/]
 
** The size of all the all the media media for 20121201 is 172 GB for the local dumps, and 2.153 TB for the remote dumps.  Total = 2.3 TB.
 
** The size of all the all the media media for 20121201 is 172 GB for the local dumps, and 2.153 TB for the remote dumps.  Total = 2.3 TB.
 +
enwiki-20121201-local-media-2.tar 22.5 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-local-media-3.tar 25.6 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-local-media-4.tar 21.5 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-local-media-5.tar 20.7 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-local-media-6.tar 22.4 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-local-media-7.tar 18.2 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-local-media-8.tar 24.4 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-local-media-9.tar 1.3 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-remote-media-1.tar 89.9 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-remote-media-10.tar 90.5 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-11.tar 88.2 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-12.tar 88.4 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-13.tar 89.6 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-14.tar 88.6 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-15.tar 91.2 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-16.tar 91.3 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-17.tar 89.4 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-18.tar 90.0 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-19.tar 90.0 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-2.tar 90.5 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-remote-media-20.tar 90.1 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-21.tar 91.2 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-22.tar 89.3 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-23.tar 91.0 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-24.tar 44.3 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-24.tar.bz2 42.6 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-3.tar 88.6 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-remote-media-4.tar 90.0 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-remote-media-5.tar 90.9 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-remote-media-6.tar 88.3 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-remote-media-7.tar 89.6 GB 12/6/12 12:00:00 AM
 +
enwiki-20121201-remote-media-8.tar 90.4 GB 12/7/12 12:00:00 AM
 +
enwiki-20121201-remote-media-9.tar 89.7 GB 12/7/12 12:00:00 AM
 +
 +
* To get them, store list above in a text file (listOfTarArchives.txt) and use wget:
 +
 +
for i in `cat listOfTarArchives.txt | cut -f 1 | grep -v bz2`; do
 +
      echo $i
 +
      wget ftp://ftpmirror.your.org/pub/wikimedia/imagedumps/tarballs/fulls/20121201/$i
 +
      done
 +
 +
* Total size should be 2.310 TB.
 +
 +
=Download the page statistics=
 +
 +
==Links of Interest==
 +
* http://stats.grok.se/
 +
* http://stats.grok.se/about
 +
* http://dom.as/
 +
* http://dumps.wikimedia.org/other/pagecounts-raw/
 +
* http://dumps.wikimedia.org/other/pagecounts-raw/2013/
 +
* started downloading all files from above link to hadoop0:/media/dominique/3TB/mediawiki/statistics/
 +
* wgetStats.sh
 +
#! /bin/bash
 +
wget http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/pagecounts-20130101-000000.gz
 +
wget http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/pagecounts-20130101-010000.gz
 +
wget http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/pagecounts-20130101-020001.gz
 +
...
 +
wget http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/projectcounts-20130131-210000
 +
wget http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/projectcounts-20130131-220000
 +
wget http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/projectcounts-20130131-230000
  
 
----
 
----

Latest revision as of 10:24, 8 December 2016


...