CSC352 2017 DT's Notes

From dftwiki3
Revision as of 09:24, 8 December 2016 by Thiebaut (talk | contribs)
Jump to: navigation, search

--D. Thiebaut (talk) 13:25, 14 November 2016 (EST)


<onlydft>

2013 Notes

Public & Private Class Notes


TOC:

1 Resources 2013
1.1 Rocco's Presentation 10/10/13
1.2 Hadoop
1.3 On-Line
1.4 Papers
1.5 Art
1.6 Some good references
2 Misc. Topics
3 XSEDE.ORG
4 Update 2015: Downloading images to Hadoop0
5 Downloading All Wikipedia Images
6 Download the page statistics
6.1 Links of Interest
7 Resources 2010
8 Map-Reduce/Hadoop
8.1 Options for Setup
8.1.1 Xen Live CD
8.1.2 Setting up Hadoop using VmWare
8.2 Setting Up Hadoop and Eclipse on the Mac
8.2.1 Install Hadoop
8.2.2 Verify configuration of Hadoop
8.3 Setting up Eclipse for Hadoop
8.3.1 Map-Reduce Locations
8.3.2 DFS Locations
8.4 Create a new project with Eclipse
8.4.1 Project
8.5 Map/Reduce driver class
8.5.1 Running the Project
9 WordCount Example on Eclipse on Mac
9.1 Mapper
9.2 Reducer
9.3 Driver
9.4 Run WordCount Project
10 Notes on doing example in Yahoo Tutorial, Module 2



Threads

  • good example with multiple ping processes: [1]
  • multi-core not used by python [2]

Programs

Setting up documents and swish-e

(Note: there are 2 other alternatives: sphinx and zend-lucene. Sphinx requires data in xml form or in mysql database)

   cd 
   cd Site/swish-e
   php swishe.php search=love
  ...
 <br>
 <br>rank:   20
 <br>score:  809
 <br>url:    http://xgridmac.dyndns.org/~thiebaut/www_etext_org/Religious_357/Polyamory/Keys2LovingUnity.html
 <br>link:   <a href="http://xgridmac.dyndns.org/~thiebaut/www_etext_org/Religious_357/Polyamory/Keys2LovingUnity.html">link</a>
 <br>file:   Keys2LovingUnity.html
 <br>offset: 47813
 <br>

Where delay is number of 1/10s of a second to wait. This is a bound as the true delay is random between 0.1 sec and the integer specified times 1/10 seconds.)

Project

Project 1
Threading in Python: given two lists of keywords, List1 and List2, retrieve docs from a site (xgridmac.dyndns.org, yahoo, google) that respond/match List1. Filter the docs received and keep only those that contain most of the words in List2.
Project 2
XGrid: process a gzip xml dump of wikipedia and break it up into individual pages (9 million or so of them)!
Project 3
Map-Reduce: process wikipedia pages and create an index of words and their associated categories

Papers

Notes on a View from Berkeley paper


2017

Ideas

  • Latex still important
  • last 2 weeks presentations: 10 minutes each, on a subject that we didn't cover. Need 10-min presentation plus 2 page prospectus.
    • GPU
    • Deeplearning
    • Top 500. Why, what, what do we learn, lessons?
    • CUDA
    • OpenMP
    • Debugging parallel programming
    • Tensorflow
    • Vampir: Trace Analyzer tool
    • TotalView: Debugger
  • C/C++ tutorial. Still good
  • Optimization options in C -O, -O2, -O3

Resources

Spark

  • Apache Spark: A Unified Engine for Big Data Processing, Matei Zaharia et al (pdf)





<onlydft>