Difference between revisions of "CSC352 Syllabus 2013"

From dftwiki3
Jump to: navigation, search
(Created page with "--~~~~ ---- right |300px __TOC__ <center> Main Page | Syllabus | [[CSC352 Class Page 2013 | Schedule...")
 
(Instructor)
 
(17 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
--[[User:Thiebaut|D. Thiebaut]] ([[User talk:Thiebaut|talk]]) 10:55, 9 August 2013 (EDT)
 
--[[User:Thiebaut|D. Thiebaut]] ([[User talk:Thiebaut|talk]]) 10:55, 9 August 2013 (EDT)
 
----
 
----
[[Image:couldComputing.png | right |300px]]
+
{|
 +
|
 
__TOC__
 
__TOC__
 +
|
 +
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
 +
| align="right" |
 +
[[Image:CloudComputingCartoon2.jpg | right |400px]]
 +
|}
 +
<br />
 
<center>[[CSC352 2013 | Main Page]] | [[CSC352 Syllabus 2013 | Syllabus]] | [[CSC352 Class Page 2013 | Schedule]] |
 
<center>[[CSC352 2013 | Main Page]] | [[CSC352 Syllabus 2013 | Syllabus]] | [[CSC352 Class Page 2013 | Schedule]] |
[[CSC352 Resources 2013 | Links &amp; Resources]]</center><br />
+
[[CSC352_Class_Page_2013#Links_and_Resources | Links &amp; Resources]]</center><br />
 
<br />
 
<br />
  
=Prof=
+
=Instructor=
'''Dominique Thiébaut''' ([mailto:dthiebaut@smith.edu dthiebaut at smith.edu]) <br />
+
Prof. '''Dominique Thiébaut''' ([mailto:dthiebaut@smith.edu dthiebaut at smith.edu]) <br />
 
Dept. Computer Science <br />
 
Dept. Computer Science <br />
 
Ford Hall 356<br />
 
Ford Hall 356<br />
 
Telephone: 3854<br />
 
Telephone: 3854<br />
Office hours TBA and by appointments
+
Office hours '''Wed 1-4 p.m'''.  and by appointments
 +
<br />
  
 
=Introduction=
 
=Introduction=
  
Parallel and Distributed Processing (formally Parallel Processing) is a seminar mixing theory and programming that explores the issues facing today's programmers needing to process data existing in either a large volume, or distributed over a network (local or the Internet).
+
Parallel and Distributed Processing (formally Parallel Processing) is a seminar mixing '''theory''', '''programming''' and '''research'''.  It explores the issues facing today's programmers in need of process data existing in either a large volume, or distributed over a network (local or the Internet).
 +
 
 +
The course this semester focuses on a research problem and attempting to formulate a solution or an ''approach'' for it.  The problem is to take a large collection of images (possibly several million images) and to create a ''visualization'' of these images in a way that will enhance some pattern or properties attached to the images.  This property could be some quantity linked to the popularity of images as viewed by visitors, or the frequency with which they are used, or the date when they were introduced in the collection, the date of last time an image was viewed, etc... 
  
The course this semester centers on a problem and formulating a solution for it.  The problem is to take a large collection of images (possibly several million images) and creating a collage of them where images are scaled according to some quantity (popularity, frequency of use, date of posting, date of last viewing, etc...).  The goal is to understand the various tasks required to process such large amount of data, investigate various parallel computing resources, and test different approaches to solve this problem in an acceptable time (minutes or hours rather than days or years of computation!).
+
The goal of the seminar is to understand how to devise solutions for the various tasks required to process, catalog, sort, and display such a large amount of data.  In the process we will investigate various parallel computing tools, and test different approaches to solve this problem in an acceptable time (minutes or hours rather than days or years of computation!).
  
 
The class mixes lectures, the reading and presentation of research papers, and programming assignments/projects.
 
The class mixes lectures, the reading and presentation of research papers, and programming assignments/projects.
  
We discuss different levels of parallelism, uncovering it in processors, in operating systems, and seek our teeth in ''multi-threading'' which we explore with Java.   We learn about various theoretical approaches to safely share data (e.g. ''semaphores''), and the problems one can expect by not adopting safe solution (''deadlocks'').
+
The topics planned for the semester include (but are not limited to):
 +
* An exploration of how artists have displayed collection of images in the art world, with one or more visits to the SCMA.
 +
* Using Latex to write scientific research papers.
 +
* Investigating the various types of parallel computers and parallel ''architectures''.
 +
* Learning about the different programming patterns for parallel programs.
 +
* Exploring ''data-sharing'' programming with '''Java threads''', and how to avoid data inconsistency and ''deadlocks'' with ''mutexes'', ''locks'' and ''semaphores''.
 +
* Exploring the ''message-passing'' paradigm with '''MPI'''.  There will be a quick introduction to '''C''' before using it with MPI.
 +
* Exploring the world of ''cloud computing'' with the ''Map-Reduce'' approach to process large amounts of data on large clusters of servers.   The infrastructure used will be ''Hadoop'', and the programming in ''Java''.
  
The final paradigm we visit is the processing of data in parallel on a grand scale and we learn about Google's ''Map-Reduce'' solution for processing large amount of textual data.  We study  ''Hadoop'', the open-source version of Map-Reduce on a local cluster of computers.   
+
A group project will cap the end of the semester.  The goal of the project will be to address one of the many tasks discovered during the semester for visualizing the collection of images, generating a parallel approach for it, and comparing its performance to the current state of the art, and reporting the results in a research paper.
 
+
<br />
The goal of the semester's work is a project that includes a parallel program that solves one of the problems associated with the processing of the collection of images, the redaction of a scientific paper describing the semester-long research using the ''LaTex'' document formatting language.
+
=Newsletter=
 
 
=Class Notes=
 
 
 
Everybody will be responsible for transcribing the notes for the class and posting them on the wiki, in a rotation pattern (roughly once a month for each person in the class).
 
  
 +
Everybody will be responsible for generating a 2-page newsletter every other week. 
 +
<br />
 
=Homework assignments/Projects=
 
=Homework assignments/Projects=
There will be homework assignments and 3 projects.  The homework assignments will be used to create various solutions that will be included in the projects. 
 
  
There will be 3 projects, roughly one month apart, and capping the material covered in each sectionMore details will be available as we go along.  The current project ideas are the following:
+
There will be homework assignments and a projectThe homework assignments will contribute to the advancement of the overall project.
;Project 1:
+
<br />
:Threading in Python: given two lists of keywords, List1 and List2, retrieve docs from a site (xgridmac.dyndns.org, yahoo, google) that respond/match List1.  Filter the docs received and keep only those that contain most of the words in List2.
+
=Piazza=
  
;Project 2:
+
On an experimental basis, we will use Piazza four on-line discussion of issues related to the class material.
:XGrid: process a gzip xml dump of wikipedia and break it up into individual pages (9 million or so of them)!
+
The system is  catered to getting you help fast and efficiently from classmates,  and your instructor.   When a question is about an assignment, a software bug, or something the whole class could benefit knowing aboutyou are encouraged to post your questions on Piazza.  
 
 
;Project 3:
 
:Map-Reduce: process wikipedia pages and create an index of words and their associated categories
 
 
 
;Project 4:
 
:Setup of Cloud Cluster. Self-scheduled, lasting until Spring break. Teams of two students will setup a PC with Ubuntu and Hadoop and contribute to documentation ([http://cs.smith.edu/classwiki/index.php/CSC352_Hadoop_Cluster_Howto#Workstation_Setup Wiki Setup Page])
 
  
 +
Find our class page at: [https://piazza.com/smith/fall2013/csc352/home https://piazza.com/smith/fall2013/csc352/home], and its user guide [http://www.piazza.com/pdfs/piazza_product_introduction.pdf  here].
 +
<br />
 
=Smith Cloud=
 
=Smith Cloud=
  
6 PCs recovered from Burton Basement are awaiting to be reincarnated in a networked cluster of Ubuntu machines running the hadoop softwareOnce initialized and connected together they will form Smith's first cloud computing platform.  One of the required projects for the class is for students to pair up in teams and each setup one of the computers, documenting the process in the class [http://cs.smith.edu/classwiki wiki].
+
We will use different computer clusters available on campusMore information will be released as the course progresses.
 
+
<br />
 
=Presentations=
 
=Presentations=
  
We'll read, present and discuss papers during the semester.  Most papers are already posted on the [[CSC352 Resources | Links &amp; Resources]] page.  More information will be available as we proceed through the semester.
+
We'll read, present and discuss papers during the semester.  Papers will be posted on the [[CSC352 Resources 2013| Links &amp; Resources]] page.  More information will be available as we proceed through the semester.
  
 
<!--For the presenters, the following [http://www.cs.swarthmore.edu/~newhall/presentation.html page] from Prof. Tia Newall of Swarthmore College for good advice on preparing a presentation.-->
 
<!--For the presenters, the following [http://www.cs.swarthmore.edu/~newhall/presentation.html page] from Prof. Tia Newall of Swarthmore College for good advice on preparing a presentation.-->
  
Whenever a paper is scheduled for presentation or discussion, everybody not presenting the paper is responsible for handing out at the beginning of the class a one-page (possibly two pages) with a summary of the paper, in 3 parts:
+
Whenever a paper is scheduled for presentation or discussion, everybody not presenting the paper will be responsible for handing out at the beginning of the class a one-page summary of the paper, formatted in Latex.
* a one-sentence summary of the paper
+
<br />
* a one-paragraph summary of the paper
 
* a half-page summary of the paper.
 
 
 
 
=Prerequisites=
 
=Prerequisites=
  
Algorithms CSC252, or permission of the instructor.  A good knowledge of C and Java is important.
+
Algorithms CSC252, or permission of the instructor.  A good knowledge of Java is important.
 
+
<br />
 
=Schedule=
 
=Schedule=
  
The class meets twice a week, on Tuesdays and Thursdays, 10:30 am - 11:50 am, in Ford Hall 342.
+
The class meets twice a week, on Tuesdays and Thursdays, 1:00-2:50 p.m., in '''Ford Hall 345'''.
 
+
<br />
 
=Textbook=
 
=Textbook=
  
 
There are no textbooks for this course.  The Web has a rich collection of documents we'll be using and which are catalogued in the[[CSC352 Resources | Links &amp; Resources]] page.
 
There are no textbooks for this course.  The Web has a rich collection of documents we'll be using and which are catalogued in the[[CSC352 Resources | Links &amp; Resources]] page.
 
+
<br />
 
=Other Sources of Material=
 
=Other Sources of Material=
  
 
The science library has a good collection of books on parallel processing and algorithms that you might find useful for supplementing the material presented and covered in class.  "Parallel algorithm", "Parallel Programming," or "Grid Computing" are good keywords to start a search on.
 
The science library has a good collection of books on parallel processing and algorithms that you might find useful for supplementing the material presented and covered in class.  "Parallel algorithm", "Parallel Programming," or "Grid Computing" are good keywords to start a search on.
 
+
<br />
 
=Lateness Policy=
 
=Lateness Policy=
  
No late assignment/paper summariy/project will be accepted (except in case of documented illness or personal difficulties).
+
No late assignment/paper summariy/project will be accepted (except in case of ''documented'' illness or personal difficulties).
 
Do your work on time!
 
Do your work on time!
  
 
<font color="red">You can, however, drop any one homework assignment and any one reading assignment without penalty.</font>  If you do not drop any assignment and do not drop any assigned reading, I will remove the ones with the lowest grade automatically.
 
<font color="red">You can, however, drop any one homework assignment and any one reading assignment without penalty.</font>  If you do not drop any assignment and do not drop any assigned reading, I will remove the ones with the lowest grade automatically.
 
+
<br />
 
=Grading=
 
=Grading=
  
You can pick between 3 options for the final grade.  If you do not make your choice known <font color="red">before the last day of class</font>, Option 1, the original grading option, will be used.
 
 
==Option 1==
 
  
 
{|
 
{|
Line 96: Line 100:
 
Class participation (summaries, class notes, discussion) &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <br />
 
Class participation (summaries, class notes, discussion) &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <br />
 
Homework <br />
 
Homework <br />
Projects (equal weight for all 3)<br />
+
Project<br />
 
Paper presentations &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br />
 
Paper presentations &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br />
 
|
 
|
Line 105: Line 109:
 
|}
 
|}
  
==Option 2==
+
<br />
{|
 
|
 
Class participation (summaries, class notes, discussion) &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <br />
 
Homework <br />
 
Projects (equal weight for all 3)<br />
 
Paper presentations &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br />
 
|
 
15% <br />
 
20% <br />
 
50% <br />
 
15%
 
|}
 
 
 
==Option 3 ==
 
Option 3 is the same as Option 1 but with more weight for Project 3.
 
 
 
{|
 
|
 
Class participation (summaries, class notes, discussion) &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <br />
 
Homework <br />
 
Project 1 <br />
 
Project 2 <br />
 
Project 3 <br />
 
Paper presentations &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br />
 
|
 
10% <br />
 
15% <br />
 
10% <br />
 
10% <br />
 
40% <br />
 
15%
 
|}
 
 
 
 
=Teaching Assistants=
 
=Teaching Assistants=
  
 
No TA for this class.
 
No TA for this class.
 
+
<br />
 +
<br />
 +
=Software=
 +
Below is a non-exhaustive list of software packages we'll use in the class.  You may want to investigate installing them on your computer. 
 +
* Java and [http://www.eclipse.org/ Eclipse].  All serious programmers should know how to use Eclipse, and should have it installed on their computer.  One advantage of having Eclipse is that it supports Processing with very little additional effort (see [[Tutorials#Processing_and_Eclipse | these tutorials]] for examples of how to set this up).
 +
* Latex for writing scientific papers.  [http://texstudio.sourceforge.net/ TexStudio] is a good visual editor, but there is also a nice on-line editor at [https://www.sharelatex.com/ sharelatex.com] that does not require any installation and works well.
 +
* [http://matplotlib.org/users/intro.html MatPlotLib] for processing data and generating graphs.
 +
* [http://www.mpich.org/ MPI], the Message-Passing Interface platform for parallel programs.  It is installed on beowulf and beowulf2, but you may like to also have it on your computer, although it is not necessary.
 
<br />
 
<br />
 
<br />
 
<br />

Latest revision as of 12:46, 2 September 2013

--D. Thiebaut (talk) 10:55, 9 August 2013 (EDT)


                             

CloudComputingCartoon2.jpg


Main Page | Syllabus | Schedule | Links & Resources


Instructor

Prof. Dominique Thiébaut (dthiebaut at smith.edu)
Dept. Computer Science
Ford Hall 356
Telephone: 3854
Office hours Wed 1-4 p.m. and by appointments

Introduction

Parallel and Distributed Processing (formally Parallel Processing) is a seminar mixing theory, programming and research. It explores the issues facing today's programmers in need of process data existing in either a large volume, or distributed over a network (local or the Internet).

The course this semester focuses on a research problem and attempting to formulate a solution or an approach for it. The problem is to take a large collection of images (possibly several million images) and to create a visualization of these images in a way that will enhance some pattern or properties attached to the images. This property could be some quantity linked to the popularity of images as viewed by visitors, or the frequency with which they are used, or the date when they were introduced in the collection, the date of last time an image was viewed, etc...

The goal of the seminar is to understand how to devise solutions for the various tasks required to process, catalog, sort, and display such a large amount of data. In the process we will investigate various parallel computing tools, and test different approaches to solve this problem in an acceptable time (minutes or hours rather than days or years of computation!).

The class mixes lectures, the reading and presentation of research papers, and programming assignments/projects.

The topics planned for the semester include (but are not limited to):

  • An exploration of how artists have displayed collection of images in the art world, with one or more visits to the SCMA.
  • Using Latex to write scientific research papers.
  • Investigating the various types of parallel computers and parallel architectures.
  • Learning about the different programming patterns for parallel programs.
  • Exploring data-sharing programming with Java threads, and how to avoid data inconsistency and deadlocks with mutexes, locks and semaphores.
  • Exploring the message-passing paradigm with MPI. There will be a quick introduction to C before using it with MPI.
  • Exploring the world of cloud computing with the Map-Reduce approach to process large amounts of data on large clusters of servers. The infrastructure used will be Hadoop, and the programming in Java.

A group project will cap the end of the semester. The goal of the project will be to address one of the many tasks discovered during the semester for visualizing the collection of images, generating a parallel approach for it, and comparing its performance to the current state of the art, and reporting the results in a research paper.

Newsletter

Everybody will be responsible for generating a 2-page newsletter every other week.

Homework assignments/Projects

There will be homework assignments and a project. The homework assignments will contribute to the advancement of the overall project.

Piazza

On an experimental basis, we will use Piazza four on-line discussion of issues related to the class material. The system is catered to getting you help fast and efficiently from classmates, and your instructor. When a question is about an assignment, a software bug, or something the whole class could benefit knowing about, you are encouraged to post your questions on Piazza.

Find our class page at: https://piazza.com/smith/fall2013/csc352/home, and its user guide here.

Smith Cloud

We will use different computer clusters available on campus. More information will be released as the course progresses.

Presentations

We'll read, present and discuss papers during the semester. Papers will be posted on the Links & Resources page. More information will be available as we proceed through the semester.


Whenever a paper is scheduled for presentation or discussion, everybody not presenting the paper will be responsible for handing out at the beginning of the class a one-page summary of the paper, formatted in Latex.

Prerequisites

Algorithms CSC252, or permission of the instructor. A good knowledge of Java is important.

Schedule

The class meets twice a week, on Tuesdays and Thursdays, 1:00-2:50 p.m., in Ford Hall 345.

Textbook

There are no textbooks for this course. The Web has a rich collection of documents we'll be using and which are catalogued in the Links & Resources page.

Other Sources of Material

The science library has a good collection of books on parallel processing and algorithms that you might find useful for supplementing the material presented and covered in class. "Parallel algorithm", "Parallel Programming," or "Grid Computing" are good keywords to start a search on.

Lateness Policy

No late assignment/paper summariy/project will be accepted (except in case of documented illness or personal difficulties). Do your work on time!

You can, however, drop any one homework assignment and any one reading assignment without penalty. If you do not drop any assignment and do not drop any assigned reading, I will remove the ones with the lowest grade automatically.

Grading

Class participation (summaries, class notes, discussion)       
Homework
Project
Paper presentations       

10%
15%
60%
15%


Teaching Assistants

No TA for this class.

Software

Below is a non-exhaustive list of software packages we'll use in the class. You may want to investigate installing them on your computer.

  • Java and Eclipse. All serious programmers should know how to use Eclipse, and should have it installed on their computer. One advantage of having Eclipse is that it supports Processing with very little additional effort (see these tutorials for examples of how to set this up).
  • Latex for writing scientific papers. TexStudio is a good visual editor, but there is also a nice on-line editor at sharelatex.com that does not require any installation and works well.
  • MatPlotLib for processing data and generating graphs.
  • MPI, the Message-Passing Interface platform for parallel programs. It is installed on beowulf and beowulf2, but you may like to also have it on your computer, although it is not necessary.