Difference between revisions of "CSC352 Homework 2"

From dftwiki3
Jump to: navigation, search
m (Problem #2)
(Fetching a URL in Python)
 
(29 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
<bluebox>
 
<bluebox>
This assignment is due on 2/18/10.
+
This assignment is due on 2/23/10 at midnight (Tuesday 24:00/Wednesday 00:00).  You can work individually or in pairs.
 
</bluebox>
 
</bluebox>
 
<br>
 
<br>
Line 9: Line 9:
  
 
Take the [[CSC352_multiprocessingNQueens.py | multiprocessing version of the N-Queens program]] we saw in class and fix it.  It is currently inefficient, as we saw in class.  In particular, it does not need the ''foundOne'' boolean, and it has to wait for all the child processes to finish before it can finish executing.  Instead, change the code so that all the processes are "told" to stop when one solution is found.  This way the parent process will not have to wait around.
 
Take the [[CSC352_multiprocessingNQueens.py | multiprocessing version of the N-Queens program]] we saw in class and fix it.  It is currently inefficient, as we saw in class.  In particular, it does not need the ''foundOne'' boolean, and it has to wait for all the child processes to finish before it can finish executing.  Instead, change the code so that all the processes are "told" to stop when one solution is found.  This way the parent process will not have to wait around.
 +
 +
Note that multiprocessing works only on Python 2.6, so make sure the machine you are using has Python 2.6 installed.  Beowulf does, and so do most of the Macs in Ford Hall.  To get the version of Python just type <tt>python -v</tt> on the command line.
  
 
Once your program works, measure its ''best of four''  execution times and plot it for N=10 to 25.  In other words, run it 4 times for N=10, and keep the best time.  4 times for N=11 and keep the best time, etc.  Plot the best times obtained.   
 
Once your program works, measure its ''best of four''  execution times and plot it for N=10 to 25.  In other words, run it 4 times for N=10, and keep the best time.  4 times for N=11 and keep the best time, etc.  Plot the best times obtained.   
Line 20: Line 22:
 
     submit hw2 hw2a.pdf
 
     submit hw2 hw2a.pdf
  
<br />
 
<br />
 
<br />
 
<br />
 
<br />
 
 
<br />
 
<br />
  
Line 30: Line 27:
 
This problem will get you started on Project #1 for this class.
 
This problem will get you started on Project #1 for this class.
  
The main idea is that every so often, you might be in a situation where given some information, you would like to have available the most pertinent Web pages dealing with that information.  I don't mean that you just go to Google and search for some keyword of interest.  For example, recently the blogs and news outlets started talking about Google Buzz, an attempt by Google to enter the world of social networking.  Wouldn't it be nice if we had a system that would check the most frequent words appearing in some technology-related blogs, and collect automatically the Web pages in Google that talk mostly of Google Buzz, and make a list of them?
+
The main idea is that every so often, you might be in a situation where given some information, you would like to have available the most pertinent Web pages dealing with that information.  I don't mean that you just go to Google and search for some keywords of interest.  An example will illustrate my point. 
<p>
+
 
 +
Recently the blogs and news outlets started talking about ''Google Buzz'', an attempt by Google to enter the world of social networking.  Wouldn't it be nice if we had a system that would  
 +
#check the most frequent words appearing in some technology-related blogs,  
 +
# collect automatically the Web pages in Google that talk mostly of these concepts, and  
 +
# make a list of them and display them as most relevant pages dealing with the new hot topic of the day?
 +
 
 +
<br>
 
That's what we are going to work on.   
 
That's what we are going to work on.   
<P>
+
<br>
 
But we'll start small.
 
But we'll start small.
<P>
+
<br>
First, Google.  Or Yahoo.  The problem is that if we start writing/debugging a program that probes Google or Yahoo constantly, they will soon discover the non-human nature of the probe and will block the IP address of where the requests come from.  We could get a special key from Google to allow us to run programs that will access their site, but the free version of the key allows for very limited access.  So, instead, I have created a mini repository of text files, about 60,000 of them, taken from public Web sites, and I have indexed all of them with [http://swish-e.org swish-e], and put them on one of my computers.
+
First, a note about Google and Yahoo.  The problem is that if we start writing/debugging a program that probes Google or Yahoo constantly, they will soon discover the non-human nature of the probe and will block the IP address of where the requests come from.  We could get a special key from Google to allow us to run programs that will access their site, but the free version of the key allows for very limited access.  So, instead, I have created a mini repository of text files, about 60,000 of them, taken from public Web sites, and I have indexed all of them with [http://swish-e.org swish-e], and put them on one of my computers.
  
<br>
 
 
<onlysmith>
 
<onlysmith>
 
<br>
 
<br>
Line 44: Line 46:
 
<br>
 
<br>
 
</onlysmith>
 
</onlysmith>
<br>
 
  
There is no Web page with a form to allow you to search.  Instead you put the word or words of interest at the end of the URL, plug the URL into your browser, and ''voila''!  Swish-e will return to you a list of 20 (this number can be changed if we want) files that it has indexed that best match the search terms.
+
The system I have setup does not provide a Web page with a form to allow you to search.  All is done from the URL bar.  You write the URL shown above in the address box of your browser, then you put the word or words of interest at the end of the URL, after the question mark, and then you press Enter. And ''voila''!  Swish-e will return to you a list of 20 (this number can be changed if we want) files that it has indexed that best match the search terms.
  
 
Try it!  Make sure you check different search words, too.
 
Try it!  Make sure you check different search words, too.
 
    
 
    
The goal of this assignment is for you to write a multithreaded or multiprocessing Python program that will ask the user for a keyword or keywords, will request the keyword(s) from the URL above, and will get back the list of 20 (or fewer) links, and will fetch all 20 (or fewer) files from the site (hence the parallelism), and display to the user the first occurence of the keyword in each file, with a few words before and a few words after.
+
The goal of this assignment is for you to write a multithreaded or multiprocessing Python program (pick what you want to play with) that will  
 +
# ask the user for a keyword or keywords,  
 +
# will request the keyword(s) from the URL above,
 +
# will get back the list of 20 (or fewer) links, and  
 +
# will fetch all 20 (or fewer) files from the site (hence the parallelism), and  
 +
# show the user the first occurence of the keyword in each file, with a few words before and a few words after, to create a context.
  
 
For example, assuming that the user, inspired by Valentine's Day, requests a search for "love".  Here is what your program will print:
 
For example, assuming that the user, inspired by Valentine's Day, requests a search for "love".  Here is what your program will print:
  
 +
<onlysmith>
 +
<code><pre>
 
  search term(s): love
 
  search term(s): love
 
  --------------------------------------------
 
  --------------------------------------------
Line 75: Line 83:
 
  LYSANDER | | in love with Hermia.  
 
  LYSANDER | | in love with Hermia.  
 
  DEMETRIUS | PHILOSTRATE master of the
 
  DEMETRIUS | PHILOSTRATE master of the
 +
...
 +
...
 +
</pre></code>
 +
</onlysmith>
 +
 +
===Python Solutions===
 +
 +
====Fetching a URL in Python====
 +
<source lang="python">
 +
.
 +
 +
import urllib, urllib2
 +
 +
MAINURL = "http://xgridmac.dyndns.org/~thiebaut/swish-e/swishe.php"
 +
VERSION = """Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11)
 +
Gecko/20071127 Firefox/2.0.0.11"""
 +
HEADERS      = {"User-Agent"  : VERSION }
 +
 +
url = MAINURL
 +
keyword = raw_input( "Search? " )
 +
args = { "search" : keyword }
 +
# print "args = ", str( args )
 +
req = urllib2.Request( url,  urllib.urlencode( args ), HEADERS )
 +
f = urllib2.urlopen( req )
 +
print f.read()
 +
 +
url = "http://xgridmac.dyndns.org/~thiebaut/www_etext_org/Fiction_315/Shakespeare/poetry/sonnets.txt"
 +
args = {  }
 +
# print "args = ", str( args )
 +
req = urllib2.Request( url,  urllib.urlencode( args ), HEADERS )
 +
f = urllib2.urlopen( req )
 +
print f.read()
 +
 +
.
 +
</source>
 +
 +
A lot of work done with just a few lines of Python!  The program above will probe the Web server, ask it to search the swish-e database for the keyword ''love'', will get back the list of 20 files returned by the Web server, and will print the whole list on the screen.
 +
 +
====Parsing the resulting list====
 +
The information sent back by the server will be of the form:
 +
 +
<onlysmith>
 +
<code><pre>
 +
<?xml version="1.0" encoding="ISO-8859-1"?>
 +
<br>rank:  1
 +
<br>score:  1000
 +
<br>url:    http://xgridmac.dyndns.org/~thiebaut/www_etext_org/Fiction_315/Shakespeare/poetry/sonnets.txt
 +
<br>link:  <a href="http://xgridmac.dyndns.org/~thiebaut/www_etext_org/Fiction_315/Shakespeare/poetry/sonnets.txt">link</a>
 +
<br>file:  sonnets.txt
 +
<br>offset: 95662
 +
<br>
 +
<br>rank:  2
 +
<br>score:  1000
 +
<br>url:    http://xgridmac.dyndns.org/~thiebaut/www_etext_org/Religious_357/Gaia/gaia4.8.txt
 +
<br>link:  <a href="http://xgridmac.dyndns.org/~thiebaut/www_etext_org/Religious_357/Gaia/gaia4.8.txt">link</a>
 +
<br>file:  gaia4.8.txt
 +
<br>offset: 89988
 +
<br>
 +
...
 +
</pre></code>
 +
</onlysmith>
 +
 +
Here's a simple way for extracting the url from the line that starts <nowiki>"<br>url: http://..."</nowiki>
 +
 +
<source lang="python">
 +
.
 +
 +
# assume lines is a list of all the lines received
 +
for line in lines:
 +
    if line.find( "<br>url:" ) != -1:
 +
          url =  line.split( "url:" )[1].strip()
 +
         
 +
.
 +
</source>
 +
 +
===Assignment===
 +
 +
Write your program (called hw2b.py) so that you can feed it the search term(s) from the command line:
 +
 +
<onlysmith>
 +
<code><pre>
 +
 +
        python hw2b.py  love
 +
 +
        fetching url: http://xgridmac.dyndns.org/~thiebaut/www_etext_org/Fiction_315/Shakespeare/poetry/sonnets.txt
 +
        Or who is he so fond will be the tomb Of
 +
        his self-love, to stop posterity? Thou
 +
        art thy mother's glass, and
 +
 +
        fetching url: http://xgridmac.dyndns.org/~thiebaut/www_etext_org/Religious_357/Gaia/gaia4.8.txt
 +
        our disease, let us now turn to the
 +
        inner structure of love and its intimate
 +
        connection with space. As
 +
 +
        ...
 +
</pre></code>
 +
</onlysmith>
 +
 +
And make the program output some context for the search term, i.e. a few words before and after the search term or search terms.
 +
 +
===Submission===
 +
 +
  submit hw2 hw2b.py
  
 
<br />
 
<br />
Line 81: Line 192:
 
<br />
 
<br />
 
<br />
 
<br />
[[Category:CSC352]][[Category:Homework]]
+
[[Category:CSC352]][[Category:Homework | Homework 2]]

Latest revision as of 14:32, 22 February 2010

This assignment is due on 2/23/10 at midnight (Tuesday 24:00/Wednesday 00:00). You can work individually or in pairs.




Problem #1

Take the multiprocessing version of the N-Queens program we saw in class and fix it. It is currently inefficient, as we saw in class. In particular, it does not need the foundOne boolean, and it has to wait for all the child processes to finish before it can finish executing. Instead, change the code so that all the processes are "told" to stop when one solution is found. This way the parent process will not have to wait around.

Note that multiprocessing works only on Python 2.6, so make sure the machine you are using has Python 2.6 installed. Beowulf does, and so do most of the Macs in Ford Hall. To get the version of Python just type python -v on the command line.

Once your program works, measure its best of four execution times and plot it for N=10 to 25. In other words, run it 4 times for N=10, and keep the best time. 4 times for N=11 and keep the best time, etc. Plot the best times obtained.

You may find the following page useful for generating the list of best times.

Compare these execution times to those you obtained for your improved threaded versions.

Put the listing and your comments in a pdf and submit it in your CSC352 account as follows:

   submit hw2 hw2a.pdf


Problem #2

This problem will get you started on Project #1 for this class.

The main idea is that every so often, you might be in a situation where given some information, you would like to have available the most pertinent Web pages dealing with that information. I don't mean that you just go to Google and search for some keywords of interest. An example will illustrate my point.

Recently the blogs and news outlets started talking about Google Buzz, an attempt by Google to enter the world of social networking. Wouldn't it be nice if we had a system that would

  1. check the most frequent words appearing in some technology-related blogs,
  2. collect automatically the Web pages in Google that talk mostly of these concepts, and
  3. make a list of them and display them as most relevant pages dealing with the new hot topic of the day?


That's what we are going to work on.
But we'll start small.
First, a note about Google and Yahoo. The problem is that if we start writing/debugging a program that probes Google or Yahoo constantly, they will soon discover the non-human nature of the probe and will block the IP address of where the requests come from. We could get a special key from Google to allow us to run programs that will access their site, but the free version of the key allows for very limited access. So, instead, I have created a mini repository of text files, about 60,000 of them, taken from public Web sites, and I have indexed all of them with swish-e, and put them on one of my computers.


This section is only visible to computers located at Smith College

The system I have setup does not provide a Web page with a form to allow you to search. All is done from the URL bar. You write the URL shown above in the address box of your browser, then you put the word or words of interest at the end of the URL, after the question mark, and then you press Enter. And voila! Swish-e will return to you a list of 20 (this number can be changed if we want) files that it has indexed that best match the search terms.

Try it! Make sure you check different search words, too.

The goal of this assignment is for you to write a multithreaded or multiprocessing Python program (pick what you want to play with) that will

  1. ask the user for a keyword or keywords,
  2. will request the keyword(s) from the URL above,
  3. will get back the list of 20 (or fewer) links, and
  4. will fetch all 20 (or fewer) files from the site (hence the parallelism), and
  5. show the user the first occurence of the keyword in each file, with a few words before and a few words after, to create a context.

For example, assuming that the user, inspired by Valentine's Day, requests a search for "love". Here is what your program will print:


This section is only visible to computers located at Smith College

Python Solutions

Fetching a URL in Python

.

import urllib, urllib2

MAINURL = "http://xgridmac.dyndns.org/~thiebaut/swish-e/swishe.php"
VERSION = """Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11)
 Gecko/20071127 Firefox/2.0.0.11"""
HEADERS       = {"User-Agent"  : VERSION }

url = MAINURL
keyword = raw_input( "Search? " )
args = { "search" : keyword }
# print "args = ", str( args )
req = urllib2.Request( url,  urllib.urlencode( args ), HEADERS )
f = urllib2.urlopen( req )
print f.read()

url = "http://xgridmac.dyndns.org/~thiebaut/www_etext_org/Fiction_315/Shakespeare/poetry/sonnets.txt"
args = {  }
# print "args = ", str( args )
req = urllib2.Request( url,  urllib.urlencode( args ), HEADERS )
f = urllib2.urlopen( req )
print f.read()

.

A lot of work done with just a few lines of Python! The program above will probe the Web server, ask it to search the swish-e database for the keyword love, will get back the list of 20 files returned by the Web server, and will print the whole list on the screen.

Parsing the resulting list

The information sent back by the server will be of the form:


This section is only visible to computers located at Smith College

Here's a simple way for extracting the url from the line that starts "<br>url: http://..."

.

# assume lines is a list of all the lines received
for line in lines:
    if line.find( "<br>url:" ) != -1:
          url =  line.split( "url:" )[1].strip()
          
.

Assignment

Write your program (called hw2b.py) so that you can feed it the search term(s) from the command line:


This section is only visible to computers located at Smith College

And make the program output some context for the search term, i.e. a few words before and after the search term or search terms.

Submission

 submit hw2 hw2b.py