CSC352 Homework 2

From dftwiki3
Revision as of 10:17, 16 February 2010 by Thiebaut (talk | contribs) (Python Solutions)
Jump to: navigation, search

This assignment is due on 2/18/10 at midnight (Thursday 24:00/Friday 00:00). You can work individually or in pairs.




Problem #1

Take the multiprocessing version of the N-Queens program we saw in class and fix it. It is currently inefficient, as we saw in class. In particular, it does not need the foundOne boolean, and it has to wait for all the child processes to finish before it can finish executing. Instead, change the code so that all the processes are "told" to stop when one solution is found. This way the parent process will not have to wait around.

Once your program works, measure its best of four execution times and plot it for N=10 to 25. In other words, run it 4 times for N=10, and keep the best time. 4 times for N=11 and keep the best time, etc. Plot the best times obtained.

You may find the following page useful for generating the list of best times.

Compare these execution times to those you obtained for your improved threaded versions.

Put the listing and your comments in a pdf and submit it in your CSC352 account as follows:

   submit hw2 hw2a.pdf


Problem #2

This problem will get you started on Project #1 for this class.

The main idea is that every so often, you might be in a situation where given some information, you would like to have available the most pertinent Web pages dealing with that information. I don't mean that you just go to Google and search for some keywords of interest. An example will illustrate my point.

Recently the blogs and news outlets started talking about Google Buzz, an attempt by Google to enter the world of social networking. Wouldn't it be nice if we had a system that would

  1. check the most frequent words appearing in some technology-related blogs,
  2. collect automatically the Web pages in Google that talk mostly of these concepts, and
  3. make a list of them and display them as most relevant pages dealing with the new hot topic of the day?


That's what we are going to work on.
But we'll start small.
First, a note about Google and Yahoo. The problem is that if we start writing/debugging a program that probes Google or Yahoo constantly, they will soon discover the non-human nature of the probe and will block the IP address of where the requests come from. We could get a special key from Google to allow us to run programs that will access their site, but the free version of the key allows for very limited access. So, instead, I have created a mini repository of text files, about 60,000 of them, taken from public Web sites, and I have indexed all of them with swish-e, and put them on one of my computers.


This section is only visible to computers located at Smith College

The system I have setup does not provide a Web page with a form to allow you to search. All is done from the URL bar. You write the URL shown above in the address box of your browser, then you put the word or words of interest at the end of the URL, after the question mark, and then you press Enter. And voila! Swish-e will return to you a list of 20 (this number can be changed if we want) files that it has indexed that best match the search terms.

Try it! Make sure you check different search words, too.

The goal of this assignment is for you to write a multithreaded or multiprocessing Python program (pick what you want to play with) that will

  1. ask the user for a keyword or keywords,
  2. will request the keyword(s) from the URL above,
  3. will get back the list of 20 (or fewer) links, and
  4. will fetch all 20 (or fewer) files from the site (hence the parallelism), and
  5. show the user the first occurence of the keyword in each file, with a few words before and a few words after, to create a context.

For example, assuming that the user, inspired by Valentine's Day, requests a search for "love". Here is what your program will print:


This section is only visible to computers located at Smith College

Python Solutions

Fetching a URL in Python

.

import urllib, urllib2

MAINURL = "http://xgridmac.dyndns.org/~thiebaut/swish-e/swishe.php"
VERSION = """Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11)
 Gecko/20071127 Firefox/2.0.0.11"""
HEADERS       = {"User-Agent"  : VERSION }

url = MAINURL
keyword = raw_input( "Search? " )
args = { "search" : keyword }
# print "args = ", str( args )
req = urllib2.Request( url,  urllib.urlencode( args ), HEADERS )
f = urllib2.urlopen( req )
print f.read()

.

A lot of work done with just a few lines of Python! The program above will probe the Web server, ask it to search the swish-e database for the keyword love, will get back the list of 20 files returned by the Web server, and will print the whole list on the screen.

Parsing the resulting list

The information sent back by the server will be of the form:


This section is only visible to computers located at Smith College

Here's a simple way for extracting the url from the line that starts "<br>url: http://..."

.

# assume lines is a list of all the lines received
for line in lines:
    if line.find( "<br>url:" ) != -1:
          url =  line.split( "url:" )[1].strip()
          
.

Assignment

Write your program (called hw2b.py) so that you can feed it the search term(s) from the command line:


This section is only visible to computers located at Smith College

And make the program output some context for the search term, i.e. a few words before and after the search term or search terms.

Submission

 submit hw2 hw2b.py