Difference between revisions of "CSC352 Homework 2"
(→Problem #2) |
(→Problem #2) |
||
Line 30: | Line 30: | ||
This problem will get you started on Project #1 for this class. | This problem will get you started on Project #1 for this class. | ||
− | The main idea is that every so often, you might be in a situation where given some information, you would like to have available the most pertinent Web pages dealing with that information. I don't mean that you just go to Google and search for some | + | The main idea is that every so often, you might be in a situation where given some information, you would like to have available the most pertinent Web pages dealing with that information. I don't mean that you just go to Google and search for some keywords of interest. An example will illustrate my point. |
+ | |||
+ | Recently the blogs and news outlets started talking about ''Google Buzz'', an attempt by Google to enter the world of social networking. Wouldn't it be nice if we had a system that would | ||
+ | #check the most frequent words appearing in some technology-related blogs, | ||
+ | # collect automatically the Web pages in Google that talk mostly of these concepts, and | ||
+ | # make a list of them? | ||
+ | |||
<p> | <p> | ||
That's what we are going to work on. | That's what we are going to work on. |
Revision as of 19:38, 11 February 2010
This assignment is due on 2/18/10.
Contents
Problem #1
Take the multiprocessing version of the N-Queens program we saw in class and fix it. It is currently inefficient, as we saw in class. In particular, it does not need the foundOne boolean, and it has to wait for all the child processes to finish before it can finish executing. Instead, change the code so that all the processes are "told" to stop when one solution is found. This way the parent process will not have to wait around.
Once your program works, measure its best of four execution times and plot it for N=10 to 25. In other words, run it 4 times for N=10, and keep the best time. 4 times for N=11 and keep the best time, etc. Plot the best times obtained.
You may find the following page useful for generating the list of best times.
Compare these execution times to those you obtained for your improved threaded versions.
Put the listing and your comments in a pdf and submit it in your CSC352 account as follows:
submit hw2 hw2a.pdf
Problem #2
This problem will get you started on Project #1 for this class.
The main idea is that every so often, you might be in a situation where given some information, you would like to have available the most pertinent Web pages dealing with that information. I don't mean that you just go to Google and search for some keywords of interest. An example will illustrate my point.
Recently the blogs and news outlets started talking about Google Buzz, an attempt by Google to enter the world of social networking. Wouldn't it be nice if we had a system that would
- check the most frequent words appearing in some technology-related blogs,
- collect automatically the Web pages in Google that talk mostly of these concepts, and
- make a list of them?
That's what we are going to work on.
<P>
But we'll start small.
<P>
First, Google. Or Yahoo. The problem is that if we start writing/debugging a program that probes Google or Yahoo constantly, they will soon discover the non-human nature of the probe and will block the IP address of where the requests come from. We could get a special key from Google to allow us to run programs that will access their site, but the free version of the key allows for very limited access. So, instead, I have created a mini repository of text files, about 60,000 of them, taken from public Web sites, and I have indexed all of them with swish-e, and put them on one of my computers.
There is no Web page with a form to allow you to search. Instead you put the word or words of interest at the end of the URL, plug the URL into your browser, and voila! Swish-e will return to you a list of 20 (this number can be changed if we want) files that it has indexed that best match the search terms.
Try it! Make sure you check different search words, too.
The goal of this assignment is for you to write a multithreaded or multiprocessing Python program that will ask the user for a keyword or keywords, will request the keyword(s) from the URL above, and will get back the list of 20 (or fewer) links, and will fetch all 20 (or fewer) files from the site (hence the parallelism), and display to the user the first occurence of the keyword in each file, with a few words before and a few words after.
For example, assuming that the user, inspired by Valentine's Day, requests a search for "love". Here is what your program will print:
search term(s): love
--------------------------------------------
fetching url: http://xgridmac.dyndns.org/~thiebaut/www_etext_org/Fiction_315/Shakespeare/poetry/sonnets.txt
Or who is he so fond will be the tomb Of
his self-love, to stop posterity? Thou
art thy mother's glass, and
--------------------------------------------
fetching url: http://xgridmac.dyndns.org/~thiebaut/www_etext_org/Religious_357/Gaia/gaia4.8.txt
our disease, let us now turn to the
inner structure of love and its intimate
connection with space. As
--------------------------------------------
fetching url: http://xgridmac.dyndns.org/~thiebaut/www_etext_org/Religious_357/Gaia/gaia5.8.txt
our disease, let us now turn to the
inner structure of love and its intimate
connection with space. As
--------------------------------------------
fetching url: http://xgridmac.dyndns.org/~thiebaut/www_etext_org/Fiction_315/Shakespeare/comedies/midsummersnightsdream.txt
of Athens. EGEUS father to Hermia.
LYSANDER | | in love with Hermia.
DEMETRIUS | PHILOSTRATE master of the