Difference between revisions of "CSC352 Project 1"
(→Performance Analysis) |
(→Introduction) |
||
(8 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
<bluebox> | <bluebox> | ||
− | This project is due | + | This project is due tentatively on March 2nd. You can work individually or in pairs. If you work in pairs, you have to spend the majority of the time working together on the programming and analysis. |
</bluebox> | </bluebox> | ||
<br /> | <br /> | ||
<br /> | <br /> | ||
− | <center> | + | <!--center> |
Under Construction! Not released officially! | Under Construction! Not released officially! | ||
− | </center> | + | </center--> |
__TOC__ | __TOC__ | ||
Line 12: | Line 12: | ||
This project is an extension of Problem 2 of [[CSC352 Homework 2 | Homework #2]], but with a few differences. | This project is an extension of Problem 2 of [[CSC352 Homework 2 | Homework #2]], but with a few differences. | ||
− | # You need to implement both the multiprocessing and threaded versions of the program (you can use the solution program(s) that will be made available as soon as Homework 2 is graded). | + | # You need to implement both the multiprocessing and threaded versions of the program (you can use the solution program(s) that will be made available as soon as Homework 2 is graded). Be careful that multiprocessing works only with Python Version 2.6. It should be available on most Macs in Ford Hall. |
# The program will analyze each document received and will remove all the [http://en.wikipedia.org/wiki/Stop_words stop words], then compute the frequency of occurrence of each word and rank the words by frequency, from most frequent to least frequent. | # The program will analyze each document received and will remove all the [http://en.wikipedia.org/wiki/Stop_words stop words], then compute the frequency of occurrence of each word and rank the words by frequency, from most frequent to least frequent. | ||
# The program will assign a score to the document that is the inverse of the ranking of the search term in the list of ordered words. For example, if you search for "match" and the list of most frequent words is [ "document", "perform", "english", "text", "match", "score",...], then the score of the document will be 1/5, since "match" is in fifth position in the list. | # The program will assign a score to the document that is the inverse of the ranking of the search term in the list of ordered words. For example, if you search for "match" and the list of most frequent words is [ "document", "perform", "english", "text", "match", "score",...], then the score of the document will be 1/5, since "match" is in fifth position in the list. | ||
Line 29: | Line 29: | ||
* Make sure that whatever you ask the multiprocessing version to do, you will need the threaded version to do as well. | * Make sure that whatever you ask the multiprocessing version to do, you will need the threaded version to do as well. | ||
− | * Use scripts to help you run the | + | * Use scripts to help you run the experiments and gather the data. |
* Report the results, and make sure you specify the conditions of your experiments (the type of computer you were using, the number of cores, the speed of the processor, the time of day, etc.). | * Report the results, and make sure you specify the conditions of your experiments (the type of computer you were using, the number of cores, the speed of the processor, the time of day, etc.). | ||
− | * Comment on the results and, | + | * If possible (not required, but recommended), run your experiments on different architectures (a single core and a dual-core, for example) |
+ | |||
+ | * Comment on the results, and explain, as best you can, why one out-performs the other, or why both methods perform similarly. | ||
==Submission== | ==Submission== | ||
− | Store your | + | Store your comments, scripts, and analysis in a pdf, and submit it as follows: |
submit project1 project1.pdf | submit project1 project1.pdf | ||
+ | |||
+ | Store your python programs in two files called proj1Thread.py and proj1MultiProc.py, and submit them as follows | ||
+ | |||
+ | submit project1 proj1Thread.py | ||
+ | submit project1 proj1MultiProc.py | ||
==Additional Information== | ==Additional Information== | ||
Line 46: | Line 53: | ||
* [http://armandbrahaj.blog.al/2009/04/14/list-of-english-stop-words/ List of English Stop Words] | * [http://armandbrahaj.blog.al/2009/04/14/list-of-english-stop-words/ List of English Stop Words] | ||
* [http://en.wikipedia.org/wiki/Stop_words What is a Stop Word]? | * [http://en.wikipedia.org/wiki/Stop_words What is a Stop Word]? | ||
+ | * [http://www.linuxquestions.org/questions/programming-9/how-to-get-cpu-information-on-linux-machine-358037/ How to get CPU information on a Linux Machine] | ||
+ | * [http://www.cpuid.com/cpuz.php CPU Id, a utility to find the CPU info on Windows machines] | ||
<br /> | <br /> | ||
<br /> | <br /> |
Latest revision as of 23:09, 15 February 2010
This project is due tentatively on March 2nd. You can work individually or in pairs. If you work in pairs, you have to spend the majority of the time working together on the programming and analysis.
Introduction
This project is an extension of Problem 2 of Homework #2, but with a few differences.
- You need to implement both the multiprocessing and threaded versions of the program (you can use the solution program(s) that will be made available as soon as Homework 2 is graded). Be careful that multiprocessing works only with Python Version 2.6. It should be available on most Macs in Ford Hall.
- The program will analyze each document received and will remove all the stop words, then compute the frequency of occurrence of each word and rank the words by frequency, from most frequent to least frequent.
- The program will assign a score to the document that is the inverse of the ranking of the search term in the list of ordered words. For example, if you search for "match" and the list of most frequent words is [ "document", "perform", "english", "text", "match", "score",...], then the score of the document will be 1/5, since "match" is in fifth position in the list.
- The program will sort the documents by decreasing scores.
- Your program will output the documents in a way similar to that of Homework 2 (with a context around the search term), and will indicate the score of each document.
- You are free to decide how to format the output. It should show at least
- the search term
- the url of each document
- the score of each document
- the context for the search term in each document
Performance Analysis
- Run several experiments where you are going to measure the average and max number of searches performed per unit of time for your two implementations.
- Make sure that whatever you ask the multiprocessing version to do, you will need the threaded version to do as well.
- Use scripts to help you run the experiments and gather the data.
- Report the results, and make sure you specify the conditions of your experiments (the type of computer you were using, the number of cores, the speed of the processor, the time of day, etc.).
- If possible (not required, but recommended), run your experiments on different architectures (a single core and a dual-core, for example)
- Comment on the results, and explain, as best you can, why one out-performs the other, or why both methods perform similarly.
Submission
Store your comments, scripts, and analysis in a pdf, and submit it as follows:
submit project1 project1.pdf
Store your python programs in two files called proj1Thread.py and proj1MultiProc.py, and submit them as follows
submit project1 proj1Thread.py submit project1 proj1MultiProc.py
Additional Information
- Word Frequency using Python
- Use Python to Detect the Most Frequent Words in a File
- List of English Stop Words
- What is a Stop Word?
- How to get CPU information on a Linux Machine
- CPU Id, a utility to find the CPU info on Windows machines