Difference between revisions of "CSC352 Project 1"

From dftwiki3
Jump to: navigation, search
(Introduction)
Line 13: Line 13:
  
 
# You need to implement both the multiprocessing and threaded versions of the program (you can use the solution programs that will be made available as soon as Homework 2 is graded).
 
# You need to implement both the multiprocessing and threaded versions of the program (you can use the solution programs that will be made available as soon as Homework 2 is graded).
# The program will analyze each document received and will compute the frequency of occurrence of each word and rank the words by frequency, from most frequent to least frequent.  It will
+
# The program will analyze each document received and will remove all the [http://en.wikipedia.org/wiki/Stop_words stop words], then compute the frequency of occurrence of each word and rank the words by frequency, from most frequent to least frequent.   
 +
# The program will assign a score to the document that is the inverse of the ranking of the search term in the list of ordered words.  For example, if you search for "match" and the list of most frequent words is [ "document", "perform", "english", "text", "match", "score",...], then the score of the document will be 1/5, since "match" is in fifth position in the list.
 +
# The program will sort the documents by decreasing scores.
 +
# Your program will output the documents in way similar to that of Homework 2 (with a context around the search term), and will indicate the score of each document. 
 +
#You are free to decide how to format the output.  It should show at least
 +
#* the search term
 +
#* the url of each document
 +
#* the score of each document
 +
#* the context for the search term in each document
  
 
==Additional Information==
 
==Additional Information==

Revision as of 09:16, 13 February 2010

This project is due TBA. You can work individually or in pairs. If you work in pairs, you have to spend the majority of the time working together on the programming and analysis.



Under Construction! Not released officially!

Introduction

This project is an extension of Problem 2 of Homework #2, but with a few differences.

  1. You need to implement both the multiprocessing and threaded versions of the program (you can use the solution programs that will be made available as soon as Homework 2 is graded).
  2. The program will analyze each document received and will remove all the stop words, then compute the frequency of occurrence of each word and rank the words by frequency, from most frequent to least frequent.
  3. The program will assign a score to the document that is the inverse of the ranking of the search term in the list of ordered words. For example, if you search for "match" and the list of most frequent words is [ "document", "perform", "english", "text", "match", "score",...], then the score of the document will be 1/5, since "match" is in fifth position in the list.
  4. The program will sort the documents by decreasing scores.
  5. Your program will output the documents in way similar to that of Homework 2 (with a context around the search term), and will indicate the score of each document.
  6. You are free to decide how to format the output. It should show at least
    • the search term
    • the url of each document
    • the score of each document
    • the context for the search term in each document

Additional Information