CSC111 Homework 8 2011
--D. Thiebaut 20:23, 8 November 2011 (EST)
This assignment is due on 11/15/11 evening, at midnight. It will be easiest for you to work on this assignment on beowulf, and not on your laptop/desktop.
Contents
Problem: Working with Files
- Check this page out: http://www.nytimes.com/ref/movies/1000best.html
- (If this page is not available, you can see a cached copy of it on beowulf: http://maven.smith.edu/~111a/1000best.html )
- It contains, according to the New York Times, the 1000 best movies ever made.
- The whole Web page is available for you to download to your 111a-xx account. All you need to do is login to your 111a-xx account, and type the following command at the prompt:
getcopy 1000best.html
- Check the contents of your directory with ls: you should see the new file. It is fairly large, and contains roughly 320,000 characters.
Your assignment
- Write a program called hw8.py that will
- ask the user for the year she was born in (or some other year),
- open the html file, read it, and output all the movies that came out in that year, sorted in alphabetical order.
- store this list to a text file, called movies_nnnn.txt, where nnnn will be the year selected by the user.
Example
- The user input is underlined:
python3.2 hw8.py Selecting movies for which year? 1933 Movies that came out in 1933: Cavalcade Dinner at Eight Duck Soup King Kong Little Women State Fair The Private Life of Henry VIII Zero for Conduct Saving movies to file movies_1933.txt
- At the end of the program a new file will be in the current directory, and its name will be movies_1933.txt
Helpful Hints
HTML format
- You will find out by looking at the raw html code of the page that it is filled with html tags. The list of movies is embedded is not easy to find. Here is the beginning of the list, as it appears in the file:
<td><a href="http://movies.nytimes.com/movie/358/A-Nous-la-Liberte/overview">A Nous la Liberte (1932)</a></td>
<td class="smartlink"><a href="#" bluekey="Up4yHhM70NjGxfSdSLf9%22oMVD7MemyJIdFp322ed7EZAz4JqryOohTYaqwjSxfhSx5w8Mb"></a></td>
</tr>
<tr>
<td><a href="http://movies.nytimes.com/movie/265451/About-Schmidt/overview">About Schmidt (2002)</a></td>
<td class="smartlink"><a href="#" bluekey="Up4yHhM70NjGxfSdSLf9%22oMVD7MemyJIsQC3SYeixIJyRSZthWCkF5hvd6PUcWfJl5w46Tl"></a></td>
</tr>
- your code will have to find the movies by locking on the different tags, and extracting the movie title from the surrounding html code.
The String find() method
- the find() method is described fully in the python doc: http://docs.python.org/py3k/library/stdtypes.html#string-methods
- The find() method accepts a second argument, which is optional. This argument is the location in the string where the searching must start. This is useful if you want to start searching not from the beginning of the string, but from a different place.
- Example
text = """
age: 35 value: 345,
age: 77 value: 1,
age: 23 value: 16,
age: 3 value: -1"""
# display all the ages
start = 0
while True: # we create an endless loop
index = text.find( "age:", start )
# found a new "age:" string?
if index == -1:
# not found
break
# we advance start to just past where we found the "age:" string
# and we take a slice of 4 characters that will contain the age.
start = index + len( "age:" )
print( text[ start: start+4 ] )
- Figure out how the code works. It should help you solve this homework...