Latest revision as of 17:10, 23 November 2011

--D. Thiebaut 12:30, 22 November 2011 (EST)

This assignment is optional. If you decide to do it, you can work on it and submit it any time until the last day of class. I will use the highest of the grades you got for Homework 8 and this one as the grade for Homework 8.

This homework has to be done individually, not in pair programming mode.

Its due date is 4:00 p.m. on the last day of class this semester.

Tricky NYT

The setup is the same as for Homework 8.
However, the NYT, knowing that people have a tendency to lift their Web pages and process them to extract information from them, has decided to regularly change the HTML tags they use to sandwich the movie titles. So, for example, one month the movie titles may appear in the HTML page as follows:

...
<td><a href="http://movies.nytimes.com/movie/358/A-Nous-la-Liberte/overview">A Nous la Liberte (1932)</a></td>
...

and the next month the information may be coded as follows:

...
<td><div number="358" section="title">A Nous la Liberte (1932)</div></td>
...

Furthermore, new movies are added to the list regularly, so we can't just create a simple text file with the 1000 titles and use it as the source of information, as the list evolves with time.

Your Assignment

Your assignment is the same as for Homework 8. You have a file containing all the titles, coded in HTML, and you ask the user for a year, then print all the titles for that year.
Your program cannot crash. If the input file does not exist, your program will just stop after telling the user that the file is missing. If the user enters for the year "two thousand," "2oo1," 20001", or just nothing, your program should keep on asking the user for a year until the user gets it right. Then the program displays the titles for that year.
New: your program will figure out what html code sandwiches the titles, and will use these strings to extract all the title. The trick for this is to assume that a particular movie is so good, that it will always be in the top 1000 movies. Then your program can look for that movie, find the sandwiching tags, then use these to extract all the movies.

We'll assume that this best movies of all time is Cat on a Hot Tin Roof (1958) (but you can write your program using another one as the title of the best movies of all times).

Testing

You may use the file 1000best2.html to test your program. It contains different tags around the movie titles, as compared to the original 1000best.html file.

You can get it into your beowulf account as follows:

  getcopy 1000best2.html

Once you have it into your account, you can rename it as follows:

  mv 1000best2.html 1000best.html

after this command, the file 1000best.html will contain the new version of the file, and your program should work as before.

You can also get the file on the Web, and copy/paste its source: http://cs.smith.edu/~111a/1000best2.html

Submission

Name your program makeup8.py, and submit it as follows:

 rsubmit hw8 makeup8.py

...

@@ Line 5: / Line 5: @@
 This homework has to be done individually, '''not in pair programming mode'''.
+Its due date is ''' 4:00 p.m. on the last day of class''' this semester.
 </bluebox>
@@ Line 16: / Line 18: @@
 <td><a href="http://movies.nytimes.com/movie/358/A-Nous-la-Liberte/overview">A Nous la Liberte (1932)</a></td>
 ...
-</tr>
 </pre></code>
@@ Line 22: / Line 23: @@
 <code><pre>
 ...
-<td><div number="358" section="title">A Nous la Liberte (1932)</div>
+<td><div number="358" section="title">A Nous la Liberte (1932)</div></td>
 ...
 </pre></code>
+* Furthermore, new movies are added to the list regularly, so we can't just create a simple text file with the 1000 titles and use it as the source of information, as the list evolves with time.
+=Your Assignment=
+* Your assignment is the  same as for Homework 8.  You have a file containing all the titles, coded in HTML, and you ask the user for a year, then print all the titles for that year.
+* Your program cannot crash.   If the input file does not exist, your program will just stop after telling the user that the file is missing.  If  the user enters for the year "two thousand," "2oo1," 20001", or just nothing, your program should ''keep on asking the user for a year'' until the user gets it right.  Then the program displays the titles for that year.
+* '''New''': your program will figure out what html code sandwiches the titles, and will use these strings to extract all the title.  The trick for this is to assume that a particular movie is ''so good'', that it will always be in the top 1000 movies.   Then your program can look for that movie, find the sandwiching tags, then use these to extract all the movies.
+:We'll assume that this best movies of all time is ''Cat on a Hot Tin Roof (1958)'' (but you can write your program using another one as the title of the best movies of all times).
+=Testing=
+* You may use the file '''1000best2.html''' to test your program.  It contains different tags around the movie titles, as compared to the original '''1000best.html''' file.
+* You can get it into your beowulf account as follows:
+   getcopy 1000best2.html
+* Once you have it into your account, you can rename it as follows:
+   mv 1000best2.html 1000best.html
+:after this command, the file 1000best.html will contain the new version of the file, and your program should work as before.
+* You can also get the file on the Web, and copy/paste its source: http://cs.smith.edu/~111a/1000best2.html
+=Submission=
+* Name your program '''makeup8.py''', and submit it as follows:
+  rsubmit hw8 makeup8.py
+<onlydft>
+=Program to replace tags=
+<source lang="python">
+target="""<td><a href="http://movies.nytimes.com/movie/8605/Cat-on-a-Hot-Tin-Roof/overview">
+          Cat on a Hot Tin Roof (1958)</a></td>"""
+lines = open( "1000best.html", "r" ).readlines()
+print( len( lines ), "line(s) read" )
+newLines = []
+for i,line in enumerate( lines ):
+    #print( i, "\b\b\b\b\b" )
+    #if line.find( "Tin Roof" ) != -1:
+    #        print( line )
+    #        break
+    start = 0
+    index=line.find( "/overview\">", start )
+    if index== -1:
+        newLines.append( line )
+    else:
+        indexAHref = line.rfind( "<a href", 0, index )
+        index2     = line.find( "</a>", index )
+        if indexAHref != -1 and index2 != -1:
+            movieLine = line[indexAHref:index2+len( "</a>" )]
+            words = movieLine.split( '/' )
+            number = words[4]
+            title = words[5].replace( '-', ' ' )
+            realTitle = movieLine.split( ">" )[1].split( "<" )[0]
+            line = line[:indexAHref] + '<div num="%s" title="%s">' % ( number, title ) \
+                   + realTitle + "</div>" + line[index2+len("</a>"):]
+            """
+            line = line[:indexAHref] \
+                   + '<CSC111SuperTag movieNumber="%s"><title="%s"><hereItComes>' % ( number, title ) \
+                   + realTitle + "</hereItComes></title></CSC111SuperTag>" + line[index2+len("</a>"):]
+            """
+            print( line )
+            newLines.append( line )
+            print( newLines[-1] )
+        else:
+            newLines.append( line )
+open( "1000best3.html", "w" ).write( '\n'.join( newLines ) )
+</source>
+</onlydft>
+<br />
+<br />
+<br />
+<br />
+<br />
+<br />
+<br />
+[[Category:CSC111]][[Category:Homework]][[Category:Python]]

Difference between revisions of "CSC111 Make-up Homework 8 2011"

Latest revision as of 17:10, 23 November 2011

Contents

Tricky NYT

Your Assignment

Testing

Submission

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools