Difference between revisions of "CSC111 Lab 12 2010"
(→Answers) |
|||
Line 1: | Line 1: | ||
+ | --[[User:Thiebaut|D. Thiebaut]] 02:08, 22 April 2010 (UTC) | ||
+ | ---- | ||
{| | {| | ||
| | | |
Revision as of 22:08, 21 April 2010
--D. Thiebaut 02:08, 22 April 2010 (UTC)
Contents |
|
Dictionaries
Dictionaries are data structures in Python that have the following properties:
- they key track of pairs of elements. The first one is the key, the second the value
- all the keys are unique
- dictionaries allow fast searching, insertion and retrieval of information.
Playing with Dictionaries
- Use Python in interactive mode and enter the different Python statements shown below.
- Observe the output, and make sense of how dictionaries work in Python
>>>
>>> # create an empty dictionary
>>> D = {}
>>> D
>>> # create a dictionary with a few key:value pairs
>>> D = { "apple":30, "pear":10, "banana":5 }
>>> D
>>> # inspect some of the contents
>>> D[ 'pear' ]
>>> D[ 'apple' ]
>>> # we are getting 25 more bananas...
>>> D[ 'banana' ] = D[ 'banana' ] + 25
>>> D[ 'banana' ]
>>> D
>>> # we're getting a new shipment of pineapples... 100 of them
...
>>> D[ 'pineapple' ] = 100
>>> D
>>> # we want the name of the fruits (keys) we carry...
>>> D.keys()
>>> for fruit in D.keys():
... print fruit
...
>>>
>>> # we want to print the full inventory
...
>>> for key in D.keys():
... print D[ key ], "units of", key
...
- Now that you better understand how dictionaries work, try to figure out how to answer the following question in Python (use the interactive mode:
- Question 1
- How many bananas do we have?
- Question 2
- We sell half of the bananas. Remove half of the bananas from your inventory.
- Question 3
- Print the fruits for which we have more than 50 units.
Problem #2
- In Il caso di Google, gli intrusi sembrava avere informazioni precise circa i nomi di software per sviluppatori di Gaia la.
What language is the sentence above written in?
To find out, let's write a python program that will tell us!
Introduction
First, go over the Wikipedia page on letter frequency. Read over the information quickly, but make sure you understand what this is all about.
Python to the rescue
As you will have now guessed, one way to figure out what language a text is written in is to measure the frequency of occurrence of each letter in the text and to compare it to the frequency of letters appearing in known texts written in different languages.
Below are various Python pieces that will come in handy for doing this:
Dictionaries
Of course! Dictionaries are the important ingredient of the solution. Here's a way to use dictionaries with characters, illustrated in an interactive Python shell.
>>> letters = {}
>>> letters[ 'a' ] = 0
>>> letters[ 'b' ] = 1
>>> letters[ 'c' ] = 0
>>> letters
{'a': 0, 'c': 0, 'b': 1}
>>> letters.has_key( 'a' )
True
>>> letters.has_key( 'z' )
False
>>> letters[ 'a' ] = letters[ 'a' ] + 10
>>> letters
{'a': 10, 'c': 0, 'b': 1}
>>>
Reading a text file into a string
# readFile1.py
# D. Thiebaut
def getText( filename ):
file = open( filename, "r" )
text = file.read()
file.close()
return text
def main():
filename = raw_input( "filename? " )
text = getText( filename )
print text
main()
Processing characters of a String
# bigEs.py
# swap all 'e' characters for 'E'...
sentence = "The quick red fox jumped over the lazy brown sleeping dog"
newS = ''
for c in sentence:
if c == 'e':
c = 'E'
newS = newS + c
print newS
Programming Time!
- Write a program that will help you identify the language of the mystery text, which is the first sentence of a longer test (the longer the text, the more meaningful the ranking of letters).
- The mystery text can be obtained as follows, from the Linux prompt in your beowulf account:
getcopy secret.txt
Known Character Frequencies for different languages
Ranking of more frequent to less frequent, taken from http://letterfrequency.org:
UK English Language Letter Frequency:
e t a o i n s r h l d c u m f p g w y b v k x j q z
Spanish Language Letter Frequency:
e a o s r n i d l c t u m p b g y í v q ó h f z j é á ñ x ú ü w k
German Language Letter Frequency:
e n i s r a t d h u l c g m o b w f k z v ü p ä ß j ö y q x
French Language Letter Frequency:
e s a i t n r u l o d c m p é v q f b g h j à x è y ê z ç ô ù â û î œ w k ï ë ü æ ñ
Italian Language Letter Frequency:
e a i o n l r t s c d u p m v g h f b q z ò à ù ì é è ó y k w x j ô
Dutch Language Letter Frequency:
e n a t i r o d s l g h v k m u b p w j c z f x y (ë é ó) q
Greek Language Letter Frequency:
α ο ι ε τ σ ν η υ ρ π κ μ λ ω δ γ χ θ φ β ξ ζ ψ
Russian Language Letter Frequency:
o e a и н т с в л р к д м п у ë я г б з ч й х ж ш ю ц щ e ф (ъ ы ь)
Turkish Language Letter Frequency:
a e i n r l ı d k m u y t s b o ü ş z g ç h ğ v c ö p f j w x q
Polish Language Letter Frequency:
i a e o z n s c r w y ł d k m t p u j l g ę b ą h ż ś ó ć ń f ź v q x
Esperanto Language Letter Frequency:
a i e o n l s r t k j u d m p v g f b c ĝ ĉ ŭ z ŝ h ĵ ĥ w y x q
Swedish Language Letter Frequency:
e a n t r s l i d o m g k v ä h f u p å ö b c j y x w z é q (à è)
To save time, these strings have been transformed into python code that you can inject directly in your program:
languageStrings = [('Spanish', 'eaosrnidlc'), ('German', 'enisratdhu'), ('French', 'esaitnrulo'), ('Italian', 'eaionlrtsc'), ('Dutch', 'enatirodsl'), ('Turkish', 'aeinrl\xc4\xb1dk'), ('Polish', 'iaeoznscrw'), ('Swedish', 'eantrslido')]
- Important Note
- When you write string that contains foreign characters in a python program, you need to tell the Python interpreter that it should expect such characters. This is done by including the following line at the top of your program, aligned against the left margin.
- # -*- coding: iso-8859-15 -*-
Each pair in the list contains two strings: the first one is the name of a language, the second is the first 10 most frequent characters found in texts written in the given language.
Open Problem
- Challenge of the Day
- Can you figure out, once you have the ranking of characters for your mystery text, how to have python compare it for you to the known frequency charts, such as the one above, to output the most likely language?
References
Answers
The text was the Italian translation of a sentence from the 4/19/10 New York Times article "Cyberattack on Google Said to Hit Password System" (www.nytimes.com/2010/04/20/technology/20google.html)