CSC111 Lab 12 2010
Contents |
This lab deals with dictionaries, and natural language processing. |
Dictionaries
Dictionaries are data structures in Python that have the following properties:
- they key track of pairs of elements. The first one is the key, the second the value
- all the keys are unique
- dictionaries allow fast searching, insertion and retrieval of information.
Playing with Dictionaries
- Use Python in interactive mode and enter the different Python statements shown below.
- Observe the output, and make sense of how dictionaries work in Python
>>>
>>> # create an empty dictionary
>>> D = {}
>>> D
>>> # create a dictionary with a few key:value pairs
>>> D = { "apple":30, "pear":10, "banana":5 }
>>> D
>>> # inspect some of the contents
>>> D[ 'pear' ]
>>> D[ 'apple' ]
>>> # we are getting 25 more bananas...
>>> D[ 'banana' ] = D[ 'banana' ] + 25
>>> D[ 'banana' ]
>>> D
>>> # we're getting a new shipment of pineapples... 100 of them
...
>>> D[ 'pineapple' ] = 100
>>> D
>>> # we want the name of the fruits (keys) we carry...
>>> D.keys()
>>> for fruit in D.keys():
... print fruit
...
>>>
>>> # we want to print the full inventory
...
>>> for key in D.keys():
... print D[ key ], "units of", key
...
- Now that you better understand how dictionaries work, try to figure out how to answer the following question in Python (use the interactive mode:
- Question 1
- How many bananas do we have?
- Question 2
- We sell half of the bananas. Remove half of the bananas from your inventory.
- Question 3
- Print the fruits for which we have more than 50 units.
Problem #2
- Smith College Museum of Art (SCMA) jest odbiorcą jednym z najbardziej głośnych przez
wybitny przywódca ashcan School of American malarstwo realistyczne.
- Smith College Museum of Art (SCMA) jest odbiorcą jednym z najbardziej głośnych przez
What language is the sentence above written in?
To find out, let's write a python program that will tell us!
Introduction
First, go over the Wikipedia page on letter frequency. Read over the information quickly, but make sure you understand what the message is.
Python to the rescue
As you will have now guessed, one way to figure out what language a text is written in is to measure the frequency of occurrence of each letter in the text and to compare it to the frequency of letters appearing in texts written in different languages.
Below are various Python pieces that will come in handy for doing this:
Dictionaries
Of course! Dictionaries are the important ingredient of the solution. Here's a way to use dictionaries with characters, illustrated in an interactive Python shell.
>>> letters = {}
>>> letters[ 'a' ] = 0
>>> letters[ 'b' ] = 1
>>> letters[ 'c' ] = 0
>>> letters
{'a': 0, 'c': 0, 'b': 1}
>>> letters.has_key( 'a' )
True
>>> letters.has_key( 'z' )
False
>>> letters[ 'a' ] = letters[ 'a' ] + 10
>>> letters
{'a': 10, 'c': 0, 'b': 1}
>>>
Reading a text file into a string
# readFile1.py
# D. Thiebaut
def getText( filename ):
file = open( filename, "r" )
text = file.read()
file.close()
return text
def main():
filename = raw_input( "filename? " )
text = getText( filename )
print text
main()
Processing characters of a String
# bigEs.py
# swap all 'e' characters for 'E'...
sentence = "The quick red fox jumped over the lazy brown sleeping dog"
newS = ''
for c in sentence:
if c == 'e':
c = 'E'
newS = newS + c
print newS
Programming Time!
- Write a program that will help you identify the language of the mystery text.
- The mystery text can be obtained as follows, from the Linux prompt in your beowulf account:
getcopy secret.txt
Known Character Frequencies for different languages
Ranking of more frequent to less frequent, taken from http://letterfrequency.org:
UK English Language Letter Frequency:
e t a o i n s r h l d c u m f p g w y b v k x j q z
Spanish Language Letter Frequency:
e a o s r n i d l c t u m p b g y í v q ó h f z j é á ñ x ú ü w k
German Language Letter Frequency:
e n i s r a t d h u l c g m o b w f k z v ü p ä ß j ö y q x
French Language Letter Frequency:
e s a i t n r u l o d c m p é v q f b g h j à x è y ê z ç ô ù â û î œ w k ï ë ü æ ñ
Italian Language Letter Frequency:
e a i o n l r t s c d u p m v g h f b q z ò à ù ì é è ó y k w x j ô
Dutch Language Letter Frequency:
e n a t i r o d s l g h v k m u b p w j c z f x y (ë é ó) q
Greek Language Letter Frequency:
α ο ι ε τ σ ν η υ ρ π κ μ λ ω δ γ χ θ φ β ξ ζ ψ
Russian Language Letter Frequency:
o e a и н т с в л р к д м п у ë я г б з ч й х ж ш ю ц щ e ф (ъ ы ь)
Turkish Language Letter Frequency:
a e i n r l ı d k m u y t s b o ü ş z g ç h ğ v c ö p f j w x q
Polish Language Letter Frequency:
i a e o z n s c r w y ł d k m t p u j l g ę b ą h ż ś ó ć ń f ź v q x
Esperanto Language Letter Frequency:
a i e o n l s r t k j u d m p v g f b c ĝ ĉ ŭ z ŝ h ĵ ĥ w y x q
Swedish Language Letter Frequency:
e a n t r s l i d o m g k v ä h f u p å ö b c j y x w z é q (à è)
Open Problem
- Challenge of the Day
- Can you figure out, once you have the ranking of characters for your mystery text, how to have python compare it for you to the known frequency charts, such as the one above, to output the most likely language?
References
Answers
The text was " Smith College Museum of Art (SCMA) is the recipient of one of the most celebrated works by a preeminent leader of the Ashcan School of American realist painting." and was translated in Polish, by Google