Difference between revisions of "CSC111 Lab 12 2010"

From dftwiki3
Jump to: navigation, search
Line 4: Line 4:
 
|
 
|
 
<bluebox>
 
<bluebox>
 +
<br /><br /><br />
 
This lab deals with dictionaries, and natural language processing.<br />Тази лаборатория се занимава с речници и обработка на естествен език<br />Das Labor beschäftigt sich mit Wörterbüchern und Verarbeitung natürlicher Sprache.<br />Ce laboratoire traite de dictionnaires, et le traitement du langage naturel.<br />辞書をこのラボはお得な情報、自然言語処理。<br />maabara hii inahusika na Mkwawa, na usindikaji lugha ya asili. <br />該實驗室處理詞典,和自然語言處理。
 
This lab deals with dictionaries, and natural language processing.<br />Тази лаборатория се занимава с речници и обработка на естествен език<br />Das Labor beschäftigt sich mit Wörterbüchern und Verarbeitung natürlicher Sprache.<br />Ce laboratoire traite de dictionnaires, et le traitement du langage naturel.<br />辞書をこのラボはお得な情報、自然言語処理。<br />maabara hii inahusika na Mkwawa, na usindikaji lugha ya asili. <br />該實驗室處理詞典,和自然語言處理。
</bluebox>
+
<br /><br /><br /></bluebox>
 
|}
 
|}
  

Revision as of 14:30, 21 April 2010




This lab deals with dictionaries, and natural language processing.
Тази лаборатория се занимава с речници и обработка на естествен език
Das Labor beschäftigt sich mit Wörterbüchern und Verarbeitung natürlicher Sprache.
Ce laboratoire traite de dictionnaires, et le traitement du langage naturel.
辞書をこのラボはお得な情報、自然言語処理。
maabara hii inahusika na Mkwawa, na usindikaji lugha ya asili.
該實驗室處理詞典,和自然語言處理。




Dictionaries

Dictionaries are data structures in Python that have the following properties:

  • they key track of pairs of elements. The first one is the key, the second the value
  • all the keys are unique
  • dictionaries allow fast searching, insertion and retrieval of information.

Playing with Dictionaries

  • Use Python in interactive mode and enter the different Python statements shown below.
  • Observe the output, and make sense of how dictionaries work in Python
>>> 
>>> # create an empty dictionary
>>> D = {}
>>> D


>>> # create a dictionary with a few key:value pairs
>>> D = { "apple":30, "pear":10, "banana":5 }
>>> D


>>> # inspect some of the contents
>>> D[ 'pear' ]


>>> D[ 'apple' ]


>>> # we are getting 25 more bananas...
>>> D[ 'banana' ] = D[ 'banana' ] + 25
>>> D[ 'banana' ]


>>> D



>>> # we're getting a new shipment of pineapples... 100 of them
... 
>>> D[ 'pineapple' ] = 100
>>> D



>>> # we want the name of the fruits (keys) we carry...
>>> D.keys()


>>> for fruit in D.keys(): 
...     print fruit
... 






>>> 
>>> # we want to print the full inventory
... 
>>> for key in D.keys():
...     print D[ key ], "units of", key
... 



  • Now that you better understand how dictionaries work, try to figure out how to answer the following question in Python (use the interactive mode:
Question 1
How many bananas do we have?
Question 2
We sell half of the bananas. Remove half of the bananas from your inventory.
Question 3
Print the fruits for which we have more than 50 units.

Problem #2

Smith College Museum of Art (SCMA) jest odbiorcą jednym z najbardziej głośnych przez
wybitny przywódca ashcan School of American malarstwo realistyczne
.

What language is the sentence above written in?

To find out, let's write a python program that will tell us!

Introduction

First, go over the Wikipedia page on letter frequency. Read over the information quickly, but make sure you understand what the message is.

Python to the rescue

As you will have now guessed, one way to figure out what language a text is written in is to measure the frequency of occurrence of each letter in the text and to compare it to the frequency of letters appearing in texts written in different languages.

Below are various Python pieces that will come in handy for doing this:

Dictionaries

Of course! Dictionaries are the important ingredient of the solution. Here's a way to use dictionaries with characters, illustrated in an interactive Python shell.


>>> letters = {}
>>> letters[ 'a' ] = 0
>>> letters[ 'b' ] = 1
>>> letters[ 'c' ] = 0
>>> letters
{'a': 0, 'c': 0, 'b': 1}
>>> letters.has_key( 'a' )
True
>>> letters.has_key( 'z' )
False
>>> letters[ 'a' ] = letters[ 'a' ] + 10
>>> letters
{'a': 10, 'c': 0, 'b': 1}
>>> 

Reading a text file into a string

# readFile1.py
# D. Thiebaut

def getText( filename ):
    file = open( filename, "r" )
    text = file.read()
    file.close()
    return text
    
def main():
    filename = raw_input( "filename?  " )
    text = getText( filename )
    
    print text

main()

Processing characters of a String

# bigEs.py
# swap all 'e' characters for 'E'...

sentence = "The quick red fox jumped over the lazy brown sleeping dog"

newS = ''
for c in sentence:
    if c == 'e':
        c = 'E'
    newS = newS + c

print newS

Programming Time!

  • Write a program that will help you identify the language of the mystery text.
  • The mystery text can be obtained as follows, from the Linux prompt in your beowulf account:
  getcopy secret.txt

Known Character Frequencies for different languages

Ranking of more frequent to less frequent, taken from http://letterfrequency.org:

UK English Language Letter Frequency:
e t a o i n s r h l d c u m f p g w y b v k x j q z

Spanish Language Letter Frequency:
e a o s r n i d l c t u m p b g y í v q ó h f z j é á ñ x ú ü w k

German Language Letter Frequency:
e n i s r a t d h u l c g m o b w f k z v ü p ä ß j ö y q x

French Language Letter Frequency:
e s a i t n r u l o d c m p é v q f b g h j à x è y ê z ç ô ù â û î œ w k ï ë ü æ ñ

Italian Language Letter Frequency:
e a i o n l r t s c d u p m v g h f b q z ò à ù ì é è ó y k w x j ô

Dutch Language Letter Frequency:
e n a t i r o d s l g h v k m u b p w j c z f x y (ë é ó) q

Greek Language Letter Frequency:
α ο ι ε τ σ ν η υ ρ π κ μ λ ω δ γ χ θ φ β ξ ζ ψ

Russian Language Letter Frequency:
o e a и н т с в л р к д м п у ë я г б з ч й х ж ш ю ц щ e ф (ъ ы ь)

Turkish Language Letter Frequency:
a e i n r l ı d k m u y t s b o ü ş z g ç h ğ v c ö p f j w x q

Polish Language Letter Frequency:
i a e o z n s c r w y ł d k m t p u j l g ę b ą h ż ś ó ć ń f ź v q x

Esperanto Language Letter Frequency:
a i e o n l s r t k j u d m p v g f b c ĝ ĉ ŭ z ŝ h ĵ ĥ w y x q

Swedish Language Letter Frequency:
e a n t r s l i d o m g k v ä h f u p å ö b c j y x w z é q (à è)

Open Problem

Challenge of the Day
Can you figure out, once you have the ranking of characters for your mystery text, how to have python compare it for you to the known frequency charts, such as the one above, to output the most likely language?

References

Answers

The text was " Smith College Museum of Art (SCMA) is the recipient of one of the most celebrated works by a preeminent leader of the Ashcan School of American realist painting." and was translated in Polish, by Google