CSC111 Lab 12 2010

From dftwiki3
Revision as of 20:41, 21 April 2010 by Thiebaut (talk | contribs) (Known Character Frequencies for different languages)
Jump to: navigation, search




This lab deals with dictionaries, and natural language processing.
Тази лаборатория се занимава с речници и обработка на естествен език
Das Labor beschäftigt sich mit Wörterbüchern und Verarbeitung natürlicher Sprache.
Ce laboratoire traite de dictionnaires, et le traitement du langage naturel.
辞書をこのラボはお得な情報、自然言語処理。
maabara hii inahusika na Mkwawa, na usindikaji lugha ya asili.
該實驗室處理詞典,和自然語言處理。




Dictionaries

Dictionaries are data structures in Python that have the following properties:

  • they key track of pairs of elements. The first one is the key, the second the value
  • all the keys are unique
  • dictionaries allow fast searching, insertion and retrieval of information.

Playing with Dictionaries

  • Use Python in interactive mode and enter the different Python statements shown below.
  • Observe the output, and make sense of how dictionaries work in Python
>>> 
>>> # create an empty dictionary
>>> D = {}
>>> D


>>> # create a dictionary with a few key:value pairs
>>> D = { "apple":30, "pear":10, "banana":5 }
>>> D


>>> # inspect some of the contents
>>> D[ 'pear' ]


>>> D[ 'apple' ]


>>> # we are getting 25 more bananas...
>>> D[ 'banana' ] = D[ 'banana' ] + 25
>>> D[ 'banana' ]


>>> D



>>> # we're getting a new shipment of pineapples... 100 of them
... 
>>> D[ 'pineapple' ] = 100
>>> D



>>> # we want the name of the fruits (keys) we carry...
>>> D.keys()


>>> for fruit in D.keys(): 
...     print fruit
... 






>>> 
>>> # we want to print the full inventory
... 
>>> for key in D.keys():
...     print D[ key ], "units of", key
... 



  • Now that you better understand how dictionaries work, try to figure out how to answer the following question in Python (use the interactive mode:
Question 1
How many bananas do we have?
Question 2
We sell half of the bananas. Remove half of the bananas from your inventory.
Question 3
Print the fruits for which we have more than 50 units.

Problem #2




In Il caso di Google, gli intrusi sembrava avere informazioni precise circa i nomi di software per sviluppatori di Gaia la.




What language is the sentence above written in?

To find out, let's write a python program that will tell us!

Introduction

First, go over the Wikipedia page on letter frequency. Read over the information quickly, but make sure you understand what this is all about.

Python to the rescue

As you will have now guessed, one way to figure out what language a text is written in is to measure the frequency of occurrence of each letter in the text and to compare it to the frequency of letters appearing in known texts written in different languages.

Below are various Python pieces that will come in handy for doing this:

Dictionaries

Of course! Dictionaries are the important ingredient of the solution. Here's a way to use dictionaries with characters, illustrated in an interactive Python shell.


>>> letters = {}
>>> letters[ 'a' ] = 0
>>> letters[ 'b' ] = 1
>>> letters[ 'c' ] = 0
>>> letters
{'a': 0, 'c': 0, 'b': 1}
>>> letters.has_key( 'a' )
True
>>> letters.has_key( 'z' )
False
>>> letters[ 'a' ] = letters[ 'a' ] + 10
>>> letters
{'a': 10, 'c': 0, 'b': 1}
>>> 

Reading a text file into a string

# readFile1.py
# D. Thiebaut

def getText( filename ):
    file = open( filename, "r" )
    text = file.read()
    file.close()
    return text
    
def main():
    filename = raw_input( "filename?  " )
    text = getText( filename )
    
    print text

main()

Processing characters of a String

# bigEs.py
# swap all 'e' characters for 'E'...

sentence = "The quick red fox jumped over the lazy brown sleeping dog"

newS = ''
for c in sentence:
    if c == 'e':
        c = 'E'
    newS = newS + c

print newS

Programming Time!

  • Write a program that will help you identify the language of the mystery text, which is the first sentence of a longer test (the longer the text, the more meaningful the ranking of letters).
  • The mystery text can be obtained as follows, from the Linux prompt in your beowulf account:
  getcopy secret.txt

Known Character Frequencies for different languages

Ranking of more frequent to less frequent, taken from http://letterfrequency.org:

UK English Language Letter Frequency:
e t a o i n s r h l d c u m f p g w y b v k x j q z

Spanish Language Letter Frequency:
e a o s r n i d l c t u m p b g y í v q ó h f z j é á ñ x ú ü w k

German Language Letter Frequency:
e n i s r a t d h u l c g m o b w f k z v ü p ä ß j ö y q x

French Language Letter Frequency:
e s a i t n r u l o d c m p é v q f b g h j à x è y ê z ç ô ù â û î œ w k ï ë ü æ ñ

Italian Language Letter Frequency:
e a i o n l r t s c d u p m v g h f b q z ò à ù ì é è ó y k w x j ô

Dutch Language Letter Frequency:
e n a t i r o d s l g h v k m u b p w j c z f x y (ë é ó) q

Greek Language Letter Frequency:
α ο ι ε τ σ ν η υ ρ π κ μ λ ω δ γ χ θ φ β ξ ζ ψ

Russian Language Letter Frequency:
o e a и н т с в л р к д м п у ë я г б з ч й х ж ш ю ц щ e ф (ъ ы ь)

Turkish Language Letter Frequency:
a e i n r l ı d k m u y t s b o ü ş z g ç h ğ v c ö p f j w x q

Polish Language Letter Frequency:
i a e o z n s c r w y ł d k m t p u j l g ę b ą h ż ś ó ć ń f ź v q x

Esperanto Language Letter Frequency:
a i e o n l s r t k j u d m p v g f b c ĝ ĉ ŭ z ŝ h ĵ ĥ w y x q

Swedish Language Letter Frequency:
e a n t r s l i d o m g k v ä h f u p å ö b c j y x w z é q (à è)

To save time, these strings have been transformed into python code that you can inject directly in your program:

languageStrings = [('Spanish', 'eaosrnidlc'), ('German', 'enisratdhu'), ('French', 'esaitnrulo'), 
                   ('Italian', 'eaionlrtsc'), ('Dutch', 'enatirodsl'), ('Turkish', 'aeinrl\xc4\xb1dk'), 
                   ('Polish', 'iaeoznscrw'), ('Swedish', 'eantrslido')]


Important Note
When you write string that contains foreign characters in a python program, you need to tell the Python interpreter that it should expect such characters. This is done by including the following line at the top of your program, aligned against the left margin.
# -*- coding: iso-8859-15 -*-


Each pair in the list contains two strings: the first one is the name of a language, the second is the first 10 most frequent characters found in texts written in the given language.

Open Problem

Challenge of the Day
Can you figure out, once you have the ranking of characters for your mystery text, how to have python compare it for you to the known frequency charts, such as the one above, to output the most likely language?

References

Answers

The text was " Smith College Museum of Art (SCMA) is the recipient of one of the most celebrated works by a preeminent leader of the Ashcan School of American realist painting." and was translated in Polish, by Google