Friday, April 13, 2012

Language import tidbits: working with unicode in python

Okay so a bit exploration of pythons unicode module stuff and unicodedata modules.  The unicode module appear auto loaded while unicodedata isn't.  The difference here?

the reserved keyword unicode() returns a unicode object of say a string, albeit this can also be done using a reserved expression such as:

>> b = u'A'
 
where b is a unicode object of the upper case Latin letter 'A'.

This unicode object can be passed in turn to any number of unicodedata module object functions.   For, instance,

first importing the unicodedata module

>>import unicodedata

then passing the unicode object exampled above

>>unicodedata.name(b)

returns 'LATIN CAPITAL LETTER A'

One can likewise provide a character return of a unicode named object by using the correct unicode comment naming convention.  Example:

 >>a = unicodedata.lookup('MODIFIER LETTER RHOTIC HOOK')

returns u'\u013e'

which is the unicode point reference.  Note if you tried to use the unicode.decode() function on this you'd get an error in Python's IDLE gui stating something to the effect that this character can't be translated into a recognized ASCII character.  If you want to see its rendered typeset form you could simply use a print() function on this object.  Thus,

>>print(unicodedata.lookup('MODIFIER LETTER RHOTIC HOOK'))

should provide a character rendering of the letter indicated above

which is   ' ˞'

Unicode maintains a database table of such objects, appearing in semicolon separated form here .  I'd also mention one can find code block group information (e.g., Latin, Cyrillic, and so forth) here.  Basically this is provides a library indexing between character set indexes and individual characters themselves.

In python the translated convention, for example, of a unicode code point from  u+013e is indicated above as u'\u013e' .

For those interested further would suggest reading at unicode.org.

And for python programmer's, while I would suggest python's reference information on the unicodedata module, it appears some aspects of coding has changed relatively so to the docs that I've read here...this is to say appears some information that I have furnished above provides more current information in so far as implementation.

If you are working with partial hexidecimal string forms, another conversion method is as follows using this example:

a = '0041'
a = '0x'+a
 
b = eval('unichr(' + a + ')')

typically unicode database represents hexidecimal form in a abridged form relative python, typically full form in python is given by 0x...

but we are in a string form of this hexidecimal number representation, I opted to use the python eval() function which allows us  to evaluate string tokens (e.g., the string 'a = 1 + 1' evaluated would yield a = 2).  We use the python function unichr() to convert the hexidecimal unicode character to its unicode represented form.

Finally if you have difficulties rendering the character in your particular gui, you can further convert this to utf-8 (works for gtk3+ and pango package renderers) by doing the following using the above example:

b = b.encode('utf-8')
 

So far I have written something of a basic importer using unicode database tables, reading semicolon separated values here.  Although links between code blocks table and data tables are off in my case, mostly since min max block correspondence, include possible non existent values in so far as values given written in table, and present algorithm scans from min or max searching for nearest corresponding high low in block range...having problems with this...converted from hexidecimal to decimal, increment/decrement, converted back to hex, truncated hex form to match table key form, but still missing somewhere here.  Will probably go to another solution in the next day or so.




No comments:

Post a Comment

Oblivion

 Between the fascination of an upcoming pandemic ridden college football season, Taylor Swift, and Kim Kardashian, wildfires, crazier weathe...