Angry Bits

Words on bytes and bits

Unicode in Python: Common Pitfalls

Most people don't get Unicode right. It is not their fault, I believe the biggest trouble here is that people underestimate the complexity of Unicode.

I've been working with text manipulation tasks using Python for a long time. During the last ten years I've been collecting thousands of mistakes for misinterpreting Unicode and its implementation in Python.

In order to help beginners avoid a few of them and understand most of them, I've decided to focus this post on the common pitfalls you can hit with Python 2.x and Unicode. Python 3.x handles Unicode in a very different way, it has its own issues but are usually different that the ones you'll fight against on Python 2.x.

str vs unicode

Python has two types for strings: str and unicode. str is used for binary data, unicode for text data. As most of the string operations can be done using the str and as the literal for byte strings is simpler than the unicode ones you'll end up with lots of text data handled by the str type. This is not what you really want. Python tries to help you with automatic coercion between these two types, for example if you concatenate two strings of both types the result is an unicode string.

How does Python convert str to unicode? It uses a default encoding to decode the byte string. You can get your default encoding checking the sys.getdefaultencoding(). Normally it is ascii which is easily wrong when you work with non-English languages.

Whenever Python does an automatic decoding it might raises an UnicodeDecodeErrror exception. This is because the byte string is not a valid representation of the default encoding.

>>> 'à' + u'foo'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

Normalization

A character in Unicode can be represented in multiple ways, for instance the character é in UTF-8 is represented by both the byte strings '\xc3\xa9' and 'e\xcc\x81'. This means you might find out you have the same text represented in two different ways (i.e. your strings come from different sources found on the Web). Why is this important for us? Look at this example:

>>> a = '\xc3\xa9'.decode('utf8')
>>> b = 'e\xcc\x81'.decode('utf8')
>>> print a
é
>>> print b
é
>>> a == b
False

This is where unicodedata module becomes handy. Particularly the function normalize. This function converts a string to a normal form. You can find the supported canonical forms in the docs:

Valid values for form are 'NFC', 'NFKC', 'NFD', and 'NFKD'.

Let's do it:

>>> a = '\xc3\xa9'.decode('utf8')
>>> b = 'e\xcc\x81'.decode('utf8')
>>> print a
é
>>> print b
é
>>> from unicodedata import normalize
>>> normalize('NFC', a) == normalize('NFC', b)
True

Et voilat, very simple but very subtle.

Newlines

A few years ago I was working with TSV files. The TSV files were weakly defined, we basically had no escape characters because the strings contained in each field could not contain tabs and line feeds. Some of the fields contained ASCII strings, some others UTF-8 strings. My Python scripts were doing some text manipulation jobs, in order to read the text from the files I was using the nice codecs module, this was my function to read the input:

import codecs

def read_tsv(filename):
    with codecs.open(filename, encoding='utf8') as fin:
        for line in fin:
            line = line[:-1]
            yield line.split('\t')

Simple, isn't it? ASCII is compatible to UTF-8, so basically there were no problems due to the encoding mix, I could assume everything was UTF-8 complaint.

This worked most of the times. One day it failed. It took a while to understand that for some reasons one of the line contained less fields than the assumptions made by the script. Checking the bad line with an editor I could see the right number of fields.

That day I've learned that Unicode standard defines multiple ways to represent new lines...

Check this code:

>>> a = u'hello\N{LINE SEPARATOR}world'
>>> len(a.split('\n'))
1
>>> len(a.splitlines())
2

As you can see we've got a text string without line feeds but that still breaks into two lines using the splitlines string method. Our weakly defined TSV files was right, my code was wrong. Fields could not contain tabs ('\t') and line feeds ('\n') and my lazy read_tsv function was wrong. Even if it was an hard issue to spot, it was easy to fix it:

def read_tsv(filename):
    with open(filename, 'rb') as fin:
        for line in fin:
            line = line[:-1]
            yield [normalize('NFC', x.decode('utf8')) for x in line.split('\t')]

The new function is actually solving two issues:

  1. It reads the file as a binary file, because IT IS a binary file where each record is delimited by a line feed character and each field by a tab character. Incidentally all of the fields are text fields.
  2. It decodes all strings using the same canonical form.

Number #2 was specific to my case.

Unknown encodings, wrong encodings and other tips

Python encode and decode raise errors when the passed encoding is not compatible with the source string. You can pass a second argument to both of these methods to change this behavior. This might be useful in some situations were you are handling mixed encoding strings and output generated by a bugged software.

Another common task is to guess the character encoding used for a text file (e.g. internet pages). chardet is an excellent library that might help you. If your problem is specific to web pages, Beautiful Soup is the library for you.

Be aware of Unicode also when you are writing regular expression. For instance if you want to extract all the words in a text do not do this:

words = re.findall('[a-zA-Z]+', text, re.U)

The right version is the following:

words = re.findall('[^\W\d]+', text, re.U)

But probably you want to check the non standard regex module. (Please check the issue 12729)

The shell iconv command is useful to convert file encodings:

$ iconv -f latin1 -t utf-8 input.txt > output.txt

Conclusions

Unicode can bring lots of confusion and subtle bugs. In order to avoid these are my simple recommendations:

  • Do not mix too many encodings, UTF-8 is fine for most cases, stuck with it.
  • Decode text data as soon as you read it, but not too early. (check the TSV story)
  • Normalize Unicode strings if you need it as soon as you can.
  • Never mix str and unicode types.
  • Use print to write on files only if opened with the codecs module, or streams wrapped by the codecs streams.

Comments