Quick and dirty language detection

Few months ago, during the development of a demo of a product we wanted to try a new feature that required language detection. I couldn't find a pluggable one so we've decided to quickly build up our own. This might sound complicate for many of you but indeed writing a simple language detector with an average accuracy is not that hard.

Where to start?

First thing you need to find a document collection with some samples in all the languages you are interested on. There are many available sources around the web, we've chosen Wikipedia as it's one of the most completed: tons of documents for many languages.

Second thing you need to choose a metric to evaluate your language classifier. Standard metrics for classifiers are Accuracy, Precision and Recall. We will focus only on the last two.

Let's first write a program that evaluate our classifier:

class LanguageDetector(object):
    """Base class for a language detector

    """
    def train(self, samples):
        raise NotImplemented

    def detect(self, text):
        raise NotImplemented

    def eval(self, samples):
        """Evaluate the model against a labelled set"""

        tp = defaultdict(int)
        fn = defaultdict(int)
        fp = defaultdict(int)
        languages = set()

        mistakes = []
        for label, text in samples:
            languages.add(label)
            lang_code, _ = self.detect(text)

            if lang_code == label:
                tp[label] += 1
            else:
                mistakes.append((text, label, lang_code))
                fn[label] += 1
                fp[lang_code] += 1

        precision = {}
        recall = {}
        for lang in languages:
            if tp[lang] + fp[lang] == 0:
                precision[lang] = 0.0
            else:
                precision[lang] = tp[lang] / float(tp[lang] + fp[lang]) * 100.0

            if tp[lang] + fn[lang] == 0:
                recall[lang] = 0
            else:
                recall[lang] = tp[lang] / float(tp[lang] + fn[lang]) * 100.0

        return precision, recall, mistakes

Our eval method returns precision/recall for each language and a list of mistakes that will be useful to improve the classifier.

Baseline

To start doing our experiment we need to write a quick baseline. The baseline is the first and most obvious classifier that we can think about. In my case I'm not enough smart and I'll go on with the following code:

class RandomLanguageDetector(LanguageDetector):
    """Simple random classifier.

    """
    def train(self, samples):
        model = set()
        for label, _ in samples:
            model.add(label)
        model.add('xx')
        self._model = list(model)

    def detect(self, text):
        return random.choice(self._model), 1.0

Yeah, random classifiers are generally the easiest to write :-) Let's test our evaluation:

-----   -----   -----
Lang    Prec    Rec
-----   -----   -----
de      16.11   13.82
en      16.94   14.88
es      15.85   13.66
fr      16.33   14.08
it      16.15   13.86
pt      16.36   13.50

Our baseline is (as expected) quite crappy, now we have to find out a better way to detect languages. I've decided to restrict our languages to some common European ones, by the way the workflow described here can be applied to all languages with just few changes.

Features

In order to classify a sample we need to extract some information that will help us on the task. Basically we want to extract some features related to the variable to guess. In our case the variable is the language.

A simple feature is the distribution of the letters. Let's think about English and French, for instance, the first one doesn't use accents at all, French uses a lot of letters with accents.

Classification

Once we know the distribution of the letters of the sample to classify we still need to find the way to classify it. We can represent the letter distributions as vectors. Vectors can be compared each other in multiple ways. Given a vector u, we can find the closest vector from a set L. Our approach is simple: u is the vector of the letter distribution of the sample text and L is the set of the letter distributions of the languages.

Now we have just to choose which similarity metric use to compare the vectors. A very common similarity between vectors used and abused in Information Retrieval is the Cosine Similarity. We are going to use this metric in our example. Be free to try different metrics.

Let's do it!

So, now we have collected all the pieces of the puzzle, let's go on:

Compute the language distribution of all the languages
Do it for the input text as well
Compute the cosine similarity between the input vector and the others

class CosineLanguageDetector(LanguageDetector):
    """Cosine similarity based language classifier that uses single chars as features

    """
    def _preprocess(self, text):
        return text

    def _extract_features(self, text):
        return list(self._preprocess(text))

    def _normalize_vector(self, v):
        norm = math.sqrt(sum(x*x for x in v.itervalues()))
        for k in v:
            v[k] /= norm

    def train(self, samples):
        extract_features = self._extract_features

        model = defaultdict(lambda: defaultdict(float))
        for label, text in samples:
            features = extract_features(text)
            for f in features:
                model[label][f] += 1

        for v in model.itervalues():
            self._normalize_vector(v)

        self._model = dict(model)

    def detect(self, text):
        features = self._extract_features(text)
        u = defaultdict(float)
        for f in features:
            u[f] += 1
        self._normalize_vector(u)

        r = []
        for l, v in self._model.iteritems():
            score = 0.0
            for f in u:
                score += u[f] * v.get(f, 0.0)
            r.append((l, score))
        return max(r, key=itemgetter(1))

The code is quite straight forward. We've overriden the train method in order to compute the letter distributions of the languages in the training set. Vector items represent a letter and contain the number of times the letter appears in the training samples. We normalize the vectors in order to simplify the detect function.

Let's check how this classifier works against our metrics:

-----   -----   -----
Lang    Prec    Rec
-----   -----   -----
de      64.63   71.58
en      60.89   59.90
es      60.63   47.68
fr      66.45   54.66
it      56.10   70.40
pt      63.15   65.98

Very well, our basic classifier is already better than the random classifier! The precision is around 60 percent for each language, this means that if our classifier says that an input text is, for example, English, 6 times over 10 it is right. At least this is the expected behaviour if we have to classify text with a similar form than the data in our datasets, this also means that the statistics do not consider we won't have to classify also unknown languages.

Improvements

There are several things to do to improve the classifier:

preprocessing
try different features
postprocessing

Adding a preprocess step will help us normalizing the text, for instance we can think that cases are not relevant in language detection, and transform to lower cases each letter of the input text. Another useful preprocess possibility is to filter out punctuations.

An example of postprocess would be discard a classification if our confidence is low. For instance if the score from the similarity with the nearest language doesn't cross a certain threshold we can return an "unknown" language result. This actually increases the precision decreasing the recall of our language detector.

Single letters can be not that much informative for language detection, using bigrams (subsequence of two consecutive letters of the original input) might be more helpful. In Python extracting bigrams is as simple as doing:

bigrams = [text[i:i+2] for i in xrange(len(text)-1)]

As fast as we do the change, we just have to run again our evaluation and check how it goes:

-----   -----   -----
Lang    Prec    Rec
-----   -----   -----
de      83.58   89.16
en      76.72   83.92
es      87.91   69.66
fr      79.36   80.60
it      83.60   84.20
pt      80.28   82.22

Wow! Big improvement! Features are important indeed :-)

Conclusions

The source code of the language detector written in this post can be found on Github. Be free to fork it, experiment on it, etc...

If you try the classifier we've just built on your friends' chat messages, you will see that it won't work very well. The reason is that our training set is not really oriented to the common spoken language. Chat messages, tweets, forum comments are usually short messages that use lots of slangs and shorteners. Context is important. Your training set should always reflect the same properties of the target text.

Be aware of that ;-)

Angry Bits

Words on bytes and bits