How Google converted translation into a problem of vector space mathematics
An article about how Google Translate works from Technology Review. Excerpt:
The new approach is relatively straightforward. It relies on the notion that every language must describe a similar set of ideas, so the words that do this must also be similar. For example, most languages will have words for common animals such as cat, dog, cow and so on. And these words are probably used in the same way in sentences such as “a cat is an animal that is smaller than a dog.”
The same is true of numbers. The image above shows the vector representations of the numbers one to five in English and Spanish and demonstrates how similar they are.
This is an important clue. The new trick is to represent an entire language using the relationship between its words. The set of all the relationships, the so-called “language space”, can be thought of as a set of vectors that each point from one word to another. And in recent years, linguists have discovered that it is possible to handle these vectors mathematically. For example, the operation ‘king’ – ‘man’ + ‘woman’ results in a vector that is similar to ‘queen’.
It turns out that different languages share many similarities in this vector space. That means the process of converting one language into another is equivalent to finding the transformation that converts one vector space into the other.
This turns the problem of translation from one of linguistics into one of mathematics. So the problem for the Google team is to find a way of accurately mapping one vector space onto the other. For this they use a small bilingual dictionary compiled by human experts–comparing same corpus of words in two different languages gives them a ready-made linear transformation that does the trick.
It seems like this would be a fairly good technique for more isolating languages, which have a lot of individual words that can be mapped onto each other based on the things that occur between white spaces.
I’m wondering how well it works for really heavily agglutinative or polysynthetic languages though: since morphemes in these languages correspond to separate words in others, I guess you’d need to first parse the words into morphemes and then map them into the same vector space, which seems like it would be a bit harder.
At any rate, another entry for linguistics jobs.