11.07.17
Mind Blown By Machine Translation
I have been studying machine learning lately, and have come across three recent research findings in machine translation which have each blown my mind:
- Computers can learn the meanings of words.
- Computers can make pretty good bilingual dictionaries given only large monolingual sets of words (also known as “a corpus”) in each of the two languages.
- Computers can make sort-of good sentence-level translations given a bilingual dictionary made by #2.
Learning the meanings of words
Imagine that you could create a highly dimensional coordinate space representing different aspects of a word. For example, imagine that you have one axis which represents “maleness”, one axis that represents “authority”, and one axis which represents “tradition”. If the maximum value is 1 and the minimum 0, then the word “king” would thus have coordinates of (1, 1, 1), while the word “queen” would have coordinates (0, 1, 1), “duke” would maybe be (1, .7, 1), “president” would maybe be (1, 1, .6). (If Hillary Clinton had been elected U.S. president, then maybe the maleness score would drop to something around .8).
You can see that, in this coordinate space, to go from “woman” to “man”, or “duchess” to “duke”, you’d need to increase the “maleness” value from 0 to 1.
It turns out that it is relatively easy now to get computers to create coordinate spaces which have hundreds of axes and work in exactly that way. It isn’t always clear what the computer-generated axes represent; they aren’t usually as clear as “maleness”. However, the relative positioning still works: if you need to add .2 to te coordinate at axis 398, .7 to the one at axis 224, and .6 to the one at axis 401 in order to go from “queen” to “king”, if you add that same offset (aka vector) — .2 at axis 398, .7 at axis 224, and .6 at axis 401 — to the coordinates for “woman”, then the closest word to those coordinates will probably be “man”. Similarly, the offset which takes you from “Italy” to “Rome” will also take you from “France” to “Paris”, and the offset which takes you from “Japan” to “sushi” also takes you from “Germany” to “bratwurst”!
A function which maps words to coordinate spaces is called, in the machine learning jargon, “a word embedding”. Because machine learning depends on randomness, different programs (and even different runs of the same program) will come up with different word embeddings. However, when done right, all word embeddings have this property that the offsets of related words can be used to find other, similarly related word pairs.
IMHO, it is pretty amazing that computers can learn to encode some information about fundamental properties of the world as humans interpret it. I remember, many years ago, my psycholinguistics professor telling us that there was no way to define the meaning of the word “meaning”. I now think that there is a way to define the meaning of a word: it’s the coordinate address in an embedding space.
As I mentioned before, it’s surprisingly easy to make good word embeddings. It does take a lot of computation time and large corpuses, but it’s algorithmically simple:
- Take a sentence fragment of a fixed length (say 11) and have that be your “good” sentence.
- Replace the middle word with some other random word, and that’s your “bad” sentence.
- Make a model which has a word embedding leading into a discriminator.
- Train your model to learn to tell “good” sentences from “bad” sentences.
- Throw away the discriminator, and keep the word embedding.
In training, the computer program iteratively changes the word embedding to make it easier for the discriminator to tell if the sentence is “good” or “bad”. If the discriminator learns that “blue shirt” appears in good sentences, that “red shirt” appears in good sentences, but “sleepy shirt” does not appear in good sentences, then the program will move “blue” and “red” closer together and farther from “sleepy” in the word embedding.
Christopher Olah has a good blog post which is more technical (but which also covers some additional topics).
Computers can make bilingual dictionaries with monolingual corpuses
A recent paper showed how to make pretty decent bilingual dictionaries given only monolingual corpuses. For example, if you have a bunch of English-only text and a bunch of French-only text, you can make a pretty good English<->French dictionary. How is this possible?!?
It is possible because:
- words in different languages with the same meaning will land at (about) the same spot in the embedding space, and
- the “shape” of the cloud of words in each language is pretty much the same.
These blew my mind. I also had the immediate thought that “Chomsky was right! Humans do have innate rules built in to their brains!” Upon further reflection, though, #1 and #2 make sense, and maybe don’t imply that Chomsky was right.
For #1, if the axes of the word embedding coordinate space encode meaning, then it would make sense that words in different languages would land at the same spot. “King” should score high on male/authority/tradition in Japanese just as much as in English. (Yes, there could be some cultural differences: Japanese makes a distinction between green and blue in a different place in the colour spectrum than English does. But mostly it should work.)
For #2, language represents what is important, and because we share physiology, what is important to us is going to be very similar. Humans care a lot about gender of animals (especially human animals), so I’d expect that there to be a lot of words in the sector of the coordinate space having to do with gender and animals. However, I don’t think humans really care about the colour or anger of intellectual pursuits, so the sector where you’d look for colourless green ideas sleeping furiously ought to be empty in pretty much every language.
The way the researchers found to map one word embedding to another (i.e. how they mapped the embedding function one program found for French to one they found for English) was they made the computer fight with itself. One piece acted like a detective and tried to tell which language it was (e.g. was the word French or English?) based on the coordinates, and one piece tried to disguise which language it was by changing the coordinates (in a way which preserved the relational integrity). If the detective piece saw a high value in a word’s coordinate which didn’t have high values in English, then it would know it was French. The disguiser then learned to change the French coordinate space so that it would look more like the English coordinate space.
They then refined their results with the Procrustes algorithm to warp the shape of the embedding spaces to match. They picked some high-occurrence words as representative points (since high-occurrence words like “person” and “hand” are more likely to have the same meaning in different languages), and used those words and their translations to figure out how to bend/fold/spindle/mutilate the coordinate spaces until they matched.
Computers can translate sentences given a dictionary
The same research group which showed how to make dictionaries (above) extended that work to machine translation with only monolingual corpuses. (In other words, no pre-existing hints of any kind as to what words or sentences in one language corresponded to words or sentences in the other language.) They did this by training two different models. For the first model, they took a good sentence in language A and messed it up, and trained the model to fix it. Then once they had that, they fed a sentence in language B into a B->A dictionary (which they had created as described above) to get a crappy translation, then fed it into the fixer-upper model. The fixed up translation wasn’t bad. It wasn’t great, especially compared to a human translator, but it was pretty freakin’ amazing given that there was never any sort of bilingual resource.
When I read A Hitchhiker’s Guide to the Galaxy, I scoffed at the babelfish. It seemed completely outlandish to me. Now, it seems totally within the realm of possibility. Wow.
ducky said,
November 16, 2017 at 1:36 pm
I just downloaded a publicly-available dataset of word embeddings which was trained on a corpus of 100 billion (yes, billion with a b) Google News corpus with publicly available code. This dataset has 3M words in it, shoehorned into 300 dimensions. No big deal.
My head is exploding at how I have so much space and so much compute power that downloading it is no big deal AND how this amazing wealth is available to me for free.