Fine-tuning GloVe representations
__author__ = "Nick Dingwall"
TL;DR We developed a new method for fine-tuning GloVe representations to specialized domains, and we open-sourced the code for doing this, which includes a very fast, vectorized Python implementation of GloVe as well.
Roam is always trying to improve the performance of our sequence labelling models for extracting complicated concepts from clinical text. Recently, recurrent neural networks (RNNs) have outperformed more traditional models like conditional random fields (CRFs) on a variety of tasks, but whenever we've applied them to clinical text, they've fallen woefully short of our CRFs.
So we did what we always do in these situations: we organized a hackathon to get more minds on the problem.
Typically, for RNN modeling, the first step is to build an embedding matrix that represents words as vectors, and there is a lot of evidence that it pays to use pretrained vectors rather than initializing them randomly. In terms of our hackathon, this is as far as I got: as xkcd would say, I nerd sniped myself.
I was worried that publicly-released pretrained vectors – trained on massive datasets like Common Crawl – wouldn't generalize well to our specialized medical texts, which include nonstandard vocabulary and usages.
Ideally, we'd 'fine tune' Stanford's release of their GloVe word vectors to our much smaller medical datasets. That is, we'd encourage our learned representations to stay close to Stanford's unless there was strong evidence that a word's usage was different. In this way, new tokens would receive embeddings compatible with Stanford's: e.g. an obscure disease that doesn't have a Stanford embedding would receive one that's similar to a common disease that does.
There are lots of ways we could implement this 'fine tuning' intuition. As a group we've been thinking about retrofitting – even publishing a paper about it (check out the blog post) – and that seemed like a natural solution. Chris Potts, our Co-Chief Scientist, showed me how easy it would be to add such a retrofitting term to the GloVe objective function and we quickly agreed that the name of the resulting model should be Mittens (because it's GloVe with a warm start).
We usually work in Python and so I looked for Python implementations to modify. But they were all horribly slow because they directly included the inner loop (over nonzero entries in the co-occurrence matrix). I convinced myself that there was a trick that would allow us to vectorize the algorithm, and thus began the second self-nerd-sniping. While everyone else worked on clever RNN variants as part of the hackathon, I grabbed a whiteboard and started working out derivatives. Before long, the hackathon drew to a close and everybody reported on their findings. I didn't notice because I was still filling up the whiteboard and trying to get my implementation to agree with the theory.
Once I finally got it working it was an order of magnitude faster than other implementations. Chris and I translated it into TensorFlow, which opened the door to GPUs and another order of magnitude of improvement in speed. Chris has loads of experience helping his grad students at Stanford to turn ideas into papers, and so could quickly design a set of experiments to evaluate whether these retrofitted embeddings improved performance on downstream tasks. To our delight, when we ran them, Mittens outperformed both publicly-released embeddings and traditional GloVe embeddings learned directly on the target datasets.
Having agreed on the name of the model, we devoted ourselves to arguing about the names of various parameters while we wrote a paper. That paper was accepted to NAACL and so last month, Chris and I went to New Orleans to present it.
|Me presenting the poster. Had I known that I was talking to Hinrich Schütze, a pioneer in representation learning (among other things), I would have been considerably more nervous!|
We open sourced the Mittens code on Github, including a fast implementation of GloVe (in both Numpy and TensorFlow). The results of some speed tests we ran for different vocabulary sizes (with 90% sparse co-occurrence matrices) are below. Notice that, if you run it on a GPU, our Tensorflow code is as fast as the offical C release. With denser matrices, ours overtakes (since ours scales only with the dimension of the co-occurrence matrix while the official release scales with the number of nonzero entries).
|5K (CPU)||10K (CPU)||20K (CPU)||5K (GPU)||10K (GPU)||20K (GPU)|
|Official GloVe (in C)||0.66||1.24||3.50||−||−||−|
If you want to try it out yourself, you can install it with
pip install mittens
With Mittens, we found a way to infuse domain-specific information into our word representations, and we think this is an important step toward making RNNs and related models competitive for our problems. However, it should be said that they are still no match for the hand-built feature functions that power our CRFs.