## Retrofitting distributional embeddings to knowledge graphs with functional relations

```
__author__ = "Ben Lengerich, Andrew L. Maas, and Christopher Potts"
```

**TL;DR** We've developed a general framework for retrofitting existing distributed representations of words and concepts using the relational information in complex knowledge graphs. This post describes the framework in informal terms and seeks to convey visually how the retrofitting step reshapes an existing space of representations to more directly encode complex concepts. (In this new paper, we provide a more comprehensive description of the framework.)

## Learning distributed representations

One of the most powerful ideas in natural language processing is that we can represent words and phrases using dense vectors learned from co-occurrence patterns in text. The core insight traces to early research in structural linguistics, and its most famous early computational manifestation is Latent Semantic Analysis. More recently, deep learning has led to a resurgence of interest in this area, because deep models can both produce and consume these representations to good effect, not just for linguistic units, but for all kinds of concepts.

If you're new to these ideas, we recommend starting with this overview paper by Turney and Pantel. The GloVe paper is also an outstanding illustration of how to carefully motivate and define a specific objective function for representation learning. And this course has lots of hands-on exercises that can help you get used to these ideas.

## The distributional hypothesis, and its limits

The most striking thing about all this is that it works at all. Research in linguistics and lexicography might lead us to expect words to be represented discretely, symbolically, and research in psychology leads us to expect that humans learn words via grounding in social situations. From this perspective, trying to represent words with vectors learned from co-occurrence patterns in text seems sure to miss the mark.

And yet these representations have proven themselves in many settings, and one might even argue that they make good on a common intuition among linguists: that words tend to be incredibly complex and related to each other in all sorts of subtle ways. It's how you are able to understand *The stock market deteriorated* even if you've never heard *deteriorate* applied to the stock market – your mental representations for *stock market* and *deterioriate* click together in a new but predictable way.

That said, purely distributional information can be limiting. The resulting representations tend to excel at capturing notions of similarity, in meaning and in form, including subtle connotational information that naturally arises from, and reflects, the way people frame ideas. It's less clear that we can learn complex semantic relations from distributional data alone. Even entailment (in the sense that *puppy* entails *mammal*), which is closely related to similarity, can be a challenge to learn with these models because entailment is asymmetric and doesn't align well with usage frequency. Such relations leave us longing for some kind of structured resource we can rely on to make up for the limits of unstructured text.

## Knowledge graphs, and their limits

Knowledge graphs are paradigm cases of structured resources that can help make up for the deficiencies of co-occurrence patterns. Struggling to learn entailment? WordNet can probably help, since it's basically a massive network of entailment relations between words. Need a rich network of medical concepts? You can try to infer one from texts alone, but the massive collection of inter-connected UMLS graphs can surely jumpstart this process for you.

That said, knowledge graphs are expensive, time-consuming, and conceptually challenging to create. As a result, for any domain, their coverage is likely to be only partial. In contrast, text is typically inexpensive to collect, and the processing needed for learning distributed representations is often very lightweight.

This naturally raises the question of how best to combine representations learned from text with whatever structured information is available in the form of knowledge graphs. Faruqui et al. (2015) provide an elegant first answer to this question.

## Retrofitting

Faruqui et al.'s model begins with distributed representations. These are presumably learned or extracted from text, but, actually, they can come from anywhere. What's more, you can use whatever model you like to learn them: TF-IDF, Latent Semantic Analysis, Positive PMI or its close cousin word2vec, convolutional neural nets, GloVe, and others.

These representations are then *retrofit* with information from a knowledge graph, using an objective function that balances (1) retaining the structure of the original representations against (2) the representations of neighbors in the knowledge graph:

Here, $\hat{\mathbf{q}_{i}}$ is the representation learned purely from text and $\|\mathbf{a}-\mathbf{b}\|^{2}$ is the Euclidean distance between vectors $\mathbf{a}$ and $\mathbf{b}$. The weights $\alpha_{i}$ and $\beta_{ij}$ determine the relative importance of staying faithful to the original vector and moving to become more like the neighboring nodes.

The authors created distributed representations using a bunch of different models and retrofit them on WordNet, FrameNet, and the Penn Paraphrase Database, and showed that the resulting representations are effective on a wide range of standard representation-learning tasks. The performance improvements matched a common use case, and the paper won the best paper award at NAACL 2015.

We of course love this paper. At Roam, we've invested heavily in building huge knowledge graphs, and we have huge amounts of text from a wide variety of domains in healthcare. The retrofitting idea holds out the promise of combining these data sets to gain better representations of healthcare concepts, which we can then use to improve the feature representations we use for NLP and to find previously overlooked relationships between medical concepts, among other things.

The sticking point for us is that Faruqui et al.'s model builds in the assumption that a graph edge between two nodes $A$ and $B$ implies that $A$ and $B$ are *similar* and should have a small pairwise distance. This is reasonable for the graphs they study, in which the semantics of the edge relationships is closely aligned with similarity. In contrast, the Roam Core Public Health Graph has about 250 edge-types with extremely diverse meanings, many of which do not imply similarity and even reach across domains (*treats*, *performed*). While Farqui et al.'s retrofitting step is often still somewhat useful in such graphs, the situation clearly calls for a generalization of their method to allow for more than just similarity. Pursuing this goal led us to *Functional Retrofitting*.

## The Functional Retrofitting framework

The Functional Retrofitting framework extends Farqui et al.'s by explicitly modeling all edge relations as functions and by including a separate term for edges $(i, j, r)$ that are not in the graph:

$$ \sum_{i \in \mathcal{V}} \alpha_{i} \|\mathbf{q}_{i} - \hat{\mathbf{q}_{i}}\|^{2} + \sum_{(i,j,r) \in \mathcal{E}} \beta_{ijr} f_{r}(q_{i}, q_{j}) - \sum_{(i,j,r) \in \mathcal{E}^{-}} \beta_{ijr} f_{r}(q_{i}, q_{j}) + \lambda \sum_{r}\rho(f_{r}) $$By optimizing this loss function, we simultaneously learn edge semantics and update our entity representations. The set $E^{-}$ is a sample of edges not in the graph, used to ensure precision of the learned edge semantics. The final term is regularization on the relation functions with strength set by $\lambda$.

The form of these edge functions $f_{r}(q_{i}, q_{j})$ is largely open to the user and could even be selected by hand. Experimentally, we've had the most success to date with simple linear functions:

$$f_{r}(q_{i}, q_{j}) = \|\mathbf{A}_{r}\mathbf{q}_{j} + \mathbf{b}_{r} - \mathbf{q}_{i}\|^{2}$$Farqui et al.'s objective is the special case of this where $\mathbf{A}_{r}$ is an identity matrix and $\mathbf{b}_{r}$ consists of all $0$s.

Neural versions like the following are also showing promise and might prove crucial for particularly complex graphs:

$$f_{r}(q_{i}, q_{j}) = \sigma(\mathbf{q}^{\top}_{i}\mathbf{A}_{r}\mathbf{q}_{j})$$Here, $\sigma$ is a non-linear activation function like $\tanh$ applied element-wise to its argument vector.

Our paper reviews all the technical details of these instantiations of the framework and identifies a range of additional functions one might use for $f_{r}(q_{i}, q_{j})$. The associated code provides implementations of the models we evaluate in the paper.

## Knowledge discovery in health knowledge graphs

For our paper, we conducted experiments on WordNet and FrameNet to facilitate comparisons with Farqui et al. Our focus at Roam, though, is on applying these models in the domain of healthcare:

- We learn initial representations for Snomed concepts and then retrofit them to the Snomed graph.
- We learn initial representations of drugs and diseases from clinical texts and retrofit them to a large Drug–Disease subgraph of the Roam Core Health Knowledge Graph.

The paper reviews the details of these experimental datasets and results. Suffice it to say that a linear instantiation of our framework proves to be really successful, especially with the Drug–Disease graph, where it not only consistently outperforms all its competitors, but it also leads to some solid improvements to our graphs by identifying a number of valid new *treats* edges between drugs and diseases.

The experimental results help prove the value of the framework, but they don't tell us much about *why* it helps. Answering such why questions is always challenging for representations this big and complex, but the following interactive diagrams begin to reveal the answer. The first depicts the representations for a linear instantiation of our framework, and the second allows us to compare those with representations learned with Faruqui et al's version.

Click nodes to list their names below. Zoom and drag to change focus.

Functional Retrofitting with linear relations |
---|

Faruqui et al. (identity retrofitting) |
---|

Both of these diagrams use t-SNE to visualize the high-dimensional representations in two dimensions (for help in interpreting these diagrams, see this elegant post).

It's striking how the Faruqui et al. embedding space mixes up drugs and diseases, whereas our model separates them into distinct regions. We think this is a key reason why the linear model is so effective – by allowing graph relations to span long distances, the linear retrofitting model retains the structure needed to model multiple types of entities.

## Looking ahead

It feels like we've just begun to realize the promise of the retrofitting idea. Functional Retrofitting opens up a vast new space of models, even allowing us to blend expert insights with general-purpose machine learning objectives. The representations we learn can be used as rich feature representations for a wide variety of machine learning models. We've so far seen the best results with a simple linear instantiation of our model, but we're now on the look-out for subgraphs of our health knowledge graphs that might justify more complex instantiations.