New paper: Effective feature representation for clinical text concept extraction
__author__ = 'Chris Potts'
TL;DR We posted a new paper to arXiv, based on the work Yifeng Tau did with us during his summer internship. The paper motivates a novel deep learning architecture for combining dense feature representations derived from unlabeled text with sparse feature representations derived from medical ontologies. The paper shows that the architecture is effective for the kinds of small, complex datasets that are typical of NLP in healthcare.
How can we make the best use of small text datasets?
This question is always on our minds at Roam. Our NLP use-cases tend to be complex and specialized, which means that we have to call on our clinical annotation team to label new examples. The fewer examples they have to label, the more effective Roam can be.
There are many aspects to this problem, including creating efficient tools and workflows for the clinical annotation team, developing robust infrastructure for model optimization, and identifying model architectures that are especially well-suited to our datasets.
Our newest research paper, a Roam/Stanford/CMU collaboration, is devoted to this model architecture question:
Effective feature representation for clinical text concept extraction
Yifeng Tao, Bruno Godefroy, Guillaume Genthial, and Christopher Potts
This paper is the result of work Yifeng Tau (CMU) did during his 2018 summer internship at Roam, in collaboration with Bruno, Guillaume, and me.
The guiding hypothesis of the paper is that there is useful information in the sort of dense unsupervised word representations that can be learned from unlabeled text (we use ELMo) and in the sparse feature representations that can be easily be derived from the many publicly available medical ontologies out there.
Finding a model that can make good use of these very different kinds of information is a challenge. We propose a model that is a combination on a recurrent neural network and a conditional random fields (CRF) model, with some special additional representational layers designed to synthesize the two kinds of feature representation.
We show that this model yields superior performance on a number of very different sequence labeling datasets: two clinical datasets that Roam developed, a dataset of tweets about adverse drug reactions, a dataset of scientific abstracts, and a new dataset of drug–disease relationships that we developed and are releasing with the paper:
The paper also includes a number of analyses aimed at identifying how the model makes use of the dense and sparse features it contains. These analyses show clearly that the model learns to pay more attention to the most reliable features, where this notion of reliability depends on a number of factors, including text length and problem type.
The analyses also show that the combined model can get traction on very small categories. This is an especially welcome outcome, since the small categories in healthcare are often those that correspond to rare but vitally important events.