Confronting the challenges of modeling Social Determinants of Health
__author__ = 'Aneiss Ghodsi'
TL;DR Social Determinants of Health (SDoH) present an enormous opportunity for NLP in healthcare, but they pose a number of modeling challenges as well. We describe a combination of strategies we used on a recent project to overcome these challenges.
SDoH: An opportunity for deep language analysis
There is growing awareness that Social Determinants of Health (e.g., socioeconomic status, education, housing) play a substantial role in shaping people's health and well-being. These factors are only weakly and indirectly captured in structured data sources in healthcare, but they are richly described in clinical notes, social worker reports, patient surveys, and other textual sources. Thus, SDoH present an enormous opportunity for natural language processing in healthcare.
Of course, SDoH present significant challenges as well. The three big ones that we see:
- This is sensitive information, so we are likely to be restricted to small, tightly regulated data sets.
- The factors themselves are subtle and very complex to identify in text. For instance, a clinical text is unlikely to state directly that a patient is lonely; rather, this will be alluded to with statements about who visits the patient and how the patient talks about friends and family.
- As we said above, the information is distributed across different sources that are likely to use different registers and styles to talk about SDoH.
We have encountered arguments that challenge 2 alone is so severe that present-day NLP just isn't up to the task of modeling SDoH with sufficient nuance and accuracy. Our perspective is much more optimistic, for the reasons we discuss just below, but there is no denying that SoDH projects will stress-test every aspect of an NLP pipeline – the annotation guidelines will be complex, annotating will be challenging, and the deployed NLP model will need to be attuned to a great deal of linguistic complexity.
And of course the above challenges interact with each other to make the problem even harder: for subtle problems that span different sources, we would like to have lots of data, but we're just not going to get it in this space!
Confronting the challenges
This is exactly the situation we found ourselves in for a recent Roam client project. We received three small data sets: notes from social worker notes, transcriptions of social worker phone calls, and EHRs. Our task was to create models and interfaces that would help the client to identify at-risk patients and begin to develop interventions that would lead to better outcomes for them.
With SDoH, there is just so much potential to improve people's lives! Even a seemingly small thing – for example, finding patients for whom transportation is an issue – can lead to relatively inexpensive interventions (covering Uber and Lyft rides; sending mobile clinics to specific geographic areas) that have a large impact on health outcomes and ultimately save lives.
Once the data arrived, we got to work as usual. I'll gloss over the complexities that come with transferring, storing, and working with sensitive health data like this. Suffice it to say that security is always top of mind for us, and it shapes all our internal tools and processes. Here, though, let's skip directly to the all-important task of designing an annotation scheme. For us, this is a data-driven process that is led by in-house clinical experts, and it involves balancing projects goals with the realities of what the data will support. Our general aim is a labeling scheme that gives sufficient coverage of relevant concepts and aligns with the broader goals of the project. For the current project, this process led us to a set of concepts that we knew would vary in frequency across our data sets, but we decided it was worth annotating and modeling all of them to try to serve those broader goals.
Predictably, the twin challenges of small data sets and diverse data sets emerged as soon as we started the modeling effort. One of the datasets had reasonable coverage for the most prevalent concepts but very few instances of the rarer categories. The other two were just sparse for all the concepts – so sparse that we could basically rule out a standard train/assessment set-up for them, since training was likely to be poor and assessment guaranteed to be noisy.
Our standard model is an LSTM-CRF with very rich pretraining and other kinds of featurization. In a recent publication, we showed that these models can get traction even in very small datasets with sparse categories. And they did do okay, all things considered, but this wasn't going to be enough. We had to get creative about how we used the data sets. And get creative we did!
Because we were applying basically the same label set to all of the datasets, we could initialize a model for one dataset with weights trained on another, and then do some fine-tuning. This resulted in some substantial improvements. Ideally, we would have supplemented this with unsupervised pretraining of the input representations on unlabeled examples from these domains, but we just didn't have enough data to make that work (challenge 1 again).
Sampling or weighted losses are commonly employed to redress imbalanced datasets. Doing so prevents underrepresented labels from being underemphasized and poorly learned by the model. When we are predicting document- or sentence-level labels, this kind of balancing is straightforward. However, we are often predicting the labels for individual words and phrases, which pretty much rules out achieving a true balance. When you oversample sentences containing a sparse label, you oversample common categories that come along for the ride. Nonetheless, we found that oversampling these sparse labels did show notable improvements for many of the lower prevalence labels.
This is perhaps our boldest gambit. Most techniques for synthesizing examples require large amounts of unlabeled data. We were already making exhaustive use of the datasets available to us, so those techniques were out. But we hypothesized that something simple – something hacky on the face of it – might pay off: we randomly replaced spans of a given label with other spans of the same label, thus introducing positive example spans with new contexts to our training data. There are a lot of implicit assumptions here relating to the exchangeability of syntactic and semantic context, so we would not expect it to work for all problems. Unsurprisingly, we saw mixed results. Certain labels improved, while others got worse. This is a great outcome for us, because, in our final model, we can selectively target labels for which the technique is effective.
Specialized training infrastructure, but a clear win
Our production models combined each of these three techniques. After identifying which labels benefited from the synthesis of new examples, we used synthesis to balance those labels and upsampling to balance the others. And we made extensive use of initializing and fine-tuning to extract as much value out of our annotations as we could while still specializing our models to the different data sources available. The resulting procedures might offend the purists, and they do impose a new burdens on our model training infrastructure, but the costs are clearly worth it – they helped improve our models to the point where they could be effective tools in designing and optimizing interventions to help improve outcomes for the most at-risk patients.