Roam @ AIDS2018: Machine Learning to Predict HIV Outcomes
__author__ = "John Semerdjian"
TL;DR Roam recently collaborated with ViiV Healthcare to explore models predicting HIV health outcomes, and presented the project's core methods and results at the AIDS2018 conference.
Supervised machine learning to predict HIV outcomes using electronic health record and insurance claims data
As the lead of the Clinical Data Science team at Roam Analytics, I have the privilege of working on fascinating machine learning problems that have the potential to impact patient lives. An example of one such project was a collaboration we did with ViiV Healthcare, a pharmaceutical company specializing in HIV treatments. Using longitudinal EHR data, Roam built a custom UI for ViiV Healthcare to explore models predicting common HIV health outcomes. Our project found that structured data alone isn't enough to predict HIV outcomes. I also presented our methods and results at the AIDS2018 conference in Amsterdam, the largest conference of its kind. It was fascinating to see much of the clinical research at the conference as well – our paper was one of only a few that applied machine learning methods to medical record datasets.
After presenting our work at the AIDS2018 conference, we now wish to elaborate on some of the technical details about how we applied machine learning to this task. For this project we had a mix of anonymized structured and unstructured EHR data aggregated from many sources across the United States. The structured data contained common patient demographic variables, but also labs, prescriptions, diagnoses, and more. The anonymized provider notes associated with each visit contained a wealth of information and were partially structured as excerpts and phrases documented at the point of care. The phrases were extracted via a large custom lexicon developed by our data vendor and, as a result, we did not know the precise sequence of the phrases. Each phrase also contained associated properties such as "sentiment" (e.g., "negative"), "section" (e.g., "patient history"), and "location" (e.g., "knee"). We preprocessed these terms using a combination of stemming and lemmatization, and further phrase extraction using Roam's internal and public ontologies (e.g., UMLS). This enhanced preprocessing of the phrases resulted in more meaningful 'term features' to use as input in our models.
We additionally re-purposed these term features to create a Vector Space Model (VSM) . Our assumption was that, if properly visualized, the term embeddings could provide greater context about highly predictive phrases. To accomplish this, we aggregated all notes for a given anonymized patient over a period of time and generated an embedding of each phrase using both GloVe  and word2vec , as follows. Starting with a patient-phrase count matrix, we generated a phrase co-occurrence matrix and applied GloVe to this matrix. Based on the basic algorithmic adjustments described in Sontag 2016 , we used the word2vec algorithm to learn embeddings on the longitudinal phrase data; we partitioned data into multiple intervals, removed duplicate term phrases, and randomly shuffled the terms within each interval. Each partition was treated as a single sentence and fed through the word2vec algorithm. There was a surprising amount of congruity between word2vec and GloVe, but in the end we opted to go with the vectors from word2vec due to their relative speed during training.
To visualize the embeddings, we experimented with both t-SNE  and Uniform Manifold Approximation and Projection (UMAP) . t-SNE is a very popular algorithm but was a bit too slow for our purposes. We instead opted to use UMAP, which was used as a drop-in replacement for t-SNE in our visualization pipeline.
To better understand the relationship between HIV outcomes and phrase embeddings, we analyzed the proportion of times a clinical phrase in our classifier was associated with a positive or negative HIV health outcome. The color (and color-intensity) of the phrases in the 3D embedding space was achieved by binning these proportions in intervals of 0.2. When visualized in our embedding space, this simple approach illustrates the clear relationship between HIV outcomes and clinical phrases, beyond inspecting the coefficients from our classifier, and provides stakeholders an intuitive and interactive tool to visualize key insights from these analyses. Further exploration using this tool can lead to hypothesis generation for future studies.
You can check out the full UI below:
It was a pleasure to work with our partners at ViiV Healthcare. They provided us with valuable HIV domain expertise throughout the course of the project. We believe combining the clinical proficiency of domain experts with the careful application machine learning techniques can enable stakeholders like ViiV Healthcare to develop new hypotheses for the drivers of HIV health outcomes. These hypothesis, if vetted by future experiments, have the potential to alter the delivery of health care for HIV patients.
[For more details about the specific HIV health outcomes and model performance, check out our conference poster and abstract entitled Supervised machine learning to predict HIV outcomes using electronic health record and insurance claims data.]
 Turney, Peter D. and Pantel, Patrick. 2010. From frequency to meaning: vector space models of semantics. Journal of Artificial Intelligence Research 37: 141–188.
 Pennington, Jeffrey; Richard Socher; and Christopher D Manning. 2014. GloVe: Global vectors for word representation. Proceedings of the Empirical Methods in Natural Language Processing, 1532–1543.
 Mikolov, Tomas; Kai Chen; Greg S. Corrado; and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. Proceedings of the International Conference on Learning Representations.
 Choi, Youngduck; Chill Yi-I Chiu; and David Sontag. 2016. Learning low-dimensional representations of medical concepts. AMIA Joint Summits on Translational Science, 41–50.
 McInnes, Leland. and John Healy. 2018. UMAP: Uniform manifold approximation and projection. The Journal of Open Source Software 3(29): 861.