COVID-19 Annotation and Modeling: Key Challenges and Lessons Learned
__author__ = 'Melissa Medvedev, Laura White, and Mark Graves'
TL;DR The body of literature related to the novel coronavirus (COVID-19) is growing rapidly, presenting an unprecedented opportunity for NLP to generate new insights. Using the COVID-19 Open Research Dataset, we explored factors associated with infection and clinical severity. We describe strategies used to overcome key challenges in the annotation and modeling process.
Overview of the COVID-19 pandemic
The novel severe acute respiratory syndrome coronavirus (SARS-CoV-2) was identified from a cluster of patients with pneumonia of unknown etiology who were linked to a seafood market in Wuhan, China in December 2019.1 The first cases were identified in North America, Europe, and Australia in January.2,3 The World Health Organization declared COVID-19 a pandemic on March 11, 2020.4,5 Since that time, COVID-19 has spread to 210 countries and territories across six continents, resulting in a total of 5,808,946 cases and 360,776 deaths as of May 28, 2020.3
Clinical manifestations of COVID-19 in adults and children
SARS-CoV-2 infection has been demonstrated to cause a wide range of clinical manifestations, ranging from asymptomatic to life-threatening.6 An analysis of the first 425 cases of COVID-19 pneumonia in Wuhan found that the median age was 59 years.7 An analysis of 1099 Chinese cases in January reported that the median age was 47 years and nearly a quarter of patients had one or more comorbidities (e.g., hypertension).8 Cases were defined as severe if pneumonia was present at the time of admission. Among patients with severe disease, median age was 7 years older and prevalence of comorbidities was two-fold higher, relative to non-severe cases. The cumulative risk of the composite endpoint [intensive care unit (ICU) admission, mechanical ventilation, or death] was 3.6% overall and 20.6% among patients with severe disease.8
It is now recognized that SARS-CoV-2 triggers a cascade of immune dysregulation in a subset of patients, leading to an inflammatory cytokine storm which is closely related to the development of acute respiratory distress syndrome (ARDS) and multiple-organ failure.9 Most patients who develop critical illness had mild symptoms in the early stages of disease; however, the condition of these patients rapidly deteriorates in the later stages or during the process of recovery.9
In recent months, several countries have experienced a rising number of cases of a pediatric inflammatory multi-system syndrome, which typically appears several weeks after SARS-CoV-2 infection. Preliminary reports indicate that the average age of affected children ranges from 7.5 years in Bergamo, Italy, to 13.5 years in London, United Kingdom.10,11 In Bergamo province, 10 children were diagnosed with the syndrome after the SARS-CoV-2 epidemic began, representing a monthly incidence which is 30-fold greater than that of the previous 5 years.10 Similar outbreaks are to be expected in other countries heavily affected by the pandemic.
The rapid evolution of knowledge around the clinical manifestations, pathophysiology, and underlying factors that predispose some individuals to develop severe disease illustrates the importance of expeditiously disseminating research findings. The corpus of scientific literature related to COVID-19 is vast and continues to grow exponentially. There exists a compelling need to develop tools which enable more efficient categorization, assessment, and assimilation of findings from research studies across the globe. Our objective was to apply NLP techniques to explore risk factors and underlying comorbidities that may increase susceptibility to SARS-CoV-2 infection and/or predispose some individuals to develop severe disease, using the COVID-19 Open Research Dataset (CORD-19). CORD-19 is a publicly available dataset that includes scientific literature regarding SARS-CoV-2 and related coronaviruses.
In our modeling effort, clinically trained annotators characterize a selection of abstracts from the literature. The annotators identify spans of text that have clinical relevance and classify those spans with a label characterizing the type of relevance. We endeavored to capture a nuanced view of clinical statements by identifying a condition as a risk factor or underlying comorbidity, as opposed to a manifestation of the disease under investigation. After exploring the data, it became clear that the context around conditions sometimes lacked the level of detail necessary to make this distinction. We adapted the annotation scheme accordingly, resulting in inclusion of population, geography, clinical findings, and condition. We also label each abstract to denote whether it is a clinical study and to add other identifiers, such as age.
We use the annotations to build NLP models that can predict those features in the literature and compare the model results against a held-out set of annotated abstracts. We use this test set to evaluate the model and annotation scheme, and to iterate based on formal metrics and manual inspection of discrepancies. At this stage, we generally care equally about precision (reducing extraneous matches) and recall (extracting relevant text), so we use the F1 score (their harmonic mean) as our primary evaluation metric. A score of 1 indicates perfect precision and recall. Models with higher scores are more likely to address clinical questions. We also note any significant discrepancies between precision and recall; instances where a low number of examples indicates we need additional annotations; and places where span boundaries affect performance. Finally, we compare the model against an unseen test set, preferably taken from more recent abstracts to assess its performance in the setting of rapidly emerging literature.
Challenge 1: How can we efficiently identify the most relevant publications for our project?
The CORD-19 dataset contains >140,000 articles covering a wide range of topic areas, and we need an efficient way to identify those most relevant to our project. We first developed a topic model to organize literature in the dataset, as described in an earlier blog post (more recent data available here). By filtering the dataset, the topic model streamlined identification of relevant abstracts and articles, thereby allowing us to focus annotation efforts on those most relevant to our project. We further narrowed the dataset by excluding pre-prints, as these have not yet been peer-reviewed and thus lacked a consistent level of quality for model training.
Challenge 2: What is the best way to address overlapping concepts in the annotation process?
Due to the nature of clinical inquiries, text relevant to two or more concepts sometimes overlaps. A medical condition may be part of a clinical findings statement, or a geographical location may be part of the population description; for example: "We enrolled critically ill patients with SARS-CoV-2 pneumonia who were admitted to the ICU in Guangdong Province." After a deep dive into the training data, it became evident that overlapping concepts were the norm rather than the exception. In addition, nearly half of the training dataset was labeled as NON_CLINICAL_STUDY. To ensure all relevant information was included, concepts were labeled with overlapping annotations. Concepts with a high likelihood of overlapping were modeled individually. Further, we found that many NON_CLINICAL_STUDY abstracts mentioned clinical conditions. To address this issue, NON_CLINICAL_STUDY abstracts were labeled at the span level for CONDITION. After making these adjustments, F1 scores improved from 45% to 89% for CONDITION and from 60% to 93% for GEOGRAPHY.
Challenge 3: How can we address linguistic challenges to ensure consistency in annotation?
Variability in sentence structure and language posed challenges for creating span boundary rules that could be generalized to achieve consistency across annotators. Guidelines instructing which content should be included in a span initially seemed straightforward: "Include number of subjects, demographic information, and hospitalization status in the population span; annotate disease separately as the condition investigated." This rule could apply to the phrase: "We enrolled 125 adults hospitalized at Wuhan hospital [POPULATION] in Wuhan, China [GEOGRAPHY] with suspected or confirmed SARS-CoV-2 infection [CONDITION]." However, the same details are presented in a variety of ways across the literature. As a result, annotation rules may not provide clear guidance in some cases; for example: "We included all adult inpatients (>18 years) with laboratory confirmed COVID-19 from Jinyintan Hospital and Wuhan Pulmonary Hospital (Wuhan, China) who had been discharged or had died by Jan 31, 2020."
The POPULATION concept was particularly challenging to define given the variable ways a group of interest was discussed across the abstracts. Sentences including demographic information were the logical place to search for population details, but these were sparse in abstracts. This may be a limitation of annotating abstracts, as such details are often provided in the methods section of full text articles. We broadened the scope of the POPULATION concept to include any defining characteristics about the group under investigation. The first model performed poorly, with an F1 score of 61%. While annotating the training data, it became clear that two distinct groups were often described in abstracts. The first was mentioned in the context of the study background, while the latter was the population under investigation. We revised the annotation scheme to only target POPULATION spans describing the current study participants. After model re-training, the F1 score increased from 61% to 71%. There are further challenges when authors reference a prior study cohort in the background section. Determining whether this information has been included as background to the study, as opposed to a meta-analysis where multiple cohorts are described, was sometimes difficult to determine. A similar challenge related to the presentation of demographic information in association with clinical outcomes; for example: "We found older patients (>65 years old) with comorbidities and ARDS were at increased risk of death."
Challenge 4: How can we improve understanding of conditions with insufficient context?
In some abstracts, the condition concepts were discussed in the same sentence phrasing with no clear path forward for physically separating the concepts out in a way that was accurate and consistent. This synthesized phrasing may be another limitation of annotating abstracts versus full text articles. In the statement, “patients with severe COVID-19 seem to have higher rates of liver dysfunction,” it is unclear whether liver dysfunction is a comorbidity or a manifestation of COVID-19. Time investment to achieve acceptable inter-annotator agreement on such concepts was also considered. For these reasons, condition concepts were consolidated into a broader category of CLINICAL_FINDINGS. This allowed these details to be captured efficiently, while leaving open the possibility of future work to parse sentences into more nuanced concepts.
This project demonstrated several key lessons regarding the use of NLP for COVID-19 modeling efforts. Importantly, annotation and modeling should happen in tandem given the need to strike a balance between the objectives and what the data will support. Narrowing the scope of our research question facilitated improvements in annotation quality and reduced the time required to conduct analyses. A collaborative effort is required to establish guidelines on how concepts should be defined and identified across the dataset. We found that a combination of clinical and technical expertise was essential to develop a successful annotation scheme and ultimately achieve consistency across annotators. Leveraging tools that allow collaboration as part of the annotation process facilitated this merging of expertise, allowing for effective data exploration, model building, and analysis for rapid iteration. Stay tuned for an upcoming blog post describing the results of our COVID-19 modeling work.
- Zhu N, Zhang D, Wang W, Li X, Yang B, Song J, et al. A novel coronavirus from patients with pneumonia in China, 2019. N Engl J Med. 2020;382:727–33.
- Holshue ML, DeBolt C, Lindquist S, Lofy KH, Wiesman J, Bruce H, et al. First case of 2019 novel coronavirus in the United States. N Engl J Med. 2020;382:929–36.
- Johns Hopkins University Center for Systems Science and Engineering (CSSE). COVID-19 Dashboard [Internet]. Baltimore; 2020 [cited 2020 May 28]. Available from: https://coronavirus.jhu.edu/
- Mahase E. Covid-19: WHO declares pandemic because of “alarming levels” of spread, severity, and inaction. BMJ. 2020;368:m1036.
- World Health Organization. WHO Director-General’s opening remarks at the media briefing on COVID-19 - 11 March 2020 [Internet]. Geneva: WHO; 2020. Available from: https://www.who.int/dg/speeches/detail/who-director-general-s-opening-remarks-at-the-media-briefing-on-covid-19---11-march-2020
- Wu D, Wu T, Liu Q, Yang Z. The SARS-CoV-2 outbreak: What we know. Int J Infect Dis. 2020;94:44–8.
- Li Q, Guan X, Wu P, Wang X, Zhou L, Tong Y, et al. Early transmission dynamics in Wuhan, China, of novel coronavirus-infected pneumonia. N Engl J Med. 2020;382:1199–207.
- Guan W, Ni Z, Hu Y, Liang W, Ou C, He J, et al. Clinical characteristics of coronavirus disease 2019 in China. N Engl J Med. 2020;382:1708–20.
- Ye Q, Wang B, Mao J. The pathogenesis and treatment of the ‘Cytokine Storm’’ in COVID-19.’ J Infect. 2020;80:607–13.
- Verdoni L, Mazza A, Gervasoni A, Martelli L, Ruggeri M, Ciuffreda M, et al. An outbreak of severe Kawasaki-like disease at the Italian epicentre of the SARS-CoV-2 epidemic: an observational cohort study. Lancet. 2020;May 13. [Epub ahead of print]. doi: 10.1016/S0140-6736(20).
- Mahase E. Covid-19: Cases of inflammatory syndrome in children surge after urgent alert. BMJ. 2020;369:m1990.