Organizing the emerging COVID-19 literature
__author__ = 'Mark Graves'
TL;DR We have open-sourced a tool that uses topic modeling and interactive visualizations to help people identify important themes in the COVID-19 Open Research Dataset.
OverviewThe COVID-19 Open Research Dataset (CORD-19) was first released on March 16, 2020, and was most recently updated on April 3, 2020. We are releasing an NLP-powered tool to categorize the abstracts from the March 27 version of this dataset in the hopes of helping others explore the emerging literature. The tool organizes these scientific articles into 60 topics, based on their abstracts, to give a broad overview of the pandemic's rapidly expanding fields of study and to provide entry points into specific areas. Using the model, we have also derived a list of articles for each topic with links to the full articles (via PubMed and DOI), and we are releasing a visualization tool for the topics based upon pyLDAvis.
Relevant terms for each topic
This webpage shows the 50 most relevant terms for each of the 60 topics. Each column header is a link to a listing of articles predominantly about that topic.
For example, if one is interested in the comorbidity of diabetes, one can search for "diabetes" using the browser search function, and it will show up twice in the column for Topic 51 (which is the 51st most prevalent topic). Selecting the column header (active link) opens a new web page (this one) with the list of documents predominantly about this topic, where each row entry includes the article title, authors, and links to the full-text (if available). One can then skim the article titles or search again for the term in article titles.
The second entry point into our models is a topic modeling visualization tool, which shows a two-dimensional projection of the topic model's term vectors, where related topics are visually close to each other. More detailed instructions are in the Appendix at the end of the blog post.
For those analyzing the CORD-19 and related datasets, the Python notebooks used to generate the topic model are also available on our public Github repository.
We hope that these preliminary results may help others investigating COVID-19 from multiple perspectives. At Roam, we are using the organized articles to narrow down the literature to build more specific models based on supervised data. We plan to continue working to find additional ways to contribute to the ongoing global efforts, making additional results available when we believe they might be helpful.
Appendix: Instructions for CORD-19 abstract visualization tool
The horizontal axis of the graph (PC1) loosely organizes the current COVID-19 outbreak topics to the right, and prior related viral outbreaks to the left. The vertical axis loosely organizes the medical and clinical topics toward the upper part of the graph, and the molecular and cellular topics in the lower half of the graph.
For example, Topic 51 is immediately below the origin, as it captures articles about angiotensin-converting enzyme 2 (ACE2), which is overexpressed in diabetes and hypertension and is a receptor for the entry of SARS-CoV-2 and other coronaviruses into host cells.
In addition, there are several topics specific to foreign language articles, scientific publications, and some artifacts remaining from the text extraction process, which can be ignored and which a more curated topic modeling process could remove. Double-clicking on a topic (circle) also opens the list of articles for which it is most predominant.
By default, the right panel of the visualization tool shows the most salient terms in all the abstracts, and selecting one of the topics on the left causes the right panel to show the most relevant terms in that topic (along with their frequency in that topic and in all abstracts).
In the upper right hand corner, a slider affects the relative weight placed on (i) the topic-specific probability of the term, in general, coming from the model (shown at the initial setting of λ=1) and (ii) the exclusivity of that term specifically to the topic (shown by moving the slider left to λ=0). A setting of λ between 0.4 and 0.6 generally leads to the easiest interpretation of the topic's conceptual meaning, and the web page of relevant terms per topic described in the body of the current post is generated with a λ=0.5.