On the viability of crowdsourcing NLP annotations in healthcare
__author__ = 'Bruno Godefroy'
OverviewObtaining high-quality annotations is a bottleneck for all natural language processing applications. The challenges are especially severe in healthcare, where we rely on annotators who have expertise in the practice of medicine and in understanding medical texts, and who are authorized to access sensitive data. Such annotators are in high-demand and thus demand high prices. The question then arises whether we can ease the burden on these annotators by crowdsourcing the less demanding annotation tasks involving publicly-available data, reserving the experts for where they are truly needed. This post reports on a crowdsourcing experiment we ran to explore this issue. We defined a reasonably nuanced span-identification task, and launched it on Figure Eight. As expected, the output was noisy, as a result of the highly variable pool of annotators we tapped into. To infer quality labels from this noisy process, we used a straightforward application of Expectation-Maximization, with quite good results, suggesting that crowdsourcing is an effective tool for obtaining annotations for at least some NLP problems in healthcare.
Crowdsourcing task definitionThe publicly available FDA Drug Labels data set is a rich source of information about how drugs relate to disease states (among other things). Since it's a public dataset, we don't have to address privacy concerns, which are of course another limiting factor when it comes to annotating data in healthcare. For our pilot task, we decided to focus on developing annotations to facilitate automatic extraction of the core drug–disease relationships expressed in these labels, as exemplified in figure 1. This is a problem we've worked on before, so we have a good sense for what information should be extracted.
Figure 1: Annotated sentences from a drug label.
Figure 2: The interface for our first task. Crowdworkers are asked to select disease, symptoms and injury mentions in a sentence. Some drug and disease mentions are automatically underlined using fixed lexicons to help the workers understand the texts.
Who completed our task on Figure Eight?We launched our task on Figure Eight with 10,000 sentences. It was completed within a few days. The job was done by 451 people from 45 countries, the majority from India and Venezuela. No special qualifications were imposed. Most workers made just a few contributions in a short period of time, as figures 3 and 4 show. Half of the sessions (continuous periods of work) lasted less than 20 minutes; at the upper end, 7% were more than an hour and a half. This is expected given recent studies of crowdworkers' behaviors [3, 4].
Figure 3: Number of work sessions per contributor.
Figure 4: Work session durations.
Assessment against gold labelsGiven our large and diverse pool of workers, we expect some of them to be unreliable, perhaps due to a lack of expertise or a lack of attention. With Figure Eight, we can supply our own labeled examples for a subset of cases, to help identify and filter out unreliable workers. Figure 5 summarizes the work of 100 annotators who were rejected from our task based on their performance on this gold data.
Figure 5: Main reason for failing out of our gold-label assessment. Results come from the manual study of the output from 100 workers who didn't pass the test.
From noisy judgments to crowd truthShould we blindly trust all the judgments of all the workers who passed our gold-label assessment? Probably not. Some errors are inevitable even for careful workers, and some malicious workers are likely to slip past our assessment against our gold examples. Furthermore, there are bound to be cases that are ambiguous or open to interpretation, leading to multiple right answers that we ourselves might not have fully appreciated, as in figure 6.
Figure 6: An ambiguous case. Should "feelings of sadness related to winter" be selected as a disease? Or a symptom?
Figure 7: The Expectation-Maximization algorithm for inferring labels from crowd work.
|Worker 1||Worker 2||Worker 3||Worker 4||True response|
Table 1: Dummy collected judgments and true response (that we don't have), for each question.
|Maximum likelihood estimate||Derived crowd response||True response|
Table 2: Estimated maximum likelihood responses.
|Estimated precision||True precision||Estimated recall||True recall|
Table 3: Workers' performance estimates.
The wisdom of our crowdThere is evidence that, in many settings, a crowd of non-experts can collectively offer estimates that match or exceed those of individual experts [11, 12, 13, 14]. Is this true of our crowd of Figure Eight workers? To address this question, we applied the EM algorithm as described above. Since many of our disease spans are multi-word expressions, we do this at the token level: for each token selected by at least one worker, its probability to be a disease mention is estimated. To begin, we hope that most tokens end up with a probability close to 0 or 1, that is, a high confidence of being part of an entity of interest or not. Figure 8 shows that this is the case.
Figure 8: The distribution of probabilities estimated with EM, after 1, 2 and 10 iterations (convergence).
Figure 9: Precision, recall and F1 scores at various confidence thresholds for each aggregation method.
Facing the crowdEM estimates individual worker reliability and can therefore help us understand individual behaviors in the crowd. To that end, figure 10 shows the timeline of each work session (a continuous period of work for one contributor).
Figure 10: Precision and recall against time during work sessions. Clusters are represented with distinct colors.
- Light blue (1,050 timelines): The largest cluster, consisting mostly of short sessions.
- Yellow (6 timelines): Very low precision and very low recall. These workers unfortunately slipped through our gold-label assessment.
- Orange (15 timelines): High precision and very low recall, probably from workers who selected almost no text segments.
- Dark blue (31 timelines): Long sessions consisting of reliable annotations.