Drug labels (prescribing information or package inserts) describe what a particular medicine is supposed to do, who should or should not take it, how to use it, and specific safety concerns. The US Food and Drug Administration (FDA) publishes regulations governing the content and format of this information to provide recommendations for applicants developing labeling for new drugs and revising labeling for already approved drugs. One of the major aspects of drug information are safety concerns in the form of Adverse Drug Reactions (ADRs). In this evaluation, we are focusing on extraction of ADRs from the prescription drug labels.FDA guidelines for applicants define ADRs as follows:
Adverse Event: refers to any untoward medical event associated with the use of a drug in humans, whether or not considered drug-related.
Adverse Reaction: an undesirable effect reasonably associated with the use of a drug, that may occur as part of the pharmacological action of the drug or may be unpredictable in its occurrence. This definition does not include all adverse events observed during use of a drug, only those for which there is some basis to believe there is a causal relationship between the drug and the occurrence of the adverse event. Adverse reactions may include signs and symptoms, changes in laboratory parameters, and changes in other measures of critical body function, such as vital signs and ECG.
Serious Adverse Reaction: refers to any reaction occurring at any dose that results in any of the following outcomes: death, a life-threatening adverse experience, inpatient hospitalization or prolongation of existing hospitalization, a persistent or significant disability or incapacity, or a congenital anomaly or birth defect.
The FDA is highly interested in automatic extraction of ADRs from drug labels for many purposes.
Two possible applications enabled by this task are
(1) comparing the ADRs present in labels from different manufacturers for the same drug, and
(2) performing post-marketing safety analysis (pharmacovigilance) by identifying new ADRs not currently present in the labels.
Specifically for the purposes of post-marketing safety analysis, the FDA relies on spontaneous adverse event reports submitted to the FDA Adverse Event Reporting System (FAERS). To detect novel ADRs more efficiently and operationalize pharmacovigilance, the current approach to FAERS case report review requires the manual reading the text of a drug label to determine if a given event is already noted (i.e. is a "labeled" event), extraction of the labeled events needs to be automated.
The results of this track will inform future FDA efforts at automating important safety processes, and could potentially lead to future FDA collaboration with interested researchers in this area.
The purpose of this TAC track is to test various natural language processing (NLP) approaches for their information extraction (IE) performance on adverse reactions. A large set of labels will be provided to participants, of which 100 will be annotated with adverse reactions. Additionally, the 100 training labels will be accompanied by the MedDRA Preferred Terms (PT) and Lower Level Terms (LLT) of the ADRs in the drug labels. This corresponds to the primary goal of the task: to identify the known ADRs in a SPL in the form of MedDRA concepts. Participants will be evaluated by their performance on a held-out set of 100 labeled SPLs.
The participants will be provided with over one thousand drug labels (structured product labels, or SPLs) as text documents in an XML format. Of these, 101 drug labels will form the official training set and contain gold standard annotations created by NLM and FDA.
The gold standard contains the following entity-style annotations:
AdverseReaction: Reported ADRs follow the guidelines for industry provided above and may include signs and symptoms, worsening medical conditions, changes in laboratory parameters, and changes in other measures of critical body function, such as vital signs and ECG. These ADRs can be associated with use of the drug or any of its components.
Severity: Measurement of the severity of a specific AdverseReaction. This can be qualitative terms (e.g., "major", "critical", "serious", "life-threatening") or quantitative grades (e.g., "grade 1", "Grade 3-4", "3 times upper limit of normal (ULN)", "240 mg/dL").
DrugClass: The class of drug that the specific drug for the label is part of. This is designed to capture drug class effects (e.g., " [beta blockers]DrugClass may result in...") that are not necessarily specific to the particular drug.
Negation: Trigger word for event negation.
Animal: Animal species in which an AdverseReaction was observed.
Factor: Any additional aspect of an AdverseReaction that is not covered by one of the other entities listed here. Notably, this includes hedging terms (e.g., may, risk, potential), references to the placebo arm of a clinical trial, or specific sub-populations (e.g., pregnancy, fetus).
Note: Other than AdverseReactions, entities are only annotated when related to an AdverseReaction by one of the following relations.
The following relations connect an AdverseReaction with one of the above entities. Each relation is limited to a specific subset of entity types.
Negated: A Negation or Factor that negates an AdverseReaction for the drug.
Hypothetical: An Animal, DrugClass, or Factor that speculates about, or qualifies the definitiveness of the drug's relationship with an AdverseReaction.
Effect: A Severity of an AdverseReaction for the drug.
Equiv: Alternative name or acronym for an event (between two AdverseReactions). For the most part, this is only annotated when the equivalent name is used in the same sentence as the AdverseReaction. The exception is if the equivalent name is mentioned in one sentence and then defined in the next.
The ultimate aim is to know which ADRs are in the labels, not the precise offsets or relations, such that the ADRs may be linked to a structured knowledge source (MedDRA). Further, an ADR mentioned several times should not necessarily carry more weight than an ADR mentioned once. As such, the gold standard contains a list of unique ADRs aggregated at the document level (by string). These strings are then annotated with MedDRA Lower Level Terms (LLT) and the corresponding Preferred Term (PT).
The participants may choose any one specific task described below or approach the tasks as each one building upon the previous tasks. Some tasks do necessarily require the output of previous tasks, e.g., Task 2 requires Task 1, but Task 3 can be performed independently.
Task 1: Extract AdverseReactions and related entities (Severity, Factor, DrugClass, Negation, Animal). This is similar to many NLP Named Entity Recognition (NER) evaluations.
Task 2: Identify the relations between AdverseReactions and related entities (i.e., Negated, Hypothetical, Effect, and Equiv). This is similar to many NLP relation identification evaluations.
Task 3: Identify the positive AdverseReaction entity names in the labels. For the purposes of this task, positive will be defined as the caseless strings of all the AdverseReactions that have not been negated and are not related by a Hypothetical relation to a DrugClass or Animal. Note that this means Factors related via a Hypothetical relation are considered positive (e.g., "[unknown risk]Factor of [stroke]AdverseReaction") for the purposes of this task. The result of this task will be a list of unique strings corresponding to the positive ADRs as they were written in the label.
Task 4: Provide MedDRA LLTs and PTs for the positive ADRs. For participants approaching the tasks sequentially, this can be viewed as normalization of the terms extracted in Task 3 to MedDRA LLTs/PTs. To get MedDRA v18.1 to the participants:
Other resources such as the UMLS Terminology Services may be used to aid with the normalization process.
Participants will be asked to submit system results on all labels. An annotated test set of 100 labels will be used to evaluate performance, but participants will not be told which of the much larger set will be used as the test set. Therefore, there is no need to have a release of the test data immediately before the submission deadline: participants can submit their results at any time. The evaluation measures are:
Precision/Recall/F1-measure on entity-level annotations, using both partial and exact matching.
Primary Metric: micro-averaged F1 on exact matches.
Precision/Recall/F1-measure on relations.
Primary Metric: micro-averaged F1.
Precision/Recall/F1-measure on unique positive AdverseReaction strings.
Primary Metric: macro-averaged F1 (by label)
Precision/Recall/F1-measure on unique MedDRA LLTs and PTs.
Primary Metric: macro-averaged F1 of PTs (by label)
An evaluation script implementing these metrics will be provided by the organizers.
Participants are allowed three separate submissions. Submissions that do not conform to the provided XML standards will be rejected without consideration.
Kirk Roberts (firstname.lastname@example.org)
Dina Demner-Fushman (email@example.com)
Joseph Tonning (firstname.lastname@example.org)