Objective

In response to the COVID-19 pandemic, the Epidemic Question Answering (EPIC-QA) track challenges teams to develop systems capable of automatically answering ad-hoc questions about the disease COVID-19, its causal virus SARS-CoV-2, related corona viruses, and the recommended response to the pandemic. While COVID-19 has been an impetus for a large body of emergent scientific research and inquiry, the response to COVID-19 raises questions for consumers. The rapid increase in coronavirus literature and evolving guidelines on community response creates a challenging burden not only for the scientific and medical communities but also the general public to stay up-to-date on the latest developments. Consequently, the goal of the track is to evaluate systems on their ability to provide timely and accurate expert-level answers as expected by the scientific and medical communities as well as answers in consumer-friendly language for the general public.

Background

A pneumonia of unknown origin was detected in Wuhan, China and was first reported to the World Health Organization (WHO) on December 31, 2019. On January 30, 2020, the outbreak had escalated to the point that it was declared a Public Health Emergency of International Concern. The WHO officially named the 2019 coronavirus disease "COVID-19" on February 11, 2020. By March 11, 2020, after more than 118,000 reported cases in 114 countries resulting in over 4,291 reported fatalities, the WHO formally characterized COVID-19 as a pandemic. In the United States, in March 2020, various states and cities began issuing mandatory quarantine ordinances as well as guidance on "social distancing" and forced closures of gyms, bars, and nightclubs. As of April 7, 41 states as well as the District of Columbia had issued mandatory self-quarantine directives, forbidding non-essential activity outside the home. Over this period, there has been a rapid escalation in scientific research on COVID-19 and related coronaviruses as well as government and community response to prevent or maintain the outbreak. For example, the scientific community has sequenced the SARS-CoV-2 genome, proposed multiple potential vaccines, and explored antibody, anti-viral, and cell-based treatments. The rapid escalation of government and community response has resulted in a large burden for consumers as well as scientists and healthcare professionals seeking to maintain up-to-date knowledge on COVID-19 as well as the recommended response and adjustments to their daily lives. Consequently, the EPIC-QA track at TAC 2020 aims to help reduce this burden by fostering research in the design of automatic question answering systems to support scientific and consumer inquiry into COVID-19 and the recommended response.

It is our hope that the track will stimulate research in automatic question answering not only to support providing high-quality timely information about COVID-19, but also that the resultant collection can be used to develop generalizable approaches to meeting information needs in the face of varying levels of expertise.

Tasks

The 2020 EPIC-QA track involves two tasks:

Task A

Expert QA: In Task A, teams are provided with a set of questions asked by experts and are asked to provide a ranked list of expert-level answers to each question. In Task A, answers should provide information that is useful to researchers, scientists, or clinicians.

Task B

Consumer QA: In Task B, teams are provided with a set of questions asked by consumers and are asked to provide a ranked list of consumer-friendly answers to each question. In Task B, answers should be understandable by the general public.

While each task will have its own set of questions, many of the questions will overlap. This is by design, so that the collection can be used to explore whether the same approaches or systems can account for different types of users.

Answering Questions

In this tracks answers must be in the form of consecutive sentences extracted from a single context in a single document. Below, we illustrate four consumer-friendly and expert-level example answers extracted for the question, What is the origin of COVID 19?:

Consumer Passage

1COVID-19 is caused by a new coronavirus. 2Coronaviruses are a large family of viruses that are common in people and many different species of animals, including camels, cattle, cats, and bats. Rarely, animal coronaviruses can infect people and then spread between people such as with MERS-CoV, SARS-CoV, and now with this new virus (named SARS-CoV-2).

The SARS-CoV-2 virus is a betacoronavirus, like MERS-CoV and SARS-CoV. 3All three of these viruses have their origins in bats. 4The sequences from U.S. patients are similar to the one that China initially posted, suggesting a likely single, recent emergence of this virus from an animal reservoir.

Expert Passage

1It is improbable that SARS-CoV-2 emerged through laboratory manipulation of a related SARS-CoV-like coronavirus. As noted above, the RBD of SARS-CoV-2 is optimized for binding to human ACE2 with an efficient solution different from those previously predicted.[7,11] 2Furthermore, if genetic manipulation had been performed, one of the several reverse-genetic systems available for betacoronaviruses would probably have been used.[19] 3However, the genetic data irrefutably show that SARS-CoV-2 is not derived from any previously used virus backbone.[20] 4Instead, we propose two scenarios that can plausibly explain the origin of SARS-CoV-2: (i) natural selection in an animal host before zoonotic transfer; and (ii) natural selection in humans following zoonotic transfer.

Contexts and sentence IDs will be provided to the participants as part of the collection. In the CORD-19 collection, contexts will correspond to paragraphs defined by the authors of their publications. In the government collection, contexts will correspond to sections identified through the HTML of government websites. Contexts that are longer than 15 sentences will be segmented into approximately 15-sentence chunks. Contexts will be further segmented into sentences, each associated with a unique ID. The participants will be required to provide the starting and ending IDs of the sentences that constitute each of their answers. To maintain provenance, each answer must also be associated with the document and context IDs from which it originated.

Note:

In this task, the goal is to explore the landscape of answers asserted by the document collection. A statement that answers the question will be considered as a valid answer regardless of whether or not it is factually accurate. The answers in this task are intended as intermediary step where-in one would like to explore all answers provided by the document collection, both correct answers as well as incorrect answers people may have discovered on their own.

Document Collection

Answers should originate from the EPIC QA collection, which includes scientific and government articles about COVID-19, SARS-CoV-2, related coronaviruses, and information about community response. This collection consists of two parts:

Research Articles

We adapt the collection of biomedical articles released for the COVID-19 Open Research Dataset Challenge (CORD-19). The primary evaluation uses a snapshot of CORD-19 from October 22, 2020. The dataset was created by the Allen Institute for AI in partnership with the Chan Zuckerberg Initiative, Georgetown University’s Center for Security and Emerging Technology, Microsoft Research, and the National Library of Medicine — National Institutes of Health, in coordination with The White House Office of Science and Technology Policy. The CORD-19 collection includes a subset of articles in PubMed Central (PMC) as well as pre-prints from bioRxiv. Contexts in this collection will correspond to automatically identified paragraphs in the articles' abstracts, or main texts.

By downloading this dataset you are agreeing to the Open COVID Pledge compatible Dataset License for the CORD-19 dataset that details the terms and conditions under which partner data and content is being made available. Specific licensing information for individual articles in the dataset is available in the metadata file.

Additional licensing information is available on the PMC website, medRxiv website and bioRxiv website.

Collection (1.4 GB) MD5 Checksum

Consumer Articles

We include a subset of the articles used by the Consumer Health Information Question Answering (CHIQA) service of the U.S. National Library of Medicine (NLM). This collection includes authoritative articles from: the Centers for Disease Control and Prevention (CDC); the Genetic and Rare Disease Information Center (GARD); the Genetics Home Reference (GHR); Medline Plus; the National Institute of Allergy and Infectious Diseases (NIAID); the World Health Organization (WHO); Contexts in this collection will correspond to paragraphs or sections as indicated by the HTML markup of the document.

We also include 265 reddit threads from /r/askscience tagged with COVID-19, Medicine, Biology, or the Human Body, and filtered for COVID-19 content.

Finally, we include a subset of the CommonCrawl News crawl from January 1st to April 30th, 2020, as used in the TREC Health Misinformation Track. Documents in this subset were filtered by domain using SALSA, PageRank, and HITS and were further filtered for COVID-19 content.

Works produced by the federal government are not copyrighted under U.S. law. You may reproduce, redistribute, and link freely to non-copyrighted content, including on social media. Documents from the WHO may be reviewed, reproduced or translated for research or private study but not for sale or for use in conjunction with commercial purposes. Additional copyright information can be obtained from their respective websites.

Collection (812 MB) MD5 Checksum

Research Articles

We adapt the collection of biomedical articles released for the COVID-19 Open Research Dataset Challenge (CORD-19). The primary evaluation uses a snapshot of CORD-19 from June 19, 2020. The dataset was created by the Allen Institute for AI in partnership with the Chan Zuckerberg Initiative, Georgetown University’s Center for Security and Emerging Technology, Microsoft Research, and the National Library of Medicine — National Institutes of Health, in coordination with The White House Office of Science and Technology Policy. The CORD-19 collection includes a subset of articles in PubMed Central (PMC) as well as pre-prints from bioRxiv. Contexts in this collection will correspond to automatically identified paragraphs in the articles' abstracts, or main texts.

By downloading this dataset you are agreeing to the Open COVID Pledge compatible Dataset License for the CORD-19 dataset that details the terms and conditions under which partner data and content is being made available. Specific licensing information for individual articles in the dataset is available in the metadata file.

Additional licensing information is available on the PMC website, medRxiv website and bioRxiv website.

11/18 Update The preliminary CORD-19 collection has been updated to account for changes in the CORD-19 preprocessing pipeline. If you downloaded the preliminary CORD-19 collection before 11/18, it had a large number of duplicated contexts, so please download the updated version (version 3).

Collection (972 MB) MD5 Checksum

Consumer Articles

We include a subset of the articles used by the Consumer Health Information Question Answering (CHIQA) service of the U.S. National Library of Medicine (NLM). This collection includes authoritative articles from: the Centers for Disease Control and Prevention (CDC); DailyMed; the Genetic and Rare Disease Information Center (GARD); the Genetics Home Reference (GHR); the Mayo Clinic; Medline Plus; the National Heart, Lung, and Blood Institute (NHLBI); the National Institute of Allergy and Infectious Diseases (NIAID); the World Health Organization (WHO); and the Office on Women's Health of the U.S. Department of Health & Human Services. Contexts in this collection will correspond to paragraphs or sections as indicated by the HTML markup of the document.

Works produced by the federal government are not copyrighted under U.S. law. You may reproduce, redistribute, and link freely to non-copyrighted content, including on social media. Documents from the WHO may be reviewed, reproduced or translated for research or private study but not for sale or for use in conjunction with commercial purposes. Additional copyright information can be obtained from their respective websites.

Collection (2.1 MB) MD5 Checksum

The documents in this collection adhere to a modified version of the CORD-19 JSON schema described .


                

Judgments

The list of human-generated answers and sentence-level answer annotations for the 21 questions judged in Task A and the 18 questions judged in Task B during the preliminary evaluation are now available. Note: these judgments were made using the preliminary evaluation collection; the context and sentence IDs may have shifted in the primary evaluation collection.

11/13 Update The preliminary judgments have been corrected to account for an issue with the CORD-19 collection used during the preliminary evaluation. If you downloaded them before 11/13, please download them again.

Preliminary Judgments (49 KB) MD5 Checksum

Questions

In conjunction with the 2020 Text REtrieval Conference's (TREC) TREC-COVID track, we have prepared sets of approximately 45 questions. Specifically, two sets of questions will be provided: one for expert-level questions and one for consumer-level questions. By design, many of these questions will overlap, allowing us to evaluate the extent to which the background of the user will affect their preference for answers. The majority of these questions originated from consumers' interactions with MedlinePlus©. Additional scientific questions were developed based on group discussions from the National Institutes of Health (NIH) special interest group on COVID-19, questions asked by Oregon Health Science University clinicians, and responses to a public call for questions.

A new set of 30 expert-level and consumer-friendly questions are provided the final evaluation cycle. None of these questions will have been evaluated in TREC-COVID; thus, systems will not be able to rely existing document-level relevant judgments. However, relevance judgments produced during the preliminary evaluation are provided here.

Task A Questions (12 KB) Task B Questions (10 KB)

The goal of the first, preliminary evaluation cycle is to produce data that can be used to develop systems for the final evaluation cycle in the fall. To reduce the barrier-to-entry, we will be using 45 topics evaluated in the fourth round of TREC-COVID. Participants are free to use the document-level relevance judgments for these topics during the preliminary evaluation cycle available here. Please note that in the final evaluation cycle there will be no document-level relevance judgments — they are provided only during the preliminary evaluation cycle to expedite the process.

The results of the preliminary evaluation cycle will be provided to all participants, not just those that participated in the preliminary evaluation.

Task A

The first 45 topics evaluated in the fourth round of TREC-COVID are used as-is for Task A.

Task B

A subset of topics evaluated in the fourth round of TREC-COVID updated with consumer-friendly narratives are used for Task B.

The results of the preliminary evaluation cycle will be provided to all participants, not just those that participated in the preliminary evaluation.

The questions in this task will be provided using the JSON schema described .


        

Evaluation

For each task, participants will provide ranked lists of answers for each question. Answer judgments will be provided by librarian indexers at the U.S. National Library of Medicine (NLM). Answers will be evaluated within the context they were extracted (where context refers to the paragraphs or sections described above). Within an evaluation cycle, answer-assessment will be made in two rounds:

  1. Answer-key Generation

    In the first evaluation round, assessors will have access to the answers produced by participating teams as well as an ad-hoc search engine over the document collection. The goal in this round is for assessors to explore the answers from teams as well as the document collection to determine a set of atomic facts that answer the question. Following TAC tradition, we refer to these facts as nuggets. For example, nuggets that may be produced for the question What is the origin of COVID-19? include Malayan pangolins, bats, Guangdon province, etc. The primary role of this round is to create an answer key for the question comprised of nuggets identified in participants' answers or the assessor's own ad-hoc search of the collection. The search engine is provided to help assessors explore the topic and identify nuggets that may not have been returned by any teams. Assessors are not expected to exhaustively identify every possible nugget that was not returned by teams; rather the intent is for them to identify important (at the discretion of the assessor) nuggets that they feel should be included in the answer key based on their understanding of the topic.

  2. Passage Annotation

    In the second evaluation round, the answer key (list of nuggets) will be fixed. Assessors will be given the same set of answer passages and contexts used in round one. This time, they will be asked to annotate sentences in each context indicating which nugget(s) (if any) are addressed by each sentence. For example, the sentence Malayan pangolins (Manis javanica) illegally imported into Guangdong province contain coronaviruses similar to SARS-CoV-2 could be annotated as containing two nuggets: Malayan pangolins and Guangdon province. Unannotated sentences will be assumed to contain no nuggets.

The primary evaluation metric will favor exploring the landscape of answers (or nuggets) in the collection, encouraging systems to provide a diverse list of answers. Specifically, we will rank teams using a modified version of Normalized Discounted Cumulative Gain (NDCG) where (1) an answer is considered relevant if and only if it describes a nugget that has not been included in any of the answers at earlier ranks in the list and (2) answers are penalized based on their length (in sentences). The evaluation script is available for download.

11/18 Update The evaluation script has been updated to account for minor changes in the corrected prelimninary judgment file. If you downloaded the script before 11/18, and want to use it with the corrected preliminary collection, please download it again.

EPIC Eval (9 KB)

Registration

To register for the EPIC-QA task, please use the TAC registration form available from NIST (coming soon).

Submission

Participants are allowed to submit three (3) runs for each task. Submissions that do not conform to the following file format will be rejected without consideration. The format for run submissions is inspired by the standard trec_eval format and is whitespace-delimited:

QUESTION_ID    Q0    START_SENTENCE_ID:END_SENTENCE_ID    RANK    SCORE    RUN_NAME
where QUESTION_ID is the ID of the question from the question json file, Q0 is a required constant for compatibility, START_SENTENCE_ID & END_SENTENCE_ID are (inclusive) IDs indicating a run of contiguous sentences from the same document context, RANK is the rank (1-1000) of the answer in the list of answers retrieved for that question, SCORE is a foating point numeric score or weight assigned to that answer, and RUN_NAME is the name of the run association with this submission file (the RUN_NAME should not change within a submission file).

For an example submission file, click .


        

Timeline

August 21
Preliminary evaluation cycle begins
September 21
Preliminary evaluation cycle ends.
October 26
Preliminary evaluation judgments become available.
November 3
Final evaluation cycle begins.
December 11
Final evaluation cycle ends.

Mailing List

The mailing list for this track will be epic-qa@list.nist.gov. Participants may join the mailing list by joining the EPIC-QA Google Group (i.e., by clicking "Join Group"). A Google account (not necessarily a Gmail account) is required to join the group. Group members will receive messages that are sent to the group mailing list. Messages to the mailing list should be sent to epic-qa@list.nist.gov.

EPIC-QA participants without a Google account who wish to join the mailing list should send a message to tac-manager@nist.gov with the subject line "subscribe epic-qa".

Organizers

Travis Goodwin & Dina Demner-Fushman
U.S. National Library of Medicine
Kyle Lo & Lucy Lu Wang
Allen Institute for AI
William R. Hersh
Oregon Health & Science University
Hoa T. Dang & Ian M. Soboroff
National Institute for Standards and Technology