Background

Health literacy, or the ability of people to transform health-related information into informed actions, accounts for measurable impact on outcomes, and its deficits help to drive the disparities of health that are linked to education level (McCray 2004, Berckman 2011). Though the Internet and open science initiatives have placed unprecedented amounts of biomedical knowledge at the fingertips of medical practitioners and medical consumers alike, consumers often run into a language barrier, even if the resources are in their native language.

Advances in Deep Learning may soon make it feasible to automatically and accurately adapt difficult scientific text for patients and caregivers. However, significant barriers need to be overcome for such models to be viable for production (Ondov 2022). It is our hope that this track will stimulate research in automatic plain language adaptation of biomedical text resources to help improve health literacy and engagement among patients and caregivers.

Task

The task is to adapt biomedical abstracts to for the general public using plain language. Given a set of abstracts (the source), your system will provide output for each sentence of the source. When adapting, source sentences may be split, in which case the output for one source sentence will be multiple target sentences, or omitted, in which case the output for a source sentence will be blank. However, source sentences may not be merged, and the output for a given source sentence should not contain information from other source sentences. Both source and output will be in English. Source abstracts have been retrieved to answer consumer questions asked on MedlinePlus. These questions will be used for to guide manual evaluation (see Evaluation). Teams will have access to the questions and may provide Systems with them desired.

Input and Output

Below, we illustrate example input and output:
Input (question)
how is strep throat treated?
Input (abstract)
Output (adapted)
Acute pharyngitis/tonsillitis, which is characterized by inflammation of the posterior pharynx and tonsils, is a common disease. Sore throat/tonsillitis, or when the back of the throat or tonsils is inflamed, is common.
Several viruses and bacteria can cause acute pharyngitis; however, Streptococcus pyogenes (also known as Lancefield group A β-hemolytic streptococci) is the only agent that requires an etiologic diagnosis and specific treatment. Many viruses and bacteria can cause short-term sore throat. However, group A strep, caused by Group A strep bacteria, is the only cause that must be identified based on signs and symptoms and treated.
S. pyogenes is of major clinical importance because it can trigger post-infection systemic complications, acute rheumatic fever, and post-streptococcal glomerulonephritis. Group A strep bacteria are important to identify because they can cause post-strep throat complications throughout the body, acute rheumatic fever (a disease that inflames the body's tissues), and post-strep throat kidney disease.
Symptom onset in streptococcal infection is usually abrupt and includes intense sore throat, fever, chills, malaise, headache, tender enlarged anterior cervical lymph nodes, and pharyngeal or tonsillar exudate. Strep throat symptoms usually happen quickly and include severe sore throat, fever, chills, general discomfort, headache, swollen lymph nodes in the front of the neck, and white or yellow spots on the throat or tonsils.
Cough, coryza, conjunctivitis, and diarrhea are uncommon, and their presence suggests a viral cause. Cough, cold symptoms, pink eye, and diarrhea are not common and might be caused by a virus.
A diagnosis of pharyngitis is supported by the patient's history and by the physical examination. Learning the person's history and doing a physical exam are used to diagnose strep throat.
Throat culture is the gold standard for diagnosing streptococcus pharyngitis. A throat swab to find, grow, and test bacteria in the throat that make you sick is the best way to diagnose strep throat.
However, it has been underused in public health services because of its low availability and because of the 1- to 2-day delay in obtaining results. However, it has not been used as much as it should because it is not widely available and takes 1 to 2 days to get results.
Rapid antigen detection tests have been used to detect S. pyogenes directly from throat swabs within minutes. Rapid strep tests have been used to find fragments of bacteria that cause strep throat from swabs within minutes.
Clinical scoring systems have been developed to predict the risk of S. pyogenes infection. Scoring systems have been made to predict the risk of strep throat.
The most commonly used scoring system is the modified Centor score.
Acute S. pyogenes pharyngitis is often a self-limiting disease. Short-term strep throat often goes away on its own without treatment.
Penicillins are the first-choice treatment. Penicillins, a type of antibiotics, are prescribed most commonly.
For patients with penicillin allergy, cephalosporins can be an acceptable alternative, although primary hypersensitivity to cephalosporins can occur. For people allergic to penicillin, cephalosporins, another type of antibiotics, can be prescribed, although people can be allergic to cephalosporins.
Another drug option is the macrolides. Another drug option is macrolides, another type of antibiotics.
Future perspectives to prevent streptococcal pharyngitis and post-infection systemic complications include the development of an anti-Streptococcus pyogenes vaccine. Making an anti-strep throat vaccine could be one way to prevent strep throat and post-strep throat complications throughout the body in the future.

Data

Training data are the publicly available PLABA dataset (Attal 2023), which comprises 750 abstracts, each manually adapted to plain language by at least one annotator, for a total of 7,643 sentence pairs.

Complete guidelines given to annotators can be seen here.

Notable guidelines include:

data.json (3mb) Readme.pdf

Evaluation

Submissions will be evaluated both with automatic, reference-based metrics and manual assessments of quality.

  1. Automatic Evaluation

    Automatic evaluation will use SARI (Xu 2016). Like BLEU, SARI compares output to a set of gold standard references. However, SARI makes special considerations for simplifying text, making it more appropriate for this task. Specifically, it considers the source as well as the reference, allowing it to measure n-grams that were kept, added, or removed vs. the source. The F-scores of these three operations are balance for the final score. Additionally, since there are many possible ways to adapt a given sentence, the test set will have 4 gold references per sentence, each from a different expert annotator. The SARI score will thus account for different ways to paraphrase and explain while adapting. Test pairs will be manually adapted using the same guidelines as the training data (see Data).

    The evaluation notebook below will use SARI to evaluate system output for the validation and test splits of the original PLABA dataset (not the final test set for this task). The data file is required.

    eval-sari.ipynb

  2. Manual Evaluation
    Due to the high stakes of the biomedical domain, it is important to evaluate system outputs manually. Experts will rank system output, for a sampling of abstracts, based on these axes:
    • For every sentence:
      • Sentence simplicity - Are long, complex sentences appropriately split?
      • Term simplicity - Are expert terms in the source replaced with alternatives or explained in the output?
      • Term accuracy - Are substitutions and explanations of expert terms accurate?
      • Fluency - Does the output follow grammatical rules and read smoothly?
    • For up to 3 sentences that answer the question or are the most relevant to it (from consensus of annotators):
      • Completeness - How much of the source information does the output provide?
      • Faithfulness - Do points made in the output match those of the source?

test_data.tar.gz (260kb)

Registration

Teams can register for the PLABA task through the TAC website.

Submission

Submission will be via TAC and will open soon.

You have two options for submitting:

  1. Aligned: One line for each sentence of the original abstract (see Task). Blank lines indicate dropped sentences. Lines may contain multiple sentences to split an input sentence or add background.
  2. Unaligned: For convenience of teams using document level systems, you may submit each adapted abstract on a single line. Sentences will be automatically aligned using Vecalign (Thompson & Koehn 2019). Note, however, that all evaluation will still happen at the sentence level, and we do not guarantee an automatic alignment that will lead to the highest score. Teams are thus encouraged to perform alignment themselves to ensure accurate evaluation.

You can submit up to 3 runs, ranked by your team, though manual evaluations will only be done on the top ranked run.

example_submission-aligned.tar.gz (255kb) example_submission-unaligned.tar.gz (255kb)


        

Timeline

July 19
Evaluation data released✔️
August 16 30
Submissions due✔️
October 18
Results posted✔️

Mailing List

The mailing list for this track is plaba2023@googlegroups.com. Participants may join the mailing list by joining the PLABA 2023 Google Group. A Google account (not necessarily a Gmail account) is required to join the group. Group members will receive messages that are sent to the group mailing list. Messages to the mailing list should be sent to plaba2023@googlegroups.com.

Organizers

Brian Ondov & Dina Demner-Fushman
U.S. National Library of Medicine
Hoa T. Dang
National Institute for Standards and Technology