NLP resources at NLM

NLP resources at NLM
This page provides access to data collections created to support research in consumer-health question answering, extraction of adverse drug reactions, extraction of information from MEDLINE^®/PubMed^® citations, and many other Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine (NLM) projects. Resources created for the Indexing Initiative projects can be found in https://ii.nlm.nih.gov/DataSets/ Global resources are listed in the Online Registry of Biomedical Informatics Tools (ORBIT)

claims_abstracts_v4_rectefied.json

Collection	Description	Created
Persistent PubMed Abstracts for BioNLP Research	A static online collection of MEDLINE^®/PubMed^® citations consisting of titles and abstracts (when available) for articles included in the MEDLINE database in a given year.	September 2016
CHQA-Corpus-1.0	A collection of 2,614 consumer health questions annotated with named entities, question topic, question triggers, and question frames.	August 2017
IOWA collection	Clinical questions described in Ely JW, Osheroff JA, Ebell MH, Bergus GR, Levy BT, Chambliss ML, et al. Analysis of questions asked by family doctors regarding patient care. BMJ 1999;319:358-361.	1999
SPL-ADR-200db	A collection of 200 Structured Product Labels fully annotated with adverse drug reactions (ADRs), and a database of distinct ADRs for each of the sections designated to report ADRs for each of the 200 drugs.	April 2017
PlacentaCollection	A collection of MEDLINE abstracts fully annotated with gene-disease relationships and gene and protein activity associated with placenta-mediated diseases.	April 2017
VQA 2018collection(ImageCLEF)	A collection of Medical images and Visual Question-Answer pairs for ImageCLEF 2018 evaluation.	April 2017
BART fine-tuned checkpoint	The Bidirectional Autoregressive Transformer (BART) model fine-tuned on BioASQ data for single-document, question-driven summarization.	Feb 2020
MedVidQA and MedVidCL Video Features	Video features of MedVidQA and MedVidCL datasets extracted using pretrained I3D and ViT models.	Jan 2022
MedVidQA at TRECVID 2023 Video Features	Video features for videos released under MedVidQA at TRECVID 2023 extracted using I3D model.	June 2023
OpenI-Images (Train)	A collection of OpenI images (training) extracted from Open-I (https://openi.nlm.nih.gov/) used in this work (https://arxiv.org/pdf/2210.02401.pdf).	July 2023
OpenI-Images (Test)	A collection of OpenI images (test) extracted from Open-I (https://openi.nlm.nih.gov/) used in this work (https://arxiv.org/pdf/2210.02401.pdf).	July 2023
OpenI-ResNet (Train)	Image features of OpenI datasets (training) extracted using ResNet-50 model.	July 2023
OpenI-ResNet (Test)	Image features of OpenI datasets (test) extracted using ResNet-50 model.	July 2023
OpenI-ConvNeXt (Train)	Image features of OpenI datasets (training) extracted using ConvNeXt-L model.	July 2023
OpenI-ConvNeXt (Test)	Image features of OpenI datasets (test) extracted using ConvNeXt-L model.	July 2023
HealthVidQA	A collection of video question-answering datasets annotated with healthcare questions and visual answers from instructional videos.	March 2024
HealthVer	HEALTHVER is an evidence-based fact-checking dataset for verifying the veracity of real-world claims about COVID-19 against scientific articles. https://aclanthology.org/2021.findings-emnlp.297.pdf	September 2021
ArchEHR-QA Shared Task Test Dataset	ArchEHR-QA shared task test dataset release for registered participants. Provided as-is while the primary host is unavailable.	Jan 29 2026