Open-Access Data and Computational Resources to Address COVID-19

COVID-19 open-access data and computational resources are being provided by federal agencies, including NIH, public consortia, and private entities. These resources are freely available to researchers, and this page will be updated as more information becomes available.

The Office of Data Science Strategy seeks to provide the research community with links to open-access data, computational, and supporting resources. These resources are being aggregated and posted for scientific and public health interests. Inclusion of a resource on this list does not mean it has been evaluated or endorsed by NIH.

To suggest a new resource, please send an email with the name of the resource, the website, and a short description to datascience@nih.gov.


Data Resources to Address COVID-19

Resource Description Data Type

Broad Terra cloud commons for pathogen surveillance

The Broad Terra cloud workspace for best practices with COVID-19 genomics data

  • Raw COVID-19 sequencing data from the NCBI Sequence Read Archive (SRA)
  • Workflows for genome assembly, quality control, metagenomic classification, and aggregate statistics
  • Jupyter Notebook produces quality control plots for workflow output
genomics
ClinicalTrials.gov COVID-19 related studies View listed clinical studies related to the coronavirus disease (COVID-19). Studies are submitted in a structured format directly by the sponsors and investigators conducting the studies. Submitted study information is generally posted on ClinicalTrials.gov within 2 days after initial submission and site content is updated daily. Full website content is also available through the API. clinical studies

CORD-19: COVID-19 Open Research Dataset and AI Challenge

Freely available dataset of 45,000 scholarly articles, including over 33,000 with full text, on COVID-19, SARS-CoV-2, and related coronaviruses. This machine-readable resource is provided to enable the application of natural language processing and other AI techniques.

See the CORD-19 Challenge, developed in partnership with Kaggle.

Read the accompanying call to action from the White House Office of Science & Technology Policy and learn more about the creation of CORD-19.

literature

COVID Digital Pathology Resource (COVID-DPR)

The COVID-DPR provides whole slide images of histopathologic samples relevant to COVID-19, including biopsy samples and autopsy specimens. The current focus of the repository includes tissue from the lungs, heart, liver, and kidney. The repository contains examples of H1N1, SARS, and MERS for comparison. digital images
Dimensions COVID-19 publications, datasets, and clinical trials

All Dimensions publications, datasets, and clinical trials related to COVID-19, updated daily. Content exported from the openly accessible Dimensions application accessible at https://covid-19.dimensions.ai/.

literature
GenBank Nucleotide Sequences Provides rapid, open, and unrestricted access to virus nucleotide sequences and is the repository being recommended by NIAID and CDC for investigator and public health submissions. Due to the scale of data indexing, there may be a delay before new submissions are indexed and retrievable with a term-based query. genomics
GenBank Protein Sequences Provides rapid, open, and unrestricted access to virus conceptually translated protein sequences and is the repository being recommended by NIAID and CDC for investigator and public health submissions. Due to the scale of data indexing, there may be a delay before new submissions are indexed and retrievable with a term-based query. genomics
GEO DataSets Human transcriptional responses to SARS-CoV-2 infection RNA-seq and expression counts

GISAID

International database of hCoV-19 genome sequences and related clinical and epidemiological data

genomics
iSearch COVID-19 Portfolio Comprehensive, expert-curated portfolio of COVID‑19 publications and preprints that includes peer-reviewed articles from PubMed and preprints from medRxiv, bioRxiv, ChemRxiv, and arXiv literature

LitCovid

NLM curated literature hub for COVID-19

literature

Modeling Infectious Disease Agents Study (MIDAS) online portal for COVID-19

NIGMS-funded modeling research. Public-access data collections with documented metadata.

Case studies in Asia and Iran; dashboards and visualization tools

NCBI Virus: SARS-CoV-2 data hub SARS-CoV-2 focused content from NCBI Virus, including links to related resources. Search, filter, and download the most up-to-date nucleotide and protein sequences from GenBank and RefSeq (taxid 2697049). Generate multiple sequence alignments and phylogenetic trees for sequences of interest. Provides one-click access to the Betacoronavirus BLAST database and relevant literature in PubMed. genomics

Nextstrain COVID-19 genetic epidemiology

Open-source SARS-CoV-2 genome data and analytic and visualization tools

genomics
outbreak.info A resource to aggregate data critical to scientific research during outbreaks of emerging diseases, such as COVID-19 various
PubChem Small molecule compounds, bioactivity data, biological targets, bioassays, chemical substances, patents, and pathways bioactivity
PubMed Central (PMC) COVID-19 Initiative On March 13, national science and technology advisors from a dozen countries, including the United States, called on publishers to voluntarily agree to make their COVID-19 and coronavirus-related publications, and the available data supporting them, immediately accessible in PMC and other appropriate public repositories to support the ongoing public health emergency response efforts. The articles added to PMC are distributed through the PMC Open Access Subset and are made available in CORD-19. literature

Research Data Alliance Working Group

Guidelines for data deposition in any common data hub or platform to facilitate data sharing in public health emergencies for scientific research

omics, clinical research, epidemiology, social sciences, community participation

Sequence Read Archive (SRA) Provides rapid, open, and unrestricted access to virus nucleotide or metagenomic sequence data and is the repository being recommended by NIAID and CDC for investigator and public health submissions. Due to the scale of data indexing, there may be a delay before new submissions are indexed and retrievable with a term-based query. genomics

UC Health clinical data warehouse

Data warehouse using Observational Medical Outcomes Partnership standard to integrate patient data across University of California health systems

participant-level clinical data

Virus Outbreak Data Network (VODAN)

Federated AI-ready repository of COVID-19 data adherent to FAIR principles (Findable, Accessible, Interoperable, Reusable) various

Computational Resources to Address COVID-19

Resource Description
Atrio

Powered by Atrio software platform offers easy access to large numbers of freely available, high-performing GPU and CPU resources. Contact support for help creating portable application containers that are performance optimized for these powerful systems.

Betacoronavirus BLAST BLAST database containing sequences from Betacoronavirus (taxid 694002), including the latest SARS-CoV-2 sequences in GenBank and RefSeq.

Cloud resources for COVID-19 research

Freely available high-performance computing resources immediately available for COVID-19 research. Provided by Rescale, Google Cloud, and Microsoft Azure.

The COVID-19 High Performance Computing (HPC) Consortium

Computing Infrastructure: XSEDE provides the portal, computing resources updated regularly, includes DOE National Laboratories, IBM, NSF, NASA, tech companies and academic computing centers.


Supporting Resources

Supporting Resources
Resource Description
Data-Against-COVID Team A group of more than 600 volunteer data scientists, machines learning experts, bioinformaticians and professional software developers who have joined together to offer their expertise for any data analysis problems that arise in the context of the ongoing coronavirus pandemic.

NASEM Standing Committee on Emerging Infectious Diseases and 21st Century Health Threats

This National Academies of Science, Engineering, and Medicine (NASEM) standing committee provides rapid expert consultation on data elements and systems design for modeling and decision making for the COVID-19 pandemic.

GenBank/SRA SARS-CoV-2 Sequence Submissions Quickly and easily submit assembled and unassembled SARS-CoV-2 data with help from NCBI if needed.
Generalist repositories supporting discoverability of COVID-19 data These seven generalist repositories are supporting the discoverability and reusability of COVID-19 data and associated code in different ways: Vivli, Figshare, GitHub, Dryad, Zenodo, Harvard Dataverse, Mendeley Data.
NIAID Overview of Coronaviruses Information about coronaviruses, including COVID-19, and resources for researchers

Schema.org

Schema.org 7.0 includes fast-tracked new vocabulary to assist the global response to the Coronavirus outbreak. Schema.org creates, maintains, and promotes schemas for structured data.

Viral Annotation DefineR (VADR) Sequence Annotation Tool NCBI developed a system called Viral Annotation DefineR (VADR) that validates and annotates viral sequences, including SARS-CoV-2.
Virus Pathogen Resource (ViPR) ViPR is an NIAID-funded resource that support the research of viral pathogens in the NIAID Category A-C Priority Pathogen lists and those causing (re)emerging infectious diseases. It provides a dedicated gateway to SARS-CoV-2 data that integrates data from external sources (GenBank, UniProt, Immune Epitope Database, Protein Data Bank), direct submissions, analysis pipelines and expert curation, and provides a suite of bioinformatics analysis and visualization tools for virology research.