A public data lake for COVID-19 research and development

The AWS COVID-19 data lake is a centralized repository of up-to-date and curated datasets focused on the spread and characteristics of the novel coronavirus (SARS-CoV-2). This data lake contains pre-processed, curated, and publicly-readable data, ready for analysis by anyone and many of which is sourced through AWS Data Exchange.

Hosted on the AWS cloud, this curated data lake contains useful data sets such as COVID-19 case tracking data from The New York Times, COVID-19 testing data from the COVID Tracking Project, hospital bed availability from Definitive Healthcare, health survey data from the Delphi Research Group, and research data from over 45,000 articles about COVID-19 and related coronaviruses from the Allen Institute for AI. As new versions of the datasets are published and other reliable sources become available, we will update the data lake.

We hope that organizations and individuals will use this data to help in the fight against COVID-19. For instance, local health authorities can build out their dashboards to efficiently deploy vital resources like hospital beds and ventilators as they track the spread of the disease. Epidemiologists can use it to complement their existing models and datasets and generate better forecasts of hotspots and trends, such as those related to testing availability and population size and density.

If you are interested in subscribing to this data in machine-readable form or contributing datasets for the public data lake, please visit our AWS Data Exchange page. You can also apply for funding for your diagnostic research project via the AWS Diagnostic Development Initiative (DDI).

Pricing

This data is hosted for free in Amazon S3. Normal charges to request data in S3 are disabled on the public data lake bucket, so you incur no cost there. However, you will still incur standard charges for the services that you use to analyze the data lake, like Amazon Athena.

Getting started

To make the data from the AWS COVID-19 data lake available in your AWS account, use the following Amazon CloudFormation template to populate the Data Catalog. If you are signed into your AWS account, the following link fills out most of the stack creation form for you. All you need to do is choose (Create stack). For instructions on creating a CloudFormation stack, see Get Started in the CloudFormation documentation. This template creates a covid-19 database in your Data Catalog and tables that point to the public AWS COVID-19 data lake.

For information on how to set up the definitions for that data in an AWS Glue Data Catalog and then query it with Amazon Athena, please read this blog post and follow the step-by-step instructions. Questions about the data lake? Please reach out to aws-covid-19-data-lake@amazon.com.

Data catalog

The following tables outline the data hosted in the data lake.

COVID-19 World Confirmed Cases, Deaths and Testing

Table Name

Description

Source

Provider

Updated

world_cases_deaths_testing

This dataset includes data on confirmed cases, deaths, and testing.

several

rearc

Daily

US Coronavirus (COVID-19) Cases

Table Name

Description

Source

Provider

Updated

This data set tracks confirmed cases and deaths in the US by state and county.

Daily

nytimes_states

Data on COVID-19 cases at US state level

NY Times

Rearc

nytimes_counties

Data on COVID-19 cases at US county level

Coronavirus Disease (COVID-19) Testing Data

Table Name

Description

Source

Provider

Updated

This data set tracks the number of people tested, pending tests, and positive and negative tests for COVID-19.

Daily

covid_testing_states_daily

USA total test daily trend by state

COVID Tracking Project

Rearc

covid_testing_us_daily

USA total test daily trend

covid_testing_us_total

USA total tests

USA Hospital Beds

Table Name

Description

Source

Provider

Updated

hospital_beds

Data on hospital beds and their utilization in the USA

Definitive Healthcare

Rearc

Daily

CORD19 Open Research Dataset Challenge

Table Name

Description

Source/Provider

Updated

This is a collection of 45,000+ research articles (33,000+ with full text) about COVID-19, SARS-CoV-2, and related coronaviruses. We have preprocessed and enriched these with extracted annotations from Amazon Comprehend Medical. To learn more about Amazon Comprehend Medical, click here.

Weekly

alleninstitute_metadata

Metadata on papers pulled from the CORD-19 research challenge dataset. The 'sha' column indicates the paper id which is the filename of the paper in the lake.

Allen Institute for AI

alleninstitute_comprehend_medical Amazon Comprehend Medical results run against CORD-19 research challenge data set. AWS
COVIDcast (COVID-19) Epidemiological Data

Table Name

Description

Source

Provider

Updated

Delphi's COVIDcast datasets are based on a variety of data sources including a CMU-run Facebook health survey, a Google-run health survey, lab test results provided by Quidel Inc, search data released by Google Health Trends, and outpatient doctor visits provided by a national health system.

covidcast_data

CMU Delphi's COVID-19 Surveillance data

Delphi Research Group (CMU)

Rearc

Daily

covidcast_metadata

CMU Delphi's COVID-19 Surveillance metadata

Tableau COVID-19 Data Hub

Table Name

Description

Source

Provider

Updated

tableau_covid_datahub

This data set includes Coronavirus (COVID-19) data that has been gathered and unified from trusted sources including New York Times and the European CDC.

NYT

ECDC

Tableau

Daily

COVID-19 Prediction Models Counties & Hospitals

Table Name

Description

Source

Provider

Updated

Yu Group at UC Berkeley is working to help forecast the severity of the COVID-19 epidemic both for individual counties and individual hospitals.

prediction_models_severity_index

Severity Index models.

Yu Group at UC Berkeley

rearc

Daily

prediction_models_county_predictions

County-level Predictions Data.

CORD19 Knowledge Graph

Table Name

Description

Source/Provider

Updated

This is a graph structured dataset that is created from the Allen Institute CORD-19 dataset. It contains sets of Nodes and Edges which constitute a graph network connecting paper metadata to itself, and to extracted annotations from Comprehend Medical.

covid_knowledge_graph_nodes_concept

Nodes based on dynamically generated output from Comprehend Medical

AWS

Weekly

covid_knowledge_graph_nodes_institution

Institution Nodes

covid_knowledge_graph_nodes_author

Author Nodes

covid_knowledge_graph_nodes_paper

Paper Nodes

covid_knowledge_graph_nodes_topic

Topic Nodes based on a custom ontology

covid_knowledge_graph_edges

Edges connecting the various Nodes

Lookup tables to support visualizations

Table Name

Description

country_codes

Lookup table for country codes

county_populations

Lookup table for population for each county based on recent census data

us_state_abbreviations

Lookup table for US state abbreviations