A public data lake for COVID-19 research and development

The AWS COVID-19 data lake is a centralized repository of up-to-date and curated datasets focused on the spread and characteristics of the novel coronavirus (SARS-CoV-2). This data lake contains pre-processed, curated, and publicly-readable data, ready for analysis by anyone and many of which is sourced through AWS Data Exchange.

Hosted on the AWS cloud, this curated data lake contains useful data sets such as COVID-19 case tracking data from The New York Times, COVID-19 testing data from the COVID Tracking Project, hospital bed availability from Definitive Healthcare, health survey data from the Delphi Research Group, and research data from over 45,000 articles about COVID-19 and related coronaviruses from the Allen Institute for AI. As new versions of the datasets are published and other reliable sources become available, we will update the data lake.

We hope that organizations and individuals will use this data to help in the fight against COVID-19. For instance, local health authorities can build out their dashboards to efficiently deploy vital resources like hospital beds and ventilators as they track the spread of the disease. Epidemiologists can use it to complement their existing models and datasets and generate better forecasts of hotspots and trends, such as those related to testing availability and population size and density.

If you are interested in subscribing to this data in machine-readable form or contributing datasets for the public data lake, please visit our AWS Data Exchange page. You can also apply for funding for your diagnostic research project via the AWS Diagnostic Development Initiative (DDI).

Pricing

This data is hosted for free in Amazon S3. Normal charges to request data in S3 are disabled on the public data lake bucket, so you incur no cost there. However, you will still incur standard charges for the services that you use to analyze the data lake, like Amazon Athena.

Getting started

To make the data from the AWS COVID-19 data lake available in your AWS account, use the following Amazon CloudFormation template to populate the Data Catalog. If you are signed into your AWS account, the following link fills out most of the stack creation form for you. All you need to do is choose (Create stack). For instructions on creating a CloudFormation stack, see Get Started in the CloudFormation documentation. This template creates a covid-19 database in your Data Catalog and tables that point to the public AWS COVID-19 data lake.

For information on how to set up the definitions for that data in an AWS Glue Data Catalog and then query it with Amazon Athena, please read this blog post and follow the step-by-step instructions. Questions about the data lake? Please reach out to aws-covid-19-data-lake@amazon.com.

Data catalog

The following tables outline the data hosted in the data lake.

Vaccine Allocations By US State

This data set tracks provides information on vaccine allocations in the US by state.

Table Name

Description

Source

Provider

Updated

cdc_pfizer_vaccine_distribution

Data on distribution of Pfizer/BioNTech vaccine

CDC

rearc

Daily

cdc_moderna_vaccine_distribution Data on distribution of Moderna vaccine
World Vaccination Data

Table Name

Description

Source/Provider

Updated

owid_world_vaccinations

This dataset includes data on COVID-19 vaccinations administered broken down by country.

Our World in Data

Daily

owid_us_state_vaccinations This dataset includes data on COVID-19 vaccinations administered broken down by US State.
owid_world_vaccinations_by_manufacturer This dataset includes data on COVID-19 vaccinations administered broken down by country and manufacturer.
COVID-19 World Confirmed Cases, Deaths and Testing

Table Name

Description

Source

Provider

Updated

world_cases_deaths_testing

This dataset includes data on confirmed cases, deaths, and testing.

several

rearc

Daily

US Coronavirus (COVID-19) Cases

This data set tracks confirmed cases and deaths in the US by state and county.

Table Name

Description

Source

Provider

Updated

nytimes_states

Data on COVID-19 cases at US state level

NY Times

Rearc

Daily

nytimes_counties

Data on COVID-19 cases at US county level
Coronavirus Disease (COVID-19) Testing Data

This data set tracks the number of people tested, pending tests, and positive and negative tests for COVID-19.

Table Name

Description

Source

Provider

Updated

covid_testing_states_daily

USA total test daily trend by state 

 

COVID Tracking Project

 

Rearc

Daily

covid_testing_us_daily

USA total test daily trend

covid_testing_us_total

USA total tests
USA Hospital Beds

Table Name

Description

Source

Provider

Updated

hospital_beds

Data on hospital beds and their utilization in the USA

Definitive Healthcare

Rearc

Daily

CORD19 Open Research Dataset Challenge

This is a collection of 45,000+ research articles (33,000+ with full text) about COVID-19, SARS-CoV-2, and related coronaviruses. We have preprocessed and enriched these with extracted annotations from Amazon Comprehend Medical. To learn more about Amazon Comprehend Medical, click here.

Table Name

Description

Source/Provider

Updated

alleninstitute_metadata

Metadata on papers pulled from the CORD-19 research challenge dataset. The 'sha' column indicates the paper id which is the filename of the paper in the lake.

Allen Institute for AI

Weekly
alleninstitute_comprehend_medical Amazon Comprehend Medical results run against CORD-19 research challenge data set. AWS
COVIDcast (COVID-19) Epidemiological Data

Delphi's COVIDcast datasets are based on a variety of data sources including a CMU-run Facebook health survey, a Google-run health survey, lab test results provided by Quidel Inc, search data released by Google Health Trends, and outpatient doctor visits provided by a national health system.

Table Name

Description

Source

Provider

Updated

covidcast_data

CMU Delphi's COVID-19 Surveillance data

Delphi Research Group (CMU)

Rearc

Daily

covidcast_metadata

CMU Delphi's COVID-19 Surveillance metadata

Tableau COVID-19 Data Hub

Table Name

Description

Source

Provider

Updated

tableau_covid_datahub

This data set includes Coronavirus (COVID-19) data that has been gathered and unified from trusted sources including New York Times and the European CDC.

NYT

ECDC

Tableau

Daily

COVID-19 Prediction Models Counties & Hospitals

Yu Group at UC Berkeley is working to help forecast the severity of the COVID-19 epidemic both for individual counties and individual hospitals.

Table Name

Description

Source

Provider

Updated

prediction_models_severity_index

Severity Index models

Yu Group at UC Berkeley

Rearc

Daily

prediction_models_county_predictions

County-level Predictions Data

CORD19 Knowledge Graph

This is a graph structured dataset that is created from the Allen Institute CORD-19 dataset. It contains sets of Nodes and Edges which constitute a graph network connecting paper metadata to itself, and to extracted annotations from Comprehend Medical.

Table Name

Description

Source/Provider

Updated

covid_knowledge_graph_nodes_concept

Nodes based on dynamically generated output from Comprehend Medical

AWS

Weekly

covid_knowledge_graph_nodes_institution

Institution Nodes

covid_knowledge_graph_nodes_author

Author Nodes

covid_knowledge_graph_nodes_paper

Paper Nodes

covid_knowledge_graph_nodes_topic

Topic Nodes based on a custom ontology

covid_knowledge_graph_edges

Edges connecting the various Nodes

Daily Global & U.S. COVID-19 Cases & Testing Data

Aggregation of COVID-19 data from Our World in Data, The New York Times and The COVID Tracking project.

Table Name

Description

Source/Provider

Updated

enigma_aggregation_global

All geographies combined

Enigma

Daily

enigma_aggregation_global_countries

Country level only

enigma_aggregation_us_states

US states only

enigma_aggregation_us_counties

US counties only

AspireVC Clio Go Contact Tracing Data

Table Name

Description

Source/Provider

Updated

aspirevc_crowd_tracing

Contact Tracing data from AspireVC

AspireVC

Daily

aspirevc_crowd_tracing_zipcode_3digits

Zip code to state lookup

AspireVC
COVID-19 UK Data

Table Name

Description

Source/Provider

Updated

uk_covid

Case and Testing Data for the United Kingdom

UK government

Daily

Lookup tables to support visualizations

Table Name

Description

country_codes

Lookup table for country codes

county_populations

Lookup table for population for each county based on recent census data

us_state_abbreviations

Lookup table for US state abbreviations