A public data lake for COVID-19 research and development
The AWS COVID-19 data lake is a centralized repository of up-to-date and curated datasets focused on the spread and characteristics of the novel coronavirus (SARS-CoV-2). This data lake contains pre-processed, curated, and publicly-readable data, ready for analysis by anyone and many of which is sourced through AWS Data Exchange.
Hosted on the AWS cloud, this curated data lake contains useful data sets such as COVID-19 case tracking data from The New York Times, COVID-19 testing data from the COVID Tracking Project, hospital bed availability from Definitive Healthcare, health survey data from the Delphi Research Group, and research data from over 45,000 articles about COVID-19 and related coronaviruses from the Allen Institute for AI. As new versions of the datasets are published and other reliable sources become available, we will update the data lake.
We hope that organizations and individuals will use this data to help in the fight against COVID-19. For instance, local health authorities can build out their dashboards to efficiently deploy vital resources like hospital beds and ventilators as they track the spread of the disease. Epidemiologists can use it to complement their existing models and datasets and generate better forecasts of hotspots and trends, such as those related to testing availability and population size and density.
If you are interested in subscribing to this data in machine-readable form or contributing datasets for the public data lake, please visit our AWS Data Exchange page. You can also apply for funding for your diagnostic research project via the AWS Diagnostic Development Initiative (DDI).
Pricing
This data is hosted for free in Amazon S3. Normal charges to request data in S3 are disabled on the public data lake bucket, so you incur no cost there. However, you will still incur standard charges for the services that you use to analyze the data lake, like Amazon Athena.
Getting started
To make the data from the AWS COVID-19 data lake available in your AWS account, use the following Amazon CloudFormation template to populate the Data Catalog. If you are signed into your AWS account, the following link fills out most of the stack creation form for you. All you need to do is choose (Create stack). For instructions on creating a CloudFormation stack, see Get Started in the CloudFormation documentation. This template creates a covid-19 database in your Data Catalog and tables that point to the public AWS COVID-19 data lake.
For information on how to set up the definitions for that data in an AWS Glue Data Catalog and then query it with Amazon Athena, please read this blog post and follow the step-by-step instructions. Questions about the data lake? Please reach out to aws-covid-19-data-lake@amazon.com.
Data catalog
The following tables outline the data hosted in the data lake.
This data set tracks provides information on vaccine allocations in the US by state.
Table Name |
Description |
Source |
Provider |
Updated |
---|---|---|---|---|
cdc_pfizer_vaccine_distribution |
Data on distribution of Pfizer/BioNTech vaccine |
CDC |
rearc |
Daily |
cdc_moderna_vaccine_distribution | Data on distribution of Moderna vaccine |
Table Name |
Description |
Source/Provider |
Updated |
---|---|---|---|
owid_world_vaccinations |
This dataset includes data on COVID-19 vaccinations administered broken down by country. |
Our World in Data |
Daily |
owid_us_state_vaccinations | This dataset includes data on COVID-19 vaccinations administered broken down by US State. | ||
owid_world_vaccinations_by_manufacturer | This dataset includes data on COVID-19 vaccinations administered broken down by country and manufacturer. |
Table Name |
Description |
Source |
Provider |
Updated |
---|---|---|---|---|
world_cases_deaths_testing |
This dataset includes data on confirmed cases, deaths, and testing. |
several |
rearc |
Daily |
This data set tracks confirmed cases and deaths in the US by state and county.
Table Name |
Description |
Source |
Provider |
Updated |
---|---|---|---|---|
nytimes_states |
Data on COVID-19 cases at US state level |
NY Times | Rearc |
Daily |
nytimes_counties |
Data on COVID-19 cases at US county level |
This data set tracks the number of people tested, pending tests, and positive and negative tests for COVID-19.
Table Name |
Description |
Source |
Provider |
Updated |
---|---|---|---|---|
covid_testing_states_daily |
USA total test daily trend by state |
COVID Tracking Project
|
Rearc |
Daily |
covid_testing_us_daily |
USA total test daily trend | |||
covid_testing_us_total |
USA total tests |
Table Name |
Description |
Source |
Provider |
Updated |
---|---|---|---|---|
hospital_beds |
Data on hospital beds and their utilization in the USA |
Definitive Healthcare |
Rearc |
Daily |
This is a collection of 45,000+ research articles (33,000+ with full text) about COVID-19, SARS-CoV-2, and related coronaviruses. We have preprocessed and enriched these with extracted annotations from Amazon Comprehend Medical. To learn more about Amazon Comprehend Medical, click here.
Table Name |
Description |
Source/Provider |
Updated |
---|---|---|---|
alleninstitute_metadata |
Metadata on papers pulled from the CORD-19 research challenge dataset. The 'sha' column indicates the paper id which is the filename of the paper in the lake. | Allen Institute for AI |
Weekly |
alleninstitute_comprehend_medical | Amazon Comprehend Medical results run against CORD-19 research challenge data set. | AWS |
Delphi's COVIDcast datasets are based on a variety of data sources including a CMU-run Facebook health survey, a Google-run health survey, lab test results provided by Quidel Inc, search data released by Google Health Trends, and outpatient doctor visits provided by a national health system.
Table Name |
Description |
Source |
Provider |
Updated |
---|---|---|---|---|
covidcast_data |
CMU Delphi's COVID-19 Surveillance data |
Delphi Research Group (CMU) |
Rearc |
Daily |
covidcast_metadata |
CMU Delphi's COVID-19 Surveillance metadata |
Table Name |
Description |
Source |
Provider |
Updated |
---|---|---|---|---|
tableau_covid_datahub |
This data set includes Coronavirus (COVID-19) data that has been gathered and unified from trusted sources including New York Times and the European CDC. |
NYT ECDC |
Tableau |
Daily |
Yu Group at UC Berkeley is working to help forecast the severity of the COVID-19 epidemic both for individual counties and individual hospitals.
Table Name |
Description |
Source |
Provider |
Updated |
---|---|---|---|---|
prediction_models_severity_index |
Severity Index models |
Yu Group at UC Berkeley |
Rearc |
Daily |
prediction_models_county_predictions |
County-level Predictions Data |
This is a graph structured dataset that is created from the Allen Institute CORD-19 dataset. It contains sets of Nodes and Edges which constitute a graph network connecting paper metadata to itself, and to extracted annotations from Comprehend Medical.
Table Name |
Description |
Source/Provider |
Updated |
---|---|---|---|
covid_knowledge_graph_nodes_concept |
Nodes based on dynamically generated output from Comprehend Medical |
AWS |
Weekly |
covid_knowledge_graph_nodes_institution |
Institution Nodes |
||
covid_knowledge_graph_nodes_author |
Author Nodes |
||
covid_knowledge_graph_nodes_paper |
Paper Nodes |
||
covid_knowledge_graph_nodes_topic |
Topic Nodes based on a custom ontology |
||
covid_knowledge_graph_edges |
Edges connecting the various Nodes |
Aggregation of COVID-19 data from Our World in Data, The New York Times and The COVID Tracking project.
Table Name |
Description |
Source/Provider |
Updated |
---|---|---|---|
enigma_aggregation_global |
All geographies combined |
Enigma |
Daily |
enigma_aggregation_global_countries |
Country level only |
||
enigma_aggregation_us_states |
US states only |
||
enigma_aggregation_us_counties |
US counties only |
Table Name |
Description |
Source/Provider |
Updated |
---|---|---|---|
aspirevc_crowd_tracing |
Contact Tracing data from AspireVC |
AspireVC |
Daily |
aspirevc_crowd_tracing_zipcode_3digits |
Zip code to state lookup |
AspireVC |
Table Name |
Description |
Source/Provider |
Updated |
---|---|---|---|
uk_covid |
Case and Testing Data for the United Kingdom |
UK government |
Daily |
Table Name |
Description |
---|---|
country_codes |
Lookup table for country codes |
county_populations |
Lookup table for population for each county based on recent census data |
us_state_abbreviations |
Lookup table for US state abbreviations |