Ten Big Open Data Stories from 2017
AWS customers did amazing work with open data in 2017. Here are some of the stories from the year, including releases of massive new datasets, tutorials on how to work with data in the cloud, and stories of how people are using AWS to put open data to work.
How the Nonprofit Open Data Collective Came Together to Work on IRS 990 Data in the Cloud: This is a story about how a community can emerge around a dataset hosted on AWS. In this case, the community rallied around IRS 990 data to create new tools and guidance on how to work with it. A single authoritative copy of the data is available in Amazon Simple Storage Service (Amazon S3), making it possible for the community to work quickly and collaboratively on the data.
“As we moved into the consortium stage, cloud hosting allowed us to distribute iterations of the cleaned data and to build open-source software that automatically pulls the latest version. The low cost and easy provisioning of AWS lets us maximize our scarce human capital by intensifying our use of compute resources,” said Lindsey Struck, Director of Program Development at Charity Navigator.
It’s encouraging to see communities like this emerge to make data easier to access and use.
#EarthonAWS: How NASA Is Using AWS at re:Invent 2017: Kevin Murphy, Program Executive of Earth Science Data Systems at NASA, spoke at re:Invent 2017 about how NASA is “preparing for a big-data future” with AWS. The presentation provides a detailed explanation of how teams at NASA are using the cloud to give scientists fast access to massive amounts of authoritative scientific data. NASA missions generate 80 TBs of data per day and reprocess 400 TBs per day. NASA is exploring ways to use machine learning to find new ways to efficiently store data and optimize its availability.
Brain Workshop Meets Cloud: The Allen Institute for Brain Science experimented with AWS at their annual Summer Workshop on the Dynamic Brain, providing students with 35TB of data through Amazon S3 and using Amazon EC2 for data analysis.
“This approach was extraordinarily successful, enabling reliable and high-powered computation and collaborative projects. Students spent more time analyzing data and less time configuring their software toolchains. We deployed a JupyterHub cluster, which dynamically provisioned Docker-based instances that come preconfigured with a host of hard-to-configure dependencies. Rather than spending days setting up development environments, students could click a link and start working immediately,” said David Feng, Associate Director of Technology at the Allen Institute for Brain Science. “Additionally, the ease of retrieving large, custom compute configurations enabled new types of projects. Students tend to limit their analyses to what they can easily run on their laptops. This year, the base instance we provided them was more powerful than most of the laptops they brought. One participant wanted to play with deep neural networks, so we spawned a GPU instance to use with the necessary dependencies and data volumes preconfigured.”
Data.world Census Data Now Available as AWS Public Dataset: This year, we worked with the US Census Bureau and data.world to make American Community Survey (ACS) Public Use Microdata Sample (PUMS) available as an AWS Public Dataset. The ACS is the largest and most up-to-date annual survey performed by the US Census Bureau detailing information about the American people and housing units. It affects $400 billion in annual spending and impacts local officials, community leaders, and businesses who rely on the data to understand the changes taking place in their communities. By hosting this important data where it can be quickly and easily processed with elastic computing resources, AWS hopes to enable more innovation, more quickly. Learn how to access the data at the ACS PUMS on AWS Public Dataset landing page.
Accessing NOAA’s GOES-R Series Satellite Weather Imagery Data on AWS: We made NOAA’s GOES-R satellite data available this year. Following a successful launch of the satellite in November 2016, NOAA started releasing GOES-R data in August. It was available on Amazon S3 from day one. GOES satellites provide critical atmospheric, oceanic, climatic, and space weather products that support weather forecasting and warnings, climatologic analysis and predictions, ecosystems management, safe and efficient transportation, and other national priorities. The availability of GOES-R Series data on AWS is the result of the NOAA Big Data Project (BDP) to explore the potential benefits of storing copies of key observations and model outputs in the cloud, which allows computing directly on the data without requiring further distribution.
Complete Sentinel-2 Archives Freely Available to Users: To date, Sentinel-2 has already produced more than 1 PB of data. Sentinels are part of Copernicus, an EU programme operated by the European Space Agency (ESA) focused on land observation. They provide insight into our planet on a weekly basis, up to 10-meter resolution globally. “We are amazed, observing what users are doing with Sentinel data. The options are limitless. Now, with almost two years of Sentinel-2 data available, and with consistent quality and availability, people are starting to realize the potential. And with the infrastructure and dedicated services to efficiently exploit it, I am confident we will see hundreds of applications helping farmers, disaster relief, environmental monitoring, and many others,” said Grega Milcinski, CEO of Sinergise.
How Cloud can Take Open Data to New Heights: Jed Sundwall, AWS Global Open Data Lead, wrote an op-ed in Federal Computer Week this summer to explain the advantages of sharing data in the cloud. He said, “Once uploaded to the public cloud, the potential use cases for open data are virtually endless, and government agencies are just getting started. Governments around the world are investing billions in new sensors, ranging from Internet of Things devices in parking meters to Earth-observing satellites, which are producing huge volumes of data. The best way to get a return on these investments is to make it easy for innovators across the country to access the data and put it to work. With a modernized data distribution method and some imagination, open data can be unleashed to become a tremendous force for the public good.”
Analyze OpenFDA Data in R with Amazon S3 and Amazon Athena: The Food and Drug Administration (FDA) makes valuable and authoritative FDA data available for analysis on Amazon S3. One of the great benefits of Amazon S3 is the ability to host, share, or consume public datasets. This provides transparency into data to which an external data scientist or developer might not normally have access. A team of AWS data engineers wrote a guide on our Big Data blog on how to take the raw data provided by openFDA, leverage several AWS services, and derive meaning from the underlying data.
USASpending.gov on an Amazon RDS Snapshot: In May, USAspending.gov started making the entire public USAspending.gov database available for anyone to copy via Amazon Relational Database Service (RDS). Typically, sharing a relational database requires extract, transfer, and load (ETL) processes that require redundant storage capacity, time for data transfer, and often scripts to migrate your database schema from one database engine to another. By making their data available as a public Amazon RDS snapshot, the team at USASpending.gov has made it easy for anyone to get a copy of their entire production database for their own use within minutes. This will be useful for researchers and businesses who want to work with real data about all US Government spending and quickly combine it with their own data or other data resources. Learn how to make your own copy of the database.
Querying OpenStreetMap with Amazon Athena: Soon after we launched Amazon Athena, Seth Fitzsimmons, member of the 2017 OpenStreetMap US board of directors, reached out to us to collaborate on a project to make it easy for anyone to use Amazon Athena to query OpenStreetMap. “OpenStreetMap (OSM) is a free, editable map of the world, created and maintained by volunteers and available for use under an open license. Companies and nonprofits like Mapbox, Foursquare, Mapzen, the World Bank, the American Red Cross and others use OSM to provide maps, directions, and geographic context to users around the world,” said Fitzsimmons. “In the 12 years of OSM’s existence, editors have created and modified several billion features (physical things on the ground like roads or buildings). The main PostgreSQL database that powers the OSM editing interface is now over 2TB and includes historical data going back to 2007. As new users join the open mapping community, more and more valuable data is being added to OpenStreetMap, requiring increasingly powerful tools, interfaces, and approaches to explore its vastness.” This post explains how anyone can use Amazon Athena to quickly query publicly available OSM data stored in Amazon S3 (updated weekly) as an AWS Public Dataset.
This is just a sample of the great work being done by AWS customers with open data. We can’t wait to see what our community does in 2018!