AWS Public Sector Blog

Earth Science Information Partners: Promoting innovation for Earth science data

The Earth Science Information Partners (ESIP) is a US-based nonprofit organization funded by NASA, NOAA, and the USGS. ESIP is playing a critical role in facilitating collaborative efforts to improve the collection, stewardship, and use of Earth science data and information.

As part of the Amazon Sustainability Data Initiative, we invited Dr. Annie Burgess, ESIP Lab Director, to share the story of how ESIP is advancing knowledge of Earth-system science.


The Challenge of Managing and Extracting Insights from Large Data Portfolios

Our scientific understanding of the Earth is based on a combination of observational and numerical model data. Over the past twenty years, data volume has increased dramatically, primarily driven by the increase in satellite missions and complexity of numerical models, which now produce petabyte-scale datasets. Such large data volumes present new challenges and opportunities around scientific analysis and highlight the need for cloud-optimal data stewardship practices if the datasets are to be fully utilized. This is where community organizations like the Earth Science Information Partners (ESIP) can play a critical role.

Many of ESIP’s partner organizations are charged with the stewardship of Earth science data across the public (e.g. NASA, NOAA, USGS), private (e.g. Esri), and academic (e.g. IEDA) sectors and are asked to address the challenges associated with the acquisition, storage, analysis, and distribution of larger and larger data portfolios. Some data providers are hosting datasets on the cloud and giving scientists a preview of how they could analyze terabyte and petabyte-scale data with cloud-native services.

But not all researchers, developers, and data providers have the resources or expertise to develop and deploy new data systems in the cloud. This is the problem ESIP’s Lab is working to solve. Through grant funding, AWS Cloud Credits for Research, and community input, the ESIP Lab supports projects working towards the adoption of community-accepted best practices in scientific data management and analysis, and experimentation with emerging technology like machine learning (ML).

Below are two examples of projects funded through the ESIP Lab supported by the AWS Cloud Credits for Research Program.

Recipes for data quality of streaming Earth science data

Natural events (e.g. tornadoes, tsunamis and volcanic eruptions) and human-caused hazards (e.g. oil spills and climate change) threaten lives and livelihoods. To track these events, geoscientists often apply Internet of Things (IoT) technologies, where small, inexpensive sensors stream data to the Internet in real-time for use by scientists, policy makers, and practitioners. Ensuring that these data are high quality and actionable in real-time is critical. The ESIP Lab funded the Real-time Sensor Testbed for Improved Provenance and Data Quality project focused on the metadata and provenance of sensor data and identification of data quality issues in cloud-hosted, real-time data streams.

The team used Amazon Elastic Compute Cloud (Amazon EC2) instances to host the NSF-sponsored Cloud-Hosted Real-time Data Services for the Geosciences (CHORDS) service, which they connected to various data streams, including several coming from the Nevada Climate Change Portal (NEVCAN). The team relied on the scalability of EC2 instances to handle the addition of data streams and/or increases in sampling rate throughout the project. The team created data quality recipes in JSON-LD markup and a simulated user interface to show how data quality could be assessed through CHORDS in real time. After QAQC, the data were written to an influxdb instance, also running on the EC2 server. The data were then visualized using Grafana (also on the EC2 server) and through a static web map (leaflet.js) served over Amazon Simple Storage Service (Amazon S3), making the QAQC’d data available to users.

This project created a blueprint for researchers to ingest and quality check real-time data streams into their scientific workflows. Researchers’ ability to implement appropriate metadata and quality control measures for real-time data streams is critical.

Improving snow covered area mapping using machine learning

The recent increase in commercial Earth imagery with high-spatiotemporal resolution has the potential to bridge the gap between ground-based point measurements and coarsely-captured satellite data. This is important in fields like snow hydrology, where high-resolution measurements of snow-covered area are critical for accurate prediction of water availability. The ESIP Lab funded the Developing Workflows for Assessing High-Resolution CubeSat Imagery to Infer Detailed Snow-Covered Areas project, which explored the use of ML and planet imagery to create high-spatiotemporal classification of snow covered areas at meter-scale spatial resolution.

Data ingest and preprocessing for the Developing Workflows for Assessing High-Resolution CubeSat Imagery to Infer Detailed Snow-Covered Areas project. Figure by: Anthony Cannistra.

The team used PlanetScope 4-band orthoimagery with NASA’s Airborne Snow Observatory LiDAR-derived snow depth data to generate snow masks across a California watershed. For processing the satellite imagery, the team used a cloud-based Jupyterhub instance that integrated parallel-computation through Dask (i.e. the NSF-funded Pangeo project – http://pangeo.io). This specific setup of Pangeo was built using Amazon Elastic Kubernetes Service (Amazon EKS) for cluster and container management and Amazon Elastic File Service (EFS) for storage.

Processing multiple high-spatial resolution datasets in an ML framework requires resources exceeding common desktop computing capabilities. The team used the scalability of the cloud by provisioning AWS EC2 instances (i.e. worker nodes) for computational workloads, GPUs for the ML pipeline, and Amazon S3 for storage. An Amazon Machine Image (AMI) will be preserved to ensure reproducibility.

Water managers, climate scientists, and ecologists could benefit from improved snow-covered area data. The pre-trained model created in this project will be publicly available for future use (project GitHub repository: https://github.com/acannistra/planet-snowcover).


The ESIP Lab’s main objective is to be a multiplier and advocate for the individuals and ideas that advance our knowledge of Earth-system science. If you have an idea you would like to develop through our ESIP Lab program, contact us at lab@esipfed.org.