Improving our knowledge about the oceans by providing cloud-based access to large datasets
As a physical oceanographer focused on remote sensing, Dr. Chelle Gentemann, senior scientist at Farallon Institute, has worked for over 20 years on retrievals of ocean temperature from space. She uses measurements of sea surface temperature from satellites to understand how the ocean impacts our lives. Chelle’s work requires analysis of large volumes of data, which requires access to large data storage and computational resources. Although most large research institutions can secure those IT resources, that is not the case for smaller organizations or underserved communities around the world.
As part of the Amazon Sustainability Data Initiative, we invited Dr. Gentemann to share her perspective on the value of hosting high-resolution climate data on Amazon Web Services (AWS), and how that is making it faster, cheaper, and simpler to experiment and collaborate in ocean science.
Why is it important that we study the temperature of the oceans?
71% of our world is covered by oceans, and about 40% of the global population is within 100 km of a coast. The ocean affects everything from weather, climate, and food security to recreational opportunities. Sea surface temperature (SST) is one of the key variables we can use to understand and predict ocean conditions and its impact on humans and the Earth.
What are the scientific and computational challenges in analyzing sea surface temperature data?
SST data is one of the longest climate data records from satellites, starting in 1982 with the launch of the polar-orbiting NOAA-7. There are many different satellites measuring SST, each with specific strengths and weaknesses. For example, some measure at very high spatial resolution but have gaps in data due to cloud cover. Others are able to estimate the temperature under the clouds but provide information at lower spatial resolutions. Some instruments are on polar orbiting satellites and are useful for global studies, but may only see a specific location about once per day while others are on geostationary satellites, which have limited geographic coverage but record data every 10 minutes over their regional view. While these products are varied, they have one thing in common: their volumes are too large for most users to store and analyze without large computational resources.
Providing SST data on the cloud no longer requires that these large data be moved around. Scientists can bring their code to the data, which removes many of the barriers that prevent them from exploring the data and quickly testing their ideas. This opens the door to additional users creating a more diverse and inclusive scientific community.
Additionally, cloud-based data access helps us address an additional challenge: science’s reproducibility problem. When programs like the Amazon Sustainability Data Initiative (ASDI) stage sustainability-related foundational datasets on the cloud and make it available at no-cost, it makes it easier for researchers to create cloud-based scientific analysis and make the analysis code available adjacent to the data. This enables anyone to test the reproducibility of science results, which is important for transparency reasons and it allows scientists to build on each other’s results and move the field forward.
How is having data staged on AWS impacting your work?
Having the data staged on AWS is affecting the way I do science in two ways: by making it easier for me to conduct data analysis and experimentation and by enabling others to access my data and code. When analyzing a dataset staged on AWS, it is easy (and cost effective) to start 100 processors and finish my analysis in a few minutes, rather than being restricted to my computer’s size, which would require a day or two to complete the job.
Another benefit of the cloud is the ability to easily share my code and data with others. I have been working on satellite retrievals of ocean temperature for over 20 years and often receive requests from other scientists for help accessing and analyzing data. A dataset that I often analyze is the Multi-Scale Ultra High Resolution Sea Surface Temperature (MUR SST), which is now available as an AWS public dataset on the Registry of Open Data on AWS. Because the data is in the cloud, I can now point to these resources and others are able to run the analysis themselves and customize it as they need. To facilitate this process, I published a set of notebooks that demonstrate how to do simple analyses using the AWS MUR SST on AWS [https://github.com/pangeo-gallery/osm2020tutorial].
This change of paradigm is impacting the way we do science. Our team created a cloud-based scientific tool for the North Pacific Marine Science Organization (PICES), which performs an ecosystem assessment every five years for 15 different regions and generates a report with the results. This report summarizes the status of the marine ecosystems in these regions and is used to understand how changes in the environment are affecting the whole ecosystem. In the past, this would have involved creating a web-portal with some pre-determined analyses that scientists could access. Now, through a cloud-based tool anyone can look at any of these regions, examine pre-determined analyses, and run other analyses they may think are relevant.
How can AWS further enable science research?
The two largest barriers I encounter in my research are data availability and accessibility. Most researchers experience the 80/20 rule: 80 percent of our time is spent finding, downloading, reading, and processing data – this is redundant work and does not generate any new knowledge—and 20 percent of our time is spent on analysis and interpretation of the data and insight generation. My hope is that we find ways to flip this ratio, enabling us to spend the majority of our time doing science, communicating results, and sharing research in an open, reproducible manner.
Dr. Chelle Gentermann: Chelle is a passionate advocate for open science, open source software, and inclusivity. As a physical oceanographer focused on remote sensing, she has worked for over 25 years on retrievals of ocean temperature from space and using that data to understand how the ocean impacts our lives. Her more recent research focuses on interdisciplinary science using cloud computing, open source software algorithm development, air-sea fluxes, biophysical interactions, and upper ocean physical processes. She has served on scientific committees, notably as co-chair of a standing committee for the National Academy of Sciences and has presented to a federal house committee on NASA’s implementation of scientific community priorities.