Embracing the cloud for climate research
A guest post by Jessica Matthews, Jared Rennie, and Tom Maycock of North Carolina State University’s North Carolina Institute for Climate Studies
Scientists at NC State University’s North Carolina Institute for Climate Studies (NCICS) work with large datasets and complex computational analysis. Traditionally, they did their work using on-premises computational resources. As different projects were stretching the limits of those systems, NCICS decided to explore cloud computing.
As part of the Amazon Sustainability Data Initiative, we invited Jessica Mathews, Jared Rennie, and Tom Maycock to share what they learned from using AWS for climate research.
As they considered exploring the cloud to support their work, the idea of leaving the comfort of the local environment was a bit scary. And they had questions: How much will it cost? What does it take to deploy processing to the cloud? Will it be faster? Will the results match what they were getting with their own systems? Here is their story and what they learned.
From satellites to weather stations
One of our projects is part of an international collaboration to produce a dataset of global land surface albedo (a measurement of how effectively the Earth’s surface reflects incoming solar radiation) by stitching together information from multiple satellites. Our task involves converting file formats and running a complex algorithm that transforms raw images from NOAA’s GOES satellites into albedo data. With more than 270 terabytes of data stored in more than 12 million files, we estimated that the full analysis would take over 1,000 hours – roughly 42 days – on our Institute’s compute cluster. Because of the scale of the project, we experimented with a small subset of data (10 days of information from two satellites) to evaluate the cost and time required to do analysis in the cloudAnother project involves NOAA’s Global Historical Climatology Network–Monthly (GHCN-M) temperature dataset. Turning raw daily observations from tens of thousands of weather stations around the world into homogenous, long-term temperature datasets requires applying an algorithm that detects and adjusts for non-climatic shifts in station data over time due to changes in station location, instrumentation type, and observing practices. While the dataset is relatively small, the algorithm is computationally and memory intensive. We also run the algorithm as an ensemble of more than 100 members to produce quantified estimates of uncertainty. Running just one ensemble on-premises currently takes more than a day, which is problematic given that new data arrives every day. It can take several months to run the entire 100-member suite.
Getting data to the cloud
The first step was getting the data uploaded to AWS. Fortunately, our IT staff already had experience with pushing NOAA data to the cloud as part of our support for NOAA’s Big Data Project. We also had Dr. Brad Rubin (University of St. Thomas) visiting us on sabbatical and he was instrumental in helping us understand the cloud.
Dealing with the temperature data was straightforward: we simply transferred the files directly to Amazon Simple Storage Service (Amazon S3), which we treated as a secure FTP site with essentially unlimited capacity. We found the Amazon S3 costs to be acceptably low. Prices vary by region and by type of access request, and careful implementation of the transfer process can pay dividends.
The albedo data was more challenging because of limitations on how quickly files can be extracted from NOAA’s CLASS archive. Copying the entire dataset to the cloud would take more than a year. Storing all this data on Amazon S3 would be relatively expensive, but AWS offers other storage options at much lower prices. We chose Amazon S3 Glacier for our small subset of data and only copied the data to S3 for the time required to run the analysis. For the full project, we would use the new S3 Glacier Deep Archive service as an even more affordable option for staging the data from CLASS.
For both analyses, we used the Amazon Elastic Compute Cloud (Amazon EC2) service. With EC2, you only pay for the time you use. Prices vary depending on how much computing power you want to employ, so we ran some test jobs to help us balance costs per hour versus total run time.
Although our two tasks involved different kinds of data and computations, we were able to use a common architectural approach for both projects:
- We used Docker containers to package everything needed to run our jobs, including the operating system and required software tools. This allowed us to test our code locally and then recreate identical environments in the cloud.
- We used AWS Batch to automatically distribute code instances across multiple virtual machines. Array Jobs allowed us to leverage the scalable parallelism available in the cloud by submitting a specified number of workloads with a single query.
- We fed the jobs using the Amazon Simple Queue Service (SQS), controlled by a Python driver.
After researching and tuning AWS settings, we ran our pilot albedo project on both AWS and our own system and verified that we were getting identical results on both platforms. Extrapolating from the test project to the full dataset, we determined that we could complete the entire albedo processing task about 50 times faster on AWS than on our own system, and at less than one-sixth the cost (Table 1).
|Amazon Web Services||NCICS Compute Cluster|
|Processing time (20 years of daily observations from GOES satellites)||About 20 hours*||About 1,000 hours|
|Total Cost||About $13,000**||About $84,000|
*Includes about 12 hours to pull data from Glacier to S3 and about 8 hours for processing
**Assumes about about $5,000 for staging the data in Glacier Deep Archive during the lengthy process of copying data from NOAA CLASS and about $8,000 for approximately 20 hours of EC2 processing and temporary storage in S3.
For the temperature dataset, we used a memory-optimized EC2 instance to run the code, selecting an option with 4 CPUs and 30.5 GB of memory. The on-demand price would have been $0.266 per hour, but we decided to try spot pricing instead. We requested and got a spot price of 25% ($0.0665 per hour). The lesson here is that if you can afford to wait a while to run your job, spot pricing can save you money. We put CentOS, awk, bash, Fortran, and our own Fortran code into a Docker container and ran 15 jobs at a time in parallel. Each job took roughly 18 hours, and the entire 100-member ensemble suite took about 6.5 days, compared to more than a month on local system. The cost was about $142 for computing time (versus more than $500 with on-demand pricing) and about $18 per month for storage.
We found that compared to a full-cost accounting of our current infrastructure, using AWS was much cheaper, and, with some guidance, our learning curve was relatively smooth and manageable. And the cloud can definitely be faster as there is almost no limit to the amount of parallel processing you can deploy, funds permitting. It was encouraging to see that using Docker containers led to consistent results across platforms.
We also learned that:
- Mistakes are cheap: With relatively low storage costs, you can run small test cases cheaply and easily. The major cloud providers have free options you can use to experiment.
- Costs are lower and more transparent: Depending on the costs of your local computing environment, you could save a significant amount by moving to the cloud. The cloud costs are generally more transparent as well.
- Risks are low: The pay-as-you-go model means there is little risk in exploring options. The risk of not trying cloud computing is that you could end up behind the technology curve.
As the saying goes, your mileage may vary, but the cloud environment worked well for us and we would encourage others to give it a try!