How to Arm a world-leading forecast model with AWS Graviton and Lambda
This post was contributed by Jake Hendy, Technical Lead, Surface Observations Collection at the UK Met Office.
The Met Office is the UK’s National Meteorological Service, providing 24×7 world-renowned scientific excellence in weather, climate and environmental forecasts and severe weather warnings for the protection of life and property. We provide forecasts and guidance for the public, to our government and defence colleagues as well as the private sector. As an example, if you’ve been on a plane over Europe, Middle East, or Africa; that plane took off because the Met Office (as one of two World Aviation Forecast Centres) provided a forecast.
When the Met Office talks about High Performance Computing in the office, it’s usually with reference to our supercomputers, or our dense compute estates. Their focus is on numerical weather prediction or climate research. Yet there’s plenty more that goes on that qualifies, like the work we do in observations.
The Met Office collects a wide range of observations, from lightning strikes around the world to passing aircraft transmitting Mode-S data we collect to measure wind, all the way through to rain radar. These observations are then used in the creation of other products like our forecast models or become products of their own.
Today I want to explain one of the ways we use AWS to collect these observations, which has freed us to focus more on top quality delivery for our customers.
I lead the SurfaceNet service, which is our next-generation product for collecting and processing surface-based observations. SurfaceNet handles nearly 400 stations worldwide which collect near-surface parameters. These stations might be at our headquarters in Exeter, or on top of a mountain. They may be a moored buoy off the coast of the British Isles, or a ship sailing around the world.
These stations are all special. Marine stations are hardened and low-power, using a satellite network to email their observations in. Some land stations send data using a 4G modem, while others use a fixed line. Some are battery and solar-powered, others have a connection to the national grid. They can have one, two, or even three loggers each with their own sensor array attached. Some measure only air temperature and pressure. Others take the temperature of soil one-meter underground, concrete, grass, and air temperature, or a mix of all. Some loggers are a kilometre apart. Some land stations are so remote, they’re actually marine packages complete with satellite comms.
In all, SurfaceNet handles over 1 billion observations a year. But before these observations are disseminated, they go through several processing steps.
One step is calibration. For example, temperature is sometimes measured using thermistors. Before a sensor is deployed at a station, it’s calibrated in our lab. We then take the measured resistance, and (knowing how a sensor behaves in the lab) we improve the accuracy when we convert it to Celsius using a standard formula.
Other more intense steps include quality control, derivation, and then more quality control on those derivations. Quality Control ensures the values are good, such as ensuring there’s enough variation (but not too much!) or that it’s within range. Derivation produces new observations, like cloud cover, or aggregations over time. Some of these derivations need only the current minute, while others need a full 12 hours. Parameters can be sampled anywhere between 2 and 240 times a minute.
Highly parallel and highly elastic
We only have a few minutes to complete these steps before the observations start losing their relevance. Observations from a station are independent of other stations, so this is embarrassingly parallel work. A few years ago there would be no question: this workload would be run on large instances. Perhaps even on bare metal hardware.
Using AWS, especially AWS Lambda, we’re free to focus on our business goals. No matter how easy it is we don’t need to worry about picking the right EC2 size, auto-scaling, or rebalancing containers over hosts.
Stations report at varying frequencies (minutes to hours). With Lambda, it doesn’t matter if we’re under normal operations, or if we’re recovering from a network outage and we receive 6 hours of observations at once.
With AWS Lambda, SurfaceNet flexes with little-to-no effort from us and maintains that timeliness. AWS Lambda also means that it’s simple to configure. We adjust the memory allocation which in turn adjusts how many vCPUs we have access to. We’ve optimized for both single-threaded execution speed, and we opt for small thread-pools. Derivation is one area where we generate many derived parameters in parallel, where a running larger AWS Lambda functions is better for us in both time and cost.
Payback, and pay less
That simple configurability paid dividends for us recently. With a one-line change in our AWS CloudFormation templates we migrated to ARM and AWS Graviton2 (you’ve got modular templates, right?). We spent a day dual-running x86 and ARM to verify the values were exactly the same.
They were, so we shipped it. No mess, no fuss. Ten minutes later, we saw that welcome drop in our Amazon CloudWatch metrics, with some Lambda functions executing 5% faster. Derivation is where we saw the most benefit, running up to 37% faster.
There were four big wins from this one-line change. Our Lambda functions execute faster which produces results for our downstream customers sooner. The reduced duration alone is more cost effective. Combined with Graviton2 execution also being cheaper to run, and the cost savings start piling up. We also know that in general, Arm-based hardware demands less power and generates less heat. There’s a positive environmental impact, too.
The Shared Responsibility Model describes how we share responsibility for the security of our environments, but I think there’s a similar, unwritten principal for performance. AWS continues to innovate and provide opportunities for us to extract value; be it saving money, time, or both. For most of us, these take very little effort. We immediately upgrade when new instance families come out, such as from C4 to C5. We also take advantage of instances with0 lower-cost architectures, like the M5A with AMD processors, to give us the best price/performance ratio. Last, where we are able, we migrate to Graviton2 instances, such as using db.r6g for our Amazon Aurora for MySQL cluster. These kinds of changes give benefits almost instantly.
In our case, the more of these wins we take, the better value we deliver for the public’s money. With the upcoming evolution of our supercomputing capability and our continued investment in AWS, we’re looking forward to these easy-to-adopt innovations across our high-performance applications.
The content and opinions in this blog are those of the third-party author and AWS is not responsible for the content or accuracy of this blog.