Use weather data to improve forecasts with Amazon SageMaker Canvas

Photo by Zbynek Burival on Unsplash

Time series forecasting is a specific machine learning (ML) discipline that enables organizations to make informed planning decisions. The main idea is to supply historic data to an ML algorithm that can identify patterns from the past and then use those patterns to estimate likely values about unseen periods in the future.

Amazon has a long heritage of using time series forecasting, dating back to the early days of having to meet mail-order book demand. Fast forward more than a quarter century and advanced forecasting using modern ML algorithms is offered to customers through Amazon SageMaker Canvas, a no-code workspace for all phases of ML. SageMaker Canvas enables you to prepare data using natural language, build and train highly accurate models, generate predictions, and deploy models to production—all without writing a single line of code.

In this post, we describe how to use weather data to build and implement a forecasting cycle that you can use to elevate your business’ planning capabilities.

Business use cases for time series forecasting

Today, companies of every size and industry who invest in forecasting capabilities can improve outcomes—whether measured financially or in customer satisfaction—compared to using intuition-based estimation. Regardless of industry, every customer desires highly accurate models that can maximize their outcome. Here, accuracy means that future estimates produced by the ML model end up being as close as possible to the actual future. If the ML model estimates either too high or too low, it can reduce the effectiveness the business was hoping to achieve.

To maximize accuracy, ML models benefit from rich, quality data that reflects demand patterns, including cycles of highs and lows, and periods of stability. The shape of these historic patterns may be driven by several factors. Examples include seasonality, marketing promotions, pricing, and in-stock availability for retail sales, or temperature, length of daylight, or special events for utility demand. Local, regional, and world factors such as commodity prices, financial markets, and events such as COVID-19 can also change demand trajectory.

Weather is a key factor that can influence forecasts in many domains, and comes in long-term and short-term varieties. The following are just a few examples of how weather can affect time series estimates:

Energy companies use temperature forecasts to predict energy demand and manage supply accordingly. Warmer weather and sunny days can drive up demand for air conditioning.
Agribusinesses forecast crop yields using weather data like rainfall, temperature, humidity, and more. This helps optimize planting, harvesting, and pricing decisions.
Outdoor events might be influenced by short-term weather forecasts such as rain, heat, or storms that could change attendance, fresh prepared food needs, staffing, and more.
Airlines use weather forecasts to schedule staff and equipment efficiently. Bad weather can cause flight delays and cancellations.

If weather has an influence on your business planning, it’s important to use weather signals from both the past and the future to help inform your planning. The remaining portion of this post discusses how you can source, prepare, and use weather data to help improve and inform your journey.

Find a weather data provider

First, if you have not already done so, you will need to find a weather data provider. There are many providers that offer a wide variety of capabilities. The following are just a few things to consider as you select a provider:

Price – Some providers offer free weather data, some offer subscriptions, and some offer meter-based packages.
Information capture method – Some providers allow you to download data in bulk, whereas others enable you to fetch data in real time through programmatic API calls.
Time resolution – Depending in your business, you might need weather at the hourly level, daily level, or other interval. Make sure the provider you choose provides data at the right level of control to manage your business decisions.
Time coverage – It’s important to select a provider based on their ability to provide historic and future forecasts aligned with your data. If you have 3 years of your own history, then find a provider that has that amount of history too. If you’re an outdoor stadium manager who needs to know weather for several days ahead, select a provider that has a weather forecast out as far as you need to plan. If you’re a farmer, you might need a long-term seasonal forecast, so your data provider should have future-dated data in line with your forecast horizon.
Geography – Different providers have data coverage for different parts of the world, including both land and sea coverage. Providers may have information at GPS coordinates, ZIP code level, or other. Energy companies might seek to have weather by GPS coordinates, enabling them to personalize weather forecasts to their meter locations.
Weather features – There are many weather-related features available, including not only the temperature, but other key data points such as precipitation, solar index, pressure, lightning, air quality, and pollen, to name a few.

In making the provider choice, be sure to conduct your own independent search and perform due diligence. Selecting the right provider is crucial and can be a long-term decision. Ultimately, you will decide on one or more providers that are a best fit for your unique needs.

Build a weather ingestion process

After you have identified a weather data provider, you need to develop a process to harvest their data, which will be blended with your historic data. In addition to building a time series model, SageMaker Canvas is able to help build your weather data processing pipeline. The automated process might have the following steps, generally, though your use case might vary:

Identify your locations – In your data, you will need to identify all the unique locations through time, whether by postal code, address, or GPS coordinates. In some cases, you may need to geocode your data, for example convert a mailing address to GPS coordinates. You can use Amazon Location Service to assist with this conversion, as needed. Ideally, if you do geocode, you should only need to do this one time, and retain the GPS coordinates for your postal code or address.
Acquire weather data – For each of your locations, you should acquire historic data and persist this information so you only need to retrieve it one time.
Store weather data – For each of your locations, you need to develop a process to harvest future-dated weather predictions, as part of your pipeline to build an ML model. AWS has many databases to help store your data, including cost-effective data lakes on Amazon Simple Storage Service (Amazon S3).
Normalize weather data – Prior to moving to the next step, it’s important to make all weather data relative to location and set on the same scale. Barometric pressure can have values in the 1000+ range; temperature exists on another scale. Pollen, ultraviolet light, and other weather measures also have independent scales. Within a geography, any measure is relative to that location’s own normal. In this post, we demonstrate how to normalize weather features for each location to help make sure no feature has bias over another, and to help maximize the effectiveness of weather data on a global basis.
Combine internal business data with external weather data – As part of your time series pipeline, you will need to harvest historical business data to train a model. First, you will extract data, such as weekly sales data by product sold and by retail store for the last 4 years.

Don’t be surprised if your company needs several forecasts that are independent and concurrent. Each forecast can offer multiple perspectives to help navigate. For example, you may have a short-term weather forecast to make sure weather-volatile products are stocked. In addition, a medium-term forecast can help make replenishment decisions. Finally, you can use a long-term forecast to estimate growth of the company or make seasonal buying decisions that require long lead times.

At this point, you will combine weather and business data together by joining (or merging) them together using time and location. An example follows in the next section.

Example weather ingestion process

The following screenshot and code snippet show an example of using SageMaker Canvas to geocode location data using Amazon Location Service.

This process submits a location to Amazon Location Service and receives a response in the form of latitude and longitude. The example provides a city as input—but your use cases should provide postal codes or specific street addresses depending on your need for location precision. As guidance, take care to persist the responses in a data store, so you aren’t continuously performing geocoding on the same locations each forecasting cycle. Instead, determine which locations you have not geocoded and only perform those. The latitude and longitude are important and are used in a later step to request weather data from your selected provider.

import json, boto3
from pyspark.sql.functions import col, udf
import pyspark.sql.types as types

def obtain_lat_long(place_search):
   location = boto3.client('location')
   response = location.search_place_index_for_text(IndexName = 'myplaceindex', Text = str(place_search))
   return (response['Results'][0]['Place']['Geometry']['Point'])

UDF = udf(lambda z: obtain_lat_long(z),
types.StructType([types.StructField('longitude', types.DoubleType()),
types.StructField('latitude', types.DoubleType())
]))

# use the UDF to create a struct column with lat and long
df = df.withColumn('lat_long', UDF(col('Location')))
# extract the lat and long from the struct column
df = df.withColumn("latitude", col("lat_long.latitude"))
df = df.withColumn("longitude", col("lat_long.longitude"))
df = df.drop('lat_long')

In the following screenshots, we show an example of calling a weather provider using the latitude and longitude. Each provider will have differing capabilities, which is why selecting a provider is an important consideration. The example we show in this post could be used for historical weather capture as well as future-dated weather forecast capture.

The following screenshot shows an example of using SageMaker Canvas to connect to a weather provider and retrieve weather data.

The following code snippet illustrates how you might provide a latitude and longitude pair to a weather provider, along with parameters such as specific types of weather features, time periods, and time resolution. In this example, a request for temperature and Barometric pressure is made. The data is requested at the hourly level for the next day ahead. Your use case will vary; consider this as an example.

import requests, json
from pyspark.sql.functions import col, udf

def get_weather_data(latitude, longitude):

    params = {
        "latitude": str(latitude),
        "longitude": str(longitude),
        "hourly" : "temperature_2m,surface_pressure",
        "forecast_days": 1
    }

    response = requests.get(url= weather_provider_url, params=params)

return response.content.decode('utf-8')

UDF = udf(lambda latitude,longitude: get_weather_data(latitude, longitude))
df = df.withColumn('weather_response', UDF(col('latitude'), col('longitude')))

After you retrieve the weather data, the next step is to convert structured weather provider data into a tabular set of data. As you can see in the following screenshot, temperature and pressure data are available at the hourly level by location. This will enable you to join the weather data alongside your historic demand data. It’s important you use future-dated weather data to train your model. Without future-dated data, there is no basis to use weather to help inform what might lie ahead.

The following code snippet is from the preceding screenshot. This code converts the weather provider nested JSON array into tabular features:

from pyspark.sql.functions import from_json, struct, col, regexp_replace, cast
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, ArrayType, MapType, LongType
from pyspark.sql.functions import explode, arrays_zip, array

json_schema = StructType([
        StructField("hourly", StructType([
        StructField("time", ArrayType(StringType()), True),
        StructField("temperature_2m", ArrayType(DoubleType()), True),
        StructField("surface_pressure", ArrayType(DoubleType()), True)
    ]), True)
])

#parse string into structure
df = df.withColumn("weather_response", from_json(col("weather_response"), json_schema))

#extract feature arrays
df = df.withColumn("time",col("weather_response.hourly.time"))
df = df.withColumn("temperature_2m",col("weather_response.hourly.temperature_2m"))
df = df.withColumn("surface_pressure",col("weather_response.hourly.surface_pressure"))

#explode all arrays together
df = df.withColumn("zipped", arrays_zip("surface_pressure", "temperature_2m", "time")) \
  .withColumn("exploded", explode("zipped")) \
  .select("Location", "exploded.time", "exploded.surface_pressure", "exploded.temperature_2m")

#cleanup format of timestamp
df = df.withColumn("time", regexp_replace(col("time"), "T", " "))

In this next step, we demonstrate how to set all weather features on the same scale—a scale that is also sensitive to each location’s range of values. In the preceding screenshot, observe how pressure and temperature in Seattle are on different scales. Temperature in Celsius is single or double digits, and pressure exceeds 1,000. Seattle may also have different ranges than any other city, as the result of its unique climate, natural topology, and geographic position. In this normalization step, the goal is to bring all weather features on a same scale, so pressure doesn’t outweigh temperature. We also want to place Seattle on its own scale, Mumbai on its own scale, and so forth. In the following screenshot, the minimum and maximum values per location are obtained. These are important intermediate computations for scaling, where weather values are set based on their position in the observed range by geography.

With the extreme values computed per location, a data frame with row-level values can be joined to a data frame with minimum and maximum values on locations being equal. The result is scaled data, according to a normalization formula that follows with example code.

First, this code example computes the minimum and maximum weather values per location. Next, the range is computed. Finally, a data frame is created with the location, range, and minimum per weather feature. Maximum is not needed because the range can be used as part of the normalization formula. See the following code:

from pyspark.sql.functions import min,max, expr, sum

df = df.groupBy("Location") \
	.agg(min("surface_pressure").alias("min_surface_pressure"), \
		max("surface_pressure").alias("max_surface_pressure"), \
		min("temperature_2m").alias("min_temperature_2m"), \
		max("temperature_2m").alias("max_temperature_2m")
		)

df = df.withColumn("range_surface_pressure",
	df.max_surface_pressure-df.min_surface_pressure)

df = df.withColumn("range_temperature_2m",
	df.max_temperature_2m-df.min_temperature_2m)

df = df.select("Location", \
	"range_surface_pressure", "min_surface_pressure", \
	"range_temperature_2m","min_temperature_2m" 
    )

In this code snippet, the scaled value is computed according the normalization formula shown. The minimum value is being subtracted from the actual value, at each time interval. Next, the difference is divided by the range. In the previous screenshot, you can see values range on a 0–1 scale. Zero is the lowest observed value for the location; 1 is the highest observed value for the location, for all the time periods where data exists.

Here, we compute the scaled x, represented as x’ :

from pyspark.sql.functions import round

df = df.withColumnRenamed('Location_0','Location')

df = df.withColumn('scaled_temperature_2m',
                     (df.temperature_2m-df.min_temperature_2m) / 
                         df.range_temperature_2m)

df = df.withColumn('scaled_surface_pressure',
                     (df.surface_pressure-df.min_surface_pressure) / 
                         df.range_surface_pressure)

df = df.drop('Location_1','min_surface_pressure','range_surface_pressure',
            'min_temperature_2m','range_temperature_2m')

Build a forecasting workflow with SageMaker Canvas

With your historic data and weather data now available to you, the next step is to bring your business data and prepared weather data together to build your time series model. The following high-level steps are required:

Combine weather data with your historic data on a point-in-time and location basis. Your actual data will end, but the weather data should extend out to the end of your horizon.

This is a crucial point—weather data can only help your forecast if it’s included in your future forecast horizon. The following screenshot illustrates weather data alongside business demand data. For each item and location, known historic unit demand and weather features are provided. The red boxes added to the screenshot highlight the concept of future data, where weather data is provided, yet future demand is not provided because it remains unknown.

After your data is prepared, you can use SageMaker Canvas to build a time series model with a few-clicks—no coding required.

As you get started, you should build a time series model in Canvas with and without weather data. This will let you quickly quantify how much of an impact weather data has for your forecast. You may find that some items are more impacted by weather than others.

After you add the weather data, use SageMaker Canvas feature importance scores to quantify which weather features are important, and retain these in the future. For example, if pollen value has no lift in accuracy but barometric pressure does, you can eliminate the pollen data feature to keep your process as simple as possible.

As an alternate to using a visual interface, we have also created a sample notebook on GitHub that demonstrates how to use SageMaker Canvas AutoML capabilities as an API. This method can be useful when your business prefers to orchestrate forecasting through programmatic APIs.

Clean up

Choose Log out in the left pane to log out of the Amazon SageMaker Canvas application to stop the consumption of SageMaker Canvas workspace instance hours. This will release all resources used by the workspace instance.

Conclusion

In this post, we discussed the importance of time series forecasting to business, and focused on how you can use weather data to build a more accurate forecasting model in certain cases. This post described key factors you should consider when finding a weather data provider and how to build a pipeline that sources and stages the external data, so that it can be combined with your existing data, on a time-and-place basis. Next, we discussed how to use SageMaker Canvas to combine these datasets and train a time series ML model with no coding required. Finally, we suggested that you compare a model with and without weather data so you can quantify the impact and also learn which weather features drive your business decisions.

If you’re ready to start this journey, or improve on an existing forecast method, reach out to your AWS account team and ask for an Amazon SageMaker Canvas Immersion Day. You can gain hands-on experience and learn how to apply ML to improve forecasting outcomes in your business.

About the Author

Charles Laughlin is a Principal AI Specialist at Amazon Web Services (AWS). Charles holds an MS in Supply Chain Management and a PhD in Data Science. Charles works in the Amazon SageMaker service team where he brings research and voice of the customer to inform the service roadmap. In his work, he collaborates daily with diverse AWS customers to help transform their businesses with cutting-edge AWS technologies and thought leadership.