AWS Machine Learning Blog
Managing missing values in your target and related datasets with automated imputation support in Amazon Forecast
Amazon Forecast is a fully managed service that uses machine learning (ML) to generate highly accurate forecasts without requiring any prior ML experience. Forecast is applicable in a wide variety of use cases, including estimating product demand, supply chain optimization, resource planning, energy demand forecasting, and computing cloud infrastructure usage.
With Forecast, there are no servers to provision or ML models to build manually. Additionally, you only pay for what you use, and there is no minimum fee or upfront commitment. To use Forecast, you only need to provide historical data for the variable you want to forecast and any optional related data that may impact your forecasts. The latter may include time-varying data such as price, events, and weather, and categorical data such as color, genre, or region. The service automatically trains and deploys ML models based on your data and provides an API to retrieve forecasts.
A common occurrence in real-world forecasting is the presence of missing values in the raw data. A missing value in historical data (or time series) means that the true corresponding value at every time period isn’t available for processing. Values can be marked as missing for a multitude of reasons. Missing values can occur because there are no transactions in the specified time period, possible measurement errors, or because the measurement couldn’t occur correctly.
Forecast supports automated imputations of missing values (including existing NaNs) for the related and target time series datasets and in the historical and forecast time periods. Related time series (RTS) data typically includes promotions, prices, or out-of-stock information that correlates with the target value (product demand) and can often improve the accuracy of the forecast. You can use several missing value logics including
nan (target time series only), depending on the specific use case. You can use the FeaturizationConfig in the CreatePredictor API to use this feature in Forecast.
This post uses the notebook example from the Forecast GitHub repo to use the missing value functionality for the related and target time series (TTS) datasets.
Handling missing values in Forecast
A missing value in a time series simply denotes that the true corresponding value isn’t available for further processing for a plethora of reasons. For a time series representing product sales, a missing value can mean it’s not available for sale; for example, the periods outside the existence of the product (before its launch, after its deprecation) or when it wasn’t available within its period of existence (partially out of stock). A missing value can also mean you don’t have sales data recorded for this time range.
Even though the corresponding target value in the “not available for sale” use case is typically
zero, additional information is conveyed in the value being marked as missing (
nan). The information that you sold zero units of an available item is very different from the information of selling zero units of an out-of-stock item. Because of this,
zero filling in the target time series might cause the predictor to be under-biased in its predictions, while
nan filling might ignore actual occurrences of zero available items sold and cause the predictor to be over-biased.
The following time-series graphs illustrate how choosing the wrong filling value can significantly affect the accuracy of your model. Graphs A and B plot the demand for an out-of-stock item, with the black lines representing actual sales data. Missing values in A1 are filled with
zero, which leads to relatively under-biased predictions (represented by the dotted lines), in A2. Similarly, missing values in B1 are filled with
nan, which leads to predictions that track the actuals more precisely in B2.
Forecast provides several filling methods to handle missing values in your TTS and RTS datasets. Filling is the process of adding standardized values to missing entries in your dataset. During backtesting, Forecast assumes the filled values (barring NaNs) to be true values and uses them in evaluating metrics. Forecast supports the following filling methods:
- Middle filling – Fills any missing values between the item start and end dates
- Back filling – Fills any missing values between the item’s last recorded data point and the dataset’s global end date (the max end date across all items)
- Future filling (RTS only) – Fills any missing values between the dataset’s global end date and the end of the forecast horizon.
The following image provides a visual representation of different filling methods.
The following table indicates the different filling options supported for each method. For more information, see Handling Missing Values.
|1||Filling Method (TTS)||Default||Options|
|2||Front filling||No filling||none|
|3||Middle filling||zero||nan, zero, value, median, mean, min, max|
|4||Back filling||zero||nan, zero, value, median, mean, min, max|
|5||Future filling (not supported)||n/a||n/a|
|1||Filling Method (RTS)||Default||Options|
|2||Front filling (not supported)||n/a||n/a|
|3||Middle filling||No default||zero, value, median, mean, min, max|
|4||Back filling||No default||zero, value, median, mean, min, max|
|5||Future filling||No default||zero, value, median, mean, min, max|
You first need to import data for both the TTS and RTS. For this use case, you use the data file tts.csv and rts.csv from the GitHub repo. The file tts.csv tracks demand for multiple items at a monthly frequency, and rts.csv tracks the average monthly price for each item. This mimics a very common retail scenario. You also have missing values with these datasets that you fill in using the filling methods and logic available in Forecast. You can visualize the demand for a sample item (item_001) using the following Python code example:
Creating the dataset group and datasets
You create a dataset group and add the TTS and RTS datasets to it by completing the following steps:
- On the Forecast console, under Dataset groups, choose Create dataset group.
- For Dataset group name, enter
- For Forecasting domain, choose Retail.
- Choose Next.
- For Dataset name, enter
- For Data schema, enter the following code:
- Choose Next.
- For Dataset import name, enter
- For Timestamp format, enter
- For IAM role, choose AmazonForecast-ExecutionRole.
- Choose Create dataset import.
To import the RTS dataset, repeat these steps and use the following code for the schema:
Model creation and inference
After you import the data, you can train your model and generate accuracy metrics. Forecast provides five algorithms; you can either choose a specific algorithm or choose Auto-ML to let Forecast select the algorithm that best meets the objective function defined by the service. For more information, see Choosing an Amazon Forecast Algorithm.
For this use case, you use DeepAR+ because you have 300 unique items with historical data over two years. When your dataset contains hundreds of time series, the DeepAR+ algorithm outperforms standard ARIMA and ETS methods. To train your predictor, complete the following steps:
- On the Forecast console, under Train a predictor, choose Start.
- For Predictor name, enter
- For Forecast horizon, enter
- For Forecast frequency, choose month.
- For Algorithm, choose Deep_AR_Plus.
- For Number of backtest windows, enter
- For Backtest window offset, enter
- For Training parameters, enter the following code:
You now set up the missing value logic for both the TTS (demand) and RTS (price). The logic you use for
meanfor both TTS and RTS. For
futurefill(the logic for specifying missing values in the forecast horizon), you use
minfor RTS. A common scenario with forecasting is evaluating the impact of different values of external variables (like price) during the forecast period. This helps with better planning and ensuring you can stock the right level of product irrespective of the scenario. You can achieve this in Forecast by either updating the data and regenerating the forecast (for instructions, see the GitHub repo) or using the
futurefillmethod (like above) and creating predictors that use different filling options (for example,
max) to mimic multiple scenarios.
- For Featurizations, enter the following code:
- Choose Train predictor.
After the predictor is trained, you can go to the predictor detail page to evaluate the relevant metrics.
Creating a forecast
To create a forecast, complete the following steps:
- On the Forecast console, under Forecast generation, choose Start.
- For Forecast name, enter
- For Predictor, choose filling_analysis_v1.
- For Forecast types, enter the quantiles you want to generate forecasts for; for example,
- Choose Create a forecast.
Querying and visualizing forecasts
Finally, we can visualize the forecasts for any item created above by leveraging the QueryForecast API via the console.
To query a forecast, complete the following steps:
- On the Forecast console, go to the Dashboard, choose Lookup Forecast.
- For Forecast, choose filling_value_v1_min
- For Start date, choose
- For End date, choose
- For item_id, choose
item_269(you can choose any item here) and click Get Forecast
You can now visualize the forecast and historical demand for the chosen item as shown below.
This post used the methods supported by Forecast to fill in missing values in TTS and RTS datasets. You can start using this feature in all Regions where Forecast is available. Please share any feedback either via the AWS forum or your regular AWS support channels.
About the author
Rohit Menon is the Sr. Product Manager for Amazon Forecast and one of its founding members. His current focus is to democratize time series forecasting by using machine learning. In his spare time, he enjoys reading and watching documentaries.
Heesung Sohn is a big data and machine learning evangelist. In his role as Software Engineer at Amazon Web Services, he leverages his experience to build distributed systems focusing on database and AI services. He was one of the founding engineers for Amazon Forecast.
Tomonori Shimmura is a Senior Technical Program Manager at AWS Vertical AI team, where he provides deeper technical consultation for customers and drives product development. In his free time, he enjoys playing video games, reading Manga books, and writing software.
Danielle Robinson is an Applied Scientist on the ML Forecasting team. Her research is in time series forecasting and in particular how we can apply new neural network based algorithms within Amazon Forecast. Her thesis research was focused on developing new, robust and physically accurate numerical models for computational fluid dynamics. Her hobbies include cooking, swimming and hiking.