Build a model to predict the impact of weather on urban air quality using Amazon SageMaker

Air pollution in cities can be an acute problem leading to damaging effects on people, animals, plants and property. It is an important topic which is getting increased attention as the human population of cities continues to increase. This year it was the subject the 2018 KDD Cup, the annual data mining and knowledge discovery competition organized by ACM SIGKDD.

The burning of fossil fuels for transport and home heating is a major contributor to air pollution in urban environments, creating the pollutant nitrogen dioxide (NO₂). This is a secondary pollutant produced by the oxidation of NO. It is a major contributor to respiratory problems. In the European Union, the Cleaner Air For Europe (CAFÉ) Directive 2008/50/EC established an hourly limit of 200 μg/m³ and an annual mean limit of 40 μg/m³ in respect of NO₂. No more than 18 exceedances of the hourly limit values are allowed per year.

Many cities around the globe report on the air quality levels at least on a daily basis. We decided to look into air quality data by using Amazon SageMaker, a fully-managed platform that enables developers and data scientists to quickly and easily build, train, and deploy machine learning models at any scale.

The Scenario

For the demonstration in this blog post, we examine the relationship between an air pollutant (NO₂) and weather in a selected city: Dublin, Ireland.

The air quality data comes from a long established monitoring station run by the Irish Environmental Protection Agency. The station is located in Rathmines, Dublin, Ireland. Rathmines is an inner suburb of Dublin, about 3 kilometers south from the city center. Dublin, the capital city of the Republic of Ireland, has a population of approximately one million people. The city is bounded by the sea to the east, mountains to the south, and flat topography to the west and north. The mountains to the south of Dublin affect the wind speed and direction over the city. When the general flow of wind is from the south the mountains deflect the flow to a south-westerly or south-easterly direction.

The weather data comes from the long established weather station located at Dublin Airport. Dublin Airport is located on the flat topography to the north of the city. It is about 12 kilometers north of Dublin city center.

The Tools

Amazon SageMaker for exploratory data analysis and machine learning
Amazon Simple Storage Service (Amazon S3) to stage the data for analysis.

The Data

Hourly air pollution datasets for the Rathmines monitoring station are published by the Irish Environmental Protection Agency. The data we used spans the years 2011 to 2016. This data is available as Open Data. The provenance of the data is described at the following link, and data can also be downloaded at this link:

Metadata and Data Resources by EPA Air Quality

A daily weather data set for Dublin Airport stretching back to 1942 is published by the Irish Meteorological Service (Met. Eireann) on their website under a Creative Commons License.

https://www.met.ie/climate/available-data/historical-data

In this blog post, we will be using these two data sets to perform our analysis and build a predictive model.

For global studies, there is a handy repository of air quality data available on OpenAQ. This data is also available on the Registry of Open Data on AWS.

Preparing the data for analysis and loading data from Amazon S3

The data is in CSV format. Before we put the data into our Amazon S3 bucket, we carried out some data wrangling to prepare it for analysis:

Weather Data: The data set contained more information than we needed for the purpose of this proof of concept. To prepare the weather data the following actions with the original dataset were carried out:
- Removed the header, this takes up the first 25 rows of the dataset.
- Converted measurement unit for wind speed from knots to meters per second.
- Selected a subset of the parameters available. Parameters were chosen based on results from scientific papers on this subject.
- The names of the parameters selected were changed to reduce ambiguity.
  - ‘rain’ became ‘rain_mm’. The precipitation amount in mm.
  - ‘maxtp’ became ‘maxtemp’. The maximum air temperature in celcius.
  - ‘mintp’ became ‘mintemp’. The minimum air temperature in celcius.
  - ‘cbl’ became ‘pressure_hpa. The mean air pressure in hectopascals.
  - ‘wdsp’ became ‘wd_speed_m_per_s’ (and the units converted from knots).
  - ‘ddhm’ became ‘winddirection’.
  - ‘sun’ became ‘sun_hours’ The sunshine duration.
  - ‘evap’ became ‘evap_mm’. Evaporation (mm).
Air Quality Data: Each year of air quality data came in separate files and the units used to measure the pollutants changed from standard units (SI) to an obsolete unit. We decided to only use the years where the SI units are used, this limited us to a time period of 2011 to 2016. These yearly files were merged into one file.
Sample Rate: The weather observations are 24-hour daily averages, and the air quality data came as 1-hour averages. We resampled the air quality data to 24-hour averages and changed the parameter name to indicate this. For example NO₂ became NO2_avg.

After this preliminary data transformation, we published the new data in our S3 bucket. We are now ready to look into the data by using the notebook-hosting capabilities of Amazon SageMaker.

Exploring the data using an Amazon SageMaker notebook

To start looking into our data, we decided to make use of Amazon SageMaker notebook hosting functionality, which enables you to have a Jupyter notebook with popular data analysis and machine learning libraries pre-installed, as well as access to the Amazon SageMaker Python SDK.

Let’s launch a notebook instance in Amazon SageMaker. To do this, go to the Amazon SageMaker console, and choose Create notebook instance to create a Jupyter Notebook.

Next give your notebook instance a name and create a new IAM role giving Amazon SageMaker access to S3 by choosing Create a new role in the IAM Role selector. It is important that the IAM role that you created has access to the dataset in S3 that you created in the above step, in order to be able to pull it into the notebook.

It takes a few minutes for the Amazon SageMaker notebook to be available. When it’s available, choose Open in the Actions column. This will take you to the Jupyter environment, where we’ll be able to run our notebook.

Now, download the companion notebook from here and upload it, using the Upload option in the Jupyter console at the top right.

When opening the notebook, we load the relevant libraries into the notebook.

%matplotlib inline
import pandas as pd
from datetime import datetime
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

Those libraries will help us analyze the data using pandas, a popular data manipulation tool, as well as numpy, the de-facto scientific library in Python. Seaborn and matplotlib are used to power our visualisations.

Loading prepared data into the Amazon SageMaker notebook from S3

Now that we have the notebook ready for use with the right libraries imported, we can import the data. For this, we will use the pandas library, which is great for exploring and massaging tabular data directly in Python. We can use the pandas.read_csv command, supplying it with the location of our data in S3. For both the air pollution and weather data, we changed the column names to something more readable (seethe following code, for an air pollution example). We also had to create our own date parser, due to the specific format use for dates in the data.

#Air Pollution Data
col_names = ['daily_avg', 'nox_avg', 'no_avg', 'no2_avg']
nox_df = pd.read_csv('https://s3.amazonaws.com/aws-machine-learning-blog/artifacts/air-quality/Dublin_Rathmines_NOx_2011_2016_ugm3_daily.csv',  
                    date_parser=parse,
                    parse_dates=['Daily_Avg'])
nox_df.columns = col_names

An initial check of the dates in the weather data revealed a problem with the imported dates. For example, 1945 was imported as 2045. This was fixed using the following piece of code.

from datetime import datetime
from dateutil.relativedelta import relativedelta

def convert_date_to_right_century(dt):
    if dt > datetime.now():
        dt -= relativedelta(years=100)
    return dt

Finally, to ease the filtering of the data frames we set the index of the air pollution data frame.

#Setting date as the index
nox_df = nox_df.set_index('daily_avg')

# Print some records in the nox_df
nox_df.head()

Exploratory data analysis (EDA) – Data cleaning and exploration

Exploratory data analysis is the attempt to summarize the date through visualizations or summary statistics before performing any statistical modelling on a data set. You can use EDA to get a general understanding of the data structure and validity. If possible it is also valuable to get a general overview of what associations might exist between datasets, which will guide us in building the right predictive model based on the data particularities.

Cleaning the data

Initial graphs of the air pollution showed that there was a problem in 2014 with negative readings of pollution concentrations for all the oxides of nitrogen. This is an impossible situation which was probably due to a sensor calibration error at the air quality monitoring station. When we studied this data in more detail, we noticed gaps in the data for 2014, as shown in the following graph.

If there are gaps in 2014, it was reasonable to assume that there are gaps in other years as well. To fix this, we linearly interpolated over the whole data set with the ‘time’ method:

nox_df = nox_df.interpolate(method='time')

For the purposes of this exercise, we fixed the sensor drift in 2014 by simply replacing the spurious values. We decided to replace values that are less than 0 with 5:

nox_df["no2_avg"] = nox_df["no2_avg"].apply(lambda x: 5 if x <= 0 else x)

Visualizing some key insights of the data

All visualizations were produced within Amazon SageMaker using the open source Python matplotlib library.

The first visualization is a figure of boxplots of NO₂ concentrations for each year, 2011 to 2016. A boxplot is a great way to understand the spread of the data. In this plot:

The box is Q3 – Q1. Q1 (first quartile) is the middle value between the lowest value and the median. Q3 is the middle value between the highest value and the median
The bar across the box is the median value
The whiskers are 1.5 Q3 – Q1 from the edges of the box
The dots beyond the whiskers are the outliers

From the figure we can see that there is some variation over the years but that the main body of the data (the box) stays within the range of 10 to 30 µg/m³.

The number of extreme values (values outside the whiskers) vary from year to year which seems to fit with the idea of pollution episodes being linked to unusual events. It is also clear that 2014 data problems have distorted the distribution of the 2014 data..

Some scatter plots of meteorological parameters against NO₂ concentrations led to the following observations:

Wind speed versus NO2 concentration. As wind speed increases pollution concentration decreases.

Atmopsheric pressure versus NO2 concentration. As atmospheric pressure increases so do the concentrations of NO2..

To get an insight into the possible relationship between air pollution and wind direction, an air quality rose graph was produced within Amazon SageMaker by adapting an open source script for a Wind Rose (see the following graph and see Acknowledgements for link to Wind Rose code ). An air quality rose is a graph that gives a view of how air quality and wind direction are distributed. The length of each spoke around the rose is related to the frequency that the wind blows from a particular direction. Each concentric circle represents a different frequency (see values on graph), starting with 0% at the center and increasing progressively outwards. The color bands on the spokes show the air pollution concentration range for that wind direction. When looking at the air quality rose, note the prevailing wind for Dublin is approximately south westerly and the wind speeds, other than westerly winds, tend to be low.

AQ Rose: An air quality rose for Dublin Airport, showing the relationship between NO2 concentrations and wind direction for the years 2011 to 2016.

In the graph above ‘Min’ is a value less than 7 µg/m³. The insight that we can take from the air quality rose is that higher concentrations of air pollutants are associated with winds that come easterly to northerly directions.

A characteristic of time series data is the data may be strongly associated with a lagged copy of itself. This autocorrelation property of NO₂ concentrations was investigated by graphing NO₂ data against a one day lag of itself (see the following graph).

NO2 and NO2 one day lag, scatter plot

From this figure we can see that there is a strong correlation between the two, meaning the previous NO₂concentrations of the day can be used a strong predictor in our model to predict the NO₂concentrations of the day.

Finally, let’s generate a correlation matrix of the variables in our data, to get final insights into the relations between our variables and guide us in selecting the model to power the predictions.

From this matrix, we can confirm that no2_avg and no2_avg_prev are strongly correlated (0.8), which makes sense because predicting air quality data is time-dependent, with the levels of polluants of the previous day impacting the next day. Interestingly, there is also a weak negative correlation between the NO₂ concentrations and the wind speed, which means that we can expect that when the wind speed increases, the NO₂ concentrations go down. This is also a result we can derive from the air quality rose. Most of the other variables are weakly correlated.

Summarizing the EDA

The exploratory data analysis of the weather and air pollution data has led to the following insights for our air quality monitoring site in Dublin::

As wind speed increase air quality improves.
As atmosphere pressure increases air quality decreases.
Air quality is associated with wind direction. Easterly winds are associated with poorer air quality and lower wind speeds.
The time series of NO₂is strongly associated with a time lagged version of itself.

From those insights, it is clear that we need to build our features with a lagged version of the NO₂, and incorporate the wind and atmosphere predictors as features in our model. In our case, we will build a model to predict the NO₂ concentrations of any given day. The NO₂ is called the target variable in a machine learning setting.

Generally speaking, we can follow two approaches when dealing with time-dependent data:

A pure time-series approach, where the aim will be to forecast the target (NO₂) based on the NO₂ value of the previous days and the other features of our data (wind speed, wind direction, and pressure).
A hybrid approach, where we use the features of our data as predictors of the NO₂ concentration, and incorporate some information about the relation in time of NO₂ (such as a lagged version of the NO₂).

Creating a machine learning model to predict air quality

To start small, we will follow the second approach, where we will build a model that will predict the NO₂ concentration of any given day based on wind speed, wind direction, maximum temperature, pressure values of that day, and the NO₂ concentration of the previous day. For this we will use the Linear Learner algorithm provided in Amazon SageMaker, enabling us to quickly build a model with minimal work.

Our model will consist of taking all of the variables in our dataset, and using them as features to the Linear Learner algorithm available in Amazon SageMaker.

However, before we can even start developing the model, we need to build a training and test set from our data. The training set will be used to train our model, while the test set will be used to evaluate the quality of the predictions of our model based on data the model has never seen. As our data is in time series format, we need to take extra care when splitting the data into our different sets. We want to train only on values that occur prior to the values in our test set, so we don’t bias our model with future observations. In our case, we used the data from years 2011 up to 2015 as training, and the year 2016 as our candidate year for testing and validating our model.

The Linear Learner algorithm is a simple yet powerful algorithm that can be used both for classification tasks (will tomorrow be sunny?) and regression tasks (what are tomorrow’s predicted NO₂ concentration levels?). In our case, we will use the regression setting of this algorithm. This will in turn be used to generate one-step ahead forecasts.

To showcase how simple it is to use the Linear Learner algorithm, we will use the Amazon SageMaker Python SDK that comes with a custom built Estimator for that algorithm. An Estimator is a general concept in machine learning that encapsulates an algorithm, along with its hyperparameters, and specific metadata.

llearner = LinearLearner(role=role,
                predictor_type='regressor',
                normalize_data=True,
                normalize_label=True,
                train_instance_count=1,
                train_instance_type='ml.c5.xlarge',
                output_path=output_location,
                data_location=data_location)

The LinearLearner estimator in Amazon SageMaker requires you to input the Amazon SageMaker role that you created beforehand, which can be retrieved using the get_execution_role method. An important parameter in the estimator is to set it as a regression task, using the predictor_type parameter. It’s often required for machine learning algorithms to normalize and scale your data, in order to not induce your model into errors because of features in your models spanning different ranges or magnitude. Rather than doing this by hand, we can ask the Linear Learner estimator to do this for us when training, by setting to true the normalize_data and normalize_label parameters.

We will train this model on a single ml.c5.large EC2 instance. The last two parameters specify where Amazon SageMaker can copy our training data in S3, and where to store the resulting model along with its learned parameters after the training.

Now that we’ve understood how to initialize the Linear Learner estimator, we can call the fit function on it, which will launch the training process. In the call to fit, we give our training data captured in the x_train variable for the features, and the y_train variable for the target.

There are 2 importants aspects when supplying the data to the estimator that you need to take into account:

The data needs to be supplied using the record_set function of the estimator, which will convert our data in the right format for the algorithm to understand it,
We also need to explicitly give the types of each of our training data variables for the call to the record_set function to work. In our case, the features and target values are decimal float values, so we can use the ‘float32’ data type.

If you are following along and you already ran the call to fit, behind the scenes your training data was copied into the Amazon S3 bucket that we defined earlier. A training job in Amazon SageMaker has been launched, which is using the Linear Learner algorithm. In the output of the call to fit, we can see the training is running, trying to find the best parameters for our final model.

After a few minutes, the job will be complete, which you can also see in the Amazon SageMaker console.

In the output of the call to fit, we can find some metrics on the training job itself. One particular metric of interest is the Mean Squared Error (MSE) which will be different from across runs, but that gives us details about how well our model is performing. We get this metric because we supplied the test data in the call to fit. This can be particularly valuable as we iteratively build a better model. We can tweak the parameters of the algorithm, launch a new training job, and review its performance metrics until we are satisified with it.

Go ahead and try different parameters in the estimator. For example, you could use the num_models parameter of the LinearLearner estimator to tweak how many models will be trained in parallel, enabling you to find solutions that may be more optimal by exploring variation of the parameters. If you want to go further, you can also use the Amazon SageMaker automatic model tuning to find the optimal parameters of your model optimized against a metric that you choose, for example the MSE that we find in the output of the training job. Because this is our baseline model, the current model is good enough for our use case, and we will now deploy our model into an endpoint to serve predictions to customers.

Creating an endpoint that serves our model is simple in Amazon SageMaker. A single call to deploy using the estimator variable is enough. We will only be using the endpoint for predicting a single batch of values, so we can use a single ml.t2.medium EC2 instance.

llearner_predictor = llearner.deploy(initial_instance_count=1,
                                 instance_type='ml.t2.medium')

Similar to the training job process, when calling deploy, an endpoint is created in Amazon SageMaker to serve the model created by the previous training job. This operation takes a few minutes, and we can monitor its progress both in the ouput of the call to deploy as well as in the Amazon SageMaker console.

Once the endpoint is up and running, we can serve predictions by sending data for which we want predictions by calling the predict method. In our case, we have already prepared data for that purpose, which is the the x_test variable, containing the features of our data except the target variable, which is the average NO₂concentrations of the day.

result = llearner_predictor.predict(x_test.values.astype('float32'))

In the same manner as for the call to fit, we also need to specify the type of our data, which is still ‘float32’. The call to predict should be a matter of a few seconds, and the resulting predictions will be saved in the result variable.

To be able to get the actual predictions, there’s a slight work of data conversion and transformation to perform that are detailed in the companion notebook.

Now that we have the predictions for our test data, we will want to compare them to the actual expected values, also called the ground truth. To do this we can plot the predicted values against the ground truth, for our candidate test year of 2015.

plt.plot(y_sm_test_df, label='actual')
plt.plot(y_sm_pred_df, label='forecast')
plt.legend()
plt.show()

Generally speaking, we can see that the model manages to capture the main trends of 2015, following the same progression over time as the ground truth.

Let’s dive deeper into the quality of our model and its predictions by using some common forecasting metrics.

# sMAPE is used in KDD Air Quality challenge: https://biendata.com/competition/kdd_2018/evaluation/ 
def smape(actual, predicted):
    dividend= np.abs(np.array(actual) - np.array(predicted))
    denominator = np.array(actual) + np.array(predicted)
    
    return 2 * np.mean(np.divide(dividend, denominator, out=np.zeros_like(dividend), where=denominator!=0, casting='unsafe'))

def print_metrics(y_test, y_pred):
    print("RMSE: %.4f" % sqrt(mean_squared_error(y_test, y_pred)))
    print('Variance score: %.4f' % r2_score(y_test, y_pred))
    print('Explained variance score: %.4f' % explained_variance_score(y_test, y_pred))
    forecast_err = np.array(y_test) - np.array(y_pred)
    print('Forecast bias: %.4f' % (np.sum(forecast_err) * 1.0/len(y_pred) ))
    print('sMAPE: %.4f' % smape(y_test, y_pred))

One metric of particular interest here is the Symmetric Mean Absolute Percentage Error (sMAPE). This is a metric that tells you how accurate your predictions are on the basis of the percentage of errors against the ground truth. Percentage-based metrics are often used in forecasting as your data can be defined in different scales, and those metrics are scale-independent. The sMAPE metric has been used, for example, in the KDD Air Quality Cup challenge that ran earlier this year.

Let’s get the scores for those metrics based on our actual values and the predicted values.

Among those metrics, we can see that the RMSE is quite low, but as the graph shows, we are making errors and not capturing all the information, especially the peaks. The variance is also well explained, with our model managing to explain a good part of the variance in our data.

Summary of our modelling

In this section, we made exclusive use of the Amazon SageMaker built-in algorithms, which enable you to use a wide variety of models developed to work at scale with the less code and overhead.

Conclusions and next steps

Let’s recap our steps. We first downloaded air quality data from Dublin, and explored the data using visualization techniques to gain valuable insights to help build a predictive model that is able to forecast NO₂ concentration levels on a daily basis. We learnt that NO₂ is highly dependent on the NO₂ concentration levels of the previous day, and that the wind direction has a great impact on those levels. Using those insights, we built a baseline model using the Amazon SageMaker LinearLearner algorithm, enabling us to focus on the data rather than setting up a model.

Amazon SageMaker provided a convenient, flexible, and powerful environment to execute a typical workflow for the study of air pollution.

We encourage you to follow our steps and use the air quality data of your city as an exercise. If you find some interesting outputs, share them with us in the comments!

Acknowledgements

Irish Weather Data:

Met Éireann retains Intellectual Property Rights and copyright over our data. If data are published in raw or processed format Met Éireann must be acknowledged as the source. Met Éireann does not accept any liability whatsoever for any error or omission in the data series, their availability, or for any loss or damage arising from their use. This work is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) License.

Irish Air Quality Data:

EPA,“EPA Ireland Archive of Nitrogen Oxides Monitoring Data“. Associated datasets and digital information objects connected to this resource are available at: Secure Archive For Environmental Research Data (SAFER) managed by Environmental Protection Agency Ireland http://erc.epa.ie/safer/resource?id=216a8992-76e5-102b-aa08-55a7497570d3 (Last Accessed: 2018-06-30) (both require as their data usage license that they be credited)

The Air Quality Rose was adapted from Wind Rose code that was published on GitHub under a BSD-license:

https://github.com/Geosyntec/cloudside

Original code here:

https://gist.github.com/phobson/41b41bdd157a2bcf6e14#file-wind_rose-ipynb

About the Authors

Jonathan Taws is a Solutions Architect for AWS, specializing in Data Science. He focuses on helping customers make the right decisions for building and running their data science and machine learning workloads.

Conor Delaney is an Environmental Data Scientist on the AWS Open Data team. He specializes in helping customers utilize the environmental and Earth science data that AWS has made available through the Public Datasets Program.