Build regression models with Amazon Redshift ML

June 2023: This post was reviewed and updated for accuracy.

With the rapid growth of data, many organizations are finding it difficult to analyze their large datasets to gain insights. As businesses rely more and more on automation algorithms, machine learning (ML) has become a necessity to stay ahead of the competition.

Amazon Redshift, a fast, fully managed, widely used cloud data warehouse, natively integrates with Amazon SageMaker for ML. With Amazon Redshift ML, you can use simple SQL statements to create and train ML models from your data in Amazon Redshift and then use these models for a variety of use cases, such as classification of a binary or multiclass outcome or predicting a numeric value through regression. Amazon SageMaker Autopilot provides all the benefits of automatic model creation, but as an advanced user, you can also influence the model training by providing different parameters such as model type, objective, and so on.

Amazon Redshift ML allows you to address several ML challenges, such as the following:

Binary classification – Predict a true/false outcome, such as whether a customer will churn. To explore this specific use case further, see Create, train, and deploy machine learning models in Amazon Redshift using SQL with Amazon Redshift ML.
Multiclass classification – Identify the class of an input value within a discrete number of classes. For example, you can identify which will be the best-selling product.
Regression – Predict a numerical outcome, like the price of a house or how many people will use a city’s bike rental service.

You can use Amazon Redshift ML to automate data preparation, pre-processing, and selection of problem type as depicted in this blog post. In this post, we assume that you have a good understanding of your data and what problem type you want to use for your use case. We demonstrate how to use Amazon Redshift ML to solve a regression problem predicting bike rental counts. We also provide some best practices for creating test data, validating your model, and using it for inference. We also show you how you can use the SageMaker console to troubleshoot the training process as an advanced user.

Want to learn more about Amazon Redshift ML? These posts might interest you:

Prerequisites

As a prerequisite for implementing the example in this post, you need to set up an Amazon Redshift cluster with ML enabled on it. For the preliminary steps to get started, see Create, train, and deploy machine learning models in Amazon Redshift using SQL with Amazon Redshift ML.

Solution overview

In this post, we use Amazon Redshift ML to build a regression model that predicts the number of people that may use the city of Toronto’s bike sharing service at any given hour of a day. The model accounts for various aspects, including holidays and weather conditions. Because we need to predict a numerical outcome, we create a regression model.

We walk you through the following high-level steps:

Input the raw data.
Prepare the input data.
Create the model.
Validate the predictions.

Input the raw data

To prepare the raw data for this model, we populated three tables in Amazon Redshift using different public datasets.

Table Name	Description	Attribution
Ridership	Bike Share Toronto Ridership Data for 2017 and 2018	Contains information licensed under the Open Government Licence – Toronto
Weather	Government of Canada historical weather data for 2017 and 2018	Based on Environment and Climate Change Canada data
Holiday	Historical holidays data by the University of Waterloo	Contains information provided by the University of Waterloo under license on an “as is” basis

You can use the following script to create and load the data in these tables in Amazon Redshift, and use the Amazon Redshift Query Editor to run these SQL scripts:

CREATE TABLE IF NOT EXISTS ridership
( trip_id               INT
, trip_duration_seconds INT
, trip_start_time       timestamp
, trip_stop_time        timestamp
, from_station_name     VARCHAR(50)
, to_station_name       VARCHAR(50)
, from_station_id       SMALLINT
, to_station_id         SMALLINT
, user_type             VARCHAR(20));

CREATE TABLE IF NOT EXISTS weather
( longitude_x         DECIMAL(5,2)
, latitude_y          DECIMAL(5,2)
, station_name        VARCHAR(20)
, climate_id          BIGINT
, datetime_utc        TIMESTAMP
, weather_year        SMALLINT
, weather_month       SMALLINT
, weather_day         SMALLINT
, time_utc            VARCHAR(5)
, temp_c              DECIMAL(5,2)
, temp_flag           VARCHAR(1)
, dew_point_temp_c    DECIMAL(5,2)
, dew_point_temp_flag VARCHAR(1)
, rel_hum             SMALLINT
, rel_hum_flag        VARCHAR(1)
, precip_amount_mm    DECIMAL(5,2)
, precip_amount_flag  VARCHAR(1)
, wind_dir_10s_deg    VARCHAR(10)
, wind_dir_flag       VARCHAR(1)
, wind_spd_kmh        VARCHAR(10)
, wind_spd_flag       VARCHAR(1)
, visibility_km       VARCHAR(10)
, visibility_flag     VARCHAR(1)
, stn_press_kpa       DECIMAL(5,2)
, stn_press_flag      VARCHAR(1)
, hmdx                SMALLINT
, hmdx_flag           VARCHAR(1)
, wind_chill          VARCHAR(10)
, wind_chill_flag     VARCHAR(1)
, weather             VARCHAR(10));

CREATE TABLE IF NOT EXISTS holiday
( holiday_date  DATE
, description VARCHAR(100));

To load the data, use the following COPY commands. Replace the AWS Identity and Access Management (IAM) role with the IAM role that you created as part of the prerequisite steps earlier.

COPY ridership FROM 
's3://redshift-ml-bikesharing-data/bike-sharing-data/ridership/'
IAM_ROLE 'arn:aws:iam::<accountid>:role/RedshiftML'
FORMAT csv IGNOREHEADER 1 DATEFORMAT 'auto' TIMEFORMAT 'auto' REGION 'us-west-2' gzip;

COPY weather FROM
's3://redshift-ml-bikesharing-data/bike-sharing-data/weather/'
IAM_ROLE 'arn:aws:iam::<accountid>:role/RedshiftML'
FORMAT csv IGNOREHEADER 1 DATEFORMAT 'auto' TIMEFORMAT 'auto' REGION 'us-west-2' gzip;

COPY holiday FROM
's3://redshift-ml-bikesharing-data/bike-sharing-data/holiday/'
IAM_ROLE 'arn:aws:iam::<accountid>:role/RedshiftML'
FORMAT csv IGNOREHEADER 1 DATEFORMAT 'auto' TIMEFORMAT 'auto' REGION 'us-west-2' gzip;

Data preparation

Let’s discuss about how the data can be biased and how selecting the right distribution of data impacts accuracy. For most ML problems, data preparation is the most time-consuming process; it involves preparing the data, finding relevant attributes, and cleaning and curating it to be used as input to the ML model. Bias or anomalies in the input data distribution also play a key role in model accuracy, therefore it’s very important to curate that as much as possible. Let’s explore and prepare our input dataset.

Ridership

The ridership table contains Bike Share Toronto’s ridership information for 2017 and 2018. We performed the following data preparation steps to make it more meaningful for our ML model:

Transform the records to all be in the local time zone. The data before July 2017 was provided in UTC, whereas all data after that was in Toronto’s local time zone, EST.
Discard trips with a duration under 60 seconds.
Discard trips with a duration longer than a day.
Aggregate data to an hourly basis (rather than minute-level granularity) for better generalization.
Break down the trip date column into day, month, year, quarter, month_week, and week_day for better correlation with the number of rides.

We carried out all these transformations using the following simple view:

CREATE OR REPLACE VIEW ridership_view AS
SELECT
    trip_time
    , trip_count
    , TO_CHAR(trip_time,'hh24') ::INT trip_hour
    , TO_CHAR(trip_time, 'dd') :: INT trip_day
    , TO_CHAR(trip_time, 'mm') :: INT trip_month
    , TO_CHAR(trip_time, 'yy') :: INT trip_year
    , TO_CHAR(trip_time, 'q') :: INT trip_quarter
    , TO_CHAR(trip_time, 'w') :: INT trip_month_week
    , TO_CHAR(trip_time, 'd') :: INT trip_week_day
FROM  
    (SELECT  
         CASE
           WHEN TRUNC(r.trip_start_time) < '2017-07-01'::DATE
           THEN CONVERT_TIMEZONE('US/Eastern', DATE_TRUNC('hour',r.trip_start_time))
           ELSE DATE_TRUNC('hour',r.trip_start_time)
         END trip_time
         , COUNT(1) trip_count
     FROM    
         ridership r
     WHERE    r.trip_duration_seconds BETWEEN 60 AND 60 * 60 * 24
     GROUP BY
         1);

Weather

The weather table contains Government of Canada historical weather data for 2017 and 2018 in UTC format. We performed the following data preparation activities:

Transform the records from UTC to US/Eastern time zone.
Aggregate the data to an hourly basis for better generalization.
Because the majority of the attributes like visibility_km and wind_chill are empty in this dataset, we only use temp_c and precip_amount_mm columns for our model.

We use the following view in Amazon Redshift to apply these transformations:

CREATE OR REPLACE VIEW weather_view AS
SELECT  
    CONVERT_TIMEZONE('US/Eastern', 
      DATE_TRUNC('hour',datetime_utc)) daytime
    , ROUND(AVG(temp_c)) temp_c
    , ROUND(AVG(precip_amount_mm)) precip_amount_mm
FROM weather
GROUP BY 1;

Trip data

In this step, we create a new table, trip_data, in Amazon Redshift to combine all relevant attributes from the input tables. We also create a random ordered attribute named serial_number to denote the serial number of each record:

CREATE TABLE trip_data AS 
SELECT         
   r.trip_time
  ,r.trip_count
  ,r.trip_hour
  ,r.trip_day
  ,r.trip_month
  ,r.trip_year
  ,r.trip_quarter
  ,r.trip_month_week
  ,r.trip_week_day
  ,w.temp_c
  ,w.precip_amount_mm
  ,CASE
      WHEN h.holiday_date IS NOT NULL
      THEN 1
      WHEN TO_CHAR(r.trip_time,'D')::INT IN (1,7)
      THEN 1
      ELSE 0
    END is_holiday
  , ROW_NUMBER() OVER (ORDER BY RANDOM()) serial_number
FROM           
  ridership_view r
JOIN            weather_view w
  ON ( r.trip_time = w.daytime )
LEFT OUTER JOIN holiday h
  ON ( TRUNC(r.trip_time) = h.holiday_date );

The following is an example SELECT query:

select * from trip_data limit 5;

The following table summarizes our output.

Trip time	trip count	trip hour	trip day	trip month	trip year	trip quarter	trip month week	trip week day	temp c	precip amount mm	is holiday	serial number
2017-07-04 23:00	105	23	4	7	17	3	1	3	20	0	0	1
2018-09-03 9:00	233	9	3	9	18	3	1	2	25	0	1	2
2017-09-29 20:00	185	20	29	9	17	3	5	6	12	0	0	3
2017-04-09 23:00	28	23	9	4	17	2	2	1	12	0	1	4
2017-02-01 16:00	175	16	1	2	17	1	1	4	2	0	0	5

We can create an ML model in Amazon Redshift to predict the trip_count column in the preceding table for any given trip time. But before that, we may want to view the relationship of different attributes of this table with our target attribute, trip_count. The Pearson correlation coefficient is a popular method to find how strong a relationship is between two attributes. It returns a value between -1 and 1, in which 1 indicates a strong positive relationship and -1 indicates a strong negative relationship. A result of zero indicates no relationship at all.

We use the stored procedure sp_correlation, available in the Amazon Redshift Utilities GitHub repo, to view the relationship between our target column trip_count and the other numeric attributes in our dataset. This stored procedure allows us to get correlation of an attribute with other attributes in a table in Amazon Redshift.

call sp_correlation('public','trip_data','trip_count','tmp_corr_table');
select * from tmp_corr_table;

The following table summarizes our output.

trip count	trip hour	trip day	trip month	trip year	trip quarter	trip month week	trip week day	temp c	precip amount mm	is holiday	serial number
1	0.32	0.02	0.19	0.12	0.20	0.02	0.02	0.52	-0.07	-0.12	0.00

The output shows that temperature and hour of the day have a strong positive correlation with the number of bike rentals. Similarly, precipitation and holidays have a negative correlation with the number of bike rentals. But serial number data has zero correlation because it’s a random number and we can safely exclude such columns in our ML model.

Create a model in Amazon Redshift

Now that our data preparation steps are complete, we can use the create model statement in Amazon Redshift to create our ML regression model. But before that, we want to divide our data into two sets by splitting 80% of the dataset for training and 20% for validation, which a common practice in ML. The training data is input to the ML model to identify the best possible algorithm for the model. After the model is created, we use the validation data to validate the model accuracy. We create the tables training_data and validation_data using the 20% threshold for our random ordered column serial_number with the following code:

CREATE TABLE training_data AS SELECT
 trip_count,trip_hour,trip_day,trip_month,trip_year,trip_quarter,
 trip_month_week, trip_week_day, temp_c, precip_amount_mm, is_holiday
FROM trip_data WHERE serial_number > 
(SELECT COUNT(1) * 0.2 FROM  trip_data);

CREATE TABLE validation_data AS SELECT
 trip_count,trip_hour,trip_day,trip_month,trip_year,trip_quarter,
 trip_month_week, trip_week_day, temp_c, precip_amount_mm, is_holiday, trip_time
FROM trip_data WHERE serial_number <= 
(SELECT COUNT(1) * 0.2 FROM  trip_data);

After that, we run the create model command in Amazon Redshift, which creates a regression model to predict the trip_count value for any input date and time:

CREATE MODEL predict_rental_count 
FROM training_data
TARGET trip_count
FUNCTION predict_rental_count
IAM_ROLE 'arn:aws:iam::<accountid>:role/RedshiftML'
PROBLEM_TYPE regression
OBJECTIVE 'mse'
SETTINGS (s3_bucket 'redshiftml-<your-account-id>',
          s3_garbage_collect off,
          max_runtime 5000);

We define the following parameters in our create model statement:

Problem type – Because we’re predicting a numeric outcome, we provide the problem type, but it’s an optional parameter.
Objective – We specified MSE (mean square error) as our objective, which is a common metric for evaluation of regression problems.
S3 garbage collect – Setting this parameter to OFF retains the intermediate dataset used to train the models by SageMaker, which may help us troubleshoot any issues.
Max runtime – This parameter denotes how long the model training can run. Specifying a larger value may help create a better tuned model. The default value for this parameter is 5,400 (90 minutes).

The preceding statement takes a few seconds to complete. It initiates an Autopilot process in the background to automatically build, train, and tune the best ML model for the input data. It then uses Amazon SageMaker Neo to deploy that model locally in the Amazon Redshift cluster as a user-defined function (UDF). You can use the SHOW MODEL command in Amazon Redshift to track the progress of your model creation, which should be in the READY state within the max_runtime parameter you defined while creating the model.

Validate predictions with mean square error

In this step, we evaluate the accuracy of our ML model against our validation data.

For regression problems, typical accuracy metrics are mean square error (MSE) and root mean square error (RMSE), which measure the distance between the predicted numeric target and the actual numeric answer, also known as ground truth. We square the differences between the actual and predicted values and then find the average to calculate the MSE. The square root of MSE computes the RMSE. A model with a low MSE and RMSE score indicates a good model.

While creating the model, Autopilot automatically splits the input data into train and validation sets and selects the model with the best objective metric, which is deployed in the Amazon Redshift cluster. We use the show model statement to view various metrics, including the calculated MSE by SageMaker while creating the model:

show model predict_rental_count;

The following output shows an MSE of 1960.272217.

Key	Value
Model Name	predict_rental_count
Schema Name	Public
Owner	Awsuser
Creation Time	Mon, 15.02.2021 16:13:18
Model State	READY
validation:mse	1960.272217
Estimated Cost	4.215379
TRAINING DATA:
Query	SELECT TRIP_COUNT, TRIP_HOUR, TRIP_DAY, TRIP_MONTH, TRIP_YEAR, TRIP_QUARTER, TRIP_MONTH_WEEK, TRIP_WEEK_DAY, TEMP_C, PRECIP_AMOUNT_MM, IS_HOLIDAY
	FROM TRAINING_DATA
Target Column	TRIP_COUNT
PARAMETERS:
Model Type	Auto
Problem Type	Regression
Objective	MSE
Function Name	predict_rental_count
Function Parameters	trip_hour trip_day trip_month trip_year trip_quarter trip_month_week trip_week_day temp_c precip_amount_mm is_holiday
Function Parameter Types	int4 int4 int4 int4 int4 int4 int4 numeric numeric int4
IAM Role	arn:aws:iam::xxxxxxxxxxxx:role/redshift-sagemaker-role
s3 Bucket	redshift-xxxxxxxxx-bucket
Max Runtime	5400

Let’s run inference queries against our validation data using the following SQL query:

 SELECT
      trip_time
    , actual_count
    , predicted_count
    , ( actual_count - predicted_count ) difference
FROM  
    (SELECT
         trip_time
         , trip_count AS actual_count
         , PREDICT_RENTAL_COUNT (trip_hour, trip_day, trip_month, trip_year, trip_quarter, trip_month_week, trip_week_day, temp_c, precip_amount_mm, is_holiday) predicted_count
     FROM  
         validation_data) LIMIT 5;

The following table summarizes our output.

trip_time	actual_count	predicted_count	difference
2017-09-29 20:00	185	246	-61
2017-01-02 17:00	76	75	1
2018-12-24 18:00	70	111	-41
2018-02-02 14:00	70	64	6
2018-07-08 22:00	273	299	-26

We can also calculate the MSE and RMSE metrics based on our validation data:

SELECT
      ROUND(AVG(POWER(( actual_count - predicted_count ),2)),2) mse
    , ROUND(SQRT(AVG(POWER(( actual_count - predicted_count ),2))),2) rmse
FROM  
    (SELECT
           trip_time
         , trip_count AS actual_count
         , PREDICT_RENTAL_COUNT (trip_hour, trip_day, trip_month, trip_year, trip_quarter, trip_month_week, trip_week_day, temp_c, precip_amount_mm, is_holiday) predicted_count
     FROM  validation_data);

We get the following output.

MSE	RMSE
1692.59	41.14

The RMSE value indicates that our model is accurate enough to the actual values from our validation dataset.

Troubleshooting

Although the Create Model statement in Amazon Redshift automatically takes care of initiating Autopilot process to build, train, and tune the best ML model and deploy that model in Amazon Redshift, you can also view the intermediate steps performed in this process, which may help you with troubleshooting if something goes wrong.

Amazon S3 bucket

When you create the model, you need to enter an Amazon Simple Storage Service (Amazon S3) bucket name as the value for the parameter s3_bucket. This bucket is used to share training data and artifacts between Amazon Redshift and SageMaker. Amazon Redshift creates a subfolder in this bucket prior to unloading the training data. When training is complete, it deletes this subfolder and its contents unless you set the parameter s3_garbage_collect to off, which may be used for troubleshooting purposes. For more information, see CREATE MODEL.

SageMaker console

If you navigate to the SageMaker console, you can view the training jobs and hyperparameter tuning jobs initiated by the Create Model process. You need appropriate access permissions, like the AmazonSageMakerReadOnly policy, to view these details. For more information, see AWS Managed (Predefined) Policies for Amazon SageMaker.

In the hyperparameter tuning jobs section on the SageMaker console, you can view the best training job selected by SageMaker based on the MSE metrics, along with different hyperparameter values. The following screenshot shows that our best training job had an MSE value of 1960.272217, similar to what we saw in the Amazon Redshift show model statement.

Amazon SageMaker Studio

Amazon SageMaker Studio is a web-based integrated development environment (IDE) for ML. You can view your model training activities on the Experiments and trials page. You can also view job metrics by choosing the option Describe AutoML Job, as shown in the following screenshot.

Studio also allows you to view two useful notebooks:

Data exploration notebook – Describes what Autopilot learned about the input data from Amazon Redshift
Candidate generation notebook – Shows how Autopilot used this information about the data to generate candidates

For more information, see Amazon SageMaker Autopilot notebooks generated to manage AutoML tasks.

Conclusion

Amazon Redshift ML makes it easy for users of all skill levels to use ML technology. With no prior ML knowledge, you can use Amazon Redshift ML to gain business insights for your data. You provide the data that you want to train a model. Then Amazon Redshift ML creates a model that capture patterns in the input data. You can then use that model to make predictions using familiar SQL commands. To learn more about Amazon RedShift ML, visit the webpage.

About the Authors

Manash Deb is a Senior Analytics Specialist Solutions Architect at AWS. He has worked on building end-to-end data driven solutions in different database and data warehousing technologies for over fifteen years. He loves to learn new technologies and solving, automating and simplifying customer problems with easy-to-use cloud data solutions on AWS.

Debu Panda, a principal product manager at AWS, is an industry leader in analytics, application platform, and database technologies and has more than 25 years of experience in the IT world.

Gokul Soundararajan is a principal engineer at AWS and received a PhD from University of Toronto and has been working in the areas of storage, databases, and analytics.

Piali Das is a Senior Software Engineer in the Amazon SageMaker Autopilot team. She previously contributed to building SageMaker Algorithms. She enjoys scientific programming in general and has developed an interest in machine learning and distributed systems.

AWS Machine Learning Blog

Build regression models with Amazon Redshift ML

Prerequisites

Solution overview

Input the raw data

Data preparation

Ridership

Weather

Trip data

Create a model in Amazon Redshift

Validate predictions with mean square error

Troubleshooting

Amazon S3 bucket

SageMaker console

Amazon SageMaker Studio

Conclusion

About the Authors

Resources

Blog Topics

Follow