AWS Big Data Blog

Create, train, and deploy machine learning models in Amazon Redshift using SQL with Amazon Redshift ML

December 2022: Post was reviewed and updated to announce support of Prediction Probabilities for Classification problems using Amazon Redshift ML.

Amazon Redshift is a fast, petabyte-scale cloud data warehouse data warehouse delivering the best price–performance. Tens of thousands of customers use Amazon Redshift to process exabytes of data every day to power their analytics workloads. Data analysts and database developers want to use this data to train machine learning (ML) models, which can then be used to generate insights on new data for use cases such as forecasting revenue, predicting customer churn, and detecting anomalies.

Amazon Redshift ML makes it easy for SQL users to create, train, and deploy ML models using familiar SQL commands. Redshift ML allows you to use your data in Amazon Redshift with Amazon SageMaker, a fully managed ML service, without requiring you to become an expert in ML.

This post shows you how to use familiar SQL statements to create and train ML models from data in Amazon Redshift and use these models to make in-database predictions on new data for use cases such as churn prediction and fraud risk scoring.

ML use cases relevant to data warehousing

You may use different ML approaches according to what’s relevant for your business, such as supervised, unsupervised, and reinforcement learning. With this release, Redshift ML supports supervised learning, which is most commonly used in enterprises for advanced analytics. As evident in the following diagram, supervised learning is preferred when you have a training dataset and an understanding of how specific input data predicts various business outcomes. The inputs used for the ML model are often referred to as features, and the outcomes or results are called targets or labels. Your training dataset is a table or a query whose attributes or columns comprise features, and targets are extracted from your data warehouse. The following diagram illustrates this architecture.

You can use supervised training for advanced analytics use cases ranging from forecasting and personalization to customer churn prediction. Let’s consider a customer churn prediction use case. The columns that describe customer information and usage are features, and the customer status (active vs. inactive) is the target or label.

The following table shows different types of use cases and algorithms used.

Use Case Algorithm / Problem Type
Customer churn prediction Classification
Predict if a sales lead will close Classification
Fraud detection Classification
Price and revenue prediction Linear regression
Customer lifetime value prediction Linear regression
Detect if a customer is going to default a loan Logistic regression

Current ways to use ML in your data warehouse

You may rely on ML experts to build and train models on your behalf or invest a lot of time into learning new tools and technology to do so yourself. For example, you might need to identify the appropriate ML algorithms in SageMaker or use Amazon SageMaker Autopilot for your use case, then export the data from your data warehouse and prepare the training data to work with these model types.

Data analysts and database developers are familiar with SQL. Unfortunately, you often have to learn a new programming language (such as Python or R) to build, train, and deploy ML models in SageMaker. When the model is deployed and you want to use it with new data for making predictions (also known as inference), you need to repeatedly move the data back and forth between Amazon Redshift and SageMaker through a series of manual and complicated steps:

  1. Export training data to Amazon Simple Storage Service (Amazon S3).
  2. Train the model in SageMaker.
  3. Export prediction input data to Amazon S3.
  4. Use the prediction in SageMaker.
  5. Import predicted columns back into the database.

The following diagram illustrates this workflow.

This iterative process is time-consuming and prone to errors, and automating the data movement can take weeks or months of custom coding that then needs to be maintained. Redshift ML enables you to use ML with your data in Amazon Redshift without this complexity.

Introducing Amazon Redshift ML

To create an ML model, as a data analyst, you can use a simple SQL query to specify the data in Amazon Redshift you want to use as the data inputs to train your model and the output you want to predict. For example, to create a model that predicts customer churn, you can query columns in one or more tables in Amazon Redshift that include the customer profile information and historical account activity as the inputs, and the column showing whether the customer is active or inactive as the output you want to predict.

When you run the SQL command to create the model, Redshift ML securely exports the specified data from Amazon Redshift to Amazon S3 and calls Autopilot to automatically prepare the data, select the appropriate pre-built algorithm, and apply the algorithm for model training. Redshift ML handles all the interactions between Amazon Redshift, Amazon S3, and SageMaker, abstracting the steps involved in training and compilation. After the model is trained, Redshift ML makes it available as a SQL function in your Amazon Redshift data warehouse by compiling it via Amazon SageMaker Neo. The following diagram illustrates this solution.

Benefits of Amazon Redshift ML

Redshift ML provides the following benefits:

  • Allows you to create and train ML models with simple SQL commands without having to learn external tools
  • Provides you with flexibility to use automatic algorithm selection
  • Automatically preprocesses data and creates, trains, and deploys models
  • Enables advanced users to specify problem type
  • Enables ML experts such as data scientists to select algorithms such as XGBoost or MLP and specify hyperparameters and preprocessors
  • Enables you to generate predictions using SQL without having to ship data outside your data warehouse
  • Allows you to pay only for training; prediction is included with the costs of your cluster (typically, ML predictions drive cost in production)

In this post, we look at a simple example that you can use to get started with Redshift ML.

To train data for a model that predicts customer churn, Autopilot preprocesses the training data, finds the algorithm that provides the best accuracy, and applies it to the training data to build a performant model.

We provide step-by-step guidance to create a cluster, create sample schema, load data, create your first ML model in Amazon Redshift, and invoke the prediction function from your queries.

Prerequisites for enabling Amazon Redshift ML

As an Amazon Redshift administrator, the following steps are required to create your Amazon Redshift cluster for using Redshift ML:

  1. On the Amazon S3 console, create an S3 bucket that Redshift ML uses for uploading the training data that SageMaker uses to train the model.

For this post, we name the bucket redshiftml-<your_account_id>. Make sure that you create your S3 bucket in the same AWS Region where you create your Amazon Redshift cluster.

  1. Create an AWS Identity and Access Management (IAM role) named RedshiftML with the policy that we provide in this section.

Although it’s easy to get started with AmazonS3FullAccess and AmazonSageMakerFullAccess, we recommend using the minimal policy that we provided (if you already have an existing IAM role, just add these to that role). Take a look at the Amazon Redshift ML guide if you need to use KMS or enhanced VPC routing.

To use or modify this policy, replace <your-account-id> with your AWS account number. The policy assumes that you have created the IAM role RedshiftML and the S3 bucket redshiftml-<your_account_id>. The S3 bucket redshift-downloads is from where we load the sample data used in this post.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "cloudwatch:PutMetricData",
                "ecr:BatchCheckLayerAvailability",
                "ecr:BatchGetImage",
                "ecr:GetAuthorizationToken",
                "ecr:GetDownloadUrlForLayer",
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:DescribeLogStreams",
                "logs:PutLogEvents",
                "sagemaker:*Job*"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "iam:PassRole",
                "s3:AbortMultipartUpload",
                "s3:GetObject",
                "s3:DeleteObject",
                "s3:PutObject"
            ],
            "Resource": [
                "arn:aws:iam::<your-account-id>:role/RedshiftML",
                "arn:aws:s3:::redshiftml-<your-account-id>/*",
                "arn:aws:s3:::redshift-downloads/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetBucketLocation",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::redshiftml-<your-account-id>",
                "arn:aws:s3:::redshift-downloads"
            
            ]
        }
    ]
} 

For instructions, see Creating IAM roles.

  1. On the role details page, on the Trust relationships tab, choose Edit trust relationship.
  2. Enter the following trust relationship definition to trust SageMaker:
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "Service": [
              "redshift.amazonaws.com",
              "sagemaker.amazonaws.com"
            ]
          },
          "Action": "sts:AssumeRole"
        }
      ]
    }

  1. On the Amazon Redshift console, create a new Amazon Redshift cluster.
  2. Attach the IAM role that you created earlier (RedshiftML).
  3. Create the cluster with the current track.

When your cluster creation is complete and the cluster is up and running, you can create accounts for data analysts on an Amazon Redshift cluster. For this post, we create a user named demouser.

  1. Use the Amazon Redshift Query Editor or your preferred SQL client to connect to Amazon Redshift as an administrator and run the following command:
    create user demouser with password '<yourpassword>';
    Alternately we have provided a notebook you may use to execute all the sql commands that can be downloaded here. You will find instructions in this blog on how to import and use notebooks.
  2. Grant CREATE MODEL privileges to your users. The following code grants privileges to the demouser user for creating a model:
    GRANT CREATE MODEL TO demouser;

Loading sample data

We use a customer churn model in this post. As an admin or database developer, you have to create the schema and load data into Amazon Redshift. This dataset is attributed to the University of California Irvine Repository of Machine Learning Datasets (Jafari-Marandi, R., Denton, J., Idris, A., Smith, B. K., & Keramati, A. (2020). Optimum Profit-Driven Churn Decision Making: Innovative Artificial Neural Networks in Telecom Industry. Neural Computing and Applications). We have modified this data for use with Redshift ML.

  1. Create a schema named demo_ml that stores the example table and the ML model that we create:
    CREATE SCHEMA DEMO_ML;

In the next steps, we create the sample table and load data into the table that we use to train the ML model.

  1. Create the table in the demo_ml schema:
    CREATE TABLE demo_ml.customer_activity (
    state varchar(2), 
    account_length int, 
    area_code int,
    phone varchar(8), 
    intl_plan varchar(3), 
    vMail_plan varchar(3),
    vMail_message int, 
    day_mins float, 
    day_calls int, 
    day_charge float,
    total_charge float,
    eve_mins float, 
    eve_calls int, 
    eve_charge float, 
    night_mins float,
    night_calls int, 
    night_charge float, 
    intl_mins float, 
    intl_calls int,
    intl_charge float, 
    cust_serv_calls int, 
    churn varchar(6),
    record_date date);
  2. Load the sample data by using the following command. Replace your IAM role and account ID appropriate for your environment.
    COPY DEMO_ML.customer_activity 
    FROM 's3://redshift-downloads/redshift-ml/customer_activity/' 
    IAM_ROLE 'arn:aws:iam::<accountid>:role/RedshiftML' delimiter ',' IGNOREHEADER 1  
    region 'us-east-1';
  1. The demouser user should also have the usual SELECT access to the tables with the data used for training:
    GRANT SELECT on demo_ml.customer_activity TO demouser;
  1. You need to also grant CREATE and USAGE on the schema to allow users to create models and query using the ML inference functions on the demo_ml schema:
    GRANT CREATE, USAGE ON SCHEMA demo_ml TO demouser;

Now the analyst (demouser) can train a model.

Create and train your first ML model

Use your preferred SQL client to connect to your Amazon Redshift cluster as the demouser user that your admin created. Run the following command to create your model named customer_churn_model:

CREATE MODEL demo_ml.customer_churn_model
FROM (SELECT state,
             area_code,
             total_charge/account_length AS average_daily_spend, 
             cust_serv_calls/account_length AS average_daily_cases,
             churn 
      FROM demo_ml.customer_activity
         WHERE record_date < '2020-01-01' 

     )
TARGET churn
FUNCTION predict_customer_churn
IAM_ROLE 'arn:aws:iam::<accountID>:role/RedshiftML'
SETTINGS (
  S3_BUCKET 'redshiftml-<your-account-id>'
)
;

The SELECT query in the FROM clause specifies the training data. The TARGET clause specifies which column is the label that the CREATE MODEL builds a model to predict. The other columns in the training query are the features (input) used for the prediction. In this example, the training data provides features regarding state, area code, average daily spend, and average daily cases for the customers that have been active earlier than January 1, 2020. The target column churn indicates whether the customer still has an active membership or has suspended their membership. For more information about CREATE MODEL syntax, see the Amazon Redshift Database Developer Guide.

With recent enhancements, Redshift ML now supports Prediction Probabilities for binary and multi-class classification models. For classification problems in machine learning, for a given record, each label can be associated with a probability that indicates how likely this record belongs to the label. Having prediction probabilities along with the label, customers could use the classification results with confidence based on a certain threshold value of prediction probability returned by the model.

Prediction probabilities are calculated by default for binary and multi-class classification models; there is no extra parameter needs to be passed while creating a model and an additional function is created while creating a model without impacting performance of the ML model.

Check the status of your ML model

You can check the status of your models by running the SHOW MODEL command from your SQL prompt. Enter the SHOW MODEL ALL command to see all the models that you have access to:

SHOW MODEL ALL

The following table summarizes our output.

SchemaName ModelName
demo_ml customer_churn_model

Enter the SHOW MODEL command with your model name to see the status for a specific model:

SHOW MODEL demo_ml.customer_churn_model

The following output provides the status of your model:

Key						Value
Model Name				customer_churn_model
Schema Name				demo_ml
Owner					awsuser
Creation Time			"Tue, 01.11.2022 07:02:51"
Model State				READY
validation:			
f1,						0.681240
Estimated Cost			0.990443
TRAINING DATA:,

SELECT STATE, AREA_CODE, TOTAL_CHARGE/ACCOUNT_LENGTH AS AVERAGE_DAILY_SPEND, CUST_SERV_CALLS/ACCOUNT_LENGTH AS AVERAGE_DAILY_CASES, CHURN"
FROM DEMO_ML.CUSTOMER_ACTIVITY
WHERE ACCOUNT_LENGTH > 120

Target Column,			CHURN

PARAMETERS:,
Model Type					auto
Problem Type				BinaryClassification
Objective					F1
Function Name				predict_customer_churn
Function Name               predict_customer_churn_prob
Function Parameters,		"state area_code average_daily_spend  
average_daily_cases "
Function Parameter Types 	"varchar int4 float8 int4 "
IAM Role					arn:aws:iam::99999999999:role/RedshiftML
s3 Bucket					redshiftml

As highlighted in bold above, predication probabilities enhancements have added another function as a suffix (_prob) to model function which could be used to get prediction probabilities

Evaluate your model performance

You can see the F1 value for the example model customer_churn_model in the output of the SHOW MODEL command. The F1 amount signifies the statistical measure of the precision and recall of all the classes in the model. The value ranges between 0–1; the higher the score, the better the accuracy of the model.

You can use the following example SQL query as an illustration to see which predictions are incorrect based on the ground truth:

WITH infer_data AS (
  SELECT area_code ||phone  accountid, churn,
    demo_ml.predict_customer_churn( 
          state,
          area_code, 
          total_charge/account_length , 
          cust_serv_calls/account_length ) AS predicted
  FROM demo_ml.customer_activity
WHERE record_date <  '2020-01-01'

)
SELECT *  FROM infer_data where churn!=predicted;

You can change the above SQL query to get prediction probabilities of each of the label returned.

WITH infer_data AS (
  SELECT area_code ||phone  accountid, churn,
    demo_ml.predict_customer_churn_prob( 
          state,
          area_code, 
          total_charge/account_length , 
          cust_serv_calls/account_length ) AS predicted
  FROM demo_ml.customer_activity
WHERE record_date <  '2020-01-01'

)
SELECT *  FROM infer_data where churn!=predicted;

Here is sample output of above query which shows customer churn prediction along with values of prediction probabilities mapped to each of the possible label values.

accountid	churn	predicted
415371-7191	False.	{"probabilities":[0.75614393,0.24385607],"labels":["False.","True."]}
408375-9999	False.	{"probabilities":[0.68209136,0.31790867],"labels":["False.","True."]}
510391-8027	False.	{"probabilities":[0.50511020,0.49488980],"labels":["True.","False."]}
510355-9993	False.	{"probabilities":[0.69801158,0.30198842],"labels":["True.","False."]}
415329-6603	True.	{"probabilities":[0.51449406,0.48550591],"labels":["False.","True."]}
415344-9403	False.	{"probabilities":[0.74465060,0.25534937],"labels":["False.","True."]}

We can also observe that Redshift ML is able to identify the right combination of features to come up with a usable prediction model with Model Explainability. Model explainability helps explain how these models make predictions using a feature attribution approach which in turn helps improve your machine learning (ML) models. We can check impact of each attribute and its contribution and weightage in the model selection using the following command:

select EXPLAIN_MODEL ('customer_churn_model')

The following output is from the above command, where each attribute weightage is representative of its role in the model decision-making.

{
  "explanations": {
    "kernel_shap": {
      "label0": {
        "expected_value": -1.3973409436056444,
        "global_shap_values": {
          "area_code": 0.13002602338269313,
          "average_daily_cases": 0.0003483867828791604,
          "average_daily_spend": 1.542027875977079,
          "state": 0.5696699213668497
        }
      }
    }
  },
  "version": "1.0"
}

Invoke your ML model for inference

You can use your SQL function to apply the ML model to your data in queries, reports, and dashboards. For example, you can run the predict_customer_churn SQL function on new customer data in Amazon Redshift regularly to predict customers at risk of churning and feed this information to sales and marketing teams so they can take preemptive actions, such as sending these customers an offer designed to retain them.

For example, you can run the following query to predict which customers in area code 408 might churn:

SELECT area_code ||phone  accountid, 
       demo_ml.predict_customer_churn( 
          state,
          area_code, 
          total_charge/account_length , 
          cust_serv_calls/account_length )
          AS "predictedActive"
FROM demo_ml.customer_activity
WHERE area_code='408' and record_date > '2020-01-01';

The following output shows the account ID and whether the account is predicted to remain active.

accountId predictedActive
408393-7984 False.
408357-3817 True.
408418-6412 True.
408395-2854 True.
408372-9976 False.
408353-3061 True.

Additionally, you can change the above query to include prediction probabilities of label output for the above scenario and decide if you still like to use the prediction by the model.

SELECT  accountid, predictedActive.labels[0] as predictedActive,predictedActive.probabilities[0]
FROM (SELECT area_code ||phone  accountid, predict_customer_churn_prob(
     	state,
     	area_code,
     	total_charge/account_length ,
     	cust_serv_calls/account_length ) AS predictedActive
 	  FROM customer_activity
      WHERE area_code='408' and record_date > '2020-01-01'
	);
accountid predictedactive probabilities 
408335-4719 "False." 0.57442170 
408350-8884 "True." 0.53838992 
408393-7984 "False." 0.57895380 
408418-6412 "True." 0.66967559 
408383-1121 "False." 0.79719913 
408360-1596 "False." 0.80988538 
408395-2854 "False." 0.71269226 
408341-9764 "False." 0.59270906 
408332-9891 "True." 0.57105911 
408372-9976 "True." 0.53818357

As an example of using results for prediction probabilities values under 0.60, customer could decide if they like to continue using the label or not based on the risks of predicting wrong and set a threshold value going forward.

You can also calculate total prediction probabilities count on all inference data to determine range of values which could help to set the threshold of using the predicted label.

Select cast(prediction.probabilities[0] as decimal(4,1)) as probabilities, count(*)
from (select
 area_code ||phone  accountid,
 predict_customer_churn_prob( 
          state,
          area_code, 
          total_charge/account_length , 
          cust_serv_calls/account_length ) as prediction
 from customer_activity
)t1
group by 1 
order by 1 desc;

probabilities	count
0.9	38
0.8	326
0.7	1231
0.6	1087
0.5	651

Provide privileges to invoke the prediction function

As the model owner, you can grant EXECUTE on the prediction function to business analysts to use the model. The following code grants the EXECUTE privilege to marketing_analyst_grp:

GRANT EXECUTE demo_ml.predict_customer_churn TO marketing_analyst_grp

Cost control

Redshift ML uses your existing cluster resources for prediction so you can avoid additional Amazon Redshift charges. There is no additional Amazon Redshift charge for creating or using a model, and prediction happens locally in your Amazon Redshift cluster, so you don’t have to pay extra unless you need to resize your cluster.

The CREATE MODEL request uses SageMaker for model training and Amazon S3 for storage, and incurs additional expense. The cost depends on the number of cells in your training data; the number of cells is the product of the number of records (in the training query or table) times the number of columns. For example, if the SELECT query of CREATE MODEL produces 10,000 records for training and each record has five columns, the number of cells in the training data is 50,000. You can control the training cost by setting the MAX_CELLS. If you don’t, the default value of MAX_CELLS is 1 million.

If the training data produced by the SELECT query of the CREATE MODEL exceeds the MAX_CELLS limit you provided (or the default 1 million, in case you didn’t provide one) the CREATE MODEL randomly chooses approximately MAX_CELLS divided by number of columns records from the training dataset and trains using these randomly chosen tuples. The random choice ensures that the reduced training dataset doesn’t have any bias. Therefore, by setting the MAX_CELLS, you can keep your cost within your limits. See the following code:

CREATE MODEL demo_ml.customer_churn_model
FROM (SELECT state,
             area_code,
             total_charge/account_length AS average_daily_spend, 
             cust_serv_calls/account_length AS average_daily_cases,
             churn 
      FROM demo_ml.customer_activity
      WHERE account_length > 120 
     )
TARGET churn
FUNCTION predict_customer_churn
IAM_ROLE 'arn:aws:iam::<acountID>:role/RedshiftML'
SETTINGS (
  S3_BUCKET 'redshiftml_<your_account_id>',
   MAX_CELLS 10000
)
;

For more information about costs associated with various cell numbers and free trial details, see Amazon Redshift pricing.

An alternate method of cost control is the MAX_RUNTIME parameter, also specified as a CREATE MODEL setting. If the training job in SageMaker exceeds the specified MAX_RUNTIME seconds, the CREATE MODEL ends the job.

The prediction functions run within your Amazon Redshift cluster, and you don’t incur additional expense there.

Customer feedback

“We use predictive analytics and machine learning to improve operational and clinical  efficiencies and effectiveness. With Redshift ML, we enabled our data and outcomes analysts to classify new drugs to market into appropriate therapeutic conditions by creating and utilizing ML models with minimal effort. The efficiency gained through leveraging Redshift ML to support this process improved our productivity and optimized our resources.”

– Karim Prasla, Vice President of Clinical Outcomes Analytics and Reporting, Magellan Rx Management

“Jobcase has several models in production using Amazon Redshift Machine Learning. Each model  performs billions of predictions in minutes directly on our Redshift data warehouse with no data pipelines required. With Redshift ML, we have evolved to model architectures that generate a 5-10% improvement in revenue and member engagement rates across several different email template
types, with no increase in inference costs.”

– Mike Griffin, EVP Optimization & Analytics, Jobcase

Please refer to blog posts for Magellan Rx and Jobcase to learn more.

Conclusion

In this post, we briefly discussed ML use cases relevant for data warehousing. We introduced Redshift ML and outlined how it enables SQL users to create, train, deploy, and use ML with simple SQL commands without learning external tools. We also provided an example of how to get started with Redshift ML.

Redshift ML also enables ML experts such as data scientists to quickly create ML models to simplify their pipeline and eliminate the need to export data from Amazon Redshift. In the following posts, we discuss how you can use Redshift ML to import your pre-trained Autopilot, XGBoost, or MLP models into your Amazon Redshift cluster for local inference or use custom ML models deployed in remote SageMaker endpoints for remote inference. With recent Redshift ML enhancement of prediction probabilities now customer could use prediction by binary and multi-class classification prediction with the confidence level of the label predicted without any additional costs.


About the Authors

Debu Panda, a principal product manager at AWS, is an industry leader in analytics, application platform, and database technologies and has more than 20 years of experience in the IT world.

Yannis Papakonstantinou is a senior principal scientist at AWS and professor (on leave) of University of California at San Diego whose research on querying nested and semi-structured data, data integration, and the use and maintenance of materialized views has received over 16,500 citations.

Murali Balakrishnan Narayanaswamy is a principal machine learning scientist at AWS and received a PhD from Carnegie Mellon University on the intersection of AI, optimization, learning and inference to combat uncertainty in real-world applications.

Sriram Krishnamurthy is a senior software development manager for the Amazon Redshift query processing team and has been working on semi-structured data processing and SQL compilation and execution for over 15 years.

Sudipta Sengupta is a senior principal technologist at AWS who leads new initiatives in AI/ML, databases, and analytics and holds a Ph.D. in electrical engineering and computer science from Massachusetts Institute of Technology.

Stefano Stefani is a VP and distinguished engineer at AWS and has served as chief technologist for Amazon DynamoDB, Amazon Redshift, Amazon Aurora, Amazon SageMaker, and other services.

Rohit Bansal is an Analytics Specialist Solutions Architect at AWS. He specializes in Amazon Redshift and works with customers to build next-generation analytics solutions using other AWS Analytics services.

Phil Bates is a Senior Analytics Specialist Solutions Architect at AWS with over 25 years of data warehouse experience.