AWS Machine Learning Blog
Build, tune, and deploy an end-to-end churn prediction model using Amazon SageMaker Pipelines
The ability to predict that a particular customer is at a high risk of churning, while there is still time to do something about it, represents a huge potential revenue source for every online business. Depending on the industry and business objective, the problem statement can be multi-layered. The following are some business objectives based on this strategy:
- Develop a framework to build propensity models estimating the probability that a given customer will remain a paid customer over several time windows like 15D, 30D, and 45D rolling window
- Develop a framework for better targeting win-back campaigns
- Identify features that are the biggest differentiators amongst customers
This post discusses how you can orchestrate an end-to-end churn prediction model across each step: data preparation, experimenting with a baseline model and hyperparameter optimization (HPO), training and tuning, and registering the best model. You can manage your Amazon SageMaker training and inference workflows using Amazon SageMaker Studio and the SageMaker Python SDK. SageMaker offers all the tools you need to create high-quality data science solutions.
SageMaker helps data scientists and developers prepare, build, train, and deploy high-quality machine learning (ML) models quickly by bringing together a broad set of capabilities purpose-built for ML.
Studio provides a single, web-based visual interface where you can perform all ML development steps, improving data science team productivity by up to 10 times.
Amazon SageMaker Pipelines is a tool for building ML pipelines that takes advantage of direct SageMaker integration. With Pipelines, you can easily automate the steps of building a ML model, catalog models in the model registry, and use one of several templates provided in SageMaker Projects to set up continuous integration and continuous delivery (CI/CD) for the end-to-end ML lifecycle at scale.
After the model is trained, you can use Amazon SageMaker Clarify to identify and limit bias and explain predictions to business stakeholders. You can share these automated reports with business and technical teams for downstream target campaigns or to determine features that are key differentiators for customer lifetime value.
By the end of this post, you should have enough information to successfully use this end-to-end template using Pipelines to train, tune, and deploy your own predictive analytics use case. The full instructions are available on the GitHub repo.
Interested in learning more about customer churn models? These posts might interest you: |
Solution overview
In this solution, your entry point is the Studio integrated development environment (IDE) for rapid experimentation. Studio offers an environment to manage the end-to-end Pipelines experience. With Studio, you can bypass the AWS Management Console for your entire workflow management. For more information on managing Pipelines from Studio, see View, Track, and Execute SageMaker Pipelines in SageMaker Studio.
The following diagram illustrates the high-level architecture of the data science workflow.
The workflow includes the following steps:
- Customer churn model development using Studio notebooks.
- Preprocess the data to build the features required and split data in train, validation, and test datasets.
- Apply hyperparameter tuning based on the ranges provided with the SageMaker XGBoost framework to give the best model, which is determined based on AUC score.
- Evaluate the best model using the test dataset.
- Check if the AUC score is above a certain threshold. If so, proceed to the next steps.
- Register the trained churn model in the SageMaker Model Registry.
- Create a SageMaker model by taking the artifacts of the best model.
- Apply batch transform on the given dataset by using the model created in the previous step.
- Create the config file, which includes information as to which columns to check bias on, baseline values for generating SHAPley plots, and more.
- Apply Clarify using the config file created in the previous step to generate model explainability and bias information reports.
Prerequisites
To get started with the development journey, you need to first onboard to Studio and create a Studio domain for your AWS account within a given Region. For instructions on getting started with Studio, see Onboard to Amazon SageMaker Studio or watch the video Onboard Quickly to Amazon SageMaker Studio.
After you create the Studio domain, select your user name and choose Open Studio. A web-based IDE opens that allows you to store and collect all the things that you need—whether it’s code, notebooks, datasets, settings, or project folders.
Pipelines is integrated directly with SageMaker, so you don’t need to interact with any other AWS services. You also don’t need to manage any resources because Pipelines is a fully managed service, which means that it creates and manages resources for you. For more information the various SageMaker components that are both standalone Python APIs along with integrated components of Studio, see the SageMaker service page.
ML development workflow
For this use case, you use the following components for the fully automated model development process:
- Prepare stage
- SageMaker Processing – Built-in Python for feature engineering
- SageMaker Clarify – Understanding the model prediction and report generation
- Build stage
- SageMaker Studio notebooks – One-click notebooks with elastic compute
- SageMaker built-in algorithms – XGBoost as a built-in algorithm
- Train, tune, and score stage
- One-click training – Distributed infrastructure management
- Automatic model tuning – Automatic hyperparameter tuning
- SageMaker Experiments – Automatic capture, organize, and search for every step of the build, train, and tune stage
- SageMaker batch transform – Score or predict on larger datasets
- Deploy stage
- SageMaker Pipelines – ML workflow orchestration and automation
A SageMaker pipeline is a series of interconnected steps that is defined by a JSON pipeline definition. This pipeline definition encodes a pipeline using a directed acyclic graph (DAG). This DAG gives information on the requirements for and relationships between each step of your pipeline. The structure of a pipeline’s DAG is determined by the data dependencies between steps. These data dependencies are created when the properties of a step’s output are passed as the input to another step.
For this post, our use case is a classic ML problem that aims to understand what various marketing strategies based on consumer behavior we can adopt to increase customer retention for a given retail store. The following diagram illustrates the complete ML workflow for the churn prediction use case.
Let’s go through the accelerated ML workflow development process in detail.
Collect and prepare data
To follow along with this post, you need to download and save the sample dataset in the default Amazon Simple Storage Service (Amazon S3) bucket associated with your SageMaker session, and in the S3 bucket of your choice. For rapid experimentation or baseline model building, you can save a copy of the dataset under your home directory in Amazon Elastic File System (Amazon EFS) and follow the Jupyter notebook Customer_Churn_Modeling.ipynb
.
The following screenshot shows the sample set with the target variable as retained 1, if customer is assumed to be active, or 0 otherwise.
Run the following code in a Studio notebook to preprocess the dataset and upload it to your own S3 bucket:
Develop the baseline model
With Studio notebooks with elastic compute, you can now easily run multiple training and tuning jobs. For this use case, you use the SageMaker built-in XGBoost algorithm and SageMaker HPO with objective function as "binary:logistic"
and "eval_metric":"auc"
.
Let’s start by splitting the dataset into train, test, and validation sets: