AWS Marketplace

Simplifying machine learning operations with Trifacta and Amazon SageMaker (Part 2)

This is the second article of a two-part series. Part 1 covered data preparation for machine learning (ML) by using Trifacta. Part 2 covers training the model using Amazon SageMaker Autopilot and operationalizing the workflow.

Background

ML provides value to business by offering accurate insights to guide business decisions. Gathering insights from ML should be straightforward enough to be handled by business personnel with no ML background. Subodh’s and my first blog post focused on data preparation using Trifacta, available in AWS Marketplace. In this blog post, we show how to build ML models using Amazon SageMaker Autopilot (Autopilot) and deploy them for inference using Amazon SageMaker. We will also demonstrate the model’s accuracy gain using datasets prepared by Trifacta.

Solution walkthrough: Simplifying machine learning operations with Trifacta and Amazon SageMaker

Preparing the data

The solution shows how to use the Bank Marketing data set to train a model that predicts whether a customer will enroll for a term deposit at a bank. The predication is based on information about the customer and past marketing campaigns. Achieving high-prediction accuracy enables the bank to better personalize its marketing.

Rather than assuming the availability of clean data, we started with messy, fragmented data that is more representative of what you find in the real world. The raw data consists of the following files:

  • csv – Core fields containing customer information and socio-economic indicators
  • csv – Details of marketing campaigns targeted at customers

This data was collected from a Portuguese marketing campaign related with bank deposit subscription for 45,211 clients, and the response is whether the client has subscribed a term deposit. We downloaded this data set from http://archive.ics.uci.edu/ml/datasets/Bank+Marketing;  The marketing campaigns were based on phone calls. Sometimes more than one contact to the same client was required.

The model created with raw data without any data preparation is called the Baseline Model. To measure efficiency gain, we then created a model with data prepared by Trifacta and compared it to the baseline model. The following tables show two input data structures, raw data and prepared data.

The customer details included: age, job, marital status, education level, whether customer has loan default in past, owns a house, carries a loan.

The socioeconomic details included: employment variation rate, consumer price index, consumer confidence index, equibor 3m interbank lending rate in euro zone, and number of employees in the bank.

This table shows the raw data.

This table shows the prepared data.

Machine learning model training, tuning, and deployment

Step 1 Run the experiment using Amazon SageMaker Autopilot

1.1 Create an experiment in Autopilot

  • To run the experiment, log in to your AWS Account.
  • To search for Amazon SageMaker, in the top search bar enter Amazon SageMaker . From the options, select Amazon Sagemaker.
  • On the Sagemaker page left panel, select Amazon Sagemaker Studio. It opens a list of studio instances under different users that you have previously used to work on Amazon SageMaker Studio.
  • If you have not used Amazon Sagemaker Studio, you can add a new user. To do this click Add User button. Enter a name for the user, and then choose AmazonSageMaker-ExecutionRole-<nnnn>.
  • To the right of the user, choose Open studio. This takes you to the Sagemaker Studio page. On the right panel, under Launcher tab, in the panel Build models automatically, select New Autopilot Experiment. This takes you to the Create experiment page to enter the information required to start your Autopilot job.

1.2 Start your Autopilot experiment

On the Create experiment page, enter the following information:

  • Experiment Name: enter a name to identify the experiment or model building process I entered exp-bank-td-enrollment-prediction.
  • Under Input data location (S3 bucket): (Check Find S3 Bucket)
    • S3 bucket name: enter the name of the Amazon S3 bucket where input data is stored. I entered sagemaker-autopilot-experiment-input-1.
    • Dataset file name: enter the name of the file in this case. I entered csv.
  • Target attribute name: enter the target variable. I entered y.
  • Output data location: (Check Find S3 Bucket)
    • S3 bucket name: Enter the name of s3 bucket where SageMaker Autopilot output will be stored. It should be an existing bucket. I entered: sagemaker_autopilot_experiment-output-1.
  • Dataset directory name: Enter where intermediate data, models, and performance metrics are stored. Artifacts in this directory provide transparency into model building and tuning. I entered output.
  • Machine learning problem type: select the problem type (keep the default value for automatic selection in this case). I selected Auto.
  • Do you want to run a complete experiment? Includes hyperparameter tuning and option to deploy. I selected Yes.

If you want to select an objective function for comparing the models, you can do so. To start, choose Create Experiment.

To train the model, Autopilot now goes through four steps. It analyzes the data, performs feature engineering, model tuning, and resents the best model and option to deploy it as a SageMaker endpoint.

1.3 Get data exploration notebooks

  • To get access to Python code that was used to explore data, choose Open Data Exploration Notebooks.
  • To access to Python code generating models, choose Open candidate generation notebook.
  • Amazon SageMaker Autopilot provides notebooks so you can see how the model was built and can refine the code. They’re a resource for data scientists or developers who are starting out on ML notebooks generated by Amazon SageMaker Autopilot are hosted on this GitHub repository:

Step 2: Exploring the results

2.1 Results

Machine learning model performance was measured in terms of following metrics:

  • True Positive (TP): Number of customers predicted by model to subscribe to term deposits who have subscribed to term deposits
  • True Negative (TN): Number of customers predicted by model to not subscribe to term deposits who have not subscribed to term deposits
  • False Positive (FP): Number of customers predicted by model to subscribe to term deposits who have not subscribed to term deposits
  • False Negative (FN): Number of customers predicted by model to not subscribe to term deposits who have subscribed to term deposits
  • F1 Score – TP/(TP + 0.5(FP+FN)): Single number to measure the efficiency of the model

The following results show the performance gain by using Trifacta for data preparation.

Autodetected problem type: binary classification

2.2 Result interpretation

Experiment 1

This means model identified 160 customers out of 237 (160+77) who subscribed to term deposits. The model ignored 77 (40%) of potential customers and instead suggested to focus efforts on 158 respondents who were not potential customers.

We interpreted this as pretty ineffective for a bank in terms of potential revenue loss and increase in cost.

Experiment 2

This time model identified 218 customers out of 237 (218+19) who subscribed to term deposits. The model missed only 8% of potential customers and only misidentified 22 respondents who were not potential customers.

We found this one much more accurate than the result from Experiment 1. Complex algorithms, more features, and more data should be used to enhance this further.

Machine learning model monitoring and retraining using Amazon SageMaker

Amazon SageMaker provides the option to monitor performance of models in production and retraining as model quality degrades. You can integrate with AWS Lambda to automatically trigger retraining the model and deploying the retrained model to keep the model performance to an acceptable level.

The following diagram shows a sample deployment architecture that enables business users with minimal ML knowledge or experience to create an inference pipeline. These users create the pipeline using Trifacta and Amazon SageMaker Autopilot.

  1. Training, deployment, and monitoring: The user starts the step function to initiate data wrangling, model development, and deployment. The step function has the following sequence.
    1. Trifacta data wrangling routine exposed using API
    2. SageMaker Autopilot
    3. SageMaker model deployment and monitoring of inference metrics using AWS CloudWatch. Refer to the following diagram.Trifacta data wrangling routine exposed using API SageMaker Autopilot SageMaker model deployment and monitoring of inference metrics using AWS CloudWatch.
  2. Mini-batch inference: Performed by writing inference input data in an Amazon S3 bucket.
    1. A Lambda function is triggered to call the Trifacta Run Job API to preprocess data.
    2. It then calls the SageMaker Endpoint that writes the inference into another S3 bucket. Refer to the following diagram.Mini-batch inference: Performed by writing inference input data in an Amazon S3 bucket using Trifacta and Amazon SageMaker
  3. Model management and retraining: Model monitoring is performed by Amazon SageMaker that stores deviation report in an S3 bucket.
    1. A Lambda function monitors the bucket for violation report and CloudWatch metrics using CloudWatch APIs.
    2. The Lambda function kicks off the model training step function if the model performance drops below acceptable limits. Refer to the following diagram.Model management and retraining: Model monitoring is performed by Amazon SageMaker that stores deviation report in an S3 bucket

Conclusion

In this post we showed how Trifacta and Amazon SageMaker Autopilot can be combined to build and deploy high quality models by non-data scientists in a few steps. Data preparation by Trifacta enables Amazon SageMaker Autopilot to build models over complex data structures, such as the socioeconomic factors we used in our example.

To get started, subscribe Trifacta in AWS Marketplace and download the flow used in this blog post https://github.com/trifacta/trifacta-sagemaker-automl. To learn more about how Trifacta fits with AWS services, see Trifacta and Amazon Web Services on the Trifacta website.

Cleanup

After the test, identify the Amazon SageMaker Notebook kernels and Model Endpoints in running state. Stop all the kernels from Jupyter Labs. Delete the Endpoint from Amazon SageMaker Console. Running Kernels and Endpoints can incur unexpected charges.

About the authors

Vijay Balasubramaniam, Director, Partner Solutions Architect, Trifacta

Vijay Balasubramaniam leverages his expertise in data management to help partners and customers be successful in large-scale analytics initiatives. He has over 18 years of experience helping large organizations manage their data sets and produce insights. He specializes in data preparation workflows and developing end-to-end solutions on the AWS platform. Outside of work, he enjoys biking, tennis, music, and spending time with family.

Subodh Kumar, Senior Manager, Partner SA, ISV, Amazon

Subodh Kumar heads the Partner Solution Architect team for Artificial Intelligence and Machine Learning, Internet of Things (IoT), and High-Performance Computing (HPC) at AWS. Subodh is a global technology leader with 15+ years of experience in leading digital transformation for top financial organizations. As  co-founder for multiple successful startups, he enjoys defining disruptive products and taking them to market. Subodh holds patents in software and hardware design. Outside of work, he enjoys cricket, music, and travel.