A/B Testing ML models in production using Amazon SageMaker

Amazon SageMaker is a fully managed service that provides developers and data scientists the ability to quickly build, train, and deploy machine learning (ML) models. Tens of thousands of customers, including Intuit, Voodoo, ADP, Cerner, Dow Jones, and Thomson Reuters, use Amazon SageMaker to remove the heavy lifting from the ML process. With Amazon SageMaker, you can deploy your ML models on hosted endpoints and get inference results in real time. You can easily view the performance metrics for your endpoints in Amazon CloudWatch, enable autoscaling to automatically scale endpoints based on traffic, and update your models in production without losing any availability.

In many cases, such as e-commerce applications, offline model evaluation isn’t sufficient, and you need to A/B test models in production before making the decision of updating models. With Amazon SageMaker, you can easily perform A/B testing on ML models by running multiple production variants on an endpoint. You can use production variants to test ML models that have been trained using different training datasets, algorithms, and ML frameworks; test how they perform on different instance types; or a combination of all of the above.

Until now, you could provide the traffic distribution for each variant on an endpoint, and Amazon SageMaker splits the inference traffic between the variants based on the specified distribution. This is helpful when you want to control how much traffic to send to each variant but don’t need to route requests to specific variants. For example, you may want to update a model in production and test how it compares with the existing model by directing some portion of the traffic to the new model. However, in some use cases, you want a specific model to process inference requests and need to invoke a specific variant. For example, you may want to test and compare how ML models perform across different customer segments and need all requests from customers in a segment to be processed by a specific variant.

You can now choose which variant processes an inference request. You simply provide the TargetVariant header on each inference request, and Amazon SageMaker ensures that the specified variant processes the request.

Use case: Amazon Alexa

Amazon Alexa uses Amazon SageMaker to manage various ML workloads. Amazon Alexa teams update their ML models frequently to stay ahead of emerging security threats. For this, the teams test, compare, and determine which version best meets their security, privacy, and business needs before releasing new model versions in production using the model testing capabilities in Amazon SageMaker. For more information about the kinds of research that these teams perform to protect their customers’ security and privacy, see Leveraging Hierarchical Representations for Preserving Privacy and Utility in Text.

“The model testing capabilities in Amazon SageMaker enable us to test new versions of our privacy-preserving models, all of which meet our high standards for customer privacy,” says Nathanael Teissier, Software Development Manager, Alexa Experiences and Devices. “The new ability to select the desired production variant with each request will open up new possibilities for deployment strategies with A/B testing without modifying the existing setup.”

In this post, we show you how you can easily perform A/B testing of ML models in Amazon SageMaker by distributing traffic to variants and invoking specific variants. The models that we test have been trained using different training datasets and deployed as production variants on Amazon SageMaker endpoints.

A/B testing with Amazon SageMaker

In production ML workflows, data scientists and engineers frequently try to improve their models in various ways, such as by performing hyperparameter tuning, training on additional or more recent data, or improving feature selection. Performing A/B testing on the new model and the old model with production traffic can be an effective final step in the validation process for a new model. In A/B testing, you test different variants of your models and compare how each variant performs relative to each other. If the new version delivers performance that is better or equal to the previously existing version, you replace the older model.

Amazon SageMaker enables you to test multiple models or model versions behind the same endpoint using production variants. Each ProductionVariant identifies an ML model and the resources deployed for hosting the model. You can distribute endpoint invocation requests across multiple production variants by providing the traffic distribution for each variant or invoking a variant directly for each request. In the following sections, we look at both methods for testing ML models.

Testing models by distributing traffic to variants

To test multiple models by distributing traffic between them, specify the percentage of the traffic to route to each model by specifying the weight for each production variant in the endpoint configuration. Amazon SageMaker distributes the traffic between production variants based on the respective weights that you provided. This is the default behavior when using production variants. The following diagram shows how this works in more detail. Each inference response also contains the name of the variant that processed the request.

Testing models by invoking specific variants

To test multiple models by invoking specific models for each request, set the TargetVariant header in the request. If you have already provided a traffic distribution by providing weights and specified a TargetVariant, the targeted routing overrides the traffic distribution. The following diagram shows how this works in more detail. Here we are invoking ProductionVariant3 for an inference request and you can concurrently invoke different variants for each request.

Solution overview

This post walks you through an example of how you can use this new feature. You use a Jupyter notebook in Amazon SageMaker to create an endpoint that hosts two models (using ProductionVariant). Both models were trained using the Amazon SageMaker built-in XGBoost algorithm on a dataset for predicting mobile operator customer churn. For more information about how the models were trained, see Customer Churn Prediction with XGBoost. In the following use case, we trained each model on different subsets of the same dataset and used different versions of the XGBoost algorithm for each model.

Try these activities yourself by using the sample A/B Testing with Amazon SageMaker’ Jupyter Notebook. You can run it either in Amazon SageMaker Studio or in an Amazon SageMaker notebook instance. The dataset we use is publicly available and mentioned in the book Discovering Knowledge in Data by Daniel T. Larose. It is attributed by the author to the University of California Irvine Repository of Machine Learning Datasets.

The walkthrough includes the following steps:

Creating and deploying the models
Invoking the deployed models
Evaluating variant performance
Dialing up inference traffic to your chosen variant in production

Creating and deploying the models

First, define where the models are located in Amazon Simple Storage Service (Amazon S3). You use these locations when deploying the models in subsequent steps. See the following code:

model_url = f"s3://{path_to_model_1}"
model_url2 = f"s3://{path_to_model_2}"

Next, create the model objects with the container image and model data. You use these model objects to deploy on production variants on an endpoint. You can develop the models by training ML models on different datasets, different algorithms, different ML frameworks, and different hyperparameters. See the following code:

from sagemaker.amazon.amazon_estimator import get_image_uri

model_name = f"DEMO-xgb-churn-pred-{datetime.now():%Y-%m-%d-%H-%M-%S}"
model_name2 = f"DEMO-xgb-churn-pred2-{datetime.now():%Y-%m-%d-%H-%M-%S}"
image_uri = get_image_uri(boto3.Session().region_name, 'xgboost', '0.90-1')
image_uri2 = get_image_uri(boto3.Session().region_name, 'xgboost', '0.90-2')

sm_session.create_model(name=model_name, role=role, container_defs={
    'Image': image_uri,
    'ModelDataUrl': model_url
})

sm_session.create_model(name=model_name2, role=role, container_defs={
    'Image': image_uri2,
    'ModelDataUrl': model_url2
})

Create two production variants, each with its own model and resource requirements (instance type and counts). To split the sent requests evenly between variants, set an initial_weight of 0.5 for both variants. See the following code:

from sagemaker.session import production_variant

variant1 = production_variant(model_name=model_name,
                              instance_type="ml.m5.xlarge",
                              initial_instance_count=1,
                              variant_name='Variant1',
                              initial_weight=0.5)
                              
variant2 = production_variant(model_name=model_name2,
                              instance_type="ml.m5.xlarge",
                              initial_instance_count=1,
                              variant_name='Variant2',
                              initial_weight=0.5)

Deploy these production variants on an Amazon SageMaker endpoint with the following code:

endpoint_name = f"DEMO-xgb-churn-pred-{datetime.now():%Y-%m-%d-%H-%M-%S}"
print(f"EndpointName={endpoint_name}")

sm_session.endpoint_from_production_variants(
    name=endpoint_name,
    production_variants=[variant1, variant2]
)

Invoking the deployed models

You can now send data to this endpoint and get inferences in real time. For this post, we use both approaches for testing models supported in Amazon SageMaker: distributing traffic to variants and invoking specific variants.

Distributing traffic to variants

Amazon SageMaker distributes the traffic between production variants on an endpoint based on the respective weights that you configured in the preceding variant definitions. See the following code where we invoke the endpoint:

# get a subset of test data for a quick test
!tail -120 test_data/test-dataset-input-cols.csv > test_data/test_sample_tail_input_cols.csv
print(f"Sending test traffic to the endpoint {endpoint_name}. \nPlease wait...")

with open('test_data/test_sample_tail_input_cols.csv', 'r') as f:
    for row in f:
        print(".", end="", flush=True)
        payload = row.rstrip('\n')
        sm_runtime.invoke_endpoint(EndpointName=endpoint_name,
                                   ContentType="text/csv",
                                   Body=payload)
        time.sleep(0.5)
        
print("Done!")

Amazon SageMaker emits metrics such as latency and invocations for each variant in Amazon CloudWatch. For the full list of endpoint metrics, see Monitor Amazon SageMaker with Amazon CloudWatch. You can query Amazon CloudWatch to get the number of invocations per variant, to see how invocations are split across variants by default. Your findings should resemble the following graph.

Invoking specific variants

The following use case uses the new Amazon SageMaker variant feature to invoke a specific variant. For this, simply use the new parameter to define which specific ProductionVariant you want to invoke. The following code invokes Variant1 for all requests and the same process can be used to invoke the other variants:

print(f"Sending test traffic to the endpoint {endpoint_name}. \nPlease wait...")
with open('test_data/test_sample_tail_input_cols.csv', 'r') as f:
    for row in f:
        print(".", end="", flush=True)
        payload = row.rstrip('\n')
        sm_runtime.invoke_endpoint(EndpointName=endpoint_name,
                                   ContentType="text/csv",
                                   Body=payload,
                                   TargetVariant="Variant1") # <- Note new parameter
        time.sleep(0.5)

To confirm that Variant1 processed all new invocations, query CloudWatch to get the number of invocations per variant. The following graph shows that for the most recent invocations (latest timestamp), Variant1 processed all requests. There were no invocations made for Variant2.

Evaluating variant performance

The following graph evaluates the accuracy, precision, recall, F1 score, and ROC/AUC for Variant1.

The following graph evaluates the same metrics for the predictions Variant2 made.

Variant2 performed better for most of the defined metrics, so this is the one you would likely choose to increasingly service more of your inference traffic in production.

Dialing up inference traffic to your chosen variant in production

Now that you have determined Variant2 to be better than Variant1, you can shift more traffic to it.

You can continue to use TargetVariant to invoke a chosen variant. A simpler approach is to update the weights assigned to each variant using UpdateEndpointWeightsAndCapacities. This changes the traffic distribution to your production variants without requiring updates to your endpoint.

Consider the scenario in which you specified variant weights to split traffic 50/50 when you created your models and endpoint configuration. The following CloudWatch metrics for the total invocations for each variant show the invocation patterns for each variant.

To shift 75% of the traffic to Variant2, assign new weights to each variant using UpdateEndpointWeightsAndCapacities. See the following code:

sm.update_endpoint_weights_and_capacities(
    EndpointName=endpoint_name,
    DesiredWeightsAndCapacities=[
        {
            "DesiredWeight": 0.25,
            "VariantName": variant1["VariantName"]
        },
        {
            "DesiredWeight": 0.75,
            "VariantName": variant2["VariantName"]
        }
    ]
)

Amazon SageMaker now sends 75% of the inference requests to Variant2 and the remaining 25% of requests to Variant1.

The following CloudWatch metrics for the total invocations for each variant show higher invocations for Variant2 compared to Variant1.

You can continue to monitor your metrics and, when you’re satisfied with a variant’s performance, you can route 100% of the traffic to it. For this use case, we used UpdateEndpointWeightsAndCapacities to update the traffic assignments for the variants. The weight for Variant1 is set to 0.0 and the weight for Variant2 is set to 1.0. Therefore, Amazon SageMaker sends 100% of all inference requests to Variant2. See the following code:

sm.update_endpoint_weights_and_capacities(
    EndpointName=endpoint_name,
    DesiredWeightsAndCapacities=[
        {
            "DesiredWeight": 0.0,
            "VariantName": variant1["VariantName"]
        },
        {
            "DesiredWeight": 1.0,
            "VariantName": variant2["VariantName"]
        }
    ]
)

The following CloudWatch metrics for the total invocations for each variant show that Variant2 processed all inference requests, and there are no inference requests processed by Variant1.

You can now safely update your endpoint and delete Variant1 from your endpoint. You can also continue testing new models in production by adding new variants to your endpoint and following the steps in this walkthrough.

Conclusion

Amazon SageMaker enables you to easily A/B test ML models in production by running multiple production variants on an endpoint. You can use SageMaker’s capabilities to test models that have been trained using different training datasets, hyperparameters, algorithms, or ML frameworks; test how they perform on different instance types; or a combination of all of the above. You can provide the traffic distribution between the variants on an endpoint and Amazon SageMaker splits the inference traffic to the variants based on the specified distribution. Alternately, if you want to test models for specific customer segments, you can specify the variant that should process an inference request by providing the TargetVariant header, and Amazon SageMaker will route the request to the variant that you specified. For more information about A/B testing, see AWS Developer Guide: Test models in production.

About the authors

Kieran Kavanagh is a Principal Solutions Architect at Amazon Web Services. He works with customers to design and build technology solutions on AWS, and has a particular interest in machine learning. In his spare time, he likes hiking, snowboarding, and practicing martial arts.

Aakash Pydi is a Software Development Engineer in the Amazon SageMaker team. He is passionate about helping developers efficiently productionize machine learning workflows. In his spare time, he loves reading (science fiction, economics, history, philosophy), gaming (real-time strategy), and long conversations.

David Nigenda is a Software Development Engineer in the Amazon SageMaker team. His current work focuses on providing useful insights on production machine learning workflows. In his spare time he tries to keep up with his kids.

Artificial Intelligence