AWS Machine Learning Blog

Using A/B testing to measure the efficacy of recommendations generated by Amazon Personalize

Machine learning (ML)-based recommender systems aren’t a new concept, but developing such a system can be a resource-intensive task—from data management during training and inference, to managing scalable real-time ML-based API endpoints. Amazon Personalize allows you to easily add sophisticated personalization capabilities to your applications by using the same ML technology used on Amazon.com for over 20 years. No ML experience required. Customers in industries such as retail, media and entertainment, gaming, travel and hospitality, and others use Amazon Personalize to provide personalized content recommendations to their users. With Amazon Personalize, you can solve the most common use cases: providing users with personalized item recommendations, surfacing similar items, and personalized re-ranking of items.

Amazon Personalize automatically trains ML models from your user-item interactions and provides an API to retrieve personalized recommendations for any user. A frequently asked question is, “How do I compare the performance of recommendations generated by Amazon Personalize to my existing recommendation system?” In this post, we discuss how to perform A/B tests with Amazon Personalize, a common technique for comparing the efficacy of different recommendation strategies.

You can quickly create a real-time recommender system on the AWS Management Console or the Amazon Personalize API by following these simple steps:

  1. Import your historical user-item interaction data.
  2. Based on your use case, start a training job using an Amazon Personalize ML algorithm (also known as recipes).
  3. Deploy an Amazon Personalize-managed, real-time recommendations endpoint (also known as a campaign).
  4. Record new user-item interactions in real time by streaming events to an event tracker attached to your Amazon Personalize deployment.

The following diagram represents which tasks Amazon Personalize manages.

This diagram represents which tasks Amazon Personalize manages

Metrics overview

You can measure the performance of ML recommender systems through offline and online metrics. Offline metrics allow you to view the effects of modifying hyperparameters and algorithms used to train your models, calculated against historical data. Online metrics are the empirical results observed in your user’s interactions with real-time recommendations provided in a live environment.

Amazon Personalize generates offline metrics using test datasets derived from the historical data you provide. These metrics showcase how the model recommendations performed against historical data. The following diagram illustrates a simple example of how Amazon Personalize splits your data at training time.

The following diagram illustrates a simple example of how Amazon Personalize splits your data at training time

Consider a training dataset containing 10 users with 10 interactions per user; interactions are represented by circles and ordered from oldest to newest based on their timestamp. In this example, Amazon Personalize uses 90% of the users’ interactions data (blue circles) to train your model, and the remaining 10% for evaluation. For each of the users in the evaluation data subset, 90% of their interaction data (green circles) is used as input for the call to the trained model, and the remaining 10% of their data (orange circle) is compared to the output produced by the model to validate its recommendations. The results are displayed to you as evaluation metrics.

Amazon Personalize produces the following metrics:

  • Coverage – This metric is appropriate if you’re looking for what percentage of your inventory is recommended
  • Mean reciprocal rank (at 25) – This metric is appropriate if you’re interested in the single highest ranked recommendation
  • Normalized discounted cumulative gain (at K) – The discounted cumulative gain is a measure of ranking quality; it refers to how well the recommendations are ordered
  • Precision (at K) – This metric is appropriate if you’re interested in how a carousel of size K may perform in front of your users

For more information about how Amazon Personalize calculates these metrics, see Evaluating a Solution Version.

Offline metrics are a great representation of how your hyperparameters and data features influence your model’s performance against historical data. To find empirical evidence of the impact of Amazon Personalize recommendations on your business metrics, such as click-through rate, conversion rate, or revenue, you should test these recommendations in a live environment, getting them in front of your customers. This exercise is important because a seemingly small improvement in these business metrics can translate into a significant increase in your customer engagement, satisfaction, and business outputs, such as revenue.

The following sections include an experimentation methodology and reference architecture in which you can identify the steps required to expose multiple recommendation strategies (for example, Amazon Personalize vs. an existing recommender system) to your users in a randomized fashion and measure the difference in performance in a scientifically sound manner (A/B testing).

Experimentation methodology

The data collected across experiments enables you to measure the efficacy of Amazon Personalize recommendations in terms of business metrics. The following diagram illustrates the experimentation methodology we suggest adhering to.

The following diagram illustrates the experimentation methodology we suggest adhering to

The process consists of five steps:

  • Research – The formulation of your questions and definition of the metrics to improve are solely based on the data you gather before starting your experiment. For example, after exploring your historical data, you might be interested in why you experience shopping cart abandonment or high bounce rates from leads generated by advertising.
  • Develop a hypothesis – You use the data gathered during the research phase to make observations and develop a change and effect statement. The hypothesis must be quantifiable. For example, providing user personalization through an Amazon Personalize campaign on the shopping cart page will translate into an increase of the average cart value by 10%.
  • Create variations based on the hypothesis – The variations of your experiment are based on the hypothesized behavior you’re evaluating. A newly created Amazon Personalize campaign can be considered the variation of your experiment when compared against an existing rule-based recommendation system.
  • Run an experiment – You can use several techniques to test your recommendation system; this post focuses on A/B testing. The metrics data gathered during the experiment help validate (or invalidate) the hypothesis. For example, a 10% increase on the average cart value after adding Amazon Personalize recommendations to the shopping cart page over 1 month compared to the average cart value keeping the current system’s recommendations.
  • Measure the results – In this step, you determine if there is statistical significance to draw a conclusion and select the best performing variation. Was the increase in your cart average value a result of the randomness of your user testing set, or did the newly created Amazon Personalize campaign influence this increase?

A/B testing your Amazon Personalize deployment

The following architecture showcases a microservices-based implementation of an A/B test between two Amazon Personalize campaigns. One is trained with one of the recommendation recipes provided by Amazon Personalize, and the other is trained with a variation of this recipe. Amazon Personalize provides various predefined ML algorithms (recipes); HRNN-based recipes enable you to provide personalized user recommendations.

The following architecture showcases a microservices-based implementation of an A/B test between two Amazon Personalize campaigns

This architecture compares two Amazon Personalize campaigns. You can apply the same logic when comparing an Amazon Personalize campaign against a custom rule-based or ML-based recommender system. For more information about campaigns, see Creating a Campaign.

The basic workflow of this architecture is as follows:

  1. The web application requests customer recommendations from the recommendations microservice.
  2. The microservice determines if there is an active A/B test. For this post, we assume your testing strategy settings are stored in Amazon DynamoDB.
  3. When the microservice identifies the group your customer belongs to, it resolves the Amazon Personalize campaign endpoint to query for recommendations.
  4. Amazon Personalize campaigns provide the recommendations for your users.
  5. The users interact with their respective group recommendations.
  6. The web application streams user interaction events to Amazon Kinesis Data Streams.
  7. The microservice consumes the Kinesis stream, which sends the user interaction event to both Amazon Personalize event trackers. Recording events is an Amazon Personalize feature that collects real-time user interaction data and provides relevant recommendations in real time.
  8. Amazon Kinesis Data Firehose ingests your user-item interactions stream and stores the interactions data in Amazon Simple Storage Service (Amazon S3) to use in future trainings.
  9. The microservice keeps track of your pre-defined business metrics throughout the experiment.

For instructions on running an A/B test, see the Retail Demo Store Experimentation Workshop section in the Github repo.

Tracking well-defined business metrics is a critical task during A/B testing; seeing improvements on these metrics is the true indicator of the efficacy of your Amazon Personalize recommendations. The metrics measured throughout your A/B tests need to be consistent across your variations (groups A and B). For example, an ecommerce site can evaluate the change (increase or decrease) on the click-through rate of a particular widget after adding Amazon Personalize recommendations (group A) compared to the click-through rate obtained using the current rule-based recommendations (group B).

An A/B experiment runs for a defined period, typically dictated by the number of users necessary to reach a statistically significant result. Tools such as Optimizely, AB Tasty, and Evan Miller’s Awesome A/B Tools can help you determine how large your sample size needs to be. A/B tests are usually active across multiple days or even weeks, in order to collect a large enough sample from your userbase. The following graph showcases the feedback loop between testing, adjusting your model, and rolling out new features on success.

The following graph showcases the feedback loop between testing, adjusting your model, and rolling out new features on success.

For an A/B test to be considered successful, you need to perform a statistical analysis of the data gathered from your population to determine if there is a statistically significant result. This analysis is based on the significance level you set for the experiment; a 5% significance level is considered the industry standard. For example, a significance level of 0.05 indicates a 5% risk of concluding that a difference exists when there is no actual difference. A lower significance level means that we need stronger evidence for a statistically significant result. For more information about statistical significance, see A Refresher on Statistical Significance.

The next step is to calculate the p-value. The p-value is the probability of seeing a particular result (or greater) from zero, assuming that the null hypothesis is TRUE. In other words, the p-value is the expected fluctuation in a given sample, similar to the variance. For example, imagine we ran an A/A test where we displayed the same variation to two groups of users. After such an experiment, we would expect the metrics results across groups to be very similar but not dramatically different: a p-value greater than your significance level. Therefore, in an A/B test, we hope to see a p-value that is less than our significance level so we can conclude the influence on the business metric was the result of your variance group. AWS Partners such as Amplitude or Optimizely provide A/B testing tools to facilitate the setup and analysis of your experiments.

A/B tests are statistical measures of the efficacy of your Amazon Personalize recommendations, allowing you to quantify the impact these recommendations have on your business metrics. Additionally, A/B tests allows you to gather organic user-item interactions that you can use to train subsequent Amazon Personalize implementations. We recommend spending less time on offline tests and getting your Amazon Personalize recommendations in front of your users as quickly as possible. This helps eliminate biases from existing recommender systems in your training dataset, which allows your Amazon Personalize deployments to learn from organic user-item interactions data.

Conclusion

Amazon Personalize is an easy-to-use, highly scalable solution that can help you solve some of the most popular recommendation use cases:

  • Personalized recommendations
  • Similar items recommendations
  • Personalized re-ranking of items

A/B testing provides invaluable information on how your customers interact with your Amazon Personalize recommendations. These results, measured according to well-defined business metrics, will give you a sense of the efficacy of these recommendations along with clues on how to further adjust your training datasets. After you iterate through this process multiple times, you will see an improvement on the metrics that matter most to improve customer engagement.

If this post helps you or inspires you to use A/B testing to improve your business metrics, please share your thoughts in the comments.

Additional resources

For more information about Amazon Personalize, see the following:


About the Author

Luis Lopez Soria is an AI/ML specialist solutions architect working with the AWS machine learning team. He works with AWS customers to help them adopt machine learning on a large scale. He enjoys playing sports, traveling around the world, and exploring new foods and cultures.