How vertical scaling and GPUs can accelerate mixed media modelling for marketing analytics

How vertical scaling and GPUs can accelerate mixed media modelling for marketing analytics In marketing analytics, mixed media modeling (MMM) is a machine learning technique that combines information from various sources, like TV ads, online ads and social media to measure the impact of marketing and advertising campaigns. By using these techniques, businesses can make smarter decisions about where to invest their money for advertising, helping them get the best return on investment. It’s a bit like having a treasure map that guides you to the most valuable marketing strategies.

A key challenge for marketing analytics teams is the compute-heavy requirement for running these models. With popular libraries like LightweightMMM, Robyn, and PyMC Marketing not being designed to scale horizontally, jobs with a granular breakdown of demographic and geographic regions will take more than 24 hours to complete on workstations with limited compute, leading to delayed insight and hence delayed marketing optimization.

In this blog post, I’ll demonstrate how to accelerate your mixed media modeling (MMM) jobs using AWS Batch. Using the LightweightMMM open-source library as an example, I’ll use AWS Batch to help with two key challenges: 1) give teams access to larger GPU and CPU compute resources; and 2) efficiently provision and manage infrastructure while reducing the risk of cost overruns. Using larger compute instances overcomes the inability of these MMM libraries to scale horizontally and results in a big reduction in model training time. At the end of the post, I’ll also go through a time and cost analysis to show you the benefits of using more expensive vertically-scaled instances for these types of workloads.

Overview of our sample application

First let’s take a look at a sample application (Figure 1) that is also published in AWS Samples repo here (see the link for the deployment method).

You can quickly submit training jobs for MMMs through this application while AWS Batch takes care of the provisioning and, importantly, the removal and cleanup of compute resource. You submit jobs using the Train new model button. The Details button shown in the completed jobs table allows the user to visualize the results and run further inference to optimize a proposed marketing budget.

Figure 1 – The web frontend for our sample application. Data scientists can submit training jobs for mixed media models leveraging high end compute and obtain results for further analysis after completion.

Architecture: AWS Batch integrated with web frontend

The architectural diagram for the sample application is shown in Figure 2. A web application provides users with a simple user interface (1). When you submit a job, the web application calls an API in the processing layer with the model-training parameters. The processing layer uses AWS Batch to run a training job (2). AWS Batch will automatically provision the required compute and execute the code for the training job. The training job accesses the specified tables in the data lake to acquire its training data (3). On completion, the trained model is saved back to the data lake (4). Once training is complete, AWS Batch will make sure the compute resources are shutdown so you don’t pay for compute that is not being used. Finally, the processing layer provides an API for inference requests back to the web application by loading the saved model from Amazon Simple Storage Service (Amazon S3) on demand (5).

Figure 2 – Architectural diagram of a web application to train Mixed Media Models. When a training request is submitted from the web frontend, the job is submitted to AWS Batch via Amazon API Gateway/AWS Lambda. Amazon Athena provides the training data directly to the training job. All the components are serverless with no requirement for pre provisioned unmanaged infrastructure.

Deep dive on components in the architecture

Web front-end

We used Amazon CloudFront and Amazon S3 to provide a simple, cost-effective way to serve a React based front end, with Amazon Cognito providing authentication and authorization to control access. The combination of Amazon API Gateway, AWS Lambda, and Amazon DynamoDB provide a job management function exposed as a REST API that can be called by the web application.

Data lake for storage

By using Amazon Athena and Amazon S3, a data lake allows team members to prepare variations of the model input data using standard SQL queries. We also have a Lambda function that contains code to generate sample data that is a feature of the LightweightMMM framework.

Processing layer

The processing layer provides the compute required to both train the model’s and run inference requests. By using Batch, we create a number of different job types, queues, and compute environments to allow training the model on small / medium and large GPU/CPU instances. Batch will automatically provision the required compute using the Amazon Elastic Compute Cloud (Amazon EC2) service with the specified instance types per job. This allows us to expose a simple job submission system via the web front end while using Batch to manage the complexity of the underlying compute infrastructure. The trained jobs are saved into the S3 bucket in the data lake, and we use Lambda to expose inference requests on these trained models.

The LightweightMMM framework

The LightweightMMM library is written in Python and uses open-source frameworks Jax and NumPyro to create a Bayesian approach to MMM using the Markov Chain Monte Carlo method.

There are some key input parameters for training the model that can influence the run time.

Historical Data and Granularity – the length and granularity of historical data used can impact the time it takes a model to complete. For this blog post we will use 3 years’ worth of data with a weekly granularity leading to 160 data points per feature.

Number of Geographies – the number of Geographies multiplies the complexity of the calculation as each data point above must be repeated for each geography as well as the number of Marketing channels. For this blog post we will use different geography numbers to show the impact on run time.
Marketing Channels – each data point multiplied by the number of Geographies is further multiplied for each marketing channel included in the analysis. For this blog post we will compare 3 and 6 channels.
Chains – each additional chain requires running the MCMC algorithm independently. This means that you need to perform the sampling and calculations for each chain separately. For this blog post we will use a varying number of Chains to demonstrate the impact on run time.

How to maximize GPU memory

The current implementation of LightweightMMM runs each Chain on a dedicated GPU, the default settings will pre-allocate GPU memory for each chain and we found that this can lead to Out of Memory errors when running large MMM models where each chain has a large number of data points (calculated as historical data x geographies x channels).

os.environ["XLA_PYTHON_CLIENT_PREALLOCATE"] = "false"
os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"] = ".50"
os.environ["XLA_PYTHON_CLIENT_ALLOCATOR"] = "platform"

This configuration inside you python application code can help maximize the GPUs for larger model training jobs. You can refer to Jax Documentation for more details.

CUDA and cuDNN on AWS Batch

For JAX to work with an NVIDIA GPU we needed to ensure:

The host machine’s CUDA drivers are on the same major version as the container image, and the minor version must be the same or newer.
The newer NVIDIA GPUs (e.g. H100) perform much better when using CUDA 12.
Jax has a prebuilt jaxlib python wheel to support CUDA 12.2 with cuDNN 8.9 and so it is best to align a container image that contains this.

The simplest solution is to use a pre-built NVIDIA container image with the required libraries, and compiler tools. You can find these in the NVIDIA container catalog. We recommend using the developer releases with cuDNN included.

FROM nvcr.io/NVIDIA/cuda:12.2.2-cudnn8-devel-ubuntu22.04

You can find a full example of the Docker file we used in our sample application in our AWS samples repository.

Time and cost analysis

To understand the impact of scaling compute when training the models, we ran an experiment with LightweightMMM on a number of different compute configurations and model complexities. The results below have been compiled based on average run times and approximate costs (using On Demand pricing for a consistent baseline) for Amazon EC2 C6i and C7i instance types for CPU, and G5 and P5 instance types for GPUs. We used the us-east-1 Region.

For our experiment, we compared the results against a base configuration of 16-cores which is reflective of the performance of a typical desktop workstation.

Figure 3 – Shows the average runtime of training a smaller model with 3 marketing channels, 100 geographies with 2 chains using different levels of compute

For smaller models, increasing the level of processing power using a greater number of CPU cores did not have a significant impact on training time. We think this is due to the overheads of splitting the processing across a larger number of threads. We found that on the extreme end, a 192-core instance with high-parallelism can actually have a negative impact on the training time.

Using GPUs had a significant positive impact on the training job: a G5 instance with 4 x NVIDIA A10 GPUs ran 4x faster compared to a typical cloud desktop configuration despite being about the same cost to train the model.

Figure 4 – Shows the average runtime of training a larger model with 6 Marketing channels, 300 Geographies using different levels of compute

For larger complex models, a 128-core C6i CPU-based instance provided more than a 2x time reduction before higher parallelism started to become detrimental. We found that G5 instances with A10 GPUs didn’t have enough memory to run the larger training configuration. However, the P5 instances with NVIDIA H100 GPUs ran 58x faster, with training completing in approximately 45 minutes. For an approximate cost increase of $50 the run time dropped from 1.5 days to 45 mins. This is a great trade-off from if you’re short on time.

Conclusion

In this post I covered how you can use AWS Batch to run mixed media models (MMM) faster with Amazon EC2 accelerated compute instances. I explained how to configure NVIDIA CUDA libraries to enable frameworks like Jax to work on Batch. And I walked you through a time and cost analysis, showing that while larger CPU and GPU instances do come with a higher per-hour price tag than the cheapest options, the time savings can enable data scientists to iterate faster – and build better MMM models – more efficiently.

To get started running mixed media models on AWS Batch, have a look at our GitHub repository.

AWS HPC Blog