How SeatGeek simulates massive load with AWS Batch to prepare for big events

This post was contributed by Zach Hammer, Sr. Software Engineer, Anderson Parra, Staff Software Engineer from SeatGeek and Umesh Kalaspurkar, Sr Solutions Architect, and Cody Collins, Solutions Architect from AWS

SeatGeek is a ticket platform that caters to mobile users and offers ticket purchase and reselling services for sports games, concerts, and theatrical productions. In 2022, SeatGeek had an average of over 47 million daily tickets available, and their customers downloaded their mobile app more than 33 million times.

Performing browser-based load testing is necessary for organizations to make sure their web applications can handle high levels of traffic. This type of testing simulates user behavior in a realistic manner, helping to identify and address issues before they occur in a live environment. Browser-based load testing can reveal performance bottlenecks, like slow loading times and inefficient code, which can significantly impact the user experience. By regularly conducting browser-based load testing, we can ensure web applications can handle traffic spikes, reduce the risk of costly outages or downtime, and improve overall application performance and security.

This post explores SeatGeek’s implementation of a browser-based load testing system to simulate user behaviors during high-traffic events. The goal of this system was to best prepare SeatGeek for a large traffic spike, such as a ticket on-sale event attracting over 50,000 fans at once. An important part of this work was to achieve this while keeping costs manageable.

Scale testing is table stakes

A major difficulty with operating any platform at scale is preparing infrastructure for large influxes of user activity. A good application needs to support these spikes in load, and a great one should do it without its users noticing it. To prepare for these events, a team needs to do extensive testing on its application to make sure it can handle abnormally high usage.

For SeatGeek, this means building a system to perform browser-based load testing, a form of testing that closely simulates user actions by invoking the application directly from a browser. If done manually, this would mean opening a browser window and executing a prepared set of actions against the application before closing the browser window again. Now imagine automating this and trying to run 50,000 of those tests in parallel on your laptop. In this scenario, the most likely outcome is a laptop crash, not a successful suite of tests. Inevitably the next step is performing these tests in the cloud at large scale and in a cost-effective way.

A protocol-based load-test is like a burst of purchases during a flash sale that comes in the form of thousands of HTTP requests against a specific service endpoint (e.g. “/purchase”). This approach comes with a vast collection of tools, like JMeter, Taurus, and Gatling. Those, however, focus more on network traffic and protocol compliance, without considering the actual end-user experience.

Browser-based load testing, however, simulates user browser activity: a burst of purchases during a flash sale now comes in the form of thousands of real-life browser sessions connecting to a website and submitting purchases on its checkout page. It’s a more comprehensive and realistic approach to testing web applications, leading to improved performance and a better user experience. But browser-based load testing comes with much sparser tooling options, especially when doing it at scale. And using this tooling can quickly become expensive, as pricing models are often bound to usage, which in this scenario would be significant.

SeatGeek’s decided to build its own load testing system because of the need for scale for scale and flexibility – but the main driver was cost. Their initial estimates showed figures up to $100K for a single 30-minute, 50,000-browsers test. While load testing is a high priority for SeatGeek, that magnitude of a bill is difficult to justify on a continuous basis. Building their own solution would allow them to test at the scale they desired, and give them the option to continue to iterate – and improve upon – their system as circumstances changed. They saw flexibility to support changes in requirements (more complex tests, larger scale, more user behaviors…) as crucial to designing a sustainable system.

Designing a solution

SeatGeek decided to design the system using AWS Batch for the core compute layer. They did consider other options like AWS Lambda, but needed longer than the maximum execution time of 15 minutes to simulate realistic user behavior.

Another important feature was the ability to leverage AWS Batch array jobs to share job definitions among several parallelized child jobs. Remember that one of the driving requirements for the design was simulating 50,000 users at a time. Orchestrating that many jobs would be a large contributor to the system’s operational overhead, so being able to deploy thousands of child jobs with a single configuration immediately makes the system more manageable.

Finally, the option to run Batch on Amazon Elastic Compute Cloud (Amazon EC2) instances makes it much easier to use headless browsers. It’s also potentially much cheaper. Performing these load tests doesn’t inherently require much resiliency, so leveraging Spot Instances in EC2 is a great way to reduce the overall cost of the system. Better yet, it’s still possible to switch to On-Demand instances should the need for an emergency, uninterrupted test arise.

Architecture overview

Figure 1: An example execution of SeatGeek’s browser based load test.

A load test comprises of two elements: the simulator and the orchestrator.

The simulator is a small Node.js script that leverages Playwright, an end-to-end testing tool, and Chromium, an open-source web browser project, to simulate a user’s behavior. The simulated user follows a predefined set of steps to visit a URL that’s secured by their virtual waiting room queueing software. After this, they either successfully arrive at the event page or encounter an error. They integrated the script into a Docker image which they deployed to AWS Batch as a job definition.

The orchestrator is a Go application responsible for managing the entire test execution. It interacts with their queuing software to control access to the protected resource under test and dispatches concurrent simulator executions to AWS Batch as an array job.

Beyond the compute layer, SeatGeek decided to use Datadog to gather test execution data, Gitlab CI for CI/CD, and Slack to notify the team of test results.

A sample run goes like this: the orchestrator script kicks off the process by initiating concurrent executions of the simulator in AWS Batch using array jobs, Batch provisions several Amazon EC2 instances according to the orchestrator’s request and runs the simulation jobs, finally reporting the results through Datadog and notifying stakeholders.

To further optimize for cost, each Amazon EC2 instance runs several browser sessions (with each session representing a test execution) to cut down on the overall number of servers needed to execute the job.

In the compute layer, AWS Batch provides a sophisticated multi-layer model of compute abstractions. To execute jobs on Batch, developers must first create a Compute Environment containing Amazon Elastic Container Service (Amazon ECS) container instances for containerized batch jobs. SeatGeek decided to employ a Managed Compute Environment, which AWS uses to provision an Amazon Elastic Container Service (Amazon ECS) cluster based on the batch job resource needs. The Amazon ECS cluster then sets up an Auto-Scaling Group (ASG) for managing the Amazon EC2 instances.

Here is a simplified configuration of SeatGeek’s AWS Batch compute environment in terraform:

resource "aws_batch_compute_environment" {
    compute_environment_name_prefix = "load-test-compute-environment"
    compute_resources {
        // “optimal” will use instances from the C4, M4 and R4 instance families
        instance_type = ["optimal"]
        // select instances with a preference for the lowest-cost type
        allocation_strategy = "BEST_FIT"
        // it’s possible to only select spot instances if available at some discount threshold,        
        bid_percentage = 100
        max_vcpus = 10000
        type = "EC2"
    }
    // AWS manages underlying ECS instance 
    type = "MANAGED"
}

Results

Using this solution, SeatGeek has successfully conducted load tests using 60,000 browser-based users for approximately 30 minutes, costing less than $100 each time. This is a noteworthy achievement compared to the price of prevailing options.

This allowed the team to perform these tests much more often than they could before. SeatGeek now conducts high-traffic load tests every week to detect any significant regressions and to monitor the overall health of their application. It’s a key element for maintaining the stability of their application.

Here is the total cost of SeatGeek’s AWS region where they run isolated tests (isolated for tests):

The cost graph of SeatGeek’s AWS consumption. Tests typically cost less than $100 to run each week, and practically nothing the rest of the time.

Figure 2: The cost graph of SeatGeek’s AWS consumption. Tests typically cost less than $100 to run each week, and practically nothing the rest of the time.

Conclusion

When protocol-based load testing is insufficient, browser-based load testing is an excellent approach to adequately prepare web applications for high traffic. In the future, SeatGeek aims to enhance their in-house testing tool to establish a more customizable testing foundation that can assess other crucial pathways in their web applications.

We hope that SeatGeek’s experimentation with this technique encourages you to assess when browser-based load testing might be suitable for your applications and shows you how to do it affordably.

AWS HPC Blog