AWS for Industries

How TripleLift optimized real-time bidding with custom load balancing, Spot, and Graviton

TripleLift is a digital advertising company reinventing ad placements at the intersection of creative, media, and data. Its marketplace serves the world’s leading brands, publishers, streaming companies, and demand-side platforms, propelling TripleLift to execute over 7 trillion ad transactions monthly through their real-time bidding (RTB) service. Customers choose TripleLift because of its addressable offerings—from native to online video to connected television—and supportive experts dedicated to maximizing partner performance.

This journey started more than 4 years ago and, like many optimizations, is more akin to a marathon than a sprint. Working backward from a set of goals and issues, TripleLift has improved its efficiency and reduced the costs of its Ad Exchange using a combination of TripleLift engineering, AWS best practices, open-source software, and service improvements.

The Ad Exchange service was built on the AWS Cloud and has evolved over the last 10 years. The digital advertising market is highly competitive, so the cost of running the service is constantly under scrutiny. Maintaining margins by reducing waste provides high leverage, particularly by focusing on two of the largest non-differentiated expenses: Data transfer out (DTO) and compute (using Amazon Elastic Computer Cloud (Amazon EC2)). This blog post will focus on the latter.

Initial optimizations

The original service was built using Amazon EC2 on-demand compute and the Application Load Balancer (ALB). It had a CPU scaling target, predictive scaling, and used a single instance type (c5.9xlarge). The following diagram depicts this initial setup at the AWS infrastructure level distributing traffic across Availability Zones.

Figure 1: Exchange architecture using Application Load Balancer

Figure 1: Exchange architecture using Application Load Balancer

This first iteration also utilized an equal traffic split across exchange instances. The following simplified traffic flow diagram illustrates how this worked in practice and recaps the main characteristics of the solution.

Figure 2: Original architecture using Application Load Balancer

Figure 2: Original architecture using Application Load Balancer

The first and most straightforward optimization was to move from on-demand pricing to savings plans and reserved instances. This change greatly reduced costs for the steady-state workload but couldn’t be used for the spikey parts of the workload.

The next optimization was adopting Amazon EC2 Spot instances. The Exchange servers were modified to be stateless to accommodate the interruptible nature of Spot. You can run fault-tolerant workloads for up to 90% less than on-demand pricing. This optimization worked well with the spikey parts of the workload. Apart from using identical Spot instances, the architecture was largely the same as depicted in the following diagram.

Figure 3 Architecture using Application Load Balancer with Spot

Figure 3: Architecture using Application Load Balancer with Spot

While Spot provided a significant cost savings, Spot availability is based on unused EC2 capacity in the AWS Cloud. In addition, the application design favored the latest generation of compute-optimized instances (C type), which weren’t always available as Spot. Therefore, we manually maintained a list of acceptable instance types based on empirical testing. This was acceptable, but cumbersome.

We then incorporated attribute-based instance type selection for auto-scaling groups. It allowed for the creation of a server profile that includes attributes such as vCPU, memory, storage, burstable, IncludeTypes, ExcludeTypes, and many more. This instance selection method has the benefit of automatically including new instance types as they’re released, giving the system more flexibility. This is coupled with the price capacity optimized allocation strategy of Spot, which makes Spot instance allocation decisions based on both the price and the capacity availability of Spot instances.

As part of the addition of Spot, TripleLift was able to adapt its Java code to also work on Graviton with minimal changes. Specifically, any components using native Java code (Java Native Interface) needed to be checked for compatibility with the ARM architecture. Flame chart analysis was also conducted to ensure no architecture-specific performance regressions.

The Exchange architecture had evolved to use Amazon EC2 on-demand and Spot with the Application Load Balancer (ALB). The scaling target was now queries per second (QPS) with an equal traffic split, such that the smallest instances would run at a desired CPU utilization. Predictive scaling, multiple instance types, and multiple architectures were also enabled.

Challenges with Spot pools and HAProxy agent iteration

This seemed like it might be the end—financial and architectural optimizations had been applied and best practices had been followed. The system ran well, but now a new issue arose.

To allocate as many Spot instances as possible, the instance attributes targeted a minimum number of vCPUs and memory footprint. This meant that the eventual pool could contain instances of the minimum size and instances that could be twice as large or larger. TripleLift began to notice that as the instance pool became more heterogeneous, the utilization of the larger servers was going down, resulting in unused CPU cycles. The following diagram highlights this underutilization in greater detail and its impacts on the architecture.

Figure 4: Architecture using Application Load Balancer with a Spot pool

Figure 4: Architecture using Application Load Balancer with a Spot pool

This led to another round of optimization by improving compute utilization, cost optimization, and reducing our carbon footprint. Why was this happening? The auto scaling groups were using target tracking scaling, but that number is based on an average CPU utilization. The smaller servers were hitting the CPU targets, while larger servers were underutilized.

A success metric and optimization target was the spread (or max-min) of CPU across different-sized instances in the cluster. If that could be minimized by sending more traffic to larger instances and less traffic to smaller instances, the CPU utilization across the cluster would theoretically be fully and ideally saturated.

The next effort was to create a capacity-aware load balancer using HAProxy and to add a HAProxy agent (originally written in Golang) to each server. The agent reports utilization, and HAProxy, acting as the load balancer, can direct different amounts of traffic to each server based on its utilization. Consul serves as the service discovery layer of this system by automatically registering and deregistering Exchange servers as they spin up or terminate. The ensemble of HAProxy instances was fronted by a Network Load Balancer (NLB) to provide Layer 4 load balancing and SSL termination. The following architecture diagram illustrates how these components work together at an infrastructure level.

Figure 5: Architecture using Network Load Balancer with HAProxy agents

Figure 5: Architecture using Network Load Balancer with HAProxy agents

The original traffic allocation methodology used a weighted system, based on the CPU architecture, the manufacturer, and the speed of the backend servers. The fastest architecture was designated as the benchmark and the others were a relative percentage from it. While this made some improvement, it caused unwanted side work trying to perfect the relative calculations; for example, having to layer in the differences in clock speed and constantly updating for new EC2 instance types. Additionally, there were still wasted CPU cycles because the model didn’t prove effective in converging on the desired utilization target at scale.

A second version of the HAProxy agent was developed with a different working model, the proportional controller. This is a system where you set a goal, measure current state, and apply a proportional correction to achieve that goal. HAProxy Agent v2 was launched with a CPU goal (setpoint) of 65%. Unfortunately, there were a few issues:

  • There were some race conditions in the CPU measurement process because of a naive Golang implementation.
  • There was a failure scenario when the cluster’s average CPU exceeded the setpoint, and all nodes would continuously shed load until all instances reached their minimum traffic allocation. This would cause the cluster to be wholly unbalanced and give inaccurate average utilization values to the autoscaling group controller, which would then fail to scale up the cluster.
  • The agents on each instance had no insight into the state of the cluster or other instances’ CPU utilization.
  • The Network Load Balancer had inconsistent traffic allocation across availability zones, likely because of round-robin DNS caching of IP addresses and long-lived connections.

Finally, a third version of the HAProxy agent was developed. Sharding by Availability Zones was also added to reduce inter-AZ costs and allow for independent scaling of each Availability Zone based on the observed traffic patterns. The agent was rewritten in Java to be fully thread-safe. It still used proportional control, but a mathematical breakthrough was to use the average CPU of the observed Availability Zone as the setpoint. Each agent now published a custom Amazon CloudWatch metric and consumed the average CPU value across all of the instances in its own Availability Zone.

This resulted in a convergence because all instances were now pulling towards the same dynamic (and shared) setpoint. An over-saturated cluster would no longer continue shedding load like before; instead, all instances would try to get to the average CPU, even if that exceeded the autoscaling group’s target CPU. The autoscaling group’s job would be solely to add more instances to reach its own CPU target, but the convergence around the average CPU (or minimizing the CPU spread by modulating the traffic to each instance) was the collective job of the agents. A spreadsheet simulation was run and then a real deployment to show that the min-max CPU range was less than 5% and had basically converged. The following graph illustrates how this convergence manifested at a metric level.

Figure 6: CPU utilization convergence

Figure 6: CPU utilization convergence

Observing the rate of requests presented an opposite pattern. The even traffic split gave way to dynamic behavior in which slower instances got less requests per second, while faster instances got more. The following graph illustrates this behavior at the monitoring level.

Figure 7 Requests per second – static to dynamic

Figure 7: Requests per second – static to dynamic

The final architecture

With this final architecture, cost savings are approximately 23–40% monthly. In APAC, 30% fewer instances were being used; in the EU, 40% fewer instances; and in the US, 10% fewer instances. At a conservative dollar value, these numbers equate to more than $2M in yearly savings both from improved Amazon EC2 compute usage and lower load balancer capacity units (LCU) pricing for NLBs (25% less than ALBs). The following diagram shows how these optimizations come together in the final architecture.

Figure 8 Architecture using Network Load Balancer and HAProxy

Figure 8: Architecture using Network Load Balancer and HAProxy

The journey isn’t over. There are more ideas for optimization, such as lowering maximum Spot pricing, scaling CPU core count, and implementing upgrades to support HTTP/2 on the backend servers. Other items include supporting better performance through connection multiplexing and header compression (binary format) and HTTP/3 on the frontend, in addition to overall CPU and memory tuning of the Ad Exchange process. Additionally, with this new architecture, TripleLift can onboard new partners by connecting them directly to HAProxy nodes through the AWS RTB Fabric service or through an AWS PrivateLink connection to the Ad Exchange NLB. By bypassing the public internet in favor of the AWS private network, these integration paths reduce latency and overhead, providing a faster, more cost-effective experience for the entire ecosystem.

Conclusion

Through a 4-year optimization journey, TripleLift transformed its Ad Exchange service by using Spot instances, Graviton processors, and a custom HAProxy agent to implement proportional control that dynamically balances traffic based average CPU utilization in each Availability Zone. This architectural evolution delivered remarkable results, exceeding cost savings projections and achieving more efficient CPU utilization than in previous iterations. With this new approach, TripleLift is well positioned to support future growth and better interoperability with partners on the AWS Cloud.

James Yuzawa

James Yuzawa

James Yuzawa is a Principal Software Engineer at TripleLift where he focuses on driving operational efficiency in Real-Time Bidding systems. Over the past decade, he has led the team responsible for building, scaling, and optimizing the TripleLift Exchange — turning complex engineering challenges into high-performance solutions that power programmatic advertising at scale. His technical interests span flame-chart analysis, low-level optimization, control theory, data compression, cryptography, geographic information systems, and networking. When he's not at the keyboard, you'll likely find him somewhere in the woods.

Mark Hoover

Mark Hoover

Mark Hoover is a Senior Solutions Architect at AWS where he is focused on helping customers build their ideas in the cloud. He has partnered with many enterprise clients to translate complex business strategies into innovative solutions that drive long-term growth.

Rajesh Kesaraju

Rajesh Kesaraju

Rajesh Kesaraju is a Sr. Specialist SA for EC2 Spot with Amazon AWS. He helps customers to cost optimize their workloads by utilizing EC2 Spot instances in various types of workloads such as big data, containers, HPC, CI/CD, stateless applications, etc.

Sabin Gautam

Sabin Gautam

Sabin Gautam is a Senior Engineering Manager of Cloud Platform & Architecture at TripleLift where he focuses on building and scaling robust, secure, and high-performance cloud infrastructure. With fifteen years of experience in large-scale distributed systems engineering, he has led high-performing teams to translate complex technical challenges into solutions that drive meaningful business impact. His deep background in software engineering informs his approach to cloud architecture, platform reliability, and developer experience — ensuring engineering organizations can move fast without sacrificing security or scale. Give him a free afternoon and he'll find something to optimize that wasn't broken to begin with.

Sean Kumar

Sean Kumar

Sean Kumar is VP and Head of Cloud & Security at TripleLift where he focuses on modernizing infrastructure, strengthening security posture, and scaling distributed systems across the organization. With over 15 years of specialized experience spanning IT operations, cloud engineering, and cybersecurity, he brings a rare combination of deep technical expertise and economic insight — rooted in his background in both Computer Science and Economics. Sean has a proven track record of leading organizations through complex technical transformations, balancing engineering rigor with strategic business thinking to deliver outcomes that are secure, scalable, and built to last.

Shyamala Sivalingam

Shyamala Sivalingam

Shyamala Sivalingam is a Sr. Customer Solutions Manager at AWS with over 19 years of experience leading large-scale digital transformation and cloud migration initiatives. Since joining AWS in 2022, she works closely with enterprise customers to drive modernization, operational efficiency, and innovation through cloud and AI technologies.