Aeron performance enables capital markets to move to the cloud on AWS
With recent performance and operational improvements in the cloud, capital markets venues are now actively engaged in the race to the cloud. In many cases, this requires capital markets infrastructure to be re-engineered for resilience, performance and consistency. Aeron’s high performance message transport and cluster technology is uniquely suited to help move front-office trading and risk workflows, liquidity providers, trade execution, and market data into the public cloud.
The Aeron team, at Adaptive, and Amazon Web Services (AWS) have benchmarked Aeron Open Source and Aeron Premium to illustrate to capital markets participants the performance that can be achieved. We believe that the cloud is now ready for the most demanding capital markets workloads. This blog post outlines what Aeron does, the benchmarking process, and Aeron’s results.
In this post, we will show how Aeron Premium is almost 500 times faster at the 99th percentile than other commonly used messaging protocols and almost 60 times faster for end-to-end encrypted transport than other commonly used encryption protocols. For clustered state replication, Aeron Premium halves latency while achieving up to 8x throughput over Aeron Open Source.
What is Aeron?
Aeron is the cloud-native, open-source low-latency message transport and cluster technology developed by Adaptive and used by financial services firms globally to build sophisticated high-performance trading systems. Adaptive has worked with AWS since 2014, using its cloud-based technology to run two versions of Aeron to deploy purpose-built trading solutions for financial institutions. Open-source Aeron consists of Aeron Transport for messaging, and Aeron Cluster for sequenced, persisted, state replication. Aeron Premium, from Adaptive, provides a set of components to enhance performance, security and resilience.
Aeron is widely used in electronic trading and is relied upon for mission-critical systems. A public list of users can be found at http://www.aeron.io/.
What problems does Aeron solve?
Aeron Transport and Aeron Cluster solve two key pain points for capital markets systems in the cloud:
- Performance: Low-latency, high-throughput data transport and dissemination
- High-availability: 24/7, ‘always on’ systems
Data transport and reliable dissemination
The Challenge: Reliably moving data at predictable low latencies is fundamental to any system involved in low-latency, high-frequency markets. Data doesn’t just need to be moved from one machine to another. It often needs to be moved from one machine to many others—market data distribution is a common example of this. Hardware-based multicast solutions solved this problem for capital markets businesses before the cloud. Multicast in the cloud requires, unfortunately, a compromise in the performance these firms are used to.
The Solution: Until Aeron, there were very limited solutions which are suitable for this challenge. Aeron Transport is able to reliably and predictably transport data across inter-process communications (IPCs) and local and wide area networks. It adds minimal overhead to the latency of the underlying network, while providing flow and congestion control built for today’s multi-tenant high-capacity networks. Aeron Transport is especially well-suited to message transport in the cloud, with features such as:
- User datagram protocol (UDP)-based messaging with capital markets-tuned flow- and congestion-control algorithms
- Multi-destination cast, which provides a high throughput multicast-like pattern for the cloud
- Data plane development kit (DPDK) kernel bypass, which provides much faster network throughput rates and low latencies, by directly accessing the underlying physical network card
- Natural batching, which enables asynchronous message passing to reach incredibly high throughput
These features come together to allow Aeron Transport to transmit data reliably and at very fast rates on AWS. For reliable messaging, Aeron Transport records and replays your data to storage. This enables users to create fast, complex messaging topologies tuned for their requirements, including large-scale, reliable market data distribution.
Sequenced, persisted, and highly-available, in the cloud
The Challenge: Traditional cloud approaches to resilience require a scale-out approach with architecture that sacrifices consistency for availability. Capital markets systems require consistency, performance, and sequencing as well as recovery time objectives (RTO) and recovery point objectives (RPO) in the order of milliseconds. Firms often run entire markets on one or two machines to achieve this—making sacrifices on consistency or availability in the process. Recovering from system outages where data consistency or the ordering of transactions is required entails extremely complex reconciliations and, ultimately, some of the costliest payouts in compensation to impacted customers. The cloud engineering pattern, while offering fast provisioning and scalability is also one where the use of the underlying hardware can change at any moment; network interfaces are patched, processes are migrated between machines, and local storage can disappear. To date, solving the preceding engineering requirements in the cloud has been thought to be nearly impossible.
The Solution: Aeron Cluster is uniquely well-suited for this paradigm. It provides developers with a resilient, performant platform that can process over two million messages per second with a 99th percentile latency of 130 microseconds (detailed results are in the Benchmark Results section).
When underlying cloud services are restarted or migrated, Aeron Cluster seamlessly continues service operation with recovery in milliseconds.
This architecture enables developers to build highly-available, resilient systems with a minimum of infrastructure and still achieve incredible throughput and performance. Developers concentrate on the logic of their business domain, relying on the resilience guarantees provided by Aeron Cluster for their RTO and RPO needs.
We have demonstrated exceptional throughput and latencies with Aeron on AWS. We have also open sourced the source code of the benchmarks that we ran. (See the end of this blog for our statement on why open, transparent, replicable performance tests are the only ones worth reviewing.)
Initial tests were run across different family of EC2 instances. The final tests were run on c5n.9xlarge and c6i.32xlarge.
We first tested Aeron at its lowest level, using Aeron Transport to push network throughput and latencies to their extreme. We also tested Aeron Transport Security (ATS), an Aeron Premium component that uses industry-standard cryptography primitives. We then tested Aeron Cluster, to show the throughput and latency that a highly available system using Aeron Cluster could achieve.
We also tested open-source Aeron. They differ only in how they access the network card; Open-source Aeron uses Berkeley Software Distribution (BSD) sockets, and Aeron Premium uses DPDK. The Aeron DPDK kernel bypass allows applications to directly access network interfaces and hardware resources on virtual and physical instances (AWS Nitro System instances make the underlying hardware available). This reduces the overhead associated with traditional kernel-based networking—improving messaging latency and increasing throughput.
Aeron Transport – The latency of a round trip without persistence
Test Setup: We used Aeron Transport to benchmark the underlying latency and throughput when sending messages across the AWS network. We used Amazon EC2 Cluster Placement Groups to control the proximity of AWS instances within an AWS Availability Zone (AZ). A cluster placement group is a logical grouping of instances within a single Availability Zone. A cluster placement group can span peered virtual private networks (VPCs) in the same Region.
To measure latency, we performed five test runs of our test case: an echo test of a 288-byte message at 100,000 messages per second.
For throughput, we wanted to understand the maximum while still meeting a given latency within an Availability Zone. We chose a latency ceiling of one millisecond at the 99th percentile, discarding test runs at throughputs with latency results above this threshold.
The results of this testing were as follows:
- Latency of 66 microseconds, dropping to 43 microseconds with Aeron Premium kernel bypass* at 100k messages/second.
- This compares to 22,151 microseconds at 25k messages/second for other commonly used messaging protocols.
- Aeron is 500 times faster with 4 times the message volume.
- For encrypted transport, using Aeron Premium Transport Security and kernel bypass, a latency of 46 microseconds* was measured.
- This compares to 384 microseconds and 2,699 microseconds for other commonly used encryption protocols*.
- This represents an improvement of between eight and 58-fold for Aeron Transport Security.
- Throughput of 350,000 messages/second for a 288-byte message with Aeron Open Source.
- With Aeron Premium, throughput leapt eight-fold, to over 3,000,000 messages/second.
*Apart from throughput results, all Aeron results quoted relate to a round trip time at the 99th percentile of a 288-byte message and a rate of 100,000 messages/second.
Aeron Cluster – High-throughput and low-latency with high-availability
Moving on to Aeron Cluster, we wanted to benchmark the latency and throughput of a round-trip response where state is replicated across a three-node Aeron Cluster system. We tested two different deployment configurations. For more information on how Aeron Cluster works, please refer to the Aeron Cookbook for guides.
Set-up 1: Cluster Placement Group – a set-up optimized for performance
Here, we deployed Aeron Cluster nodes in the same AWS Availability Zone. This means that messages sent to the cluster are replicated to a quorum of other nodes within the same AZ for fault tolerance.
This configuration gives latency benefits but it comes with a redundancy trade-off when compared to deploying nodes across Availability Zones. When deployed using Cluster Placement Groups, if the primary Availability Zone is lost, the system can be brought back up from the messages replicated to a secondary AZ through the use of Aeron Premium Cluster Warm Standby.
As with the Aeron Transport tests, our Aeron Cluster testing covered latency and throughput. For latency, we tested the performance of Aeron Cluster with a 288-byte message at 100,000 messages per second. For throughput, we wanted to understand the maximum throughput while still meeting a given latency within an Availability Zone. We chose a latency ceiling of one millisecond at the 99th percentile, disregarding test runs at throughputs with latency results above this threshold.
The results of this test were as follows:
- For latency, we measured a round trip time of 235 microseconds when using Aeron Open Source.
- With Aeron Premium, we saw that latency almost halved – to 130 microseconds at the 99th percentile for 100,000 messages per second of a 288-byte message.
- For throughput, we maintained over 250,000 messages a second for a 288-byte message with Aeron Open Source while staying under our one millisecond threshold.
- This compares with over 2,000,000 messages a second with Aeron Premium, an 8x improvement over the open-source results.
Figure 5: Aeron Cluster latency using a Cluster Placement Group within an Availability Zone
Set-up 2: Partition Placement Group – a set-up optimized for redundancy**
Here, the test was conducted with Partition Placement Group. Partition placement groups help reduce the likelihood of correlated hardware failures for your application. When using partition placement groups, Amazon EC2 divides each group into logical segments called partitions. We deployed Aeron Cluster nodes across different Availability Zones within the same region. That means that messages sent to the cluster are replicated to a quorum of other nodes across at least two other AZs.
This setup gives enhanced reliability with reduced RTO and RPO times, but comes with a trade-off in performance. The latency of transmitting data to nodes in other Availability Zones is simply higher than within a single Availability Zone.
Figure 7: Aeron Cluster test set-up using AWS Partition Placement Group set-up
For our throughput tests, we increased the acceptable latency threshold before we disregarded our test results. This was simply to account for the increased network latency from having the cluster deployed across Availability Zones. The latency threshold was increased from one millisecond to 10 milliseconds, at the 99th percentile.
The results of this test were as follows:
- For latency, we measure a round trip of 3,428 microseconds using Aeron Open Source, at the 99.9th percentile.
- With Aeron Premium the latency almost halved, to 2,109 microseconds, again at the 99.9th percentile, for 100,000 messages per second of a 288-byte message.
- For throughput, with Aeron Open Source we sustained 250,000 messages per second for a 288-byte message.
- Using Aeron Premium, throughput improved nearly 7x, to over 1,700,000 messages per second.
Transparent, repeatable results
Many public statements by technology firms about latency are designed to make the biggest impact but are often light on details. They are ambiguous, and almost always impossible to independently repeat in a reliable manner.
AWS believes in performance results that are straightforward to understand and can be independently repeated and verified. We hope this post has provided you with a transparent and clear understanding of how Aeron works and the impressive performance it can achieve on AWS.
Along with Aeron Open Source, we have made our performance benchmarks project open source so you can run these performance tests yourself and confirm the above results are reliable and repeatable. We will happily talk about the results with you and walk you through how to set up your AWS infrastructure to achieve the same results with the best performance for your application.
The cloud has been seen as ‘too slow’ and ‘too unreliable’ for the most demanding of capital market’s workloads. With Aeron, our benchmarks show that this is no longer true. The tests conclusively suggest that AWS Cloud services and features (the different Amazon EC2 instance types, Placement Groups, Amazon VPC and network optimization) can efficiently run low-latency, high-frequency workloads.
The open-source Aeron Transport is hundreds of times faster than other commonly used open-source messaging protocols. With Aeron Premium, Aeron Transport is, again, twice as fast.
Against market-leading transport encryption implementations, Aeron Transport Security demonstrates between eight- and 58-times lower latency.
Aeron Cluster, which enables high throughput at very low latencies, while reliably sequencing and persisting message flows, is engineered to provide resilience in the face of transient failures.
If you are interested in using Aeron to move your trading infrastructure to AWS, please contact Aeron. If you want to run these performance tests in your own environment you can request the Aeron Performance Testing guide. The Aeron team also run regular community meet-ups. If you’d like to meet and discuss your needs, you can register for their mailing list.