AWS Storage Blog
Marqeta supercharges MySQL workloads using Amazon EBS io2 Block Express
Given the ubiquity of digital payments, cutting-edge fintech solutions hinge on seamless and highly available real-time transaction processing. Invariably, this needs the support of a performant, reliable, and secure datastore. And after considering technical requirements, fintech companies know that regulatory and compliance auditing never takes a back seat.
Enterprise AWS customer Marqeta needed all of this and more when deciding to migrate their card-issuing platform to the cloud. Marqeta is a heavy MySQL user, so their technical concerns are centered on transaction performance, high availability (HA), and disaster recovery (DR). After evaluating both managed and unmanaged service options, Marqeta strategically opted to run their most critical MySQL workloads on Nitro-based Amazon Elastic Cloud Compute (Amazon EC2) and Amazon Elastic Block Store (Amazon EBS) io2 Block Express.
Online transaction processing (OLTP) workloads are typically write heavy, often involving a large volume of rapid, small-sized requests that need to be quickly persisted to a database. Database architects and administrators need to account for fluctuating traffic patterns, business-driven performance requirements, planned and unplanned downtime, security, compliance, and cost. As part of this effort, choosing the right storage technology is imperative.
In this post, using insights from Marqeta’s journey, we discuss the storage-centric challenges of running high-performance database workloads in the cloud and how io2 Block Express can help. While our focus is on MySQL in this discussion, the io2 Block Express use case is applicable to most relational database engines for meeting stringent performance, availability, durability, and capacity requirements.
Amazon EBS io2 Block Express highlights
First, let’s cover what io2 Block Express brings to the table when attached to supported Nitro-based EC2 instance types. In addition to supporting standard Amazon EBS features, io2 Block Express adds the following high-performance attributes:
- Consistent sub-millisecond latencies, with up to 2.5x better average I/O latency than major cloud providers
- Maximum of 256,000 IOPS
- Maximum of 4,000 MB/s of throughput
- Storage capacity up to 64 TiB
- Amazon EBS Multi-Attach and NVMe reservations
io2 Block Express is best suited for workloads that need a single high-performance and high-capacity storage volume. io2 Block Express delivers the best outlier latency control in the cloud, making it a tailor-made storage solution for critical database workloads.
Transaction performance
A single database transaction may be composed of multiple queries of varying complexity, with each query needing a certain number of input/output operations per second (IOPS). At the storage layer, slow data writes can lead to transaction timeouts, upstream/downstream failures, and reduced availability. A typical culprit for this is I/O saturation, which occurs when IOPS demand outstrips a storage system’s performance capability. The primary symptoms of I/O saturation are an increase in storage queue depth and steadily increasing client response times.
First, choosing an appropriate storage solution means estimating the IOPS performance that a workload needs. For example, if a transaction needs 50 IOPS, and the expected traffic volume is 100 transactions per second, then the storage layer must be able to handle 5000 IOPS (50 IOPS x 100 transactions/sec). If the storage solution in place cannot handle 5000 IOPS, or greater than 5000 IOPS usage is seen, then the storage queue depth increases, which leads to I/O saturation. In this case, Amazon EBS gp3 and io1 can be applicable storage solutions.
However, in the fintech space, OLTP transaction sizes are often smaller, and volume is much higher. For example, a single MySQL database might be expected to handle well above 100,000 IOPS. With a size of 16KiB per I/O operation, that would need almost 1,600MiB/sec throughput. Designed for high-performance workloads, io2 Block Express supports up to 256,000 IOPS (16KiB per I/O operation) and up to 4,000 MB/s of throughput. Allocation of IOPS can be configured independently of volume size (maximum ratio of 1,000 IOPS per GiB), making this a flexible storage option as well.
Low I/O latency is also a must-have for critical applications, which io2 Block Express provides at the sub-millisecond level. However, the overall average storage latency alone does not fully illustrate how well a storage solution performs for every transaction. Given that not every storage operation experiences identical latency, what can users expect in outlier scenarios? To support ultra-consistent performance, io2 Block Express has the lowest p99.9 I/O latency and up to 2.5x better p50 I/O latency among major cloud providers.
Capacity and durability
Storage capacity is a critical factor to consider, particularly for on-premises databases that have been vertically scaled over time. Managing a large MySQL database also demands the optimization of InnoDB, which is MySQL’s default storage engine. As database tables grow in size, partitioning can be used to spread sectioned table data across a single volume or horizontally scale the storage layer across multiple volumes. The latter allows for a single table to contain more data than can be stored on a single volume. For databases up to 64TiB, MySQL instances can leverage a single io2 volume rather than striping data across multiple volumes, simplifying management and reducing operational overhead. And in the case of multi-volume partitions, the high capacity of io2 Block Express reduces maintenance overhead by enabling users to provision the lowest number of volumes needed.
In terms of physical durability, io2 Block Express volumes maintain an annual failure rate of 0.001% (99.999% durability). This equates to one volume failure per 100,000 actively running volumes over the course of a year. Combined with multi-replication and backup strategies, this level of durability is a great fit for workloads processing failure-sensitive data.
To enhance data durability, EBS volumes provide torn write prevention (TWP) when attached to Nitro-based EC2 instances. This feature benefits database engines such as InnoDB by providing storage-level protection against incomplete (torn) writes. In the absence of write atomicity, an operating system crash or power outage during write transactions can lead to data corruption.
InnoDB’s doublewrite buffer is enabled by default to guarantee that storage writes are atomic, although this comes with a performance cost. With the doublewrite buffer enabled, pages are written first to the buffer, which is then flushed to disk. After the buffer is successfully written to disk, a page is then written to its final location in storage. In essence, the doublewrite buffer serves as a valid copy that can be used to recover pages from storage failures.
With TWP, write transactions are inherently all-or-nothing, which eliminates the need for the doublewrite buffer. When leveraging io2 Block Express volumes, disabling the doublewrite buffer can lead to an increase in transactions per second of up to 30% and a write latency decrease of up to 50%, all without sacrificing data resiliency.
HA and DR
Cross-Availability Zone (AZ) and cross-Region replication for MySQL databases are best practice strategies for achieving comprehensive HA and DR postures. This is especially applicable to workloads handling financial transactions in real-time. By replicating data across different physical locations, the continuity of application uptime can better withstand infrastructure disruptions. However, there is no one-size-fits-all solution. An effective strategy is defined by how a business supports its customers.
The primary factors in designing an HA/DR strategy are Recovery Time Objective (RTO) and Recovery Point Objective (RPO). RTO is the maximum amount of time an organization has to restore normal operations following an outage. RPO is the amount of data, typically expressed in minutes or hours, that an organization can lose due to an outage. Detailed in this AWS post focused on establishing RTO and RPO targets, the question of RPO can be thought of as “how old can the data be when this application is recovered?”.
RPO requirements can be very low for database workloads handling financial transactions. The following diagram outlines an example of a multi-Region architecture designed for HA with a low RPO value. In this example, AWS Region B represents a DR failover target in the event of service impact to AWS Region A. Cross-AZ and cross-Region replication between primary and replica instances is asynchronous to maximize transaction performance. To minimize cross-Region data transfer costs, a replica instance in AWS Region A is replicated to a single instance in AWS Region B, with asynchronous replication enabled between instances in AWS Region B.
Database replication strategies
By default, MySQL replication is asynchronous. In this configuration, a source database instance first writes events to its binary log, then replica database instances request updates. The source instance does not know if or when a replica has processed events, and a replica receipt of events is not guaranteed. This form of replication is the most performant from an application standpoint, as it does not require waiting for replication write completion to commit a transaction. From a business continuity standpoint, a non-zero RPO value must be accounted for when leveraging asynchronous replicas, as there is no guarantee they are fully and continuously synchronized with the primary database instance.
Asynchronous replication is a performant HA solution when configured between primary and replica/standby database instances spread across three AZs within the same AWS Region. AZs are located no more than 60 miles (~100 km) from each other, which generally produces single-digit millisecond roundtrip latency between AZs in the same AWS Region. This level of network latency is adequate to support replication needs for write-heavy applications.
When replicating a database across one or more AWS Regions, asynchronous replication is commonly a suitable and performant configuration. Depending on business requirements, cross-Region replicas can also be used to offload read traffic. Furthermore, cross-Region replication is an effective solution for DR in the case of primary AWS Region failure.
For HA and DR database replication, a sub-millisecond storage solution, such as io2 Block Express, can mitigate disk I/O performance risks for source and replica instances.
Conclusion
Designing for and managing a high-performance MySQL workload needs right-sized infrastructure, a business-driven architectural approach, and optimal configuration. Additionally, developing the right business continuity plan will help define storage requirements for HA and DR. These strategies, combined with io2 Block Express, enable Marqeta’s multi-Region approach to supporting its most critical and demanding database workloads.