AWS Database Blog
Scaling transaction peaks: Juspay’s approach using Amazon ElastiCache
This is a guest post by Jobin Sukumaran, Product Manager at Juspay, Boaz John, Senior Architect at Juspay, in partnership with AWS.
Juspay powers global enterprises by streamlining payment process orchestration, enhancing security, reducing fraud, and providing seamless customer experiences. Juspay’s orchestration engine handles payments for a diverse clientele ranging from high-growth innovators to some of the world’s most demanding enterprises. Juspay processes over 30 million transactions a day, facilitating payment operations across diverse markets.
In this post, we walk you through how Juspay transformed their payment processing architecture to handle transaction peaks. Using Amazon ElastiCache and Amazon Relational Database Service (Amazon RDS) for MySQL, Juspay built a system that processes 7.6 million transactions per hour during peak events, achieves sub-millisecond latency, and reduces infrastructure costs by 80% compared to their previous solution.
Business Challenge: Managing Extreme Traffic Spikes
In the fiercely competitive digital payments industry, a highly available and efficient transaction processing system is critical—with millions of customers making transactions every day and spikes up to 20 times the average load during events like festivals and sales. One such event is the Indian Premier League (IPL), a 2-month long cricket tournament staged across India that attracts approximately 600 million viewers and peaks to 7.6 million transactions per hour on the Juspay orchestration platform.
Juspay’s payment traffic generally follows predictable patterns on typical days but witnesses hockey-stick spikes in traffic during IPL, going from 1 to 20-fold greater in a matter of seconds. Juspay requires a system that can scale to any traffic pattern. Traffic spikes, if not managed correctly, can result in delayed processing and potential transaction failures, and can impact customer experience and business loss.
At Juspay, we built a robust and horizontally scalable, next-generation in-memory key-value caching layer to address the tremendous variance in daily and seasonal traffic patterns, enabling our systems to scale limitlessly. This post explores how we implemented the Command Query Responsibility Segregation (CQRS) pattern using Amazon ElastiCache as a write-back cache to handle traffic spikes while maintaining consistency and reducing costs.
The architecture combines two powerful patterns: CQRS separates our read and write models, allowing each to be optimized independently, and the write-back cache pattern provides high performance and durability. ElastiCache serves as our command-side store, handling all writes with eventual consistency to Amazon RDS for MySQL, which serves as our query-side store, optimized for complex reads and operational use cases.
Solution overview
We implemented an in-memory key-value caching layer to address the variance in daily and seasonal traffic patterns. This post explores how we implemented the Command Query Responsibility Segregation (CQRS) pattern using Amazon ElastiCache as a write-back cache to handle traffic spikes while maintaining consistency and reducing costs.
The architecture combines two architectural patterns: CQRS separates our read and write models, allowing each to be optimized independently, and the write-back cache pattern delivers low-latency performance and durability. Amazon ElastiCache serves as our command-side store, handling all writes with eventual consistency to Amazon RDS for MySQL, which serves as our query-side store, optimized for complex reads and operational use cases.
The integration of ElastiCache with our existing Amazon RDS infrastructure enabled Juspay to handle traffic spikes efficiently, which yielded significant performance gains and enhanced our payment service resiliency. This architectural improvement has enabled us to expand and to additional AWS Regions, including US East (Ohio), Asia Pacific (Hyderabad), Asia Pacific (Singapore), and Europe (Ireland).
Before implementing this solution, Juspay faced several challenges with our original architecture. To understand the improvements we made, let’s first examine our previous approach to handling transaction processing and its limitations.
Previous architecture
Juspay’s payment processor architecture consisted of application services deployed on Amazon Elastic Kubernetes Service (Amazon EKS) with Amazon RDS for MySQL as the primary data store. The application layers were designed to horizontally scale with traffic based on regular predicted patterns, with requests flowing from the client applications to the payment processing services. These services would process transactions, perform business logic operations, and store transaction states directly in Amazon RDS. During high-traffic periods such as Black Friday or the IPL, scaling application pods up or down (horizontally or vertically) in Amazon EKS was handled automatically, either for the complete season or for the match timeframe. While Amazon RDS served as the data persistence solution for all transaction states and processing, Juspay discovered that managing vertical scaling of the relational database instances was complex and time-consuming. This was also not a cost-effective solution since Juspay had to upscale the Amazon RDS instances for the complete season.
The following diagram illustrates this architecture.
Juspay would scale up an r5.4xlarge instance to r5.16xlarge to handle the traffic spikes, which was suboptimal from a price-performance perspective. Also, Juspay follows a zero-downtime principle for any maintenance activity; upgrading a database without downtime involves bandwidth of almost all the product teams within Juspay.
In April 2019, we challenged ourselves to identify a more cost-effective way to run our system at scale for the relatively short window when payment traffic spikes each day during IPL and the Indian festival season.
We faced the following challenges:
- Scalability – Amazon RDS had to be vertically scaled to a particular instance class based on expected load. Even with a bigger instance, there are limits, like maximum number of connections or slow-running queries due to high transaction volume that could cause degradations.
- End-user experience – Juspay implemented request rate limiting to align incoming traffic with the database instance’s processing capacity, which was constrained by its specific configuration (for example, r5.16xlarge). This limitation directly affected the quality of service experienced by end-users, as requests might be throttled or rejected, leading to slower response times or even temporary unavailability of the service.
- Operational cost – Modifying the instance type caused temporary connection loss, resulting in downtime. To maintain Juspay’s zero downtime policy, we ran an over-provisioned instance for the entire year. Overall Amazon RDS CPU utilization during non-peak time was less than 15%, which served as a key indicator that a more modern approach was required.
- Single point of failure – With relational databases, horizontal scaling of writer instances is complex. Additionally, because a single database instance in an Availability Zone can fail at any time, it presents a potential single point of failure. To address these challenges and enhance system resiliency, Juspay needed to develop a robust architecture with clearly defined fault isolation boundaries.
Exploring possible solutions
Juspay explored various solutions to enhance performance, elasticity, and resiliency in their system. They considered sharding the existing database instance into multiple smaller instances, but this approach was deemed impractical due to the extensive application changes required and the complexity of managing multiple new instances.
Alternatives like scale-up and scale-down operations were complicated by accumulated state, and zero-downtime migrations between instance sizes required hours of careful orchestration by multiple teams.
The use of NoSQL key-value stores like Apache Cassandra was also evaluated, but ultimately dismissed due to their inability to meet Juspay’s performance requirements. These systems exhibited latency in the range of single- to double-digit milliseconds for processing read and write requests, which proved inadequate given the high volume of write operations generated by the application.
This level of performance fell short of the demands required by Juspay’s payment processing workloads, necessitating the exploration of alternative solutions that could deliver the required speed and efficiency to handle the anticipated workload.
We then explored ElastiCache, which met our requirements, as a fully managed service, it seamlessly integrates with our existing AWS architecture. It functions as a buffer for traffic spikes and provides read /write throughput of up to 350,000 operations per second, as shown in our performance testing. We implemented a drainer process from ElastiCache to Amazon RDS for MySQL that provides controlled write-back at a rate that prevents database overload.
The following table summarizes the performance metrics we measured when testing these possible solutions during our architecture design phase.
Metric | Amazon RDS for MySQL | Open Source Apache Cassandra | Amazon ElastiCache |
Instance Type | r6g.2xlarge | r6g.2xlarge | r6g.2xlarge |
Read/Write Operations per Second | 1,000–10,000 | 40,000–50,000 | Up to 350,000 |
CPU Utilization | 70–100%(under heavy load) | Up to 80%(for consistent throughput) | 70-80%(high throughput) |
Latency (p50 – p99) | 2–10 milliseconds | 5–10 milliseconds | 0.44–2.45 milliseconds |
Technical Implementation
Let’s look into the various patterns we used in our architecture to build a payment gateway with Amazon ElastiCache and Amazon RDS. We implemented four key architectural patterns, starting with a memory-first approach that fundamentally changed how we handle transaction data.
Memory-first architecture
Juspay implemented Amazon ElastiCache in cluster mode to store active transactions with write-back to persistent storage. This fully managed service acted as a buffer for traffic spikes, scaling to handle incoming traffic and populating Amazon RDS for MySQL at a controlled rate. This solution integrated with Juspay’s existing AWS architecture and delivered the required throughput with sub-millisecond latency.
A key insight emerged during our evaluation: our payment platform only needs to maintain transaction state for short durations (averaging 2 minutes) while routing payments to gateways. This short-lived state characteristic made Amazon ElastiCache an ideal temporary key-value data store. This approach set the foundation for our write path optimization.
Write path optimization
Our implementation uses Amazon ElastiCache as write-ahead logs for sub-millisecond write acknowledgment. This approach allows us to quickly confirm transaction receipt while ensuring data is safely captured for later persistence. Amazon ElastiCache serves as a temporary storage layer that maintains transaction consistency and real-time accuracy before data is permanently written to the database.
We designed a scalable queue system that buffers data from the ElastiCache instances and drains it to RDS for MySQL. The autoscaling of these “drainers” is based on the CPU utilization of the RDS instance, which allows us to maintain consistent database instances sizes while handling variable workloads. This eliminates the need for database scaling, providing operational stability and cost reduction. Given that ElastiCache is an in-memory datastore, we implemented safeguards by writing critical streams to secondary instances, which prevents data loss during failover events. However, we have not encountered such a scenario so far, thanks to the high availability provided by the ElastiCache service.
The write-back processor batches updates for Amazon RDS for MySQL persistence, a balance performance and reliability. By controlling the write-back process, we prevent database overload while maintaining high throughput. This architecture combines the speed of Amazon ElastiCache with the durability of Amazon RDS for MySQL. This controlled write-back approach directly influences our consistency model implementation.
Consistency models
Building on our write path optimization, we maintain data consistency between Amazon ElastiCache and Amazon RDS through our queue system. The auto-scaling of drainers based on Amazon RDS CPU utilization allows us to keep database instance sizes constant while controlling the write-back rate to the relational database.
Our architecture implements a hybrid consistency model that balances performance with data integrity. For active transactions, we maintain strong consistency by storing them in Amazon ElastiCache, thus ensuring real-time accuracy for current payment operations. Data from Amazon ElastiCache is drained to Amazon RDS for MySQL at a configured frequency. This approach leads to an eventual consistency model at the database level, optimizing for efficient long-term storage and retrieval of completed transactions.
The append only log implementation within Amazon ElastiCache ensures reliable write propagation between these systems, maintaining data coherence throughout the transaction lifecycle. To address potential read inconsistencies, we implemented cache-miss handlers that bridge any gaps in read operations, providing a continuous user experience across the entire data lifecycle. This consistency model works in conjunction with our scale-out architecture, which we explore next.
Scale-out architecture
We implemented Amazon ElastiCache with cluster mode enabled, which provides online resharding and rebalancing capabilities for horizontal scaling. This automatic scaling distributes data across nodes without manual intervention, increasing system uptime and maintaining consistent application response times. The elasticity of Amazon ElastiCache allows us to scale out during traffic spikes that reach 7.6 million transactions per hour and scale in during normal operations, while maintaining sub-millisecond write acknowledgement. For complex operations, we implemented cross-shard transaction handling to execute multi-key operations across different shards. We also configured background write-back scaling to persist data to Amazon RDS for MySQL without affecting the cluster’s performance during peak loads. This architecture delivers 99.99% availability with consistent performance even during 20x traffic spikes.
The following diagram illustrates the solution architecture.
The diagram illustrates how our application on Amazon EKS sends both reads and writes to Amazon ElastiCache, while our draining application writes to Amazon RDS for MySQL. The architecture includes an Amazon RDS Read Replica that Amazon ElastiCache can query during cache miss scenarios, creating a complete data flow cycle that maintains data consistency while optimizing for performance.
Implementation journey and results
The first implementation of the key-value architecture was tested in 2019. Since then, we have made several improvements to the caching model and subsequent draining to Amazon RDS. Over the years, ElastiCache has successfully handled sudden traffic spikes during various high-volume events, including IPL, Cricket World Cup ticket sales, India festival sales, and Indian Railways’ daily tatkal ticket bookings.
This solution resulted in the following benefits:
- Reliability – Moving Amazon RDS for MySQL from the critical-path eliminated our single point of failure. The design of Amazon ElastiCache isolates hardware failures to single nodes, preventing system-wide outages and maintaining service availability during node failures.
- Performance efficiency – Amazon ElastiCache provides up to 35 times higher throughput compared to Amazon RDS (refer to the table earlier in this post), reducing our application response times to sub-millisecond latency. In subsequent testing, Amazon ElastiCache achieved over 1 million requests per second on a single node and 500 million requests per second through a cluster using AWS advances in enhanced I/O and AWS Graviton 3 instances.
- Cost optimization – Switching to Amazon Aurora MySQL with I/O Optimized configuration instances reduced our database size and Amazon RDS costs. Running our workload on a 10-node ElastiCache cluster costs approximately 80% less than scaling up RDS instances vertically to handle the same traffic volume.
Future enhancements
With the Amazon ElastiCache key-value model, we created multiple cells, each comprise a compute layer and the ElastiCache storage. This architecture allows us to deploy and scale cells independently, which eventually drain to the RDS instances. This approach enabled us to extend the architecture to multi-Region deployments without requiring a globally writable database. We are currently working on a shuffle sharding architecture to achieve multi-Region, active-active, and highly resilient systems.
In the future, we will upgrade to Amazon ElastiCache Serverless and the Valkey engine to improve performance and reduce costs. ElastiCache Serverless eliminates cluster management overhead and doubles throughput in just 2-3 minutes. The Valkey engine upgrade delivers 20% cost savings and adds new performance features. These changes align with our strategy of using managed services built on open-source technology.
Conclusion
In this post, we shared how Juspay used Amazon ElastiCache as a write-back cache to manage extreme transaction peaks during events like IPL that drive 20-fold spikes in payment processing. Our solution leveraged key architectural patterns including CQRS, memory-first architecture, and scale-out capabilities to handle transaction volumes of 7.6 million per hour with sub-millisecond latency.
We achieved measurable improvements in our payment processing infrastructure: 35 times higher throughput compared to our previous setup, 80% cost reduction in handling peak loads, and increased system reliability by eliminating single points of failure. The architecture maintains consistent performance during traffic spikes while reducing operational complexity.
Contact us at info@juspay.in to know more about our platforms.