Mastering Multi-Region Resilience and Scalability: Active-Active Design with Amazon ElastiCache Redis

This post is co-written by Jayadev Sirimamilla from Citibank, along with Sayan Deb Ghosh and Dibyarup Basu from TCS.

Introduction: The Active-Active Redis Challenge in Cloud Migration

In today’s digital-first world, milliseconds (ms) matter, especially for global financial institutions processing millions of transactions. Customers need their financial transactions to be instantly recorded, reflected, and accessible across geographical regions in real-time. Translating the business promise of delivering sensitive data like financial transactions, reflecting latest state reliably, in real-time, over a geographically distributed landscape, for millions of transactions per hour, necessitates solutions like Distributed Cache e.g. Redis Cache.

While Amazon Web Services (AWS) drives digital transformation, migrating Active-Active Redis deployments presents challenges. Enterprises use Active-Active Redis across global data centers through Redis Enterprise, providing cross-regional reliability and seamless failover. However, Amazon ElastiCache for Redis, while offering superior elasticity, operational simplicity, cost efficiency and cross-region replication, doesn’t natively support Active-Active configuration across regions—a feature many organizations rely upon in their on-premise Enterprise Redis deployments.

This perceived gap often raises concerns among technical decision-makers looking to escape the constraints of traditional infrastructure:

Capital-intensive capacity planning requiring upfront investments
Burdensome annual licensing compliance and audits
Inflexible multi-year contractual commitments
Over-provisioning to accommodate peak loads

This blog demonstrates how Tata Consultancy Services (TCS) developed an innovative approach to achieve Active-Active functionality for Amazon ElastiCache across multiple AWS regions, bridging the gap between on-premise capabilities and cloud-native services—all while eliminating the need for Enterprise Redis licensing costs.

TCS is an AWS Premier Tier Services Partner and Managed Cloud Services Provider (MSP) with Migration Competency.

Citigroup’s Existing Architecture and Migration Challenge:

Citigroup, one of the world’s leading financial institutions digital banking systems relied on a dual-region infrastructure in Singapore and Tokyo, using Active-Active Enterprise Redis Cluster for caching user sessions and business rules. The system processed 3 million cache hits hourly (60% read, 40% write), with 70% of reads accessing local data and 30% requiring cross-region access. During their AWS migration, a critical challenge emerged: while their existing Redis Enterprise supported Active-Active configuration across regions, Amazon ElastiCache (mandated by their Enterprise Architecture board) only supported Active-Passive setups. This limitation threatened to increase latency for write operations from the secondary region, potentially impacting their global digital banking services’ performance and customer experience. The situation required an innovative solution to maintain global write capabilities within AWS’s architecture.

Pilot Scope and requirements:

The customer’s Technology Pilot requirements focus on validating the core Active-Active Redis architecture across regions while incorporating essential production features such as at-rest and in-transit encryption, multi-AZ deployment, cluster mode sharding, and VPC peering configurations as a minimum viable product. While these production-grade features add complexity to baseline performance measurements and introduce some performance overhead, their inclusion provides a realistic assessment of the production-ready solution. The Pilot should be able to prove that it can replicate the active-active geo distribution capability in Amazon ElastiCache Redis without a need to performing any application code refactor. It should achieve the same by deploying Multiple Clusters of Amazon Global ElastiCache Redis ensuring latest copy of data always being stored locally using its native Cross-Region replication feature.

Pilot Success Criteria:

Regional Writes: Achieve single digit ms write latency in each geographic region without adding the complexity of multi-AZ, encryption etc.
Regional Reads without conflict resolution: All read operations accessing data from the local region’s read/write node must complete in less than 10ms, which addresses the 70% of workload that occurs within the same region.
Regional Reads with conflict resolution: When data is replicated across regions, multiple versions of the same record may exist simultaneously, accounting for 30% of the total workload. For these cross-region operations, read requests requiring conflict resolution must complete in less than 20ms to maintain application performance.
Seamless Failover: In case of ElastiCache failure in any region, the replicated node in the surviving region should be able accept applications Write traffic from another region along with Reads.

TCS Solution: Active-Active with Amazon ElastiCache across regions

The architecture employs a dual-cluster configuration across regions to achieve Active-Active ElastiCache.

Figure1: Cross-Region Active-Active Amazon ElastiCache Architecture

Active-Active Amazon ElastiCache Architecture Walk through:

Setup two Amazon ElastiCache for Redis clusters in Active/ Passive mode:
- Cluster One (C1): In the Singapore region, C1:Node1 serves as the primary Read/Write node, handling local write operations for Singapore traffic while also providing read access to all data directly written to this region. To ensure data availability across regions, all data written to C1:Node1 in Singapore is asynchronously replicated to C1:Node2 in Tokyo. This replication performed within a range of 5-10 ms enables local read access in the Tokyo region, allowing applications in Tokyo to retrieve Singapore-originated data without the need for cross-region queries, thus optimizing read performance and reducing latency.
- Cluster Two (C2): In the Tokyo region, C2:Node1 functions as the primary Read/Write node, processing local write operations for Tokyo traffic while providing read access to all data written directly to this region. To maintain data consistency, all data written to C2:Node1 in Tokyo is asynchronously replicated to C2:Node2 in Singapore, enabling local read access in the Singapore region. This replication strategy allows Singapore-based applications to access Tokyo-originated data through local reads, optimizing performance and minimizing cross-region latency.
Write Operations: Write operations are processed by the local Node1 in each region, with data being asynchronously replicated to Node2 of the same cluster in the other region. For example, when a write occurs in Singapore, C1:Node1 (Singapore) processes it and then replicates the data to C1:Node2 (Tokyo). Each write operation includes timestamp metadata, which is crucial for the system’s conflict resolution mechanism during read operations.
Read Operations and Conflict Resolution: The architecture employs a concurrent dual-node query strategy for read operations, enhancing data consistency across regions. Applications simultaneously query both local nodes in their region (e.g., in Tokyo, C2.Node1 and C1.Node2). To manage conflicts, each data record includes a timestamp field (“ts”). When responses are received, the system compares timestamps, considering the record with the most recent timestamp as authoritative. By automatically selecting the most up-to-date version, the system ensures users receive consistent information regardless of which regional node they query. This approach effectively handles asynchronous replication between regions, providing a coherent global data view.
Seamless Failover: In case, of any Amazon ElastiCache failure in any region, the Node2 in other region can be manually promoted to ACTIVE status. The promoted Node2 will then handle both READ and WRITE operations. This failover mechanism ensures business continuity with minimal disruption.

Testing and Validation:

Write Propagation Scenarios: When a new record is written in Region-1:

AWS Region-1 (Singapore)			AWS Region-2 (Tokyo)
Cluster1:Node1 (Read/ Write)	{“id”:1, “name”:”John”, “ts”:”20240730″}		Cluster1:Node2 (Read)	{“id”:1, “name”:”John”, “ts”:”20240730″}

Reading from either region returns: {“name”: “John”, “ts”: “20240730”}

Update Conflict Resolution: When an update occurs in Region-2:

AWS Region-1 (Singapore)		AWS Region-2 (Tokyo)
Cluster1:Node1 (Read/ Write)	{“id”:1, “name”:”John”, “ts”:”20240730″}	Cluster1:Node2 (Read)	{“id”:1, “name”:”John”, “ts”:”20240730″}
Cluster2:Node2 (Read)	{“id”:1,”name”:”JohnDoe”,”ts”:”20240731″}	Cluster2:Node1(Read/ Write)	{“id”:1,”name”:”JohnDoe”,”ts”:”20240731″}

The system fetches all records with ID:1, compares timestamps, and returns the most recent: {“id”:1, “name”: “John Doe”, “ts”: “20240731”}

Delete Operations and Data Consistency: When records expire in one or more regions:

AWS Region-1		AWS Region-2
Cluster1:Node1 (Read/ Write)	<<Expired Entry>>	Cluster1:Node2 (Read)	<<Expired Entry>>
Cluster2:Node2 (Read)	{“id”:1,”name”:”JohnDoe”,”ts”:”20240731″}	Cluster2:Node1(Read/ Write)	{“id”:1,”name”:”JohnDoe”,”ts”:”20240731″}

The system returns the only surviving record: {“id”:1, “name”: “John Doe”, “ts”: “20240731”}

Pilot Results:

Citigroup’s Expectation: Amazon ElastiCache must match or exceed the performance metrics offered by current on-premise Active-Active Enterprise Redis Cluster for successful migration approval.

Performance Requirements Comparison:

Performance Metric	Pilot Goal	Pilot Status
Regional Write Latency	< 10 ms	Met
Regional Read Latency without conflict resolution	< 10 ms	Met
Regional Read Latency with conflict resolution	< 20 ms	Met
Seamless Failover	Zero Disruption	Met

Business Benefits:

1. Time to Market: The solution built using Amazon ElastiCache Redis enables fast re-platform migration without major code refactoring, improving operational efficiency and market responsiveness.

2. Resilience: Architecture ensures business continuity during regional disruptions through seamless failover capabilities, maintaining enterprise-grade reliability for global banking operations.

3. Architecture Alignment: The architecture complies with Citigroup’s Enterprise Architecture standards, enabling smooth migration of critical workloads to AWS without exceptions.

4. Performance: It matches previous Redis Enterprise system’s performance, maintaining single-digit ms latency and cross-region data consistency.

5. Cost Efficiency: Eliminates Redis Enterprise licensing fees and related compliance costs. Optimizes resources through auto-scaling, following pay-as-you-go model instead of over-provisioning

Conclusion:

This innovative approach to implementing Active-Active functionality with Amazon ElastiCache demonstrates how organizations can maintain enterprise-grade capabilities while embracing cloud-native services in lieu of procuring expensive licensed enterprise grade products. For financial institutions and other enterprises with similar requirements, this architecture provides a blueprint for successful migration of mission-critical Redis workloads to AWS.

TCS has a proven record of migrating mission-critical applications to AWS with associates who are trained and certified in AWS services implementation. For more information about migrating Enterprise Redis workloads and implementing Active-Active functionality with Amazon ElastiCache on AWS, please contact the TCS team.

Migration & Modernization