AWS Database Blog

Build a multi-Region session store with Amazon ElastiCache for Valkey Global Datastore

Update (June 2026): This post has been updated to include additional guidance on resilience considerations for cross-Region write operations, read-after-write consistency, and failover mechanism dependencies. The architecture described introduces a dependency on inter-Region connectivity for write operations, which is an acceptable tradeoff for soft-state data such as user sessions. See the “Challenges and tradeoffs” section for recommended mitigations and alternative approaches.

As companies expand globally, they must be able to architect highly available and fault-tolerant systems across multiple AWS Regions. With such scale, a company can find itself in this position when designing a caching solution across its multi-Region infrastructure.

In this post, we dive deep into how to use Amazon ElastiCache for Valkey, a fully managed in-memory data store with Redis OSS and Valkey compatibility, and the Amazon ElastiCache for Valkey Global Datastore feature set. With ElastiCache Global Datastore, you can write to your ElastiCache cluster in one Region and have the data available to be read from two other cross-Region replica clusters, thereby enabling low-latency reads and disaster recovery across Regions.

This solution provides application servers with a unified database caching layer and secure cross-Region replication. We discuss how it can solve the challenges of operating in multiple Regions while also enabling a disaster recovery strategy that meets strict business requirements.

Evolution of a caching architecture

Our example starts with an application stack in a single Region (us-west-1) with a typical setup. The following diagram shows the original stack, consisting of Amazon Virtual Private Cloud (Amazon VPC), Application Load Balancer (ALB), target groups, Amazon Elastic Compute Cloud (Amazon EC2) instances, and ElastiCache for Valkey for user session management.

Base Architecture Diagram

This setup works well, but let’s introduce new business and regulatory requirements. We have to make the infrastructure multi-Region for read capabilities in multiple Regions and disaster recovery capacity to minimize business impact during a single Region failure. The same infrastructure has to be duplicated to a second Region (us-west-2) where different users would connect. This is a fairly painless task and a great opportunity to implement AWS Global Accelerator to optimize connectivity to the ALBs in the two Regions. The following diagram shows the updated architecture with two Regions.

Multi-Region Architecture Diagram

Global Accelerator automatically routes traffic to a healthy endpoint nearest to the user and is designed to load balance traffic. This means that we can’t deterministically route multiple users to a specific destination behind the accelerator. For this reason, we want that user’s session to be replicated in both Regions so each Region can handle those user’s requests.

Challenges and tradeoffs of a cross-Region caching layer

We have to find a solution for this challenge or undo all of the new multi-Region configuration. We need a way to share the dataset across both Regions. In short, both us-west-1 and us-west-2 Regions have to accept logins and provide all users with their session data regardless of the location. This would be best accomplished with a unified dataset between us-west-1 and us-west-2.

Because we are already using ElastiCache, adopting the Global Datastore feature is the logical next step. However, before proceeding, it’s important to understand the architectural tradeoffs this pattern introduces.

Understanding the cross-Region write dependency

ElastiCache Global Datastore uses active-passive replication. Writes must go to the primary cluster in a single Region. When application servers in the secondary Region need to create or update sessions, they must write cross-Region to the primary cluster through VPC peering or Transit Gateway. This creates a hard dependency on inter-Region network connectivity for write operations.

This is a deliberate departure from the AWS Well-Architected Reliability Pillar best practice REL10-BP01, which recommends against creating dependencies between Regions. We accept this tradeoff here because:

  • Session data is soft-state. If a cross-Region write fails, the impact is small: the user must re-authenticate. There is no durable business data at risk.
  • Writes are a small fraction of operations. Sessions are created once and read many times. The majority of operations are local reads served by the secondary cluster, which remain unaffected by cross-Region connectivity issues.
  • The failure mode is recoverable. A cross-Region write failure doesn’t cascade — it affects only the individual session creation attempt, not existing sessions or read traffic.

Read-after-write consistency

Because ElastiCache Global Datastore replicates asynchronously from the primary Region to the secondary Region, there is a consistency gap to be aware of. Consider this scenario: a user in us-west-2 creates a session (the write goes cross-Region to the us-west-1 primary), then immediately attempts to read that session from the local us-west-2 replica. Replication typically completes in under one second, but this is not guaranteed. The read may return empty if the replica has not yet converged.

For session stores, this race condition is unlikely to cause issues in practice because the session write and subsequent read are typically separated by at least one HTTP round-trip (the response carrying the session cookie back to the client, followed by a new request). However, if your application performs write-then-immediate-local-read within the same request flow, consider one of the following mitigations:

  • Read-from-primary after write. Direct the immediate follow-up read to the primary cluster (cross-Region), then fall back to local reads for subsequent requests.
  • Short client-side delay or retry. Introduce a brief retry (for example, 100–200 ms) on a local read miss immediately after a session creation.
  • Accept re-authentication. For session data, a missed read simply means the user re-authenticates — the same graceful degradation that applies to write failures.

This consistency gap is inherent to the active-passive replication model. If your workload requires read-after-write consistency across Regions without these application-level workarounds, see the “Alternatives with local writes” section.

Failure modes to plan for

When relying on cross-Region writes, expect the following during transient network impairments between Regions:

  • Increased write latency (from single-digit milliseconds to hundreds of milliseconds or timeouts)
  • Intermittent write failures during AWS backbone maintenance or congestion events
  • Brief periods where new sessions cannot be created in the secondary Region while existing sessions continue to be served from the local replica
  • Asymmetric degradation where the secondary-to-primary path is impaired but primary-to-secondary replication remains intact (existing sessions continue replicating, but new session creation from the secondary Region fails)

Note that in this architecture, inter-Region network issues surface directly as application-layer performance degradation (write timeouts, failed session creation). This is a key distinction from architectures that support local writes, where inter-Region network issues affect replication (RPO/RTO) rather than real-time application performance.

Recommended mitigations

To operate this pattern reliably, implement the following in your application layer:

  • Configure write timeouts appropriate for cross-Region latency. Single-Region ElastiCache writes typically complete in under 1 ms. Cross-Region writes traverse the AWS backbone and should be configured with timeouts of 200–500 ms to accommodate normal inter-Region latency without triggering false failures.
  • Implement retry with exponential backoff. Transient cross-Region failures are recoverable. Retry session creation writes with jitter (for example, 100 ms, 200 ms, 400 ms) before failing the request.
  • Implement graceful degradation. If the cross-Region write fails after retries, handle the failure gracefully rather than returning an error to the user. Options include creating a local-only session (the user may need to re-authenticate if subsequently routed to the other Region) or falling back to a stateless authentication token.
  • Monitor cross-Region write latency and errors. Use Amazon CloudWatch metrics on your ElastiCache cluster and application-level metrics to alarm on elevated p99 write latency or increased write error rates, which may indicate inter-Region connectivity degradation.
  • Automate detection and recovery. Configure CloudWatch alarms on cross-Region write error rates and latency percentiles. When thresholds are breached, trigger automated failover of the Global Datastore to promote the secondary Region to primary, eliminating the cross-Region write path. Combine this with the DNS failover mechanism described later in this post to redirect traffic accordingly.

Alternatives with local writes

The cross-Region write dependency and the read-after-write consistency gap described above are inherent to the active-passive replication model of ElastiCache Global Datastore. No configuration of ElastiCache today provides active-active multi-Region replication with local writes in each Region. If your workload requires write availability in every Region, where inter-Region network issues must not surface as real-time application degradation, consider the following alternatives:

  • Amazon DynamoDB Global Tables — active-active multi-Region replication with local writes in each Region and single-digit millisecond read/write latency. Well-suited for session stores where write availability across Regions is non-negotiable and you can accept the latency tradeoff of not having an in-memory data store.
  • Amazon Aurora DSQL — a serverless, distributed SQL database with active-active multi-Region availability and strong consistency. Suitable when you need relational semantics or transactional guarantees on session-adjacent data (for example, user profiles or authorization state that must be consistent across Regions).

Both alternatives eliminate the cross-Region write dependency entirely. Each Region writes locally, and replication happens transparently. The tradeoff is higher read/write latency compared to an in-memory store like ElastiCache.

A complete multi-Region architecture

ElastiCache for Valkey Global Datastore provides fully managed, fast, reliable, and secure cross-Region replication.

Implementing Global Datastore automatically enables the application servers to read from ElastiCache regardless of the Region. Because Global Datastore provides failover capability between Regions, ElastiCache also meets our requirement for disaster recovery. The following diagram shows the updated architecture with two Regions using Global Datastore.

Multi-Region With ElastiCache Global Datastore Architecture Diagram

The cross-Region operations

For the two Regions to be fully operational, we have to implement a solution to enable write operations from all Regions to the Global Datastore primary cluster. To do so, we can configure both Regions using VPC peering, so the application servers running on Amazon EC2 in the secondary Global Datastore Region have the necessary cross-Region connectivity to access the cluster in the primary Region. This allows the application in the secondary Region to write in the primary cluster.

An alternative to VPC peering can be AWS Transit Gateway. A VPC peering connection is a networking connection between two VPCs that routes traffic between them using private IPv4 or IPv6 addresses. Transit Gateway connects VPCs to a single Transit Gateway instance, which consolidates an organization’s entire AWS routing configuration in one place. VPC peering doesn’t support transitive routing. A direct VPC peering connection is required between each VPC that must communicate with one another. Transit Gateway supports transitive routing. Traffic is routed among all the connected networks by using route tables. Transit Gateway has additional costs, whereas VPC peering doesn’t (only the regular data-transfer costs apply). Our recommendation is to use VPC peering if you connect up to 10 VPCs, otherwise, you should consider Transit Gateway.

To prevent code modification after a global datastore failover, you can create an Amazon Route 53 custom DNS record to resolve the endpoint of the primary cluster, then add this custom DNS record in the code of the application in both us-west-1 and us-west-2. In case of failover, the endpoint modification can be done at the DNS level.

The DNS automation

In this section, we introduce an additional functionality for automation.

You can implement an AWS Lambda function to automatically update the custom DNS record in the Route 53 private zone upon global datastore failover. The Lambda function uses an Amazon Simple Notification Service (Amazon SNS) notification of a global datastore failover as a trigger to update the DNS record. From there, the application in the secondary cluster Region resumes cross-Region write operations through the peering connection. At the same time, the application in the primary cluster Region operates locally. The details of the Lambda function are available later in this post.

You can see the updated architecture with the automation in the following diagram. The core elements of the DNS automation are Amazon SNS and Lambda, which changes the value of the Route 53 DNS record upon the ElastiCache cluster primary failover. Those core elements will also be part of the AWS CloudFormation template provided later in this post.

Multi-Region with ElastiCache Global Datastore and DNS automation Architecture Diagram

To validate the failover automation, you can make the secondary cluster a primary cluster that can accept write operations and test this functionality. For more details, see Promoting the secondary cluster to primary.

Failover mechanism considerations

The failover mechanism described in this post uses an AWS Lambda function to call the Route 53 ChangeResourceRecordSets API to update DNS records after a Global Datastore failover. It’s important to understand that Route 53 record modifications depend on the Route 53 control plane, which operates from a single Region (us-east-1). If us-east-1 experiences an impairment, the ability to modify Route 53 records programmatically may be affected, even if your workload operates in other Regions.

For more detail on this dependency, see Fundamental 3: Failover mechanisms in the AWS Multi-Region Fundamentals prescriptive guidance.

For production workloads that require resilient failover independent of any single Region, consider the following alternatives:

  • Amazon Application Recovery Controller (ARC) routing controls. Replace the Route 53 record update with an ARC routing control state change. ARC routing controls are distributed across multiple Regions and do not depend on the Route 53 control plane in us-east-1. The Lambda function would toggle the ARC routing control instead of calling ChangeResourceRecordSets.
  • Route 53 health check–based failover routing. Configure Route 53 failover routing policies with health checks that monitor CloudWatch alarms (for example, an alarm on ElastiCache primary reachability). This approach uses the Route 53 data plane for routing decisions — health check evaluations and DNS responses continue to function even during control plane impairments, because they rely on pre-configured routing policies rather than in-flight record modifications.

Either approach provides a statically stable failover mechanism that does not require modifying Route 53 records during an incident.

AWS services used

In this section, we dive deeper into the services used in this solution. We also provide a CloudFormation template in the accompanying GitHub repository that deploys the necessary AWS resources in both Regions to have an end-to-end working solution.

Amazon ElastiCache for Valkey

ElastiCache is a fully managed, Valkey-, Memcached-, and Redis OSS-compatible caching service that delivers real-time, cost-optimized performance for modern applications. ElastiCache scales to millions of operations per second with microsecond response time, and offers enterprise-grade security and reliability.

Valkey is an open source, in-memory key-value data store. It is a drop-in replacement for Redis OSS. It is stewarded by the Linux Foundation and rapidly improving with contributions from a vibrant developer community. AWS is actively contributing to Valkey; to learn more about AWS contributions for Valkey, see Amazon ElastiCache and Amazon MemoryDB announce support for Valkey.

With the ElastiCache for Valkey Global Datastore feature, you can write to your ElastiCache for Valkey cluster in one Region (primary) and have the data available to be read from two other cross-Region secondary clusters. This enables low-latency reads and disaster recovery across Regions. The clusters—primary and secondary—in your global datastore should have the same number of primary nodes, node type, engine version, and number of shards (in case of cluster mode enabled). Each cluster in your global datastore can have a different number of read replicas to accommodate the read traffic local to that cluster.

For this solution, the VPC for the primary ElastiCache for Valkey cluster and the VPC for the secondary cluster must use a different network CIDR.

Amazon Route 53

Route 53 is a highly available and scalable cloud DNS web service. To reduce application code changes after a failover, this solution uses a DNS private zone accessible by the two VPCs connected with VPC peering. For ElastiCache cluster mode disabled, a DNS record (CNAME) is created in this DNS private zone to resolve the primary endpoint of the primary cluster in the global datastore.

If the global datastore is based off ElastiCache for Valkey clusters with cluster mode enabled, the CNAME must resolve the configuration endpoint of the primary cluster.

Amazon SNS

Amazon SNS is a fully managed messaging service for both application-to-application and application-to-person communication. In this solution, ElastiCache is configured to publish events in an SNS topic. For more details, see Managing ElastiCache Amazon SNS notifications. When a Region is promoted to the role of primary in the global datastore, the following event is published to its associated SNS topic:

ElastiCache:ReplicationGroupPromotedAsPrimary : $ClusterName

AWS Lambda

Lambda is a serverless compute service that lets you run code without provisioning or managing servers, creating workload-aware cluster scaling logic, maintaining event integrations, or managing runtimes.

If a failover is manually triggered in the global datastore, the related notification is published in the SNS topic.

This SNS topic is used as a trigger to run a Lambda function. The code in the function updates the CNAME in the DNS private zone to replace the previous endpoint with the endpoint of the new primary cluster. This automation allows the application in both the primary and secondary Region to reconnect to the primary cluster for the write operations.

Prerequisites

To implement this solution, you need an AWS account with the necessary permissions.

Create the Lambda function and IAM policies

This section walks you through creating the Lambda function and the required AWS Identity and Access Management (IAM) policies. If you are using the CloudFormation template provided in the post, the Lambda function will already be created in your account.

  1. Create a new IAM policy called ElastiCache_Route53 with the following content:
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Sid": "Stmt1511707556511",
          "Action": [
            "route53:GetHostedZone",
            "route53:ChangeResourceRecordSets"
          ],
          "Effect": "Allow",
          "Resource": "arn:aws:route53:::hostedzone/{hosted-zone-id}"
        }
      ]
    }
    
  2. Create an IAM role for the Lambda function with the following policies:
    1. AmazonElastiCacheReadOnlyAccess
    2. AWSLambdaBasicExecutionRole
    3. ElastiCache_Route53
  3. Create the Lambda function in each Region with the following configuration:
    1. Runtime Python 3.12
    2. Set the trigger for Amazon SNS (choose the topic of the local cluster)
    3. Increase function timeout to 10 seconds
    4. Lambda variables:
      1. cname – Custom DNS record to update in the DNS private zone
      2. endpoint – Primary (or configuration) endpoint of the local cluster
      3. zone_id – Route 53 private zone ID

Use the following code:

from __future__ import print_function

import boto3
import re
import os
import json


CNAME = os.environ['cname']
ENDPOINT= os.environ['endpoint']
ZONE_ID = os.environ['zone_id']


def aws_session(role_arn=None, session_name='my_session'):
    """
    If role_arn is given assumes a role and returns boto3 session
    otherwise return a regular session with the current IAM userFailoverComplete/role
    """

    if role_arn:
        client = boto3.client('sts')
        response = client.assume_role(
            RoleArn=role_arn, RoleSessionName=session_name)
        session = boto3.Session(
            aws_access_key_id=response['Credentials']['AccessKeyId'],
            aws_secret_access_key=response['Credentials']['SecretAccessKey'],
            aws_session_token=response['Credentials']['SessionToken'])
        return session
    else:
        return boto3.Session()


def update_cname(endpoint, cname, zone, session):
    """
    update CNAME
    """

    route53 = session.client('route53')
    dzone = route53.get_hosted_zone(Id=zone)
    dzonedomain = dzone["HostedZone"]["Name"]


    dns_changes = {
        'Changes': [
            {
                'Action': 'UPSERT',
                'ResourceRecordSet': {
                    'Name': cname,
                    'Type': 'CNAME',
                    'TTL': 10,
                    'ResourceRecords': [
                        {
                          'Value': endpoint,
                        }
                    ],
                }
            }
        ]
    }
    print(
"DEBUG - Updating Route53 to create CNAME {} for {}".format(cname, endpoint))

    route53.change_resource_record_sets(HostedZoneId=zone, ChangeBatch=dns_changes)


def lambda_handler(event, context):
    """
    Main lambda function
    Parse and check the event validity
    """

    msg = json.loads(event['Records'][0]['Sns']['Message'])
    msg_type = msg.keys()[0]
    msg_event = msg_type.split(':')[1]

    events = ['ReplicationGroupPromotedAsPrimary']

    if msg_event not in events:
        print('Event {} is not valid for Replica-autocname function'.format(msg_type))
        return
    else:
        print(
            'Event {} is valid, processing with Replica-autocname...'.format(msg_type))

    session = aws_session()

    dnsupdate = update_cname(ENDPOINT, CNAME, ZONE_ID, session)

After you promote the secondary Region to become a primary cluster, the SNS notification will trigger the Lambda function, which will update the customer DNS record in Route 53 with the new primary cluster endpoint.

Clean up

To avoid future charges after you have verified the solution, you should delete the CloudFormation stacks in both Regions using the AWS CloudFormation console.

Summary

ElastiCache for Valkey Global Datastore provides a fully managed, fast, reliable, and secure cross-Region replication. You can write to your ElastiCache for Valkey cluster in one Region and have the data replicated to up to two other Regions with latency of typically under one second. This enables low-latency reads and disaster recovery across Regions. The feature is available in 36 of the AWS Regions, 15 of those regions were recently launched.

In this post, we showed how to implement a multi-Region session store with ElastiCache for Valkey Global Datastore and transitioned from a single Region to a multi-Region architecture according to specific business requirements.

To get started with ElastiCache for Valkey Global Datastore, check out AWS What’s Next: ElastiCache Global Datastore Launch with AWS product expert Ruchita Arora and Replication across AWS Regions using global datastores.


About the Authors

Eran Balan is an AWS Senior Solutions Architect based in EMEA. He works with AWS enterprise customers to provide them with architectural guidance for building scalable architecture in AWS environments.

Ben Fields is an AWS Senior Solutions Architect based out of Seattle, Washington. His interests and experience include databases, containers, AI/ML, and full-stack software development. You can often find him out climbing at the nearest climbing gym, playing ice hockey at the closest rink, or enjoying the warmth of home with a good game.

Yann Richard is an AWS ElastiCache Solutions Architect. On a more personal side, his goal is to make data transit in less than 4 hours and run a marathon in sub-milliseconds, or the opposite.