Best practices: Valkey/Redis OSS clients and Amazon ElastiCache

May 2025: This post was reviewed for accuracy.

In this post, we cover best practices for interacting with Amazon ElastiCache for Valkey and Redis OSS resources with commonly used open-source Valkey or Redis OSS client libraries. ElastiCache is compatible with open-source Valkey and Redis OSS up to version 7.2. Redis OSS clients are compatible with ElastiCache for Valkey. However, you may still have questions about how to optimize your applications and associated client library configurations to interact with ElastiCache. These issues typically arise when operating ElastiCache clusters at a large scale, or gracefully handling cluster resize events. Learn best practices for common scenarios and follow along with code examples of some of the most popular Valkey and open source Redis client libraries (valkey-py, PHPRedis, Lettuce, and the new Valkey GLIDE).

Large number of connections

Individual ElastiCache for Valkey nodes support up to 65,000 concurrent client connections. However, to optimize for performance, we advise that client applications do not constantly operate at that level of connection. Valkey is a single-threaded process based on an event loop where incoming client requests are handled sequentially. That means the response time of a given client becomes longer as the number of connected clients increases.

You can take the following set of actions to avoid hitting a connection bottleneck on the Valkey server:

Perform read operations from read replicas. This can be done by using the ElastiCache reader endpoints in cluster mode disabled or by using replicas for reads in cluster mode enabled.
Distribute write traffic across multiple primary nodes. You can do this in two ways. You can use a multi-sharded Valkey/Redis cluster with a cluster mode capable client. You could also write to multiple primary nodes in cluster mode disabled with client-side sharding.
Use a connection pool when available in your client library.

In general, creating a TCP connection is a computationally expensive operation compared to typical Valkey/Redis commands. For example, handling a SET/GET request is an order of magnitude faster when reusing an existing connection. Using a client connection pool with a finite size reduces the overhead of connection management. It also bounds the number of concurrent incoming connections from the client application.

The following code is an example of using the new Valkey GLIDE’s Java client. Valkey GLIDE is a client that automatically multiplexes many concurrent commands over a single TCP connection, eliminating the need for explicit connection pooling. It also supports multiple languages, including Java, Go, Node.js, and Python.

GlideClientConfiguration config = GlideClientConfiguration.builder()
.address(NodeAddress.builder().host(HOST).port(PORT).build())
.inflightRequestsLimit(1000) // Allow up to 1000 concurrent in-flight requests
.build();

GlideClient client = GlideClient.createClient(config).get())
client.set("mykey", "myvalue").get();

The following code example of PHPRedis shows that a new connection is created for each new user request:

$redis = new Redis();
        if ($redis->connect($HOST, $PORT) != TRUE) {
            //ERROR: connection failed
            return;
        }
        $redis->set($key, $value);
        unset($redis);
        $redis = NULL;

We benchmarked this code in a loop on an Amazon Elastic Compute Cloud (Amazon EC2) instance connected to a Graviton2 (m6g.2xlarge) ElastiCache for Redis node. We placed both the client and server at the same Availability Zone. The average latency of the entire operation was 2.82 milliseconds.

When we updated the code and used persistent connections and a connection pool, the average latency of the entire operation was 0.21 milliseconds:

$redis = new Redis();
        if ($redis->pconnect($HOST, $PORT) != TRUE) {
            // ERROR: connection failed
            return;
        }
        $redis->set($key, $value);
        unset($redis);
        $redis = NULL;

Required redis.ini configurations:

1. redis.pconnect.pooling_enabled=1

2. redis.pconnect.connection_limit=10

The following code is an example of a Valkey-py connection pool:

import valkey
conn = valkey.Valkey(connection_pool=valkey.BlockingConnectionPool(host=HOST, max_connections=10))
conn.set(key, value)

The following code is an example of a Lettuce connection pool:

RedisClient client = RedisClient.create(RedisURI.create(HOST, PORT));
GenericObjectPool<StatefulRedisConnection> pool = ConnectionPoolSupport.createGenericObjectPool(() -> client.connect(), new GenericObjectPoolConfig());
pool.setMaxTotal(10); // Configure max connections to 10
try (StatefulRedisConnection connection = pool.borrowObject()) {
    RedisCommands syncCommands = connection.sync();
    syncCommands.set(key, value);
}

Valkey cluster client discovery and exponential backoff

When connecting to an ElastiCache for Valkey cluster in cluster mode enabled, the corresponding client library must be cluster aware. The clients must obtain a map of hash slots to the corresponding nodes in the cluster in order to send requests to the right nodes and avoid the performance overhead of handing cluster redirections. As a result, the client must discover a complete list of slots and the mapped nodes in two different situations:

The client is initialized and must populate the initial slots configuration
A MOVED redirection is received from the server, such as in the situation of a failover when all slots served by the former primary node are taken over by the replica, or re-sharding when slots are being moved from the source primary to the target primary node

Client discovery is usually done via issuing a CLUSTER SLOT or CLUSTER NODE command to the server. We recommend the CLUSTER SLOT method because it returns the set of slot ranges and the associated primary and replica nodes back to the client. This doesn’t require additional parsing from the client and is more efficient.

Depending on the cluster topology, the size of the response for the CLUSTER SLOT command can vary based on the cluster size. Larger clusters with more nodes produce a larger response. As a result, it’s important to ensure that the number of clients doing the cluster topology discovery doesn’t grow unbounded. For example, when the client application starts up or loses connection from the server and must perform cluster discovery, one common mistake is that the client application fires several reconnection and discovery requests without adding exponential backoff upon retry. This can render the server unresponsive for a prolonged period of time, with the CPU utilization at 100%. The outage is prolonged if each CLUSTER SLOT command must process a large number of nodes in the cluster bus. We have observed multiple client outages in the past due to this behavior across a number of different languages including Python (redis-py-cluster) and Java (Lettuce and Redisson).

To mitigate the impact caused by a sudden influx of connection and discovery requests, we recommend the following:

Implement a client connection pool with a finite size to bound the number of concurrent incoming connections from the client application.
When the client disconnects from the server due to timeout, retry with exponential backoff with jitter. This helps to avoid multiple clients overwhelming the server at the same time.
Use the ElastiCache Configuration Endpoint to perform cluster discovery. In doing so, you spread the discovery load across all nodes in the cluster (up to 90) instead of hitting a few hardcoded seed nodes in the cluster.

The following are some code examples for exponential backoff retry logic in Valkey GLIDE, valkey-py, PHPRedis, and Lettuce.

Backoff logic sample 1: Valkey-py

Valkey-py has a built-in retry mechanism that retries after failure with a customized retry policy. This mechanism can be enabled through the retry_on_timeout and retry arguments supplied when creating a Valkey object. Here, we demonstrate a custom retry mechanism with exponential backoff.

from valkey.backoff import ExponentialBackoff
from valkey.client import Valkey
from valkey.retry import Retry

retry_policy = Retry(ExponentialBackoff(), 3)
client = Valkey(host=HOST, retry_on_timeout=True, retry=retry_policy)
client.set("key", "value")

Depending on your workload, you might want to change the base backoff value to a few tens or hundreds of milliseconds for latency-sensitive workloads.

Backoff logic sample 2: PHPRedis

PHPRedis has a built-in retry mechanism that retries a (non-configurable) maximum of 10 times. There is a configurable delay between tries (with a jitter from the second retry onwards). We’ve submitted a pull request to natively implement exponential backoff in PHPredis (#1986) that has since been merged and documented. For those on the latest release of PHPRedis, it won’t be necessary to implement manually but we’ve included the reference here for those on previous versions. For now, the following is a code example that configures the delay of the retry mechanism:

$timeout = 2; // 2 seconds connection timeout
$retry_interval = 1000; // 1000 millisecond retry interval
$client = new Redis();
if($client->pconnect($HOST, $PORT, $timeout, NULL, $retry_interval) != TRUE){
    return; // ERROR: connection failed
}
$client->set($key, $value);

Tune the connection timeout and retry intervals based on your application needs.

Backoff logic sample 3: Lettuce

Lettuce has built-in retry mechanisms based on the exponential backoff strategies described in the post Exponential Backoff and Jitter. The following is a code excerpt showing the full jitter approach:

public static void main(String[] args)
{
    ClientResources resources = null;
    RedisClient client = null;

    try {
        resources = DefaultClientResources.builder()
                .reconnectDelay(Delay.fullJitter(
            Duration.ofSeconds(1),     // minimum 1 second delay
            Duration.ofSeconds(5),      // maximum 5 second delay
            1, TimeUnit.SECONDS) // 1 second base
        ).build();

        client = RedisClient.create(resources, RedisURI.create(HOST, PORT));
        client.setOptions(ClientOptions.builder()
            .socketOptions(SocketOptions.builder().connectTimeout(Duration.ofSeconds(2)).build()) // 2 second connection timeout
            .timeoutOptions(TimeoutOptions.builder().fixedTimeout(Duration.ofSeconds(5)).build()) // 5 second command timeout
            .build());

     // use the connection pool from above example
    } finally {
        if (connection != null) {
            connection.close();
        }

        if (client != null){
            client.shutdown();
        }

        if (resources != null){
            resources.shutdown();
        }

    }
}

Backoff logic sample 4: Valkey GLIDE

Valkey GLIDE implements exponential backoff for connection retries, and is used for handling reconnection attempts when connections fail. The exponential backoff strategy is defined with three key parameters:

factor: The base delay in milliseconds
exponent_base: The exponential growth factor (for example, 2 means the delay doubles with each retry)
number_of_retries: The maximum number of retry attempts with increasing delays

This logic is centralized in the Rust core, and exposed via all client bindings (Java, Go, Node.js, Python). The following is a code sample in Java.

BackoffStrategy reconnectStrategy = BackoffStrategy.builder()  
    .numOfRetries(5)  
    .exponentBase(2)  
    .factor(100)  
    .build();
  
GlideClientConfiguration config = GlideClientConfiguration.builder()  
    .address(NodeAddress.builder().host(HOST).port(PORT).build())  
    .reconnectStrategy(reconnectStrategy)  
    .build();

Configure a client-side timeout

Configure the client-side timeout appropriately to allow the server sufficient time to process the request and generate the response. This also allows it to fail fast if the connection to the server can’t be established. Certain Redis commands can be more computationally expensive than others. For example, Lua scripts or MULTI/EXEC transactions that contain multiple commands that must be run atomically. In general, a higher client-side timeout is recommended to avoid a time out of the client before the response is received from the server, including the following:

Running commands across multiple keys
Running MULTI/EXEC transactions or Lua scripts that consist of multiple individual Redis commands
Reading large values
Performing blocking operations such as BLPOP

In case of a blocking operation such as BLPOP, the best practice is to set the command timeout to a number lower than the socket timeout.

The following are code examples for implementing a client-side timeout in valkey-py, PHPRedis, Lettuce, and Valkey GLIDE.

Timeout configuration sample 1: valkey-py

The following is a code example with valkey-py:

import valkey
# connect to server with a 2 second timeout
# give every command a 2 second timeout
client = valkey.Valkey(connection_pool=valkey.BlockingConnectionPool(host=HOST, max_connections=10, socket_connect_timeout=2, socket_timeout=2))

res = client.set("key", "value") # will timeout after 2 seconds
print(res)                       # if there is a connection error

res = client.blpop("list", timeout=1) # will timeout after 1 second
                                      # less than the 2 second socket timeout
print(res)

Timeout config sample 2: PHPRedis

The following is a code example with PHPRedis:

// connect to Redis server with a 2s timeout
// give every Redis command a 2s timeout
$client = new Redis();
$timeout = 2; // 2 seconds connection timeout
$retry_interval = 1000; // 1000 millisecond retry interval
$client = new Redis();
if($client->pconnect($HOST, $PORT, $timeout, NULL, $retry_interval, $read_timeout=2) != TRUE){
    return; // ERROR: connection failed
}
$client->set($key, $value);

$res = $client->set("key", "value"); // will timeout after 2 seconds
print "$res\n";                      // if there is a connection error

$res = $client->blpop("list", 1); // will timeout after 1 second
print "$res\n";                   // less than the 2 second socket timeout

Timeout config sample 3: Lettuce

The following is a code example with Lettuce:

// connect to Redis server and give every command a 2 second timeout
public static void main(String[] args)
{
    RedisClient client = null;
    StatefulRedisConnection<String, String> connection = null;
    try {
        client = RedisClient.create(RedisURI.create(HOST, PORT));
        client.setOptions(ClientOptions.builder()
.socketOptions(SocketOptions.builder().connectTimeout(Duration.ofSeconds(2)).build()) // 2 second connection timeout
.timeoutOptions(TimeoutOptions.builder().fixedTimeout(Duration.ofSeconds(2)).build()) // 2 second command timeout 
.build());

        // use the connection pool from above example

        commands.set("key", "value"); // will timeout after 2 seconds
        commands.blpop(1, "list"); // BLPOP with 1 second timeout
    } finally {
        if (connection != null) {
            connection.close();
        }

        if (client != null){
            client.shutdown();
        }
    }
}

Timeout configuration sample 4: Valkey GLIDE

The following is a code example with Valkey GLIDE:

GlideClientConfiguration config = GlideClientConfiguration.builder()  
    .address(NodeAddress.builder().host(HOST).port(PORT).build())  
    .requestTimeout(2000) // 2 seconds in milliseconds  
    .build();
  
GlideClient client = GlideClient.createClient(config).get()  
    // Use the client

Configure a server-side idle timeout

We have observed cases when a customer’s application has a high number of idle clients connected but isn’t actively sending commands. In such scenarios, you can exhaust all 65,000 connections with a high number of idle clients. To avoid such scenarios, configure the timeout setting appropriately on the server via ElastiCache Valkey/Redis parameter groups. This ensures that the server actively disconnects idle clients to avoid an increase in the number of connections.

Lua scripts

Valkey supports more than 200 commands, including those to run Lua scripts. However, when it comes to Lua scripts, there are several pitfalls that can affect memory and availability of Valkey.

Unparameterized Lua scripts

Each Lua script is cached on the Valkey server before it runs. Unparameterized Lua scripts are unique, which can lead to the Valkey server storing many Lua scripts and consuming more memory. To mitigate this, ensure that all Lua scripts are parameterized and regularly perform SCRIPT FLUSH to clean up cached Lua scripts if needed.

The following example shows how to use parameterized scripts. First, we have an example of an unparameterized approach that results in three different cached Lua scripts and is not recommended:

      eval "return redis.call('set','key1','1')" 0
      eval "return redis.call('set','key2','2')" 0
      eval "return redis.call('set','key3','3')" 0

Instead, use the following pattern to create a single script that can accept passed parameters:

      eval "return redis.call('set',KEYS[1],ARGV[1])" 1 key1 1 
      eval "return redis.call('set',KEYS[1],ARGV[1])" 1 key2 2 
      eval "return redis.call('set',KEYS[1],ARGV[1])" 1 key3 3

Long-running Lua scripts

Lua scripts can run multiple commands atomically, so it can take longer to complete than a regular Valkey command. If the Lua script only runs read-only operations, you can stop it in the middle. However, as soon as the Lua script performs a write operation, it becomes unkillable and must run to completion. A long-running Lua script that is mutating can cause the server to be unresponsive for a long time. To mitigate this issue, avoid long-running Lua scripts and test the script out in a pre-production environment.

Lua script with stealth writes

There are a few ways a Lua script can continue to write new data into Valkey even when the server is over maxmemory:

The script starts when the Valkey server is below maxmemory, and contains multiple write operations inside
The script’s first write command isn’t consuming memory (such as DEL), followed by more write operations that consume memory

You can mitigate this problem by configuring a proper eviction policy in Valkey server other than noeviction. This allows Redis to evict items and free up memory in between Lua scripts.

Storing large composite items

We have observed cases where an application stores large composite items in Valkey (such as a multi-GB hash dataset). This is not a recommended practice because it often leads to performance problems. For example, the client can do a HGETALL command to retrieve the entire multi GB hash collection. This can generate significant memory pressure to the Valkey server buffering the large item in the client output buffer. Also, for slot migration in cluster mode, ElastiCache doesn’t migrate slots that contain items with serialized size that is larger than 256 MB.

To solve the large item problems, we have the following recommendations:

Break up the large composite item into multiple smaller items. For example, break up a large hash collection into individual key-value fields with a key name scheme that appropriately reflects the collection, such as using a common prefix in the key name to identify the collection of items. If you must access multiple fields in the same collection atomically, you can use the MGET command to retrieve multiple key-values in the same command.
If you evaluated all options and still can’t break up the large collection dataset, try to use commands that operate on a subset of the data in the collection instead of the entire collection. Avoid having a use case that requires you to atomically retrieve the entire multi-GB collection in the same command. One example is using HGET or HMGET commands instead of HGETALL on hash collections.

Conclusion

In this post, we reviewed Valkey/Redis client library best practices when using ElastiCache, and ways to avoid common pitfalls. By adhering to best practices, you can increase the performance, reliability, and operational excellence of your ElastiCache environments. If you have any questions or feedback, reach out on the Amazon ElastiCache discussion forum or in the comments.

About the Authors

Qu Chen is a senior software development engineer at Amazon ElastiCache – the team responsible for building, operating and maintaining the highly scalable and performant Redis managed service at AWS. In addition, he is an active contributor to the open-source Redis project. In his spare time, he enjoys sports, outdoor activities and playing piano music.

Jim Gallagher is an Amazon ElastiCache Specialist Solutions Architect based in Austin, TX. He helps AWS customers across the world best leverage the power, simplicity, and beauty of Redis. Outside of work he enjoys exploring the Texas Hill Country with his wife and son.

Nathaniel Braun is a Senior Software Development Engineer at Amazon Web Services, based in Tel Aviv, Israel. He designs and operates large-scale distributed systems and likes to tackle difficult problems with his team. Outside of works he enjoys hiking, sailing, and drinking coffee.

Asaf Porat Stoler is a Software Development Manager at Amazon ElastiCache, based in Tel Aviv, Israel. He has vast and diverse experience in storage systems, data reduction, and in-memory databases, and likes performance and resource optimizations. Outside of work he enjoys sport, hiking, and spending time with his family.

AWS Database Blog