Best practices: Redis clients and Amazon ElastiCache for Redis

In this post, we cover best practices for interacting with Amazon ElastiCache for Redis resources with commonly used open-source Redis client libraries. ElastiCache is compatible with open-source Redis. However, you may still have questions about how to optimize your applications and associated Redis client library configurations to interact with ElastiCache. These issues typically arise when operating ElastiCache clusters at a large scale, or gracefully handling cluster resize events. Learn best practices for common scenarios and follow along with code examples of some of the most popular open source Redis client libraries (redis-py, PHPRedis, and Lettuce).

Large number of connections

Individual ElastiCache for Redis nodes support up to 65,000 concurrent client connections. However, to optimize for performance, we advise that client applications do not constantly operate at that level of connection. Redis is a single-threaded process based on an event loop where incoming client requests are handled sequentially. That means the response time of a given client becomes longer as the number of connected clients increases.

You can take the following set of actions to avoid hitting a connection bottleneck on the Redis server:

Perform read operations from read replicas. This can be done by using the ElastiCache reader endpoints in cluster mode disabled or by using replicas for reads in cluster mode enabled.
Distribute write traffic across multiple primary nodes. You can do this in two ways. You can use a multi-sharded Redis cluster with a Redis cluster mode capable client. You could also write to multiple primary nodes in cluster mode disabled with client-side sharding.
Use a connection pool when available in your client library.

In general, creating a TCP connection is a computationally expensive operation compared to typical Redis commands. For example, handling a SET/GET request is an order of magnitude faster when reusing an existing connection. Using a client connection pool with a finite size reduces the overhead of connection management. It also bounds the number of concurrent incoming connections from the client application.

The following code example of PHPRedis shows that a new connection is created for each new user request:

	    $redis = new Redis();
        if ($redis->connect($HOST, $PORT) != TRUE) {
            //ERROR: connection failed
            return;
        }
        $redis->set($key, $value);
        unset($redis);
        $redis = NULL;

We benchmarked this code in a loop on an Amazon Elastic Compute Cloud (Amazon EC2) instance connected to a Graviton2 (m6g.2xlarge) ElastiCache for Redis node. We placed both the client and server at the same Availability Zone. The average latency of the entire operation was 2.82 milliseconds.

When we updated the code and used persistent connections and a connection pool, the average latency of the entire operation was 0.21 milliseconds:

        $redis = new Redis();
        if ($redis->pconnect($HOST, $PORT) != TRUE) {
            // ERROR: connection failed
            return;
        }
        $redis->set($key, $value);
        unset($redis);
        $redis = NULL;

Required redis.ini configurations:

1. redis.pconnect.pooling_enabled=1

2. redis.pconnect.connection_limit=10

The following code is an example of a Redis-py connection pool:

conn = Redis(connection_pool=redis.BlockingConnectionPool(host=HOST, max_connections=10))
conn.set(key, value)

The following code is an example of a Lettuce connection pool:

RedisClient client = RedisClient.create(RedisURI.create(HOST, PORT));
GenericObjectPool<StatefulRedisConnection> pool = ConnectionPoolSupport.createGenericObjectPool(() -> client.connect(), new GenericObjectPoolConfig());
pool.setMaxTotal(10); // Configure max connections to 10
try (StatefulRedisConnection connection = pool.borrowObject()) {
    RedisCommands syncCommands = connection.sync();
    syncCommands.set(key, value);
}

Redis cluster client discovery and exponential backoff

When connecting to an ElastiCache for Redis cluster in cluster mode enabled, the corresponding Redis client library must be cluster aware. The clients must obtain a map of hash slots to the corresponding nodes in the cluster in order to send requests to the right nodes and avoid the performance overhead of handing cluster redirections. As a result, the client must discover a complete list of slots and the mapped nodes in two different situations:

The client is initialized and must populate the initial slots configuration
A MOVED redirection is received from the server, such as in the situation of a failover when all slots served by the former primary node are taken over by the replica, or re-sharding when slots are being moved from the source primary to the target primary node

Client discovery is usually done via issuing a CLUSTER SLOT or CLUSTER NODE command to the Redis server. We recommend the CLUSTER SLOT method because it returns the set of slot ranges and the associated primary and replica nodes back to the client. This doesn’t require additional parsing from the client and is more efficient.

Depending on the cluster topology, the size of the response for the CLUSTER SLOT command can vary based on the cluster size. Larger clusters with more nodes produce a larger response. As a result, it’s important to ensure that the number of clients doing the cluster topology discovery doesn’t grow unbounded. For example, when the client application starts up or loses connection from the server and must perform cluster discovery, one common mistake is that the client application fires several reconnection and discovery requests without adding exponential backoff upon retry. This can render the Redis server unresponsive for a prolonged period of time, with the CPU utilization at 100%. The outage is prolonged if each CLUSTER SLOT command must process a large number of nodes in the cluster bus. We have observed multiple client outages in the past due to this behavior across a number of different languages including Python (redis-py-cluster) and Java (Lettuce and Redisson).

To mitigate the impact caused by a sudden influx of connection and discovery requests, we recommend the following:

Implement a client connection pool with a finite size to bound the number of concurrent incoming connections from the client application.
When the client disconnects from the server due to timeout, retry with exponential backoff with jitter. This helps to avoid multiple clients overwhelming the server at the same time.
Use the ElastiCache Configuration Endpoint to perform cluster discovery. In doing so, you spread the discovery load across all nodes in the cluster (up to 90) instead of hitting a few hardcoded seed nodes in the cluster.

The following are some code examples for exponential backoff retry logic in redis-py, PHPRedis, and Lettuce.

Backoff logic sample 1: redis-py

redis-py has a built-in retry mechanism that retries one time immediately after a failure. This mechanism can be enabled through the retry_on_timeout argument supplied when creating a Redis object. Here we demonstrate a custom retry mechanism with exponential backoff and jitter. We’ve submitted a pull request to natively implement exponential backoff in redis-py (#1494). In the future it may not be necessary to implement manually.

def run_with_backoff(function, retries=5):
  base_backoff = 0.1 # base 100ms backoff
  max_backoff = 10 # sleep for maximum 10 seconds
  tries = 0
  while True:
    try:
      return function()
    except (ConnectionError, TimeoutError):
      if tries >= retries:
        raise
      backoff = min(max_backoff, base_backoff * (pow(2, tries) + random.random()))
      print(f"sleeping for {backoff:.2f}s")
      sleep(backoff)
      tries += 1

You can then use the following code to set a value:

client = redis.Redis(connection_pool=redis.BlockingConnectionPool(host=HOST, max_connections=10))
res = run_with_backoff(lambda: client.set("key", "value"))
print(res)

Depending on your workload, you might want to change the base backoff value from 1 second to a few tens or hundreds of milliseconds for latency-sensitive workloads.

Backoff logic sample 2: PHPRedis

PHPRedis has a built-in retry mechanism that retries a (non-configurable) maximum of 10 times. There is a configurable delay between tries (with a jitter from the second retry onwards). For more information, see the following sample code. We’ve submitted a pull request to natively implement exponential backoff in PHPredis (#1986) that has since been merged and documented. For those on the latest release of PHPRedis, it won’t be necessary to implement manually but we’ve included the reference here for those on previous versions. For now, the following is a code example that configures the delay of the retry mechanism:

$timeout = 0.1; // 100 millisecond connection timeout
$retry_interval = 100; // 100 millisecond retry interval
$client = new Redis();
if($client->pconnect($HOST, $PORT, $timeout, NULL, $retry_interval) != TRUE){
    return; // ERROR: connection failed
}
$client->set($key, $value);

Backoff logic sample 3: Lettuce

Lettuce has built-in retry mechanisms based on the exponential backoff strategies described in the post Exponential Backoff and Jitter. The following is a code excerpt showing the full jitter approach:

public static void main(String[] args)
{
    ClientResources resources = null;
    RedisClient client = null;

    try {
        resources = DefaultClientResources.builder()
                .reconnectDelay(Delay.fullJitter(
            Duration.ofMillis(100),     // minimum 100 millisecond delay
            Duration.ofSeconds(5),      // maximum 5 second delay
            100, TimeUnit.MILLISECONDS) // 100 millisecond base
        ).build();

        client = RedisClient.create(resources, RedisURI.create(HOST, PORT));
        client.setOptions(ClientOptions.builder()
.socketOptions(SocketOptions.builder().connectTimeout(Duration.ofMillis(100)).build()) // 100 millisecond connection timeout
.timeoutOptions(TimeoutOptions.builder().fixedTimeout(Duration.ofSeconds(5)).build()) // 5 second command timeout
.build());

     // use the connection pool from above example
    } finally {
        if (connection != null) {
            connection.close();
        }

        if (client != null){
            client.shutdown();
        }

        if (resources != null){
            resources.shutdown();
        }

    }
}

Configure a client-side timeout

Configure the client-side timeout appropriately to allow the server sufficient time to process the request and generate the response. This also allows it to fail fast if the connection to the server can’t be established. Certain Redis commands can be more computationally expensive than others. For example, Lua scripts or MULTI/EXEC transactions that contain multiple commands that must be run atomically. In general, a higher client-side timeout is recommended to avoid a time out of the client before the response is received from the server, including the following:

Running commands across multiple keys
Running MULTI/EXEC transactions or Lua scripts that consist of multiple individual Redis commands
Reading large values
Performing blocking operations such as BLPOP

In case of a blocking operation such as BLPOP, the best practice is to set the command timeout to a number lower than the socket timeout.

The following are code examples for implementing a client-side timeout in redis-py, PHPRedis, and Lettuce.

Timeout configuration sample 1: redis-py

The following is a code example with redis-py:

# connect to Redis server with a 100 millisecond timeout
# give every Redis command a 2 second timeout
client = redis.Redis(connection_pool=redis.BlockingConnectionPool(host=HOST, max_connections=10,socket_connect_timeout=0.1,socket_timeout=2))

res = client.set("key", "value") # will timeout after 2 seconds
print(res)                       # if there is a connection error

res = client.blpop("list", timeout=1) # will timeout after 1 second
                                      # less than the 2 second socket timeout
print(res)

Timeout config sample 2: PHPRedis

The following is a code example with PHPRedis:

// connect to Redis server with a 100ms timeout
// give every Redis command a 2s timeout
$client = new Redis();
$timeout = 0.1; // 100 millisecond connection timeout
$retry_interval = 100; // 100 millisecond retry interval
$client = new Redis();
if($client->pconnect($HOST, $PORT, 0.1, NULL, 100, $read_timeout=2) != TRUE){
    return; // ERROR: connection failed
}
$client->set($key, $value);

$res = $client->set("key", "value"); // will timeout after 2 seconds
print "$res\n";                      // if there is a connection error

$res = $client->blpop("list", 1); // will timeout after 1 second
print "$res\n";                   // less than the 2 second socket timeout

Timeout config sample 3: Lettuce

The following is a code example with Lettuce:

// connect to Redis server and give every command a 2 second timeout
public static void main(String[] args)
{
    RedisClient client = null;
    StatefulRedisConnection<String, String> connection = null;
    try {
        client = RedisClient.create(RedisURI.create(HOST, PORT));
        client.setOptions(ClientOptions.builder()
.socketOptions(SocketOptions.builder().connectTimeout(Duration.ofMillis(100)).build()) // 100 millisecond connection timeout
.timeoutOptions(TimeoutOptions.builder().fixedTimeout(Duration.ofSeconds(2)).build()) // 2 second command timeout 
.build());
  
        // use the connection pool from above example

        commands.set("key", "value"); // will timeout after 2 seconds
        commands.blpop(1, "list"); // BLPOP with 1 second timeout
    } finally {
        if (connection != null) {
            connection.close();
        }

        if (client != null){
            client.shutdown();
        }
    }
}

Configure a server-side idle timeout

We have observed cases when a customer’s application has a high number of idle clients connected, but isn’t actively sending commands. In such scenarios, you can exhaust all 65,000 connections with a high number of idle clients. To avoid such scenarios, configure the timeout setting appropriately on the server via ElastiCache Redis parameter groups. This ensures that the server actively disconnects idle clients to avoid an increase in the number of connections.

Redis Lua scripts

Redis supports more than 200 commands, including those to run Lua scripts. However, when it comes to Lua scripts, there are several pitfalls that can affect memory and availability of Redis.

Unparameterized Lua scripts

Each Lua script is cached on the Redis server before it runs. Unparameterized Lua scripts are unique, which can lead to the Redis server storing a large number of Lua scripts and consuming more memory. To mitigate this, ensure that all Lua scripts are parameterized and regularly perform SCRIPT FLUSH to clean up cached Lua scripts if needed.

The following example shows how to use parameterized scripts. First, we have an example of an unparameterized approach that results in three different cached Lua scripts and is not recommended:

      eval "return redis.call('set','key1','1')" 0
      eval "return redis.call('set','key2','2')" 0
      eval "return redis.call('set','key3','3')" 0

Instead, use the following pattern to create a single script that can accept passed parameters:

      eval "return redis.call('set',KEYS[1],ARGV[1])" 1 key1 1 
      eval "return redis.call('set',KEYS[1],ARGV[1])" 1 key2 2 
      eval "return redis.call('set',KEYS[1],ARGV[1])" 1 key3 3

Long-running Lua scripts

Lua scripts can run multiple commands atomically, so it can take longer to complete than a regular Redis command. If the Lua script only runs read-only operations, you can stop it in the middle. However, as soon as the Lua script performs a write operation, it becomes unkillable and must run to completion. A long-running Lua script that is mutating can cause the Redis server to be unresponsive for a long time. To mitigate this issue, avoid long-running Lua scripts and test the script out in a pre-production environment.

Lua script with stealth writes

There are a few ways a Lua script can continue to write new data into Redis even when Redis is over maxmemory:

The script starts when the Redis server is below maxmemory, and contains multiple write operations inside
The script’s first write command isn’t consuming memory (such as DEL), followed by more write operations that consume memory

You can mitigate this problem by configuring a proper eviction policy in Redis server other than noeviction. This allows Redis to evict items and free up memory in between Lua scripts.

Storing large composite items

We have observed cases where an application stores large composite items in Redis (such as a multi-GB hash dataset). This is not a recommended practice because it often leads to performance problems in Redis. For example, the client can do a HGETALL command to retrieve the entire multi GB hash collection. This can generate significant memory pressure to the Redis server buffering the large item in the client output buffer. Also, for slot migration in cluster mode, ElastiCache doesn’t migrate slots that contain items with serialized size that is larger than 256 MB.

To solve the large item problems, we have the following recommendations:

Break up the large composite item into multiple smaller items. For example, break up a large hash collection into individual key-value fields with a key name scheme that appropriately reflects the collection, such as using a common prefix in the key name to identify the collection of items. If you must access multiple fields in the same collection atomically, you can use the MGET command to retrieve multiple key-values in the same command.
If you evaluated all options and still can’t break up the large collection dataset, try to use commands that operate on a subset of the data in the collection instead of the entire collection. Avoid having a use case that requires you to atomically retrieve the entire multi-GB collection in the same command. One example is using HGET or HMGET commands instead of HGETALL on hash collections.

Conclusion

In this post, we reviewed Redis client library best practices when using ElastiCache, and ways to avoid common pitfalls. By adhering to best practices, you can increase the performance, reliability, and operational excellence of your ElastiCache environments. If you have any questions or feedback, reach out on the Amazon ElastiCache discussion forum or in the comments.

About the Authors

Qu Chen is a senior software development engineer at Amazon ElastiCache – the team responsible for building, operating and maintaining the highly scalable and performant Redis managed service at AWS. In addition, he is an active contributor to the open-source Redis project. In his spare time, he enjoys sports, outdoor activities and playing piano music.

Jim Gallagher is an Amazon ElastiCache Specialist Solutions Architect based in Austin, TX. He helps AWS customers across the world best leverage the power, simplicity, and beauty of Redis. Outside of work he enjoys exploring the Texas Hill Country with his wife and son.

Nathaniel Braun is a Senior Software Development Engineer at Amazon Web Services, based in Tel Aviv, Israel. He designs and operates large-scale distributed systems and likes to tackle difficult problems with his team. Outside of works he enjoys hiking, sailing, and drinking coffee.

Asaf Porat Stoler is a Software Development Manager at Amazon ElastiCache, based in Tel Aviv, Israel. He has vast and diverse experience in storage systems, data reduction, and in-memory databases, and likes performance and resource optimizations. Outside of work he enjoys sport, hiking, and spending time with his family.