Optimize Redis Client Performance for Amazon ElastiCache and MemoryDB
Redis users typically access a Redis service, such as Amazon ElastiCache or Amazon MemoryDB for Redis, using their choice of language-specific open source client libraries. These libraries are built and maintained by independent teams, with contributions from others including AWS. In this post, we share best practices for optimizing Redis client performance for popular Redis client libraries in Python, Java, C#, Node.js, and PHP. The benchmarks in this post were done with Amazon ElastiCache, but most of the performance enhancement principles apply to other Redis systems, including Amazon MemoryDB.
We first provide general definitions that are relevant to all client libraries and then we dive into each client library. We are providing comparison between client libraries for same language, to help you make knowledgeable decision on which library suit your needs best. Feel free to jump to different sections on clients you’re interested in. The sections on the different clients in this post aren’t dependent on one another and can be read in any order.
We also share guidelines to help you avoid some common pitfalls and easily avoidable performance issues we have seen customers face. We cover common scenarios and include our testing setup (you can review the code we used for testing on GitHub), as well as our benchmarking results.
In this section, we provide general definitions relevant to all client libraries.
Synchronous vs. asynchronous API
The different APIs for accessing Redis can essentially be divided into two broad categories: synchronous and asynchronous. A single client library may offer both synchronous and an asynchronous API. In fact, most of them do offer both.
A synchronous API is blocking. This means that an application must receive a response to an API call before it can move on to other tasks. When using a synchronous API, the application spends a lot of time waiting for responses.
An asynchronous API is non-blocking. This means that an application is free to move on to other tasks before it receives a response to an API call. With an asynchronous API, instead of idly waiting for a response, the application can run other tasks before the response is received, and handle the response once it arrives.
A connection pool is a store of connections that can be reused when requests to a database are made. You may ask why use a connection pool at all? Why not have every thread or process have its own dedicated connection? There are a couple of reasons. Redis has a limit on the number of open connections it can handle. Using a connection pool reduces the risk of crossing that threshold. For more information, see Best practices: Redis clients and Amazon ElastiCache for Redis. For short-lived interactions, using a connection pool can boost performance by saving the overhead of establishing a new connection.
Pipelining and batching
The Redis documentation defines pipelining as the ability to send multiple commands to the server without waiting for the replies at all, and finally reading the replies in a single step.
There are two major ways a client library can implement pipelines.
- Buffering a sequence of commands in memory and sending them to the Redis server as a single batch. We refer to this method as “batching”.
- Using an async API. We refer to this method as “pipelining”.
Transactions (MULTI/EXEC blocks)
A transaction is a sequence of commands that run atomically. Either all commands are run in order, with no other commands run in-between, or no commands are run at all. In Redis, this is achieved with the
MULTI command. After receiving the
MULTI command, Redis queues all commands it receives. Redis atomically runs the queued commands only after it receives the
EXEC command. Because commands are queued on the server side after receiving the
MULTI command, the client is free to send the sequence of commands either one by one or in batches.
For more information on Redis transactions, see Transactions.
Test setup and system specs
We use the following specs for our testing:
- Commands – 80%
- Size of values in bytes – Drawn from normal distribution with mean = 1024 bytes and 99.99% chance of falling between 4 and 2048 bytes
- Duration – Approximately 10 minutes per test
- Size of key space used for
get– 3.75 million
- Size of key space used for
set– 3 million
- Before each test – Flush the database from the previous test, and set every key in the set key space (this is to get an 80% hit rate for every test)
- Client EC2 instance – Amazon Elastic Compute Cloud (Amazon EC2) instance c5.4xlarge for Amazon Linux 2
- Amazon ElastiCache specs – We use the following parameters for Amazon ElastiCache:
- version 6.0
- Cluster mode disabled
- TLS disabled
- 1 shard, 3 nodes (1 primary, 2 replicas)
- Node type r6g.2xlarge
- Availability zone: client and server are in the same AZ
When measuring performance, we usually talk about requests per second (RPS) and latency. The higher the RPS, the better. Latency is a bit more complex to figure out because when we increase the load on the server, the queues get longer and the latency increases. This would force us to limit the load on the server to compare latencies between different clients and scenarios. In addition, when using batches, it’s less clear what we mean by latency. Do we want to measure the latency of a single request or the entire batch? Therefore, for the simplicity of this post, we use only the RPS metric as an indicator of performance.
Our recommendation is to monitor the client and server CPU utilization. High CPU utilization can indicate suboptimal performance. CPU is measured in percent per core; for example, a CPU utilization of 250% is equivalent to a CPU utilization of two and half cores.
When sending commands in batches, we increase performance by reducing the number of system calls and CPU usage both on the client machine and on the server side.
In this configuration, when using a single connection, the typical latency for one command is 0.2 millisecond. To measure the latency, we ran
redis-benchmark -t ping -c 1 -h <host name>. The following table depicts the latency results in milliseconds.
Therefore, when sending commands through a single connection with a synchronous API, the best possible throughput (using this configuration) is approximately 5,000 RPS (1 second / 0.2 milliseconds).
Both Redis and all client libraries are capable of operating at much higher speeds than the ones displayed in the preceding table; the speed of the network here limits performance. Therefore, when issuing commands through a single connection with a synchronous API, all clients achieve a throughput of approximately 5,000 RPS.
Don’t open a new connection for every command. This is a common mistake. A connection should be reused throughout the program.
You don’t need to open a new connection for every command (or every few commands sent). Initializing a TCP connection is an extremely slow operation and doing it frequently has drastic effects on performance.
We tested the performance of opening a new connection for every command. The following table summarizes our results.
|Reused Connection||New Connection for Every Command|
|5,000 RPS||250 RPS|
Redis-py is a popular Redis client library for Python. This post works with redis-py version 3.5.3. The documentation and code for redis-py is available on GitHub.
Make sure you install the
hiredis module in addition to the
From the section on parses in the redis-py README:
“Hiredis is a C library maintained by the core Redis team. Pieter Noordhuis was kind enough to create Python bindings. Using Hiredis can provide up to a 10x speed improvement in parsing responses from the Redis server. The performance increase is most noticeable when retrieving many pieces of data, such as from LRANGE or SMEMBERS operations.”
The Hiredis parser increased performance by approximately 10% when using the workload defined in the test setup section. The following table summarizes the effect on performance:
|Pipeline Size||With Hiredis (RPS)||Without Hiredis (RPS)|
|3||about 10,500||about 10,500|
Batching in redis-py is achieved using a
Pipeline object. A Pipeline object in
redis-py buffers commands on the client side and flushes them to the server only after the
Pipeline.execute method is called.
Pipeline.execute wraps commands in a MULTI/EXEC block. This hurts performance and can be disabled if not required. To disable, set
transactions = False when creating a Pipeline object as follows:
For an example on how to use batching in redis-py, see Pipelines in the GitHub repo.
The following table summarizes the performance of pipelines with a single connection (in RPS).
|Pipeline Size||Default Behavior (Transaction)||Transactions = False||Percent Improvement|
Because of the global interpreter lock (GIL) that restricts every Python program to run only one thread at a given time, the Python programming language is less than ideal for writing multithreaded code. Nevertheless, behind the scenes, redis-py manages a connection pool to be shared between threads. Even if only one thread is running at any given time, multithreaded code can still have a positive impact on performance. This is because when one thread is waiting for a response from Redis, a preempted context switch can allow a different thread to send an additional request.
For more information, see Connection Pools in the GitHub repo.
By default, redis-py initializes an unbounded and thread-safe connection pool. By unbounded, we mean that by default there is no limit on the number of newly created connections and therefore the size of the connection pool. By thread-safe, we mean that user doesn’t need to use any locking mechanisms when accessing the thread pool. Connections are only closed when the entire program terminates. Upon every command (all commands issued using
pipeline.execute are treated as one command), the running thread attempts to take a connection from the pool, issues the command, and upon receiving the response from the Redis server, returns the connection to the pool so that it can be used by other threads.
Initially, the pool is empty. If a client attempts to take a connection from an empty pool (all connections are in use or none have been created), the client creates a new connection. When the client is finished using the newly created connection, it places it in the pool exactly as it would have done if it had initially taken it from the pool. The following table summarizes the performance of the multithreaded approach using the default connection pool.
|between 4 and 100||about 11,000||162%|
StackExchange.Redis is a popular Redis client for .NET languages. This post works with StackExchange.Redis version 2.2.50. The documentation and code are available on GitHub.
StackExchange.Redis offers both a synchronous and asynchronous API. The asynchronous API offers better performance and uses the task-based asynchronous pattern (TAP).
Dividing the workload between several threads has the potential of increasing performance.
For this, StackExchange.Redis offers
ConnectionMultiplexer, a single thread-safe connection that StackExchange.Redis manages asynchronously. Different threads should share the same
ConnectionMultiplexer. Because it’s thread-safe, no user-defined locking mechanisms are required when sharing it between threads.
Some Redis commands such as BLPOP and BRPOP block the connection from which they’re sent until certain criteria is met. Because multiple threads using StackExchange.Redis all access the Redis server using a single
ConnectionMultiplexer, connection-blocking commands may block the
ConnectionMultiplexer indefinitely. Therefore, StackExchange.Redis does not support such connection-blocking commands. For more information about
ConnectionMultiplexer, see Basic Usage.
The following table summarizes the performance we measured using the synchronous API in multithreaded scenarios.
We see that multithreading can give a significant performance boost; 20 threads gave more than 8 times the RPS of using a single thread. On the other hand, using more than 20 threads significantly increases CPU utilization does not improve performance. Although using multiple threads with the synchronous API gives a performance boost, we recommend using the asynchronous API as described in the sections below.
Pipelining in StackExchange.Redis is done by sending a sequence of commands asynchronously and waiting for the corresponding tasks to complete. For example, a pipeline of size 2:
When pipelining in StackExchange.Redis, you must use the asynchronous API.
We tested the performance of a variable number of pipeline sizes using a single thread. The following table summarizes our results.
|Pipeline Size||RPS||CPU Utilization|
StackExchange.Redis also supports batching. This is slightly different than pipelining in that commands are first buffered in client memory and then sent to the Redis server in one batch by calling
In StackExchange.Redis, batched commands are not wrapped in a Multi/Exec block and are therefore not guaranteed to run atomically.
For example, a batch of size 2:
We tested the performance of a few different batch sizes. The following table summarizes our results.
|Batch Size||RPS||CPU Utilization|
We recommend batching over pipelining when possible. Batching has the advantage of making fewer I/O calls and fewer context switches giving us similar RPS for drastically lower CPU utilization. As can be seen from the tables above, we see that both pipelines and batches of size 100 give us an RPS of about 180000, but with pipelines, the CPU utilization is at 300% whereas with batches, it is about 200%.
Node-redis is a popular redis client for Node.js. This post works with node-redis version 3.1. The documentation and code are available on GitHub.
Because it’s written in Node.js, node-redis only offers an asynchronous API.
It can be difficult to bound the amount of concurrency when writing async code. Unbounded concurrency can lead to over-consumption of resources. For example, the following code may exhaust the heap, causing the program to crash:
Each call to
client.set allocates memory on the heap. The allocated memory can only be freed after the callback stops. Therefore, the allocated memory piles up until the heap runs out of memory and causes the program to crash.
For more information on memory leak using node-redis, see Node.js + Redis memory leak.
One way of bounding the amount of concurrency is by limiting the initial amount of asynchronous function calls and only issuing subsequent commands via callbacks. For example, to cap the amount of concurrency of 3,000,000 requests at 1,000 concurrent requests:
This limits the amount of incomplete commands to at most 1,000 at any given time. The following table summarizes performance with bounded concurrency.
|Concurrency Bound||RPS||CPU Utilization|
Node-redis (v3.1) also supports batching. Batches are not transactions in node-redis.
Batches are sent via a
batch object, which buffers commands until the
batch.exec method is called, after which it sends all of the buffered commands. For example, a batch of size 2:
Note: node-redis v4 doesn’t have the
client.batch command anymore, but it still supports batching. Refer to How are commands batched? to learn more.
We tested the performance of batches, sending the next batch after all the callbacks from the previous batch completed running. The following table summarizes batching performance.
|Batch Size||RPS||CPU Utilization|
We recommend batching when possible. Batching uses less CPU by making fewer I/O calls and fewer context switches. As seen in the first table, a concurrency bound of 1000 requests gives a performance of about 200,000 RPS at 100% CPU. In this case, the CPU is the bottleneck limiting performance. Batching on the other hand uses less CPU and does not reach the CPU bottleneck; hence it is able to get to 300,000 RPS.
Predis vs. phpredis
The Redis open-source community recommends two popular PHP clients: predis and phpredis. Predis is written in PHP and therefore slower than phpredis, which is an extension written in C. This post works with predis version 1.1.9 and phpredis version 5.3.4.
Both predis and phpredis offer synchronous APIs.
Both predis and phpredis support batching via a pipeline object, which buffers a sequence of commands and sends it to the server after all commands have been buffered. The syntax varies slightly between these two clients, but the idea is the same.
We tested the performance of these two clients using a variable number of pipeline lengths. The following table summarizes the performance comparison.
|1||Approximately 5,000 RPS, 16% CPU utilization||Approximately 5,000 RPS, 10% CPU utilization|
|2||5,373 RPS, 16% CPU utilization||9,126 RPS, 10% CPU utilization|
|3||7,865 RPS, 20% CPU utilization||13,431 RPS, 13% CPU utilization|
|10||20,985 RPS, 25% CPU utilization||35,867 RPS, 20% CPU utilization|
|20||32,902 RPS, 50% CPU utilization||56,682 RPS, 28% CPU utilization|
|50||52,241 RPS, 71% CPU utilization||87,417 RPS, 40% CPU utilization|
|100||67,913 RPS, 91% CPU utilization||115,041 RPS, 48% CPU utilization|
|1,000||75,206 RPS, 100% CPU utilization||161,303 RPS, 75% CPU utilization|
Lettuce is a popular Redis client for the Java programming language. This post works with Lettuce version 6.0.2. The documentation and code is available on GitHub.
Lettuce offers both synchronous and asynchronous APIs. In Lettuce, asynchronous methods return Lettuce
futures, which are a handle on Lettuce asynchronous function calls. Among other things, you can use Lettuce futures to wait for asynchronous function calls to complete.
You can implement pipelines in Lettuce in several ways. You can achieve pipelines by asynchronously sending a sequence of commands and waiting for the corresponding futures to complete only after the entire sequence has been sent. For example, a pipeline of size 3:
Lettuce also supports batching. Batching is achieved by setting
AutoFlushCommands to false, which causes commands to be buffered instead of being immediately flushed, and calling
flushCommands to empty the buffer and send the commands as a single batch. For example, a batch of size 3:
For more information about pipelining and batching in Lettuce, see Pipelining and command flushing.
The following table compares the performance of pipelines vs. batching with a single connection (in RPS).
|Size||Pipeline (RPS)||Batching (RPS)|
|5||28,412 RPS, 37% CPU utilization||34,208 RPS, 36% CPU utilization|
|10||46,896 RPS, 53% CPU utilization||53,918 RPS, 46% CPU utilization|
|20||82,228 RPS, 81% CPU utilization||78,931 RPS, 45% CPU utilization|
|50||135,264 RPS, 105% CPU utilization||115,775 RPS, 50% CPU utilization|
|100||170,867 RPS, 135% CPU utilization||149,172 RPS, 60% CPU utilization|
|200||189,636 RPS, 145% CPU utilization||175,913 RPS, 76% CPU utilization|
|500||239,079 RPS, 160% CPU utilization||188,090 RPS, 81% CPU utilization|
|1,000||231,085 RPS, 160% CPU utilization||215,157 RPS, 93% CPU utilization|
Batching makes fewer I/O calls and causes fewer context switches than pipelining. Hence it uses less CPU than pipelining. We recommend batching over pipelining if your application can tolerate a slightly higher latency. As can be seen from the table above, pipelining can achieve a higher RPS but at the cost of almost doubling the CPU utilization. For example, a pipeline of size 1000 has a CPU utilization of about 160% whereas a batch of size 1000 has a CPU utilization of about 93%. If your application has CPU resources to spare then consider pipelining.
Lettuce supports using multiple connections via a connection pool. Lettuce is built on top of the Netty framework, which is a multi-threaded, event-driven I/O framework (the connections are processed by several threads).
Lettuce uses the Apache Commons-pool2
GenericObjectPool, which we discuss in more detail later in this post.
To initialize a connection pool:
and borrow a connection from the pool:
For more information on connection pools in Lettuce, see Connection Pooling.
We tested the performance of various connection pool sizes in Lettuce in multithreaded scenarios measured in RPS. The following table summarizes our results:
|# of Threads||Connection Pool MaxTotal||RPS|
We found that having more threads than connections in the pool has a negative impact on performance. Nevertheless, the Redis server has a limit on the number of open connections it can handle, and having too many open connections decreases performance on the server side. We explain how to best configure connection pools in the Using GenericObjectPool section.
Jedis is a popular client library for the Java programming language. This post works with Jedis version 3.6.0. The documentation and code are available on GitHub.
Jedis only offers a synchronous API. To send several requests concurrently, we can either use batching or multithreading.
Jedis supports batching via a
Pipeline object. The
Pipeline object buffers commands on the client side and sends them as a single batch after the
Pipeline.sync method is called. For example:
The following table summarizes our batching performance results:
|5||39,004 RPS, 15% CPU utilization|
|10||67,725 RPS, 18% CPU utilization|
|20||101,747 RPS, 22% CPU utilization|
|50||157,015 RPS, 29% CPU utilization|
|100||222,373 RPS, 35% CPU utilization|
|200||285,389 RPS, 38% CPU utilization|
|500||328,210 RPS, 52% CPU utilization|
|1,000||380,912 RPS, 63% CPU utilization|
For more information on batching in Jedis, see Pipelining in the GitHub repo.
Jedis also supports batching for transactions. This is done through a
multi object, which like the
Pipeline object buffers command on the client side. The buffered commands are sent to the Redis server as a single batch after the exec method is called. Batches that are sent via the
multi object are wrapped in a MULTI/EXEC block, and therefore are run atomically by the Redis server.
The following table summarizes the performance of batched transactions.
|5||25,296 RPS, 13 CPU% utilization|
|10||36,739 RPS, 15 CPU% utilization|
|20||57,763 RPS, 17 CPU% utilization|
|50||94,972 RPS, 20 CPU% utilization|
|100||140,527 RPS, 26 CPU% utilization|
|200||174,067 RPS, 29 CPU% utilization|
|500||200,817 RPS, 32 CPU% utilization|
|1,000||226,517 RPS, 38 CPU% utilization|
Because atomicity takes a toll on performance, we recommend avoiding transactions when atomicity isn’t required.
For more information on transactions in Jedis, see Transactions.
Jedis vs. Lettuce
The following table compares Jedis and Lettuce performance.
In our experience Jedis is up to twice as fast as Lettuce.
When working with GenericObjectPool, consider the following:
- maxTotal – The maximum number of connections allowed in the pool (default is 8).
- maxIdle – The maximum number of idle connections allowed in the pool (default is 8).
If your workload is consistent over time, we recommend setting
maxTotal = maxIdle to prevent closing connections unnecessarily (we did so in our tests).
Although creating new connections is expensive and should be avoided, if you expect to have short and intense peaks in the usage of the pool’s resources, we recommend setting
maxIdle lower than
maxTotal in order to reduce the resource consumption on both the server and client side. Using this configuration impacts latency. The assumption here is that usage of concurrent connections above
maxIdle is not common.
In this blog post, we shared best practices for optimizing the performance of Redis clients. We explored the performance of synchronous and asynchronous APIs, and discussed different methods of pipelining and methods of sharing connections between threads.
For all of the Redis clients that we investigated, we found that the use of batching, pipelining, multithreading, connection pooling and transactions can increase end-to-end performance (RPS) by a significant factor. In some cases, a system can run as much as 5 times faster.
To learn more about best practices for configuring Redis clients in environments with many connections, see Best practices: Redis clients and Amazon ElastiCache for Redis.
Please ask any questions you may have, and let us know what performance you achieve in the comments.
About the Authors
Adi Emanuel Pinsky is a Software Development Engineer at Amazon ElastiCache, based in Tel Aviv, Israel. He is a graduate of a dual degree in Mathematics and Computer Science from the Technion – Israel Institute of Technology. He enjoys learning and spends way to much time watching video lectures. When he is not working Adi enjoys sports, eating great food and spending time with friends and family.
Barak Gilboa is a SDE at Amazon ElastiCache, based in Tel Aviv, Israel. He works to have a better user experience for the customer. Outside of work he loves to spend time with his family, reading books and long distance running.
Asaf Porat Stoler is a Software Development Manager at Amazon ElastiCache, based in Tel Aviv, Israel. He has vast and diverse experience in storage systems, data reduction, and in-memory databases, and likes performance and resource optimizations. Outside of work he enjoys sport, hiking, and spending time with his family.
Tzach Kaufmann is a Principal Product Manager for Amazon ElastiCache in the In-Memory Databases team at Amazon Web Services based in Israel. When not in front of the computer he loves to spend time with his family, hike, ride bicycles and sports.