How do I troubleshoot high latency issues when using ElastiCache for Redis?
Last updated: 2022-08-04
How do I troubleshoot high latency issues when using Amazon ElastiCache for Redis?
The following are common reasons for elevated latencies or time-out issues in ElastiCache for Redis:
- Latency caused by slow commands.
- High memory usage leading to increased swapping.
- Latency caused by network issues.
- Client side latency issues.
- ElastiCache cluster events.
Latency caused by slow commands
Redis is mostly single-threaded. So, when a request is slow to serve, all other clients must wait to be served. This waiting adds to command latencies. Redis commands also have a time complexity defined using the Big O notation.
Use Amazon CloudWatch metrics provided by ElastiCache to monitor the average latency for different classes of commands. It's important to note that common Redis operations are calculated in microsecond latency. CloudWatch metrics are sampled every 1 minute, with the latency metrics showing an aggregate of multiple commands. So, a single command can cause unexpected results, such as timeouts, without showing significant changes in the metric graphs. In these situations use the SLOWLOG command to help determine what commands are taking longer to complete. Connect to the cluster and run the slowlog get 128 command in the redis-cli to retrieve the list. For more information, see How do I turn on Redis Slow log in an ElastiCache for Redis cache cluster?
You might also see an increase in the EngineCPUUtilization metric in CloudWatch due to slow commands blocking the Redis engine. For more information, see Why am I seeing high or increasing CPU usage in my ElastiCache for Redis cluster?
Examples of complex commands include:
High memory usage leading to increased swapping
Redis starts to swap pages when there is increased memory pressure on the cluster by using more memory than what is available. Latency and timeout issues increase when memory pages are transferred to and from the swap area. The following are indications in CloudWatch metrics of increased swapping:
- Increasing of SwapUsage.
- Very low FreeableMemory.
- High BytesUsedForCache and DatabaseMemoryUsagePercentage metrics.
SwapUsage is a host-level metric that indicates the amount of memory being swapped. It's normal for this metric to show non-zero values because it's controlled by the underlying operating system and can be influenced by many dynamic factors. These factors include OS version, activity patterns, and so on. Linux proactively swaps idle keys (rarely accessed by clients) to disk as an optimization technique to free up memory space for more frequently used keys.
Swapping becomes a problem when there isn't enough available memory. When this happens, the system starts moving pages back and forth between disk and memory. Specifically, SwapUsage less than a few hundred megabytes doesn't negatively impact Redis performance. There are performance impacts if the SwapUsage is high and actively altering and there isn't enough memory available on the cluster. For more information, see:
Latency caused by Network
Network latency between the client and the ElastiCache cluster
To isolate network latency between the client and cluster nodes, use TCP traceroute or mtr tests from the application environment. Or, use a debugging tool such as the AWSSupport-SetupIPMonitoringFromVPC AWS Systems Manager document (SSM document) to test connections from the client subnet.
The cluster is hitting network limits
An ElastiCache node shares the same network limits as that of corresponding type Amazon Elastic Compute Cloud (Amazon EC2) instances. For example, the node type of cache.m6g.large has the same network limits as the m6g.large EC2 instance. For information on checking the three key network performance components of bandwidth capability, packet-per-second (PPS) performance, and connections tracked, see Monitor network performance for your EC2 instance.
To troubleshoot issues network limits on your ElastiCache node, see Troubleshooting - Network-related limits.
TCP/SSL handshake latency
Clients connect to Redis clusters using a TCP connection. Creating a TCP connection takes a few milliseconds. The extra milliseconds create additional overhead on Redis operations run by your application and extra pressure on the Redis CPU. It's important to control the volume of new connections when your cluster is using the ElastiCache in-transit encryption feature due to the extra time and CPU utilization needed for a TLS handshake. A high volume of connections rapidly opened (NewConnections) and closed might impact the node’s performance. You can use connection pooling to cache established TCP connections into a pool. The connections are then reused each time a new client tries to connect to the cluster. You can implement connection pooling using your Redis client library (if supported), with a framework available for your application environment, or build it from the ground. You can also use aggregated commands such as MSET/MGET as an optimization technique.
There are a large number of connections on the ElastiCache node
It's a best practice to track the CurrConnections and NewConnections CloudWatch metrics. These metrics monitor the number of TCP connections accepted by Redis. A large number of TCP connections might lead to the exhaustion of the 65,000 maxclients limit. This limit is the maximum concurrent connections you can have per node. If you reach the 65,000 limit, you receive the ERR max number of clients reached error. If more connections are added beyond the limit of the Linux server, or of the maximum number of connections tracked, then additional client connections result in connection timed out errors. For information on preventing a large number of connections, see Best practices: Redis clients and Amazon ElastiCache for Redis.
Client side latency issues
Latency and timeouts might originate from the client itself. Verify the memory, CPU, and network utilization on the client side to determine if any of these resources are hitting their limits. If the application is running on an EC2 instance, then leverage the same CloudWatch metrics discussed previously to check for bottlenecks. Latency might happen in an operating system that can't be monitored thoroughly by default CloudWatch metrics. Consider installing monitoring tool inside the EC2 instance, such as atop or CloudWatch agent.
If the timeout configuration values set up on the application side are too small, you might receive unnecessary timed out errors. Configure the client-side timeout appropriately to allow the server sufficient time to process the request and generate the response. For more information, see Best practices: Redis clients and Amazon ElastiCache for Redis.
The timeout error received from your application reveals additional details. These details include whether a specific node is involved, the name of the Redis data type that's causing timeouts, the exact timestamp when the timeout occurred, and so on. This information helps you to find the pattern of the issue. Use this information to answer question such as the following:
- Do timeouts happen frequently during a specific time of a day?
- Did the timeout occur at one client or more clients?
- Did the timeout occur at one Redis node or at more nodes?
- Did the timeout occur at one cluster or more clusters?
Use these patterns to investigate the most likely client, or ElastiCache node. You can also use your application log, and VPC Flow Logs to determine if the latency happened on the client side, ElastiCache node, or network.
Synchronization of Redis
Synchronization of Redis is initiated during backup, replacement, and scaling events. This is a compute-intensive workload that can cause latencies. Use the SaveInProgress CloudWatch metric to determine if synchronization is in progress. For more information, see How synchronization and backup are implemented.
ElastiCache cluster events
Check the Events section in the ElastiCache console for the time period when latency was observed. Check for background activities such as node replacement or failover events that could be caused by ElastiCache Managed Maintenance and service updates, or for unexpected hardware failures. You receive notification of scheduled events through the PHD dashboard and email.
The following is a sample event log:
Finished recovery for cache nodes 0001 Recovering cache nodes 0001 Failover from master node <cluster_node> to replica node <cluster_node> completed