How can I troubleshoot high latency on DynamoDB Accelerator (DAX) clusters?
Last updated: 2022-12-14
My read or write requests in Amazon DynamoDB Accelerator (DAX) experience high latency. How do I troubleshoot this?
There are multiple reasons why you might receive latency in your requests. Refer to each of the potential issues below to troubleshoot your latency.
The cluster or node is experiencing high load
Latency is often caused by a cluster or node that's experiencing a high load on the DAX cluster. This latency can be impacted further if you have your client configured to a single node URL instead of the cluster URL. In this case, if the node is suffering any issue during a high load, then the client requests suffer latency or throttling.
Misconfiguration in the DAX client
If you lower the withMinIdleConnectionSize parameter, then latency across the DAX cluster is likely to increase. This parameter sets the minimum number of idling connections with the DAX cluster. For every request, the client will use an available idle connection. If a connection isn't available, then the client establishes a new one. For example, if the parameter is set to 20, then there is a minimum of 20 idle connections with the DAX cluster.
The client maintains a connection pool. When an application makes an API call to DynamoDB or DAX, the client leases a connection from the connection pool. Then, the client makes the API call and returns the connection to the pool. However, the connection pool has an upper limit. If you make a large number of API calls to DAX at once, then they might exceed the limit of the connection pool. In this case, some requests must wait for other requests to complete before obtaining leases from the connection pool. This results in requests queuing up at the connection pool level. As a result, the application experiences an increase in round-trip latency.
Therefore, to decrease periodic traffic spikes in your application, adjust the parameters setMinIdleConnectionSize, getMinIdleConnectionSize, and withMinIdleConnectionSize. These parameters play a key role in the latency of a DAX cluster. Configure them for your API calls so that DAX uses an appropriate number of idling connections without the need to reestablish new connections.
Missed items in the cache
If a read request misses an item, then DAX sends the request to DynamoDB. DynamoDB processes the requests using eventually consistent reads and then returns the items to DAX. DAX stores them in the item cache and then returns them to the application. Latency in the underlying DynamoDB table can cause latency in the request.
Cache misses commonly happen for two reasons:
1. Strongly consistent reads: Strongly consistent reads for the same item aren't cached by DAX. This results in a cache miss because the entries bypass DAX and are retrieved from the DynamoDB table itself. You can use eventually consistent reads to solve this issue, but note that DynamoDB must first read the data for the data to be cached.
2. Eviction policy in DAX: Queried data that's already evicted from the cache results in a miss. DAX uses three different values to determine cache evictions:
- DAX clusters use a Least Recently Used (LRU) algorithm to prioritize items. Items with the lowest priority are evicted when the cache is full.
- DAX uses a Time-to-Live (TTL) value for the period of time that items are available in the cache. After an item's TTL value is exceeded, the item is evicted.
Note: If you're using the default TTL value of five minutes, then check to see if you're querying the data after the TTL time.
- DAX uses write-through functionality to evict older values as new values are written. This helps keep the DAX item cache consistent with the underlying data store, using a single API call.
To extend the TTL value of your items, see Configuring TTL settings.
Note: You can't modify a parameter group while it's in use in a running DAX instance.
Cache misses can also occur when maintenance patching is applied to a DAX cluster. Use multiple node clusters to reduce this downtime.
Latency might occur during the weekly maintenance window, especially if there are software upgrades, patches, or system changes to the cluster's nodes. In most cases, requests are handled successfully by other available nodes that aren't undergoing maintenance. A cluster with a high number of requests during heavy maintenance can experience failure.
To reduce chances of latency or failure, configure the maintenance window to your off-peak hour. Do so allows the cluster to upgrade during a period of lighter request load.
Latency in the DynamoDB table
With write operations, data is first written to the DynamoDB table and then to the DAX cluster. The operation is successful only if the data is successfully written to both the table and to DAX. Latency in the underlying DynamoDB table can cause latency in the request. To reduce this latency, see How can I troubleshoot high latency on an Amazon DynamoDB table?
To further configure DynamoDB to your application's latency requirements, see Tuning AWS Java SDK HTTP request settings for latency-aware Amazon DynamoDB applications.
Request timeout period
The parameter setIdleConnectionTimeout determines the timeout period for idle connections, and setConnectTimeout determines the timeout period for connections with the DAX cluster. These two parameters deal with timeouts of the connection pools, which can affect the latency of your cluster.
Configure the request timeout for connections with the DAX cluster by adjusting the setRequestTimeout parameter. For more information, see setRequestTimeout in the DAX documentation.
It's also a best practice to use exponential backoff retries, which reduce request errors and also operational costs.
Note: DAX doesn't support latency of the cluster in CloudWatch Metrics.