Tuning the AWS SDK for Java to Improve Resiliency
In this blog post we will discuss why it’s important to protect your application from downstream service failures, offer advice for tuning configuration options in the SDK to fit the needs of your application, and introduce new configuration options that can help you set stricter SLAs on service calls.
Service failures are inevitable. Even AWS services, which are highly available and fault-tolerant, can have periods of increased latency or error rates. When there are problems in one of your downstream dependencies, latency increases, retries start, and generally API calls take longer to complete, if they complete at all. This can tie up connections, preventing other threads from using them, congest your application’s thread pool, and hold onto valuable system resources for a call or connection that may ultimately be doomed. If the AWS SDK for Java is not tuned correctly, then a single service dependency (even one that may not be critical to your application) can end up browning out or taking down your entire application. We will discuss techniques you can use to safeguard your application and show you how to find data to tune the SDK with the right settings.
The metrics system in the AWS SDK for Java has several predefined metrics that give you insight into the performance of each of your AWS service dependencies. Metric data can be aggregated at the service level or per individual API action. There are several ways to enable the metrics system. In this post, we will take a programmatic approach on application startup.
To enable the metrics system, add the following lines to the startup code of your application.
AwsSdkMetrics.enableDefaultMetrics(); AwsSdkMetrics.setCredentialProvider(credentialsProvider); AwsSdkMetrics.setMetricNameSpace("AdvancedConfigBlogPost");
Note: The metrics system is geared toward longer-lived applications. It uploads metric data to Amazon CloudWatch at one-minute intervals. If you are writing a simple program or test case to test-drive this feature, it may terminate before the metrics system has a chance to upload anything. If you aren’t seeing metrics in your test program, try adding a sleep interval of a couple of minutes before terminating to allow metrics to be sent to CloudWatch.
For more information about the features of the metrics system and other ways to enable it, see this blog post.
Interpreting Metrics to tune the SDK
After you have enabled the metrics system, the metrics will appear in the CloudWatch console under the namespace you’ve defined (in the preceding example, AdvancedConfigBlogPost).
Let’s take a look at the metrics one by one to see how the data can help us tune the SDK.
HttpClientGetConnectionTime: Time, in milliseconds, for the underlying HTTP client library to get a connection.
- Typically, the time it takes to establish a connection won’t vary in a service (that is, all APIs in the same service should have similar SLAs for establishing a connection). For this reason, it is valid to look at this metric aggregated across each AWS service.
- Use this metric to determine a reasonable value for the connection timeout setting in ClientConfiguration.
- The default value for this setting is 50 seconds, which is unreasonably high for most production applications, especially those hosted within AWS itself and making service calls to the same region. Connection latencies, on average, are on the order of milliseconds, not seconds.
HttpClientPoolAvailableCount: Number of idle persistent connections of the underlying HTTP client. This metric is collected from the respective PoolStats before the connection of a request is obtained.
- A high number of idle connections is typical of applications that perform a batch of work at intervals. For example, consider an application that uploads all files in a directory to Amazon S3 every five minutes. When the application is uploading files, it’s creating several connections to S3 and then does nothing with the service for five minutes. The connections are left in the pool with nothing to do and will eventually become idle. If this is the case for your application, and there are constantly idle connections in the pool that aren’t serving a useful purpose, you can tune the connectionMaxIdleMillis setting and use the idle connection reaper (enabled by default) to more aggressively purge these connections from the pool.
- Setting the connectionMaxIdleMillis too low can result in having to establish connections more frequently, which can outweigh the benefits of freeing up system resources from idle connections. Take caution before acting on the data from this metric.
- If your application does have a bursty workload and you find that the cost of establishing connections is more damaging to performance than keeping idle connections, you can also increase the connectionMaxIdleMillis setting to allow the connections to persist between periods of work.
- Note: The connectionMaxIdleMillis will be limited to the Keep-Alive time specified by the service. For example if you set connectionMaxIdleMillis to five minutes but the service only keeps connections alive for sixty seconds, the SDK will still discard connections after sixty seconds when they are no longer usable.
HttpClientPoolPendingCount: Number of connection requests being blocked while waiting for a free connection of the underlying HTTP client. This metric is collected from the respective PoolStats before the connection of a request is obtained
- A high value for this metric can indicate a problem with your connection pool size or improper handling of service failures.
- If your usage of the client exceeds the ability of the default connection pool setting to satisfy your request you can increase the size of the connection pool through this setting.
- Connection contention can also occur when a service is experiencing increased latency or error rates and the SDK is not tuned properly to handle it. Connections can quickly be tied up waiting for a response from a faulty server or waiting for retries per the configured retry policy. Increasing the connection pool size in this case might only make things worse by allowing the application to hog more threads trying to communicate with a service in duress. If you suspect this may be the case, look at the other metrics to see how you can tune the SDK to handle situations like this in a more robust way.
HttpRequestTime: Number of milliseconds for a logical request/response round-trip to AWS. Captured on a per request-type level.
- This metric records the time it takes for a single HTTP request to a service. This metric can be recorded multiple times per operation, depending on the retry policy in use.
- We’ve recently added a new configuration setting that allows you to specify a timeout on each underlying HTTP request made by the client. The SLAs for requests between APIs or even per request can vary widely, so it’s important to use the provided metrics and consider the timeout setting carefully.
- This new setting can be specified per request or for the entire client (through ClientConfiguration). Although it’s hard to set a reasonable timeout on the client, it makes sense to set a default timeout on the client and override it per request, where needed.
- By default, this feature is disabled.
- Request timeouts are only supported in Java 7 and later.
Using the DynamoDB client as an example, let’s look at how you can use this new feature.
ClientConfiguration clientConfig = new ClientConfiguration(); clientConfig.setRequestTimeout(20 * 1000); AmazonDynamoDBClient ddb = new AmazonDynamoDBClient(credentialsProvider, clientConfig); // Will inherit 20 second request timeout from client level setting ddb.listTables(); // Request timeout overridden on the request level ddb.listTables(new ListTablesRequest().withSdkRequestTimeout(10 * 1000)); // Turns off request timeout for this request ddb.listTables(new ListTablesRequest().withSdkRequestTimeout(-1));
ClientExecuteTime: Total number of milliseconds for a request/response including the time to execute the request handlers, the round-trip to AWS, and the time to execute the response handlers. Captured on a per request-type level.
- This metric includes any time spent executing retries per the configured retry policy in ClientConfiguration.
- We have just launched a new feature that allows you to specify a timeout on the entire execution time, which matches up very closely to the ClientExecuteTime metric.
Using the DynamoDB client as an example, let’s look at how you would enable a client execution timeout.
ClientConfiguration clientConfig = new ClientConfiguration(); clientConfig.setClientExecutionTimeout(20 * 1000); AmazonDynamoDBClient ddb = new AmazonDynamoDBClient(credentialsProvider, clientConfig); // Will inherit 20 second client execution timeout from client level setting ddb.listTables(); // Client Execution timeout overridden on the request level ddb.listTables(new ListTablesRequest().withSdkClientExecutionTimeout(10 * 1000)); // Turns off client execution timeout for this request ddb.listTables(new ListTablesRequest().withSdkClientExecutionTimeout(-1));
The new settings for request timeouts and client execution timeouts are complementary. Using them together is especially useful because you can use client execution timeouts to set harder limits on the API’s total SLA and use request timeouts to prevent one bad request from consuming too much of your total time to execute.
ClientConfiguration clientConfig = new ClientConfiguration(); clientConfig.setClientExecutionTimeout(20 * 1000); clientConfig.setRequestTimeout(5 * 1000); // Allow as many retries as possible until the client execution timeout expires clientConfig.setMaxErrorRetry(Integer.MAX_VALUE); AmazonDynamoDBClient ddb = new AmazonDynamoDBClient(credentialsProvider, clientConfig); // Will inherit timeout settings from client configuration. Each HTTP request // is allowed 5 second to complete and the SDK will retry as many times as // possible (per the retry condition in the retry policy) within the 20 second // client execution timeout ddb.listTables();
Configuring the SDK with aggressive timeouts and appropriately sized connection pools goes a long way toward protecting your application from downstream service failures, but it’s not the whole story. There are many techniques you can apply on top of the SDK to limit the negative effects of a dependency’s outage on your application. Hystrix is an open source library specifically designed to make fault-tolerance easier and your application even more resilient. To use Hystrix to its fullest potential, you’ll need some data to tune it to match your actual service SLAs in your environment. The metrics we discussed in this blog post can give you that information. Hystrix also has an embedded metrics system that can complement the SDKs metrics.
We would love your feedback on the configuration options and metrics provided by the SDK and what you would like to see in the future. Do we provide enough settings and hooks to allow you to tune your application for optimal performance? Do we provide too many settings and make configuring the SDK overwhelming? Does the SDK provide you with enough information to intelligently handle service failures?