AWS Compute Blog
Dynamic Scaling with EC2 Spot Fleet
Tipu Qureshi, AWS Senior Cloud Support Engineer
The RequestSpotFleet API allows you to launch and manage an entire fleet of EC2 Spot Instances with one request. A fleet is a collection of Spot Instances that are all working together as part of a distributed application and providing cost savings. With the ModifySpotFleetRequest API, it’s possible to dynamically scale a Spot fleet’s target capacity according to changing capacity requirements over time. Let’s look at a batch processing application that is utilizing Spot fleet and Amazon SQS as an example. As discussed in our previous blog post on Additional CloudWatch Metrics for Amazon SQS and Amazon SNS, you can scale up when the ApproximateNumberOfMessagesVisible SQS metric starts to grow too large for one of your SQS queues, and scale down once it returns to a more normal value.
There are multiple ways to accomplish this dynamic scaling. As an example, a script can be scheduled (e.g. via cron) to get the value of the ApproximateNumberOfMessagesVisible SQS metric periodically and then scale the Spot fleet according to defined thresholds. The current size of the Spot fleet can be obtained using the DescribeSpotFleetRequests API and the scaling can be carried out by using the new ModifySpotFleetRequest API. A sample script written for NodeJS is available here, and following is a sample IAM policy for an IAM role that could be used on an EC2 instance for running the script:
By leveraging the IAM role on an EC2 instance, the script uses the AWS API methods described above to scale the Spot fleet dynamically. You can configure variables such as the Spot fleet request, SQS queue name, SQS metric thresholds and instance thresholds according to your application’s needs. In the example configuration below we have set the minimum number of instances threshold (minCount) at 2 to ensure that the instance count for the spot fleet never goes below 2. This is to ensure that a new job is still processed immediately after an extended period with no batch jobs.
You can modify this sample script to meet your application’s requirements.
You could also leverage AWS Lambda for dynamically scaling your Spot fleet. As depicted in the diagram below, an AWS Lambda function can be scheduled (e.g using AWS datapipeline, cron or any form of scheduling) to get the ApproximateNumberOfMessagesVisible SQS metric for the SQS queue in a batch processing application. This Lambda function will check the current size of a Spot fleet using the DescribeSpotFleetRequests API, and then scale the Spot fleet using the ModifySpotFleetRequest API after also checking certain constraints such as the state or size of the Spot fleet similar to the script discussed above.
You could also use the sample IAM policy provided above to create an IAM role for the AWS Lambda function. A sample Lambda deployment package for dynamically scaling a Spot fleet based on the value of the ApproximateNumberOfMessagesVisible SQS metric can be found here. However, you could modify it to use any CloudWatch metric based on your use case. The sample script and Lambda function provided are only for reference and should be tested before using in a production environment.