AWS Compute Blog

Dynamic Scaling with EC2 Spot Fleet

Tipu Qureshi Tipu Qureshi, AWS Senior Cloud Support Engineer

The RequestSpotFleet API allows you to launch and manage an entire fleet of EC2 Spot Instances with one request. A fleet is a collection of Spot Instances that are all working together as part of a distributed application and providing cost savings. With the ModifySpotFleetRequest API, it’s possible to dynamically scale a Spot fleet’s target capacity according to changing capacity requirements over time. Let’s look at a batch processing application that is utilizing Spot fleet and Amazon SQS as an example. As discussed in our previous blog post on Additional CloudWatch Metrics for Amazon SQS and Amazon SNS, you can scale up when the ApproximateNumberOfMessagesVisible SQS metric starts to grow too large for one of your SQS queues, and scale down once it returns to a more normal value.

There are multiple ways to accomplish this dynamic scaling. As an example, a script can be scheduled (e.g. via cron) to get the value of the ApproximateNumberOfMessagesVisible SQS metric periodically and then scale the Spot fleet according to defined thresholds. The current size of the Spot fleet can be obtained using the DescribeSpotFleetRequests API and the scaling can be carried out by using the new ModifySpotFleetRequest API. A sample script written for NodeJS is available here, and following is a sample IAM policy for an IAM role that could be used on an EC2 instance for running the script:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "Stmt1441252157702",
      "Action": [
        "ec2:DescribeSpotFleetRequests",
        "ec2:ModifySpotFleetRequest",
        "cloudwatch:GetMetricStatistics"
      ],
      "Effect": "Allow",
      "Resource": "*"
    }
  ]
}/

By leveraging the IAM role on an EC2 instance, the script uses the AWS API methods described above to scale the Spot fleet dynamically. You can configure variables such as the Spot fleet request, SQS queue name, SQS metric thresholds and instance thresholds according to your application’s needs. In the example configuration below we have set the minimum number of instances threshold (minCount) at 2 to ensure that the instance count for the spot fleet never goes below 2. This is to ensure that a new job is still processed immediately after an extended period with no batch jobs.

// Sample script for Dynamically scaling Spot Fleet
// define configuration
var config = {
    spotFleetRequest:'sfr-c8205d41-254b-4fa9-9843-be06585e5cda', //Spot Fleet Request Id
    queueName:'demojobqueue', //SQS queuename
    maxCount:100, //maximum number of instances
    minCount:2, //minimum number of instances
    stepCount:5, //increment of instances
    scaleUpThreshold:20, //CW metric threshold at which to scale up
    scaleDownThreshold:10, //CW metric threshold at which to scale down
    period:900, //period in seconds for CW
    region:'us-east-1' //AWS region
};

// dependencies
var AWS = require('aws-sdk');
var ec2 = new AWS.EC2({region: config.region, maxRetries: 5});
var cloudwatch = new AWS.CloudWatch({region: config.region, maxRetries: 5});

console.log ('Loading function');
main();

//main function
function main() {
//main function
    var now = new Date();
    var startTime = new Date(now - (config.period * 1000));
    console.log ('Timestamp: '+now);
    var cloudWatchParams = {
        StartTime: startTime,
        EndTime: now,
        MetricName: 'ApproximateNumberOfMessagesVisible',
        Namespace: 'AWS/SQS',
        Period: config.period,
        Statistics: ['Average'],
        Dimensions: [
            {
                Name: 'QueueName',
                Value: config.queueName,
            },
        ],
        Unit: 'Count'
    };
    cloudwatch.getMetricStatistics(cloudWatchParams, function(err, data) {
        if (err) console.log(err, err.stack); // an error occurred
        else {
            //set Metric Variable
            var metricValue = data.Datapoints[0].Average;
            console.log ('Cloudwatch Metric Value is: '+ metricValue);
            var up = 1;
            var down = -1;
            // check if scaling is required
            if (metricValue = config.scaleDownThreshold)
                console.log ("metric not breached for scaling action");
            else if (metricValue >= config.scaleUpThreshold)
                scale(up); //scaleup
            else
                scale(down); //scaledown
        }
    });
};

//defining scaling function
function scale (direction) {
    //adjust stepCount depending upon whether we are scaling up or down
    config.stepCount = Math.abs(config.stepCount) * direction;
    //describe Spot Fleet Request Capacity
    console.log ('attempting to adjust capacity by: '+ config.stepCount);
    var describeParams = {
        DryRun: false,
        SpotFleetRequestIds: [
            config.spotFleetRequest
        ]
    };
    //get current fleet capacity
    ec2.describeSpotFleetRequests(describeParams, function(err, data) {
        if (err) {
            console.log('Unable to describeSpotFleetRequests: ' + err); // an error occurred
            return 'Unable to describeSpotFleetRequests';
        }
        //set current capacity variable
        var currentCapacity = data.SpotFleetRequestConfigs[0].SpotFleetRequestConfig.TargetCapacity;
        console.log ('current capacity is: ' + currentCapacity);
        //set desired capacity variable
        var desiredCapacity = currentCapacity + config.stepCount;
        console.log ('desired capacity is: '+ desiredCapacity);
        //find out if the spot fleet is already modifying
        var fleetModifyState = data.SpotFleetRequestConfigs[0].SpotFleetRequestState;
        console.log ('current state of the the spot fleet is: ' + fleetModifyState);
        //only proceed forward if  maxCount or minCount hasn't been reached
        //or spot fleet isn't being currently modified.
        if (fleetModifyState == 'modifying')
            console.log ('capacity already at min, max or fleet is currently being modified');
        else if (desiredCapacity  config.maxCount)
            console.log ('capacity already at max count');
        else {
            console.log ('scaling');
            var modifyParams = {
                SpotFleetRequestId: config.spotFleetRequest,
                TargetCapacity: desiredCapacity
            };
            ec2.modifySpotFleetRequest(modifyParams, function(err, data) {
                if (err) {
                    console.log('unable to modify spot fleet due to: ' + err);
                }
                else {
                    console.log('successfully modified capacity to: ' + desiredCapacity);
                    return 'success';
                }
            });
        }
    });
}

You can modify this sample script to meet your application’s requirements.

You could also leverage AWS Lambda for dynamically scaling your Spot fleet. As depicted in the diagram below, an AWS Lambda function can be scheduled (e.g using AWS datapipeline, cron or any form of scheduling) to get the ApproximateNumberOfMessagesVisible SQS metric for the SQS queue in a batch processing application. This Lambda function will check the current size of a Spot fleet using the DescribeSpotFleetRequests API, and then scale the Spot fleet using the ModifySpotFleetRequest API after also checking certain constraints such as the state or size of the Spot fleet similar to the script discussed above.

Dynamic Spot Fleet Scaling Architecture

You could also use the sample IAM policy provided above to create an IAM role for the AWS Lambda function. A sample Lambda deployment package for dynamically scaling a Spot fleet based on the value of the ApproximateNumberOfMessagesVisible SQS metric can be found here. However, you could modify it to use any CloudWatch metric based on your use case. The sample script and Lambda function provided are only for reference and should be tested before using in a production environment.