AWS Compute Blog

Cost-effective Batch Processing with Amazon EC2 Spot

Tipu Qureshi Tipu Qureshi, AWS Senior Cloud Support Engineer

With Spot Instances, you can save up to 90% of costs by bidding on spare Amazon Elastic Compute Cloud (Amazon EC2) instances. This reference architecture is meant to help enable you to realize cost savings for batch processing applications while maintaining high availability. We recommend tailoring and testing it for your application before implementing it in a production environment.

Below is a multi-part job processing architecture that can be used for the deployment of a heterogeneous, scalable “grid” of worker nodes that can quickly crunch through large batch processing tasks in parallel. There are numerous batch oriented applications in place today that can leverage this style of on-demand processing, including claims processing, large scale transformation, media processing and multi-part data processing work.

Spot Batch Architecture Diagram

Raw job data is uploaded to Amazon Simple Storage Service (S3) which is a highly-available and persistent data store. An AWS Lambda function will be invoked by S3 every time a new object is uploaded to the input S3 bucket. AWS Lambda is a compute service that runs your code in response to events and automatically manages the compute resources for you, making it easy to build applications that respond quickly to new information. Information about Lambda functions is available here and a walkthrough on triggering Lambda functions on S3 object uploads is available here.

AWS Lambda automatically runs your code with an IAM role that you select, making it easy to access other AWS resources, such as Amazon S3, Amazon SQS, and Amazon DynamoDB. AWS lambda can be used to place a job message into an Amazon SQS queue. Amazon Simple Queue Service (SQS) is a fast, reliable, scalable, fully managed message queuing service, which makes it simple and cost-effective to decouple the components of a cloud application. Depending on the application’s needs, multiple SQS queues might be required for different functions and priorities.

The AWS Lambda function will also store state information for each job task in Amazon DynamoDB. DynamoDB is a regional service, meaning that the data is automatically replicated across availability zones. AWS Lambda can be used to trigger other types of workflows as well, such as an Amazon Elastic Transcoder job. EC2 Spot can also be used with Amazon Elastic MapReduce (Amazon EMR).
Below is a sample IAM policy that can be attached to an IAM role for AWS Lambda. You will need to change the ARNs to match your resources.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "Stmt1438283855455",
      "Action": [
        "dynamodb:PutItem"
      ],
      "Effect": "Allow",
      "Resource": "arn:aws:dynamodb:us-east-1::table/demojobtable"
    },
    {
      "Sid": "Stmt1438283929844",
      "Action": [
        "sqs:SendMessage"
      ],
      "Effect": "Allow",
      "Resource": "arn:aws:sqs:us-east-1::demojobqueue"
    }
  ]
}

Below is a sample Lambda function that will send an SQS message and put an item into a DynamoDB table in response to a S3 object upload. You will need to change the SQS queue URL and DynamoDB table name to match your resources.

// create an IAM Lambda role with access to Amazon SQS queue and DynamoDB table
// configure S3 to publish events as shown here: http://docs.aws.amazon.com/lambda/latest/dg/walkthrough-s3-events-adminuser-configure-s3.html

// dependencies
var AWS = require('aws-sdk');

// get reference to clients
var s3 = new AWS.S3();
var sqs = new AWS.SQS();
var dynamodb = new AWS.DynamoDB();

console.log ('Loading function');

exports.handler = function(event, context) {
  // Read options from the event.
  var srcBucket = event.Records[0].s3.bucket.name;
  // Object key may have spaces or unicode non-ASCII characters.
  var srcKey = decodeURIComponent(event.Records[0].s3.object.key.replace(/\+/g, " "));
  // prepare SQS message
  var params = {
    MessageBody: 'object '+ srcKey + ' ',
    QueueUrl: 'https://sqs.us-east-1.amazonaws.com//demojobqueue',
    DelaySeconds: 0
  };
  //send SQS message
  sqs.sendMessage(params, function (err, data) {
    if (err) {
      console.error('Unable to put object' + srcKey + ' into SQS queue due to an error: ' + err);
      context.fail(srcKey, 'Unable to send message to SQS');
    } // an error occurred
    else {
      //define DynamoDB table variables 
      var tableName = "demojobtable";
      var datetime = new Date().getTime().toString();
      //Put item into DynamoDB table where srcKey is the hash key and datetime is the range key
      dynamodb.putItem({
        "TableName": tableName,
        "Item": {
          "srcKey": {"S": srcKey },
          "datetime": {"S": datetime },
        }
      }, function(err, data) {
        if (err) {
          console.error('Unable to put object' + srcKey + ' into DynamoDB table due to an error: ' + err);
          context.fail(srcKey, 'Unable to put data to DynamoDB Table');
        }
        else {
          console.log('Successfully put object' + srcKey + ' into SQS queue and DynamoDB');
          context.succeed(srcKey, 'Data put into SQS and DynamoDB');
        }
      });
    }
  });
};

Worker nodes are Amazon EC2 Spot and On-demand instances on deployed Auto Scaling groups. These groups are containers that ensure health and scalability of worker nodes. Worker nodes pick up job parts from the input queue automatically and perform single tasks based on the job task state in DynamoDB. Worker nodes will store the input objects in a file system such as Amazon Elastic File System (Amazon EFS) for processing. Amazon EFS is a file storage service for EC2 that provides elastic capacity to your applications, automatically adding storage as you add files. Depending on IO and application needs, the job data can also be stored on local instance store or Amazon Elastic Block Store (EBS). Each job can be further split into multiples sub-parts if there is a mechanism to stitch the outputs together (as in the case of some media processing where pre-segmenting the output may be possible). Once completed, the objects will be uploaded back to S3 using multi-part upload.

Similar to our blog on an EC2 Spot architecture for web applications, Auto Scaling groups running On-demand instances can be used together with groups running Spot instances. Spot Auto Scaling groups can use different Spot bid prices and even different instance types to give you more flexibility and to meet changing traffic demands.

The availability of Spot instances can vary depending on how many unused Amazon EC2 instances are available. Because real-time supply and demand dictates the available supply of Spot instances, you should architect your application to be resilient to instance termination. When the Spot price exceeds the price you named (i.e. the bid price), the instance will receive a warning that it will be terminated after two minutes. You can manage this event by creating IAM roles with the relevant SQS and DynamoDB permissions for the Spot instances to run the required shutdown scripts. Details about Creating an IAM Role Using the AWS CLI can be found here and Spot Instance Termination Notices documentation can be found here.

A script like the following can be placed in a loop and can be run on startup (e.g. via systemd or rc.local) to detect for Spot instance termination. It can then update job task state in DynamoDB and re-insert the job task into the queue if required. We recommend that applications poll on the termination notice at five-second intervals.


#!/bin/bash
while true
  do
    if curl -s http://169.254.169.254/latest/meta-data/spot/termination-time | grep -q .*T.*Z; then /env/bin/runterminationscripts.sh; 
  else
    # Spot instance not yet marked for termination.
    sleep 5
  fi
done

An Auto Scaling group running on-demand instances in tandem with Auto Scaling group(s) running Spot instances will help to ensure your application’s availability in case of changes in Spot market price and Spot instance capacity. Additionally, using multiple Spot Auto Scaling groups with different bids and capacity pools (groups of instances that share the same attributes) can help with both availability and cost savings. By having the ability to run across multiple pools, you reduce your application’s sensitivity to price spikes that affect a pool or two (in general, there is very little correlation between prices in different capacity pools). For example, if you run in five different pools your price swings and interruptions can be cut by 80%.

For Auto Scaling to scale according to your application’s needs, you must define how you want to scale in response to changing conditions. You can assign more aggressive scaling policies to Auto Scaling groups that run Spot instances, and more conservative ones to Auto Scaling groups that run on-demand instances. The Auto Scaling instances will scale up based on the SQS queue depth CloudWatch metric (or more relevant metric depending on your application), but will scale down based on EC2 instance CPU utilization CloudWatch metric (or more relevant metric) to ensure that a job actually gets completed before the instance is terminated. For information about using Amazon CloudWatch metrics (such as SQS queue depth) to scale automatically, see Dynamic Scaling. As a safety buffer, a grace period (e.g. 300 seconds) should also be configured for the Auto Scaling group to prevent termination before a job completes.

Reserved instances can be purchased for the On-demand EC2 instances if they are consistently being used to realize even more cost savings.

To automate further, you can optionally create a Lambda function to dynamically manage Auto Scaling groups based on the Spot market. The Lambda function could periodically invoke the EC2 Spot APIs to assess market prices and availability and respond by creating new Auto Scaling launch configurations and groups automatically. Information on the Describe Spot Price History API is available here. This function could also delete any Spot Auto Scaling groups and launch configurations that have no instances. AWS Data Pipeline can be used to invoke the Lambda function using the AWS CLI at regular intervals by scheduling pipelines. More information about scheduling pipelines is available here and information about invoking AWS Lambda functions using AWS CLI is here.

Spot Architecture Automation Diagram