AWS Compute Blog

Amazon ECS Events in February

by Chris Barclay | on | in Amazon ECS | | Comments

Here are some upcoming events for Amazon ECS this month:

Container World: Abby Fuller, senior AWS technical evangelist, will be speaking about Amazon ECS at Container World on Feb 21-23. Check out her schedule.

Microservices Day @ AWS NY Loft: Microservices Day is on Feb 24 as part of the DevOps | AWS Loft Architecture Week. Learn more about how to build and deploy microservices architectures on AWS. We will cover how to use Amazon ECS and AWS Lambda to build microservices. Signup here.

Seattle AWS Architects & Engineers Meetup: Join us Feb 28 at SURF Incubator to learn more about AWS Batch and Amazon ECS. Food and drinks provided. RSVP here.

Implementing Serverless Manual Approval Steps in AWS Step Functions and Amazon API Gateway

by Bryan Liston | on | | Comments

Ali Baghani, Software Development Engineer

A common use case for AWS Step Functions is a task that requires human intervention (for example, an approval process). Step Functions makes it easy to coordinate the components of distributed applications as a series of steps in a visual workflow called a state machine. You can quickly build and run state machines to execute the steps of your application in a reliable and scalable fashion.

In this post, I describe a serverless design pattern for implementing manual approval steps. You can use a Step Functions activity task to generate a unique token that can be returned later indicating either approval or rejection by the person making the decision.

Key steps to implementation

When the execution of a Step Functions state machine reaches an activity task state, Step Functions schedules the activity and waits for an activity worker. An activity worker is an application that polls for activity tasks by calling GetActivityTask. When the worker successfully calls the API action, the activity is vended to that worker as a JSON blob that includes a token for callback.

At this point, the activity task state and the branch of the execution that contains the state is paused. Unless a timeout is specified in the state machine definition, which can be up to one year, the activity task state waits until the activity worker calls either SendTaskSuccess or SendTaskFailure using the vended token. This pause is the first key to implementing a manual approval step.

The second key is the ability in a serverless environment to separate the code that fetches the work and acquires the token from the code that responds with the completion status and sends the token back, as long as the token can be shared, i.e., the activity worker in this example is a serverless application supervised by a single activity task state.

In this walkthrough, you use a short-lived AWS Lambda function invoked on a schedule to implement the activity worker, which acquires the token associated with the approval step, and prepares and sends an email to the approver using Amazon SES.

It is very convenient if the application that returns the token can directly call the SendTaskSuccess and SendTaskFailure API actions on Step Functions. This can be achieved more easily by exposing these two actions through Amazon API Gateway so that an email client or web browser can return the token to Step Functions. By combining a Lambda function that acquires the token with the application that returns the token through API Gateway, you can implement a serverless manual approval step, as shown below.

In this pattern, when the execution reaches a state that requires manual approval, the Lambda function prepares and sends an email to the user with two embedded hyperlinks for approval and rejection.

If the authorized user clicks on the approval hyperlink, the state succeeds. If the authorized user clicks on the rejection link, the state fails. You can also choose to set a timeout for approval and, upon timeout, take action, such as resending the email request using retry/catch conditions in the activity task state.

Employee promotion process

As an example pattern use case, you can design a simple employee promotion process which involves a single task: getting a manager’s approval through email. When an employee is nominated for promotion, a new execution starts. The name of the employee and the email address of the employee’s manager are provided to the execution.

You’ll use the design pattern to implement the manual approval step, and SES to send the email to the manager. After acquiring the task token, the Lambda function generates and sends an email to the manager with embedded hyperlinks to URIs hosted by API Gateway.

In this example, I have administrative access to my account, so that I can create IAM roles. Moreover, I have already registered my email address with SES, so that I can send emails with the address as the sender/recipient. For detailed instructions, see Send an Email with Amazon SES.

Here is a list of what you do:

  1. Create an activity
  2. Create a state machine
  3. Create and deploy an API
  4. Create an activity worker Lambda function
  5. Test that the process works

Create an activity

In the Step Functions console, choose Tasks and create an activity called ManualStep.


Remember to keep the ARN of this activity at hand.


Create a state machine

Next, create the state machine that models the promotion process on the Step Functions console. Use StatesExecutionRole-us-east-1, the default role created by the console. Name the state machine PromotionApproval, and use the following code. Remember to replace the value for Resource with your activity ARN.

  "Comment": "Employee promotion process!",
  "StartAt": "ManualApproval",
  "States": {
    "ManualApproval": {
      "Type": "Task",
      "Resource": "arn:aws:states:us-east-1:ACCOUNT_ID:activity:ManualStep",
      "TimeoutSeconds": 3600,
      "End": true

Create and deploy an API

Next, create and deploy public URIs for calling the SendTaskSuccess or SendTaskFailure API action using API Gateway.

First, navigate to the IAM console and create the role that API Gateway can use to call Step Functions. Name the role APIGatewayToStepFunctions, choose Amazon API Gateway as the role type, and create the role.

After the role has been created, attach the managed policy AWSStepFunctionsFullAccess to it.


In the API Gateway console, create a new API called StepFunctionsAPI. Create two new resources under the root (/) called succeed and fail, and for each resource, create a GET method.


You now need to configure each method. Start by the /fail GET method and configure it with the following values:

  • For Integration type, choose AWS Service.
  • For AWS Service, choose Step Functions.
  • For HTTP method, choose POST.
  • For Region, choose your region of interest instead of us-east-1. (For a list of regions where Step Functions is available, see AWS Region Table.)
  • For Action Type, enter SendTaskFailure.
  • For Execution, enter the APIGatewayToStepFunctions role ARN.


To be able to pass the taskToken through the URI, navigate to the Method Request section, and add a URL Query String parameter called taskToken.


Then, navigate to the Integration Request section and add a Body Mapping Template of type application/json to inject the query string parameter into the body of the request. Accept the change suggested by the security warning. This sets the body pass-through behavior to When there are no templates defined (Recommended). The following code does the mapping:

   "cause": "Reject link was clicked.",
   "error": "Rejected",
   "taskToken": "$input.params('taskToken')"

When you are finished, choose Save.

Next, configure the /succeed GET method. The configuration is very similar to the /fail GET method. The only difference is for Action: choose SendTaskSuccess, and set the mapping as follows:

   "output": "\"Approve link was clicked.\"",
   "taskToken": "$input.params('taskToken')"

The last step on the API Gateway console after configuring your API actions is to deploy them to a new stage called respond. You can test our API by choosing the Invoke URL links under either of the GET methods. Because no token is provided in the URI, a ValidationException message should be displayed.


Create an activity worker Lambda function

In the Lambda console, create a Lambda function with a CloudWatch Events Schedule trigger using a blank function blueprint for the Node.js 4.3 runtime. The rate entered for Schedule expression is the poll rate for the activity. This should be above the rate at which the activities are scheduled by a safety margin.

The safety margin accounts for the possibility of lost tokens, retried activities, and polls that happen while no activities are scheduled. For example, if you expect 3 promotions to happen, in a certain week, you can schedule the Lambda function to run 4 times a day during that week. Alternatively, a single Lambda function can poll for multiple activities, either in parallel or in series. For this example, use a rate of one time per minute but do not enable the trigger yet.


Next, create the Lambda function ManualStepActivityWorker using the following Node.js 4.3 code. The function receives the taskToken, employee name, and manager’s email from StepFunctions. It embeds the information into an email, and sends out the email to the manager.

'use strict';
console.log('Loading function');
const aws = require('aws-sdk');
const stepfunctions = new aws.StepFunctions();
const ses = new aws.SES();
exports.handler = (event, context, callback) => {
    var taskParams = {
        activityArn: 'arn:aws:states:us-east-1:ACCOUNT_ID:activity:ManualStep'
    stepfunctions.getActivityTask(taskParams, function(err, data) {
        if (err) {
            console.log(err, err.stack);
  'An error occured while calling getActivityTask.');
        } else {
            if (data === null) {
                // No activities scheduled
                context.succeed('No activities received after 60 seconds.');
            } else {
                var input = JSON.parse(data.input);
                var emailParams = {
                    Destination: {
                        ToAddresses: [
                    Message: {
                        Subject: {
                            Data: 'Your Approval Needed for Promotion!',
                            Charset: 'UTF-8'
                        Body: {
                            Html: {
                                Data: 'Hi!<br />' +
                                    input.employeeName + ' has been nominated for promotion!<br />' +
                                    'Can you please approve:<br />' +
                                    '' + encodeURIComponent(data.taskToken) + '<br />' +
                                    'Or reject:<br />' +
                                    '' + encodeURIComponent(data.taskToken),
                                Charset: 'UTF-8'
                    Source: input.managerEmailAddress,
                    ReplyToAddresses: [
                ses.sendEmail(emailParams, function (err, data) {
                    if (err) {
                        console.log(err, err.stack);
              'Internal Error: The email could not be sent.');
                    } else {
                        context.succeed('The email was successfully sent.');

In the Lambda function handler and role section, for Role, choose Create a new role, LambdaManualStepActivityWorkerRole.


Add two policies to the role: one to allow the Lambda function to call the GetActivityTask API action by calling Step Functions, and one to send an email by calling SES. The result should look as follows:

  "Version": "2012-10-17",
  "Statement": [
      "Effect": "Allow",
      "Action": [
      "Resource": "arn:aws:logs:*:*:*"
      "Effect": "Allow",
      "Action": "states:GetActivityTask",
      "Resource": "arn:aws:states:*:*:activity:ManualStep"
      "Effect": "Allow",
      "Action": "ses:SendEmail",
      "Resource": "*"

In addition, as the GetActivityTask API action performs long-polling with a timeout of 60 seconds, increase the timeout of the Lambda function to 1 minute 15 seconds. This allows the function to wait for an activity to become available, and gives it extra time to call SES to send the email. For all other settings, use the Lambda console defaults.


After this, you can create your activity worker Lambda function.

Test the process

You are now ready to test the employee promotion process.

In the Lambda console, enable the ManualStepPollSchedule trigger on the ManualStepActivityWorker Lambda function.

In the Step Functions console, start a new execution of the state machine with the following input:

{ "managerEmailAddress": "", "employeeName" : "Jim" } 

Within a minute, you should receive an email with links to approve or reject Jim’s promotion. Choosing one of those links should succeed or fail the execution.



In this post, you created a state machine containing an activity task with Step Functions, an API with API Gateway, and a Lambda function to dispatch the approval/failure process. Your Step Functions activity task generated a unique token that was returned later indicating either approval or rejection by the person making the decision. Your Lambda function acquired the task token by polling the activity task, and then generated and sent an email to the manager for approval or rejection with embedded hyperlinks to URIs hosted by API Gateway.

If you have questions or suggestions, please comment below.

Amazon Kinesis Firehose Data Transformation with AWS Lambda

by Bryan Liston | on | | Comments

Shiva Narayanaswamy, Solution Architect

Amazon Kinesis Firehose is a fully managed service for delivering real-time streaming data to destinations such as Amazon S3, Amazon Redshift, or Amazon Elasticsearch Service (Amazon ES). You configure your data producers to send data to Firehose and it automatically delivers the data to the specified destination. You can send data to your delivery stream using the Amazon Kinesis Agent or the Firehose API, using the AWS SDK.

Customers have told us that they want to perform light preprocessing or mutation of the incoming data stream before writing it to the destination. Other use cases might include normalizing data produced by different producers, adding metadata to the record, or converting incoming data to a format suitable for the destination. At the moment, customers deliver data to an intermediate destination, such as a S3 bucket, and use S3 event notification to trigger a Lambda function to perform the transformation before delivering it to the final destination.

In this post, I introduce data transformation capabilities on your delivery streams, to seamlessly transform incoming source data and deliver the transformed data to your destinations.

Introducing Firehose Data Transformations

With the Firehose data transformation feature, you can now specify a Lambda function that can perform transformations directly on the stream, when you create a delivery stream.

When you enable Firehose data transformation, Firehose buffers incoming data and invokes the specified Lambda function with each buffered batch asynchronously. The transformed data is sent from Lambda to Firehose for buffering and then delivered to the destination. You can also choose to enable source record backup, which back up all untransformed records to your S3 bucket concurrently while delivering transformed records to the destination.

To get you started, we provide the following Lambda blueprints, which you can adapt to suit your needs:

  • Apache Log to JSON
  • Apache Log to CSV
  • Syslog to JSON
  • Syslog to CSV
  • General Firehose Processing

Setting up Firehose Data Transformation

Now I'm going to walk you through the setup of a Firehose stream with data transformation.

In the Firehose console, create a new delivery stream with an existing S3 bucket as the destination.

alt text

In the Configuration section, enable data transformation, and choose the generic Firehose processing Lambda blueprint, which takes you to the Lambda console.

alt text

Edit the code inline, and paste the following Lambda function, which I'm using to demonstrate the Firehose data transformation feature. Choose a timeout of 5 minutes. This function matches the records in the incoming stream to a regular expression. On match, it parses the JSON record. The function then does the following:

  • Picks only the RETAIL sector and drops the rest (filtering)
  • Adds a TIMESTAMP to the record (mutation)
  • Converts from JSON to CSV (transformation)
  • Passes the processed record back into the stream for delivery
'use strict';
console.log('Loading function');

/* Stock Ticker format parser */
const parser = /^\{\"TICKER_SYMBOL\"\:\"[A-Z]+\"\,\"SECTOR\"\:"[A-Z]+\"\,\"CHANGE\"\:[-.0-9]+\,\"PRICE\"\:[-.0-9]+\}/;

exports.handler = (event, context, callback) => {
    let success = 0; // Number of valid entries found
    let failure = 0; // Number of invalid entries found
    let dropped = 0; // Number of dropped entries 

    /* Process the list of records and transform them */
    const output = => {

        const entry = (new Buffer(, 'base64')).toString('utf8');
        let match = parser.exec(entry);
        if (match) {
            let parsed_match = JSON.parse(match); 
            var milliseconds = new Date().getTime();
            /* Add timestamp and convert to CSV */
            const result = `${milliseconds},${parsed_match.TICKER_SYMBOL},${parsed_match.SECTOR},${parsed_match.CHANGE},${parsed_match.PRICE}`+"\n";
            const payload = (new Buffer(result, 'utf8')).toString('base64');
            if (parsed_match.SECTOR != 'RETAIL') {
                /* Dropped event, notify and leave the record intact */
                return {
                    recordId: record.recordId,
                    result: 'Dropped',
            else {
                /* Transformed event */
                return {
                    recordId: record.recordId,
                    result: 'Ok',
                    data: payload,
        else {
            /* Failed event, notify the error and leave the record intact */
            console.log("Failed event : "+;
            return {
                recordId: record.recordId,
                result: 'ProcessingFailed',
        /* This transformation is the "identity" transformation, the data is left intact 
        return {
            recordId: record.recordId,
            result: 'Ok',
        } */
    console.log(`Processing completed.  Successful records ${output.length}.`);
    callback(null, { records: output });

In the Firehose console, choose the newly created Lambda function. Enable source record backup, and choose the same S3 bucket and an appropriate prefix. Firehose delivers the raw data stream to this bucket under this prefix.

alt text

Choose a S3 buffer size of 1 MB, and a buffer interval of 60 seconds. Create a Firehose Delivery IAM role.

alt text

Review the configuration and create the Firehose delivery stream.

Testing Firehose Data Transformation

You can use the AWS Management Console to ingest simulated stock ticker data. The console runs a script in your browser to put sample records in your Firehose delivery stream. This enables you to test the configuration of your delivery stream without having to generate your own test data. The following is an example from the simulated data:


To test the Firehose data transformation, the Lambda function created in the previous section adds a timestamp to the records, and delivers only the stocks from the “RETAIL” sector. This test demonstrates the ability to add metadata to the records in the incoming stream, and also filtering the delivery stream.

Choose the newly created Firehose delivery stream, and choose Test with demo data, Start sending demo data.

alt text

Firehose provides CloudWatch metrics about the delivery stream. Additional metrics to monitor the data processing feature are also now available.

alt text

The destination S3 bucket does not contain the prefixes with the source data backup, and the processed stream. Download a file of the processed data, and verify that the records contain the timestamp and the “RETAIL” sector data, as follows:



With the Firehose data transformation feature, you now have a powerful, scalable way to perform data transformations on streaming data. You can create a data lake with the raw data, and simultaneously transform data to be consumed in a suitable format by a Firehose destination.

For more information about Firehose, see the Amazon Kinesis Firehose Developer Guide.

If you have any questions or suggestions, please comment below.

Authorizing Access Through a Proxy Resource to Amazon API Gateway and AWS Lambda Using Amazon Cognito User Pools

by Bryan Liston | on | in Amazon API Gateway, AWS Lambda | | Comments

Ed Lima, Solutions Architect

Want to create your own user directory that can scale to hundreds of millions of users? Amazon Cognito user pools are fully managed so that you don’t have to worry about the heavy lifting associated with building, securing, and scaling authentication to your apps.

The AWS Mobile blog post Integrating Amazon Cognito User Pools with API Gateway back in May explained how to integrate user pools with Amazon API Gateway using an AWS Lambda custom authorizer. Since then, we’ve released a new feature where you can directly configure a Cognito user pool authorizer to authenticate your API calls; more recently, we released a new proxy resource feature. In this post, I show how to use these new great features together to secure access to an API backed by a Lambda proxy resource.


In this post, I assume that you have some basic knowledge about the services involved. If not, feel free to review our documentation and tutorials on:

Start by creating a user pool called “myApiUsers”, and enable verifications with optional MFA access for extra security:


Be mindful that if you are using a similar solution for production workloads you will need to request a SMS spending threshold limit increase from Amazon SNS in order to send SMS messages to users for phone number verification or for MFA. For the purposes of this article, since we are only testing our API authentication with a single user the default limit will suffice.

Now, create an app in your user pool, making sure to clear Generate client secret:


Using the client ID of your newly created app, add a user, “jdoe”, with the AWS CLI. The user needs a valid email address and phone number to receive MFA codes:

aws cognito-idp sign-up \
--client-id 12ioh8c17q3stmndpXXXXXXXX \
--username jdoe \
--password P@ssw0rd \
--region us-east-1 \
--user-attributes '[{"Name":"given_name","Value":"John"},{"Name":"family_name","Value":"Doe"},{"Name":"email","Value":""},{"Name":"gender","Value":"Male"},{"Name":"phone_number","Value":"+61XXXXXXXXXX"}]'  

In the Cognito User Pools console, under Users, select the new user and choose Confirm User and Enable MFA:


Your Cognito user is now ready and available to connect.

Next, create a Node.js Lambda function called LambdaForSimpleProxy with a basic execution role. Here’s the code:

'use strict';
console.log('Loading CUP2APIGW2Lambda Function');

exports.handler = function(event, context) {
    var responseCode = 200;
    console.log("request: " + JSON.stringify(event));
    var responseBody = {
        message: "Hello, " + + " " + +"!" + " You are authenticated to your API using Cognito user pools!",
        method: "This is an authorized "+ event.httpMethod + " to Lambda from your API using a proxy resource.",
        body: event.body

    //Response including CORS required header
    var response = {
        statusCode: responseCode,
        headers: {
            "Access-Control-Allow-Origin" : "*"
        body: JSON.stringify(responseBody)

    console.log("response: " + JSON.stringify(response))

For the last piece of the back-end puzzle, create a new API called CUP2Lambda from the Amazon API Gateway console. Under Authorizers, choose Create, Cognito User Pool Authorizer with the following settings:


Create an ANY method under the root of the API as follows:


After that, choose Save, OK to give API Gateway permissions to invoke the Lambda function. It’s time to configure the authorization settings for your ANY method. Under Method Request, enter the Cognito user pool as the authorization for your API:


Finally, choose Actions, Enable CORS. This creates an OPTIONS method in your API:


Now it’s time to deploy the API to a stage (such as prod) and generate a JavaScript SDK from the SDK Generation tab. You can use other methods to connect to your API however in this article I'll show how to use the API Gateway SDK. Since we are using an ANY method the SDK does not have calls for specific methods other than the OPTIONS method created by Enable CORS, you have to add a couple of extra functions to the apigClient.js file so that your SDK can perform GET and POST operations to your API:

    apigClient.rootGet = function (params, body, additionalParams) {
        if(additionalParams === undefined) { additionalParams = {}; }
        apiGateway.core.utils.assertParametersDefined(params, [], ['body']);       

        var rootGetRequest = {
            verb: 'get'.toUpperCase(),
            path: pathComponent + uritemplate('/').expand(apiGateway.core.utils.parseParametersToObject(params, [])),
            headers: apiGateway.core.utils.parseParametersToObject(params, []),
            queryParams: apiGateway.core.utils.parseParametersToObject(params, []),
            body: body

        return apiGatewayClient.makeRequest(rootGetRequest, authType, additionalParams, config.apiKey);

    apigClient.rootPost = function (params, body, additionalParams) {
        if(additionalParams === undefined) { additionalParams = {}; }
        apiGateway.core.utils.assertParametersDefined(params, ['body'], ['body']);
        var rootPostRequest = {
            verb: 'post'.toUpperCase(),
            path: pathComponent + uritemplate('/').expand(apiGateway.core.utils.parseParametersToObject(params, [])),
            headers: apiGateway.core.utils.parseParametersToObject(params, []),
            queryParams: apiGateway.core.utils.parseParametersToObject(params, []),
            body: body
        return apiGatewayClient.makeRequest(rootPostRequest, authType, additionalParams, config.apiKey);


You can now use a little front end web page to authenticate users and test authorized calls to your API. In order for it to work, you need to add some external libraries and dependencies including the API Gateway SDK you just generated. You can find more details in our Cognito as well as API Gateway SDK documentation guides.

With the dependencies in place, you can use the following JavaScript code to authenticate your Cognito user pool user and connect to your API in order to perform authorized calls (replace your own user pool Id and client ID details accordingly):

<script type="text/javascript">
 //Configure the AWS client with the Cognito role and a blank identity pool to get initial credentials

    region: 'us-east-1',
    credentials: new AWS.CognitoIdentityCredentials({
      IdentityPoolId: ''

  AWSCognito.config.region = 'us-east-1';
  AWSCognito.config.update({accessKeyId: 'null', secretAccessKey: 'null'});
  var token = "";
  //Authenticate user with MFA

  document.getElementById("buttonAuth").addEventListener("click", function(){  
    var authenticationData = {
      Username : document.getElementById('username').value,
      Password : document.getElementById('password').value,

    var showGetPut = document.getElementById('afterLogin');
    var hideLogin = document.getElementById('login');

    var authenticationDetails = new AWSCognito.CognitoIdentityServiceProvider.AuthenticationDetails(authenticationData);

   // Replace with your user pool details

    var poolData = { 
        UserPoolId : 'us-east-1_XXXXXXXXX', 
        ClientId : '12ioh8c17q3stmndpXXXXXXXX', 
        Paranoia : 7

    var userPool = new AWSCognito.CognitoIdentityServiceProvider.CognitoUserPool(poolData);

    var userData = {
        Username : document.getElementById('user').value,
        Pool : userPool

    var cognitoUser = new AWSCognito.CognitoIdentityServiceProvider.CognitoUser(userData);
    cognitoUser.authenticateUser(authenticationDetails, {
      onSuccess: function (result) {
        token = result.getIdToken().getJwtToken(); // CUP Authorizer = ID Token
        console.log('ID Token: ' + result.getIdToken().getJwtToken()); // Show ID Token in the console
        var cognitoGetUser = userPool.getCurrentUser();
        if (cognitoGetUser != null) {
          cognitoGetUser.getSession(function(err, result) {
            if (result) {
              console.log ("User Successfuly Authenticated!");  

        //Hide Login form after successful authentication = 'block'; = 'none';
    onFailure: function(err) {
    mfaRequired: function(codeDeliveryDetails) {
            var verificationCode = prompt('Please input a verification code.' ,'');
            cognitoUser.sendMFACode(verificationCode, this);

//Send a GET request to the API

document.getElementById("buttonGet").addEventListener("click", function(){
  var apigClient = apigClientFactory.newClient();
  var additionalParams = {
      headers: {
        Authorization: token

      .then(function(response) {
        document.getElementById("output").innerHTML = ('<pre align="left"><code>Response: '+JSON.stringify(, null, 2)+'</code></pre>');
      }).catch(function (response) {
        document.getElementById('output').innerHTML = ('<pre align="left"><code>Error: '+JSON.stringify(response, null, 2)+'</code></pre>');

//Send a POST request to the API

document.getElementById("buttonPost").addEventListener("click", function(){
  var apigClient = apigClientFactory.newClient();
  var additionalParams = {
      headers: {
        Authorization: token
 var body = {
        "message": "Sample POST payload"

      .then(function(response) {
        document.getElementById("output").innerHTML = ('<pre align="left"><code>Response: '+JSON.stringify(, null, 2)+'</code></pre>');
      }).catch(function (response) {
        document.getElementById('output').innerHTML = ('<pre align="left"><code>Error: '+JSON.stringify(response, null, 2)+'</code></pre>');

As far as the front end is concerned you can use some simple HTML code to test, such as the following snippet:

<div id="container" class="container">
    <img src="">
    <h1>Cognito User Pools and API Gateway</h1>
    <form name="myform">
          <li class="fields">
            <div id="login">
            <label>User Name: </label>
            <input id="username" size="60" class="req" type="text"/>
            <label>Password: </label>
            <input id="password" size="60" class="req" type="password"/>
            <button class="btn" type="button" id='buttonAuth' title="Log in with your username and password">Log In</button>
            <br />
            <div id="afterLogin" style="display:none;"> 
            <br />
            <button class="btn" type="button" id='buttonPost'>POST</button>
            <button class="btn" type="button" id='buttonGet' >GET</button>
            <br />
    <div id="output"></div>

After adding some extra CSS styling of your choice (for example adding "list-style: none" to remove list bullet points), the front end is ready. You can test it by using a local web server in your computer or a static website on Amazon S3.

Enter the user name and password details for John Doe and choose Log In:


A MFA code is then sent to the user and can be validated accordingly:


After authentication, you can see the ID token generated by Cognito for further access testing:


If you go back to the API Gateway console and test your Cognito user pool authorizer with the same token, you get the authenticated user claims accordingly:


In your front end, you can now perform authenticated GET calls to your API by choosing GET.


Or you can perform authenticated POST calls to your API by choosing POST.


The calls reach your Lambda proxy and return a valid response accordingly. You can also test from the command line using cURL, by sending the user pool ID token that you retrieved from the developer console earlier, in the “Authorization” header:


It’s possible to improve this solution by integrating an Amazon DynamoDB table, for instance. You could detect the method request on event.httpMethod in the Lambda function and issue a GetItem call to a table for a GET request or a PutItem call to a table for a POST request. There are lots of possibilities for this kind of proxy resource integration.


The Cognito user pools integration with API Gateway provides a new way to secure your API workloads, and the new proxy resource for Lambda allows you to perform any business logic or transformations to your API calls from Lambda itself instead of using body mapping templates. These new features provide very powerful options to secure and handle your API logic.

I hope this post helps with your API workloads. If you have questions or suggestions, please comment below.

Managing Secrets for Amazon ECS Applications Using Parameter Store and IAM Roles for Tasks

by Chris Barclay | on | in Amazon ECS | | Comments

Thanks to my colleague Stas Vonholsky  for a great blog on managing secrets with Amazon ECS applications.


As containerized applications and microservice-oriented architectures become more popular, managing secrets, such as a password to access an application database, becomes more challenging and critical.

Some examples of the challenges include:

  • Support for various access patterns across container environments such as dev, test, and prod
  • Isolated access to secrets on a container/application level rather than at the host level
  • Multiple decoupled services with their own needs for access, both as services and as clients of other services

This post focuses on newly released features that support further improvements to secret management for containerized applications running on Amazon ECS. My colleague, Matthew McClean, also published an excellent post on the AWS Security Blog, How to Manage Secrets for Amazon EC2 Container Service–Based Applications by Using Amazon S3 and Docker, which discusses some of the limitations of passing and storing secrets with container parameter variables.

Most secret management tools provide the following functionality:

  • Highly secured storage system
  • Central management capabilities
  • Secure authorization and authentication mechanisms
  • Integration with key management and encryption providers
  • Secure introduction mechanisms for access
  • Auditing
  • Secret rotation and revocation

Amazon EC2 Systems Manager Parameter Store

Parameter Store is a feature of Amazon EC2 Systems Manager. It provides a centralized, encrypted store for sensitive information and has many advantages when combined with other capabilities of Systems Manager, such as Run Command and State Manager. The service is fully managed, highly available, and highly secured.

Because Parameter Store is accessible using the Systems Manager API, AWS CLI, and AWS SDKs, you can also use it as a generic secret management store. Secrets can be easily rotated and revoked. Parameter Store is integrated with AWS KMS so that specific parameters can be encrypted at rest with the default or custom KMS key. Importing KMS keys enables you to use your own keys to encrypt sensitive data.

Access to Parameter Store is enabled by IAM policies and supports resource level permissions for access. An IAM policy that grants permissions to specific parameters or a namespace can be used to limit access to these parameters. CloudTrail logs, if enabled for the service, record any attempt to access a parameter.

While Amazon S3 has many of the above features and can also be used to implement a central secret store, Parameter Store has the following added advantages:

  • Easy creation of namespaces to support different stages of the application lifecycle.
  • KMS integration that abstracts parameter encryption from the application while requiring the instance or container to have access to the KMS key and for the decryption to take place locally in memory.
  • Stored history about parameter changes.
  • A service that can be controlled separately from S3, which is likely used for many other applications.
  • A configuration data store, reducing overhead from implementing multiple systems.
  • No usage costs.

Note: At the time of publication, Systems Manager doesn’t support VPC private endpoint functionality. To enforce stricter access to a Parameter Store endpoint from a private VPC, use a NAT gateway with a set Elastic IP address together with IAM policy conditions that restrict parameter access to a limited set of IP addresses.

IAM roles for tasks

With IAM roles for Amazon ECS tasks, you can specify an IAM role to be used by the containers in a task. Applications interacting with AWS services must sign their API requests with AWS credentials. This feature provides a strategy for managing credentials for your applications to use, similar to the way that Amazon EC2 instance profiles provide credentials to EC2 instances.

Instead of creating and distributing your AWS credentials to the containers or using the EC2 instance role, you can associate an IAM role with an ECS task definition or the RunTask API operation. For more information, see IAM Roles for Tasks.

You can use IAM roles for tasks to securely introduce and authenticate the application or container with the centralized Parameter Store. Access to the secret manager should include features such as:

  • Limited TTL for credentials used
  • Granular authorization policies
  • An ID to track the requests in the logs of the central secret manager
  • Integration support with the scheduler that could map between the container or task deployed and the relevant access privileges

IAM roles for tasks support this use case well, as the role credentials can be accessed only from within the container for which the role is defined. The role exposes temporary credentials and these are rotated automatically. Granular IAM policies are supported with optional conditions about source instances, source IP addresses, time of day, and other options.

The source IAM role can be identified in the CloudTrail logs based on a unique Amazon Resource Name and the access permissions can be revoked immediately at any time with the IAM API or console. As Parameter Store supports resource level permissions, a policy can be created to restrict access to specific keys and namespaces.

Dynamic environment association

In many cases, the container image does not change when moving between environments, which supports immutable deployments and ensures that the results are reproducible. What does change is the configuration: in this context, specifically the secrets. For example, a database and its password might be different in the staging and production environments. There’s still the question of how do you point the application to retrieve the correct secret? Should it retrieve prod.app1.secret, test.app1.secret or something else?

One option can be to pass the environment type as an environment variable to the container. The application then concatenates the environment type (prod, test, etc.) with the relative key path and retrieves the relevant secret. In most cases, this leads to a number of separate ECS task definitions.

When you describe the task definition in a CloudFormation template, you could base the entry in the IAM role that provides access to Parameter Store, KMS key, and environment property on a single CloudFormation parameter, such as “environment type.” This approach could support a single task definition type that is based on a generic CloudFormation template.

Walkthrough: Securely access Parameter Store resources with IAM roles for tasks

This walkthrough is configured for the North Virginia region (us-east-1). I recommend using the same region.

Step 1: Create the keys and parameters

First, create the following KMS keys with the default security policy to be used to encrypt various parameters:

  • prod-app1 –used to encrypt any secrets for app1.
  • license-key –used to encrypt license-related secrets.
aws kms create-key --description prod-app1 --region us-east-1
aws kms create-key --description license-code --region us-east-1

Note the KeyId property in the output of both commands. You use it throughout the walkthrough to identify the KMS keys.

The following commands create three parameters in Parameter Store:

  • prod.app1.db-pass (encrypted with the prod-app1 KMS key)
  • general.license-code (encrypted with the license-key KMS key)
  • prod.app2.user-name (stored as a standard string without encryption)
aws ssm put-parameter --name prod.app1.db-pass --value "AAAAAAAAAAA" --type SecureString --key-id "<key-id-for-prod-app1-key>" --region us-east-1
aws ssm put-parameter --name general.license-code --value "CCCCCCCCCCC" --type SecureString --key-id "<key-id-for-license-code-key>" --region us-east-1
aws ssm put-parameter --name prod.app2.user-name --value "BBBBBBBBBBB" --type String --region us-east-1

Step 2: Create the IAM role and policies

Now, create a role and an IAM policy to be associated later with the ECS task that you create later on.
The trust policy for the IAM role needs to allow the ecs-tasks entity to assume the role.

   "Version": "2012-10-17",
   "Statement": [
       "Sid": "",
       "Effect": "Allow",
       "Principal": {
         "Service": ""
       "Action": "sts:AssumeRole"

Save the above policy as a file in the local directory with the name ecs-tasks-trust-policy.json.

aws iam create-role --role-name prod-app1 --assume-role-policy-document file://ecs-tasks-trust-policy.json

The following policy is attached to the role and later associated with the app1 container. Access is granted to the prod.app1.* namespace parameters, the encryption key required to decrypt the prod.app1.db-pass parameter and the license code parameter. The namespace resource permission structure is useful for building various hierarchies (based on environments, applications, etc.).

Make sure to replace <key-id-for-prod-app1-key> with the key ID for the relevant KMS key and <account-id> with your account ID in the following policy.

     "Version": "2012-10-17",
     "Statement": [
             "Effect": "Allow",
             "Action": [
             "Resource": "*"
             "Sid": "Stmt1482841904000",
             "Effect": "Allow",
             "Action": [
             "Resource": [
             "Sid": "Stmt1482841948000",
             "Effect": "Allow",
             "Action": [
             "Resource": [

Save the above policy as a file in the local directory with the name app1-secret-access.json:

aws iam create-policy --policy-name prod-app1 --policy-document file://app1-secret-access.json

Replace <account-id> with your account ID in the following command:

aws iam attach-role-policy --role-name prod-app1 --policy-arn "arn:aws:iam::<account-id>:policy/prod-app1"

Step 3: Add the testing script to an S3 bucket

Create a file with the script below, name it and add it to an S3 bucket in your account. Make sure the object is publicly accessible and note down the object link, for example

#This is simple bash script that is used to test access to the EC2 Parameter store.
# Install the AWS CLI
apt-get -y install python2.7 curl
curl -O
pip install awscli
# Getting region
EC2_AVAIL_ZONE=`curl -s`
EC2_REGION="`echo \"$EC2_AVAIL_ZONE\" | sed -e 's:\([0-9][0-9]*\)[a-z]*\$:\\1:'`"
# Trying to retrieve parameters from the EC2 Parameter Store
APP1_WITH_ENCRYPTION=`aws ssm get-parameters --names prod.app1.db-pass --with-decryption --region $EC2_REGION --output text 2>&1`
APP1_WITHOUT_ENCRYPTION=`aws ssm get-parameters --names prod.app1.db-pass --no-with-decryption --region $EC2_REGION --output text 2>&1`
LICENSE_WITH_ENCRYPTION=`aws ssm get-parameters --names general.license-code --with-decryption --region $EC2_REGION --output text 2>&1`
LICENSE_WITHOUT_ENCRYPTION=`aws ssm get-parameters --names general.license-code --no-with-decryption --region $EC2_REGION --output text 2>&1`
APP2_WITHOUT_ENCRYPTION=`aws ssm get-parameters --names prod.app2.user-name --no-with-decryption --region $EC2_REGION --output text 2>&1`
# The nginx server is started after the script is invoked, preparing folder for HTML.
if [ ! -d /usr/share/nginx/html/ ]; then
mkdir -p /usr/share/nginx/html/;
chmod 755 /usr/share/nginx/html/

# Creating an HTML file to be accessed at http://<public-instance-DNS-name>/ecs.html
cat > /usr/share/nginx/html/ecs.html <<EOF
<!DOCTYPE html>
body {padding: 20px;margin: 0 auto;font-family: Tahoma, Verdana, Arial, sans-serif;}
code {white-space: pre-wrap;}
result {background: hsl(220, 80%, 90%);}
<h1>Hi there!</h1>
<p style="padding-bottom: 0.8cm;">Following are the results of different access attempts as expirienced by "App1".</p>

<p><b>Access to prod.app1.db-pass:</b><br/>
<pre><code>aws ssm get-parameters --names prod.app1.db-pass --with-decryption</code><br/>
<code>aws ssm get-parameters --names prod.app1.db-pass --no-with-decryption</code><br/>

<p><b>Access to general.license-code:</b><br/>
<pre><code>aws ssm get-parameters --names general.license-code --with-decryption</code><br/>
<code>aws ssm get-parameters --names general.license-code --no-with-decryption</code><br/>

<p><b>Access to prod.app2.user-name:</b><br/>
<pre><code>aws ssm get-parameters --names prod.app2.user-name --no-with-decryption</code><br/>

<p><em>Thanks for visiting</em></p>

Step 4: Create a test cluster

I recommend creating a new ECS test cluster with the latest ECS AMI and ECS agent on the instance. Use the following field values:

  • Cluster name: access-test
  • EC2 instance type: t2.micro
  • Number of instances: 1
  • Key pair: No EC2 key pair is required, unless you’d like to SSH to the instance and explore the running container.
  • VPC: Choose the default VPC. If unsure, you can find the VPC ID with the IP range in the Amazon VPC console.
  • Subnets: Pick a subnet in the default VPC.
  • Security group: Create a new security group with CIDR block and port 80 for inbound access.

Leave other fields with the default settings.

Create a simple task definition that relies on the public NGINX container and the role that you created for app1. Specify the properties such as the available container resources and port mappings. Note the command option is used to download and invoke a test script that installs the AWS CLI on the container, runs a number of get-parameter commands, and creates an HTML file with the results.

Replace <account-id> with your account ID, <your-S3-URI> with a link to the S3 object created in step 3 in the following commands:

aws ecs register-task-definition --family access-test --task-role-arn "arn:aws:iam::<account-id>:role/prod-app1" --container-definitions name="access-test",image="nginx",portMappings="[{containerPort=80,hostPort=80,protocol=tcp}]",readonlyRootFilesystem=false,cpu=512,memory=490,essential=true,entryPoint="sh,-c",command="\"/bin/sh -c \\\"apt-get update ; apt-get -y install curl ; curl -O <your-S3-URI> ; chmod +x ; ./ ; nginx -g 'daemon off;'\\\"\"" --region us-east-1

aws ecs run-task --cluster access-test --task-definition access-test --count 1 --region us-east-1

Verifying access

After the task is in a running state, check the public DNS name of the instance and navigate to the following page:


You should see the results of running different access tests from the container after a short duration.

If the test results don’t appear immediately, wait a few seconds and refresh the page.
Make sure that inbound traffic for port 80 is allowed on the security group attached to the instance.

The results you see in the static results HTML page should be the same as running the following commands from the container.


aws ssm get-parameters --names prod.app1.db-pass --with-decryption --region us-east-1
aws ssm get-parameters --names prod.app1.db-pass --no-with-decryption --region us-east-1

Both commands should work, as the policy provides access to both the parameter and the required KMS key.


aws ssm get-parameters --names general.license-code --no-with-decryption --region us-east-1
aws ssm get-parameters --names general.license-code --with-decryption --region us-east-1

Only the first command with the “no-with-decryption” parameter should work. The policy allows access to the parameter in Parameter Store but there’s no access to the KMS key. The second command should fail with an access denied error.


aws ssm get-parameters --names prod.app2.user-name –no-with-decryption --region us-east-1

The command should fail with an access denied error, as there are no permissions associated with the namespace for prod.app2.

Finishing up

Remember to delete all resources (such as the KMS keys and EC2 instance), so that you don’t incur charges.


Central secret management is an important aspect of securing containerized environments. By using Parameter Store and task IAM roles, customers can create a central secret management store and a well-integrated access layer that allows applications to access only the keys they need, to restrict access on a container basis, and to further encrypt secrets with custom keys with KMS.

Whether the secret management layer is implemented with Parameter Store, Amazon S3, Amazon DynamoDB, or a solution such as Vault or KeyWhiz, it’s a vital part to the process of managing and accessing secrets.

Resize Images on the Fly with Amazon S3, AWS Lambda, and Amazon API Gateway

by Bryan Liston | on | | Comments

John Pignata, Solutions Architect

With the explosion of device types used to access the Internet with different capabilities, screen sizes, and resolutions, developers must often provide images in an array of sizes to ensure a great user experience. This can become complex to manage and drive up costs.

Images stored using Amazon S3 are often processed into multiple sizes to fit within the design constraints of a website or mobile application. It’s a common approach to use S3 event notifications and AWS Lambda for eager processing of images when a new object is created in a bucket.

In this post, I explore a different approach and outline a method of lazily generating images, in which a resized asset is only created if a user requests that specific size.

Resizing on the fly

Instead of processing and resizing images into all necessary sizes upon upload, the approach of processing images on the fly has several upsides:

  • Increased agility
  • Reduced storage costs
  • Resilience to failure

Increased agility

When you redesign your website or application, you can add new dimensions on the fly, rather than working to reprocess the entire archive of images that you have stored.

Running a batch process to resize all original images into new, resized dimensions can be time-consuming, costly, and error-prone. With the on-the-fly approach, a developer can instead specify a new set of dimensions and lazily generate new assets as customers use the new website or application.

Reduced storage costs

With eager image processing, the resized images must be stored indefinitely as the operation only happens one time. The approach of resizing on-demand means that developers do not need to store images that are not accessed by users.

As a user request initiates resizing, this also unlocks options for optimizing storage costs for resized image assets, such as S3 lifecycle rules to expire older images that can be tuned to an application’s specific access patterns. If a user attempts to access a resized image that has been removed by a lifecycle rule, the API resizes it on demand to fulfill the request.

Resilience to failure

A key best practice outlined in the Architecting for the Cloud: Best Practices whitepaper is

“Design for failure and nothing will fail.” When building distributed services, developers should be pessimistic and assume that failures will occur.

If image processing is designed to occur only one time upon object creation, an intermittent failure in that process―or any data loss to the processed images―could cause continual failures to future users. When resizing images on-demand, each request initiates processing if a resized image is not found, meaning that future requests could recover from a previous failure automatically.

Architecture overview


Here’s the process:

  1. A user requests a resized asset from an S3 bucket through its static website hosting endpoint. The bucket has a routing rule configured to redirect to the resize API any request for an object that cannot be found.
  2. Because the resized asset does not exist in the bucket, the request is temporarily redirected to the resize API method.
  3. The user’s browser follows the redirect and requests the resize operation via API Gateway.
  4. The API Gateway method is configured to trigger a Lambda function to serve the request.
  5. The Lambda function downloads the original image from the S3 bucket, resizes it, and uploads the resized image back into the bucket as the originally requested key.
  6. When the Lambda function completes, API Gateway permanently redirects the user to the file stored in S3.
  7. The user’s browser requests the now-available resized image from the S3 bucket. Subsequent requests from this and other users will be served directly from S3 and bypass the resize operation. If the resized image is deleted in the future, the above process repeats and the resized image is re-created and replaced into the S3 bucket.

Set up resources

A working example with code is open source and available in the serverless-image-resizing GitHub repo. You can create the required resources by following the README directions, which use an AWS Serverless Application Model (AWS SAM) template, or manually following the directions below.

To create and configure the S3 bucket

  1. In the S3 console, create a new S3 bucket.
  2. Choose Permissions, Add Bucket Policy. Add a bucket policy to allow anonymous access.
  3. Choose Static Website Hosting, Enable website hosting and, for Index Document, enter index.html.
  4. Choose Save.
  5. Note the name of the bucket that you’ve created and the hostname in the Endpoint field.

To create the Lambda function

  1. In the Lambda console, choose Create a Lambda function, Blank Function.
  2. To select an integration, choose the dotted square and choose API Gateway.
  3. To allow all users to invoke the API method, for Security, choose Open and then Next.
  4. For Name, enter resize. For Code entry type, choose Upload a .ZIP file.
  5. Choose Function package and upload the .ZIP file of the contents of the Lambda function.
  6. To configure your function, for Environment variables, add two variables:
    • For Key, enter BUCKET; for Value,enter the bucket name that you created above.
    • For Key, enter URL; for Value, enter the endpoint field that you noted above, prefixed with http://.
  7. To define the execution role permissions for the function, for Role, choose Create a custom role. Choose View Policy Document, Edit, Ok.
  8. Replace YOUR_BUCKET_NAME_HERE with the name of the bucket that you’ve created and copy the following code into the policy document. Note that any leading spaces in your policy may cause a validation error.
  "Version": "2012-10-17",
  "Statement": [
      "Effect": "Allow",
      "Action": [
      "Resource": "arn:aws:logs:*:*:*"
      "Effect": "Allow",
      "Action": "s3:PutObject",
      "Resource": "arn:aws:s3:::__YOUR_BUCKET_NAME_HERE__/*"    
  1. For Memory, choose 1536. For Timeout, enter 10 sec. Choose Next, Create function.
  2. Choose Triggers, and note the hostname in the URL of your function.


To set up the S3 redirection rule

  1. In the S3 console, open the bucket that you created above.
  2. Expand Static Website Hosting, Edit Redirection Rules.
  3. Replace YOUR_API_HOSTNAME_HERE with the hostname that you noted above and copy the following into the redirection rules configuration:

Test image resizing

Upload a test image into your bucket to for testing. The blue marble is a great sample image for testing because it is large and square. Once uploaded, try to retrieve resized versions of the image using your bucket’s static website hosting endpoint:




You should see a smaller version of the test photo. If not, choose Monitoring in your Lambda function and check CloudWatch Logs for troubleshooting. You can also refer to the serverless-image-resizing GitHub repo for a working example that you can deploy to your account.


The solution I’ve outlined is a simplified example of how to implement this functionality. For example, in a real-world implementation, there would likely be a list of permitted sizes to prevent a requestor from filling your bucket with randomly sized images. Further cost optimizations could be employed, such as using S3 lifecycle rules on a bucket dedicated to resized images to expire resized images after a given amount of time.

This approach allows you to lazily generate resized images while taking advantage of serverless architecture. This means you have no operating systems to manage, secure, or patch; no servers to right-size, monitor, or scale; no risk of over-spending by over-provisioning; and no risk of delivering a poor user experience due to poor performance by under-provisioning.

If you have any questions, feedback, or suggestions about this approach, please let us know in the comments!

Amazon ECS at The Climate Corporation: Using Amazon ECR and Multiple Accounts for Isolated Regression Testing

by Chris Barclay | on | | Comments

This is a guest post from Nathan Mehl, Site Reliability Engineering Manager at The Climate Corporation.

The Climate Corporation aims to help all the world’s farmers sustainably increase their productivity through the use of digital tools. The integrated Climate FieldView™ digital agriculture platform provides farmers with a comprehensive, connected suite of digital tools. Bringing together seamless field data collection, advanced agronomic modeling and local weather monitoring into simple mobile and web software solutions, the Climate FieldView platform gives farmers a deeper understanding of their fields so they can make more informed operating decisions to optimize yields, maximize efficiency, and reduce risk.

Like many large users of AWS, The Climate Corporation uses multiple AWS accounts in order to provide strict isolation between a production environment and various pre-production environments (testing, staging, etc.). Applications are continuously deployed from a build/CI system (Jenkins CI) into a testing environment, and then, after either automated or manual testing (depending on the application and team), are “promoted” into a staging/pre-production environment where full regression/load tests are done, and then finally promoted again into production.

But when it came time to evaluate the use of Amazon EC2 Container Registry (Amazon ECR) and Amazon EC2 Container Service (Amazon ECS), our multi-account deployment presented a challenge: if the unit of deployment is a container, and there are potentially multiple full-fledged independent accounts where that container could be running, how do we safely and conveniently manage the lifecycle of an individual container image as it makes its way through the testing pipeline and into staging and production?

One obvious approach that we quickly discarded was simply having an independent ECR deployment in each account. The problem with this approach is that it would be extremely difficult to know the current deployment state of any given image without building and maintaining an external state-tracking system that could easily fall out of sync with ground truth in ECR. Also, “promoting” an image between environments would require copying it from one account’s ECR registry to another’s: in addition to being slow, this would require careful construction of inter-account IAM policies to let a single client pull from ECR in account A and then push to account B.

We chose to have a single “ECR registry of record” in the AWS account that hosts our Jenkins continuous integration system: Jenkins builds create the container images and push them to ECR. An out-of-band scheduled AWS Lambda function process iterates over the list of ECR repositories and applies a registry access policy that allows all of our other accounts read access.

The next question was: how to track the state of any given container image as it moved through the different accounts/environments on its way to production? Again, we could build an external state-tracking system, but after some thought we realized that we could use the tag metadata offered by the v2.2 Docker Manifest specification and the ECR put_image API operation to provide the same information.

To track the state of the container image through the continuous deployment pipelines, we use multiple Docker tags to ensure that the image’s current state is implicit in the current state of its repository in ECR. Each image has a unique tag applied to it at creation time, what we call the BUILD TAG. The build tag is treated as immutable: once created, none of our other tooling alters it. The build tag ties the image back to a particular build in Jenkins and to a particular git hash, for example for build #11 of the “ecs-demo” project:

Image Tag Image Digest
2016.08.24T17.13.38Z.5ad95f2-ecs-demo-11 sha256:749a1f13eff516cc4fcbc2a9a28ea1685440a1fbf29b

But as our tooling “promotes” the image from our testing environment to staging to production, it adds or updates additional tags—which we call ENVIRONMENT TAGS—pointing at the same image.

When a new image is built in our Jenkins continuous integration server (usually because a git pull request has been approved and merged to the master branch of that project’s repository), a common build script takes care of building the image, applying the build tag and the “testing” environment tag, and then pushing the image and both of those tags to ECR and then kicking off the deployment into our testing environment. A tool called, simply enough, “promote” can be used either by automated processes or humans to move an image from the testing environment to the staging, or from there to production.

For example, if build #11 had made it through the testing and staging environments but had not yet hit production, there might be two images in play, but five tags:

Image Tag Image Digest
2016.08.24T17.13.38Z.5ad95f2-ecs-demo-11 sha256:749a1f13eff516cc4fcbc2a9a28ea1685440a1fbf29b
2016.08.24T17.13.38Z.5ad95f2-ecs-demo-10 sha256:62985ec242857128fa0acea55e3c760e85594d6a2868
testing sha256:749a1f13eff516cc4fcbc2a9a28ea1685440a1fbf29b
staging sha256:749a1f13eff516cc4fcbc2a9a28ea1685440a1fbf29b
production sha256:62985ec242857128fa0acea55e3c760e85594d6a2868

Later on, when build #11 is promoted to production but build #12 has hit the testing environment, the “testing” tag is applied to the same image as the build tag for build #12:

Image Tag Image Digest
2016.08.24T17.13.38Z.5ad95f2-ecs-demo-12 sha256:c79b3e5b3459eb6f0d08a26eb304b8b70235d2eb7622
2016.08.24T17.13.38Z.5ad95f2-ecs-demo-11 sha256:749a1f13eff516cc4fcbc2a9a28ea1685440a1fbf29b
testing sha256:c79b3e5b3459eb6f0d08a26eb304b8b70235d2eb7622
staging sha256:749a1f13eff516cc4fcbc2a9a28ea1685440a1fbf29b
production sha256:749a1f13eff516cc4fcbc2a9a28ea1685440a1fbf29b

The key here is that each image can potentially have multiple tags pointing at it, and the state of those image-to-tag mappings tells us where the image is in the CD pipeline.

So how do we manage all of these tags in ECR?

Tying this all together requires a bit of code: the Docker manifest format does not allow for the same tag to be associated with multiple images, and that’s a good thing as otherwise we would have no way of enforcing the uniqueness of a tag-to-hash mapping.

But that means that in order to find out what that mapping is, we need to iterate over the list of images in the repository using ECR’s batch_get_image API and then derive the mapping of build tag to environment tag: if two tags point to the same image digest, we can infer that they are both tagging the same image:

import re
from collections import defaultdict

BUILD_TAG_RE = re.compile(r'^\d{4}\.\d{2}\.\d{2}T\d{2}\.\d{2}\.\d{2}Z')
ENVS = frozenset(['testing', 'staging', 'production'])

def get_tags_by_image(all_images):
    """ Iterate over the output of batch_get_image; return a dictionary
        that maps image digest hashes to a set of image tags."""
    images = defaultdict(set)
    for image in all_images:
    return images

def get_equivalent_tags(all_images):
    """ Iterate over the output of batch_get_image; return a dictionary
        that maps environment tag names to build tag names. """
    equivs = {}
    tags_by_image = get_tags_by_image(all_images)
    for tags in tags_by_image.itervalues():
        # any given tag can only apply to one image, but
        # an image may have multiple tags. So each possible
        # tag will only ever appear one time in one of the images
        # dict's value lists. Index those lists by the env tag.
        if len(tags) > 1:
            env_tags = [x for x in tags if x in ENVS]
            build_tags = [x for x in tags if BUILD_TAG_RE.match(x)]
            for env in env_tags:
                for build in build_tags:
                    # this is safe because `env` will only ever
                    # occur once in tags_by_image.values()
                    equivs[env] = build
    return equivs

def get_all_images(ecr, repo, tag_digest_list):
    """ Query ECR for all image manifests for the images listed
        in tag_digest_list. Return the 'images' section of the
        response; warn if there are errors. 
    all_images = ecr.batch_get_image(
    if all_images['failures']:
        # not even sure this could ever happen as we're
        # feeding it directly from ecr.list_images()...
            'Some Docker images could not be fetched:\n %s',
    return all_images['images']

def get_digests_and_tags(ecr, repo):
    """ Iterate over the paginated output of ecr.list_images for
        our repository; return a list of
        {"imageTag": "foo", "imageDigest": "bar"} dicts
        response_images = []
        paginator = ecr.get_paginator('list_images')
        for page in paginator.paginate(
                filter={'tagStatus': 'TAGGED'},
    except Exception as e:
            'Failed to fetch images for {0} from ECR'.format(repo))
        raise e
    return response_images

def main(argv):
    repo = sys.argv[1]
    creds = get_creds()
    ecr = boto3.client('ecr')
    tag_digest_list = get_digests_and_tags(ecr, repo)
    all_images = get_all_images(ecr, repo, tag_digest_list)
    equivs = get_equivalent_tags(all_images)
    print json.dumps(equivs, indent=2)

This code dumps out a JSON object with the current mapping of environment tag to build tag:

$ oe/ecs-demo
"staging": "2016.08.24T17.13.38Z.5ad95f2-ecs-demo-11",
"production": "2016.08.24T17.13.38Z.5ad95f2-ecs-demo-11",
"testing": "2016.08.24T17.13.38Z.5ad95f2-ecs-demo-12"

The last problem to solve is updating the tags as the image moves along the pipeline. We could, of course, use the standard Docker tools to download the image, tag it, and push it back up:

$ eval $(aws ecr get-login)
$ docker pull oe/ecs-demo:2016.08.24T17.13.38Z.5ad95f2-ecs-demo-12
$ docker tag oe/ecs-demo:2016.08.24T17.13.38Z.5ad95f2-ecs-demo-12 oe/ecs-demo:staging
$ docker push oe/ecs-demo:staging

But pulling down a multi-gigabyte container image in order to change a few bytes worth of tag data is slow and time-consuming. Surely, there’s a better way?

As it happens, there is! The new v2.2 manifest format for Docker images finally separated the tag text from the secure hash of the image layers, and the most recent version of the ECR API lets us push up new image manifests and specify tags to apply to them as part of the request via the ecr.put_image API. All this means that we can easily create new manifests with the same contents but different tags, without having to actually pull down the layers themselves.

So, after we know the current mapping, updating the image’s tags is a matter of finding out the SHA256 hash that the build tag currently maps to, grabbing the manifest for that hash from ECR, and pushing the manifest back up to ECR using the ECR put_image API and setting a new tag there. Building on the code above:

def push_new_image(ecr, repo, manifest, dst_env_tag):
    """ Attach the desired tag to an existing image in our
        repository by creating a new image with the exact
        same manifest but a different tag name.
    response = ecr.put_image(
        imageManifest=manifest,  # same manifest
        imageTag=dst_env_tag) # new tag!
    return response['image']['imageId']

def get_img_manifest(build_tag, all_images):
    """ Iterate over the output of ecr.batch_get_image();
        return the manifest of the first image with
        an ImageTag matching build_tag.
    for image in all_images:
        if image['imageId']['imageTag'] == build_tag:
             return image['imageManifest']
    raise Exception('Manifest Not Found!')

def promote_tag(ecr, repo, equivs, all_images, src_env_tag, dst_env_tag):
    """ Update the state of our Docker repo, telling it that the destination
        environment tag should now be associated with the same image as the
        source environment tag.
    # "equivs" is the mapping of environment tags to build tags, from
    # get_equivalent_tags(), above
    build_tag = equivs[src_env_tag]
    manifest = get_img_manifest(build_tag, all_images)
    response = push_new_image(ecr, repo, manifest, dst_env_tag)'Created new image: \n%s', response)

This code:

1. Finds the build tag that corresponds to the same image hash as our source environment tag, by using the “equivs” dict that uses get_equvalent_tags() built in the previous code example.
2. Gets the image manifest for that build tag.
3. Pushes that manifest back up to ECR using put_image, but using the imageTag attribute to attach the name of the destination environment to the manifest.

After this process is done, it’s reflected in the mappings for any new client that comes along to look, for example, if we promoted build #12 from testing to staging:

$ oe/ecs-demo
  "staging": "2016.08.24T17.13.38Z.5ad95f2-ecs-demo-12",
  "production": "2016.08.24T17.13.38Z.5ad95f2-ecs-demo-11",
  "testing": "2016.08.24T17.13.38Z.5ad95f2-ecs-demo-12"

Of course, that doesn’t actually deploy any code, it just moves the tags around to reflect what the deployment system has done. How that part works might be material for another post.


By using the ecr.put_image() API and ECR support for the v2.2 Docker Manifest format, we’ve implemented state tracking for our containers using nothing but the Docker tagging system. There’s no external state database to keep synced up: the intended state of each image can be found by using ecr.batch_get_image().

If you have questions or suggestions, please comment below.

How to Automate Container Instance Draining in Amazon ECS

by Chris Barclay | on | in Amazon ECS | | Comments

My colleague Madhuri Peri sent a nice guest post that describes how to use container instance draining to remove tasks from an instance before scaling down a cluster with Auto Scaling Groups.

There are times when you might need to remove an instance from an Amazon ECS cluster; for example, to perform system updates, update the Docker daemon, or scale down the cluster size. Container instance draining enables you to remove a container instance from a cluster without impacting tasks in your cluster. It works by preventing new tasks from being scheduled for placement on the container instance while it is in the DRAINING state, replacing service tasks on other container instances in the cluster if the resources are available, and enabling you to wait until tasks have successfully moved before terminating the instance.

You can change a container instance’s state to DRAINING manually, but in this post, I demonstrate how to use container instance draining with Auto Scaling groups and AWS Lambda to automate the process.

Amazon ECS overview

Amazon ECS is a container management service that makes it easy to run, stop, and manage Docker containers on a cluster, or logical grouping of EC2 instances. When you run tasks using ECS, you place them on a cluster. Amazon ECS downloads your container images from a registry that you specify, and runs those images on the container instances within your cluster.

Using the container instance draining state

Auto Scaling groups support lifecycle hooks that can be invoked to allow custom processes to finish before instances launch or terminate. For this example, the lifecycle hook invokes a Lambda function that performs two tasks:

  1. Sets the ECS container instance state to DRAINING.
  2. Checks if there are any tasks left on the container instance. If there are running tasks still in process of draining, it posts a message to SNS so that the Lambda function is called again.

Lambda repeats step 2 until there are no tasks running on the container instance OR the heartbeat timeout on the lifecycle hook is reached (set to TTL 15 minutes in the sample CloudFormation template), whichever occurs first. Afterward, control is returned to the Auto Scaling lifecycle hook, and the instance terminates. This process is shown in the following diagram:

Try it out!

Use the CloudFormation template to set up the resources described in this post. To use the CloudFormation template you will need to upload the Lambda deployment package to an S3 bucket in your account. This template creates the following resources:

  • The VPC and associated network elements (subnets, security groups, route table, etc.)
  • An ECS cluster, ECS service, and sample ECS task definition
  • An Auto Scaling group with two EC2 instances and a termination lifecycle hook
  • A Lambda function
  • An SNS topic
  • IAM roles for Lambda to execute

Create the CloudFormation stack and then see how this works by triggering an instance termination event.

In the Amazon EC2 console, choose Auto Scaling Groups and select the name of the Auto Scaling group created by CloudFormation (from the resources section of the CloudFormation template).

Select Actions, Edit and update the service to reduce the desired number of instances by “1”. This initiates one of the instances’ termination process.

Select the Auto Scaling group Instances tab; one instance state value should show the lifecycle state “Terminating:Wait”.

This is when the lifecycle hook gets activated and posts a message to SNS. The Lambda function is then executed in response to the SNS message trigger.

The Lambda function changes the ECS container instance state to DRAINING. The ECS service scheduler then stop the tasks on the instance and starts tasks on an available instance.

You can go to the ECS console to confirm that the container instance state is DRAINING.

After the tasks have drained, the Auto Scaling group activity history confirms that the EC2 instance is terminated.

How it works

Take a moment to see the inner workings of the Lambda function. The function first checks to see if the event received has a LifecycleTransition value matching autoscaling:EC2_INSTANCE_TERMINATING.

# If the event received is instance terminating...
if 'LifecycleTransition' in message.keys():
print("message autoscaling {}".format(message['LifecycleTransition']))
if message['LifecycleTransition'].find('autoscaling:EC2_INSTANCE_TERMINATING') > -1:

If there is a match, it proceeds to call the function “checkContainerInstanceTaskStatus”. This function gets the container instance ID of the EC2 instance ID received, and sets the container instance state to ‘DRAINING’.

# Get lifecycle hook name
lifecycleHookName = message['LifecycleHookName']
print("Setting lifecycle hook name {} ".format(lifecycleHookName))

# Check if there are any tasks running on the instance
tasksRunning = checkContainerInstanceTaskStatus(Ec2InstanceId)

It then checks to see if there are tasks running on the instance. If there are tasks, it publishes a message to the SNS topic to trigger the Lambda function again and then exits.

# Use Task ARNs to get describe tasks
descTaskResp = ecsClient.describe_tasks(cluster=clusterName, tasks=listTaskResp['taskArns'])
for key in descTaskResp['tasks']:
print("Task status {}".format(key['lastStatus']))
print("Container instance ARN {}".format(key['containerInstanceArn']))
print("Task ARN {}".format(key['taskArn']))

# Check if any tasks are running
if len(descTaskResp['tasks']) > 0:
print("Tasks are still running..")
return 1
print("NO tasks are on this instance {}..".format(Ec2InstanceId))
return 0

When the Lambda function sees that no more tasks are running on the container instance, it proceeds to complete the lifecycle hook and terminate the EC2 instance.

#Complete lifecycle hook.
response = asgClient.complete_lifecycle_action(
print("Response = {}".format(response))
print("Completedlifecycle hook action")
except Exception, e:


Container instance draining simplifies cluster scale-down and operational activities such as new AMI rollouts. For example, with the integration described in this post, you could use CloudFormation and CodePipeline to create a rolling deployment that launches new instances and terminates instances in batches.

To learn more about container instance draining, see the Amazon ECS Developer Guide.

If you have questions or suggestions, please comment below.

Seamlessly Scale Predictions with AWS Lambda and MXNet

by Bryan Liston | on | | Comments

Sunil Mallya, Solutions Architect

Building AI solutions at scale can be challenging, in this blog we’ll look at how to leverage AWS Lambda and MXNet to build a scalable prediction pipeline.

Companies that leverage machine and deep learning invest in much more than just training models. They have sophisticated pipelines that include the following stages:

  • Data storage
  • Pre-processing
  • Feature extraction
  • Model generation
  • Model Analysis
  • Feature engineering
  • Evaluation and feedback


Each stage of the pipeline requires:

  • Elasticity to adapt to changing workload demands
  • Scalability to adapt well to the overall size of the workload
  • Cost effectiveness to optimize total cost of ownership (TCO)

Amazon S3 meets all of the requirements for data and model storage. But the unpredictability of user demand and location can make scaling up for batch predictions (or the results of the model analysis) challenging, and can affect the overall user experience. In this post, we show how to use MXNet and AWS Lambda to deploy models at scale for predictions.

What is MXNet?

MXNet is a full-featured, flexibly programmable, and highly scalable deep learning framework that supports state-of-the-art deep models, including convolutional neural networks (CNNs) and long short-term memory networks (LSTMs). It is the result of collaboration between researchers at several top universities, including the founding institutions of the University of Washington and Carnegie Mellon University.

As discussed in Werner Vogel’s MXNet – Deep Learning Framework of Choice at AWS post, not only is MXNet scalable for multi-instance training, it also scales down to a variety of devices and small memory footprints, even when serving predictions on very large models. MXNet is available through open source under the Apache Version 2 license.

Challenges with the prediction pipeline

As previously mentioned, ML model training and validation is just a small part of the story. After the model is built, the real work begins. To service millions of customers seamlessly, every application must scale. However scaling the prediction portion of the pipeline can be challenging and expensive, especially when end users are geographically dispersed.

Delivering model updates, deploying globally, and maintaining high availability can be difficult. Lambda is a very good deployment option for the prediction pipeline, an excellent solution for serverless web applications, real-time batch processing, and map reduce tasks.

How can you leverage Lambda?

“Code, test, deploy, and let the service do the heavy lifting” is the Lambda customer motto. Regardless of your traffic patterns—high concurrency or bursty workloads—Lambda scales to service your needs.

We compiled and built the MXNet libraries to demonstrate how Lambda scales the prediction pipeline to provide this ease and flexibility for machine learning or deep learning model prediction. We built a sample application that predicts image labels using an 18-layer deep residual network. The model architecture is based on the winning model in the ImageNet competition called ResidualNet. The application produces state-of-the-art results for problems like image classification.

Putting it all together

Lambda has a deployment package limit of 50 MB. This limit means that you might not always be able to package your models along with the code. To accommodate this limitation, you can use S3 to store the model, and download the model when you service the request.

For optimal performance, you need to download the model outside of the lambda_handler function so that the downloaded file persists across< requests in memory. For subsequent Lambda invocations, MXNet uses the downloaded model that’s already in memory. For more information about this optimization, see AWS Lambda: How It Works.

The following reference Lambda function for prediction is quite simple. It loads the model, downloads image from the specified URL, transforms the image into an NDArray, and uses the model to make a prediction that outputs labels, with associated confidence percentages. The implementation code is provided below.

import os
import boto3
import json
import tempfile
import urllib2 
import mxnet as mx
import numpy as np
import cv2
from collections import namedtuple

Batch = namedtuple('Batch', ['data'])
f_params = 'resnet-18-0000.params'
f_symbol = 'resnet-18-symbol.json'

bucket = 'my-model-bucket'
s3 = boto3.resource('s3')
s3_client = boto3.client('s3')


f_params_file = tempfile.NamedTemporaryFile()
s3_client.download_file(bucket, f_params,


f_symbol_file = tempfile.NamedTemporaryFile()
s3_client.download_file(bucket, f_symbol,

def load_model(s_fname, p_fname):
    symbol = mx.symbol.load(s_fname)
    save_dict = mx.nd.load(p_fname)
    arg_params = {}
    aux_params = {}
    for k, v in save_dict.items():
        tp, name = k.split(':', 1)
        if tp == 'arg':
            arg_params[name] = v
        if tp == 'aux':
            aux_params[name] = v
    return symbol, arg_params, aux_params

def predict(url, mod, synsets):
    req = urllib2.urlopen(url)
    arr = np.asarray(bytearray(, dtype=np.uint8)
    cv2_img = cv2.imdecode(arr, -1)
    img = cv2.cvtColor(cv2_img, cv2.COLOR_BGR2RGB)
    if img is None:
        return None
    img = cv2.resize(img, (224, 224))
    img = np.swapaxes(img, 0, 2)
    img = np.swapaxes(img, 1, 2) 
    img = img[np.newaxis, :] 

    prob = mod.get_outputs()[0].asnumpy()
    prob = np.squeeze(prob)
    a = np.argsort(prob)[::-1]
    out = '' 

    for i in a[0:5]:
        out += 'probability=%f, class=%s' %(prob[i], synsets[i])
    out += "\n"

    return out

with open('synset.txt', 'r') as f:
    synsets = [l.rstrip() for l in f]

def lambda_handler(event, context):
    url = ''
    if event['httpMethod'] == 'GET':
        url = event['queryStringParameters']['url']
    elif event['httpMethod'] == 'POST':
        data = json.loads(event['body'])
        url = data['url']

    print "image url: ", url
    sym, arg_params, aux_params = load_model(,
    mod = mx.mod.Module(symbol=sym)
    mod.bind(for_training=False, data_shapes=[('data', (1,3,224,224))])
    mod.set_params(arg_params, aux_params)
    labels = predict(url, mod, synsets)
    out = {
            "headers": {
                "content-type": "application/json",
                "Access-Control-Allow-Origin": "*"
            "body": '{"labels": "%s"}' % labels,  
            "statusCode": 200
    return out

What can you expect in production?

Because the code must download the model from S3, download an image from the web, and run the image through the model, we benchmarked it to evaluate how it scales. We ran a local benchmark on a laptop in San Francisco using wrk to the endpoint deployed in US West (Oregon) Region (us-west-2). We observed an average latency of 1.18 seconds at a sustained rate of 75 requests per second, as shown in the following output.


To benchmark global prediction latencies, we used Goad, a Lambda-based distributed load tester. We observed average latencies ranging from 1.2 seconds to 1.5 seconds. The following figure shows the observed latencies from various regions with the endpoint hosted in the US West (Oregon) Region (us-west-2).



For a state-of-the-art image label prediction model, the numbers are impressive. The average global latency of 1.5 seconds makes it worth exploring integrating Lambda into your machine learning or deep learning pipeline for batch predictions. The libraries and code samples for deployment are available in the mxnet-lambda GitHub repo.

If you have questions or suggestions, please comment below.

Continuous Deployment to Amazon ECS using AWS CodePipeline, AWS CodeBuild, Amazon ECR, and AWS CloudFormation

by Chris Barclay | on | in Amazon ECS | | Comments

Thanks to my colleague John Pignata for a great blog on how to create a continuous deployment pipeline to Amazon ECS.

Delivering new iterations of software at a high velocity is a competitive advantage in today’s business environment. The speed at which organizations can deliver innovations to customers and adapt to changing markets is increasingly a pivotal attribute that can make the difference between success and failure.

AWS provides a set of flexible services designed to enable organizations to embrace the combination of cultural philosophies, practices, and tools called DevOps that increases an organization’s ability to deliver applications and services at high velocity.

In this post, I explore the DevOps practice called continuous deployment and outline a reference architecture to implement an automated deployment pipeline for applications delivered as Docker containers onto Amazon ECS using AWS CodePipeline, AWS CodeBuild, and AWS CloudFormation.

What is continuous deployment?

Agility is often cited as a key advantage of cloud computing over the traditional delivery of IT resources. Instead of waiting weeks or months for other departments to provision a new server, developers can create new instances with a click or API call and start using it within minutes. This newfound speed and autonomy frees developers to experiment and deliver new products and features to their customers as quickly as possible.

On top of the cloud, teams are embracing DevOps practices in order to achieve a faster time-to-market, better code quality, and more reliable releases of their products and services. Continuous deployment is a DevOps practice in which new software revisions are automatically built, tested, packaged, and released to production.

Continuous deployment enables developers to ship features and fixes through an entirely automated software release process. Instead of batching up large releases over a period of weeks or months and conducting deployments manually, developers can use automation to deliver versions of their applications many times a day as new software revisions are ready for users. In the same way cloud computing abbreviates the delivery time of resources, continuous deployment reduces the release cycle of new software to your users from weeks or months to minutes.

Embracing this speed and agility has many benefits including:

  • New features and bug fixes are released to users quickly; code sitting in a source code repository does not deliver business value or benefit your customers. By releasing new software revisions as close to immediately as possible, customers start benefiting from your work more quickly and teams can get more focused feedback.
  • Change sets are smaller; large change sets create challenges in pinpointing root causes of issues, bugs, and other regressions. By releasing smaller change sets more frequently, teams can more easily attribute and correct introduced issues.
  • Automated deployment encourages best practices; as any change committed to your source code repository can be deployed immediately via automation, teams have to ensure that changes are well-tested and that their production environments are closely monitored.

How does continuous deployment work?

Continuous deployment is conducted by an automated pipeline that coordinates the activities related to software release and provides visibility into the process. During the process, a releasable artifact is built, tested, packaged, and deployed into a production environment. The releasable artifact might be an executable file, a package of script files, a container, or some other component that ultimately must be delivered to production.

AWS CodePipeline is a continuous delivery and deployment service that coordinates the building, testing, and deployment of your code each time there is a new software revision. CodePipeline provides visible, central orchestration for taking a code change and moving it through a workflow and ultimately into the hands of your users. The pipeline defines stages to retrieve code from a source code repository, build the source code into a releasable artifact, test that artifact, and deliver it to production while ensuring that these stages happen in order and are halted if a failure occurs.

While CodePipeline powers the delivery pipeline and orchestrates the process, it does not have facilities for building or testing the software itself. For these stages, CodePipeline integrates with several other tools, including AWS CodeBuild, which is a fully managed build service. CodeBuild compiles source code, runs tests, and produces software packages that are ready to deploy. That makes it ideal for the build and test stages of a continuous deployment pipeline. Out of the box, CodeBuild has native support for many different kinds of build environments, including building Docker containers.

Containers are a powerful mechanism for software delivery, as they allow for a predictable and reproducible environment and provide a high level of confidence that changes tested in one environment can be successfully deployed. AWS provides several services to run and manage Docker container images. Amazon ECS is a highly scalable and high performance container management service that allows you to run applications on a cluster of Amazon EC2 instances. Amazon ECR is a fully managed Docker container registry that makes it easy for developers to store, manage, and deploy Docker container images.

Finally, CodePipeline integrates with several services to facilitate deployment, including AWS Elastic Beanstalk, AWS CodeDeploy, AWS OpsWorks, and your own custom deployment code or process using AWS Lambda or AWS CloudFormation. These deployment actions can be used to power the final step in your pipeline to push the newly built changes live onto your production environment.

Continuous deployment to Amazon ECS

Here’s a reference architecture that puts these components together to deliver a continuous deployment pipeline of Docker applications onto ECS:

This architecture demonstrates how to deploy containers onto ECS and ECR using CodePipeline to build a fully automated continuous deployment pipeline on top of AWS. This approach to continuous deployment is entirely serverless and uses managed services for the orchestration, build, and deployment of your software.

The pipeline created in the reference architecture looks like the following:

In this post, I discuss each stage in this reference architecture. What happens when a developer changes some copy on a landing page and pushes that change into the source code repository?

First, in the Source stage, the pipeline is configured with details for accessing a source code repository system. In the reference architecture, you have a sample application hosted in a GitHub repository. CodePipeline polls this repository and initiates a new pipeline execution for each new commit. In addition to GitHub, CodePipeline also supports source locations such as a Git repository in AWS CodeCommit or a versioned object stored in Amazon S3. Each new build is retrieved from the source code repository, packaged as a zip file, stored on S3, and sent to the next stage of the pipeline.

The Source stage also defines a template artifact stored on Amazon S3. This is the template that defines the deployment environment used by the deployment stage after a successful build of the application.

The Build stage uses CodeBuild to create a new Docker container image based upon the latest source code and pushes it to an ECR repository. CodePipeline also integrates with a number of third-party build systems, such as Jenkins, CloudBees, Solano CI, and TeamCity.

Finally, the Deploy stage uses CloudFormation to create a new task definition revision that points to the newly built Docker container image and updates the ECS service to use the new task definition revision. After this is done, ECS initiates a deployment by fetching the new Docker container from ECR and restarting the service.

After all of the pipeline’s stages are green, you can reload the application in a web browser and see the developer’s copy changes live in production. This happened automatically without any human invention.

This pipeline is now in production, listening for new code in the source code repository, and ready to ship any future changes that your team pushes into production. It’s also extensible, meaning that new stages can be added to include additional steps. For example, you could include a test stage to execute unit and acceptance tests to ensure the new code revision is safe to deploy to production. After it’s deployed, a notification step could be added to alert your team via email or a Slack channel that a new version is live, along with the details about the change set deployed to production.


We’re excited to see what kinds of applications you can deliver to your users using this approach and how it affects your product development processes. The cloud unlocks massive advantages in agility, and the ability to implement techniques like continuous deployment unlocks a significant competitive advantage.

You’ll find an AWS CloudFormation template with everything necessary to spin up your own continuous deployment pipeline at the AWS Labs EC2 Container Service – Reference Architecture: Continuous Deployment repo on GitHub. If you have any questions, feedback, or suggestions, please let us know!