Use AWS CloudFormation Stack Termination Protection and Rollback Triggers to Maintain Infrastructure Availability

Managing your infrastructure as code using AWS CloudFormation provides a consistent way to rapidly deliver AWS environments for your applications. As your pace of delivery increases, it’s important to ensure you have the appropriate guardrails to protect your most critical infrastructure resources.

AWS CloudFormation now includes two additional tools to help you ensure the consistent health and stability of your application environments:

Stack Termination Protection provides a low friction mechanism to quickly protect stacks that contain critical resources.
Rollback Triggers allow you to quickly revert infrastructure changes that are having a negative impact to the performance of your applications.

In this post, I’m going to examine strategies for adding these new features to your infrastructure management tool belt.

Stack Termination Protection

Take advantage of the new Stack Termination Protection parameter to prevent the accidental deletion of stacks that contain critical resources. You can enable termination protection while creating a new stack, and AWS CloudFormation will deny any delete actions against that stack. This new features gives you an extra layer of protection for stacks containing critical resources such as AWS IAM roles or AWS CloudTrail trails.

You can enable termination protection while creating a new stack using the AWS Command Line Interface (CLI), AWS APIs, or in the AWS Management Console. In this blog, I’m using the CloudFormation console to create a new stack. Under the Advanced section, next to Termination Protection, I’ve selected the Enable check box. This protects my stack that contains a critical application deployment pipeline from deletion.

After you create your stack, you can verify the stack termination protection icon in the Overview section of your stack. If you are using nested stacks, termination protection cascades down to in sub-stacks of the parent without the need to individually manage protections on each stack.

Here you can see I’ve enabled termination protection for the stack that contains the AWS CodePipeline for deploying my application.

With termination protection enabled, you will see the following message when you attempt to delete the stack:

You can add or remove termination protection from existing stacks using a simple API call. Control access to this API operation using IAM permissions. Make sure protection is removed only when you’re ready to delete the stack.

You can start adding termination protection to your critical stacks right now.

Rollback Triggers

We announced rollback triggers back in August, but I wanted to take a little time to revisit them in this context.

The rollback triggers functionality allows you to integrate application- and resource-level alarms from Amazon CloudWatch into the update process for your stacks. If a change to the stack causes any of the registered alarms to fire, CloudFormation immediately stops the update and rolls back to the last good state. You can include a monitoring window after all updates are complete to allow additional time for the change to stabilize. This window happens prior to the CloudFormation cleanup phase, allowing attribute changes and replaced resources to be quickly restored.

For an example of rollback triggers in action, this blog starts with the reference architecture for containerized batch processing using Amazon ECS. The stack in this reference architecture contains an Amazon EC2 Container Service (ECS) cluster running an image processing service and the Amazon Simple Queue Service (SQS) queue that feeds it. To begin, follow this first three steps in the reference architecture’s instructions. When you come to step four, deploy your ECS service using the following simple CloudFormation template along with the parameter values from the stack that you created in Step 2:

batch-service.yml

AWSTemplateFormatVersion: '2010-09-09'
Parameters:
  TaskDefinition:
    Type: String
    Description: "ARN of an existing ECS Task Definition"
  ECSCluster:
    Type: String
    Description: "Existing ECS Cluster"
  ProcessCount:
    Type: Number
    Default: 1
    Description: "Number of processes to run"
Resources:
  service:
    Type: AWS::ECS::Service
    Properties:
      Cluster: !Ref 'ECSCluster'
      DesiredCount: !Ref 'ProcessCount'
      TaskDefinition: !Ref 'TaskDefinition'
Outputs:
  ecsservice:
    Value: !Ref 'service'

You can put your parameter values into a JSON document to make it easier to quickly perform stack creation and updates:

batch-service-config.jsn

[
    {
      "ParameterKey": "TaskDefinition",
      "ParameterValue": "arn:aws:ecs:us-east-2:xxxxxxxxxxxx:task-definition/ecs-batch-processing-TaskDefinition-xxxxxxxxxxxx:1"
    },
    {
      "ParameterKey": "ProcessCount",
      "ParameterValue": "1"
    },
    {
      "ParameterKey": "ECSCluster",
      "ParameterValue": "ecs-batch-processing-ECSCluster-xxxxxxxxxxxxxx"
    }
]

Then, you can create your stack from the AWS CLI like this:

aws cloudformation create-stack --region us-east-2 \
  --stack-name ImageProcService \
  --template-body file://batch-service.yml \
  --parameters file://batch-service-config.json

You can skip step five in the reference architecture. We don’t need Auto Scaling for this example. After stack creation is completed, you should have a single batch-processing worker up and running and ready to receive .jpg files in the input bucket. Upload a few images to make sure it’s working.

Now you will make a breaking change to your batch service, but with a rollback trigger in place to protect your processing capabilities. As part of the stack you deployed in step two, CloudFormation created a CloudWatch alarm monitoring the SQS queue that feeds your batch workers. You can build a rollback trigger using this alarm. For example:

batch-service-rollbacktrigger.jsn

{
  "RollbackTriggers": [
    {
      "Arn": "arn:aws:cloudwatch:us-east-2:225704381548:alarm:SQSQueueDepth",
      "Type": "AWS::CloudWatch::Alarm"
    }
  ],
  "MonitoringTimeInMinutes": 10
}

This trigger will wait 10 minutes after the deployment is completed to see if the queue goes into an alarm state.

Reduce the number of processes in your configuration file to 0 by modifying the value in your parameter file:

batch-service-config.jsn

[
    {
      "ParameterKey": "TaskDefinition",
      "ParameterValue": "arn:aws:ecs:us-east-2:xxxxxxxxxxxx:task-definition/ecs-batch-processing-TaskDefinition-xxxxxxxxxxxx:1"
    },
    {
      "ParameterKey": "ProcessCount",
      "ParameterValue": "0"
    },
    {
      "ParameterKey": "ECSCluster",
      "ParameterValue": "ecs-batch-processing-ECSCluster-xxxxxxxxxxxxxx"
    }
]

Now you can update your stack using the rollback trigger you defined earlier:

aws cloudformation update-stack --region us-east-2 \
  --stack-name ImageProcService \
  --template-body file://batch-service.yml \
  --parameters file://batch-service-config.json \
  --rollback-configuration file://batch-service-rollbacktrigger.json

After you make a change to decrease the number of ECS processes running in your stack, there won’t be any workers to handle the queue. Upload at least five more .jpg files to the input bucket before the monitoring time ends (you have 10 minutes). Things will continue to run for a little while, but soon the CloudWatch alarm is triggered as queue depth crosses a critical threshold. Since this alarm is registered as a rollback trigger, CloudFormation will automatically begin rolling back the update when the alarm triggers.

Once the rollback is completed, batch processing is restored and the SQS queue will begin to drain, removing the alarm.

Use rollback triggers to monitor the state of your application during the stack creation and update process. You can specify the alarms and the thresholds you want AWS CloudFormation to monitor, and if any of the alarms are breached, CloudFormation rolls back the entire stack operation to the previous deployed state.

You can start using rollback triggers to allow CloudFormation to monitor critical metrics for your application in continuous integration/continuous deployment (CI/CD) pipelines and other deployment automation today.

AWS Cloud Operations & Migrations Blog

Use AWS CloudFormation Stack Termination Protection and Rollback Triggers to Maintain Infrastructure Availability

Stack Termination Protection

Rollback Triggers

Resources

Follow