Recovering AWS CloudFormation stacks using ContinueUpdateRollback

AWS CloudFormation treats a stack as a collection of AWS resources that customers can manage as a single unit. After you launch a stack, you can use the AWS CloudFormation console, API, or AWS CLI to update resources in your stacks. You should not make any changes to stack resources outside of CloudFormation. This is due to something we call resource drift. Resource drift occurs when you make out-of-band changes to CloudFormation managed resources that can cause errors if you later update or delete the stack.

This blog post walks you through examples of how to recover your stack from states such as UPDATE_ROLLBACK_FAILED. We show you how to regain control of the stack to perform further updates using CloudFormation without intervention from AWS Support.

UPDATE_ROLLBACK_FAILED

Out-of-band changes can easily occur when there are newcomers to your organization who make accidental changes to resources created by CloudFormation. These changes can also be made by members of different teams who might not have awareness of the service. Out-of-band changes can cause your CloudFormation stack to enter a state in which you are no longer able to continue modifying the stack. When a stack reaches UPDATE_ROLLBACK_FAILED, this means that the CloudFormation stack was attempting an UPDATE operation, the operation failed, and we began a rollback. An issue occurred that stopped CloudFormation from returning to the previous “good” state during the rollback. As a result, the stack can’t update and can’t roll back, thus it assumes this half-way state. The API then stops any further actions on the stack other than ContinueUpdateRollback and DeleteStack.

ContinueUpdateRollback

The ContinueUpdateRollback API operation provides customers an override for stacks in an UPDATE_ROLLBACK_FAILED state. It forces CloudFormation to continue with the rollback procedure. Use this operation only for troubleshooting. When a failure occurs and the stack enters an UPDATE_ROLLBACK_FAILED state, the API operation simply continues the rollback. However, it provides no fix to the underlying issue. For example, it doesn’t fix the underlying issue of whether a rollback failed due to an account limit. If you don’t address this account limit constraint, ContinueUpdateRollback will simply retry the rollback and fail once more. You still need to fix the underlying problem separately and, in a break from best-practices, outside of AWS CloudFormation.

A practical example

We’ll walk you through a scenario involving stacks entering the UPDATE_ROLLBACK_FAILED state and the use of the Auto Scaling service. Let’s take a look at a CloudFormation template.

Note: This template can be deployed in any Region as long as you specify a valid HVM-compatible Amazon Machine Image (AMI) on launch. Alternatively, to launch it as-is use the Europe (London) (eu-west-2) Region.

AWSTemplateFormatVersion: "2010-09-09"
Description: "A template used to illustrate the use of ContinueUpdateRollback API when recovering CloudFormation stacks from Update Rollback Failed"
Parameters: 
  pAMI:
    Type: "AWS::EC2::Image::Id"
    Description: "Pick any HVM compatible AMI-ID for the AutoScaling group, the default works in eu-west-2 only"
    Default: "ami-1a7f6d7e"
Resources:
  ASG:
    Type: "AWS::AutoScaling::AutoScalingGroup"
    Properties:
      AvailabilityZones: 
        - !Select 
          - 0
          - Fn::GetAZs: !Ref 'AWS::Region'
      DesiredCapacity: 1
      LaunchConfigurationName: !Ref "LC"
      MaxSize: 1
      MinSize: 1
  LC:
    Type: "AWS::AutoScaling::LaunchConfiguration"
    Properties: 
      ImageId: !Ref "pAMI"
      InstanceType: "t2.micro"

As you can see, we create an Auto Scaling group with a launch configuration and our stack is created just as we expected.

Chaos ensues

To continue our hypothetical, let’s pretend that we hire Jimmy, a new member of the operations team fresh out his AWS Certified Solutions Architect Associate exam. One day, while you are not around, the development team asks him to decrease the size of the instance type for the Auto Scaling group. He decides it’s best to create a cool new launch configuration for the running the Auto Scaling group and assign the new launch configuration to the group:

aws autoscaling create-launch-configuration --launch-configuration-name Jimmys_new_LC --image-id ami-1a7f6d7e --instance-type t2.nano --region eu-west-2
ASName=`aws cloudformation describe-stack-resources --stack-name <stack-name> --logical-resource-id ASG --output text --query StackResources[0].PhysicalResourceId --region eu-west-2`
aws autoscaling update-auto-scaling-group --auto-scaling-group-name $ASName --launch-configuration-name Jimmys_new_LC --region eu-west-2

Of course, he also decides to clean up the old Auto Scaling launch configuration for good measure:

(For your ease of replication, I include the first line below to fetch the physical ID of the launch configuration created by the previous template.)

LCName=`aws cloudformation describe-stack-resources --stack-name <stack-name> --logical-resource-id LC --output text --query StackResources[0].PhysicalResourceId --region eu-west-2`
aws autoscaling delete-launch-configuration --launch-configuration-name $LCName --region eu-west-2

Oh Jimmy, we had such high hopes…

Rollbacks and stabilization

In the majority of my tests AWS CloudFormation attempts to fail fast. For example, if we were to add a HealthCheckType with an incorrect value, the initial API call would fail and the stack would roll back. However, because CloudFormation did not replace or successfully change the configuration of the target resource, it correctly assumes that no API call is necessary in order to roll back.

A rollback can trigger a reapplication of a previous configuration after a period of time. CloudFormation refers to this as stabilization. CloudFormation performs an API call on behalf of a user, and in addition, it attempts to ensure that, when a resource is labeled as CREATE COMPLETE, the resource is running in the desired state. For Auto Scaling for example, it sends a CreateAutoScalingGroup API call and then attempts to DescribeAutoScalingGroup until the group has a Min/Max and Desired count equal to the count defined in the template. If this doesn’t occur, CloudFormation must then reapply the previous configuration.

Forcing a rollback

In our example, we could force this type of failure by increasing the Auto Scaling group’s capacity past our account limit for running On-Demand instances. This would in turn fail stabilization and trigger our rollback. At this point CloudFormation would be unable to find the previously defined launch configuration and the stack would then enter the UPDATE_ROLLBACK_FAILED state. However, this would involve launching a number of unused instances, which would not be very frugal. Instead, we can use the CloudFormation Wait Condition resource to simulate a failure and force a rollback.

This is how we update the CloudFormation template:

AWSTemplateFormatVersion: "2010-09-09"
Description: "A template used to illustrate the use of ContinueUpdateRollback API when recovering CloudFormation stacks from Update Rollback Failed"
Parameters: 
  pAMI:
    Type: "AWS::EC2::Image::Id"
    Description: "Pick any HVM compatible AMI-ID for the AutoScaling group, the default works in eu-west-2 only"
    Default: "ami-1a7f6d7e"
Resources:
  ASG:
    Type: "AWS::AutoScaling::AutoScalingGroup"
    Properties:
      AvailabilityZones: 
        - !Select 
          - 0
          - Fn::GetAZs: !Ref 'AWS::Region'
      DesiredCapacity: 1
      LaunchConfigurationName: !Ref "LC"
      MaxSize: 1
      MinSize: 1
  LC:
    Type: "AWS::AutoScaling::LaunchConfiguration"
    Properties: 
      ImageId: !Ref "pAMI"
      InstanceType: "t2.large"
  WaitCondition:
    Type: "AWS::CloudFormation::WaitCondition"
    DependsOn: ASG
    CreationPolicy:
      ResourceSignal:
        Count: 1
        Timeout: PT1M

As you can see, the template updates our launch configuration to use t2.large instances. However, because we never signal our WaitCondition, the stack will roll back. At this point CloudFormation will attempt to reapply the launch configuration. However, as we saw earlier, Jimmy deleted our launch configuration. So any update on the stack that updates the Auto Scaling group and triggers a rollback will fail.

General guidance on recovering stuck stacks

To address issues with CloudFormation stacks that have entered UPDATE_ROLLBACK_FAILED state you have three options:

1. Delete the stack. If the deletion fails for any reason, you can then use the DeleteStack API operation with the RetainResources option listing resources that failed deletion.

2. Make underlying account changes manually/outside the scope of the stack to re-synchronize the stack with the expectation and then perform ContinueUpdateRollback.

3. If you address the issues with the underlying stack resources you canuse ContinueUpdateRollback along with the ResourcesToSkip option. CloudFormation will mark the problematic/failing resources as UPDATE_COMPLETE and continue with the rest of the rollback.

Saving the day

Now to address Jimmy’s mistake. With experience we know that CloudFormation relies on resource names or IDs to keep track of ownership. With launch configurations, we can see that the service uses a combination of the StackName, LogicalId, and a random string to name the launch configuration.

What we can do is create a new launch configuration using the precise name of the previous one. We can grab that name by describing our stack as before then follow that up with a ContinueUpdateRollback:

LCName=`aws cloudformation describe-stack-resources --stack-name <stack-name> --logical-resource-id LC --output text --query StackResources[0].PhysicalResourceId`
aws autoscaling create-launch-configuration --launch-configuration-name $LCName --image-id ami-1a7f6d7e --instance-type t2.micro --region eu-west-2
aws cloudformation continue-update-rollback --stack-name <stack-name> --region eu-west-2

Our stack should now be back to UPDATE_ROLLBACK_COMPLETE and ready for us to perform our update again.

aws cloudformation describe-stacks --stack-name <stack-name> --query Stacks[0].StackStatus –region eu-west-2
"UPDATE_ROLLBACK_COMPLETE"

In our example we forced the rollback by adding a never-completing WaitCondition. In your case the cause might be due to account limitations or other issues.

When replacement is not an option

In our example, we were left in the lucky situation in which you can replace the removed resource easily by naming a new resource the same name that your stack expects. Let’s take a look at a situation in which a resource was deleted that does not have a custom-name.

Consider the following CloudFormation template:

AWSTemplateFormatVersion: "2010-09-09"
Description: "A template used to illustrate the use of ContinueUpdateRollback API when recovering CloudFormation stacks from Update Rollback Failed"
Parameters: 
  pAMI:
    Type: "AWS::EC2::Image::Id"
    Description: "Pick any HVM compatible AMI-ID for the AutoScaling group, the default works in eu-west-2 only"
    Default: "ami-1a7f6d7e"
  pVPC:
    Type: "AWS::EC2::VPC::Id"
    Description: "Please use your *default* VPC security group so that the steps that follow are successful"
  pSubnet:
    Type: "AWS::EC2::Subnet::Id"
    Description: "A Subnet within the VPC provided"
Resources:
  Instance:
    Type: "AWS::EC2::Instance"
    Properties: 
      ImageId: !Ref pAMI
      SubnetId: !Ref pSubnet
      SecurityGroupIds:
        - !Ref InstanceSecurityGroup
      InstanceType: t2.micro
  InstanceSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      VpcId: !Ref pVPC
      GroupDescription: Allow http to client host
      SecurityGroupIngress:
      - IpProtocol: tcp
        FromPort: '80'
        ToPort: '80'
        CidrIp: 0.0.0.0/0
      SecurityGroupEgress:
      - IpProtocol: tcp
        FromPort: '80'
        ToPort: '80'
        CidrIp: 0.0.0.0/0

In this template we are creating an instance and a security group. Let’s pretend our old friend Jimmy was this time instructed to alter the EC2 instance and reference the default VPC security group:

defaultSecGroup=`aws ec2 describe-security-groups --group-name default --query SecurityGroups[0].GroupId --output text --region eu-west-2`
stackSecGroup=`aws cloudformation describe-stack-resources --stack-name <stack-name> --logical-resource-id InstanceSecurityGroup --output text --query StackResources[0].PhysicalResourceId --region eu-west-2`
instance=`aws cloudformation describe-stack-resources --stack-name <stack-name> --logical-resource-id Instance --output text --query StackResources[0].PhysicalResourceId --region eu-west-2`
aws ec2 modify-instance-attribute --instance-id $instance --group $defaultSecGroup --region eu-west-2
aws ec2 delete-security-group --group-id $stackSecGroup --region eu-west-2

And then, once again we update our stack with the failing wait condition to simulate a failure and initiate a rollback:

AWSTemplateFormatVersion: "2010-09-09"
Description: "A template used to illustrate the use of ContinueUpdateRollback API when recovering CloudFormation stacks from Update Rollback Failed"
Parameters: 
  pAMI:
    Type: "AWS::EC2::Image::Id"
    Description: "Pick any HVM compatible AMI-ID for the AutoScaling group, the default works in eu-west-2 only"
    Default: "ami-1a7f6d7e"
  pVPC:
    Type: "AWS::EC2::VPC::Id"
    Description: "Any VPC with at least 1 available subnet"
  pSubnet:
    Type: "AWS::EC2::Subnet::Id"
    Description: "A Subnet within the VPC provided"
Resources:
  Instance:
    Type: "AWS::EC2::Instance"
    Properties: 
      ImageId: !Ref pAMI
      SubnetId: !Ref pSubnet
      InstanceType: t2.micro
  InstanceSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      VpcId: !Ref pVPC
      GroupDescription: Allow http to client host
      SecurityGroupIngress:
      - IpProtocol: tcp
        FromPort: '80'
        ToPort: '80'
        CidrIp: 0.0.0.0/0
      SecurityGroupEgress:
      - IpProtocol: tcp
        FromPort: '80'
        ToPort: '80'
        CidrIp: 0.0.0.0/0
  WaitCondition:
    Type: "AWS::CloudFormation::WaitCondition"
    DependsOn: Instance
    CreationPolicy:
      ResourceSignal:
        Count: 1
        Timeout: PT1M

The stack is now in the UPDATE_ROLLBACK_FAILED state due to the ‘Instance’ resource being unable to rollback with an error like:

The security group ‘sg-xxxxxxxx’ does not exist

This time, to recover, because we can’t re-create sg-xxxxxxxx with the exact ID, we will have to continue the rollback and skip past the instance so that we can continue to modify the stack and regain a working and synced state.

The first step is to perform continue-update-rollback skipping the resource:

aws cloudformation continue-update-rollback --resources-to-skip Instance --region eu-west-2

The stack would now be in the UPDATE_ROLLBACK_COMPLETE state. However, the state of the stack no longer matches the state of the underlying resources. The instance, while still running, will have the default EC2 security group assigned to it by Jimmy instead of the security group we have defined here in the stack. To rectify this, we have to modify our template so that the instance is updated with a different security group ID. We can also remove the reference to the old, now deleted group from our template and use this as the new group for the instance:

AWSTemplateFormatVersion: "2010-09-09"
Description: "A template used to illustrate the use of ContinueUpdateRollback API when recovering CloudFormation stacks from Update Rollback Failed"
Parameters: 
  pAMI:
    Type: "AWS::EC2::Image::Id"
    Description: "Pick any HVM compatible AMI-ID for the AutoScaling group, the default works in eu-west-2 only"
    Default: "ami-1a7f6d7e"
  pVPC:
    Type: "AWS::EC2::VPC::Id"
    Description: "Any VPC with at least 1 available subnet"
  pSubnet:
    Type: "AWS::EC2::Subnet::Id"
    Description: "A Subnet within the VPC provided"
Resources:
  Instance:
    Type: "AWS::EC2::Instance"
    Properties: 
      ImageId: !Ref pAMI
      SubnetId: !Ref pSubnet
      InstanceType: t2.micro
      SecurityGroupIds:
        - InstanceSecurityGroup2
  InstanceSecurityGroup2:
    Type: AWS::EC2::SecurityGroup
    Properties:
      VpcId: !Ref pVPC
      GroupDescription: Allow http to client host
      SecurityGroupIngress:
      - IpProtocol: tcp
        FromPort: '80'
        ToPort: '80'
        CidrIp: 0.0.0.0/0
      SecurityGroupEgress:
      - IpProtocol: tcp
        FromPort: '80'
        ToPort: '80'
        CidrIp: 0.0.0.0/0

Our stack should now be back in a state that is synchronized with the underlying resources and in a state that allows for further modifications down the line.

Conclusion

Throughout this blog, we learned about various CloudFormation concepts such as rollbacks, stabilization and using the ContinueUpdateRollback operation. We showed you two examples of using ContinueUpdateRollback to rescue a stack from a stuck state. In the first example we simply alter an underlying value then perform the API call. In the second example we have to skip past the problematic resource (due to the fact that it uses a unique ID instead of a user-defined value).

About the Author

Nishant Casey is a Cloud Support Engineer on the AWS Deployment team where he focuses on AWS CloudFormation and AWS Elastic Beanstalk. Nishant is passionate about Infrastructure as Code and DevOps. A gigging DJ and guitarist in a previous life, he enjoys music, as well as cooking, traveling, and video and board gaming.

AWS Cloud Operations & Migrations Blog