AWS Cloud Operations Blog

AWS CloudFormation Guardrails: Protecting your Stacks and Ensuring Safer Updates

“I wonder what will happen if I touch these two wires together.” – Unix fortune

If you’ve worked with cloud-hosted applications or large distributed architectures for any extended period of time, chances are you’ve heard colleagues invoke Murphy’s law: “Anything that can go wrong, will go wrong”. All of us have experienced one of those events in the middle of the night where a usually well-intentioned colleague decides to run maintenance or cleanup on some systems…and accidentally deletes or changes a volume, server, endpoint, function, or other critical resource.

If your applications, functions, servers, and other resources are in AWS, and you’re using AWS CloudFormation to automate the deployment and changes to your stacks, you are well positioned to implement several levels of safety guardrails to reduce the likelihood of many of these unplanned events. In this blog post we cover many of these guardrails. We’ll also present ideas collected from user surveys, support cases, and other sources so you can build a strategy to use these safety provisions and improve them over time.

There are four primary features that you can use to protect your stacks and resources in CloudFormation:

Guardrail Description
Termination Protection Stack level attribute to prevent deletion; also works with nested stacks
Deletion Policies Resource level attribute; can be Delete (default), Retain or Snapshot
Stack Policies Restrict operations at a stack level to multiple resource groups
IAM Policies Restrict operations by users, groups or roles

These features vary in scope and the granularity of options. Consider implementing several of these features in a layered way, as opposed to using only one of them.

Using stack termination protection

Using this stack attribute, you can prevent a new or existing stack from being accidentally deleted. This setting is disabled by default, so you have to explicitly enable it when you create new stacks. For existing, non-nested stacks, you can change termination protection using the AWS Management Console or the AWS CLI. For existing nested stacks, you must enable termination protection on the root stack. After it is enabled on the root stack, the protection is also set for the nested or child stacks. However, keep in mind that if you perform a stack update on the root stack that would delete the nested stack, CloudFormation will delete the nested stack. If you attempt to delete a nested stack when its root stack has termination protection in place, the operation will fail and the nested stack will remain unchanged. Like many other CloudFormation operations, you can control who can enable or disable termination protection by using an IAM policy.

{
    "Version":"2012-10-17",
    "Statement":[{
        "Effect":"Allow",
        "Action":[
            "cloudformation:UpdateTerminationProtection"
        ],
        "Resource":"*"
    }]
}

Figure 1: Sample IAM policy granting permissions to change stack termination protection

 

So, now that you know about termination protection, should you enable it on all or most of your stacks? Maybe, but you should consider the lifecycle of all your stacks first. Adding termination protection to seldom-changing network resource stacks makes sense, providing yet another layer that can supplement existing controls without interfering with daily application changes. On the other hand, if the application stack changes often, you’ll end up enabling and disabling termination protection often as well. For those types of ephemeral stacks, other guardrails outlined in this article might be more appropriate.

Using resource-specific deletion policies

You can use Deletion Policies on a resource-by-resource basis in your template code. By default, when a resource is deleted from a stack template and the stack is updated, the resource is deleted by CloudFormation. (The exceptions are some Amazon RDS database resources, which have a different default behavior.) For more information on resource-specific deletion policies, see the CloudFormation DeletionPolicy Attribute documentation.

Keep in mind the Retain option, which deletes the resource from being managed by CloudFormation via stacks and templates, but doesn’t delete it from your AWS account or region. This can be critical for stateful resources like databases and queues, and semi-durable resources like state machines when using AWS Step Functions. For state machines in particular, it’s advisable to retain them because doing so also retains their execution history. In the interest of utmost safety, you should liberally set Deletion Policies to Retain if and until you can make sure you won’t lose valuable historical data for troubleshooting. You can always delete these resources later using other means.

For some stateful resources like Amazon EC2 volumes, Amazon ElastiCache, Amazon RDS and Amazon Redshift, you also have the option to have CloudFormation create a snapshot before it deletes those resources. To further protect your more critical stateful resources, you can group them into separate stacks with more strict policies, and/or you can create dependencies between stacks using cross-stack references, which implements further implicit checks.

{
    "AWSTemplateFormatVersion":"2010-09-09",
    "Resources": {
	  "myVolume": {
        "Type":"AWS::EC2::Volume",
        "DeletionPolicy":"Snapshot",
        "Properties": {
            "AvailabilityZone":"us-east-1a",
            "Size":"200"
        }
        }
    }
}

Figure 2: Using Deletion Policy to take a snapshot of an EC2 Volume, if deleted

 

Using stack-level policies

A stack policy is a JSON document that defines the update actions that can be performed on a single resource or a group of resources in a flexible yet compact way (versus Deletion Policies, which are defined on a resource-by-resource basis). Stack policies are evaluated and applied in advance of any update actions, which include cases when resources are modified, recreated, or removed. Resources for a rule can be selected with wildcards or by evaluating a condition expression.

{
  "Statement" : [
    {
      "Effect" : "Deny",
      "Action" : "Update:*",
      "Principal": "*",
      "Resource" : "*",
      "Condition" : {
        "StringEquals" : {
          "ResourceType" : ["AWS::RDS::DBInstance"]
        }
      }
    },
    {
      "Effect" : "Allow",
      "Action" : "Update:*",
      "Principal": "*",
      "Resource" : "*"
    }
  ]
}

Figure 3: A Stack policy that prevents updates to all RDS DB Instances

 

Stack policies can be set when you create a stack using the AWS Management Console or by using the AWS CLI. However, to set a stack policy on an existing stack, you must do it using the CLI or API. You also must use the CLI or API to modify an existing policy on a stack. You can also opt to create a strict permanent stack policy, and then update policy-protected resources by creating temporary policies that override the permanent stack policy. Finally, if you use AWS Config, you can also record configuration changes to the attributes of a stack policy, as well as other permissions and rollback settings.

Using IAM Policies

IAM Policies explicitly enforce access controls on users, groups or roles. Beyond restricting updates and deletions to a subset of users, you can also restrict users to use only specific templates, use only specific stack policies, create only a few resource types, or assume specific roles. Gaining expertise with IAM policies can benefit you beyond your CloudFormation usage, such as controlling access to logs, who can execute Lambda functions, and many other use cases across all AWS services.

{
"Effect":"Allow",
"Action":["cloudformation:CreateStack"]
},
{
"Effect":"Deny",
"Action":["cloudformation:CreateStack"]
	“Condition”:{
		‘ForAnyValue:StringLike”:{
			“cloudformation:ResourceType”: [“AWS::IAM::*”]
			}
	}
}

Figure 4: IAM policy allowing users to create resources and stacks except for IAM resources

 

A few more suggestions

Beyond understanding how these four guardrails work, here are a few other suggestions and ideas you can study as you look to implement these controls, or improve existing controls you may have inherited:

  • By having smaller, multiple policies that affect a smaller set of resources and stacks, you can limit the blast radius of policy changes as you improve them over time, to either make them more restrictive or to add layers of control.
  • Stack and IAM policies should be treated as code and, with that in mind, they should be periodically tested and versioned. Consider implementing validation pipelines and creating chaos engineering-like tests where critical resource deletions are attempted.
  • For your own custom resources, it’s up to you to add code to determine what happens to your resources when they get deleted. For stateful custom resources, it becomes your responsibility to ensure a given resource is backed up or retained. You can also use a Lambda-backed custom resource to pass a stack’s policy from a root stack to a nested stack using the setStackPolicy API call because nested stacks don’t automatically inherit the root stack’s policy.
  • These guardrails are preventive steps that you can execute within CloudFormation. If you go outside of CloudFormation and use a resource’s AWS Management Console to update and delete it, your template code will become out-of-sync with the resource’s state. You should update your template to reflect the new changes, and prevent future changes using additional IAM policies.

You should plan to use most (if not all) of these options. Let’s say, for example, that you’ve just inherited a group of applications (and, with those, the infrastructure stacks and templates associated with them) and are looking at adding layers of protection using these guardrails. A defensible approach may look like this:

  • Review your stacks and templates, and review the resources that are more critical, like stateful or semi-durable resources. Determine which of these critical resources are useful candidates for a DeletionPolicy of Retain or Snapshot.
  • Go one level up from resources to stacks, and consider what operations will be allowed for resources (or groups of resources) within that stack. On top of your resource deletion policy layer, add stack policy restrictions that reflect the criticality of those stacks. For the more critical stacks, consider also using stack termination protection. Even if it becomes a nuisance, at least you’ll gain understanding about how frequently stack updates are required for those stacks, which stacks have cross stack references or parent/child relationships, etc. Armed with this information, you can adjust your controls accordingly.
  • Finally, once you’ve protected the resources and stacks themselves, you should then restrict which users should have the ability to run updates on those stacks using IAM policies.

Conclusion

We just walked through a sequence starting with the most detailed controls at the individual resource level, then to the stacks, and finally to the users, and will likely end up with the most enforced or least privileged controls. Alternatively, you can also opt to start with users, groups, and roles first, and then work your way down through stacks and resources. This can probably be justified if, in looking at the history of your unplanned downtime events, you already suspect that your IAM policies need more urgent attention. In either case, you want to ensure that there are multiple layers of safety in place by having multiple guardrails apply to your most critical resources.

Overall, the key to success in making the most of these features is to carefully test and adapt your use of these guardrails over time, and ensure that you have multiple guardrail layers in place for the most critical stacks and resources. Many existing CloudFormation best practices still apply; for example, smaller stacks will be easier to test and ultimately protect than large, complex ones.  Finally, consider ways to establish automated tests for your template code by implementing processes like validation pipelines.

 

About the Author

Luis Colon is a Senior Developer Advocate for the AWS CloudFormation team. He works with customers and internal development teams to focus on and improve the developer experience for CloudFormation users. In his spare time, he mixes progressive trance music.