Coordinating complex resource dependencies across CloudFormation stacks

There are many benefits to using Infrastructure as Code (IaC), but as you grow your infrastructure or your IaC coverage, the number of components and their dependencies can become increasingly more complex. In this post we will walk through strategies to address this complexity.

CloudFormation has built-in support for defining dependencies across resources in your template. Not only does it allow you to explicitly specify that a resource depends on another, but it can also infer dependencies from the way you reference resources from other resources. Often times though, you’ll be working with multiple interconnected templates. You may do that to promote reusability, maintainability, and ownership, or because CloudFormation is a regional service and your deployments target multiple regions and/or accounts (e.g., primary-secondary, hub-spoke architectures).

When you have dependencies across templates, the easy, non-automated way of addressing it is just deploying those templates manually in the desired order. That goes against IaC principles and introduces a human component in the provisioning process, making it slower, less reproducible, and more error prone. A better approach is to use automated steps that deploy templates in a specific order. In some cases, for instance when using StackSets configured to AutoDeploy to new accounts, human intervention is not an option at all and there’s no guarantee in which order stack instances will be deployed.

To achieve greater infrastructure automation, we need to also define these inter-dependencies as IaC. CloudFormation is flexible enough to accomplish this through different mechanisms. In this blog post we will explore how to use custom resources with Lambda function and EventBridge rules to achieve that.

Building custom cross-stack dependency mechanism

CloudFormation allows for customized behavior through Custom Resources, Template Macros and Resource Types. We will explore how to use Custom Resources along with WaitCondition resources, so we can coordinate multiple stacks based on custom events within the same account and region.

Consider the scenario where you want to deploy a new network with VPC baseline configurations and a bastion host. Let’s suppose we have this split into two templates for better reusability: a network baseline template and a bastion host template. There’s a dependency between them, but they do not need to run completely sequential. At a resource level, only the EC2 instance in the bastion host template depends on the subnet in the network baseline template. This means that the bastion host template could safely create IAM roles and instance profiles at any time and concurrently to the network baseline, for example.

A CloudFormation template would look something like this:

# Network baseline template

Resources:
  rVpc:
    Type: AWS::EC2::VPC
    Properties:
      ...

  rSubnet:
    Type: AWS::EC2::Subnet
    Properties:
      ...
      VpcId: !Ref rVpc

And:

# Bastion host template

Resources:
  rEc2InstanceRole: 
    Type: AWS::IAM::Role
    Properties:
      ...      

  rEc2InstanceProfile: 
    Type: AWS::IAM::InstanceProfile
    Properties: 
      Path: /
      Roles: 
        - !Ref rEc2InstanceRole

  rEc2Instance:
    Type: AWS::EC2::Instance
    Properties:
      ...
      SubnetId: <<ID for rSubnet in the other stack>>
      IamInstanceProfile: !Ref rEc2InstanceProfile

The real dependency in this case is rEc2Instance has to be created after rSubnet. Because they are in different templates, we can’t really use CloudFormation’s DependsOn or inference through references. Instead, we can use a WaitCondition and a cross-stack mechanism to signal to it when the condition is satisfied. Figure 1 shows the high-level idea of how to coordinate stack creation using WaitCondition Figure 1. High-level idea to coordinate stack creation using WaitCondition

Figure 1. High-level idea to coordinate stack creation using WaitCondition

In the bastion host template:
1. Create a rEc2Instance depending on a rRequireSubnetCondition WaitCondition
In the network baseline template:
1. Create a rResolvePending custom resource, depending on rSubnet, that will signal the rRequireSubnetCondition WaitCondition in the bastion host stack to continue

That would guarantee that if the bastion host instance is deployed first, it will wait until rSubnet in the network baseline stack is provisioned before continuing to provision rEc2Instance.

But how would the custom resource know what resource to signal to in which stack, and how do we retrieve the id of the subnet dependency? For that we can use Parameter Store, a capability of AWS Systems Manager, and WaitCondition’s callback data. The bastion host stack can create a Parameter Store parameter with all the information the custom resource needs in order to send the signal. We just need a naming strategy for our parameters so stacks can look up dependencies. An additional custom resource is also required to parse the JSON returned after the WaitCondition is satisfied. Figure 2 shows how parameters and WaitCondition callback data could be utilized.

Figure 2. Sharing data using parameters and WaitCondition callback data

Figure 2. Sharing data using parameters and WaitCondition callback data

In the bastion host template:
1. Create a rSubnetParameter parameter using a unique name, containing the callback handle URL to rRequireSubnetCondition
2. Create a rParseJson custom resource that will parse the data attribute received from rSignalFunction‘s call
3. Have rEc2Instance to get the subnetId from the JSON parsed by rParseJson
In the network baseline template:
1. Have rSignalFunction read the callback URL from Parameter Store using the unique name defined previously and call it, passing the id for rSubnet

Because templates can be deployed in any order, there’s a scenario we did not cover which is when the network baseline stack (or rResolvePending more specifically) is provisioned before the bastion host stack (or rRequireSubnetCondition more specifically). In that case when the WaitCondition is created, the Lambda function would have already executed, and it would never get the signal and would timeout. We need a mechanism in place to signal to WaitConditions that are created after the resource they depend on is already provisioned.

Amazon EventBridge can define rules based on AWS events. We can define a rule that matches Parameter Store parameters creation events. We will want it to only match the unique name we chose to represent the dependency. The rule would target our already existing rSignalFunction. Figure 3 shows the final design, including the mechanism to signal to resources created afterwards.

Figure 3. Setting up EventBridge rules to signal to future stacks

Figure 3. Setting up EventBridge rules to signal to future stacks

The structure of CloudFormation templates for our final design would be:

# Network baseline template
Transform: AWS::Serverless-2016-10-31
...

Resources:

  # Resources being dependent upon
  
  rVpc:
    Type: AWS::EC2::VPC
    Properties:
      ...

  rSubnet:
    Type: AWS::EC2::Subnet
    Properties:
      ...
      VpcId: !Ref rVpc

  # Dependency Coordination

  rSignalFunction:
    Type: AWS::Serverless::Function
    Properties:
      ...
      Environment:
        Variables:
          DEPENDENCY_ID: mySubnetDep
          SUBNET_ID: !Ref rSubnet
      Events:
        ParameterRule:
          Type: EventBridgeRule
          Properties:
            Pattern:
              source: [ aws.ssm ]
              detail-type: [ Parameter Store Change ]
              detail: 
                name: [ { prefix: !Sub '/cf-deps/mySubnetDep' } ]
                operation: [ Create ]

  rResolvePending:
    Type: AWS::CloudFormation::CustomResource
    Properties:
      ServiceToken: !GetAtt rSignalFunction.Arn

And:

# Bastion host template
Transform: AWS::Serverless-2016-10-31
...

Resources:  
  # Dependency Coordination

  rRequireSubnetHandle:
    Type: AWS::CloudFormation::WaitConditionHandle

  rRequireSubnetCondition:
    Type: AWS::CloudFormation::WaitCondition
    Properties:
      ...
      Handle: !Ref rRequireSubnetHandle

  rSubnetParameter:
    Type: AWS::SSM::Parameter
    Properties:
      ...
      Name: !Sub /cf-deps/mySubnetDep/${AWS::StackName}
      Value: !Ref rRequireSubnetHandle

  rParserFunction:
    Type: AWS::Serverless::Function
    Properties:
      ...

  rParseJson:
    Type: AWS::CloudFormation::CustomResource
    Properties:
      ServiceToken: !GetAtt rParserFunction.Arn
      String: !GetAtt rRequireSubnetCondition.Data

  # Dependent

  rEc2InstanceRole: 
    Type: AWS::IAM::Role
    Properties:
      ...      

  rEc2InstanceProfile: 
    Type: AWS::IAM::InstanceProfile
    Properties: 
      Path: /
      Roles: 
        - !Ref rEc2InstanceRole

  rEc2Instance: 
    Type: AWS::EC2::Instance
    Properties:
      ...
      IamInstanceProfile: !Ref rEc2InstanceProfile 
      SubnetId: !GetAtt rParseJson.subnetId

Using this method has a few advantages:

Both stacks can be created simultaneously, in any order, independent of each other. If a stack has any dependencies that are not satisfied, that stack will wait until the dependency is provisioned. This means you achieve higher parallelism and consequently faster provisioning times.
You only work with the stack you are creating or updating, instead of an over-arching template with nested templates dependent on each other. That means a reduced blast radius as there’s no risk of unintentionally updating unrelated nested stacks.
There’s no central template that needs to be updated, and build process is simplified.
Dependency information is now where it should be: in the template using the resource, instead being contextual based on how a parent template defined it to be. That makes it easy to understand dependencies and keeps others from unintentionally deploying the template outside of a parent stack/stack set .
It’s possible to expand on the concept to make it cross-region and cross-account. This means you can have other automation mechanisms to deploy stacks directly to the intended accounts.

There are also some nuances to this solution:

There’s a small overhead as this solution creates additional resources (Lambda functions, Parameter Store parameters and EventBridge rules).
Naming strategy for Parameter Store’s parameters needs to be consistent, since it’s used to indicate the dependencies. If you are launching the same template multiple times, (e.g. networkBaselineAnalytics and networkBaselineDev), make sure your naming strategy accounts for that. The dependent resource should use the right parameter name to create a dependency to the right resource. For example, you may want to use a naming strategy including the stack name, for example /cf-deps/mySubnetDep-${team}/${AWS::StackName}.
CloudFormation supports referencing Parameter Store parameters through dynamic references, but they are evaluated when the templates are first deployed. These dynamic references will not work on parameters created after the stack is launched.
Stacks dependent on other stacks can be spun up at any time and concurrently with their parents but will timeout eventually if the dependencies are not resolved in time. This timeout is good design and comes for free with CloudFormation’s WaitCondition. If that happens you will get a clear error and it should be easy to trace back the unsatisfied dependency.
You cannot retroactively depend on an already existing resource that wasn’t setup using this mechanism.

Conclusion

As organizations mature their IaC practices, it’s common to build an increasing number of assets like CloudFormation templates that are composable and reusable. This granularity comes with the added complexity of cross-stack dependencies and the challenge of defining them in an automation-friendly way.

We described how to build a mechanism to coordinate dependencies across CloudFormation templates, and walked through the thought process example templates for a network baseline and a bastion host. The mechanism proposed allows dependencies to be defined at a finer grained level, which in turn allows for faster provisioning times and reduced blast radius.

To see a those strategies employed in a real use case take a look at the Github repository for the Automate Networking foundation in multi-account environments blog post. To see the full example discussed on this post, check out the full example on Github.

AWS Cloud Operations & Migrations Blog

Coordinating complex resource dependencies across CloudFormation stacks

Building custom cross-stack dependency mechanism

Conclusion

Resources

Follow