Accelerating development feedback loops with AWS CDK hotswap deployments for Amazon ECS

Introduction

Culture Amp is an employee experience platform that provides the insights into employee engagement, performance, and development that organizations need to build a category-defining culture. Culture Amp’s mission is to create a better world of work.

Developer feedback loops

An efficient developer feedback loop is a critical part of an effective development process – ideally the engineer has timely “loops” that provide them with reliable feedback on their implementation thus far. Successive loops often increase in complexity – providing more realistic feedback at the cost of increased “loop time”. Delays or overheads due to poor feedback loops are a common source of friction, frustration, and lost productivity for engineers.

Figure 1. Developer feedback loop

As in the preceding diagram, an engineer’s feedback loop often embeds deployment to a cloud development environment for integration testing with other services and components in the ecosystem. Modern cloud software ecosystems are often driven by continuous integration/continuous development (CI/CD) patterns that promote repeatability and safety through pipelines and automation. When these cloud deployments intersect with the developer feedback loop in the context of development deploys, we can see a natural tension between deployment speed and the safety of deployment automation that allows us to deploy with confidence.

At Culture Amp we maintain Buildkite pipelines that orchestrate the build and deploy of infrastructure implemented as AWS Cloud Development Kit (CDK) applications. Many of our services use Amazon Elastic Container Service (ECS) and AWS Fargate for their compute needs. This results in a deployment toolchain with the following components:

Figure 2. Deployment toolchain

At Culture Amp we have 200+ developers performing 2000-3000 deployments per week across the development environments for the 204 ECS services that make up our platform. The majority of these deployments through the AWS CDK are just application code changes. Not all of these deployments have an engineer on the other end waiting to continue their feedback loop, but if even half of them do, then even small improvements in feedback time due to faster deployments can translate to large boosts in productivity.

This article explores the use of ECS hotswap deployment support in the AWS CDK to save time on code-only deployments and accelerate the developer experience – focusing on how we quantified the benefit using real data from our environments.

Hotswap deployments for faster deployments

The AWS CDK allows for AWS infrastructure to be defined in code and then deployed to AWS safely and repeatably through AWS CloudFormation stacks.

The AWS CDK supports “hotswap” deployments for selected resource types such as ECS services and AWS Lambda functions that allow us to optimize for deployment speed at the cost of some safety. If a change to be deployed is limited to one of these supported resources, then the AWS CDK can perform a specific update on that resource, skipping the CloudFormation engine.

Figure 3. AWS CDK hotswap deployment

The key use case positioned for these “hotswap” deployments has been local development speed, but at Culture Amp we suspected that the increased speed could be harnessed in our wider CI/CD systems instead to improve development speed.

Figure 4. Usecases for hotswap deployment

Hotswap deployments and CloudFormation “drift”

The drawback to performing “hotswap” deployments is that it intentionally introduces “drift” to the associated CloudFormation stacks. This “drift” in configuration can disrupt deploying future updates to the affected stack, as the underlying resources are no longer in the expected state. This type of risk is not recommended in a production context – but that’s fine for our situation, as we are looking to improve development feedback loops specifically.

Using “hotswap” deployments for Amazon ECS can improve our development feedback loops by increasing CI/CD deployment speed in our development environments. But by how much?

Building a data-driven view of this opportunity allows us to balance the productivity benefits against any costs to introduce hotswaps into our pipelines and any drawbacks presented by allowing CloudFormation drift into our development environments.

Measuring deployment performance of AWS CDK for AWS Fargate

To understand the impact that using “hotswap” deployments would have for our organization, we needed to be able to measure how much time our current deployments using the AWS CDK onto AWS Fargate were spending in the CloudFormation engine.

Measuring overheads in creation of CloudFormation change sets

By default, the AWS CDK performs updates to CloudFormation through a CloudFormation change set. This process provides more clarity to the development about the updates to be performed, but it does add a little time.

Figure 5. CDK and CloudFormation change set

We determined that we can measure the time taken for creating and then executing change sets in our pipeline by comparing the CreationTime of the change set and the LastUpdatedtime of the stack.

In the following example we can see that the comparison of these values determines that this deployment spent 32 seconds on CloudFormation change set creation.

Figure 6. Calculating time to create change set

Analyzing a set of our development CloudFormation stacks that orchestrate ECS services (n = 605) stacks showed an average CloudFormation change set creation time of 18 seconds. The change set creation time appears to vary primarily with the number of resources in the stack.

Figure 7. Distribution of CloudFormation change set creation time

Switching to hotswap deployments for ECS may save 18 seconds per deployment on change set creation, on average.

Measuring overheads in deploying Amazon ECS through the CloudFormation stack

We can determine the time spent in the CloudFormation engine during a normal deployment by constructing a timeline using relevant CloudFormation stack events and ECS service events.

Figure 8. ECS service events showing when Amazon ECS started and completed the deployment

Figure 9. CloudFormation stack events showing when CloudFormation started and completed the update on the ECS service resource

Figure 10. Comparing Cloudformation and ECS hotswao deployments

Figure 10. Capturing timestamps for CloudFormation and ECS service events

In the preceding example, we see a delay of 14 seconds between CloudFormation signaling the beginning of its update and Amazon ECS beginning the deployment, then 13 seconds between Amazon ECS completing the deployment and CloudFormation signaling the resource update is complete. The addition of the CloudFormation engine orchestrating the deployment has added 27 seconds.

Analysis of our own development deploys over a period of a week (n = 1926) showed an average overhead from the CloudFormation engine of 50 seconds.

Figure 11. Controlling hotswap behavior using environment variables

Figure 11. Comparing CloudFormation and ECS hotswap deployments

Switching to hotswap deployments for Amazon ECS may save as much as 50 seconds per deployment by avoiding the overheads of the CloudFormation engine.

Implementing hotswaps in a Buildkite pipeline

Hotswap deployments are enabled in the AWS CDK CLI using one of two command-line flags, depending on the desired behavior.

Figure 12. CDK CLI hotswap flags

For our situation, the transparent –hotwap-fallback behaviour is desired – developers benefit from faster hotswap deployments when possible, without additional decision-making or cognitive load on their part when non-”code-only” changes are needed.

Dynamic control through environment variables

The AWS CDK uses yargs to manage the command line experience and is appropriately configured to allow for environment variables to configure command line flags. This allows us to avoid dynamically constructing a deploy command and instead control behavior in existing scripts that manage environment variables.

At Culture Amp the branch-based deployments are in a development context, so this becomes a convenient differentiator for our scripts to configure enable hotswap deployments

Figure 13. Controlling hotswap behavior using environment variables

Early results at Culture Amp

Live hotswap deployments in a test scenario confirm the analysis, saving on average 70 seconds per deployment (a 36% improvement). Tests on our internal Backstage services were even more promising – saving on average 131 seconds per deployment (a 45% improvement).

Figure 14. Comparing hotswap and full deployments

Figure 15. Our internal Backstage services deploying through hotswap in development – approximately three minutes

Figure 16. Our internal Backstage services deploying through a full deployment in production – approximately five minutes

Next steps

We are now looking to identify the right pilot opportunities to enable hotswap deployments in services and see hotswaps in action more widely at Culture Amp. The behaviour of our current deployments is well understood by SRE and development teams, and it is a priority for us to understand clearly if there are novel behaviors to accommodate before enabling hotswap deployments generally. As an example, we want to understand more clearly how failure and rollback modes of hotswap deployments differ to full deployments so that we can best support developers in their workflows.

Conclusion

At Culture Amp we identified that hotswap Amazon ECS deployments in our AWS CDK deployment pipelines could accelerate our developer feedback loops. An analysis of our existing CloudFormation-based AWS CDK deployments allowed us to take a data-driven approach to quantifying the potential productivity benefits. Practical pilots of using hotswap deployments confirm or exceed the analysis.

Further pilots are aimed at better understanding novel failure scenarios for hotswaps as compared with full CloudFormation-based deployments.

In our tests so far hotswap Amazon ECS deployments achieved time savings of 30-40% per deployment. Applied to the average 2000-3000 development deployments that we perform at Culture Amp, this translates to a potential 40-60 hours saved on deployment time each week.

Containers