Implement automatic drift remediation for AWS CloudFormation using Amazon CloudWatch and AWS Lambda

“Stack drift” is a common occurrence for organizations using AWS CloudFormation, and remediating stack drift represents a persistent and tedious challenge for organizations managing critical infrastructure with CloudFormation stacks. Stack drift occurs when the actual configuration of an infrastructure resource differs from its expected configuration. Typically, this is caused by users editing resources directly by using the underlying service that created the resource. Changes that cause stack drift may be accidental, or may be made intentionally to respond to time-sensitive operational events. For example, you can manually add more capacity to a DynamoDB table to respond to increased demand. Regardless of the origin, minimizing stack drift helps to ensure configuration consistency and successful stack operations.

When resources are created as part of a CloudFormation stack, they are created according to the specifications in the stack template. However, once created, these resources can be edited directly, causing their specification to no longer match the specification outlined in the template. For instance, an IAM role created in a CloudFormation stack can be modified with additional policies after creation. Although this IAM role is still a part of the CloudFormation stack (and would be deleted if the CloudFormation stack were deleted), the specifications of the IAM role no longer match those laid out in the template.

AWS CloudFormation offers a “drift detection” feature to automatically detect unmanaged configuration changes to stacks and resources. With this feature, AWS CloudFormation analyzes the current specifications of resources in a stack against the specifications defined in the stack template, and reports the difference. To return a resource to compliance with the specifications in the stack template, the resource can be edited directly, manually imported into a new stack, or the stack can be destroyed and recreated with new resources.

In this post, I demonstrate how your organization that is using AWS CloudFormation for mission-critical resource management can use a custom AWS Lambda function and Amazon CloudWatch to implement automatic drift remediation and return resources created in a CloudFormation stack to compliance with the stack template. Using a custom Lambda function for remediating stack drift offers a familiar, automated, scalable, and customizable alternative to manually resolving stack drift.

Prerequisites

To build the solution outlined in this post, you need:

An AWS account
Access to the AWS Management Console with permissions to create resources and manage applications
Basic knowledge of AWS CloudFormation, AWS Lambda, and Python 3.7

Benefits of implementing drift remediation

Implementing automatic drift remediation can provide the following benefits:

Simplify stack operations and maintenance capabilities: changes to resources outside of AWS CloudFormation can complicate stack operations. Changes can also complicate troubleshooting and replication processes for complex resource configurations, as it is difficult to know the exact state of resources at a given time. Automatically remediating configuration drifts can ensure that resources are always operating according to their stack template specifications, and stack operations proceed smoothly.
Reduce risk to sensitive resources: prevent resources that are highly sensitive (for example, IAM roles, S3 bucket policies, security groups) from being accidentally modified. This can ensure that all applications remain secure, and simplify audit and compliance processes.
Define custom remediation and notification logic: the approach used in this document to implement drift remediation allows for the definition of custom code to define which resources are automatically returned to compliance and which resources can, optionally, remain out of compliance. Drift notifications to services such as Amazon SNS can easily be added as well, if desired.

For organizations using large teams to operate agile, sensitive applications, implementing automatic drift remediation can be a valuable addition to ensure that resources remain in compliance with stack templates.

Solution overview

The following diagram shows that the high-level architecture you use to implement automatic drift remediation.

Automatic drift remediation solution architecture

To monitor the resources in the CloudFormation stack, you create a Lambda function that is triggered on a schedule by a CloudWatch Events rule. This Lambda function checks if any resource in the stack has drifted, and if so, returns the resource to compliance.

Getting started

To build the architecture described in the solution overview, you need a CloudFormation stack to monitor, detect configuration drift, and enforce resource compliance. The following AWS CloudFormation template defines several resources that are used in this post to demonstrate implementation of automatic drift remediation:

an IAM role, “AutomaticDriftRemediationRole.” This role consists of AWS managed policies combined with the customer managed policies defined as follows.
a customer-managed IAM policy, “AutomaticDriftRemediationPolicyOne”. This policy contains read-only access to Amazon S3 and Amazon S3 Glacier.
a customer managed IAM policy, “AutomaticDriftRemediationPolicyTwo”. This policy contains required permissions for working with specific Amazon ECS clusters.

The resources defined in this template are not meant to be taken as an example of practical security policies, but rather as an example of types of resources that can be used with the automatic drift remediation patterns described in this post.

AWSTemplateFormatVersion: "2010-09-09"
Description: Contains IAM Role and policies to support drift remediation demo
Resources:
  AutomaticDriftRemediationRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: 'AutomaticDriftRemediationRole'
      Path: /
      AssumeRolePolicyDocument:
        Version: 2012-10-17
        Statement:
          - Effect: Allow
            Principal:
              Service:
              - ec2.amazonaws.com
            Action:
              - 'sts:AssumeRole'
      ManagedPolicyArns:
      - arn:aws:iam::aws:policy/AmazonEC2ReadOnlyAccess
      - arn:aws:iam::aws:policy/AmazonDynamoDBReadOnlyAccess
      Policies:
      - PolicyDocument:
          Statement:
          - Action:
            - s3:Get*
            - s3:List*
            Effect: Allow
            Resource: '*'
          - Action:
            - glacier:DescribeJob
            - glacier:DescribeVault
            - glacier:GetDataRetrievalPolicy
            - glacier:GetJobOutput
            - glacier:GetVaultAccessPolicy
            - glacier:GetVaultLock
            - glacier:GetVaultNotifications
            - glacier:ListJobs
            - glacier:ListMultipartUploads
            - glacier:ListParts
            - glacier:ListTagsForVault
            - glacier:ListVaults
            Effect: Allow
            Resource: '*'
        PolicyName: 'AutomaticDriftRemediationPolicyOne'
      - PolicyDocument:
          Statement:
          - Action:
            - ecs:ListClusters
            - ecs:DescribeContainerInstances
            Effect: Allow
            Resource:
            - arn:aws:ecs:us-east-1:<YOUR_AWS_ACCOUNT>:service/exampleClusterOne*
            - arn:aws:ecs:us-east-1:<YOUR_AWS_ACCOUNT>:service/exampleClusterTwo*
        PolicyName: 'AutomaticDriftRemediationPolicyTwo'

To create this CloudFormation stack, download this template and run the command below after replacing the placeholder values:

<YOUR_AWS_REGION>: AWS region in which to create resources.
<YOUR_TEMPLATE_LOCATION>: local address of saved CloudFormation template.

Be sure to configure your AWS CLI with an IAM user that has permissions to create the resources described in the template. Refer to managing IAM permissions for more details on creating custom IAM users and policies.

aws cloudformation create-stack \
--region <YOUR_AWS_REGION> \
--capabilities CAPABILITY_NAMED_IAM \
--stack-name drift-remediation-demo \
--template-body file://<YOUR_TEMPLATE_LOCATION>

This command creates a CloudFormation stack, drift-remediation-demo, that contains the IAM role and policies that we use to test our solution architecture.

Sample CloudFormation stack

You should find that your IAM role has been created and the policies defined in the template have been created and attached.

IAM role created by CloudFormation stack

Lambda drift remediation function

To ensure that the resources defined in the CloudFormation stack remain in compliance with the stack template, you define a Lambda function to detect stack drift and, if necessary, return the resources to compliance. The Lambda function in this post is written in Python 3.7, but can be written in any AWS Lambda runtime.

The core functionality of this Lambda function is to use the AWS CloudFormation drift detection feature exposed through the AWS CloudFormation API operations in combination with user-defined functions to return resources to compliance. The logic defined in your Lambda function depends entirely on the resources present in your CloudFormation stack. The example in this post defines two functions to enforce compliance for the IAM role created by the CloudFormation stack in the previous step:

repair_managed_policies(): add or remove AWS managed policies from the IAM role
repair_policies(): add or remove customer managed policies from the IAM role

These functions represent the “remediation” component of the solution. If you need to ensure that certain aspects of your stack resources are kept in compliance, you should write and include this functionality in your Lambda function. For example, the preceding two functions defined serve only to keep AWS managed policies and customer managed policies in compliance, respectively. If another aspect of the IAM role were to change, such as the role description, our Lambda function may correctly identify this configuration drift, but would not remediate this difference. The Lambda function in this post only contains functionality to detect configuration drift and return the policies attached to IAM roles to compliance. Likewise, additional resources and features such as Amazon SNS notifications and optional noncompliance can be included.

You configure this function to run on a schedule. Upon invocation, the function performs the following procedure:

Initiate a stack drift detection process for all stack resources. For more information on this process, refer to detect drift on an entire CloudFormation stack.
Wait for the drift detection to complete. For highly complex stacks, this can take up to a few minutes. For the simple stack in this example, this operation only takes a few seconds.
Evaluate whether configuration drift has occurred for any stack resources. Note that AWS CloudFormation does not support drift detection for all resources, refer to resources that support drift detection for a complete list.
Retrieve the “actual” and “expected” configuration values for resources that have experienced configuration drift. These values come from the AWS CloudFormation drift detection API operations.
For resources that have drifted, use the user-defined functions to return the AWS managed policies and customer managed policies for the IAM role to compliance.

"""Lambda function to return IAM resources to complaince with CloudFormaiton stack template"""
import json
import logging
import time
import boto3

# Initialize Logger
LOGGER = logging.getLogger()
LOGGER.setLevel(logging.INFO)

# Define CloudFormation stack information
STACK_NAME = "drift-remediation-demo"
ROLE_NAME = "AutomaticDriftRemediationRole"

# Define boto3 clients
CF_CLIENT = boto3.client('cloudformation')
IAM_CLIENT = boto3.client('iam')

def repair_managed_policies(expected_value, actual_value):
    """Repair managed policies assigned assigned to IAM role"""
    if expected_value != actual_value:
        role = boto3.resource('iam').Role(ROLE_NAME)
        # Remove undesired policies
        policies_to_remove = []
        for policy in actual_value:
            if policy not in expected_value:
                policies_to_remove.append(policy)
        for policy_to_remove in policies_to_remove:
            LOGGER.info("Removing policy %s from %s", policy_to_remove, ROLE_NAME)
            role.detach_policy(
                PolicyArn=policy_to_remove
                )
        # Add desired policies
        policies_to_add = []
        for policy in expected_value:
            if policy not in actual_value:
                policies_to_add.append(policy)
        for policy_to_add in policies_to_add:
            LOGGER.info("Adding policy %s from %s", policy_to_add, ROLE_NAME)
            role.attach_policy(
                PolicyArn=policy_to_add
                )
    else:
        LOGGER.info("No updates needed to managed policies for %s", ROLE_NAME)

def repair_policies(expected_value, actual_value):
    """Repair policies assigned assigned to IAM role"""
    if expected_value != actual_value:
        iam = boto3.resource('iam')
        # Remove undesired policies
        policies_to_remove = []
        for policy in actual_value:
            if policy not in expected_value:
                policies_to_remove.append(policy)
        for policy_to_remove in policies_to_remove:
            LOGGER.info("Removing policy %s from %s", policy_to_remove, ROLE_NAME)
            role_policy = iam.RolePolicy(ROLE_NAME, policy_to_remove["PolicyName"])
            role_policy.delete()
        # Add desired policies
        policies_to_add = []
        for policy in expected_value:
            if policy not in actual_value:
                policies_to_add.append(policy)
        for policy_to_add in policies_to_add:
            LOGGER.info("Adding policy %s to %s", policy_to_add, ROLE_NAME)
            role_policy = iam.RolePolicy(ROLE_NAME, policy_to_add["PolicyName"])
            role_policy.put(
                PolicyDocument=json.dumps(policy_to_add["PolicyDocument"])
                )

def lambda_handler(event, context):
    """Handle Lambda Invocations from Cloudwatch"""
    # Initiate a stack drift detection
    initiate_stack_drift_detection = CF_CLIENT.detect_stack_drift(
                StackName=STACK_NAME
    )
    stack_drift_detection_id = initiate_stack_drift_detection["StackDriftDetectionId"]
    LOGGER.info("Initiating drift detection.  Stack Drift Detection Id: %s",
                stack_drift_detection_id)
    # Wait for the stack drift detection to complete
    drift_detection_status = ""
    while drift_detection_status not in ["DETECTION_COMPLETE",  "DETECTION_FAILED"]:
        check_stack_drift_detection_status = CF_CLIENT.describe_stack_drift_detection_status(
            StackDriftDetectionId=stack_drift_detection_id
        )
        drift_detection_status = check_stack_drift_detection_status["DetectionStatus"]
        # Add artificial delay to avoid throttling by CloudFormation APIs
        time.sleep(1)
    LOGGER.info("Drift detection complete. Stack Drift Status: %s", drift_detection_status)
    if drift_detection_status == "DETECTION_FAILED":
            LOGGER.info("The stack drift detection did not complete successfully for at \
                         least one resource. Results will be available for resources that \
                         successfully completed drift detection")
    # Check if the stack has drifted
    if check_stack_drift_detection_status["StackDriftStatus"] == "DRIFTED":
        # Retrieve resources that have drifted
        stack_resource_drift = CF_CLIENT.describe_stack_resource_drifts(
            StackName=STACK_NAME
        )
        LOGGER.info("Drifted stack resources: %s", str(stack_resource_drift))
        # Iterate over drifted resources and return to compliance
        for drifted_stack_resource in stack_resource_drift["StackResourceDrifts"]:
            resource_type = drifted_stack_resource["ResourceType"]
            expected_properties = json.loads(drifted_stack_resource["ExpectedProperties"])
            actual_properties = json.loads(drifted_stack_resource["ActualProperties"])
            if resource_type == "AWS::IAM::Role":
                repair_managed_policies(expected_properties.get("ManagedPolicyArns", []),
                                        actual_properties.get("ManagedPolicyArns", []))
                repair_policies(expected_properties.get("Policies", []),
                                actual_properties.get("Policies", []))
    else:
        LOGGER.info("No drift detected")

To deploy your Lambda function, you must first create and deploy a Lambda deployment package containing your function code and dependencies (if applicable). For more information on packaging and deploying a Lambda function, see getting started with AWS Lambda. For this demo, you should name your Lambda function drift-remediation-demo-lambda and use a Python 3.7 runtime environment with a timeout of 30 seconds.

Additionally, you need to configure the Lambda function to use an AWS Lambda execution role with sufficient permissions to monitor the CloudFormation stack and modify the resources described in the stack.

Drift remediation Lambda function

CloudWatch Events rule

To continuously monitor the CloudFormation stack, you need to execute the Lambda function on a regular schedule. You use a CloudWatch Events rule to accomplish this. Follow the walkthrough for creating a CloudWatch Events rule that triggers on a schedule to create a rule that executes the Lambda function, drift-remediation-demo-lambda, every five minutes.

Sample CloudWatch Events rule

In addition to triggering the Lambda function on a timer, you can optionally use AWS CloudTrail combined with Amazon CloudWatch Events to trigger the Lambda function directly in response to resource modifications.

Results

The following example shows intentional configuration drift applied to the IAM role defined in the CloudFormation stack. Here, the following configurations have been altered:

Added additional AWS managed policies: AdministratorAccess, AmazonCognitoPowerUser
Removed AWS managed policy: AmazonEC2ReadOnlyAccess
Added additional resource to policy statement in AutomaticDriftRemediationPolicyTwo

IAM role with modifications

These changes represent significant, and potentially dangerous, deviations from the configurations outlined in the stack template. After the Lambda function is invoked by CloudWatch, the resources are returned to compliance.

IAM role returned to compliance

You can check the CloudWatch Log Group of the Lambda function to see the operations performed and logged during stack remediation.

Logs from drift remediation process

Cleanup

To avoid additional infrastructure costs associated with the example in this post, be sure to delete all CloudWatch Events rules, CloudFormation stacks, and Lambda functions.

Conclusion

In this post, you saw how Lambda functions can be used with CloudWatch and AWS CloudFormation to implement automatic drift remediation for resources in a CloudFormation stack. The example in this post focuses on IAM policies. However, this pattern can be extended to any type of resource supported by AWS CloudFormation drift detection. Additional resources can be supported by expanding the Lambda function with additional remediation logic. For organizations seeking to implement automatic drift remediation, this solution can offer the following benefits:

Prevent complications with stack operations that result from stack drift.
Automatic detection of stack drift for mission-critical resources; instantly return resources to compliance.
Familiar and scalable operating environment (that is, Lambda) for developers to implement customizable drift remediation logic.

Author Bio

Bryant Bost is an Application Consultant for AWS Professional Services based out of Washington, DC. As a consultant, he supports customers with architecting, developing, and operating new applications, as well as migrating existing applications to AWS. In addition to web application development, Bryant specializes in serverless and container architectures, and has authored several posts on these topics.

AWS Cloud Operations & Migrations Blog