Managing Your AWS Resources Through a Serverless Policy Engine

Stephen Liedig, Solutions Architect

Customers are using AWS Lambda in new and interesting ways every day, from data processing of Amazon S3 objects, Amazon DynamoDB streams, and Amazon Kinesis triggers, to providing back-end processing logic for Amazon API Gateway.

In this post, I explore ways in which you can use Lambda as a policy engine to manage your AWS infrastructure. Lambda’s ability to react to platform events makes it an ideal solution for handling changes to your AWS resource state and enforcing organizational policy.

With support for a growing number of triggers, Lambda provides a lightweight, customizable, and cost effective solution to do things like:

Shut down idle resources or schedule regular shutdowns during nights, weekends, and public holidays
Clean up snapshots older than 6 months
Execute regular patching/server maintenance by automating execution of Amazon EC2 Run Command scripts
React to changes in your environment by evaluating AWS Config events
Perform a custom action if resources are created in regions that you do not wish to run workloads

I have created a sample application that demonstrates how to create a Lambda function to verify whether instances launched into a VPC conform to organizational tagging policies.

Tagging policy solution

Tagging policies are important because they help customers manage and control their AWS resources. Many customers use tags to identify the lifespan of a resource, their security, or operational context, or to assist with billing and cost tracking by assigning cost center codes to resources and later using them to generate billing reports. For these reasons, it is not uncommon for customers to take a “hard-line” approach and simply terminate or isolate compute resources that haven’t been tagged appropriately, in order to drive cost efficiencies and maintain integrity in their environments.

The tagging policy example in this post takes a middle-ground approach, in that it applies some decision-making logic based on a collection of policy rules, and then notifies system administrators of the actions taken on an EC2 instance.

A high-level view of the solution looks like this:

The tagging policy function uses an Amazon CloudWatch scheduled event, which allows you to schedule the execution of your Lambda functions using cron or rate expressions, thereby enabling policy control checks at regular intervals on new and existing EC2 resources.
Tag policies are pulled from DynamoDB, which provides a fast and extensible solution for storing policy definitions that can be modified independently of the function execution.
The function looks for EC2 instances within a specified VPC and verifies that the tags associated with each instance conform to the policy rules.
If required, missing information, such as user name of the IAM user who launched the instance, is retrieved from AWS CloudTrail.
A summary notification of actions undertaken is pushed to an Amazon SNS topic to notify administrators of the policy violations and actions performed.

Note that, while I have chosen to demonstrate the CloudWatch scheduled event trigger to invoke the Lambda function, there are a number of other ways in which you could trigger a tagging policy function. Using AWS CloudTrail or AWS Config, for example, youI could filter events of type ‘RunInstances’ or create a custom config rule, to determine whether newly-created EC2 resources match your tagging policies.

Define the policies

This walkthrough uses DynamoDB to store the policies for each of the tags. DynamoDB provides a scalable, single-digit millisecond latency data store, supporting both document and key-value data models that allows me to extend and evolve my policy model easily over time. Given the nature and size of the data, DynamoDB is also a cost-effective option over a relational database solution. The table you create for this example is straightforward, using a single HASH key to identify the rule.

CLI

Use the following AWS CLI command to create the table:

aws dynamodb create-table --table-name acme_cloud_policy_tagging_def --key-schema AttributeName=RuleId,KeyType=HASH --attribute-definitions AttributeName=RuleId,AttributeType=N  --provisioned-throughput ReadCapacityUnits=5,WriteCapacityUnits=5

The sample policy items have been extended with additional attributes:

TagKey
Action
Required
Default

These attributes will help build a list a list of policy definitions for each tag and the corresponding behavior that your function should implement should the tags be missing or have no value assigned to them.

The following items have been added to the tagging policy table:

RuleId (N)	TagKey (S)	Action (S)	Required (S)	Default (S)
1	ProjectCode	Update	Y	Proj007
2	CreatedBy	UserLookup	Y
3	Expires	Function	N	today()

In this example, the default behavior for instances launched into the VPC with no tags is to terminate them immediately. This action may not be appropriate for all scenarios, and could be enhanced by stopping the instance (rather than terminating it) and notifying the resource owners that further action is required.

The Update action either creates a tag key and sets the default value if they have been marked as required, or sets the default value if the tag key is present, but has no value.

The UserLookup action in this case searches CloudTrail logs for the IAM user that launched the EC2 instance, and sets the value if it is missing.

Now that the policies have been defined, take a closer look at the actual Lambda function implementation.

Set up the trigger

The first thing you need to do before you create the Lambda function to execute the tagging policy is to create a trigger that runs the function automatically after a specified fixed rate expression, such as run “rate(1 hour)”, or via a cron expression.

After it’s configured, the resulting event looks something like this:

{
  "account": "123456789012",
  "region": "ap-southeast-2",
  "detail": {},
  "detail-type": "Scheduled Event",
  "source": "aws.events",
  "time": "1970-01-01T00:00:00Z",
  "id": "cdc73f9d-aea9-11e3-9d5a-835b769c0d9c",
  "resources": [
    "arn:aws:events:ap-southeast-2:123456789012:rule/my-schedule"
  ]
}

Create the Lambda execution policy

The next thing you need to do is define the IAM role under which this Lambda function executes. In addition to the CloudWatchLogs permissions to enable logging on the function, you need to call ec2:DescribeInstances on your EC2 resources to find tag information for the instances in your environment. You also require permissions to read policy definitions from a specified DynamoDB table and to then be allowed to publish the policy reports via Amazon SNS. Working on the basis of least-privilege, the IAM role policy looks something like the following:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "StmtReadOnlyDynamoDB",
            "Action": [
                "dynamodb:BatchGetItem",
                "dynamodb:GetItem"
            ],
            "Effect": "Allow",
            "Resource": "arn:aws:dynamodb:ap-southeast-2:123456789012:table/acme_cloud_policy_tagging_definitions"
        },
        {
            "Sid": "StmtLookupCloudTrailEvents",
            "Action": [
                "cloudtrail:LookupEvents"
            ],
            "Effect": "Allow",
            "Resource": "*"
        },
        {
            "Sid": "StmtLambdaCloudWatchLogs",
            "Action": [
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Effect": "Allow",
            "Resource": "arn:aws:logs:*:*:*"
        },
        {
            "Sid": "StmtPublishSnsNotifications",
            "Action": [
                "sns:Publish"
            ],
            "Effect": "Allow",
            "Resource": "arn:aws:sns:ap-southeast-2:123456789012:acme_cloud_policy_notifications"
        },
        {
            "Sid": "StmtDescribeEC2",
            "Action": [
                "ec2:DescribeInstances"
            ],
            "Effect": "Allow",
            "Resource": "*"
        }
    ]
}

Create the Lambda function

For this example, you create a Python function. The function itself is broken into a number of subroutines, each performing a specific function in the policy execution.

AWS Lambda function handler

def lambda_handler(event, context):
    print('Beginning Policy check.')
    policies = get_policy_definitions()
    for instance in find_instances('vpc-abc123c1'):
        validate_instance_tags(instance, policies)
    if len(report_items) > 0:
        send_notification()
    print('Policy check complete.')
    return 'OK'

The Lambda function orchestrates the policy logic in the following way:

Load the policy rules from the DynamoDB table:

def get_policy_definitions():
    dynamodb = boto3.resource('dynamodb')
    policy_table = dynamodb.Table('acme_cloud_policy_tagging_def')
    response = policy_table.scan()
    policies = response['Items']
    return policies

Find the tags for all EC2 instances within a specified VPC. Note that this rule processing revalidates every instance; this is to ensure that no changes have been made to instance tagging after the last policy execution. For simplicity, the VPC ID has been hard-coded into the function. In a production scenario, you would look this value up:

def find_instances(vpc_id):
    ec2 = boto3.resource('ec2')
    vpc = ec2.Vpc('%s' % vpc_id)
    return list(vpc.instances.all())

After you have all the instances in the VPC, apply the policies:

def validate_instance_tags(instance, policies):
    print(u'Validating tags for instance: {0:s} '.format(instance.id))
    tags = instance.tags
    if tags is None:
        instance.terminate()
        report_items.append(u'{0:s} has been terminated. Reason: No tags found.'.format(instance.id))
        return

    for p in policies:
        policy_key = p['TagKey']
        policy_action = p['Action']
        if 'Default' in p:
            policy_default_value = p['Default']
        else:
            policy_default_value = ''
        if not policy_key_exists(tags, policy_key):
            print(u'Instance {0:s} is missing tag {1:s}. Applying policy.'.format(instance.id, policy_key))

            if policy_action == 'Update':
                instance.create_tags(Tags=[{'Key': policy_key, 'Value': policy_default_value}])
                report_items.append(u'Instance {0:s} missing tag {1:s}. New tag created.'.format(instance.id, policy_key))

            elif policy_action == 'UserLookup':
                try:
                    user_id = find_who_launched_instance(instance.id)
                    report_items.append(u'Instance {0:s} missing tag {1:s}. User name set.'.format(instance.id, e.message))
                except StandardError as e:
                    user_id = "Undefined"
                    report_items.append(u'Instance {0:s} missing tag {1:s}. User name set to Undefined.'.format(instance.id, e.message))

                instance.create_tags(Tags=[{'Key': policy_key, 'Value': user_id}])

            elif policy_action == 'Function':
                if policy_default_value == 'today()':
                    instance.create_tags(Tags=[{'Key': policy_key, 'Value': str(datetime.now().date())}])
                    report_items.append(u'Instance {0:s} missing tag {1:s}. New tag created.'.format(instance.id, policy_key))

The CreatedBy tag rule is defined as Lookup, meaning if the tag is missing or empty, you search the CloudTrail logs to determine the IAM user that launched a specified instance. If the IAM user ID is found, the tag value is set to the instance:

def find_who_launched_instance(instance_id):
    response = cloudtrail.lookup_events(
        LookupAttributes=[
            {
                'AttributeKey': 'EventName',
                'AttributeValue': 'RunInstances'
            }
        ],
        StartTime=datetime(2016, 6, 4),
        EndTime=datetime.now(),
        MaxResults=50
    )

    events_list = response['Events']
    for event in events_list:
        resources = event['Resources']
        for resource in resources:
            if (resource['ResourceType'] == 'AWS::EC2::Instance') and (resource['ResourceName'] == instance_id):
                return event['Username']
            else:
                raise Exception("Unable to determine IAM user that launched instance.")

Finally, after all the policy rules have been applied to the instances in your VPC, send an Amazon SNS notification, to which your system administrators have been subscribed, to inform them of any policy violations and the actions taken by the Lambda function:

def send_notification():
    print("Sending notification.")
   
    topic_arn = 'arn:aws:sns:ap-southeast-2:12345678910:acme_cloud_policy_notifications'

    message = 'These following tagging policy violation occurred:\\n'

    for ri in report_items:
        message += '-- {0:s} \n'.format(ri)
    
    try:
        sns = boto3.client('sns')
        sns.publish(TopicArn=('%s' % topic_arn),
                    Subject='ACME Cloud Tagging Policy Report',
                    Message=message)
    except ClientError as ex:
        raise Exception(ex.message)

The emailed report generated by the policy engine generates the following output. The format of the notification is, of course, customisable and can contain as much or as little information as needed. These notifications can also act as a trigger themselves, allowing you to link policies.

Summary

As I have demonstrated, using Lambda as a policy engine to manage your AWS resources and to maintain operational integrity of your environment is an extremely lightweight, powerful, and customisable solution.

Policies can be composed in a number of ways, and integrating them with various triggers provides an ideal mechanism for creating a secure, automated, proactive, event-driven infrastructure across all your regions. And given that the first 1 million requests per month are free, you’d be able to manage a significant portion of your infrastructure for little or no cost.

Furthermore, the concepts presented in this post aren’t specific to managing your infrastructure; they can quite easily also be applied to a security context. Monitoring changes in your security groups or network ACLs through services like AWS Config allow you to proactively take action on unauthorised changes in your environment.

If you have questions or suggestions, please comment below.

AWS Compute Blog