Operating serverless at scale: Implementing governance – Part 1

This post is written by Jerome Van Der Linden, Solutions Architect.

With serverless services, infrastructure management tasks like capacity provisioning and patching are handled by AWS, so you can focus on writing code and deliver value to your customers. By reducing operational overhead, developers can iterate faster and release new features more often.

But with increased agility and productivity, you must also keep control. When scaling to thousands of AWS Lambda functions, hundreds of AWS Step Functions workflows, and millions of Amazon EventBridge events sent throughout the company, you must maintain visibility. What is provisioned, what is running, and how does everything work as a whole?

This three-part series covers important topics to help you maintain control over a growing set of serverless resources.

Maintaining visibility on resources and workloads

For governance, the first recommendation is to have clear visibility of your environment:

Visibility on resources: APIs, Lambda functions, state machines, event buses, queues, or topics. It is essential to have an up-to-date inventory of all your resources together with metadata such as the application it belongs to, the environment where it is deployed, and the owner. This is needed to track cost, manage compliance and evaluate risks.
Visibility on how these resources are linked together. They may be components of the same application. You must track who is calling who (in synchronous calls) or who is consuming what message or event from whom (in asynchronous calls). This dynamic view is as necessary as the inventory. It gives you important insight on the architecture and potential security and compliance issues.

This visibility into your AWS environment is essential to understand what you do with it and be able to operate it. It allows you to understand your workloads and make sure they follow your compliance rules. It allows you to track your usage of AWS and potentially optimize and reduce your costs. This is a best practice for building and growing on AWS.

Tagging your resources

For all resources on AWS, assign tags to your resources. A tag consists in a label (the key) and an optional value. It makes it easier to organize, search for, and filter resources by application, environment, or other criteria. Tags can serve different purposes: automation, access control, cost allocation, risk management. Above all, they provide an additional layer of information you can use to understand why each resource exists.

Assigning tags during provisioning is the preferred method. Using the AWS Serverless Application Model (AWS SAM), you can define tags. For example, for a Lambda function:

AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31

Resources:
  MyFunction:
    Type: AWS::Serverless::Function
    Properties:
      CodeUri: function/
      Handler: app.lambda_handler
      Runtime: python3.8
      Tags:
        mycompany:environment: "dev"
        mycompany:application-id: "ecommerce"
        mycompany:service-id: "products"
        mycompany:business-owner: "john.smith@mycompany.com"

As the number of resources increases in a template, you can use the --tags option of the sam deploy command. This applies a set of tags to all compatible resources declared in the template:

sam deploy \
--tags mycompany:environment=dev \
mycompany:application-id=ecommerce \
mycompany:service-id=products \
mycompany:business-owner=john.smith@mycompany.com

You can also add these tags in the samconfig.toml file so you don’t need to specify them each time on the command line:

tags = "mycompany:environment=dev mycompany:application-id=ecommerce mycompany:service-id=products mycompany:business-owner=john.smith@mycompany.com"

Enforcing consistency in tags

To maintain organization, tags must be consistent. For example, do you use “environment“, ”Environment“ or ”env“? And is the value ”dev“, ”Dev“ or ”Development“?

Define which tags are necessary and what do you need to identify: the owner, the application, the environment, the business line, etc. You can have up to 50 tags per resource but you should restrict yourself to a set of needed tags and iterate. Apply the YAGNI principle to minimize maintenance.
Agree on the syntax. You may use spinal-case (lower case with hyphens to separate words) and a prefix to identify your company. For example, mycompany:application-id. Having a tag dictionary shared with developers and administrators may be useful. See this guide for best practices.
Enforce the “rules” that are established:
- At the Organization level, use Service Control Policies to block the creation of a resource if not correctly tagged. The following example denies the creation of Lambda functions when the environment tag is absent. You can find more examples here. Before applying, test policies in a sandbox:
```
{
   "Sid": "DenyCreateLambdaWithNoEnvironmentTag",
   "Effect": "Deny",
   "Action": "lambda:CreateFunction",
   "Resource": [
      "arn:aws:lambda:*:*:function/*"
   ],
   "Condition": {
      "Null": {
          "aws:RequestTag/environment": "true"
       }
    }
}
```

- Use Config Rules in AWS Config to verify that resources have the appropriate tags (see this example for Lambda). You can also perform remediation if necessary.
- Use Tag Policies to enforce consistency across your accounts and resources. Tag Policies verify the syntax and values set on your resources. They mark as non-compliant all the resources that do not match the policies. For example, the following policy defines the tag “environment“ and its possible values:
```
{
    "tags": {
        "mycompany:environment": {
            "tag_value": {
                "@@assign": [
                    "dev",
                    "test",
                    "qa",
                    "prod"
                ]
            }
        }
    }
}
```

4. Monitor the percentage of resources untagged or badly tagged and try to improve. Also iterate on your tag dictionary as your requirements evolve by adding new tags or removing unused ones.

Applying tags on serverless resources

For serverless resources, there are additional tags you may add:

For an API or a microservice, you may want to know if it is public (B2C), semi-public (B2B) or private (internal). This can help you adjust the RTO and the level of support. You can add a tag “exposition” with a value “public” or “private” to your API Gateway and Lambda functions.
For an API or a microservice again, you may want to know the route from where traffic is coming. For a Lambda function, use the tag “route” with a value like “POST /products”, “GET /products/_id_” (“{}” are not valid characters for tags).
For Lambda functions, add sources (triggers) and destinations within tags. For example, add the SNS topic name or SQS queue name to a “trigger” or “destination” tag. This helps document the dependencies between resources and have a better view of the architecture.

Adding more tags increases maintenance, and unmaintained tags can be misleading. Add tags if you really need them and if you can automate their maintenance. For example with AWS CloudFormation or AWS SAM, or using scripts or scheduled Lambda functions (using propagate-cfn-tags for example).

Grouping related resources

In addition to tags, use resource groups to better organize resources, by creating groups of related resources. For example, you can consolidate a set of Lambda functions and APIs for a microservice, or more globally a set of components related to the same application.

If you already created tagged resources, you can create a resource group based on these tags, using the following command. This example groups all supported resources that are tagged with the tag “mycompany:service-id” and value “products”:

aws resource-groups create-group \
--name products-service \
--resource-query '{"Type":"TAG_FILTERS_1_0","Query":"{\"ResourceTypeFilters\":[\"AWS::AllSupported\"],\"TagFilters\":[{\"Key\":\"mycompany:service-id\",\"Values\":[\"products\"]}]}"}'

If you use infrastructure as code (AWS CloudFormation or AWS SAM), which is the recommended approach, you can have a resource group created for a complete stack. This is convenient for serverless applications with a reasonable number of related resources:

Resources:
  ResourceGroup:
    Type: AWS::ResourceGroups::Group
    Properties:
      Name: products-service

With a group defined, visualize the list of its resources in the AWS Management Console or using the following CLI command:

aws resource-groups list-group-resources --group products-service

For more details on resource groups, read this blog post.

Getting a dynamic view of your resources

Tags and resource groups give you a static and declarative view of your resources and their relations. To know what is running, and how everything is linked, it is also important to have a dynamic view. It is the best representation of your environment but since it is often documentation-based and rarely automated, it quickly becomes inaccurate and outdated.

Using AWS X-Ray, you can trace requests made from one resource to another, and thus have a complete map of interactions between components of your application. Combine it with Amazon CloudWatch Logs and metrics and you have Amazon CloudWatch ServiceLens.

The primary use case for ServiceLens is to help you debug and troubleshoot performance issues. But you can also use the Service Map to gain visibility into your applications. What are the components that make up your application? What are the transactions and dependencies between those components?

To obtain such a map, you must enable X-Ray tracing in your application for all supported resources. This can be done in the AWS SAM template by enabling “tracing”:

Resources:
  MyFunction:
    Type: AWS::Serverless::Function
    Properties:
      CodeUri: function/
      Handler: app.lambda_handler
      Runtime:python3.8
      Tracing: Active
      
  MyApi:
    Type: AWS::Serverless::Api
    Properties:
      DefinitionBody:
        Fn::Transform:
          Name: "AWS::Include"
          Parameters:
            Location: "resources/openapi.yaml"
      EndpointConfiguration: REGIONAL
      StageName: prod
      TracingEnabled: true
      
  MyStateMachine:
    Type: AWS::Serverless::StateMachine
    Properties:
      DefinitionUri: statemachine/my_state_machine.asl.json
      Role: arn:aws:iam::123456123456:role/service-role/my-sample-role
      Tracing:
        Enabled: True

You may also need to instrument your Lambda function code using the X-Ray SDK, to retrieve and propagate traces when using services like Amazon SNS, Amazon SQS or Amazon EventBridge.

When this is done, open the CloudWatch ServiceLens console to get the dynamic view of all your components. See their respective size (based on the number of requests they handle), their relations, and the services they use. As it is based on the real execution of your application, it is always up to date.

Conclusion

Having visibility on your AWS resources is the key to operating and growing successfully. In this first part of this series on serverless governance, I describe how you can get this visibility by using tags to organize and group your resources, and ease the search and management of related resources. I also describe how AWS X-Ray, combined with CloudWatch ServiceLens, can provide a dynamic view of workloads and help you understand how serverless resources are acting together.

The second part will focus on provisioning and how to standardize deployments to improve consistency and compliance.

Read this whitepaper on tagging best practices to get more details on tags.

For more serverless learning resources, visit Serverless Land.

AWS Compute Blog

Operating serverless at scale: Implementing governance – Part 1

Maintaining visibility on resources and workloads

Tagging your resources

Enforcing consistency in tags

Applying tags on serverless resources

Grouping related resources

Getting a dynamic view of your resources

Conclusion

Resources

Follow

Learn

Resources

Developers

Help