AWS Cloud Operations & Migrations Blog

Using Tag-Based Filtering to Manage AWS Health Monitoring and Alerting at Scale

AWS provides customers regular updates of service notifications and planned activities via e-mail to the root account owners or the operational, security and billing contacts. AWS also provides granular notifications to customers via AWS Health allowing them to fine-tune their alerts on issues relating directly to them. Alongside Health Dashboard’s monitoring capabilities, customers can also benefit from the underlying API it is built on i.e. AWS Health API. By using the AWS Health API, customers can collect all the notifications that affect their resources and customize those notifications to suit their unique business needs. An example of the AWS Health API in action is the AWS Health Aware framework, which allows customers to collect notifications and send them to multiple communication channels such as e-mail, slack and Amazon EventBridge.

AWS customers often have multiple accounts spanning their organization. Each of these accounts may generate alerts based on the AWS Health events, and as customers scale their organization to hundreds of accounts, it becomes important to redirect those alerts to the appropriate teams at scale.

In this blog, we provide you guidance on a framework for alerting on AWS infrastructure health events using AWS Health, and fine-tune notifications to fit in to your existing workflows with the right tagging strategy so you can identify resources and direct your alerts to the appropriate areas of responsibility.  A similar framework is currently in production at Zoom Video Communications and according to Yasin Mohammed (Manager of Cloud Operations) “Setting up a mechanism to automatically direct AWS Health notifications using tag-based filtering on AWS resources has helped Zoom streamline its health monitoring and alerting mechanism across accounts and Regions”.

Pre-requisites

  1. Customers who want to take full advantage of the AWS Health API should first ensure they are enrolled with Business or Enterprise Support. Once enabled, customers can write code to query the AWS Health API, allowing to customize AWS Health alerts. Customers who wish to deploy this solution across their organization to collect AWS Health alerts can use the AWS Health Aware framework, a free and open-source framework which allows them to integrate those alerts with EventBridge, SNS, e-mail, etc.
  2. Intermediate understanding of IAM permissions for Amazon Elastic Compute Cloud (EC2), Amazon EventBridge, AWS Lambda
  3. Intermediate understanding of Amazon EC2 and Amazon EventBridge APIs

Solution Architecture

A common scenario for many Enterprise customers is that they may have many accounts for different business units managed by different operations or development teams. These teams could also be organized according to specific resources across accounts because there are specific areas of expertise or responsibility such as databases, security, etc. You may find it necessary, for example, to notify the operations team of an upcoming change to some of their resources at the account level, or it may be necessary to alert a database administrator, if there are notifications for specific databases.

In such scenarios, it is important to tag AWS resources as per AWS tagging best practices. Once you have resources tagged, you can direct AWS Health alerts on the basis of tag information. Once an AWS Health event is generated, it is sent to EventBridge. Amazon EventBridge allows you to configure EventBridge rules to trigger Lambda functions that can fine tune alerts by obtaining tag information from relevant AWS resources. The Lambda functions can also be used to enrich the AWS Health event, such as adding resource environment, team name etc. You can create dedicated custom event buses to notify separate groups/teams. The Lambda function sends the enriched AWS Health event to custom event bus which delivers message to Amazon SNS to notify right people/applications. Here, please note AWS Health delivers events on a best effort basis. Events are not always guaranteed to be delivered to EventBridge. This framework also supports AWS Health Aware so you can deploy this alerting framework throughout your organization and ensure that the appropriate teams are alerted about the resources they are responsible for in a timely manner using their preferred methods of notification.

The figure shows architecture diagram for solution framework with health events filtered, enriched and directed using AWS Lambda

Figure 1: Solution Architecture

Example Use Case

In our example, we setup alerting for an EC2 instance in our DEV environment. We capture environment information for the EC2 instance using environment tag. We also specify a dedicated event bus using customEventBus tag. This dedicated custom event bus will notify DEV environment admins using a SNS topic.

Figure shows EC2 instance with AWS Tags as custom event bus, environment and name

Figure 2: EC2 instance tags

In addition to tagging EC2 instances, you can tag almost any AWS resource such as AWS accounts, Amazon RDS resources, etc. If you are using AWS Organizations, you can enforce tagging policies on AWS resources to ensure your team follows best operational practices.

Once the EC2 instance is tagged, we use EventBridge to receive AWS Health events for the instance. We deploy a Lambda function triggered by an EventBridge rule to inspect the JSON payload of an AWS Health event, enrich the AWS Health event payload with EC2 environment information, and redirect the alert to the custom event bus. The dedicated custom event bus will deliver the alert to the right channel using SNS.

Use Case Walkthrough

Step 1: Create an SNS topic which will alert infrastructure team

  1. Navigate to Amazon SNS Console.
  2. From left hand panel, choose Topics and then select Create Topic from right hand panel.
  3. On Create Topic page, under Type, choose Standard and enter Name, such as “health-alerts”.
  4. Keep rest configuration as is and select Create Topic.
  5. Create subscription for your e-mail and confirm it via e-mail confirmation in your e-mail inbox.

Step 2: Create a custom event bus dedicated to infrastructure team that will hold our enriched event

  1. Navigate to Amazon EventBridge console.
  2. From left hand pane, choose Event Buses. From right hand panel, choose Create Event bus.
  3. Enter Name such as “health-events” and choose Create.

Step 3: Create an execution role for Lambda Function to read and write to the required services

Before creating the Lambda Execution Role, create an IAM Policy for your Lambda Execution role that will allow your Lambda Function to read and write to the required services:

  1. Navigate to IAM console and from left hand panel select on Policies.
  2. From right hand panel, choose Create Policy.
  3. Add the permissions needed for Lambda to call services on your behalf based on your use case. In our example, in addition to the basic Lambda Execution policy i.e. AWSLambdaBasicExecutionRole, we need to send events to the EventBridge in question and read tags on EC2. Please refer to the IAM documentation for respective services to customize this policy to your needs.
  4. Once you have finished adding permissions, choose Next.
  5. Give your policy a name, such as “EnrichHealthEventsPolicy” and optionally provide Description.
  6. Select Create policy.

Once you have IAM policy setup, create a Lambda execution role from the policy:

  1. Navigate to IAM console and from left hand panel select Roles.
  2. From right hand panel, select Create Role.
  3. Choose AWS Service. Choose Lambda for Use Case and choose Next.
  4. Select the “EnrichHealthEventsPolicy” you just created, choose Next.
  5. Give your role a name, such as “EnrichHealthEventsLambdaRole”, and choose Create Role.

Step 4: Add Lambda function to get EC2 tags and enrich AWS Health event

  1. Navigate to the Lambda Console.
  2. Select Create Function from right hand panel and then select Author from Scratch.
  3. Give your function a name, such as “EnrichHealthEvent”.
  4. Choose a Runtime (in our example, we will be using Python).
  5. Select Change Default Execution Role and choose the execution role we created in step 3.
  6. Select Create function (This will create a simple “hello world” function which you can save for now in order to proceed to next steps).
  7. Select Deploy.
  8. Later, you can enhance, iterate, customize and test your Lambda function according to your needs.

A test AWS Health Event for AWS_EC2_MAINTENANCE_SCHEDULED for an EC2 instance has the following structure:

Figure shows event schema for health event AWS_EC2_MAINTENANCE_SCHEDULED

Figure 3: AWS Health Event schema

Tips to code your Lambda function in python:

    1. Referring to Figure 3, you can get the instance id of the EC2 instance by referencing affectedEntities using code snippet below (python):
      ec2InstanceId= event['detail']['affectedEntities'][0]['entityValue']
    2. Get environment and customEventBus tag associated with affected EC2 instance. To do this filter instances by EC2 instanceid and loop through the Tag Keys to get Tag Value.
    3. The event is enriched by simply adding environment field to the event: event['environment'] = environment
    4. Finally, send the enriched event to custom event bus created in Step 2 using put_events API call:
cloudwatch_events = boto3.client('events')
response = cloudwatch_events.put_events(
  Entries=[
            {
             'Source': 'modifiedHealthEvent',
             'EventBusName': eventBusName,
             'DetailType': 'enrichedEvent',
             'Detail': json.dumps(event)
            }
        ]
)

Step 5: Create an EventBridge rule to send events from custom event bus to SNS

  1. Navigate to EventBridge console.
  2. From right hand panel, under Get started, select EventBridge Rule and choose Create Rule
  3. Under Rule detail page, enter Name (e.g “send-enriched-events”), under Event bus, choose the event bus created in Step 2 (e.g. “health-events”). Select Next.
  4. Under Event source, for Event source select All Events. Leave all other options as-is, and Choose Next.
  5. In Select target(s), choose AWS Service. Under Select a target, choose SNS topic you created in Step 1 (i.e. “health-alerts”).
  6. Keep defaults for Configure Tags and select Next.
  7. On Review and Create page, select Create Rule.

Step 6. Create a EventBridge rule that will send AWS Health event to our Lambda function

  1. Navigate to EventBridge console. From right hand panel, under Get started, select EventBridge rule and choose Create Rule.
  2. Under Rule detail page, enter Name (e.g health-events-rule), under Event bus, choose default. Select Next.
  3. In Build event pattern page, navigate to Creation Method, choose Use pattern form.
  4. Under Event pattern, for Event source select AWS services, for AWS service choose Health. For Event type, choose Specific Health events. Choose Specific service(s) as EC2. Select Next.
ALT TEXT: Figure shows event pattern for filtering an EC2 health event in EventBridge

Figure 4: Event pattern for filtering an EC2 health event

  1. In Select target(s), choose AWS Service. Under Select a target, choose Lambda function. Under Function provide Lambda function created in step 4 (e.g. “EnrichHealthEvent”). Select Next.
  2. Keep defaults for Configure Tags and select Next.
  3. On Review and Create page, select Create Rule.

Testing the solution

To test your solution, consider using the Lambda test feature:

  1. Navigate to the Lambda console and select the Lambda function created in Step 4.
  2. Navigate to Test tab and create a new test event by modifying event structure provided in Fig 3.
  3. Navigate to Code, under Test dropdown, select the test event you just created. Choose Test.

This will trigger a test health event and you should receive a notification on the email address configured in Step 1.

You can now modify the walkthrough provided in our example to suit your business needs. Please test the solution in your environments depending upon the resources and tags.

Conclusion

In this blog post, we demonstrated a framework to automate alert notifications by assigning relevant tags to your AWS resources and improve responses to AWS Health events while reducing notification noise. We showed you how you can parse your AWS Health event and enrich it for relevant teams. To learn more about AWS Health, please visit AWS Health documentation.

About the authors

Author photograph - Pranjal Gururani

Pranjal Gururani

Pranjal Gururani is a Solutions Architect at AWS based out of Seattle. Pranjal works with various customers to architect cloud solutions that address their business challenges. He enjoys hiking, kayaking, skydiving, and spending time with family during his spare time.

Author photograph - John Bickle

John Bickle

John Bickle is a Senior Technical Account Manager and Enterprise Support Lead based in Montreal, Canada. John loves working closely with customers to achieve operational excellence, reduce complexity and eliminate downtime. In his spare time, John enjoys music, sailing and photography.

Author photograph - Ballu Singh

Ballu Singh

Ballu Singh is a Principal Solutions Architect at AWS. He lives in the San Francisco Bay area and helps customers architect and optimize applications on AWS. In his spare time, he enjoys reading and spending time with his family.