AWS Cloud Operations Blog

Resolve your AWS incidents faster by automatically engaging AWS Managed Services

Modern enterprise environments are increasingly reliant on complex, interconnected IT systems to drive their business and operations. From unexpected application outages to infrastructure issues, the potential for disruptions that can impact business continuity and customer satisfaction is significant. Many organizations struggle with rapid incident resolution due to limited 24/7 AWS expertise. In this post you will learn how AWS Managed Services (AMS) customers can engage the AWS experts faster and reduce their issue resolution time.

To address the challenge of limited 24/7 AWS expertise, AMS offers round-the-clock incident response support for all AWS services used by your applications. AMS acts as an extension of an organization’s IT team, providing AWS experts and proven practices for incident management and other operational activities. AMS leverages their deep operational expertise and direct access to AWS Support and Service teams for efficient issue resolution. Furthermore, AMS offers response and restoration Service Level Agreements (SLAs), ensuring rapid action on critical issues. This reduces your Mean Time to Resolution (MTTR) for outages and incidents.

As a first step in incident lifecycle management, AMS configures comprehensive monitoring for a wide range of AWS services, delivering robust operational oversight. This monitoring includes AMS creating alarms based on Amazon CloudWatch metrics, enabling proactive issue detection and response. However, specific business needs often require monitoring of additional AWS services, custom CloudWatch metrics, or composite alarms. In these cases, customers should create their own alarms for metrics they want to monitor beyond AMS’s baseline coverage. When these custom alarms trigger, indicating potential threats to business applications, rapid response is paramount. Engaging AMS’s AWS experts promptly is crucial for leveraging specialized expertise and reducing incident response time. This rapid engagement enables comprehensive resolution of complex issues that may span multiple AWS services or interact with custom configurations.

In this post, you’ll learn how to engage AMS swiftly through automation that generate AWS Support incidents for your CloudWatch alarms. The guide provides a foundational implementation that can be extended for more complex incident management workflows and can be integrated into your response runbooks.

Prerequisites

  1. AMS Enrollment: Your AWS account must be onboarded to AMS to leverage AMS incident support. If you’re not currently an AMS customer, please consult with your AWS account team or visit the AWS Managed Services page for enrollment information.
  2. Necessary Permissions: Ensure you have the appropriate permissions within your AWS environment to:
  3. AWS Console Access: Verify that you have access to the AWS Management Console, as this guide will involve interactions with various AWS services.
  4. Basic Understanding of AWS services: Familiarity with AWS services, particularly CloudWatch, Lambda, and IAM, is beneficial but not mandatory.

Please note that while these pre-requisites cover the essentials for following this guide, your specific organizational policies or more complex implementations may require additional permissions or resources.

Solution Overview

This solution leverages AWS Lambda function that utilizes AWS Software Development Kit (SDK) to automate the creation and management of AMS incidents. By leveraging this powerful combination, you can enhance your incident response process in the following ways:

  1. Automated Incident Creation: Configure your CloudWatch alarms to trigger the AMS alarm creator Lambda function when specific conditions are met.
  2. Customizable Alerting: Tailor the incident creation process to your specific needs by adjusting the Lambda function’s code and CloudWatch alarm configurations.
  3. Streamlined Workflow: Automatically generate support incidents for AMS, reducing manual intervention and accelerating response times.
  4. Scalable Architecture: Easily apply this solution across multiple CloudWatch alarms and AWS services, providing comprehensive coverage for your infrastructure.
  5. Integration with Existing Monitoring: Complement your current monitoring setup by adding this automated incident creation capability to any of your existing or new CloudWatch alarms.

This approach allows you to proactively manage potential issues by bridging the gap between detection (via CloudWatch) and resolution (via AMS), thus minimizing downtime and improving overall system reliability.

Getting started

Step 1: Create an IAM role for your lambda function execution.

To enable your Lambda function to create AWS support cases, you need to establish an IAM role with appropriate permissions. This role will grant Lambda the necessary access to interact with AWS Support and other required services.

Required Permissions:

  • Use the AWS managed policy ‘AWSSupportAccess‘ to grant the essential permissions for creating support cases.

Creation Methods: You can create this IAM role using any of the following methods:

  1. AWS Management Console
  2. AWS Command Line Interface (CLI)
  3. AWS API
  4. AWS CloudFormation

Example: Creating an IAM Role using AWS CLI

Below is a step-by-step example of how to create the required IAM role using the AWS CLI:

  1. Create the IAM Role: Execute the following AWS CLI command to create a role named ‘LambdaSupportRole’. This role includes a trust relationship allowing Lambda to assume it:
    aws iam create-role --role-name LambdaSupportRole --assume-role-policy-document '{"Version":"2012-10-17","Statement":[{"Effect":"Allow","Principal":{"Service":"lambda.amazonaws.com"},"Action":"sts:AssumeRole","Condition":{"StringEquals":{"aws:SourceArn":"arn:aws:lambda:REGION:ACCOUNT_ID:function:AmsIncidentGenerator"}}}]}'
    Remember to replace REGION and ‘ACCOUNT-ID’ with your specific details. “AmsIncidentGenerator” is the lambda function name you will create in Step2.
  2. Now, attach the ‘AWSSupportAccess’ managed policy to the role. This grants the necessary permissions to interact with AWS Support:
    aws iam attach-role-policy --role-name LambdaSupportRole --policy-arn arn:aws:iam::aws:policy/AWSSupportAccess

Step 2: Create the AMS incident creator Lambda function

In this step, you’ll create a Lambda function that generates AMS incidents automatically. This function will serve as the core of your automated incident creation system.

For a detailed guide on creating Lambda functions, you can refer to the AWS documentation “Create your first Lambda function“. This post will follow the “Create a Lambda function with the console” steps, with some specific configurations for our use case.

2.1 Creating the Lambda Function:

  1. Navigate to the AWS Lambda console.
  2. Click ‘Create function’.
  3. Choose ‘Author from scratch’.
  4. Set the following basic information:
    • Function name: ‘AmsIncidentGenerator’
    • Runtime: Python 3.12 (or the latest available Python version)
    • Architecture: x86_64 (default)
  5. Expand the “Change default execution role” section.
  6. Select “Use an existing role”.
  7. From the dropdown, choose the ‘LambdaSupportRole’ role created in Step 1.
  8. Click ‘Create function’.

2.2 Configuring the Function Code:

After creating the function, you’ll be taken to the function configuration page. Here, you’ll replace the default code with our AMS incident creation code.

  1. In the “Code source” section, you’ll see the default “Hello World” code.
  2. Replace this entirely with the following Python code. This code is similar to the “Create Case with an AWS SDK” Python example in AWS Support documentation, but adapted for our specific use case:
    import json
    import boto3
    from botocore.exceptions import ClientError
    import logging
    
    logger = logging.getLogger()
    logger.setLevel(logging.INFO)
    
    
    def lambda_handler(event, context):
    
        support_client = boto3.client("support")
        try:
            response = support_client.create_case(
                    subject= event['alarmData']['alarmName'],
                    serviceCode="service-ams-operations-report-incident",
                    severityCode="low",
                    categoryCode="other",
                    communicationBody= json.dumps(event),
                    language="en",
                    issueType="customer-service",
                )
            case_id = response["caseId"]
        except ClientError as err:
                if err.response["Error"]["Code"] == "SubscriptionRequiredException":
                    logger.info(
                        "You must have a Business, Enterprise On-Ramp, or Enterprise Support "
                        "plan to use the AWS Support API. \n\tPlease upgrade your subscription to generate a support ticket "
                    )
                else:
                    logger.error(
                        "Couldn't create case. Here's why: %s: %s",
                        err.response["Error"]["Code"],
                        err.response["Error"]["Message"],
                    )
                    raise
        else:
                return case_id
    
        return {
            'statusCode': 200,
            'body': case_id
        }
    
    Python

2.3 Configuring Function Settings:

  1. In the ‘Configuration’ tab, select ‘General configuration’.
  2. Click ‘Edit’.
  3. Set the Timeout to 1 minute.
  4. Click ‘Save’.

You have now created a Lambda function AmsIncidentGenerator to create AMS incidents. In the next step you will configure the alarm to invoke Lambda function when the alarm is in In-alarm state.

Step 3: Configure CloudWatch alarm to invoke AmsIncidentGenerator Lambda when it is in “ALARM” state

From CloudWatch Console select your alarm and choose edit to invoke the workflow.

In the CloudWatch Console actions dropdown has Edit action to start the modification workflow

Figure 2: Edit the alarm to configure alarm actions

As you go through the workflow, select add Lambda action and keep rest of the configuration same. Select the trigger condition as in-alarm and choose the lambda function AmsIncidentGenerator you have created in the previous step. Complete the workflow.

In the workflow window, chose the Lambda action and Function Type as function from the signed account

Figure 3: Configuring the Lambda Alarm action

Step 4: Configure your alarm with the permissions to invoke AmsIncidentGenerator Lambda

After configuring the alarm action, you need to grant CloudWatch permission to invoke your Lambda function. Refer to the CloudWatch Lambda Alarm Action documentation for detailed authorization instructions. For quick reference, here’s an example AWS CLI command to set up the necessary permissions:

aws lambda add-permission --function-name AmsIncidentGenerator --statement-id CloudWatchAlarmAction --action 'lambda:InvokeFunction' --principal cloudwatch.amazonaws.com --source-arn 'arn:aws:cloudwatch:region:account-id:alarm:alarm-name'

Remember to replace ‘region’, ‘account-id’, and ‘alarm-name’ with your specific details.

You have now completed all the steps required to generates AMS incidents when custom alarms are triggered.

What’s Next

Now that you’ve successfully configured the solution to generate AMS support incidents automatically from CloudWatch alarms, leverage this solution for your applications to engage AMS quickly. Identify the applications and services where you have configured alarms, configure the CloudWatch alarms actions to invoke this solution. For your business critical applications consider utilizing AWS Incident Detection and Response (IDR), which is an entitlement for AMS customer.

To further optimize this solution and tailor it to your specific operational needs, consider implementing the following enhancements:

  • Enrich alarm descriptions with detailed metric data and threshold values for quicker troubleshooting.
  • Implement logic to set incident priorities based on alarm severity or resource criticality, using alarm tags or naming conventions.
  • Extend the Lambda function to fetch and include relevant tags from alarmed resources, providing context such as environment type, owner, or application name.
  • Implement a deduplication mechanism using Amazon DynamoDB or by querying the AWS Support API to avoid duplicate incidents. Note that AMS provides self-service weekly incident reports.
  • Include preliminary troubleshooting steps or links to runbooks in the incident description for known issues to speed up resolution times.
  • Error handling, improved logging and monitoring for the AmsIncidentGenerator Lambda.

Conclusion

To reduce your incident response time and engage AMS faster, in this blog post, you have walked through the process of creating an automated system that generates AMS support incidents from CloudWatch alarms. By leveraging AWS Lambda and the AWS SDK, you have built a solution that bridges the gap between monitoring and incident management.

This post began by setting up the necessary IAM roles, created a Lambda function to interface with AWS Support, and configured CloudWatch alarms to trigger this function when specific conditions are met. This automation not only reduces the manual overhead of creating support tickets but also ensures that critical alerts are promptly addressed by the AMS team.

This solution provides a foundation for automated incident creation, but can be expanded to meet complex operational needs. Potential enhancements include enriching incident details with contextual information and implementing advanced incident management mechanisms.