AWS Cloud Operations & Migrations Blog

Reduce incident management response times for container workloads using AWS Chatbot

One of the key focus areas for customers running their mission-critical container workloads on AWS, is to be able to analyze and act on operational events quickly. Getting real-time visibility into performance issues, traffic spikes, infrastructure events and security threats can enable teams to quickly address issues and prevent potential downtime.

AWS Chatbot helps teams to collaborate and respond to events faster by enabling monitoring, troubleshooting and operating AWS resources from customer’s chat channels. AWS Chatbot is an interactive agent that makes it easy to set up ChatOps for AWS in Microsoft TeamsSlack channels or Amazon Chime chatrooms. With AWS Chatbot, customers can receive alerts, retrieve diagnostic information, configure AWS resources and resolve incidents from their chat channels, enabling them to reduce incident management response times for container workloads.

In this post, we will discuss how to monitor and operate container workloads running on Amazon Elastic Container Service (Amazon ECS) from Slack.

Introduction

Assume you are part of the DevOps team, rolling out a new containerized application that will run on Amazon ECS. You have configured Amazon ECS service auto-scaling based on initial traffic projections. You want to be alerted on any unexpected traffic spike impacting application performance and be able to remediate the issue quickly.

First, we will deploy a simple containerized web application to Amazon ECS. We will then setup an Amazon CloudWatch alarm to notify you in the Slack channel when the average CPU utilization exceeds a configured limit. We will then remediate the issue by updating the autoscaling configuration by running AWS CLI commands from the Slack channel.

Architecture

Figure 1 below illustrates a high-level flow for receiving notifications:

  • Amazon CloudWatch Alarm triggers when average CPU utilization of Amazon ECS tasks exceeds 70%.
  • Amazon CloudWatch Alarm sends the notification to Amazon SNS.
  • AWS Chatbot processes the notification from Amazon SNS and sends them to Slack.

Diagram showing Notification flow from Amazon ECS workloads

Figure 1: Notification Flow

Figure 2 below illustrates how the remediation flow works:

  • User sends AWS CLI command from the Slack channel.
  • AWS Chatbot runs the command using the associated channel IAM role and returns the response to Slack.

Diagram showing remediation flow

Figure 2: Remediation Flow

Prerequisites

To follow along the instructions, you need an AWS account. If you don’t have one, you can create a new AWS account here.  You also need an AWS Identity and Access Management (IAM) user, with appropriate permissions to create and manage Amazon ECS clusters, create Amazon CloudWatch alarms and dashboards, create SNS topics and configure AWS Chatbot client.

You need access to a Slack workspace, where administrators have approved the use of the AWS Chatbot app. You also need to create a Slack channel to receive notifications from AWS Chatbot.

You also need to set up AWS Cloud Development Kit (AWS CDK) development environment. For instructions on how to set it up, see Getting Started With the AWS CDK – Prerequisites.

Deployment steps

Here are the high-level deployment steps:

  1. Deploy a Containerized Web Application: Create an Amazon ECS cluster and deploy a sample containerized application to it.
  2. Setup Notifications: Create an Amazon CloudWatch alarm which gets triggered when the average CPU utilization of the Amazon ECS tasks exceeds 70%, and sends notification to the Amazon SNS topic.
  3. Configure AWS Chatbot: Configure AWS Chatbot configuration for the Slack channel. This will allow AWS Chatbot to interact with the configured Slack channel.

For testing, you will generate traffic to the web application. This traffic spike will trigger a notification on the Slack channel. You will then troubleshoot and resolve the issue from within the Slack channel.

Step 1: Deploy a Containerized Web Application

In this step, you will deploy a containerized PHP web application to an Amazon ECS cluster using AWS CDK. Install AWS CDK and initialize the project by running the following commands. You will use TypeScript as the CDK language for this setup.

npm install -g aws-cdk
mkdir ecs-sample-app
cd ecs-sample-app
cdk init --language typescript

In the AWS CDK project you created, update lib/hello-ecs-stack.ts so that it resembles the following.

This AWS CDK stack does the following:

  • creates AWS resources required to run a containerized web application, including Amazon ECS cluster, Amazon VPC, Amazon EC2 instances, Auto Scaling group, Application Load Balancer and IAM roles and policies.
  • creates an Amazon ECS task definition and deploys a web application to Amazon ECS cluster. The containers run on AWS Fargate, which is a serverless compute engine for containers.
  • configures Amazon ECS service auto-scaling, with a target CPU utilization of 50%, minimum capacity of 1 and maximum capacity of 2.
  • creates an Amazon IAM role and an Amazon IAM policy required to configure AWS Chatbot.
  • creates an Amazon SNS topic, which will send notifications.
import * as cdk from 'aws-cdk-lib';
import { Construct } from 'constructs';
import * as ecs from "aws-cdk-lib/aws-ecs";
import * as ecs_patterns from "aws-cdk-lib/aws-ecs-patterns";
import * as sns from "aws-cdk-lib/aws-sns";
import * as iam from "aws-cdk-lib/aws-iam";
export class EcsSampleAppStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    // Create a load-balanced Fargate service and make it public
    const fargateLoadbalancedservice =
      new ecs_patterns.ApplicationLoadBalancedFargateService(this, "MyFargateService", {
      taskImageOptions: { image: ecs.ContainerImage.fromRegistry("amazon/amazon-ecs-sample")},
      publicLoadBalancer: true // Default is true
    });

    // Enable Service Autoscaling
    const autoscale = fargateLoadbalancedservice.service.autoScaleTaskCount(
     {minCapacity:1, maxCapacity: 2}
    )
    
    autoscale.scaleOnCpuUtilization('CpuScaling', {
      targetUtilizationPercent: 50,
      scaleInCooldown: cdk.Duration.seconds(30),
      scaleOutCooldown: cdk.Duration.seconds(30)
    });    
    
    // Create an SNS Topic for Notification
    const cwTopic = new sns.Topic (this, 'notification', {
      topicName: 'aws-chatbot-notification-topic',
      displayName: 'aws-chatbot-notification-topic'
    })

    const cwRole = new iam.Role(this, 'chatbot-channel-role', {
      assumedBy: new iam.ServicePrincipal('chatbot.amazonaws.com'),
      description: 'AWS Chatbot Role',
    });

    const cwPolicy = new iam.ManagedPolicy (this, 'chatbot-channel-policy', {
      statements: [
        new iam.PolicyStatement({
          resources: ['*'],
          actions: [
            "cloudwatch:ListDashboards",
            "cloudwatch:GetDashboard",
            "cloudwatch:GetMetricStatistics",
            "cloudwatch:GetMetricWidgetImage"            
          ],
        }),
        new iam.PolicyStatement({
          resources: ['*'],
          actions: [
            "application-autoscaling:RegisterScalableTarget",
            "application-autoscaling:DescribeScalableTargets"
          ],
        })
      ],
      roles: [cwRole]
    })
    new cdk.CfnOutput(this,'ECS Cluster Name',{value: fargateLoadbalancedservice.cluster.clusterName})
    new cdk.CfnOutput(this,'ECS Service Name',{value: fargateLoadbalancedservice.service.serviceName})
    new cdk.CfnOutput(this, 'SNS Topic', { value: cwTopic.topicName });
    new cdk.CfnOutput(this, 'Channel Role', { value: cwRole.roleName });
    new cdk.CfnOutput(this, 'Channel Guardrail', { value: cwPolicy.managedPolicyName});
  }
}

To deploy the services in your AWS account, run the below command in your application’s main directory. You’ll asked to approve the IAM policies and security group changes that the AWS CDK generated.

cdk deploy 

Once the deployment is complete, note down the values of SNSTopicChannelRoleChannelGuardrailECSClusterNameECSServiceName and MyFargateServiceServiceURL from the output displayed.

To verify that the deployment is successful, open MyFargateServiceServiceURL in a browser. You should see a page like Figure 3.

Diagram showing sample application page

Figure 3: Sample Web Application Page

Step 2: Setup Notifications

In this step, you will create an Amazon CloudWatch alarm for notification and dashboards for monitoring.

2.1 Create Amazon CloudWatch Alarm

You will setup an Amazon CloudWatch alarm to send notifications, when the average CPU utilization of the Amazon ECS service exceeds 70%.

  • Choose CloudWatch service on AWS Console.
  • In the navigation pane, select All alarms under Alarms. Select Create Alarm.
  • In Specify metric and conditions page, choose Select metric.
  • In Select metric page, and choose the ECS namespace, in the Browse tab under AWS namespaces.
  • Choose ClusterName, ServiceName
  • Choose the check box for CPUUtilization metric corresponding to the Amazon ECS cluster and service that you created in Step 1.
  • Choose Select metric.

The Specify metric and conditions page shows a graph and other information about the metric and statistic.

  • In the Metric section, make sure the Metric name is CPUUtilization and Statistic is Average with a Period of 5 minutes.
  • For Conditions (the circumstances under which the Amazon CloudWatch alarm fires and an action takes place), choose the following options:
    • For Threshold type, choose Static.
    • For Whenever CPUUtilization is, choose Greater > threshold.
    • For .. specify a threshold value of 70. This setting ensures you will trigger the notification whenever the Average CPU utilization of the service goes about 70%.
  • Choose Next.
  • In Configure actions screen, you set the action to create SNS notifications when the metric threshold exceeds.
  • For Notification, choose the following options.
    • For Whenever this alarm state is… choose In Alarm.
    • For Select an SNS topic choose Select an existing SNS topic.
    • For Send a notification to… choose your SNS topic that was created in Step 1.
  • Choose Next.
  • Enter a name and description for the alarm. The name must contain only ASCII characters. Then choose Next.
  • For Preview and create confirm that the information and conditions are correct, then choose Create alarm.

2.2 Create Amazon CloudWatch Dashboard

You will create a new Amazon CloudWatch Dashboard to monitor the CPU and memory utilization of the Amazon ECS service.

  • Choose Amazon ECS service on AWS Console.
  • Choose the Clustercreated in Step 1 to open the Cluster Overview
  • From Services section, choose hyperlink on Service name to open Service Overview.
  • Under Health and metrics, you will see the CPU utilization and Memory utilizationwidgets under Health
  • Select Add to dashboard.
  • In Add to dashboard page, select Create new.
  • Enter a name for the new dashboard. Note down the dashboard name. Select
  • Choose Add to dashboard, which will open the new Amazon CloudWatch dashboard (Figure 4).
  • Select Save. 

Diagram showing Amazon Cloudwatch dashboards for Amazon ECS

Figure 4: Amazon CloudWatch dashboard

Step 3: Configure AWS Chatbot

In this step, you will create an AWS Chatbot configuration in Slack to get notified on Amazon CloudWatch alarms. You will also configure Slack channel member permissions and guardrail policies to allow channel members to issue CLI commands from chat channels.

3.1 Add AWS Chatbot to the Slack workspace

Adding AWS Chatbot App to your Slack workspace enables you to interact with AWS resources from your team’s channels.

  • In Slack, on the left navigation pane, choose Apps. If you do not see Apps in the left navigation pane, choose More, then choose Apps (Figure 5).
  • If AWS Chatbot is not listed, choose the Browse Apps Directory. Browse the directory for the AWS Chatbot app.
  • Choose Add to add AWS Chatbot to your workspace.

Choosing Apps page in Slack

Figure 5: Choosing Apps page in Slack

3.2 Configure chat client in AWS Chatbot

You will now need to configure a client for Slack in AWS Chatbot, so that the Chatbot can push notifications to Slack channels.

  • Open the AWS Chatbot console.
  • Under Configure a chat client, choose Slack, then choose Configure client.
  • From the dropdown list at the top right, choose the Slack workspace that you want to use with AWS Chatbot.
  • Choose Allow.

3.3 Configure Chatbot channel

  • On the Workspace details page, choose Configure new channel to create a new Chatbot configuration (Figure 6).

Diagram showing AWS Chatbot Client configuration

Figure 6: AWS Chatbot channel configuration

  • Under the Configuration details, enter a name for your configuration. The name must be unique across your account and can’t be edited later.
  • Choose Publish logs to Amazon CloudWatch Logs to enable logging for this configuration. For more information, see Amazon CloudWatch Logs for AWS Chatbot.
  • For Slack channel, choose the channel type as Public or Private, depending on the Slack channel configuration.
  • Enter the channel name.
    • If you are using a Public Slack channel for testing, you can choose the channel name from the list.
    • If you are using a Private channel, find the channel ID in Slack by right clicking on the channel in the channel list and copying the link. The channel ID is the string at the end of the URL.
    • If you configure a private Slack channel, run the /invite @AWS command in Slack to invite the AWS Chatbot to the chat room.

Next is to configure the permissions under Role Setting. Your role setting dictates what permissions your channel members have. A channel role gives all members the same permissions. This is useful if your channel members typically perform the same actions in Slack. A user role requires your channel members to choose their own roles. As such, different users in your channels can have different permissions. This is useful if your channel members are diverse or you don’t want new channel members to perform actions as soon as they join the channel. For more information, see Role setting.

  • For this setup, you will choose Channel role. For Channel Role, choose Use an existing role. Select the Amazon IAM ChannelRole that was created in Step 1, from the drop-down. Following the principle of least privilege, this Amazon IAM role has permissions only to view Amazon CloudWatch dashboards and perform changes to Amazon ECS service autoscaling configuration.

Channel guardrail policies provide detailed control over what actions your channel members can take. These guardrail policies are applied at runtime to both channel roles and user roles. For this example, you are going to select the same policy that is attached to the channel role.

  • For Channel guardrail policies, choose the Amazon IAM ChannelGuardrail policy that was created in Step 1, from the drop-down. For this example, the guardrail policy is given the same permissions as the channel role.

Lastly, you will subscribe SNS topics to the AWS Chatbot configuration. AWS Chatbot monitors these topics and the messages from these topics are sent to the associated chat channel.

  • Under Notifications section, choose the AWS Region that you are using and then choose SNS topic that was created in Step 1 to create a topic subscription.
  • Choose Configure to complete the AWS Chatbot channel configuration.
  • Send a test message to the configured chat channel. From the Configured clients, choose the newly created chatbot configuration by selecting the checkbox next to it and choosing the Send test message You will receive a test message in the configured Slack channel.

Step 4: Load Test

Now that the setup is complete, you will simulate a traffic spike scenario and validate how monitoring and incident response actions can be performed from the Slack channel.

For generating traffic to the sample application, you will use an open-source load generator tool.  Install the tool following the instructions here.

After installation, run the below command, which will run 80 concurrent worker processes, each generating 100 queries per sec, for 20 mins. This should generate sufficient traffic for the Amazon ECS service CPU utilization to exceed beyond the Amazon CloudWatch alarm threshold.

Replace <service URL> with the value of MyFargateServiceServiceURL from Step 1.

hey -z 20m -c 80 -q 100 <MyFargateServiceServiceURL>

In a few mins, you should receive an Amazon CloudWatch alarm notification.

  • Navigate to your Slack channel, which you configured to receive your Amazon CloudWatch alarm notifications.
  • Locate the CloudWatch notification for the alarm you created in Step 2 (Figure 7). Select the See more link to view the details of the CloudWatch alarm metric trend line.

Figure 7: CloudWatch Alarm Notification in Slack

As a first step in troubleshooting, you will retrieve additional dashboard to get more details on the ECS cluster.

  • Choose List Dashboards in the Slack notification to retrieve dashboards.
  • Choose Show corresponding to the Amazon CloudWatch dashboard you created in Step 2.
  • Choose the CPU Utilization widget to retrieve the corresponding metric data. You can optionally filter metrics data further by choosing appropriate Start Time, Metric Period, and Statistics filters (Figure 8).

Diagram showing Amazon CloudWatch dashboards from Slack

Figure 8: CloudWatch Dashboard in Slack

With additional insights on the issue, you can move on to the remediation phase. Adding more Amazon ECS tasks can help address the traffic spike. Before doing that, you will need to check the current service autoscaling configuration. You can do this by running AWS CLI commands from the Slack channel itself. If you don’t remember the CLI command syntax, AWS Chatbot will guide you to complete the CLI command.

Enter the below command on Slack channel to view the service autoscaling configuration (Figure 9).

@aws application-autoscaling describe-scalable-targets —service-namespace ecs

Diagram showing running remediation tasks from Slack

Figure 9: Details on Autoscaling configuration in Slack

You can see that the maximum capacity value is currently set to 2.  You will change this to a higher value to make sure that enough Amazon ECS tasks are running to support the traffic.

  • Enter the below command on Slack channel to update the maximum capacity value to 4.

@aws application-autoscaling register-scalable-target --service-namespace ecs --scalable-dimension ecs:service:DesiredCount --resource-id <ecs-service-id> --min-capacity 1 --max-capacity 4

  • When asked for confirmation, choose [Run] command.

Once the max capacity value is increased, Amazon ECS service auto scaling will increase the task count to meet scaling criteria. To make sure that the issue is resolved, you can monitor the CPU utilization using CloudWatch dashboard.

  • Enter the below command on Slack channel.

@aws cw list-dashboards --region us-east-1

  • Choose ECS-Service-Dashboard.
  • Choose the CPU Utilization widget to retrieve the corresponding metric data. You should see the CPU utilization trending down.

Cleanup

To clean up the resources created in this post, follow the instructions below.

  • Use AWS CDK to delete the resources created by the CDK stack.

cdk destroy

  • Delete Amazon CloudWatch Alarm
    • Choose CloudWatch service on AWS Console.
    • In the navigation pane, select All alarms under Alarms.
    • Select the Amazon CloudWatch Alarm you created in Step 2. Choose Actionsand then
  • Delete Amazon CloudWatch Dashboard
    • Choose CloudWatch service on AWS Console.
    • In the navigation pane, select
    • Select the Dashboard you created in Step 2 from Custom dashboards
    • Choose
  • Delete AWS Chatbot Client Configuration
    • Choose Chatbot service on AWS Console.
    • Select Slack – AWS under Configured Clients.
    • Select the configuration you created in Step 3 from Configured channels.
    • Choose
  • Cleanup Slack resources
    • Delete the Slack channel you created for testing.

Conclusion

In this post, we walked through the steps to set up AWS Chatbot to monitor and operate Container workloads from Slack. This approach provides near real-time visibility on operational events, and enables teams to collaborate on troubleshooting and resolving issues quickly.

You can learn more about AWS Chatbot here.

Hareesh Iyer

Hareesh Iyer

Hareesh Iyer is a Senior Solutions Architect at AWS. He helps customers build scalable, secure, resilient and cost-efficient architectures on AWS. He is passionate about cloud-native development, containers and microservices.