AWS Cloud Operations & Migrations Blog

Enhance Kubernetes Operational Visibility with AWS Chatbot

Many customers run their mission critical container workloads on Amazon Web Services (AWS)  using Amazon Elastic Kubernetes Service (Amazon EKS). One of the key focus areas for them is to analyze and act on operational events quickly. Getting real-time visibility into performance issues, traffic spikes and infrastructure events can enable teams to quickly address issues and prevent potential downtime.

In this blog post, we describe how to monitor Amazon EKS workloads in near real-time from customer’s chat channels using AWS Distro for OpenTelemetry (ADOT)Amazon CloudWatch and AWS Chatbot. We will specifically cover the integration with Microsoft Teams. The steps outlined in this blog can be extended to other chat platforms like Amazon Chime or Slack for monitoring, troubleshooting and remediation.

Introduction

AWS Chatbot is an interactive agent that makes it easy to set up ChatOps for AWS with Microsoft Teams, Slack or Amazon Chime chatrooms. With AWS Chatbot, customers can receive alerts, retrieve diagnostic information, configure AWS resources and resolve incidents from their chat channels, enabling them to reduce incident management response times for container workloads. In this solution, we will use AWS Chatbot to send customized monitoring alerts for Amazon EKS to a Microsoft Teams channel.

AWS Distro for OpenTelemetry (ADOT) is a secure, AWS-supported distribution of the OpenTelemetry project. Part of the Cloud Native Computing Foundation, OpenTelemetry provides open source APIs, libraries, and agents to collect distributed traces and metrics for application monitoring. Users can instrument their applications just once and, using ADOT, send correlated metrics and traces to multiple monitoring solutions. AWS Distro for OpenTelemetry also collects metadata from your AWS resources and managed services, so you can correlate application performance data with underlying infrastructure data, reducing the mean time to problem resolution. We will use ADOT collector to collect metrics from workloads deployed to the Amazon EKS cluster and send them to Amazon CloudWatch.

Amazon CloudWatch is a monitoring and observability service that provides actionable insights for AWS, on-premises, hybrid, and other cloud applications and infrastructure resources. You can view the Amazon EKS  cluster metrics in Amazon CloudWatch console, create alarms and dashboards, and use various other CloudWatch features for monitoring and troubleshooting. We will configure Amazon CloudWatch alarm to set thresholds for specific Amazon EKS metrics and trigger a notification when metric breaches a defined threshold.

Architecture

This architecture uses Amazon EKS, AWS Distro for OpenTelemetry , Amazon CloudWatch, Amazon Simple Notification Service (Amazon SNS), and AWS Chatbot as shown below.

Figure 1 illustrates high-level flow architecture.

  • An AWS Distro for OpenTelemetry collector runs on Amazon EKS cluster to scrape AWS resource and application telemetry data and ingest them into Amazon CloudWatch.
  • An Amazon CloudWatch alarm, configured based on a metric of interest, sends a notification to Amazon Simple Notification Service (Amazon SNS) when the threshold is breached.
  • AWS Chatbot receives the notifications from Amazon SNS and sends them to Microsoft Teams.

Notification flow

                                                                                     Figure 1: Notification flow

Pre-requisites

Ensure you have an active AWS account. If you haven’t already, create one here.

Setup Walk-through

Here are the high-level deployment steps:

  • Setup AWS Distro for OpenTelemetry collector to collect metrics from Amazon EKS cluster and send to Amazon CloudWatch
  • Create an Amazon SNS topic to receive notifications from Amazon CloudWatch
  • Create an Amazon CloudWatch alarm, based on the metric to be monitored. Configure the alarm action to send a notification to the Amazon SNS topic created in step 2
  • Setup Integration between AWS Chatbot and Microsoft Teams
  • Configure AWS Chatbot to send the notifications from Amazon SNS topic to Microsoft Teams channel

Step 1: Setup AWS Distro for OpenTelemetry to collect metrics from Amazon EKS cluster

In this step, you will deploy AWS Distro for OpenTelemetry collector to your Amazon EKS cluster. Amazon EKS supports AWS Distro for OpenTelemetry operator as an add-on, which simplifies the installation and management of the collector.

  • Follow the installation steps in the documentation to setup ADOT add-on in your EKS cluster.
  • After the ADOT operator is running in your cluster, configure ADOT collector for Amazon CloudWatch  to ingest metrics for CloudWatch, follow the steps provided here.
  • Create service account role to the cluster with permissions to Amazon CloudWatch.
  • Configure the role created above to use with ADOT add-on in the AWS console.

eksctl create iamserviceaccount \
 —name adot-collector \
 —namespace default \
 —cluster my-cluster \
 —attach-policy-arn arn:aws:iam::aws:policy/AmazonPrometheusRemoteWriteAccess \
 —attach-policy-arn arn:aws:iam::aws:policy/AWSXrayWriteOnlyAccess \
 —attach-policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy \
 —approve \
 —override-existing-serviceaccounts

Once the setup is completed, you should be able to see the telemetry data in Amazon CloudWatch metrics (under Custom namespaces section).

Step 2: Create an Amazon SNS Topic for Customized Amazon CloudWatch Alarm Notification

Amazon CloudWatch can send notifications to the Amazon SNS topic you will create. This Amazon SNS topic will be the source for the notifications sent to your Microsoft Teams channel.

You can utilize AWS Management Console or the AWS SDK to create a topic. Follow the steps in the documentation to create a topic. Choose Standard for the topic type. Note down the ARN for the SNS Topic, as you will need in the next step.

Step 3: Create an Amazon CloudWatch alarm and dashboard

Create a CloudWatch alarm based on your monitoring requirements.  For this walk-through, we will monitor the metric otelcol_process_memory_rss  and alert based on Memory Notification.

The otelcol_process_memory_rss metric represents the Resident Set Size (RSS) memory usage of the OpenTelemetry Collector (often referred to as “otelcol”). The RSS memory is a measure of the portion of a process’s memory that is held in RAM and is actively used by the process at a given time. This metric helps to monitor the memory usage of the OpenTelemetry Collector. If this metric starts to increase significantly, it may indicate a memory leak or increased resource usage, which could impact the collector’s performance and the overall system. It’s an important metric to watch to ensure the efficient and reliable operation of the OpenTelemetry Collector in your monitoring and observability stack.

Create Amazon Cloud Alarm

  • Assuming you completed the workshop instructions in Step 1, you should see metric for otelcol_process_memory_rss  in Amazon CloudWatch Metrics (under Customer Namespaces → ContainerInsights/Prometheus → ClusterName).
  • Follow the instructions here to configure an Amazon CloudWatch alarm.
  • Set the Condition as otelcol_process_memory_rss greater than a desired threshold level. Under Notification, select the Amazon SNS topic that you created in Step 2.

: CloudWatch Alarm configured for metric otelcol_process_memory_rss

                                     Figure 2: CloudWatch Alarm configured for metric otelcol_process_memory_rss

Setup Amazon CloudWatch Dashboard

You will create an Amazon CloudWatch Dashboard with customized views of the relevant metrics. You can access these dashboards from Microsoft Teams channel for troubleshooting.

  • Follow the instructions in the documentation to setup an Amazon CloudWatch dashboard.
  • Add widgets to the dashboard to monitor relevant metrics like otelcol_process_memory_rss, otelcol_process_runtime_heap_alloc_bytes and otelcol_process_cpu_seconds.

CloudWatch Alarm configured for metric otelcol_process_memory_rss

                              Figure 3: CloudWatch Alarm configured for metric otelcol_process_memory_rss

Step 4: Configure Integration between AWS Chatbot and Microsoft Teams

This is a two step process:

First, Follow the instructions here to create a team in Microsoft Teams. Also create a public channel, following the steps here. The notifications from Amazon CloudWatch alarm will be sent to the channel you configured. Choose the options menu (three dots) for the channel name and choose “Get link to channel” as shown in Figure 4. Note down the channel URL, as you will need it to configure AWS Chatbot in the next step.

Get link to a channel in Microsoft Teams

                                                                     Figure 4: Get link to a channel in Microsoft Teams

Now, add the AWS Chatbot app to the teams you created earlier in this step by following the instructions here. Administrator privilege is required to perform this action. Search for “aws” in the Apps search bar. Add the app to the channel you created earlier, as illustrated below.

Add AWS App to a team in Microsoft Teams

                                                          Figure 5: Add AWS App to a team in Microsoft Teams

Step 5: Register Microsoft Teams Channel in AWS Chatbot

Follow the instructions here to create and configure a Microsoft Teams client channel in AWS Chatbot. A few considerations while configuring the client channel:

  • Use the Microsoft Teams channel URL from Step 4 when configuring the client. The channel needs to be publicly accessible.
  • For Role setting, choose Channel role. AWS Chatbot will assume the configured role to run the tasks in the channel.
  • For Channel role, choose Create new role. If you want to use an existing role instead, choose Use an existing role. To use an existing IAM role, you will need to modify it for use with AWS Chatbot. If you want your users to be able to use Amazon Q, attach the AmazonQFullAccess policy. For more information, see Configuring an IAM Role for AWS Chatbot.
  • For Role name, enter a name. Valid characters: a-z, A-Z, 0-9, .\w+=,.@-_.
  • For Role policy template, choose the template you wish to use. For more information, see Role setting.
  • After configuring the role permission, choose a Channel guardrails policy. A channel guardrail policy limits the actions that your channel members can take. The actions that channel members are allowed to do is based on the intersection of the guardrails and the IAM user or Channel role permissions. The guardrail policy is applied to both the Channel IAM role and User Roles role settings at runtime. For more information, see Channel guardrail policies
  • Under Notifications section, choose the Amazon SNS topic that you created in Step 2.

Once the channel is configured, you can test the integration between AWS Chatbot and Microsoft Teams by sending a test message from the AWS Chatbot console. Follow the instructions for testing.

Testing

Now that the setup is complete, you can test the end-to-end flow.

  • Deploy a sample application to your Amazon EKS cluster, following instructions in the documentation (steps 1,2).
  • Once the deployment is complete, scale the deployment to  100 replicas.

kubectl scale deployment -n <your_namespace> <your-deployment-name> --replicas=100

This should increase the memory usage on the Open Telemetry collector significantly and trigger the Amazon CloudWatch alarm.

In a few mins, you should receive an Amazon CloudWatch alarm notification to your Teams channel.

  • Navigate to your Teams channel, which you configured to review the Amazon CloudWatch alarm notification.
  • Locate the CloudWatch notification for the alarm . Select the See more link to expand the CloudWatch alarm notification.

Below is an example of the Notification received in Teams channel. The Alarm notification displays EKS cluster, POD names, Alarm state, metric details and metric graph details in the Teams channel.

CloudWatch Alarm Notification in Teams Channel

                                                           Figure 6: CloudWatch Alarm Notification in Teams Channel

Customize AWS Chatbot Notification (Optional)

AWS Chatbot now lets you customize messages for your application events or customize default AWS service notifications using custom notifications. With Customizable notifications, you can include relevant contextual information in your chat channels, add URLs, include remediation steps, tag team members and more. This not only boosts your team’s visibility but also enables faster responses.

Architecture

To send custom notifications using AWS Chatbot, it is essential to align them with the event schema provided here. This involves having a data source that sends the event with all the necessary details adhering to the schema. In this architecture, Amazon EventBridge handles dispatching tailored content to an Amazon SNS Topic.

We will explore an architecture to streamline custom notification from Amazon EKS cluster to Microsoft Teams Channel leveraging Amazon CloudWatch, Amazon Event Bridge, Amazon SNS and AWS Chatbot.

Customizable notification with AWS Chatbot using Amazon EventBridge

                                Figure 7: Customizable notification with AWS Chatbot using Amazon EventBridge

Let’s delve into the step-by-step process:

  • Decouple the SNS Topic: This step involves detaching the SNS Topic from the CloudWatch Alarm configured previously in step 3. This separation allows more flexibility to customize notifications.
  • Set up an Amazon EventBridge Rule: Next create an Amazon EventBridge rule to capture CloudWatch Alarm state change events. These events are then channeled to Amazon SNS topic with customized notifications.
  • Integrate the Microsoft Teams Channel with Chatbot:  Leveraging the configured Amazon SNS topic, we seamlessly funnel customized notifications to our designated Microsoft Teams channel, as outlined in Step 4.

To ensure a seamless continuation, let’s now proceed with the step-by-step instructions for setting up the EventBridge rule:

  • Open the EventBridge console.
  • In the navigation pane, choose Rules.
  • Choose Create rule.
  • Enter a Name and, optionally, a Description for the rule.
  • For Event bus, select AWS default event bus.
  • For Rule type, choose Rule with an event pattern.
  • Choose Next
  • For Creation method, select Custom pattern (JSON editor).
  • For Event Pattern, copy and customize following event pattern (for example, alarm ARN, operation, or state).  Replace the ARN with CloudWatch Alarm ARN that was setup in Step 3.

{
  "source": ["aws.cloudwatch"],
  "detail-type": ["CloudWatch Alarm State Change"],
  "resources": ["arn:aws:cloudwatch:us-east-1:123456789123:alarm:eks-adot"]
}

  • Choose Next
  • Under Select target(s), select the same SNS topic (created in Step 2 ) as target, and then choose Next.
  • Configure target input under Additional settings, select constant (JSON text). Setup the target input with the custom JSON text that adheres to the custom notification event schema. Refer documentation in here for sample custom notifications.
  • Review the rule configuration, and then choose Create rule.

Here is the custom event that is configured in the step11.

{
  "version": "1.0",
  "source": "custom",
  "id": "c-weihfjdsf",
  "content": {
    "textType": "client-markdown",
    "title": ":fire: EKS ADOT Pods are consuming high memory!",
    "description": ":warning: EKS Pods are consuming high memory. OnCall team has beeen paged.",
    "nextSteps": [
      "Refer to <http://www.example.com|*diagnosis* runbook>",
      "@googlie: Page Jane if error persists over 30 minutes",
      "Check if instance i-04d231f25c18592ea needs to be scaled up"
    ],
    "keywords": [
      "EKS",
      "ADOT",
      "Critical",
      "SRE"
    ]
  },
  "metadata": {
    "threadId": "EKSADOTCloudWatch",
    "summary": "EKS Pods running on high memory",
    "eventType": "EKSADOTEvent",
    "relatedResources": [
      "i-04d231f25c18592ea",
      "i-0c8c31affab6078fb"
    ],
    "additionalContext": {
      "priority": "critical"
    }
  }
}

Now when the alarm state changes to Alarm, custom event notification is sent to MS Teams channel.

The custom notification below, now featuring a fire emoji!  Amazon Chatbot allows you to seamlessly include compatible emojis into your custom notifications. Plus, there are no additional costs for using these custom notifications.

Customized event Notification to the Teams Channel using AWS Chatbot

                                    Figure 8: Customized event Notification to the Teams Channel using AWS Chatbot 

Troubleshoot Findings

AWS Chatbot, allows customers to receive notifications and issue any AWS CLI commands in teams channel, facilitating collaboration and decreasing response times.

You can Choose List Dashboards button in the Notification message to get the list of Amazon CloudWatch dashboards in the selected region.

View Amazon CloudWatch Dashboards from the Teams Channel

                                          Figure 9: View Amazon CloudWatch Dashboards from the Teams Channel

Select the Amazon CloudWatch dashboard for analysis, simply by clicking on the ‘Show’ button to unveil and explore the insightful metrics within the dashboard.

Figure 10: View Amazon CloudWatch Dashboard Widgets from Teams Channel

                                      Figure 10: View Amazon CloudWatch Dashboard Widgets from Teams Channel

AWS Chatbot allows you to retrieve Amazon CloudWatch dashboards and run Amazon CloudWatch Log Insights queries directly from your Microsoft Teams channel using the Query logs button in the notification .

CloudWatch Alarm Notification in Teams Channel

                                                      Figure 11: CloudWatch Alarm Notification in Teams Channel

Click on the ‘Query logs’ button, and a Log Insights query pop-up window will open, enabling you to seamlessly query the necessary log groups directly from your MS Teams channel.

Search and analyze CloudWatch logs directly from Teams Channel

                                    Figure 12: Search and analyze CloudWatch logs directly from Teams Channel

After running the query, CloudWatch retrieves results from the selected log group, as illustrated below. You have the option to view the results in your browser or download them as a CSV file, as illustrated below. Ensure that the AWSChatbot-eksops role has the required permissions to fetch insights from the chosen log group.

Search and analyze CloudWatch logs directly from Teams Channel

                                          Figure 13: Search and analyze CloudWatch logs directly from Teams Channel

Issue Remediation

Chatbot also lets you to quickly run common DevOps tasks by using the recommended and customer-defined action buttons on notifications. With these custom action buttons, you can quickly diagnose issues, follow pre-defined run books and remediate issues by simply clicking on the pre-configured action buttons.

The custom action target can be a CLI command, a Lambda function or an AWS Systems Manager automation document. For example, You can specify the custom action button name and an action to run to reboot the EC2 instance or execute lambda function to approve/deny code pipeline action.

Lets create a custom notification button by selecting the ellipsis button next to the “Query Logs“ on the event notification. The custom action button is configured with Lambda action type that can perform escalation or remediation actions.

Create custom action button “Execute Runbook”

                                           Figure 14: Create custom action button “Execute Runbook”

Create custom action button select option to execute lambda function

                                          Figure 15: Create custom action button select option to execute lambda function

Select the custom action button display criteria

                                                      Figure 16: Select the custom action button display criteria

When the alarm is in ALARM state the notification will also display the custom action button “Execute Runbook” along with the notification as shown below.

Open a support case in response to a CloudWatch Alarm notification

                                      Figure 17: Custom action button displayed along with alarm notification

You can also respond to your operational events by directly issuing Amazon EKS CLI commands directly from your Microsoft Teams channel. This way, you can retrieve additional telemetry data, resource information,  execute run-books to remediate the issues or even open a support case, as illustrated below using @aws support create-case command.

Open a support case in response to a CloudWatch Alarm notification

                              Figure 18: Open a support case in response to a CloudWatch Alarm notification

In addition, you can also use Amazon Q, a generative AI–powered assistant designed for work that can be tailored to your business, directly from Microsoft Teams channel to get the answers to questions with the power of Artificial Intelligence. Note that AWSChatbot-eks role should hold the appropriate permissions to use the Amazon Q features.

Open a support case in response to a CloudWatch Alarm notification

                                       Figure 19: Open a support case in response to a CloudWatch Alarm notification

Cleanup

To clean up the resources created in this post, follow the instructions below.

  • Delete the EKS Cluster
    • Choose Elastic Kubernetes Service on AWS Console.
    • Click on the EKS Cluster you created, delete all managed node groups
    • Choose the cluster to delete and choose Delete
  • Delete Amazon SNS Topic
    • Choose Simple Notification Service (SNS) on AWS Console.
    • In the navigation pane, select the topic you created in Step 2  and choose delete.
  • Delete Amazon EventBridge Rule
    • Choose EventBridge on AWS Console.
    • In the left navigation pane, select Rules Under Buses section
    • Select the rule you created for custom notifications and choose delete.
  • Delete Amazon CloudWatch Dashboard
    • Choose CloudWatch service on AWS Console.
    • In the navigation pane, select the Dashboard you created in Step 3 from Custom dashboards and choose delete.
  • Delete Amazon CloudWatch Alarm
    • Choose CloudWatch service on AWS Console.
    • In the navigation pane, select All alarms under Alarms.
    • Select the CloudWatch alarm you created in Step 3.
    • Click on Actions and choose delete.
  • Delete the Microsoft Teams Channel
    • Delete the Microsoft Teams Channel you created for Testing.
  • Delete AWS Chatbot Client Configuration
    • Choose Chatbot service on AWS Console.
    • Select Microsoft Teams, under Configured Clients on the left pane.
    • Select the configuration you created in Step 5 from Configured channels and choose delete.

Conclusion

In this blog post, we showed you how to leverage AWS Chatbot to get notifications in Microsoft Teams chat rooms to monitor and respond to operational events in your Amazon EKS cluster. Additionally, we showed you how to customize AWS chatbot with custom notifications, custom action and command aliases. This ChatOps approach improves visibility and empowers to you to stay in control of your AWS resources by providing real-time notifications and robust monitoring capabilities.

Anil Chinnam

Anil Chinnam is a Solutions Architect with Healthcare Life Sciences (HCLS) team at AWS. Anil brings to his role over 20 years of hands-on engineering and architecture experience. He enjoys working with customers to design highly scalable, innovative, and secure cloud solutions.

Hareesh Iyer

Hareesh Iyer is a Sr. Solutions Architect at AWS. He helps customers build scalable, secure, resilient and cost-efficient architectures on AWS. He is passionate about application modernization, containers and devops.