AWS Cloud Operations & Migrations Blog

Unlocking Insights: Turning Application Logs into Actionable Metrics

Modern software development teams understand the importance of observability as a critical aspect of building reliable and resilient applications. By implementing observability practices, software teams can proactively identify issues, uncover performance bottlenecks, and enhance system reliability. However, it is a fairly recent trend and still lacks industry-wide adoption.

As organizations standardize on containers, they often lift and shift legacy applications into containers. Such legacy applications provide insights mostly through logs. Unlike modern applications, they don’t generate metrics, as a result, teams find it challenging to observe and extract meaningful insights from application logs.

In this post, we demonstrate how to improve the observability of applications that don’t generate metrics. We further demonstrate on how you can use the same capability with control plane logs, such as audit logs, and receive timely notifications when errors arise. We use Amazon CloudWatch to create metrics from log events using filters. This post uses Amazon EKS to run the application. Customers can use this approach for applications running in Amazon ECS, Amazon EC2, or AWS Lambda, as well as applications running in their data centers using IAM Roles Anywhere.  Finally, we will dive in to enhancing EKS security with automated anomaly detection and alerting using Amazon CloudWatch Anomaly Detection.

Solution Overview

For this demonstration, we use Amazon EKS Blueprints to create an Amazon EKS cluster with AWS for Fluent Bit agent to aggregate application logs produced in common log format and ingest into Amazon CloudWatch. The application is designed to inject failures at random. Using CloudWatch metric filters, we match required terms in application’s logs to convert log data into metrics. Next, will create alarms in CloudWatch to detect increased error rate. Whenever the sample application’s error rate breaches the set threshold, a notification gets sent to a Slack channel.

Solution Architecture for Turning Application Logs into Actionable Metrics

Figure 1: Solution Architecture for Turning Application Logs into Actionable Metrics

Here is a functional flow of this solution:

  1. AWS for Fluent Bit agent collects and processes application logs
  2. AWS for Fluent Bit agent deploys a Fluent bit agent to forwards logs to Amazon CloudWatch Logs to be stored in log groups
  3. When activated, Amazon EKS control plane logs are available as vended logs in to Amazon CloudWatch Logs
  4. Amazon CloudWatch Patterns surface emerging trends and identify frequently occurring or high-cost log lines
  5. Amazon CloudWatch custom metric filters for extracting metric data points based on metric filter expressions and creates metric time series
  6. CloudWatch Alarm triggers notification to Amazon SNS when a threshold is breached
  7. Amazon SNS invokes an Amazon Lambda function, which in-turn sends CloudWatch alarm notifications to Slack.

Prerequisites

Install the following utilities on a Linux based host machine, which can be an Amazon EC2 instance, AWS Cloud9 instance or a local machine with access to your AWS account:

  • AWS CLI version 2 or later to interact with AWS services using CLI commands
  • js (v16.0.0 or later) and npm (8.10.0 or later)
  • AWS CDK v2.114.1 or later to build and deploy cloud infrastructure and Kubernetes resources programmatically
  • AWS SAM CLI to deploy AWS Lambda function
  • Kubectl to communicate with the Kubernetes API server
  • Git to clone required source repository from GitHub

Let’s start by setting environment variables:

export CAP_ACCOUNT_ID=$(aws sts get-caller-identity --query 'Account' --output text)
export CAP_CLUSTER_REGION="us-west-2"
export AWS_REGION=$CAP_CLUSTER_REGION
export CAP_CLUSTER_NAME="demo-cluster"
export CAP_FUNCTION_NAME="cloudwatch-to-slack"

Clone the sample repository which contains the code for our solution:

git clone https://github.com/aws-samples/containers-blog-maelstrom.git
cd ./containers-blog-maelstrom/aws-cdk-eks-app-alarms-to-slack

Bootstrap the Environment

As the solution uses Amazon EKS CDK Blueprints to provision an Amazon EKS cluster, you must bootstrap your environment in the required AWS Region of your AWS account.

Bootstrap your environment and install all Node.js dependencies:

bash ./bootstrap-env.sh

Create EKS cluster

Once you’ve bootstrapped the environment, create the cluster:

cdk deploy "*" --require-approval never

Deployment will take approximately 20-30 minutes to complete. Upon completion, you will have a fully functioning EKS cluster deployed in your account.

Snapshot of output from cdk deployment

Figure 2: Snapshot of output from cdk deployment

Please copy and run the aws eks update-kubeconfig command as shown in the screenshot to gain access to your Amazon EKS cluster using kubectl.

Create an Incoming Webhook in Slack

Slack allows you to send messages from other applications using incoming webhook. Please refer to sending messages using incoming webhooks for more details. We will use this incoming webhook to send notifications to a Slack Channel whenever an alarm is triggered.

Follow these steps to configure incoming webhook in Slack:

  1. Create or pick a Slack channel to send CloudWatch alarms notifications.
  2. Go to https://<your-team-domain>.slack.com/services/new and search for Incoming WebHooks, select and click Add to Slack.
  3. Under Post to Channel, Choose the Slack channel where you send messages and click Add Incoming WebHooks Integration.
  4. Copy webhook URL from the setup instructions and save it. You’ll use this URL in the Lambda function.

Create a KMS Key

To increase security posture of incoming webhook URL, we will encrypt incoming webhook URL using AWS KMS keys. We will create KMS Key with key alias alias/${CAP_FUNCTION_NAME}-key as part of script deploy-sam-app.sh.

Create an Amazon Lambda function

As next step, create a Lambda function to send CloudWatch alarm notifications to Slack. The script uses AWS Serverless Application Model (SAM) to create:

  1. An Amazon SNS topic, and CloudWatch alarm will publish notifications to that topic
  2. A Lambda execution role to grant function permission with basic access and to decrypt using KMS Key
  3. A Lambda function to send notifications to Slack using incoming webhook URL
  4. Lambda permissions for SNS to trigger the Lambda function

The script deploy-sam-app.sh intakes the following two input values to deploy SAM template.

  1. Slack incoming webhook URL which you created previously.
  2. Slack channel name (you selected previously) to which notifications need to be sent

The deploy-sam-app.sh script will encrypt (client-side) the Slack incoming webhook URL using a KMS Key with a specific encryption context. Lambda function will decrypt using same encryption context. Lambda execution role is provided with fine-grained access to use KMS Key only for the specific encryption context.

Run the following command to deploy SAM template.

bash ./deploy-sam-app.sh

Test Lambda function

Let’s validate the Lambda function by pushing a test event using the payload available at templates/test-event.json:

aws lambda invoke --region ${CAP_CLUSTER_REGION} \
--function-name ${CAP_FUNCTION_NAME} \
--log-type Tail \
--query LogResult --output text \
--payload $(cat templates/test-event.json | base64 | tr -d '\n') - \
| base64 -d

A successful execution will post a test message to the Slack channel and have command output as shown in the image:

Incoming Webhook on slack

Figure 3: Incoming Webhook on slack

[INFO]  2023-04-12T01:04:49.880Z        81699331-10e9-416f-b8ae-4fb7f44f1d29    Message posted to httphandler-cloudwatch-alarms
END RequestId: 81699331-10e9-416f-b8ae-4fb7f44f1d29
REPORT RequestId: 81699331-10e9-416f-b8ae-4fb7f44f1d29  Duration: 587.33 ms     Billed Duration: 588 ms Memory Size: 128 MB     Max Memory Used: 68 MB  Init Duration: 448.31 ms

AWS for Fluent Bit Agent

AWS for Fluent Bit agent aggregates application logs and forwards them to CloudWatch. To forward logs to CloudWatch logs, you need to provide IAM role that grants required permissions. Amazon EKS CDK Blueprints provides options to configure Fluent-Bit add-on as well as create IAM policies with required permissions.

Verify that Fluent-Bit is running in your cluster:

kubectl get po -n kube-system \
-l app.kubernetes.io/name=aws-for-fluent-bit
NAME                                        READY   STATUS    RESTARTS   AGE
blueprints-addon-aws-for-fluent-bit-9db6l   1/1     Running   0          30m

Deploy Sample Application

Next, let us deploy a sample application HTTPHandler, which is an HTTP server that injects error randomly. We’ll also deploy a curl container to generate traffic.

Deploy the sample application httphandler:

kubectl apply -f ./templates/sample-app.yaml

Let’s send a few request to the sample application to see the logs it produces.

kubectl exec -n sample-app -it \
curl -- sh -c 'for i in $(seq 1 15); do curl http://httphandler.sample-app.svc.cluster.local; sleep 1; echo $i; done'

Then, check the logs and observe response code and count. With every HTTP request, this sample application injects errors randomly and logs respective response code with counts.

kubectl -n sample-app logs -l app=httphandler
2023/04/14 21:40:35 Listening on :8080...
192.168.99.44 - - [14/Apr/2023:21:40:44 +0000] "GET / HTTP/1.1" 500 22
192.168.99.44 - - [14/Apr/2023:21:40:45 +0000] "GET / HTTP/1.1" 200 13
192.168.99.44 - - [14/Apr/2023:21:40:46 +0000] "GET / HTTP/1.1" 200 13
192.168.99.44 - - [14/Apr/2023:21:40:47 +0000] "GET / HTTP/1.1" 200 13
192.168.99.44 - - [14/Apr/2023:21:40:48 +0000] "GET / HTTP/1.1" 500 22
192.168.99.44 - - [14/Apr/2023:21:40:49 +0000] "GET / HTTP/1.1" 200 13
192.168.99.44 - - [14/Apr/2023:21:40:50 +0000] "GET / HTTP/1.1" 200 13

Create CloudWatch Metrics

AWS for Fluent Bit agent is configured to forward application logs with log key “log” to CloudWatch log group with name like /aws/eks/fluentbit-cloudwatch/<CAP_CLUSTER_NAME>/workload/<NAMESPACE>. Hence, sample application (httphandler) logs are forwarded to log group /aws/eks/fluentbit-cloudwatch/demo-cluster/workload/sample-app. Each pod will have one log stream.

You can convert log data into numerical CloudWatch metrics using metric filters. Metric filters allow you to configure rules to extract metric data from log messages. Logs from httphandler are parsed using filter pattern [host, logName, user, timestamp, request, statusCode>200, size] and used in the put-metric-filter command to create metric filter with dimension on response status code.

Create a metrics filter:

aws logs put-metric-filter --region ${CAP_CLUSTER_REGION} \
--log-group-name /aws/eks/fluentbit-cloudwatch/${CAP_CLUSTER_NAME}/workload/sample-app \
--cli-input-json file://templates/sample-app-metric-filter.json

Send CloudWatch alarms to Slack

At this time, application logs are being forwarded to CloudWatch and CloudWatch is generating metrics from logs. Now we’d like to send a Slack notification to our SRE channel so that when errors exceed a set threshold, the team gets notified immediately.

Create CloudWatch Alarms

Next, we’ll create a CloudWatch alarm on a metric to create and send notifications to SNS topic. Below mentioned command creates CloudWatch alarm that monitors for statuscode>200 and when the number of errors is above 10 in last 5 minutes, then a notification is sent to SNS topic.

SNS_TOPIC=$(aws cloudformation --region ${CAP_CLUSTER_REGION} describe-stacks --stack-name ${CAP_FUNCTION_NAME}-app --query 'Stacks[0].Outputs[?OutputKey==`CloudwatchToSlackTopicArn`].OutputValue' --output text)
aws cloudwatch put-metric-alarm --region ${CAP_CLUSTER_REGION} \
--alarm-actions ${SNS_TOPIC} \
--cli-input-json file://templates/sample-app-400-alarm.json
aws cloudwatch describe-alarms --region ${CAP_CLUSTER_REGION} \
--alarm-names "400 errors from sample app"

Generate traffic to sample application httphandler using the following command which in-turn generates metrics. Run this script in a separate terminal :

kubectl exec -n sample-app -it curl -- sh -c 'for i in $(seq 1 5000); do curl http://httphandler.sample-app.svc.cluster.local; sleep 1; echo $i; done'

Let this run for 10 minutes and check CloudWatch alarm status. If there is a breach in threshold, you will get notification in your Slack channel.

Open Amazon CloudWatch in the AWS Management Console and navigate to Metrics → All metrics → SampleAppMetrics → Metrics with no dimensions → response_count to visualize metrics CloudWatch creates from application logs.

CloudWatch console showing sample app metrics

Figure 4: CloudWatch console showing sample app metrics

You can select the Alarms from Metrics page or you can view the alarm we’ve created for 400 errors by going to CloudWatch → Alarms → All alarms → 400 errors from sample app

CloudWatch alarms page

Figure 5: CloudWatch alarms page

Whenever the threshold is breached, you’ll get a notification in Slack:

Notifications on slack

Figure 6: Notifications on slack

With this setup, we are now notified whenever our application experiences issues.

Enhancing Amazon EKS Security with automated anomaly detection and alerting

As cyber threats are surging, customers are looking for quick ways to identify anonymous requests targeting their infrastructure and applications. Thanks to Amazon CloudWatch Anomaly Detection which can help detect anomalies for a metric using statistical and machine learning algorithms. These algorithms continuously analyze metrics of systems and applications, determine normal baselines, and surface anomalies with minimal user intervention. CloudWatch anomaly detection is available with any AWS service metric or custom CloudWatch metric that has a discernible trend or pattern.

EKS control plane logging is recommended to be selected to collect and analyze audit logs, which is essential for root cause analysis and attribution, including ascribing a change to a particular user. When required logs have been collected, they can be used to detect anomalous behaviors. On EKS, the audit logs are sent to Amazon CloudWatch Logs, once turned on under the log group /aws/eks/<CAP_CLUSTER_NAME>/cluster.  To activate CloudWatch Anomaly detection for the control plane logs, select the control plane log group under the Anomaly Detection tab select the Create anomaly detector. Review the options like frequency and filter patterns and Activate anomaly detection. Please note this will take up to 24 hours to train and detect anomalies.

CloudWatch Anomaly Detection Configuration

Figure 7: CloudWatch Anomaly Detection Configuration

We can create a filter pattern for unauthorized access errors and create custom metric-filter to detect any anonymous requests. Using CloudWatch Alarms alerts can also be triggered to the same slack channel we created earlier.

aws logs put-metric-filter --region ${CAP_CLUSTER_REGION} \
--log-group-name /aws/eks/${CAP_CLUSTER_NAME}/cluster \
--cli-input-json file://templates/cluster-403-metric-filter.json
aws cloudwatch put-metric-alarm --region ${CAP_CLUSTER_REGION} \
--alarm-actions ${SNS_TOPIC} \
--cli-input-json file://templates/cluster-403-alarm.json
aws cloudwatch describe-alarms --region ${CAP_CLUSTER_REGION} \
--alarm-names "403 errors from Cluster API Server"

We can generate some anonymous requests to the cluster using the EKS cluster endpoint. Get cluster endpoint from kubectl config view or you can get API server endpoint from the AWS console. From the terminal, execute the below command to generate unauthorized errors for a unavailable endpoint.

export CAP_CLUSTER_API_ENDPOINT=$(kubectl config view --minify | grep server | cut -f 2- -d":" | tr -d " ")
for i in `seq 1 10`; do curl -k $CAP_CLUSTER_API_ENDPOINT/anomaly; done

AWS CloudWatch Log Insights can be used to easily extract information from logs, identify patterns, and gain deeper insights into your applications and infrastructure. We can identify common patterns on the EKS control plane logs using Log Insights and the same filter pattern can be used to detect anomalies or any suspicious events to which alarms can be created. For the unauthorized anonymous requests, we can find the pattern in LogInsights.

CloudWatch Log Insights

Figure 8: CloudWatch Log Insights

When the threshold is breached for the anonymous requests, you’ll get a notification in Slack for which we created the alarm based on the filter pattern.

Anonymous request notifications on Slack

Figure 9: Anonymous request notifications on Slack

AWS Systems Manager Incident Manager starts engaging the right responders promptly, tracking incident updates, and automating remediation actions. It reduces Mean Time To Recover (MTTR) through Response plans which defines who responds, automated mitigation actions, and collaboration tools for responder communication and notifications. Leverage our blog posts on  Creating contacts, escalation plans, and response plans in AWS Systems Manager Incident Manager and AWS Systems Manager Incident Manager integration with Amazon CloudWatch to take full advantage of CloudWatch alarms generated here to trigger incident-specific response plans for streamlined, automated incident response.

CloudWatch alarm configurations for 400 errors

Figure 10: CloudWatch alarm configurations for 400 errors

AWS Systems Manager action for CloudWatch alarms

Figure 11: AWS Systems Manager action for CloudWatch alarms

Cleanup

Run cleanup.sh script to clean up all resources deployed as part of this post.

bash ./cleanup.sh

Conclusion

This post demonstrates a solution to capture errors in application logs to metrics that you can track to improve the reliability of your systems. Using CloudWatch, you can create metrics from log events and monitor applications without changing its code. You can use this technique to improve the reliability and observability of applications that don’t generate metrics. We also showed how you can streamline monitoring by sending alarm notifications in Slack. Furthermore, we have demonstrated monitoring of Control Plane logs, guiding us to capture and report errors promptly.

For more information, see the following references:

Authors

Elamaran Shanmugam

Elamaran Shanmugam

Elamaran (Ela) Shanmugam is a Sr. Container Specialist Solutions Architect with AWS. Ela is a Container, Observability, and Multi-Account Architecture SME and helps AWS partners and customers to design and build scalable, secure, and optimized container workloads on AWS. His passion is building and automating Infrastructure to allow customers to focus more on their business. He is based out of Tampa, Florida, and reach him on Twitter @IamElaShan and on GitHub

Re Alvarez-Parmar

Re Alvarez-Parmar

In his role as Containers Specialist Solutions Architect at Amazon Web Services, Re advises engineering teams with modernizing and building distributed services in the cloud. Prior to joining AWS, he spent over 15 years as Enterprise and Software Architect. He is based out of Seattle. You can connect with him on LinkedIn linkedin.com/in/realvarez

Prakash Srinivasan

Prakash Srinivasan

Prakash is a Solutions Architect at AWS, passionate about empowering customers to unlock the full potential of the Cloud. As a builder at heart, he guides businesses through modernizing applications and accelerating cloud journey. When not immersed in cloud solutions, this Denver-based professional unwinds by watching movies and cherishing quality time with family. Connect with him on LinkedIn linkedin.com/in/prakash-s

hari_headshot.jpg

Hari Muthusamy

Hari is a Senior DevOps Consultant with Amazon Web Services. He is a DevOps Evangelist focused on driving cloud adoption, open source integration and infrastructure automation. In his spare time, he enjoys watching comedy shows and playing Tennis. He is based out of Columbus, Georgia and you can connect with him on Linkedin linkedin.com/in/hari-muthusamy