How Capgemini used AWS Systems Manager and AWS cloud native observability to provide self-service monitoring

This post was written in collaboration with David Wansell, an Enterprise Cloud Architect at Capgemini with over 20 years of experience across multiple enterprise domains. He designs and builds automation and solutions that enable customers to deliver on their desired outcomes in their cloud adoption journey.

Customers need a way to automatically create alarms that monitor metrics, send notifications in the AWS cloud, and automatically remediate the issue. Many customers leverage Managed Solutions Providers to manage their AWS accounts and are looking for AWS native solutions to solve their business problems.

As a certified AWS Managed Services Provider (MSP) and an AWS Premier Consulting Partner with seven AWS Competencies, and an AWS Well-Architected Partner Program, Capgemini has been proven to create solutions for the unique and evolving needs of customers.

Cloud Operation Services (COS) from CapGemini is a Managed Service offer for AWS Infrastructure as a Service (IaaS) and Platform as a Service (PaaS) solutions. Built on AWS best practices and tools, this post provides a breakdown of the various components leveraged and implemented to provide modern cloud managed services using cloud native tooling. It illustrates the relationship between the components and provides detailed information each component’s use and process flow.

For self-service monitoring solution on AWS, Capgemini leverages AWS Systems Manager. The Systems Manager agent allows Systems Manager to update, manage, and configure Amazon CloudWatch agent installed on the resources dynamically using Run Command and Parameter Store. Alarms are dynamically created using Amazon EventBridge rules and AWS Lambda functions to alert on the metrics provided by the CloudWatch agent. This solution also provides an optional Incident Management framework using Systems Manager Incident Manager to auto-remediate any incidents by running Runbooks based on the alarm type encountered.

AWS Services and Feature components used in this solution

Systems Manager is an AWS service that you can use to view and control your infrastructure on AWS, on-premises, or in a hybrid environment. Using the Systems Manager console, you can view operational data from multiple AWS services and automate operational tasks across your AWS resources. For more information on Systems Manager capabilities, refer to this link.

CloudWatch is a monitoring and observability service built for DevOps engineers, developers, site reliability engineers (SREs), and IT managers. CloudWatch provides you with data and actionable insights to monitor your applications, respond to system-wide performance changes, optimize resource utilization, and get a unified view of operational health. For more information about this CloudWatch Agent is a software package that autonomously and continuously runs on your servers. Using CloudWatch Agent, we can collect metrics and logs from Amazon Elastic Compute Cloud (Amazon EC2), on-premises servers running both Linux and Windows, as well as containerized applications and microservices. CloudWatch Agent provides access to more system level and in-guest metrics, in addition to host metrics already provided by Amazon EC2.

Using Run Command, a capability of Systems Manager, you can remotely and securely manage the configuration of your managed instances. Parameter Store, a capability of Systems Manager, provides secure, hierarchical storage for configuration data management and secrets management.

EventBridge is a serverless event bus that makes it easier to build event-driven applications at scale using events generated from your applications, integrated Software-as-a-Service (SaaS) applications, and AWS services. To know more about EventBridge and how it works, refer to this link.
Lambda is a compute service that lets you run code without provisioning or managing servers. You can invoke your Lambda functions using the Lambda API, or Lambda can run your functions in response to events from other AWS services.

Monitoring prerequisites

The Ec2 instances that should be managed by the solution must follow these prerequisites:

Tags: Instances are required to be tagged with the appropriate management tag key and value that correspond with what the “COS-Lambda-Create-EC2-Instance-CloudWatch-Alarms” that the Lambda function is scanning for.

Systems Manager: Make sure that the instances complete the Systems Manager’s prerequisites as per here. The Instance profile role will also need the “CloudWatchAgentServerPolicy” policy attached to stream metrics to Cloudwatch.

CloudWatch: Refer to this link for a list of supported operating systems for the CloudWatch Agent.

How Capgemini made it work

When an EC2 instance is launched with the appropriate tag key and value pair and in the running state, the event is detected by the Alarm creation EventBridge rule, which forwards the instance id to a COS-Lambda-Create-EC2-Instance-CloudWatch-Alarms Lambda function. Once the Lambda function receives the instance-id of the newly provisioned EC2 instance, it will then do the following:

Check instances are correctly tagged. If the correct management tag (with correct case) isn’t found, then the instance won’t be processed.
Install and update the CloudWatch agent via the Systems Manager “AWS-ConfigureAWSPackage” document.
Determine if the EC2 is Windows or Linux.

1. Windows instances will run the Systems Manager “AWS-RunPowerShellScript” commands which will:

1. 1. Install the “AmazonCloudWatch-coswindowsmetricsconfiguration” CloudWatch Agent configuration file for Windows held in the Systems Manager parameter store.
  2. Start the CloudWatch Agent.

1. Linux Instances will run the Systems Manager “AWS-RunShellScript” commands which will:

1. 1. Install collectd and epel.
  2. Install the “AmazonCloudWatch-coslinuxmetricsconfiguration” CloudWatch Agent configuration file for Linux held in Systems Manager parameter store.
  3. Start the CloudWatch Agent.

Create either Linux or Windows EC2 alarms, depending on the platform of the instance, and set alarm trigger to output to the SNS topic. All alarm names are prefixed with the instanceid.
Any issues or alerting from the monitoring are setup to point to the SNS topic, which is associated with the COS-Lambda-SNOW-Listener Lambda function. This function uses the SNS payload and creates a ticket on ServiceNow for support users to triage and remediate.

The following figure shows the entire architecture:

Figure 1. Self-Service monitoring for multi-platform instances using Systems Manager.

CloudWatch Alarms

The default alarms that get created for Windows and Linux EC2 instances are “CPUUtilization”,”MemoryUsed”, “DiskUsed”, “StatusCheckFailed”. The COS-Lambda-Create-EC2-Instance-CloudWatch-Alarms Lambda function creates each alarm using Python code. Customizations to the default alarms can also be made by editing this Lambda code or by defining custom dimensions or namespaces in the configuration stored in the Systems Manager Parameter store.

The Metrics are created with the CloudWatch alarm output, which can be set to an email address SNS topic. Alternatively, the output can be set to an SNS topic that will forward the alarm payload to a Lambda function “COS-Lambda-SNOW-Listener”. This sends the payload to the MSP ServiceNow instance where a support ticket will be created and remediated by support staff.

CloudWatch Alarm Deletion

Normally when an EC2 instance is deleted, the CloudWatch Alarms associated with it are kept. The solution deploys a method of cleaning this up.

When an EC2 instance is deleted, it’s associated CloudWatch alarms will be removed. This is done by a terminate alarms Event bridge rule that will trigger when the deletion event occurs. Then it will forward the instance id to the COS-Lambda-Terminate-EC2-Instances-CloudWatch-Alarms Lambda, which will scan for any CloudWatch alarms prefixed with the instance id and remove them.

Systems Manager Governance Rule

The following rules for Systems Manager are enabled in the region that the solution is deployed into –

CloudWatch-Agent-Update-Association: All ec2 instances that are tagged with the management tag key and value are considered in scope for this rule. By default, the association will run on Systems Manager automatically every 30 days. The “AWS-ConfigureAWSPackage” document will run and do an update of the CloudWatch agent of all of the EC2s that meet the standard Systems Manager prerequisites.

Systems Manager Incident Manager Auto Remediation – Optional

The monitoring solution framework also has an automated remediation solution which can be deployed on an optional basis.

Incident Manager is a bleeding edge AWS service that is designed to help users mitigate and recover from incidents affecting their AWS-hosted applications. Because Incident Manager isn’t yet available in all regions, the function is disabled by default. But this can easily be enabled by changing the IncidentManagerEnabled conditional parameter.

A framework has been developed that will enable CloudWatch alarms to interact with the incident manager and trigger a runbook. Different runbooks can be utilized for different alarms. This enables automatic remediation or the triaging of issues based on the alarm type encountered.

Figure 2. Incident management for monitoring and auto remediation using Systems Manager.

Whenever an EC2 CloudWatch alarm enters the ALARM state, an EventBridge rule will trigger the COS-Lambda-SSMIncidentManager-Create-Incident incident Lambda.

This Lambda will verify that the alarm is from an in scope/managed EC2 instance. If it is, then it will create an incident in Incident Manager. A Response plan will trigger the Incident Runbook Systems Manager document, which has a sequence of steps defined. The first is that it triggers the “COS-Lambda-SSMIncidentManager-CloudwatchAlarm-DetailsExtractor” Lambda function, which will retrieve information about the CloudWatch Alarm and the EC2, and then compare them to the open incident. Depending on the name of the alarm, a branching action occurs and the appropriate runbook will be triggered. This will run a customized script on the EC2 instance. The Runbooks can be predefined templates which are already provided by Systems Manager Automation, or operational engineers can create customized runbook templates and add them to any kind of alarm triggers that they define.

Summary

Capgemini now offers a monitoring solution for EC2 instances which can raise alerts and incidents, send notifications for the alarm triggers, and perform automated remediation. To learn more about how Capgemini can assist with your business challenges related to management and governance, and to learn more about Capgemini AWS Cloud Operation Services, visit Capgemini Cloud Platform. To learn more about how AWS Systems Manager could be leveraged to manage instances in a hybrid environment, visit AWS Cloud Operation Services.

About the authors:

AWS Cloud Operations & Migrations Blog