Simplifying Hybrid Cloud Management Using AWS Systems Manager Run Command

By Brandon Pierce, Engineering Director at Onica
By Roy Kalamaro, Security Architect at Onica

Although a hybrid cloud environment is not always ideal, infrastructure needs and requirements can create situations requiring a hybrid approach. If this is the case, it’s even more important to ensure a strong architecture.

Although our team at Onica, an AWS Partner Network (APN) Premier Consulting Partner, utilizes Amazon Web Services (AWS) as our primary platform, customer needs sometimes require the usage of on-premises systems.

Enter AWS Systems Manager, a valuable resource for quickly accessing operational insights and taking action in both AWS and on-premises environments.

In this post, we’ll share several use cases that our team at Onica leverages AWS Systems Manager for. We’ll demonstrate how organizations can utilize Systems Manager to simplify hybrid environment operations, enabling you to significantly reduce operational overhead and manual procedures.

Managing Hybrid Workloads with AWS Systems Manager

On-premises servers are a limiting factor in hybrid infrastructures, and are often unable to integrate with the capabilities of cloud services or communicate seamlessly with cloud counterparts.

AWS Systems Manager offers unprecedented insights and access with a unified user interface (UI) that includes information from a multitude of AWS services and on-premises servers. Systems Manager uses a lightweight agent installed on servers to provide visibility, eliminating communications challenges faced in most hybrid environments.

Our team at Onica utilizes Systems Manager to simplify resource grouping while leveraging access to automated command execution. This previously would have been documented in a manual Standard Operating Procedure (SOP) and runbooks to execute manual actions.

Through AWS Systems Manager, we can provide contextual information to Amazon CloudWatch Alarm notifications.

We have been able to influence hybrid cloud environments like never before. Below are some ways in which our team has found value in the automation provided by Systems Manager.

Remote Management of Hybrid Environments at Scale

When we have the need to manage systems on-premises for clients, the AWS Systems Manager Agent (SSM Agent) allows for seamless management using the same console, API, automation, and tooling that we would utilize within AWS.

One of the main challenges we have faced when working in a hybrid environment has been utilizing a single management tool for control and orchestration of Windows and Linux OS across multiple hosting platforms.

SSM Agent is able to monitor the heartbeat of Amazon Elastic Compute Cloud (Amazon EC2) instances, as well as that of remote on-premises servers. Additionally, it allows our team to run commands and verify output regardless of the OS, hypervisor, or platform.

In Figure 1, you can see orchestration on Windows and Linux servers across multiple hosting platforms and AWS Regions using Systems Manager.

Figure 1 – Orchestration on Windows and Linux servers using AWS Systems Manager.

At Onica, we capture instance-level metrics using Amazon CloudWatch whose agent collects the custom and standard metrics from these instances and sends them to CloudWatch Logs. We configured CloudWatch Alarms in response to specific metrics (CPUUtilization, RAM, DiskWriteOps, etc.) that are deemed critical for customer workloads to function effectively.

Alarm Enrichment Solution

One challenge we faced was meeting customer requirements for real-time notification to stakeholders with relevant Windows/Linux OS and application-level health data when these CloudWatch Alarms are triggered. In response, our team at Onica developed the Alarm Enrichment solution that utilizes CloudWatch and Systems Manager services in a hybrid environment.

This solution provides extended information for customers and engineers when certain CloudWatch metrics cross a given threshold and triggers a CloudWatch Alarm.

Figure 2 demonstrates a solution Onica designed and implemented for an enterprise customer that operates around 1,000 servers in a hybrid environment while utilizing System Manager APIs.

Figure 2 – Alarm Enrichment automated workflow.

Here is the sequence of steps capturing the Alarm Enrichment workflow:

Step 1: CloudWatch Alarms are placed on custom metrics that are tied to the health of customer workloads running on the instances managed by Systems Manager.

Step 2: CloudWatch Alarms watch a given metric and send AWS Simple Notification Service (SNS) notifications when the value of the metric exceeds a configured threshold.

Step 3: When CloudWatch Alarms trigger SNS notifications to a given topic, it’s configured to execute an AWS Lambda function that runs an AWS Systems Manager document (SSM documents) on the affected instance. The SSM document runs OS-level commands using the Run Command feature, which executes a PowerShell script to collect metrics of the top 10 processes by CPU, memory, disk usage, etc.

Here’s an example of an SSM document’s JSON snippet that gets the server processes information:

{
   "schemaVersion": "2.2",
   "description": "Windows Process and Memory Alert Details",
   "parameters": {
      "infoType": {
         "type": "String",
         "default": "CPU",
         "description": "Return CPU or RAM usage details",
         "allowedValues": [
            "CPU",
            "RAM",
            "Both"
         ]
      }
   },
   "mainSteps": [
      {
         "action": "aws:runPowerShellScript",
         "name": "getWindowsDetails",
         "precondition": {
            "StringEquals": [
               "platformType",
               "Windows"
            ]
         },
         "inputs": {
            "runCommand": [
               "if (\"{{ infoType }}\" -ne \"RAM\" ){ write-output \"Top CPU Usage:\"; Get-Process | Sort-Object 'CPU' -Descending | select -First 10 }",
               "if (\"{{ infoType }}\" -eq \"Both\" ){ write-output \"-----------------------------\" }",
               "if (\"{{ infoType }}\" -ne \"CPU\" ){ write-output \"Top Memory Usage:\";  Get-Process | Sort-Object 'WS' -Descending | select -First 10 }"
            ]
         }
      }]
}

Step 4: Next, we trigger another Lambda function that compiles the data collected in Step 3 and sends this detailed information to engineering teams via a PagerDuty endpoint.

This is one of the examples that illustrates how to automate execution of run books encapsulating a set of actions in hybrid infrastructures using CloudWatch, Systems Manager APIs, and Lambda.

It reduces significant operational overhead that can cause business loss due to manual actions that are slowly taken. These actions include logging in to each instance, checking the probed processes, running the analysis about the root cause that triggered the alarm, and finally attempting remediation.

It potentially saves a business the cost of a third-party infrastructure monitoring tool that wouldn’t necessarily support all of the above functionalities to perform remediation actions in an automated fashion.

Automated Runbook Execution

A typical operational challenge is the timely and proper execution of runbooks during an incident or maintenance exercise. Depending on the number of impacted systems, there may be a large number of engineers involved in remediation. More critically, there’s the risk of human error due to the improper following of SOPs.

Traditional solutions to these problems involve custom scripts or third-party orchestration software. These solutions often have large price tags or require separate efforts to maintain complex systems in and of themselves. They also don’t scale into the cloud very well, as they were not designed for such dynamic environments.

The goal for automated runbook execution is to reduce engineering effort and any downtime associated with customer application failures. This can be achieved in a cloud-native manner by using Systems Manager and Amazon CloudWatch. We monitor the CloudWatch Logs for specific values or patterns using CloudWatch Alarms to detect abnormal application or process-level errors.

One of Onica’s customers had an in-house developed application that was passing all traditional endpoint and OS-level monitoring. Periodically, users would report issues in application functionality. We did some analysis and found the issue was attributed to a faulty OS process that would error out occasionally when a user tried to log in to the application’s web-portal. On these occasions, a specific error was logged in CloudWatch.

To remediate the issue while a long-term fix was worked on, we created an auto remediation solution to restart the process using Systems Manager and CloudWatch, as shown in Figure 3.

Figure 3 – Automated OS process remediation by using AWS Systems Manager.

In the illustration above, you’ll note the following steps:

Step 1: CloudWatch Agent sends application-level metrics to CloudWatch.

Step 2: A CloudWatch Alarm based on a log filter pattern is created. The alarm triggers SNS notifications on given topics when configured metrics exceed the threshold value.

Step 3: SNS topics have subscriptions to a Lambda function that gets executed when the alarm triggers.

Step 4: A Lambda function initiates Run Command to instantly restart an OS-level process using Systems Manager APIs.
.
The following code demonstrates how to target the instances based on certain tag values and execute a Shell Script command on those instances:

function blind_run(){
  var ssm = new AWS.SSM();
  var params = {
    DocumentName: 'AWS-RunShellScript',
    Targets: [
      {
        Key: 'tag:environment',
        Values: [ 'prod' ]
      },
      {
         Key: 'tag:type',
         Values: [ 'jsis' ]
      }
    ],
    Parameters: {
      commands: ['service jsis-service restart']
   }
  };

Step 5: After command execution (i.e. restarting of an OS-level process), the instances come back to normal operation and the issue is remediated.

This sequence of automated runbook execution completes within a few seconds. The solution not only prevents a bad user experience for end-users, but also helps customers from experiencing significant business loss had the issue not been remediated instantly to meet required application-level SLAs.

With a solution like this, there’s no need to physically log in to the instances or configure SSH and RDP for remote connection. This increases our security posture, as we don’t need to open up ports or manage keys in our infrastructure for remote management and troubleshooting.

Conclusion

When analyzing these use cases, it’s clear that AWS Systems Manager simplifies hybrid cloud management.

Systems Manager makes the oversight of thousands of instances and virtual machines running over eight different operating systems no more challenging than the management of a few instances running in a single Availability Zone.

For our team at Onica, this has resulted in deprecating previous hybrid management solutions from third parties that are costly to implement and maintain.

With AWS Systems Manager, we have also been able to reduce weekly helpdesk tickets and automated alerts by five percent, translating to roughly 10 percent reduction in the human effort required to support the same amount of resources. We foresee additional savings over time as operational efficiency continues to increase and new workloads are launched with these strategies.

The content and opinions in this blog are those of the third party author and AWS is not responsible for the content or accuracy of this post.

Onica – APN Partner Spotlight

Onica is an APN Premier Consulting Partner. They provide cloud consulting, infrastructure, and managed services, ensuring customers have the best technical solutions to solve their business challenges and deliver value for their organization.

Contact Onica | Practice Overview

*Already worked with Onica? Rate this Partner

*To review an APN Partner, you must be an AWS customer that has worked with them directly on a project.