AWS Cloud Operations Blog

How Datacom solved hybrid risk management with AWS Systems Manager

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.

This post is from Chris Coombs at Datacom, and Samual Brown, Senior Technical Account Manager at AWS. Datacom is an AWS Premier Partner providing migration, transformation and managed services across Australia and New Zealand.

At Datacom our Cloud Ops team now uses AWS Systems Manager as the default task runner and preferred state configuration tool for all new managed services customers. Our transition from an on-premises solution to AWS Systems Manager emerged from a desire to focus more on the customer and less on the tooling. Now that we have migrated to AWS Systems Manager, we have found that it provides even more value thanks to its extensibility and ease of use.

Although AWS Systems Manager has many uses, this blog post focuses on our hybrid implementation and the risk dashboard we built on top of it.

Activating AWS Systems Manager

When setting up the AWS Systems Manager agent on Amazon EC2, you usually create an instance profile to allow the agent to run. An often overlooked feature of AWS Systems Manager is that it also runs outside of AWS. However, as your on-premises hypervisor doesn’t understand IAM (Identity and Access Management), AWS provides another mechanism for configuring AWS Systems Manager: activation codes.

With activation codes, you can install the AWS Systems Manager agent prior to a cloud migration. What’s more, you can also use the AWS System Manager activation codes for EC2 instances running in AWS itself. Doing so provides a standard setup for your entire fleet whether it’s on AWS, on-premises, or within other commercial cloud platforms.

Activation at scale

On-premises instances are typically registered with AWS Systems Manager from the AWS Management Console or AWS Command Line Interface (AWS CLI).  This poses a problem of scaling, particularly in the cloud:  We want to dynamically assign instances to AWS Systems Manager as the infrastructure scales out or is replaced over a long period of time. To do this we use an API backed by Lambda, which we run as part of the instance user data script (or similar bootstrap script on resources outside of AWS).  The secure API we created returns a single-use activation code that is unique for the EC2 instance, virtual machine, or other resource requesting it. The activation code is then used to register the AWS System Manager’s commercial endpoint.

Naming instances

For those of you already familiar with AWS Systems Manager, you may be asking why we don’t create a single activation code and use it to register multiple instances? The reason, as with all difficult things in computer science, is naming. If we use activation codes to add an instance to AWS Systems Manager, that instance appears in the AWS Management Console with a randomly generated ID (such as mi-1234). Don’t be fooled by the string after the m (i-1234); that isn’t the AWS instance ID! So how do we map AWS Systems Manager IDs to AWS instance IDs (or some other on-premises ID)? Simple: We give it a name!

We can’t give the instance a name during registration as you might think, so instead we specify the name when we create the activation code.

Write-Host "Getting Instance ID"
$instanceId = Invoke-RestMethod -Method GET -Uri http://169.254.169.254/latest/meta-data/instance-id
Write-Host "Getting Activation Code for $instanceId"
$activation = Invoke-RestMethod -Method GET -Uri "https://$url/latest/activate?name=$instanceId"

Registration

Now that we have the required registration information, the agent setup is pretty straightforward: We just take the generated activation codes and run the AWS Systems Manager executable. That’s it!

Tagging

In order to automate tasks (more on this later), we also assign AWS Systems Manager tags (like “dev” or “prod”) to instances in order to group them. As an aside, if you don’t want to dynamically generate activation codes for any reason, you can manually generate the code and update the Name tag after registration. Again, we use the API to set the tags on the instance, but this step is completely optional. It might seem odd not to use the native IAM integration with AWS Systems Manager for instances in AWS. But our method ensures that all instances are treated the same way so that we have a single workflow for all instances, regardless of their location.

To help get you started with hybrid AWS Systems Manager instances, we’ve provided a quick start AWS CloudFormation template. This template contains the API Gateway and Lambda resources, plus an example invocation script.

Assessing risk

With a hybrid fleet now registered, AWS Systems Manager provides considerable power to operations teams for running scheduled and ad hoc commands. This is a time saver for Ops. Where AWS Systems Manager really excels is in its flexibility. For example, one of our use cases is to report on compliance and security risk in near real time, providing enormous customer value. To achieve this we use AWS Systems Manager to install third-party agents across the hybrid fleet to gather CIS and other compliance information. Below is an overview of our AWS Systems Manager compliance workflow.

This data can then be fed into the AWS Systems Manager compliance dashboard, or into your data visualization application of choice, to provide fleet wide summaries and deep insights into individual instances that supplement AWS Systems Manager’s base inventory reporting. A sample of our risk dashboard can be found below.

With AWS Systems Manager we can run State Manager in either of two modes. First we run in a report only mode. This allows us to gather patching, anti-virus and CIS compliance information from the entire fleet without breaking anything. We can then discuss this data (using our risk dashboard) with the business, who may accept some risks (e.g. a legacy application, which the vendor won’t let you patch) but may mandate others (e.g. AV). With this information we can then move some or all workloads into enforcement mode, and it’s as simple as switching the AWS Systems Manager tag from report to enforce!

This is great for migrations. We can run the agent on-premises, analyse the results and remediate any gaps (e.g. missing AV) using Run Command prior to relocation, reducing both the risk of rollback and the duration of the migration window. It also has the benefit of providing real time insight into born in the cloud workloads, which disappear at night or scale massively during the day. What’s really powerful is that the business can see what the risk profile looks like at any point in time, they can set alerts and take action with their development teams as things change.

What’s Next?

The extensibility of AWS Systems Manager is one of its greatest features. With AWS Systems Manager you can build a solution using cutting edge AWS technology and run it anywhere, from AWS to traditional tin. What Datacom build next is up to you. The idea for our risk dashboard came from customer feedback, and we’d love to hear what challenges you’re facing and how we can help.