Centralizing configuration management using AWS Systems Manager

In this guest post, Kaitlyn Fedorak (Engineer) and contributors, Cody Olsen (Senior Engineer), Will Scott (Engineer), Samuel Raghunandan (Engineer), from Xero discuss their use of AWS Systems Manager Inventory and State Manager for configuration management of Amazon EC2 instances. Any team or company can leverage a similar design described in this post to save on licensing costs, have more flexibility in how output data is used, increase their visibility into the configuration of their instances and software therein, and decrease incidents from misconfigurations.

Xero is a leader in cloud accounting across New Zealand, Australia, and the United Kingdom, serving over 3.5 million subscribers in more than 180 countries. Xero provides business owners with real-time visibility of their financial position and performance in a way that’s simple, smart and secure. Xero connects small businesses to their advisors and other services. For accountants, Xero forges a trusted relationship with clients through online collaboration and gives accountants the opportunity to extend their services.

Misconfigurations across a system can cause a lot of easily avoidable production incidents or compliance issues, but these can be nearly impossible to find in advance if you don’t have an easy way to view the current configuration settings. There are numerous prebuilt configuration management tools out there, but what if these prebuilt tools do not have the ability to check every area of a system that you need to manage?

In particular, finding a tool that has the ability to check a specific piece of software on an instance, like Microsoft SQL Server, on top of more universal checks like OS and hardware, can be near impossible. This post demonstrates how Xero built their own highly-customizable configuration monitoring solution capable of monitoring hardware settings, Microsoft SQL Server, and everything in between using several AWS services including AWS Systems Manager, Amazon Simple Storage Service (Amazon S3), AWS Glue, Amazon Athena, and AWS Lambda.

Overview

Many prebuilt tools on the market have a set of checks they support natively, however there may be many checks outside these sets that you need to cover. Some of these checks may be prohibitively difficult to build due to permissions limitations, such as accessing multiple well-secured AWS accounts for CLI calls or granting permissions to run queries against a database service. There would also be additional costs to consider, as you would need to pay for an enterprise license for most available tools, not to mention managing an instance to host the tool itself.

Architecture

After evaluating several available prebuilt tools, Xero decided to build their own solution. This option was cheaper than paying for an enterprise license of a prebuilt tool, given that the majority of the cost came from storing the text files containing the configuration data in S3, followed by Athena usage; all Systems Manager pieces of the design do not incur any extra cost.

Additionally, since Xero already used other aspects of Systems Manager on their instances, this solution removed all overhead from giving a new tool permissions to access all of the configuration data. Finally, due to the way the scripts for data gathering are organized, this tool provides an easy way to adjust for new areas of configuration to check against, making it highly customizable for use with any teams’ tool set.

Below is a diagram showing the architecture of Xero’s configuration management tool. It begins under Systems Manager in State Manager, with an Association that gathers the current state of the configuration. This then is sent to Systems Manager Inventory, where a Resource Data Sync then sends the configuration data to the target S3 bucket.

Inventory data is automatically separated by AWS account ID and Region, so you need to deploy resource data syncs in each AWS account and Region with managed nodes you wish to monitor. By sending inventory data collected from all of your managed nodes to a single S3 bucket, you gain a holistic view of the fleet.

Xero also hosts self-managed JavaScript Object Notation (JSON) files that contain the desired ideal configuration state in an S3 bucket. These JSON files can be updated manually or added to your automated update/configuration processes.

Once the information is in an S3 bucket, a Glue crawler scans the bucket in order to create a database and table which you can then query using Athena. From there, Lambda uses the ideal state files to write Athena queries to find any instances with misconfigurations. This resulting output can be used for any number of solutions, such as auto-remediation or on-call alerting.

Figure 1. Architecture diagram for configuration management.

Figure 1: Architecture diagram for configuration management.

Create a custom solution based on your data

Xero was able to create a single solution to gather the configuration state of several components, most notably hardware, other settings within AWS, the operating system, and SQL Server. This solution is easily customizable to gather whatever types of configuration data you wish when given proper permissions. There are two key areas where the custom solution will vary based on what your own configuration data looks like, and those are in State Manager and the Lambda functions.

Pre-requisites

Amazon Elastic Compute Cloud (EC2) instances, AWS Internet of Things (IoT) Greengrass core devices, on-premises servers, edge devices, and VMs must be Systems Manager managed nodes to have inventory data gathered. This means your nodes must meet certain prerequisites and be configured with the AWS Systems Manager Agent (SSM Agent). For more information, see Setting up AWS Systems Manager.

State Manager association

State Manager associations allow you to invoke scripts and commands within the operating system on your instances as long as the instance is registered with Systems Manager. The actions performed by the association are defined using Systems Manager Command Documents, which require a particular design for gathering your own data.

This document, at minimum, must have two actions:

An example minimal document that allows you to pass in PowerShell commands can be viewed on GitHub here:

https://github.com/aws-samples/aws-management-and-governance-samples/blob/master/AWSSystemsManager/ConfigurationManagement/exampleCommandDocument.yml

The first action, aws:runShellScript or aws:runPowerShellScript, allows you to run your own Shell or PowerShell scripts to collect any configuration settings that are not included in the standard metadata collected by inventory and add the configuration settings as custom inventory.

The second action, aws:softwareInventory, does two things; it gathers the standard metadata from your instance, and it allows your custom inventory to be indexed into Inventory which is then synchronized to the S3 bucket using the resource data sync. It is important to make sure aws:softwareInventory runs after the custom script block, otherwise the custom inventory data is not indexed.

Additionally, a managed node should only be associated with a single inventory association to prevent unexpected behaviors, for more information see Configuring inventory collection. So if you would like to create and collect multiple custom inventory types, these must all be in the same association. Furthermore, while it is possible to create multiple custom inventory types in a single association, Systems Manager has a default service quota of 20 custom inventory types per AWS account and Region. This means that if multiple teams in an organization want to monitor the same instances, or are at risk of hitting the custom inventory limit, they will have to coordinate.

Defining the custom inventory data structure

The bulk of the work in this stage of the design goes into that custom code block in the Command document. This will vary based on the structure of your own configuration data. The Xero team had a large amount of settings they wanted to check, so Xero began by grouping these into categories, ending with nearly 20 separate categories. For example, two settings gathered were related to the servers Buffer Pool Extension (BPE) ratio and file size:

{
    "common": {
        "BPE_Ratio": {
            "{*}": "1:8"
        },
        "BPE_File_Size": {
            "i3.xlarge": "244",
            "i3.2xlarge": "488",
            "i3.4xlarge": "976",
            "i3.8xlarge": "1952",
            "i3en.xlarge": "256",
            "i3en.2xlarge": "512",
            "i3en.3xlarge": "768",
            "i3en.6xlarge": "1536",
            "i3en.12xlarge": "3072"
        }
    }
}

Once this was done, they designed how the final output JSON for each of these categories needed to be structured to fully understand which setting a JSON entry referred to. With this, they found that they could use two different JSON structures to fully understand every category of their configuration data.

In order to create custom inventory data, the JSON structure needs to be the same for each data point, so gathering all configuration data meant having two custom inventory types. If you have a great deal of categories to cover, or categories with similar information that would be difficult to write separate queries for, you may want to include a CategoryName key in each entry.

Gathering the custom inventory metadata

Once the overall data structure has been created, the Shell or PowerShell scripts that actually gather the data can be written. Depending on how many points you need to gather, it may make sense to split things up into one top-level script that calls one subscript per data category. These scripts programmatically gather each setting you want to check via in-built functions, API calls, queries, or whatever other methods are available to you.

Each setting is one data point that gets added to a list of JSON, with separate lists for each JSON structure. For easier debugging, you could store these as JSON text files in a temp folder. Once all of this information has been gathered, the final steps of this script must combine these data points into a single file per JSON structure. The SSM Agent, which sends this custom inventory to Inventory, requires the JSON file to be stored in a specific file path, which varies between Windows and Linux instances. Note that Inventory only accepts JSON entries containing strings, so be sure to convert any numerical entries.

Lambda functions

The second key part to a custom solution that fits your own needs is the Lambda functions at the end of the pipeline. Depending on the structure of your configuration data and the number of custom inventory types created, it may make sense to create one comparison Lambda for each data category so as not to have a large, difficult-to-manage Lambda function.

These Lambda functions should pull the ideal configuration files from an S3 bucket and use them to construct Athena queries. An example of the items returned from the resulting Athena query is as follows:

Figure 2. Example Athena query results for custom inventory metadata

Figure 2: Example Athena query results for custom inventory metadata.

These Athena queries should find and make a list of any misconfigurations found in the items returned. If the formatting of your ideal configuration files is consistent enough, you may be able to make helper functions that can dynamically build a query based on the ideal file’s contents, else you can maintain hard-coded queries, which is easier for one-off configurations.

Once your Lambda function has given you the list of misconfigurations, the choice is yours as to what to do with them. You can design a system with auto-remediation, send urgent issues to an on-call engineer, or send a non-urgent report of misconfigurations to be dealt with during a scheduled maintenance window. Systems Manager capabilities such as Incident Manager and Change Manager can be utilized to enable help mitigate and recover from incidents or establish an enterprise change management framework for operational changes to your application configuration and infrastructure.

An example Lambda function can be found on the following GitHub repo:

https://github.com/aws-samples/aws-management-and-governance-samples/blob/master/AWSSystemsManager/ConfigurationManagement/configurationManagementExampleLambda.py

Summary

This post describes how Xero was able to create a configuration management tool with a wide range of use and coverage. Any team or company can leverage a similar design to save on licensing costs, have more flexibility in how output data is used, increase their visibility into the configuration of their instances and software therein, and decrease incidents from misconfigurations.

AWS Cloud Operations & Migrations Blog