Automating Amazon EBS volume resizing with AWS Step Functions and AWS Systems Manager

In active applications, it’s possible for an Amazon EC2 instance’s Amazon EBS volume utilization to reach provisioned capacity. Depending on the application in use, this creates the risk of a customer-impacting application outage when that provisioned capacity is exhausted. One solution is to design a failover mechanism into the application. However, this can be a burden to orchestrate. An easier solution is to resize the EBS volume automatically.

At Infor, we manage thousands of EC2 instances (both Windows and Linux) in our production environments. We needed a proactive approach to prevent such outages, so we developed the solution described in this post to make sure that a volume automatically increases when it reaches a certain threshold. This approach has two benefits:

If an anomaly occurs and a volume becomes critically low on space, the automation expands it before a provisioned capacity exhaustion outage happens. This action buys precious time to investigate and resolve the root cause.
Volumes no longer have to be over-provisioned. This solution automatically increases volume capacity over time as more space is needed, thus reducing your EBS cost.

In this post, you walk through the automatic EBS volume-resizing process that we developed at Infor, see its architecture, and learn some best practices.

Overview

In 2017, AWS announced Amazon EBS Elastic Volumes. This feature offers the ability to increase a volume’s size, adjust performance, or change the volume type on the fly with no downtime and with a simple API call.

While modifying the volume is relatively easy, the tricky part is extending the file system to take advantage of the additional storage. Typically, this is done manually on the OS, but if AWS Systems Manager manages your instances, it’s possible to use AWS Lambda to send Systems Manager commands that run OS-level script.

The following list shows the steps in this workflow:

Monitor when a volume reaches 80% capacity and trigger the automation.
Run through a series of checks before proceeding, because certain systems were excluded. Some EC2 instances in our fleet had custom-built applications that required manual disk expansion. Other EC2 instances were managed by teams that had specific governance policies requiring tighter control of any infrastructure changes.
Take a snapshot of the volume as a safety measure in the unusual event that data corruption occurs.
Expand the volume by 20% on the AWS layer using Lambda.
Extend the file system on the OS using Systems Manager documents that download and execute scripts stored in Amazon S3.
Check the status and make sure that the volume expanded and extended correctly.
Send an email notification summarizing the action that the automation performed successfully using Amazon SES.

The following graphic shows these workflow steps diagrammatically.

Diagram of workflow for modifying storage volume and extending a file system

Walk-through

This section includes the solution’s architecture and steps.

Architecture

The overall architecture of this solution can be broken down into two main stages: triggering and execution.

Triggering

Monitors volumes on instances using the Amazon CloudWatch agent or similar third-party monitoring applications.
Checks whether the instance should be excluded, according to previously-defined criteria.
Configures alerts that call Amazon API Gateway when a volume reaches a certain threshold.
Integrates API Gateway to a Lambda function that acts as the “invoker” Lambda function.

Execution

The “Invoker” Lambda function executes an AWS Step Functions state machine, which is the “orchestrator” of the various tasks required as part of the automation.
Certain tasks use Systems Manager to download PowerShell and bash scripts from S3, and run them on the instance.
SES sends notifications when the automation succeeds (or fails).

This architecture is shown graphically in the following diagram.

The two main phases of the solution are the triggering and execution phases.

Triggering steps

The following triggering steps are involved in the solution:

Set up an alert when a volume reaches a certain threshold.
Configure API Gateway to receive the alert and trigger the invoker function.
Create the invoker Lambda function.

Set up an alert when a volume reaches a certain threshold

To proactively expand a volume on an instance, obviously, the volume must be monitored. The file system cannot be monitored at the EBS layer, because that information can be obtained only from the operating system itself. For example, this information might include how much of the C drive on a Windows instance is used and how much is free.

Standard CloudWatch metrics do not provide information on the provisioned capacity used by the OS. However, AWS offers a CloudWatch agent that can collect custom metrics for this purpose.

Alternatively, most third-party monitoring applications should be able to monitor the volume-provisioned capacity used and trigger the automation. In our case, we had LogicMonitor deployed to all instances in our fleet. We configured a LogicMonitor alert to be issued when a volume utilization reaches 80% of the provisioned capacity.

Configure API Gateway to receive the alert and trigger the invoker function

We set up the alert described earlier in LogicMonitor to call the API that was integrated with a Lambda function (the invoker function). The job of the invoker function is to receive the request data and trigger the execution of the state machine. The ARN of the state machine is passed as an environment variable to this Lambda function, which contains the various task states of the automation.

Defining your API Gateway

In the Amazon API Gateway console, choose Create API.
Select REST and fill out the name and description fields.
Under Resources, create a new resource and provide the resource name.
Select your new API resource and add a POST
For Integration type, choose Lambda Function and select the check box next to Use Lambda Proxy integration.
Provide the Region and the name of your Lambda function, and then click Save.
After adding the POST method, you should see a summary of your method execution.
Deploy . Make a note of the “invoker URL,” as your monitoring application uses it to trigger this automation.

Create the invoker Lambda function

Using the boto3 AWS SDK for Python, you can easily write a short Lambda function that triggers the execution of a state machine:

import boto3 

sf = boto3.client('stepfunctions')    
  
sf.start_execution(    
    stateMachineArn='ARN_of_state_machine',    
    name='unique_execution_id',    
    input='{\"my_data\":\"As a JSON string\"}'  
)

The input must include the data required to process the automation. In our case, we had to know the instance ID, the drive letter (for Windows instances) or mount point (for Linux instances), the Region, and the account ID. You need the volume ID. However, it might not be readily available from your monitoring system, so it has to be retrieved as part of the automation. More on this later in this post.

Execution steps

The following execution steps are involved in the solution:

Set up the Lambda function.
Orchestrate the Lambda functions with AWS Step Functions.

Set up the Lambda function

Modifying an instance’s volume requires several steps that you must execute in the correct order. Lambda is the natural choice here, as it allows you to break up this complex automation into smaller pieces.

The following is a list of the tasks that we included in the automation:

Check if an instance is eligible.
Map the volumes.
Retrieve the volume ID.
(Optional) Take a snapshot of the volume.
Expand the volume (on the AWS layer).
Extend the file system.
Check the status.
Send an email confirmation.

Check if an instance is eligible

A task whose job is to verify if the instance meets certain conditions that are specific to Infor. If those conditions are met, the Lambda function returns the original input and appends a key-value pair that includes the eligibility status. For example, notice the last key-value pair in this return:

{  
  "instance_id": "i-0bc3fb5a56fdc4e8e",  
  "region": "us-east-1",  
  "platform": "windows",  
  "drive_letter": "D",  
  "eligible": true
}

Map the volumes

Our monitoring software, like most monitoring applications, is a collector process that runs on the OS layer of the instance. As a result, the alert provides information on the affected Windows (or Linux) volume, but it does not provide the EBS volume ID. As a result, we needed a way to retrieve the EBS volume ID so the automation can expand it.

Thankfully, if an instance is managed by AWS Systems Manager, you can use SSM commands to download a PowerShell (or bash for Linux) script from S3 and then run the script. In this case, the script lists the volumes and the corresponding EBS volume IDs. AWS even provides a sample PowerShell script.

This task is a Lambda function that executes the SSM document and returns the execution ID to the output. The SSM document can be executed from Lambda using the Systems Manager boto3 client:

import boto3    
  
ssm = boto3.client('ssm')  
  
response = ssm.send_command(    
      InstanceIds=['the_id_of_the_instance'],    
      DocumentName='the_name_of_the_ssm_document',    
      Comment='Retrieve volumes mapped to this instance',     
      Parameters={    
          'driveletter': [    
              'D',    
          ]    
      },    
  )    
  
# return the command ID  
return response['Command']['CommandId']

Retrieve the volume ID

After mapping the volumes using Systems Manager in the previous step and returning a command ID, you can use the command ID to retrieve the output of the SSM document execution. Then you parse through the output and return the target EBS volume ID.

With the command ID, retrieving the output of an SSM command is easily accomplished by using the boto3 SSM client and the list_command_invocations() method.

(Optional) Take a snapshot of the volume

This step is optional, but it’s good to always take a snapshot before modifying any volumes as a precautionary measure. You can also do this in Lambda using the boto3 EC2 client and the create_snapshot method.

Expand the volume (on the AWS layer)

This step is the first one that actually modifies the volume. This Lambda function expands the EBS volume by 20% using boto3 (the following example shows how to expand a volume to 100Gib):

import boto3    
   
ec2 = boto3.client('ec2')    
    
ec2.modify_volume(    
        VolumeId='ebs_volume_id',    
        Size=100    
)

Extend the file system

The most critical (and challenging) part of this automation is the ability to “extend” the file system on the OS layer. It is relatively easy to expand an EBS volume on the AWS layer. However, this additional space is useless unless the file system on the OS is extended to take advantage of the newly provisioned space. This process requires a good understanding of the various scenarios that you might come across, for example, different operating systems and volume types. Then, write the appropriate scripts to extend the volume on the OS layer.

For Windows, we wrote a PowerShell script that used DiskPart, and for Linux our bash script used growpart. For more information about how to extend the file system after resizing the EBS volume, see the following:

We followed the same approach used in earlier steps to deploy the scripts to the instance:

Write the script and store in S3.
Write an SSM document that downloads the script from S3 and executes it.
In your Lambda, use the SSM client and the send_command method to run the SSM document on the instance.

Check the status

An important step is to check the status of the volume on the instance to make sure that the file system was extended correctly. Because this is done on the OS layer of the instance, Systems Manager can be used to download and execute a script as in previous steps. A status of “success” or “fail” can be returned depending on whether the file system was expanded correctly or not.

Send an email confirmation

The last task is sending an email notification to alert the team that the automation ran (it also summarizes the action taken).

In our case, due to the size of our infrastructure, we have multiple teams. Each instance has an “owner” tag with the value being that team’s distribution list email address. We wrote a Lambda function that extracted the owner tag and then used the SES client in boto3 to send a message after every execution of the automation. The message included information about the infrastructure that was affected (including instance ID, volume ID, drive letter, original size, and final size).

We also wrote another Lambda function responsible for sending “fail” emails. In some cases, the automation failed to modify the EBS volume. Some of these failures were by design (for example, certain systems were excluded from the automation). Some were due to other factors (for example, a drive that gets modified but then fills up again within that 6-hour window). Be sure to alert system administrators of any failures, so that they can be addressed.

Orchestrate the Lambda functions with AWS Step

As mentioned earlier, you must execute the steps listed previously in a specific order. Because Lambda is stateless, you need an approach to pass states between the various Lambda functions. The natural solution is AWS Step Functions, which provides the ability to build a state machine to orchestrate several Lambda functions. Our solution used a complex state machine that starts executing the steps mentioned earlier once it is triggered by the invoker Lambda function.

The order and workflow of tasks can vary greatly depending on your particular environment or needs. Step Functions gives you the ability to implement choice states and wait states, as well as complex branching and error handling. The following diagram shows the visual workflow of a successful execution of the state machine.

Visual Workflow of a successful execution of the state machine built with AWS step functions to orchestrate several Lambda functions.

Considerations

As mentioned earlier, the goal of this solution is to avoid outages, but note that it does not resolve the actual cause of high disk utilization of provisioned capacity. Also note that ignoring the root cause might lead to additional storage costs because your volumes continue to expand automatically.

After running this automation for a few months, we noted the following important considerations:

Make sure that notifications are correctly sent and received by system administrators.
Have a process in place to investigate high-volume utilization of provisioned capacity, particularly the extreme cases.
While expanding an EBS volume is possible, reducing the size of an EBS volume requires additional steps. Adding steps such as taking a snapshot of the volume and migrating the data to a new volume also add considerable complexity.
An EBS volume cannot be modified more than once within a six-hour window. If something were to cause the automation to be triggered multiple times within that timeframe, it would fail in subsequent attempts.

Conclusion

In this post, you learned how to expand EBS volumes proactively, and the benefits gained from automating this process. You also viewed our Infor end-to-end approach using existing AWS tools and services. Remember to delete example resources created while following along if they are no longer needed, to avoid incurring any future costs.

Depending on how diverse your EC2 environment is, you might experience a few failures. Most of the failures we came across were during the file system extension step, where scripts failed to run on older operating systems or non-standard configurations. However, those failures were rare.

This automation allowed Infor to provision storage in a leaner way. Instead of over-provisioning, we can rely on this automation to automatically add additional EBS capacity as needed.

While conducting a recent review, we discovered that the automation successfully ran about 450 times in one month in our fleet of about 15,000 EC2 instances. It takes anywhere from 15 to 30 minutes for an engineer to manually provision additional EBS storage and extend it on the OS. Therefore, we estimate that in one month alone, we saved between 100 and 200 staff-hours, and at least 5–10 potential customer outages.

Since implementing this automation at Infor, we have eliminated provisioned capacity exhaustion outages and performance issues. We have also been able to significantly reduce the time and effort that engineers and system administrators spend on provisioning additional EBS storage where it is legitimately needed.

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.