Building a Multi-Region Solution for Auto Recovery of Amazon EC2 Instances Using AWS CDK and AWS Step Functions
By Rafal Krol, Cloud SRE – Chaos Gears
One of our customers, a startup from the medical industry, has gained a global reach and serves its clients by operating in multiple AWS regions (currently 10 with more scheduled to come) spanning many time zones.
At the center of each region, there’s an Amazon Elastic Compute Cloud (Amazon EC2) instance that, by design, occasionally maxes out on the CPU. When this happens, a member of the customer’s operations team runs a set of checks to determine whether the instance is reachable; if it’s not, it gets restarted.
Once restarted and back online, which takes a few minutes, the same round of checks recommences. More often than not, this approach proves sufficient and the on-call engineer handling the issue can get back to work.
Being a startup and lacking the resources to man a follow-the-sun operations team, the customer came to Chaos Gears requesting a simple, adjustable, and cost-effective solution that would relieve their engineers from such an operational burden.
This post looks at the multi-regional first-line of support solution Chaos Gears built for the customer. I will also discuss how we automated the incident response duties that would typically engage at least one of the first-line of support engineers.
Before launching this solution, the on-call engineers needed to identify the affected instances, spread among different AWS regions, manually run a set of pre-defined checks on each one, and, based on the outcome, either do nothing or restart the pertinent machines and rerun the checks.
Infrastructure as Code
In today’s world of agile software development, our team at Chaos Gears treats everything as code, or at least we should be doing that.
Hence, the first decision we made for the customer was to leverage the AWS Cloud Development Kit (AWS CDK), a multi-language software development framework for modelling cloud infrastructure as reusable components as our infrastructure as code (IaC) tool.
Our customer’s software engineers were already familiar with TypeScript, the language we chose to build out the infrastructure with, which meant they’d comprehend the final solution quickly.
Moreover, we avoided the steep learning curve of mastering a domain-specific language (DSL) and the additional burden of handling an unfamiliar codebase.
On top of all of that, we could reuse the existing software tooling like linters and apply the industry’s coding best practices.
The challenge at hand was perfect for an event-driven architecture, and we had already envisioned the subsequent steps of the verification process:
- URL health check
- Amazon Route 53 health check
- SSH check
We needed the glue, and with AWS Step Functions we effortlessly combined all of those pieces without worrying about server provisioning, maintenance, retries, and error handling.
We had the backbone figured out, but we still had to decide how to monitor the CPU usage on the Amazon EC2 instances and pass the knowledge of a breach to AWS Step Functions state machine.
When the ‘CPUUtilization’ metric for a given instance reaches 100%, a CloudWatch alarm enters the ‘alarm’ state. This change gets picked up by an EventBridge rule that triggers the AWS Step Functions state machine.
Upon receiving the event object from the EventBridge rule, the state machine orchestrates the following workflow:
- Three checks run, one after another—URL check, Route 53 check, and SSH check.
- If all checks succeed during the first run, the execution ends silently (the ‘All good’ step followed by the ‘End’ field).
- When a check fails, the EC2 instance is restarted and we recommence from the beginning with a second run.
- If all checks succeed during the second run, a Slack notification is sent and the execution ends (the ‘Slack’ step followed by the ‘End’ field).
- When a check fails during the second run, an OpsGenie alert is created and the execution ends (the ‘OpsGenie’ step followed by the ‘End’ field).
Here’s the diagram depicting the complete solution:
Figure 1 – State machine.
All of the above-mentioned resources, plus the Lambda functions, an Amazon Simple Storage Service (Amazon S3) bucket for the Lambda code packages, and the necessary AWS Identity and Access Management (IAM) roles and policies are created and managed by AWS CDK and AWS SAM.
Furthermore, this solution can be deployed effortlessly to multiple regions using AWS CDK environments.
A Peek at the Code
A public repository is available on GitHub with a full working solution and a detailed README. I won’t dissect all of the code here, but let me draw your attention to some of the more interesting elements.
These two configuration files serve the entire project since we use TypeScript for both the infrastructure and application layers.
Now, let’s take a closer look at the ‘bin/cpu-check-cdk.ts’ file, the point of entry to our CDK app, whence all stacks are instantiated.
We imported all of the necessary dependencies one library at a time, but in AWS CDK v2 all of the CDK libraries are consolidated in one package.
Next, we check whether all of the necessary environment variables have been set.
Then, we initialize the CDK app construct.
We grab the regions to which to deploy, along with corresponding instance IDs to monitor from AWS CDK’s context.
Next, we create a tags object with the app’s version and repo’s URL taken directly from the package.json file.
Finally, we loop through the map of regions and corresponding instance IDs we grabbed earlier.
In each region, we produce eight stacks: one for every Lambda function, one for the state machine, and one for the metric, alarm, and rule combo.
Nice and easy, and all in one go, regardless of the number of regions to which we would want to deploy, and mind you, there are 25 available (with more to come).
I won’t be going through all of the AWS CDK and Lambda files in this post, though I strongly encourage you to give the code a thorough review.
You’ll notice there is a helper function called `capitalizeAndRemoveDashes` amongst the CDK libs. Since AWS CDK uses a GPL, we can introduce any amount of custom logic, as we could do with a _regular_ application.
The `lib/helpers.ts` file looks as follows:
Next, we extend the default stack properties like description with ours, setting some as mandatory and some as optional.
We then create a resource name out of the mandatory name property that will be passed in during the class initialization.
We create an IAM role that the Lambda service can assume. We add the `service-role/AWSLambdaBasicExecutionRole` AWS-managed policy to it and, if provided, a custom user-managed policy.
Next, we initialize a construct of the Lambda function using the role defined earlier and stack properties, or arbitrary defaults if stack properties were not provided.
Finally, we expose the Lambda function object as the class’s read-only property we defined earlier. We’re also sure to close our brackets to avoid the implosion of the universe.
In this post, I showed how our team at Chaos Gears put together a serverless application running AWS Lambda under the baton of AWS Step Functions to relieve our customer’s engineers from some of their operational burdens. This enables them to focus more on evolving their business.
The approach described here can be adapted to serve other needs or cover different cases, as AWS Step Functions’ visual workflows allow for a quick translation of business requirements to technical ones.
By using AWS CDK as the infrastructure as code (IaC) tool, we were able to write all of the code in TypeScript, which puts us in an excellent position for future improvements.
We avoided the trap of introducing unnecessary complexity and kept things concise with codebase that was approachable and comprehensive to all team members.
The content and opinions in this blog are those of the third-party author and AWS is not responsible for the content or accuracy of this post.
Chaos Gears – AWS Partner Spotlight
Chaos Gears is an AWS Partner that helps customers and companies of all sizes utilize AWS to its full potential so they can focus on evolving their business.
*Already worked with Chaos Gears? Rate the Partner
*To review an AWS Partner, you must be a customer that has worked with them directly on a project.