AWS Partner Network (APN) Blog

Building a Multi-Region Solution for Auto Recovery of Amazon EC2 Instances Using AWS CDK and AWS Step Functions

By Rafal Krol, Cloud SRE – Chaos Gears

Chaos Gears-AWS-Partners
Chaos Gears
Connect with Chaos-Gears-1.1

At Chaos Gears, an AWS Advanced Consulting Partner, we help customers and companies of all sizes utilize Amazon Web Services (AWS) to its full potential so they can focus on evolving their business.

One of our customers, a startup from the medical industry, has gained a global reach and serves its clients by operating in multiple AWS regions (currently 10 with more scheduled to come) spanning many time zones.

At the center of each region, there’s an Amazon Elastic Compute Cloud (Amazon EC2) instance that, by design, occasionally maxes out on the CPU. When this happens, a member of the customer’s operations team runs a set of checks to determine whether the instance is reachable; if it’s not, it gets restarted.

Once restarted and back online, which takes a few minutes, the same round of checks recommences. More often than not, this approach proves sufficient and the on-call engineer handling the issue can get back to work.

Being a startup and lacking the resources to man a follow-the-sun operations team, the customer came to Chaos Gears requesting a simple, adjustable, and cost-effective solution that would relieve their engineers from such an operational burden.

This post looks at the multi-regional first-line of support solution Chaos Gears built for the customer. I will also discuss how we automated the incident response duties that would typically engage at least one of the first-line of support engineers.

Before launching this solution, the on-call engineers needed to identify the affected instances, spread among different AWS regions, manually run a set of pre-defined checks on each one, and, based on the outcome, either do nothing or restart the pertinent machines and rerun the checks.

Infrastructure as Code

In today’s world of agile software development, our team at Chaos Gears treats everything as code, or at least we should be doing that.

Hence, the first decision we made for the customer was to leverage the AWS Cloud Development Kit (AWS CDK), a multi-language software development framework for modelling cloud infrastructure as reusable components as our infrastructure as code (IaC) tool.

Our customer’s software engineers were already familiar with TypeScript, the language we chose to build out the infrastructure with, which meant they’d comprehend the final solution quickly.

Moreover, we avoided the steep learning curve of mastering a domain-specific language (DSL) and the additional burden of handling an unfamiliar codebase.

The recent introduction of ASW CDK integration with the AWS Serverless Application Model (SAM), allows for developing serverless applications seamlessly within a CDK project.

On top of all of that, we could reuse the existing software tooling like linters and apply the industry’s coding best practices.

Serverless

The adage says that “no server is better than no server,” and with that in mind we turned our heads towards AWS Step Functions, a serverless orchestrator for AWS Lambda and other AWS services.

The challenge at hand was perfect for an event-driven architecture, and we had already envisioned the subsequent steps of the verification process:

  • URL health check
  • Amazon Route 53 health check
  • SSH check
  • Restart

We needed the glue, and with AWS Step Functions we effortlessly combined all of those pieces without worrying about server provisioning, maintenance, retries, and error handling.

Managed Services

We had the backbone figured out, but we still had to decide how to monitor the CPU usage on the Amazon EC2 instances and pass the knowledge of a breach to AWS Step Functions state machine.

It screamed of Amazon CloudWatch alarms for the metric monitoring bit and Amazon EventBridge for creating a rule for routing the alarm event to the target (a state machine, in our case).

Business Logic

When the ‘CPUUtilization’ metric for a given instance reaches 100%, a CloudWatch alarm enters the ‘alarm’ state. This change gets picked up by an EventBridge rule that triggers the AWS Step Functions state machine.

Upon receiving the event object from the EventBridge rule, the state machine orchestrates the following workflow:

  1. Three checks run, one after another—URL check, Route 53 check, and SSH check.
  2. If all checks succeed during the first run, the execution ends silently (the ‘All good’ step followed by the ‘End’ field).
  3. When a check fails, the EC2 instance is restarted and we recommence from the beginning with a second run.
  4. If all checks succeed during the second run, a Slack notification is sent and the execution ends (the ‘Slack’ step followed by the ‘End’ field).
  5. When a check fails during the second run, an OpsGenie alert is created and the execution ends (the ‘OpsGenie’ step followed by the ‘End’ field).

Here’s the diagram depicting the complete solution:

Chaos-Gears-First-Line-of-Support-2

Figure 1 – State machine.

All of the above-mentioned resources, plus the Lambda functions, an Amazon Simple Storage Service (Amazon S3) bucket for the Lambda code packages, and the necessary AWS Identity and Access Management (IAM) roles and policies are created and managed by AWS CDK and AWS SAM.

Furthermore, this solution can be deployed effortlessly to multiple regions using AWS CDK environments.

A Peek at the Code

A public repository is available on GitHub with a full working solution and a detailed README. I won’t dissect all of the code here, but let me draw your attention to some of the more interesting elements.

In the project’s root directory, we keep a ‘tsconfig.json’ file responsible for configuring the TypeScript’s compiler, and an ‘.eslintrc.json’ file holding the configuration for ES Lint, a popular JavaScript linter.

These two configuration files serve the entire project since we use TypeScript for both the infrastructure and application layers.

AWS CDK’s support for many popular general-purpose languages (TypeScript, JavaScript, Python, Java, C#, and Go, which is in developer preview) enables and encourages the DevOps culture by making the end-to-end development experience more uniform, as you can use familiar tools and frameworks across your whole stack.

Now, let’s take a closer look at the ‘bin/cpu-check-cdk.ts’ file, the point of entry to our CDK app, whence all stacks are instantiated.

We imported all of the necessary dependencies one library at a time, but in AWS CDK v2 all of the CDK libraries are consolidated in one package.

  ```typescript
  #!/usr/bin/env node
  import 'source-map-support/register'
  import * as cdk from '@aws-cdk/core'
  import * as iam from '@aws-cdk/aws-iam'
  ```

Next, we check whether all of the necessary environment variables have been set.

```typescript
import { SLACK_TOKEN, SLACK_CHANNEL_ID, TEAM_ID, API_KEY, R53_HEALTH_CHECK_ID } from '../config'

if (!SLACK_TOKEN) {
  throw new Error('SLACK_TOKEN must be set!')
}
if (!SLACK_CHANNEL_ID) {
  throw new Error('SLACK_CHANNEL_ID must be set!')
}
if (!TEAM_ID) {
  throw new Error('TEAM_ID must be set!')
}
if (!API_KEY) {
  throw new Error('API_KEY must be set!')
}
if (!R53_HEALTH_CHECK_ID) {
  throw new Error('R53_HEALTH_CHECK_ID must be set!')
}
```

Then, we initialize the CDK app construct.

```typescript
const app = new cdk.App()
```

We grab the regions to which to deploy, along with corresponding instance IDs to monitor from AWS CDK’s context.

```typescript
const regionInstanceMap: Map<string, string> = app.node.tryGetContext('regionInstanceMap')
```

Next, we create a tags object with the app’s version and repo’s URL taken directly from the package.json file.

```typescript
import { repository, version } from '../package.json'

const tags = {
  version,
  repositoryUrl: repository.url,
}
```

Finally, we loop through the map of regions and corresponding instance IDs we grabbed earlier.

In each region, we produce eight stacks: one for every Lambda function, one for the state machine, and one for the metric, alarm, and rule combo.

```typescript
import { StateMachineStack } from '../lib/state-machine-stack'
import { LambdaStack } from '../lib/lambda-stack'
import { MetricAlarmRuleStack } from '../lib/metric-alarm-rule-stack'

for (const [region, instanceId] of Object.entries(regionInstanceMap)) {
  const env = {
    region,
    account: process.env.CDK_DEFAULT_ACCOUNT,
  }

  const lambdaStackUrlHealthCheck = new LambdaStack(app, `LambdaStackUrlHealthCheck-${region}`, {
    tags,
    env,
    name: 'url-health-check',
    policyStatementProps: {
      effect: iam.Effect.ALLOW,
      resources: ['*'],
      actions: ['ec2:DescribeInstances'],
    },
  })

  const lambdaStackR53Check = new LambdaStack(app, `LambdaStackR53Check-${region}`, {
    tags,
    env,
    name: 'r53-check',
    policyStatementProps: {
      effect: iam.Effect.ALLOW,
      resources: [`arn:aws:route53:::healthcheck/${R53_HEALTH_CHECK_ID}`],
      actions: ['route53:GetHealthCheckStatus'],
    },
    environment: {
      R53_HEALTH_CHECK_ID,
    },
  })

  const lambdaStackSshCheck = new LambdaStack(app, `LambdaStackSshCheck-${region}`, {
    tags,
    env,
    name: 'ssh-check',
    policyStatementProps: {
      effect: iam.Effect.ALLOW,
      resources: ['*'],
      actions: ['ec2:DescribeInstances'],
    },
  })

  const lambdaStackRestartServer = new LambdaStack(app, `LambdaStackRestartServer-${region}`, {
    tags,
    env,
    name: 'restart-server',
    policyStatementProps: {
      effect: iam.Effect.ALLOW,
      resources: [`arn:aws:ec2:${region}:${process.env.CDK_DEFAULT_ACCOUNT}:instance/${instanceId}`],
      actions: ['ec2:RebootInstances'],
    },
  })

  const lambdaStackSlackNotification = new LambdaStack(app, `LambdaStackSlackNotification-${region}`, {
    tags,
    env,
    name: 'slack-notification',
    environment: {
      SLACK_TOKEN,
      SLACK_CHANNEL_ID,
    },
  })

  const lambdaStackOpsGenieNotification = new LambdaStack(app, `LambdaStackOpsGenieNotification-${region}`, {
    tags,
    env,
    name: 'opsgenie-notification',
    environment: {
      TEAM_ID,
      API_KEY,
      EU: 'true',
    },
  })

  const stateMachineStack = new StateMachineStack(app, `StateMachineStack-${region}`, {
    tags,
    env,
    urlHealthCheck: lambdaStackUrlHealthCheck.lambdaFunction,
    r53Check: lambdaStackR53Check.lambdaFunction,
    sshCheck: lambdaStackSshCheck.lambdaFunction,
    restartServer: lambdaStackRestartServer.lambdaFunction,
    slackNotification: lambdaStackSlackNotification.lambdaFunction,
    opsGenieNotification: lambdaStackOpsGenieNotification.lambdaFunction,
  })

  new MetricAlarmRuleStack(app, `MetricAlarmRuleStack-${region}`, {
    tags,
    env,
    instanceId,
    stateMachine: stateMachineStack.stateMachine,
  })
}
```

Thanks to using a basic programming concept of the for loop, we saved ourselves from unnecessary duplication by keeping things DRY (Don’t Repeat Yourself).

Nice and easy, and all in one go, regardless of the number of regions to which we would want to deploy, and mind you, there are 25 available (with more to come).

I won’t be going through all of the AWS CDK and Lambda files in this post, though I strongly encourage you to give the code a thorough review.

Notwithstanding, let’s see how easy it is to define a stack class in AWS CDK looking at the `lib/lambda-stack.ts` file. First, we import the dependencies:

```typescript
import * as cdk from '@aws-cdk/core'
import * as iam from '@aws-cdk/aws-iam'
import * as lambda from '@aws-cdk/aws-lambda'
import { capitalizeAndRemoveDashes } from './helpers'
```

You’ll notice there is a helper function called `capitalizeAndRemoveDashes` amongst the CDK libs. Since AWS CDK uses a GPL, we can introduce any amount of custom logic, as we could do with a _regular_ application.

The `lib/helpers.ts` file looks as follows:

```typescript
/**
 * Take a kebab-case string and turn it into a PascalCase string, e.g.: my-cool-function -> MyCoolFunction
 * 
 * @param kebab
 * @returns string
 */
export function capitalizeAndRemoveDashes(kebab: string): string {
  const kebabSplit = kebab.split('-')

  for (const i in kebabSplit) {
    kebabSplit[i] = kebabSplit[i].charAt(0).toUpperCase() + kebabSplit[i].slice(1)
  }

  return kebabSplit.join('')
}
```

Next, we extend the default stack properties like description with ours, setting some as mandatory and some as optional.

```typescript
interface LambdaStackProps extends cdk.StackProps {
  name: string,
  runtime?: lambda.Runtime,
  handler?: string,
  timeout?: cdk.Duration,
  pathToFunction?: string,
  policyStatementProps?: iam.PolicyStatementProps,
  environment?: {
    [key: string]: string
  },
}
```

We start a declaration of the `LambdaStack` class with a `lambdaFunction` read-only property and a constructor.

```typescript
export class LambdaStack extends cdk.Stack {
  readonly lambdaFunction: lambda.Function

  constructor(scope: cdk.Construct, id: string, props: LambdaStackProps) {
    super(scope, id, props)
```

We then create a resource name out of the mandatory name property that will be passed in during the class initialization.

```typescript
const resourceName = capitalizeAndRemoveDashes(props.name)
```

We create an IAM role that the Lambda service can assume. We add the `service-role/AWSLambdaBasicExecutionRole` AWS-managed policy to it and, if provided, a custom user-managed policy.

```typescript
const role = new iam.Role(this, `Role${resourceName}`, { assumedBy: new iam.ServicePrincipal('lambda.amazonaws.com') })
role.addManagedPolicy(iam.ManagedPolicy.fromAwsManagedPolicyName('service-role/AWSLambdaBasicExecutionRole'))
if (props.policyStatementProps) {
  role.addToPolicy(new iam.PolicyStatement(props.policyStatementProps))
}
```

Next, we initialize a construct of the Lambda function using the role defined earlier and stack properties, or arbitrary defaults if stack properties were not provided.

```typescript
const lambdaFunction = new lambda.Function(this, `LambdaFunction${resourceName}`, {
  role,
  runtime: props.runtime || lambda.Runtime.NODEJS_12_X,
  handler: props.handler || 'app.handler',
  timeout: props.timeout || cdk.Duration.seconds(10),
  code: lambda.Code.fromAsset(`${props.pathToFunction || 'src'}/${props.name}`),
  environment: props.environment,
})
```

Finally, we expose the Lambda function object as the class’s read-only property we defined earlier. We’re also sure to close our brackets to avoid the implosion of the universe.

```typescript
    this.lambdaFunction = lambdaFunction
  }
}
```

Conclusion

In this post, I showed how our team at Chaos Gears put together a serverless application running AWS Lambda under the baton of AWS Step Functions to relieve our customer’s engineers from some of their operational burdens. This enables them to focus more on evolving their business.

The approach described here can be adapted to serve other needs or cover different cases, as AWS Step Functions’ visual workflows allow for a quick translation of business requirements to technical ones.

By using AWS CDK as the infrastructure as code (IaC) tool, we were able to write all of the code in TypeScript, which puts us in an excellent position for future improvements.

We avoided the trap of introducing unnecessary complexity and kept things concise with codebase that was approachable and comprehensive to all team members.

Check out the GitHub repository and visit the Chaos Gears website to learn more about collaborating with us.

The content and opinions in this blog are those of the third-party author and AWS is not responsible for the content or accuracy of this post.

.
Chaos-Gears-APN-Blog-CTA-1
.


Chaos Gears – AWS Partner Spotlight

Chaos Gears is an AWS Partner that helps customers and companies of all sizes utilize AWS to its full potential so they can focus on evolving their business.

Contact Chaos Gears | Partner Overview

*Already worked with Chaos Gears? Rate the Partner

*To review an AWS Partner, you must be a customer that has worked with them directly on a project.