AWS Developer Tools Blog

Testing infrastructure with the AWS Cloud Development Kit (CDK)

The AWS Cloud Development Kit (CDK) allows you to describe your application’s infrastructure using a general-purpose programming language, such as TypeScript, JavaScript or Python. This opens up familiar avenues for working with your infrastructure, such as using your favorite IDE, getting the benefit of autocomplete, creating abstractions in a familiar way, distributing them using your ecosystem’s standard package manager, and of course: writing tests for your infrastructure like you would write tests for your application.

In this blog post you will learn how to write tests for your infrastructure code in TypeScript using Jest. The code for JavaScript will be the same (sans the types), while the code for Python would follow the same testing patterns. Unfortunately, there are no ready-made Python libraries for you to use yet.

Approach

The pattern for writing tests for infrastructure is very similar to how you would write them for application code: you define a test case as you would normally do in the test framework of your choice. Inside that test case you instantiate constructs as you would do in your CDK app, and then you make assertions about the AWS CloudFormation template that the code you wrote would generate.

The one thing that’s different from normal tests are the assertions that you write on your code. The TypeScript CDK ships with an assertion library (@aws-cdk/assert) that makes it easy to make assertions on your infrastructure. In fact, all of the constructs in the AWS Construct Library that ship with the CDK are tested in this way, so we can make sure they do—and keep on doing—what they are supposed to do. Our assertions library is currently only available to TypeScript and JavaScript users, but will be made available to users of other languages eventually.

Broadly, there are a couple of classes of tests you will be writing:

  • Snapshot tests (also known as “golden master” tests). Using Jest, these are very convenient to write. They assert that the CloudFormation template the code generates is the same as it was when the test was written. If anything changes, the test framework will show you the changes in a diff. If the changes were accidental, you’ll go and update the code until the test passes again, and if the changes were intentional, you’ll have the option to accept the new template as the new “golden master”.
    • In the CDK itself, we also use snapshot tests as “integration tests”. Rather than individual unit tests that only look at the CloudFormation template output, we write a larger application using CDK constructs, deploy it and verify that it works as intended. We then make a snapshot of the CloudFormation template, that will force us to re-deploy and re-test the deployment if the generated template starts to deviate from the snapshot.
  • Fine-grained assertions about the template. Snapshot tests are convenient and fast to write, and provide a baseline level of security that your code changes did not change the generated template. The trouble starts when you purposely introduce changes. Let’s say you have a snapshot test to verify output for feature A, and you now add a feature B to your construct. This changes the generated template, and your snapshot test will break, even though feature A still works as intended. The snapshot can’t tell which part of the template is relevant to feature A and which part is relevant to feature B. To combat this, you can also write more fine-grained assertions, such as “this resource has this property” (and I don’t care about any of the others).
  • Validation tests. One of the advantages of general-purpose programming languages is that we can add additional validation checks and error out early, saving the construct user some trial-and-error time. You would test those by using the construct in an invalid way and asserting that an error is raised.

An example: a dead letter queue

Let’s say you want to write a DeadLetterQueue construct. A dead letter queue is used to hold another queue’s messages if they fail delivery too many times. It’s generally bad news if messages end up the dead letter queue, because it indicates something is wrong with the queue processor. To that end, your DeadLetterQueue will come with an alarm that fires if there are any items in the queue. It is up to the user of the construct to attach any actions to the alarm firing, such as notifying an SNS topic.

Start by creating an empty construct library project using the CDK CLI and install some of the construct libraries we’ll need:

$ cdk init --language=typescript lib
$ npm install @aws-cdk/aws-sqs @aws-cdk/aws-cloudwatch

The CDK code might look like this (put this in a file called lib/dead-letter-queue.ts):

import cloudwatch = require('@aws-cdk/aws-cloudwatch');
import sqs = require('@aws-cdk/aws-sqs');
import { Construct, Duration } from '@aws-cdk/core';

export class DeadLetterQueue extends sqs.Queue {
  public readonly messagesInQueueAlarm: cloudwatch.IAlarm;

  constructor(scope: Construct, id: string) {
    super(scope, id);

    // Add the alarm
    this.messagesInQueueAlarm = new cloudwatch.Alarm(this, 'Alarm', {
      alarmDescription: 'There are messages in the Dead Letter Queue',
      evaluationPeriods: 1,
      threshold: 1,
      metric: this.metricApproximateNumberOfMessagesVisible(),
    });
  }
}

Writing a test

You’re going to write a test for this construct. First, start off by installing Jest and the CDK assertion library:

$ npm install --save-dev jest @types/jest @aws-cdk/assert

You also have to edit package.json file in your project to tell NPM to run Jest, and tell Jest what kind of files to collect:

{
  ...
 "scripts": {
    ...
    "test": "jest"
  },
  "devDependencies": {
    ...
    "@types/jest": "^24.0.18",
    "jest": "^24.9.0",
  },
  "jest": {
    "moduleFileExtensions": ["js"]
  }
}

You can now write a test. A good place to start is checking that the queue’s retention period is 2 weeks. The simplest kind of test you can write is a snapshot test, so start with that. Put the following in a file named test/dead-letter-queue.test.ts:

import { SynthUtils } from '@aws-cdk/assert';
import { Stack } from '@aws-cdk/core';

import dlq = require('../lib/dead-letter-queue');

test('dlq creates an alarm', () => {
  const stack = new Stack();
  new dlq.DeadLetterQueue(stack, 'DLQ');
  expect(SynthUtils.toCloudFormation(stack)).toMatchSnapshot();
});

You can now compile and run the test:

$ npm run build
$ npm test

Jest will run your test and tell you that it has recorded a snapshot from your test.

PASS  test/dead-letter-queue.test.js
 ✓ dlq creates an alarm (55ms)
 › 1 snapshot written.
Snapshot Summary
› 1 snapshot written

The snapshots are stored in a directory called __snapshots__. If you look at the snapshot, you’ll see it just contains a copy of the CloudFormation template that our stack would generate:

exports[`dlq creates an alarm 1`] = `
Object {
  "Resources": Object {
    "DLQ581697C4": Object {
      "Type": "AWS::SQS::Queue",
    },
    "DLQAlarm008FBE3A": Object {
     "Properties": Object {
        "AlarmDescription": "There are messages in the Dead Letter Queue",
        "ComparisonOperator": "GreaterThanOrEqualToThreshold",
...

Congratulations! You’ve written and run your first test. Don’t forget to commit the snapshots directory to version control so that the snapshot gets stored and versioned with your code.

Using the snapshot

To make sure the test is working, you’re going to break it to make sure the breakage is detected. To do this, in your dead-letter-queue.ts file, change the cloudwatch.Alarm period to 1 minute (instead of the default of 5 minutes), by adding a period argument:

this.messagesInQueueAlarm = new cloudwatch.Alarm(this, 'Alarm', {
  // ...
  period: Duration.minutes(1),
});

If you now build and run the test again, Jest will tell you that the template changed:

$ npm run build && npm test

FAIL test/dead-letter-queue.test.js
✕ dlq creates an alarm (58ms)

● dlq creates an alarm

expect(received).toMatchSnapshot()

Snapshot name: `dlq creates an alarm 1`

- Snapshot
+ Received

@@ -19,11 +19,11 @@
               },
             ],
             "EvaluationPeriods": 1,
             "MetricName": "ApproximateNumberOfMessagesVisible",
             "Namespace": "AWS/SQS",
     -       "Period": 300,
     +       "Period": 60,
             "Statistic": "Maximum",
             "Threshold": 1,
           },
           "Type": "AWS::CloudWatch::Alarm",
         },

 › 1 snapshot failed.
Snapshot Summary
 › 1 snapshot failed from 1 test suite. Inspect your code changes or run `npm test -- -u` to update them.

Jest is telling you that the change you just made changed the emitted Period attribute from 300 to 60. You now have the choice of undoing our code change if this result was accidental, or committing to the new snapshot if you intended to make this change. To commit to the new snapshot, run:

npm test -- -u

Jest will tell you that it updated the snapshot. You’ve now locked in the new alarm period:

PASS  test/dead-letter-queue.test.js
 ✓ dlq creates an alarm (51ms)

 › 1 snapshot updated.
Snapshot Summary
 › 1 snapshot updated

Dealing with change

Let’s return to the DeadLetterQueue construct. Messages go to the dead letter queue when something is wrong with the primary queue processor, and you are notified via an alarm. After you fix the problem with the queue processor, you’ll usually want to redrive the messages from the dead letter queue, back to the primary queue, to have them processed as usual.

Messages only exist in a queue for a limited time though. To give yourself the greatest chance of recovering the messages from the dead letter queue, set the lifetime of messages in the dead letter queue (called the retention period) to the maximum time of 2 weeks. You make the following changes to your DeadLetterQueue construct:

export class DeadLetterQueue extends sqs.Queue {
  constructor(parent: Construct, id: string) {
    super(parent, id, {
      // Maximum retention period
      retentionPeriod: Duration.days(14)
    });
    // ...
  }
}

Now run the tests again:

$ npm run build && npm test
FAIL test/dead-letter-queue.test.js
✕ dlq creates an alarm (79ms)

    ● dlq creates an alarm

    expect(received).toMatchSnapshot()

    Snapshot name: `dlq creates an alarm 1`

    - Snapshot
    + Received

    @@ -1,8 +1,11 @@
      Object {
        "Resources": Object 
          "DLQ581697C4": Object {
    +       "Properties": Object {
    +         "MessageRetentionPeriod": 1209600,
    +       },
            "Type": "AWS::SQS::Queue",
         },
         "DLQAlarm008FBE3A": Object {
           "Properties": Object {
             "AlarmDescription": "There are messages in the Dead Letter Queue",

  › 1 snapshot failed.
Snapshot Summary
  › 1 snapshot failed from 1 test suite. Inspect your code changes or run `npm test -- -u` to update them.

The snapshot test broke again, because you added a retention period property. Even though the test was only intended to make sure that the DeadLetterQueue construct created an alarm, it was inadvertently also testing that the queue was created with default options.

Writing fine-grained assertions on resources

Snapshot tests are convenient to write and have their place for detecting accidental change. We use them in the CDK for our integration tests when validating larger bits of functionality all together. If a change causes an integration test’s template to deviate from its snapshot, we use that as a trigger to tell us we need to do extra validation, for example actually deploying the template through AWS CloudFormation and verifying our infrastructure still works.

In the CDK’s extensive suite of unit tests, we don’t want to revisit all the tests any time we make a change. To avoid this, we use the custom assertions in the @aws-cdk/assert/jest module to write fine-grained tests that verify only part of the construct’s behavior at a time, i.e. only the part we’re interested in for that particular test. For example, the test called “dlq creates an alarm” should assert that an alarm gets created with the appropriate metric, and it should not make any assertions on the properties of the queue that gets created as part of that test.

To write this test, you will have a look at the AWS::CloudWatch::Alarm resource specification in CloudFormation, and see what properties and values you’re using the assertion library to guarantee. In this case, you’re interested in the properties Namespace, MetricName and Dimensions. You can use the expect(stack).toHaveResource(...) assertion to make sure those have the values you want. To get access to that assertion, you’ll first need to import @aws-cdk/assert/jest, which extends the assertions that are available when you type expect(…). Putting this all together, your test should look like this:

import '@aws-cdk/assert/jest';

// ...
test('dlq creates an alarm', () => {
  const stack = new Stack();

  new dlq.DeadLetterQueue(stack, 'DLQ');

  expect(stack).toHaveResource('AWS::CloudWatch::Alarm', {
    MetricName: "ApproximateNumberOfMessagesVisible",
    Namespace: "AWS/SQS",
    Dimensions: [
      {
        Name: "QueueName",
        Value: { "Fn::GetAtt": [ "DLQ581697C4", "QueueName" ] }
      }
    ],
  });
});

This test asserts that an Alarm is created on the ApproximateNumberOfMessagesVisible metric of the dead letter queue (by means of the { Fn::GetAtt } intrinsic). If you run Jest now, it will warn you about an existing snapshot that your test no longer uses, so get rid of it by running npm test -- -u.

You can now add a second test for the retention period:

test('dlq has maximum retention period', () => {
  const stack = new Stack();

  new dlq.DeadLetterQueue(stack, 'DLQ');

  expect(stack).toHaveResource('AWS::SQS::Queue', {
    MessageRetentionPeriod: 1209600
  });
});

Run the tests to make sure everything passes:

$ npm run build && npm test
 
PASS  test/dead-letter-queue.test.js
  ✓ dlq creates an alarm (48ms)
  ✓ dlq has maximum retention period (15ms)

Test Suites: 1 passed, 1 total
Tests:       2 passed, 2 total

It does!

Validating construct configuration

Maybe you want to make the retention period configurable, while validating that the user-provided value falls into an acceptable range. You’d create a Props interface for the construct and add a check on the allowed values that your construct will accept:

export interface DeadLetterQueueProps {
    /**
     * The amount of days messages will live in the dead letter queue
     *
     * Cannot exceed 14 days.
     *
     * @default 14
     */
    retentionDays?: number;
}

export class DeadLetterQueue extends sqs.Queue {
  public readonly messagesInQueueAlarm: cloudwatch.IAlarm;

  constructor(scope: Construct, id: string, props: DeadLetterQueueProps = {}) {
    if (props.retentionDays !== undefined && props.retentionDays > 14) {
      throw new Error('retentionDays may not exceed 14 days');
    }

    super(scope, id, {
        // Given retention period or maximum
        retentionPeriod: Duration.days(props.retentionDays || 14)
    });
    // ...
  }
}

To test that your new feature actually does what you expect, you’ll write two tests:

  • One that checks a configured value ends up in the template; and
  • One which supplies an incorrect value to the construct and checks that you get the error you’re expecting.
test('retention period can be configured', () => {
  const stack = new Stack();

  new dlq.DeadLetterQueue(stack, 'DLQ', {
    retentionDays: 7
  });

  expect(stack).toHaveResource('AWS::SQS::Queue', {
    MessageRetentionPeriod: 604800
  });
});

test('configurable retention period cannot exceed 14 days', () => {
  const stack = new Stack();

  expect(() => {
    new dlq.DeadLetterQueue(stack, 'DLQ', {
      retentionDays: 15
    });
  }).toThrowError(/retentionDays may not exceed 14 days/);
});

Run the tests to confirm:

$ npm run build && npm test

PASS  test/dead-letter-queue.test.js
  ✓ dlq creates an alarm (62ms)
  ✓ dlq has maximum retention period (14ms)
  ✓ retention period can be configured (18ms)
  ✓ configurable retention period cannot exceed 14 days (1ms)

Test Suites: 1 passed, 1 total
Tests:       4 passed, 4 total

You’ve confirmed that your feature works, and that you’re correctly validating the user’s input.

As a bonus: you know from your previous tests still passing that you didn’t change any of the behavior when the user does not specify any arguments, which is great news!

Conclusion

You’ve written a reusable construct, and covered its features with resource assertion and validation tests. Regardless of whether you’re planning on writing tests on your own infrastructure application, on your own reusable constructs, or whether you’re planning to contribute to the CDK on GitHub, I hope this blog post has given you some mental tools for thinking about testing your infrastructure code.

Finally, two values I’d like to instill in you when you are writing tests:

  • Treat test code like you would treat application code. Test code is going to have an equally long lifetime in your code as regular code, and is equally subject to change. Don’t copy/paste setup lines or common assertions all over the place, take some extra time to factor out commonalities into helper functions. Your future self will thank you.
  • Don’t assert too much in one test. Preferably, a test should test one and only one behavior. If you accidentally break that behavior, you would prefer exactly one test to fail, and the test name will tell you exactly what you broke. There’s nothing worse than changing something trivial and having dozens of tests fail and need to be updated because they were accidentally asserting some behavior other than what the test was for. This does mean that—regardless of how convenient they are—you should be using snapshot tests sparingly, as all snapshot tests are going to fail if literally anything about the construct behavior changes, and you’re going to have to go back and scrutinize all failures to make sure nothing accidentally slipped by.

Happy testing!