Architecting for resiliency on AWS App Runner

AWS App Runner is one of the simplest ways to run your containerized web applications and APIs on AWS. App Runner abstracts away the cloud resources needed for running your web application or API, including load balancers, TLS certificates, auto-scaling, logs, metrics, tracing (such as observability), as well as the underlying compute resources. With App Runner, you can start with source code or a container image.

App Runner is a regional service, which means that when you use it, you benefit from the availability and fault tolerance mechanisms that AWS offers. For many applications, using App Runner in a single region with its automatic use of multiple Availability Zones provides a good balance of availability, simplicity, and affordability. However, some use cases require expanding to additional regions in an active-active or active-standby disaster recovery configuration to meet resiliency requirements.

In this post, we show how to architect your web application on App Runner in two regions in an active-active configuration using Amazon Route 53 to manage the traffic across the regions. The goal of this architecture is to enable your application to continue serving its users, even in the rare event of an issue in a particular region.

Architecture

The following diagram shows the high-level architecture.

Diagram showing the high-level architecture of the solution described in the post

We use a sample application that is composed of a Next.js frontend and a Go API server hosted in an App Runner service backed by an Amazon DynamoDB table. App Runner runs containers, which means that you can package and run a large variety of types of HTTP (request/response) applications on App Runner, such as a REST API written in Node Express, a web app written in Python Flask, or a web app written in Java Spring Boot, just to name a few possibilities.

The most challenging component to architect in a multi-region configuration is typically the database tier. In our example, we’ll be using DynamoDB global tables, which provides a fully managed, multi-region, and multi-active database that works great for high performance and massive scale.

When building a solution in AWS, we recommend you use infrastructure as code (IaC), which provides many of the benefits mentioned in this whitepaper. In this post, we will use Terraform, an open-source IaC tool; however other tools, such as AWS CloudFormation or AWS Cloud Development Kit (AWS CDK), would also work for this project. Note that the common code will be organized in a shared Terraform module that can be referenced by both regional modules, us-east-1, N. Virginia (Region A) and us-east-2, Ohio (Region B).

App Runner

The first thing you need to provision for this architecture is a container repository that hosts the application container images that App Runner will run. You can use the registry settings of Amazon Elastic Container Registry (Amazon ECR) to configure private image cross-region replication at the registry level. This means that whenever you push an image to any repository in Region A, Amazon ECR will automatically replicate the image to Region B. The following is an example of this using Terraform:

resource "aws_ecr_repository" "main" {
  name                 = var.app
  image_tag_mutability = "IMMUTABLE"
  tags                 = var.tags

  image_scanning_configuration {
    scan_on_push = var.registry_scanning
  }
}

# cross-region replication
resource "aws_ecr_replication_configuration" "main" {
  replication_configuration {
    rule {
      destination {
        registry_id = data.aws_caller_identity.current.account_id
        region      = var.ecr_replication_destination_region
      }
    }
  }
}

Next, provision two App Runner services, one in Region A, and one in Region B. These App Runner services will point to the Amazon ECR repository images in their respective regions. The following provides a Terraform code snippet from our shared Terraform module. Notice the simplicity of this App Runner service resource and just how few parameters are needed to get a secure and scalable web application up and running:

resource "aws_apprunner_service" "main" {
  service_name = var.app
  tags         = var.tags

  source_configuration {
    auto_deployments_enabled = false

    image_repository {
      image_repository_type = "ECR"
      image_identifier      = var.image
      image_configuration {
        port = var.port
        runtime_environment_variables = {
          DYNAMO_TABLE = var.dynamo_table
        }
      }
    }

    authentication_configuration {
      access_role_arn = aws_iam_role.access.arn
    }
  }

  instance_configuration {
    instance_role_arn = aws_iam_role.instance.arn
    cpu               = var.cpu
    memory            = var.memory
  }

  health_check_configuration {
    protocol = var.health_check_protocol
    path     = var.health_check_path
  }
}

Ingress

App Runner provides a default HTTPS endpoint on the awsapprunner.com domain, like https://{random}.{region}.awsapprunner.com. To route traffic across both regions, you can create an App Runner custom domain in both services that points to the same global domain name. This will effectively add the global service domain (such as app.example.com) to the Subject Alternate Name (SAN) certificate that App Runner manages for you. This allows web browsers to perform secure TLS negotiation (HTTPS) with our App Runner services in either region. By using Terraform, this is as simple as declaring an App Runner custom domain association resource:

resource "aws_apprunner_custom_domain_association" "multi_region" {
  domain_name = "${var.app_sub_domain}.${data.aws_route53_zone.main.name}"
  service_arn = aws_apprunner_service.main.arn
}

Health

With the App Runner custom domains in place, you can create Route 53 health checks for both regional App Runner endpoints along with CNAME records using a Route 53 weighted routing policy. Weighted routing allows you to associate both regional domains (such as us-east-1.example.com and us-east-2.example.com) with your single domain name (app.example.com) and set how much traffic is routed to each endpoint. In the example, choose 128 as the weight so that about 50 percent of the traffic goes to Region A and 50 percent goes to Region B. There are also other types of Route 53 routing policies that you can choose, including Geolocation, which uses your users’ locations, and Latency, which lets you minimize round-trip time.

Here’s an example of how to set up your health check and routing policy using Terraform:

resource "aws_route53_health_check" "main" {
  fqdn              = aws_apprunner_custom_domain_association.main.domain_name
  resource_path     = var.health_check
  type              = "HTTPS"
  port              = 443
  failure_threshold = 5
  request_interval  = 30

  tags = {
    Name = aws_apprunner_custom_domain_association.main.domain_name
  }
}

# route 50% of the requests to the shared multi-region endpoint to the regional endpoint
resource "aws_route53_record" "weighted" {
  zone_id         = data.aws_route53_zone.main.zone_id
  name            = aws_apprunner_custom_domain_association.multi_region.domain_name
  records         = [aws_apprunner_custom_domain_association.main.domain_name]
  set_identifier  = var.region
  type            = "CNAME"
  ttl             = 60
  health_check_id = aws_route53_health_check.main.id

  weighted_routing_policy {
    weight = 128
  }
}

Behind the scenes, Route 53 health checks provide 16 global health checkers that actively ensure that your regional App Runner services are healthy. At least 3 out of 16 checkers, or about 18 percent (in different regions) must report as healthy for the endpoint to be considered healthy. If one of your regional endpoints becomes unhealthy, Route 53 will stop including the unhealthy endpoint in its DNS query responses and, as a result, App Runner will automatically scale out the number of containers needed to handle the increased traffic coming into the alternate region. The health checks follow the constant work pattern, which can help increase resiliency. However, they do prevent App Runner from scaling active container instances to zero, which means that you are charged for the little amount of vCPU cycles in addition to only the memory. The following image shows what the health checks look like in the AWS console after provisioning the resources with Terraform:

Route 53 health checks in the AWS console after provisioning with Terraform

Route 53 data plane

It’s important to understand that the Route 53 health checks described above run in the globally distributed Route 53 data plane and are designed for a 100 percent availability service-level agreement (SLA). When you are architecting for multi-region, it is important to avoid taking any dependencies on the Route 53 control plane, which runs in a single region. An example of this would be to rely on a Route 53 API call (through the AWS console or a script) to make a DNS change. Even in the event of a Route 53 control plane issue in Region A, the Route 53 data plane will still be able to make the required changes to only route traffic to Region B. This provides the resiliency required by our application.

DynamoDB global tables

With the application container, which serves the frontend and API, now distributed across different regions, we turn to the database tier. In order to replicate the application data across regions, you can use DynamoDB global tables. A replica table is a single DynamoDB table that functions as part of a global table. Each replica stores the same set of data items. When your application writes data to a replica table in either Region A or B, DynamoDB automatically propagates the write to the replica table in the other region. The DynamoDB global tables default concurrency strategy is last writer wins, which is acceptable for the sample application. Note that our App Runner services, by default, run in an AWS managed Amazon Virtual Private Cloud (Amazon VPC) and communicate with DynamoDB over the public internet, encrypted using HTTPS. This is acceptable for many customers. However, if you need to keep the database traffic on the AWS network, you can associate the App Runner services with an Amazon VPC and then configure an Amazon VPC endpoint to route the traffic to DynamoDB. The following code snippet shows how to provision the DynamoDB global table using Terraform:

resource "aws_dynamodb_table" "main" {
  name             = var.app
  hash_key         = "ID"
  billing_mode     = "PAY_PER_REQUEST"
  stream_enabled   = true
  stream_view_type = "NEW_AND_OLD_IMAGES"
  tags             = var.tags

  attribute {
    name = "ID"
    type = "S"
  }

  replica {
    region_name = var.region_alternate
  }
}

Provisioning

Now that we have reviewed the architectural components of the application, we describe how the automated provisioning mechanism works. We’ve provided a bash script that performs the following initial, one-time set of actions. After cloning the Git repository, you can simply run make init to run the script.

Use Terraform to provision an Amazon ECR repository and DynamoDB global table in Region A.
Use the Docker command line interface (CLI) to build and push our application image to Amazon ECR.
Use Terraform to provision the App Runner resources in Region A.
Wait a few seconds for Amazon ECR cross-region replication to replicate our image.
Use Terraform to provision the App Runner resources in Region B, pointed at the replicated Amazon ECR repo.

The following are some key snippets from the provisioning bash script. Note that the base Terraform module outputs the URLs of the Amazon ECR repositories in both regions, which are then used to assemble a full container image name including the region-specific registry hostname, the repository, and a version used as the tag:

echo "Provisioning ECR repositories and Dynamo tables"
cd iac/base
terraform init
terraform apply -auto-approve
 
# get the primary and replicated repo urls
repo=$(terraform output -raw ecr_repo_url)
repo_replicated=$(terraform output -raw ecr_repo_url_replicated)
 
echo "Provisioning resources in Region A"
cd ../region-a
terraform init
image=${repo}:${version}
terraform apply -var="image=${image}" -auto-approve
 
echo "Provisioning resources in Region B"
cd ../region-b
terraform init
image=${repo_replicated}:${version}
terraform apply -var="image=${image}" -auto-approve

After the initial provisioning script completes, it can take time for the App Runner custom domains to validate. The console indicates that this can take up to 24–48 hours for the status to change; however in my testing, I found that it typically only takes a few minutes. So go grab a cup of coffee, and when you come back, you should have an app that has been deployed to two different regions.

Application deployment

Now that all the cloud resources have been provisioned and configured, you can focus on your application code. As you add features to your application and iterate on your code, you’ll need an automated mechanism to deploy your changes to both regions. To accomplish this, we’ve provided a deployment script that performs the following actions. After cloning the Git repository, you can run make deploy to run this script:

Build and push a new container image to our Amazon ECR repository in Region A.
Call the App Runner UpdateService API in Region A to run your new image.
Wait a few seconds for Amazon ECR to replicate your image to Region B.
Call the App Runner UpdateService API in Region B to run your new image.

The following are snippets that show how to do this using the Docker CLI and AWS Command Line Interface (AWS CLI):

image=${repo}:${version}
 
# login to ECR
aws ecr get-login-password | docker login --username AWS --password-stdin ${repo}
 
# build and push our container image to ECR
docker buildx build --push -t ${image} --platform linux/amd64 .
 
# Create an App Runner rolling deployment
aws apprunner update-service \
  --cli-input-json '{ "ServiceArn": "'${service_arn}'", "SourceConfiguration": { "ImageRepository": { "ImageRepositoryType": "ECR", "ImageIdentifier": "'${image}'" } } }'

You can run the deployment script on your development machine. However, we recommend setting up a continuous integration and continuous delivery (CI/CD) pipeline that automatically runs whenever you make code changes and git push them to your origin repo. There are dozens of options for implementing a CI/CD pipeline today. A couple of popular options include GitHub Actions and AWS CodePipeline.
A few minutes after running the deployment script, your code changes will be up and running in both regions in a highly available and resilient configuration.

Verification

To verify that your application is indeed being served from both regions, you can run the following command, which sends an HTTP GET request to the health check endpoint every few seconds. Note that the health check route simply returns the value of the App Runner built-in AWS_REGION environment variable:

watch curl -s https://apprunner.example.com/health

You should see the region returned from the health check periodically alternating between Region A and Region B:

Every 2.0s: curl -s https://apprunner.example.com/health
 
us-east-2

Summary

The goal of this post was to show how to architect a highly available and resilient application using App Runner. The application architecture is able to continue serving its users, even in the event of a region-wide issue. We achieved this by building an application that uses Route 53 health checks and a weighted routing policy, Amazon ECR replication, an App Runner service in each region, and DynamoDB global tables for a multi-region database. Hopefully you were able to follow along and are now equipped to build highly resilient applications for your projects. Happy building!

Containers