Guidance for a Cell-Based Architecture for Amazon EKS

This Guidance demonstrates how to configure a cell-based architecture for Amazon Elastic Kubernetes Service (Amazon EKS). It moves away from typical multiple Availability Zone clusters to a single Availability Zone cluster. These single Availability Zone clusters are called cells, and the aggregation of these cells in each Region is called a supercell. These cells help to ensure that a failure in one cell doesn't affect the cells in another, reducing data transfer costs and improving both the availability and resiliency against Availability Zone failures for Amazon EKS workloads.

Please note: [Disclaimer]

Architecture Diagram

[Architecture diagram description]

Download the architecture diagram PDF

Main Architecture
Supercells

Main Architecture
This architecture diagram shows how you can use a cell-based architecture to improve resiliency and reduce data transfer costs for Amazon EKS workloads. It shows what a cell consists of and how those cells are routed. For more details about supercells, open the other tab.

Step 1
A cell consists of an Amazon Elastic Kubernetes Service (Amazon EKS) cluster having its worker nodes (workloads) deployed within a single Availability Zone (AZ). These cells are independent replicas of the application, and create a fault isolation boundary to limit the scope of impact. There can be multiple cells per AZ. And there can be multiple cells deployed across multiple AZs to provide high availability and resiliency against AZ failures.

Step 2
Clients are routed towards Amazon EKS workloads within each cell by a cell-routing layer, which consists of Elastic Load Balancing (ELB), Amazon Route 53 routing records, and Amazon Route 53 Application Recovery Controller to provide readiness checks, routing control, and zonal shifts capability. An application load balancer balances the traffic to the Kubernetes resources within each cell.

Step 3
Once the request reaches a cell, all subsequent internal communications among the Kubernetes (k8s) workloads stays within the cell. This prevents cross-cell dependency, making each cell statically stable and more resilient. Additionally, with minimal inter-AZ communication, there are no inter-AZ data transfer costs for chatty workloads, as traffic never leaves the AZ boundary. Amazon EKS workloads utilize Karpenter for compute autoscaling needs.

Step 4
Amazon EKS workloads that require access to data persistence can continue to use other data store services managed by AWS, like Amazon Relational Database Service (Amazon RDS), Amazon DynamoDB, and Amazon ElastiCache, which span across multiple AZs for high availability.

Click to enlarge

Step 1
A cell consists of an Amazon Elastic Kubernetes Service (Amazon EKS) cluster having its worker nodes (workloads) deployed within a single Availability Zone (AZ). These cells are independent replicas of the application, and create a fault isolation boundary to limit the scope of impact. There can be multiple cells per AZ. And there can be multiple cells deployed across multiple AZs to provide high availability and resiliency against AZ failures.

Step 2
Clients are routed towards Amazon EKS workloads within each cell by a cell-routing layer, which consists of Elastic Load Balancing (ELB), Amazon Route 53 routing records, and Amazon Route 53 Application Recovery Controller to provide readiness checks, routing control, and zonal shifts capability. An application load balancer balances the traffic to the Kubernetes resources within each cell.

Step 3
Once the request reaches a cell, all subsequent internal communications among the Kubernetes (k8s) workloads stays within the cell. This prevents cross-cell dependency, making each cell statically stable and more resilient. Additionally, with minimal inter-AZ communication, there are no inter-AZ data transfer costs for chatty workloads, as traffic never leaves the AZ boundary. Amazon EKS workloads utilize Karpenter for compute autoscaling needs.

Step 4
Amazon EKS workloads that require access to data persistence can continue to use other data store services managed by AWS, like Amazon Relational Database Service (Amazon RDS), Amazon DynamoDB, and Amazon ElastiCache, which span across multiple AZs for high availability.
Supercells
This architecture diagram shows how multiple cells are aggregated to create a supercell. It also outlines how those supercells are routed. For more details about the main architecture, open the other tab.

Step 1
A cell, as mentioned in Step 1 of the Main Architecture tab, consists of an Amazon EKS cluster having its workloads deployed within a single AZ.

Step 2
An aggregation of multiple cells within a Region is called a supercell.

Step 3
Amazon EKS workloads in each AWS Region, or supercell, use ELB to load balance the traffic to Amazon EKS workloads within each cell.

Step 4
Clients are routed to a supercell using the Route 53 weighted routing policy, and also use the Route 53 Application Recovery Controller to provide routing control and zonal shift capabilities.

Step 5
Multiple supercells can be deployed across AWS Regions for disaster recovery, or to satisfy data residency requirements.

Click to enlarge

Step 1
A cell, as mentioned in Step 1 of the Main Architecture tab, consists of an Amazon EKS cluster having its workloads deployed within a single AZ.

Step 2
An aggregation of multiple cells within a Region is called a supercell.

Step 3
Amazon EKS workloads in each AWS Region, or supercell, use ELB to load balance the traffic to Amazon EKS workloads within each cell.

Step 4
Clients are routed to a supercell using the Route 53 weighted routing policy, and also use the Route 53 Application Recovery Controller to provide routing control and zonal shift capabilities.

Step 5
Multiple supercells can be deployed across AWS Regions for disaster recovery, or to satisfy data residency requirements.

Well-Architected Pillars

The AWS Well-Architected Framework helps you understand the pros and cons of the decisions you make when building systems in the cloud. The six pillars of the Framework allow you to learn architectural best practices for designing and operating reliable, secure, efficient, cost-effective, and sustainable systems. Using the AWS Well-Architected Tool, available at no charge in the AWS Management Console, you can review your workloads against these best practices by answering a set of questions for each pillar.

The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.

Operational Excellence

When configuring this Guidance, Amazon EKS, ELB, and Route 53 work in tandem to isolate faults to individual partitions. Traditionally, users were in the same failure domain of a single business system. Now, with this new approach, users are in different failure domains, so workloads are more resilient in the rare, but possible, event of an Availability Zone (AZ) failure.

Read the Operational Excellence whitepaper
Security

All AWS services used in this Guidance use AWS Identity and Access Management (IAM) for authentication and authorization. These services utilize IAM roles to get short-term credentials to access other AWS resources. By scoping IAM policies to the minimum permissions required, you limit unauthorized access to resources. Also, applications running in Kubernetes' clusters can utilize native Kubernetes authentication and authorization policies to interact with the Kubernetes API server.

Read the Security whitepaper
Reliability

As each Amazon EKS cell is deployed to a single AZ, multiple such cells are deployed to achieve availability against AZ-wide failures, with ELB directing traffic to healthy cells. ELB allows for synchronous loose coupling so that traffic won’t be directed to any unhealthy Amazon EKS cells. This reduces the chance of application failures. You can also use the Route 53 Application Recovery Controller to do zonal shifts, as well as the Route 53 routing control component in the Application Recovery Controller to mitigate any failures from AZs or Region-wide impairments.

Read the Reliability whitepaper
Performance Efficiency

When intra-cell traffic is kept within a single AZ, it greatly improves the network performance with low latency. A cell-based, multi-cluster architecture ensures resiliency against AZ-wide failures, improves the performance of your applications, and reduces the data transfer charges by keeping the intra-cell communications within a single AZ. For additional improvements, Kubernetes Horizontal Pod Auto Scaling can be utilized to automatically scale the applications, and Karpenter can be used for compute autoscaling within an Amazon EKS cell. Moreover, ELB ensures the traffic is distributed across healthy cells, and additional Amazon EKS cells can also be launched to match the heavy demand.

Read the Performance Efficiency whitepaper
Cost Optimization

All the services used in this Guidance, including Amazon EKS, Amazon Elastic Compute Cloud (Amazon EC2), and ELB are managed services that offer a pay-as-you-go approach to pricing. The AWS Pricing Calculator can help you estimate the cost of these services. Kubernetes Horizontal Pod Auto Scaling and Karpenter are used to auto-scale the workloads to match the application demand, helping to secure a high utilization of deployed resources. Karpenter can also help you provision capacity from multiple Amazon EC2 instance types, such as a Spot instance and On-Demand instances, to further reduce your overall compute spend.

Read the Cost Optimization whitepaper
Sustainability

To help minimize the environmental impacts of running your cloud workloads, Kubernetes Horizontal Pod Auto Scaling policies were carefully selected for this Guidance due to its capability to match the number of application pods with the demand. Furthermore, Karpenter is used to launch the required compute capacity in each Amazon EKS cell, and maintains high utilization of deployed compute resources. Karpenter can also be used to provision capacity using more sustainable Amazon EC2 instance types, like AWS Graviton Processors, AWS Inferentia, and AWS Trainium, and also ensures high utilization of deployed resources.

Read the Sustainability whitepaper

Implementation Resources

A detailed guide is provided to experiment and use within your AWS account. Each stage of building the Guidance, including deployment, usage, and cleanup, is examined to prepare it for deployment.

The sample code is a starting point. It is industry validated, prescriptive but not definitive, and a peek under the hood to help you begin.

Open implementation guide

Open sample code on GitHub

Improve resiliency and reduce your data transfer costs between Availability Zones

Architecture Diagram

Well-Architected Pillars

Implementation Resources

Related Content

Life360’s journey to a multi-cluster Amazon EKS architecture to improve resiliency

Disclaimer

Was this page helpful?

Guidance for a Cell-Based Architecture for Amazon EKS

Improve resiliency and reduce your data transfer costs between Availability Zones

Architecture Diagram

Well-Architected Pillars

Implementation Resources

Related Content

Life360’s journey to a multi-cluster Amazon EKS architecture to improve resiliency

Disclaimer

Was this page helpful?

Ending Support for Internet Explorer