Operating resilient workloads on Amazon EKS

Introduction

When the margin for error is razor thin, it is best to assume that anything that can go wrong will go wrong. AWS customers are increasingly building resilient workloads that continue to operate while tolerating faults in systems. When customers build mission-critical applications on AWS, they have to make sure that every piece in their system is designed in such a way that the system continues to work while things go wrong.

AWS customers have applied the principle of design for failure to build scalable mission-critical systems that meet the highest standards of reliability. The best practices established in the AWS Well Architected framework have allowed teams to improve systems continuously while minimizing business disruptions. Let’s look at a few key design principles we have seen customers use to operate workloads that cannot afford downtime:

High availability: Modern applications should have no single point of failure
Fault tolerance: The system should have resources to account for internal and external failures up to a reasonable level
Auto scaling: Components within the system should scale independently
Self-healing: The system should be able to recover from errors automatically
Safer upgrades: Developers should be able to push code with confidence by employing blue/green and canary deployment strategies
Rollbacks: It should be easy to recover from a failed deployment

A system is resilient when it can adapt its functioning in response to changing conditions. The National Academy of Sciences defines resilience as “the ability to prepare and plan for, absorb, recover from, or more successfully adapt to actual or potential events”.

Without a sophisticated orchestration system such as Kubernetes, practicing these principles was a heavy lift for most engineering teams. Kubernetes makes it easy to run, observe, scale, and upgrade workloads. It is designed to be a self-healing system. As a container orchestrator, it offers controls that allow you to operate highly available, fault-tolerant, and scalable applications. These are the reasons why customers are increasingly choosing Amazon Elastic Kubernetes Service (Amazon EKS) to operate reliable and resilient services.

Amazon EKS

One of the most arduous tasks for Kubernetes users is managing the Kubernetes cluster control plane. Amazon EKS makes it easy for you to operate a Kubernetes cluster by providing you a production-grade Kubernetes control plane. It is designed to be highly available, fault-tolerant, and scalable. Following the recommendations of the Well Architected Framework, Amazon EKS runs Kubernetes control plane across multiple AWS Availability Zones (AZ) to ensure high availability. The cluster control plane autoscales based on the load, and any unhealthy control plane instances are replaced automatically.

Amazon EKS Architecture

Amazon EKS Cluster Topology

The Amazon EKS team has gained an immense amount of experience managing Kubernetes control planes at scale. This experience has been translated into engineering and operational practices that ensure customers can rely on Amazon EKS clusters for the security, performance, availability, reliability, and resilience of the Kubernetes control plane.

AWS assumes the responsibility of the availability of your Amazon EKS cluster control plane. This means Amazon EKS customers can focus all their efforts on ensuring that their workload is deployed in a manner that’s resilient to common infrastructure failure events.

Kubernetes is self-healing

Software is ubiquitous. And so are bugs. Run any piece of code long enough and it eventually runs into issues. Modern application architectures anticipate failure and run multiple copies of a code to minimize disruptions. But, we also need to ensure that not all copies crash at the same time. So, we run application replicas in multiple failure domains (nodes, AZs, Regions, and service providers) to isolate faults in the underlying infrastructure.

A typical method for increasing a workload’s reliability and resilience is spreading its replicas across multiple isolated failure domains. Each additional failure domain adds a layer of protection, which makes the workload more robust against potential failures. For example, if a failure occurs in one domain, then the other domains can continue to handle the workload, minimizing the impact on users or customers. This approach is commonly referred to as fault tolerance or redundancy.

However, it’s important to note that each additional layer of isolation also comes with increased costs. Setting up and maintaining separate failure domains often involves additional hardware, networking infrastructure, and operational overhead. Organizations need to strike a balance between the desired level of reliability and the associated costs.

Kubernetes brings powerful orchestration capabilities to enhance workload resilience:

Replication and scaling: Kubernetes allows you to define the desired number of replicas for your workloads, ensuring that multiple instances of the application are running concurrently. If a replica fails or becomes unresponsive, Kubernetes automatically replaces it with a healthy one.
Health checks and self-healing: Kubernetes continuously monitors the health of individual replicas by performing health checks. If a replica fails these checks, then Kubernetes automatically terminates it and starts a new one in its place. This self-healing feature ensures that the workload remains available and resilient even in the presence of failures.
Fault isolation: Kubernetes allows you to define and enforce resource limits and constraints for workloads. By setting resource quotas and limits, you can prevent a single workload from consuming excessive resources, which reduces the risk of resource exhaustion and isolating failures to specific workloads rather than affecting the entire cluster.
Rolling updates and rollbacks: Kubernetes facilitates seamless rolling updates, which allows you to update your application without incurring downtime. By rolling out updates in a controlled manner, Kubernetes ensures that a minimum number of replicas are available and operational at all times. Additionally, Kubernetes supports rollbacks that revert to a previous version of an application if issues arise during an update.
Multi-domain deployments: Kubernetes supports deploying workloads across multiple failure domains. By deploying replicas across multiple nodes, AZs, and clusters, you increase the resilience of your workload against infrastructure failure at various levels.

Layers of resilience in Kubernetes

Layers of resilience in Kubernetes

Let’s start from the lowest unit of deployment in Kubernetes, a Pod, and work our way up the infrastructure stack to eliminate single points of failure and resilience to each layer.

Resilience at the deployment layer

A Kubernetes workload deployed to be highly available

Pods are the smallest deployable unit of computing in Kubernetes. But Pods by themselves are not resilient. A workload deployed as a singleton Pod stops running if its Pod terminates for any reason.

It is a best practice to deploy Kubernetes workloads as Deployments instead of individual Pods. If a Pod that’s part of a Deployment (or a ReplicaSet) crashes, Kubernetes attempts to heal the workload by creating another Pod. Therefore, customers should prefer running workloads using Deployments to enable automatic recovery when a Pod fails.

Recovering from a Pod failure can take some time. What happens to business in the meanwhile? The business can’t afford downtime while the application restarts. So, we run multiple replicas of our workload to make sure if one Pod goes down, there are still other replicas to handle requests.

Running multiple replicas of a service minimizes business impact during application crashes. Customers should deploy multiple replicas of a workload to improve availability and use Horizontal Pod Autoscaling to scale replicas. Teams should also consider using probes to help Kubernetes detect failure in application code. Please see the Applications section of the EKS Best Practices Guide for more information about running highly available application.

Generally, customers don’t need to implement fault tolerance at deployment level once horizontal scaling is operational. There is one exception. You may have to over-provision workloads that have a high startup time. If a workload’s Pods cannot scale fast enough due to high startup time, then consider overprovisioning replicas to compensate for Pod failures.

Resilience at the node level

A Kubernetes workload running on multiple worker nodes

In the previous section, we discussed Kubernetes features you can use to deploy your workload in a resilient manner. We can further improve a workload’s resilience by running replicas across multiple failure domains. Let’s shift our focus to the compute resources that run workloads, Nodes.

Once we’ve made our workload highly available and fault-tolerant, we need to ensure that the Kubernetes data plane is also high available and scalable. For the data plane to be resilient, there should be a way to detect and replace unhealthy Amazon Elastic Compute Cloud (Amazon EC2) instances automatically and scale Nodes based on workload requirements. A resilient, scalable data plane ensures that our workloads remain operational even when Nodes become unavailable.

Adding a Kubernetes autoscaler, such as the Cluster Autoscaler and Karpenter, ensures that the data plane scales as workloads scale. To detect failures in Nodes, you can install node-problem-detector and couple it Descheduler to drain nodes automatically when a failure occurs.

Should you over-provision capacity at the Node level to make your data plane more fault-tolerant? That depends on the impact to your workload when the cluster data plane needs to be scaled up and the time it takes to create a container on a new node. It is faster to scale Nodes if they have a short boot-up time. Similarly, containers with smaller images start faster than larger images.

There are two things you can do to make sure your data plane scales faster and recovers quickly from failures in individual Nodes:

Over-provision capacity at the Node level
Reduce Node startup time by using Bottlerocket or vanilla Amazon EKS optimized Linux

To speed up container startup, you can reduce the size of the container image by using techniques such as multi-stage build or use image lazy loading.

By making the data plane resilient, you prevent a Node from becoming a single point of failure. Once your data plane is resilient, you can spread workload across multiple nodes using anti-affinity rules, Topology Spread Constraints, and disruption budgets.

Resilience at Availability Zone level (Single vs multi-AZ clusters)

A Kubernetes workload deployed across multiple Availability Zones

So far, our efforts have focused on making sure that the workloads and data plane are resilient to failure. Let’s zoom out and consider the scenario in which an entire data center becomes unhealthy.

Every AWS Region has multiple, isolated locations known as AZs, which are designed not to be simultaneously impacted by a shared fate scenario like utility power, water disruption, fiber isolation, earthquakes, fires, tornadoes, or floods. AZs are physically separated by a meaningful distance, many kilometers, from any other AZ, although all are within 100 km (60 miles) of each other.

It is an essential best practice to deploy node across multiple AZs to be resilient to failures in a single AZ. When using node groups, it is recommended that you create node groups in multiple AZs. If you provision nodes using Karpenter, you are already covered as Karpenter’s default behaviour is to provision nodes across AZs. Once capacity is available across AZs, you can use Pod Topology Spread Constraints to spread workloads across multiple AZs.

Applications, such as relational databases, that don’t support horizontal scaling for high availability are single points of failure. Consider setting up automatic recovery for these types of applications. Wherever possible, use managed services like Amazon RDS that can automatically failover to another AZ when a failure is detected. In an ideal multi-AZ architecture, you should be able to withstand infrastructure failure in one of the AZs without impairing production.

One downside of multi-AZ design is that there’s a cost associated with data transferred between AZs. Applications such as Cassandra and other distributed databases exchange large volumes of data between replicas. To avoid paying for inter-AZ data transfer charges, some customers deploy workloads in a single AZ. Thus compromising reliability for cost. Techniques such as Topology Aware Hints and service Meshes can help in keeping data local to an AZ while maintaining high availability across AZs.

Resilience at cluster level (single cluster vs multi-cluster)

A Kubernetes cluster depends on add-ons for critical functions such as networking, security, observability. A failure in one of these components has the potential to cause cluster wide outages. In the past, we have seen customers breaking their cluster by misconfiguring the Container Network Interface (CNI) or installing a poorly written controller.

As a result, many customers deploy their workloads across multiple Amazon EKS clusters to eliminate a Kubernetes cluster from being a single point of failure.

When adopting a multi-cluster architecture for resiliency, it is essential to reduce operational overhead of managing clusters individually. The idea is to treat clusters as a unit. Any change made to one cluster are propagated to all clusters.

There are three key capabilities that make multi-cluster deployments successful:

The ability to deploy workloads across cluster consistently
The ability to distribute traffic across multiple clusters
The ability to remove a cluster from receiving traffic

Multi-cluster configuration management

When operating in a multi-cluster environment, having an automated method for configuring clusters and deploying workloads reduces your platform team’s operational burden. Manual changes or updates made directly to clusters without proper documentation or control lead to configuration drift. GitOps or similar centralized deployment techniques become critical for avoiding inconsistencies in multi-cluster environments.

The reason to avoid configuration drift across cluster is to make deployments, upgrades, and troubleshooting easier. Without clear visibility and control over configurations, managing updates, performing version control, and ensuring consistency across environments becomes increasingly complex. This complexity can lead to errors, increase the time required for management tasks, and hinder scalability or automation efforts.

Should an issue arise in the deployment, it’s easier to fix if all clusters share the same workload version and configuration. Once you identify the fix in one cluster, you can confidently deploy the fix to other clusters, without having to troubleshoot the issue in each cluster separately.

Multi-cluster traffic routing

Multi-cluster architecture also provide opportunities for testing, maintenance, and upgrades without disrupting production environments. By diverting traffic or workloads to a set of clusters during planned maintenance activities, organizations can ensure continuous service availability and achieve near-zero downtime.

Amazon EKS customers use Application Load Balancer (ALB) or Network Load Balancer (NLB) to distribute traffic to replicas of services running inside a cluster. They can also use load balancers to distribute traffic across multiple clusters.

When using ALB, customers can create a dedicated target groups for each cluster. Using weighted target groups, customers can then control the percentage of traffic each cluster gets.

For workloads that use an NLB, customers can use AWS Global Accelerator to distribute traffic across multiple clusters.

Resilience at regional level

In some cases, mission-critical workloads with stringent availability requirements may operate across multiple AWS Regions. This approach protects against larger-scale disasters or regional outages. However, the cost and complexity of implementing such a setup can be significantly higher. This architecture pattern is typically reserved for disaster recovery and business continuity. Few AWS customers operate in multiple regions concurrently.

Multi-regional deployments fall under two categories: active/active and active/passive. Customers use recovery point objective (RTO) and recovery time objective (RTO) metrics to determine which multi-region deployment strategy to use. Cost and RPO/RTO are inversely proportional. Therefore, customers only use active/active deployments for when the cost of being down in one Region offsets the cost of running infrastructure in multiple Regions simultaneously.

In active/passive deployments, customers deploy workloads in a primary Region and configure a secondary Region as a fallback should the primary region become unavailable.

The scale at which you deploy infrastructure in the secondary Region depends on your Recovery Point Objective (RPO) and Recovery Time Objective (RTO) numbers. Most Kubernetes customers with multi-regional requirements deploy Kubernetes clusters in multiple regions and use GitOps to keep to workload deployment and cluster configuration in sync. In active/passive deployments, customers primarily operate in one Region and failover to a secondary Region when a large scale failure impacts production. In such cases, the worker nodes in the secondary Region are usually scaled down to save costs.

When the disaster recovery process is activated, the cluster in the secondary Region is scaled up and traffic is shifted from the primary region to the secondary Region.

Kubernetes backup and recovery tools like Velero offer another relatively inexpensive option for implementing disaster recovery. These tools backup Kubernetes cluster resources and persistent volumes to Amazon S3. The backups are then replicated to multiple regions using Amazon S3 cross-region replication. Should the deployment in the primary region become unavailable, customers can use the backup to recreate the deployment in an Amazon EKS cluster in another Region.

Multi-regional deployments work best with stateless workloads because replicating data across regions is costly. Data replication can also have a lag depending on the method of replication, which often creates complexity at the application layer.

When data is stored in block storage or managed file stores such as Amazon EFS, Amazon S3, and Amazon FSx, customers can use AWS Backup or replication methods provided by the system to copy data to another Region.

For data stored in databases, customers can use AWS Database services like Amazon DynamoDB global tables and Amazon Aurora global databases to replicate data with minimal lag. If the database layer in your architecture doesn’t facilitate near real-time data replication, you’ll have to account for database replication lag before implementing a multi-region active/active architecture.

Due to the inherent cost and complexity, active-active multi-region deployments should be reserved for mission-critical workloads. Applications that could have this availability goal include, for example, certain banking, investing, finance, government, and critical business applications that are the core business of an extremely large-revenue generating business. If your organization has a workload which if unavailable will lead to thousands of dollars in lost revenue, that workload can be a suitable candidate for multi-region deployment.

For more information on multi-region deployment considerations, please see the multi-region scenarios section in the AWS Well Architected Framework.

Validate resiliency

Once you have set up your workload’s architecture to be resilient, it’s imperative to validate its resilience. AWS recommends testing your system’s resilience regularly to ensure that it can withstand and recover from failures, disruptions, and adverse conditions.

Many AWS customers use the practice of chaos engineering to uncover the weaknesses of their systems. Testing your system’s resilience proactively helps you avoid business disruption and customer impact. Chaos engineering exercises can also uncover the potential impact of failure scenarios.

Resilience testing provides an opportunity to simulate real-world scenarios that may occur in production. By mimicking different failure scenarios, such as network outages, hardware failures, or service disruptions, you can evaluate the system’s behavior and readiness in those situations. This preparation will enable you to refine incident response procedures, train personnel, and improve system architecture and design.

Conclusion

In this post, we showed you that by understanding and using failover mechanisms, you can build systems that are highly available, fault tolerant, and withstand interruptions to maintain business continuity.

Resilience is implemented at different layers in Kubernetes. This post explained how to improve the resilience of a workload by deploying it across multiple failure domains. We covered patterns for multi-node, multi-AZ, multi-cluster, and multi-region deployments.

Ultimately, the decision to increase the number of failure domains for a workload depends on a comprehensive evaluation of factors such as the workload’s criticality, the impact of potential failures, the financial implications, and the organization’s overall risk tolerance.

AWS customers can use AWS Resilience Hub to receive assessments and recommendations for improving resilience. Resilience Hub can analyze the overall resilience of an Amazon EKS cluster and can examine Kubernetes deployments, replicas, ReplicationControllers, and pods.

Containers