Organizations using Amazon Elastic Kubernetes Service (Amazon EKS) need to establish that their clusters are built-as-designed, production-ready, and follow Amazon EKS Best Practices. Although Amazon EKS manages the Kubernetes control plane, validating cluster configurations and establishing quality across infrastructure, applications, policies, and resilience remains a key responsibility for platform teams. This post details how platform engineering teams can build an assurance pipeline for Amazon EKS deployments, incorporating validation frameworks that verify configurations, test infrastructure as code (IaC), assess application resilience, and establish compliance with organizational standards.
This comprehensive validation approach complement the robust scalability capabilities of Amazon EKS, helping teams build confidence in their deployments and maintain high-quality Kubernetes environments that can handle the demands of large-scale operations.
Current pain points in validating EKS clusters
Organizations deploying applications on Amazon EKS face several validation challenges:
- Infrastructure validation gaps: Traditional testing often focuses on application code, neglecting IaC validation, and leading to misconfigurations and deployment failures.
- Siloed testing approaches: Teams often use disconnected testing methods across infrastructure, applications, and policies, creating blind spots in validation coverage.
- Limited policy enforcement testing: Organizations struggle to validate that their Kubernetes policies are correctly enforced, potentially exposing security vulnerabilities.
- Non-functional testing complexity: Load testing Kubernetes components such as CoreDNS requires specialized knowledge and tools that many teams lack.
- Resilience assessment challenges: Understanding how applications behave during infrastructure failures is difficult without through simulation and frameworks/tools supporting that simulation.
- Manual and time-consuming processes: Without automated validation frameworks, teams resort to manual validation, which are error-prone, limited in nature, and often lead to inefficient practices.
Solution overview
To address cluster validation challenges, we’ve developed an assurance pipeline that systematically validates Amazon EKS environments through six distinct frameworks, each serving a specific purpose in our validation process.
- Infrastructure validation (Terraform test): Validates infrastructure before deployment by testing EKS cluster component modules and verifying compliance with Amazon Web Services (AWS) best practices. This early validation process helps detect and resolve infrastructure issues during the development phase rather than in production.
- Behavioral testing (Pytest BDD): Validates cluster behavior through readable test scenarios that verify core operations such as pod scheduling and service discovery. The framework establishes proper component interactions and confirms that Kubernetes API operations respond as expected.
- Package validation (Helm testing): Verifies Helm chart installations and cluster add-ons deployment while establishing proper resource creation. This validation step maintains consistency as code moves between different environments.
- Policy compliance (Chainsaw): Tests admission controls, security policies, and network policies to establish that clusters adhere to organizational standards and compliance requirements. This comprehensive policy validation safeguards cluster security configurations.
- Performance assessment (Locust): Evaluates cluster performance under various load conditions by measuring component response times and monitoring scaling behavior. This testing helps identify potential performance bottlenecks before they impact production workloads.
- Resilience testing (AWS Tools): Uses AWS Resilience Hub and AWS Fault Injection Service (AWS FIS) to test failure recovery procedures and validate availability configurations. These tools help identify reliability improvements and establish robust cluster operations.
This pipeline gives us a clear view of our Amazon EKS environments, helping us catch issues before they affect our applications. Each framework adds a layer of validation, creating a practical approach to testing our Kubernetes infrastructure.
Prerequisites
The following prerequisites are necessary before continuing:
Furthermore, navigate to your GitLab project and configure the following:
- Go to Settings > CI/CD > Variables.
- Add the following variables:
AWS_ACCESS_KEY_ID: Your AWS access key
AWS_SECRET_ACCESS_KEY: Your AWS secret key
AWS_REGION: Your preferred AWS Region
CLUSTER_NAME: Your EKS cluster name
Walkthrough
In this walkthrough, you integrate the Amazon EKS validation framework in the GitLab CI/CD pipeline. Create a .gitlab-ci.yml file in your repository root with the following structure:
stages:
- validate-infrastructure
- deploy-infrastructure
- validate-policies
- deploy-applications
- functional-tests
- non-functional-tests
- resilience-assessment
variables:
AWS_REGION: us-west-2
CLUSTER_NAME: eks-validation-cluster
TERRAFORM_DIR: terraform
HELM_DIR: helm
POLICY_DIR: policies
FUNCTIONAL_TEST_DIR: tests/functional
LOAD_TEST_DIR: tests/load
# Reusable templates
.aws-auth: &aws-auth
before_script:
# Using GitLab CI/CD environment variables for AWS credentials
# These should be set as protected and masked variables in GitLab CI/CD settings
# No need to explicitly configure credentials as AWS CLI will automatically use these variables
- export AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID
- export AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY
- export AWS_DEFAULT_REGION=$AWS_REGION
- aws sts get-caller-identity # Verify AWS credentials are working
.k8s-auth: &k8s-auth
before_script:
- aws eks update-kubeconfig --name $CLUSTER_NAME --region $AWS_REGION
# Infrastructure validation and deployment
terraform-validate:
stage: validate-infrastructure
image: hashicorp/terraform:latest
<<: *aws-auth
script:
- cd $TERRAFORM_DIR
- terraform init
- terraform validate
- terraform test
terraform-plan:
stage: validate-infrastructure
image: hashicorp/terraform:latest
<<: *aws-auth
script:
- cd $TERRAFORM_DIR
- terraform init
- terraform plan -out=tfplan
artifacts:
paths:
- $TERRAFORM_DIR/tfplan
expire_in: 1 day
terraform-apply:
stage: deploy-infrastructure
image: hashicorp/terraform:latest
<<: *aws-auth
script:
- cd $TERRAFORM_DIR
- terraform init
- terraform apply -auto-approve tfplan
dependencies:
- terraform-plan
when: manual
environment:
name: production
url: https://console.aws.amazon.com/eks/home?region=$AWS_REGION#/clusters/$CLUSTER_NAME
# Policy validation
policy-test:
stage: validate-policies
image: ghcr.io/kyverno/chainsaw:latest
<<: *k8s-auth
script:
- cd $POLICY_DIR
- chainsaw test --report-format junit --report-path chainsaw-report.xml
artifacts:
reports:
junit: $POLICY_DIR/chainsaw-report.xml
# Application deployment and Helm testing
helm-deploy-test:
stage: deploy-applications
image: alpine/helm:latest
<<: *k8s-auth
script:
- cd $HELM_DIR
- helm dependency update ./
- helm upgrade --install app-release ./ --wait
- helm test app-release --logs
# Functional testing
functional-test:
stage: functional-tests
image: python:3.9
<<: *k8s-auth
script:
- cd $FUNCTIONAL_TEST_DIR
- pip install -r requirements.txt
- pytest --bdd-format=pretty --junitxml=pytest-report.xml
artifacts:
reports:
junit: $FUNCTIONAL_TEST_DIR/pytest-report.xml
# Non-functional testing
load-test:
stage: non-functional-tests
image: locustio/locust:latest
<<: *k8s-auth
script:
- cd $LOAD_TEST_DIR
- locust -f coredns_locustfile.py --headless -u 20 -r 2 -t 5m --html=locust-report.html
artifacts:
paths:
- $LOAD_TEST_DIR/locust-report.html
expire_in: 1 week
# Resilience assessment
resilience-assessment:
stage: resilience-assessment
image: amazon/aws-cli:latest
<<: *aws-auth
script:
- cd resilience
- ./run-resilience-assessment.sh $CLUSTER_NAME $AWS_REGION
- ./run-fault-injection.sh $CLUSTER_NAME $AWS_REGION
artifacts:
paths:
- resilience/assessment-report.json
- resilience/fis-results.json
expire_in: 1 week
1. Unit testing with Terraform test
Unit testing your infrastructure code is crucial for catching configuration errors early in the development cycle. It helps:
- Validate that your infrastructure components are correctly defined
- Establish that resources have the expected properties and configurations
- Prevent costly mistakes before deploying to AWS
- Provide documentation of expected infrastructure behavior
- Enable refactoring with confidence
How to implement
To implement unit testing with Terraform’s native testing framework, you can use your existing Terraform repository and create a tests directory with an eks.tftest.hcl file. The eks.tftest.hcl file is a Terraform test configuration file used for unit testing Amazon EKS infrastructure code. It validates that your Amazon EKS infrastructure components are correctly defined before actual deployment to AWS, helping catch configuration errors early in the development cycle.In your existing Terraform project structure, create the following tests directory:
your-terraform-project/
├── main.tf
├── eks.tf
├── ...
├── karpenter.tf
├── variables.tf
├── outputs.tf
└── tests/
└── eks.tftest.hcl
Sample run and expected output
When running the Terraform tests with the new native testing framework, the output looks like the following:
$ cd terraform
$ terraform test
Testing terraform/tests/eks.tftest.hcl...
run "create_eks_cluster"... pass
Success! 1 passed, 0 failed.
For more detailed output with verbose flag:
$ terraform test -verbose
Testing terraform/tests/eks.tftest.hcl...
run "create_eks_cluster"...
module.eks.cluster_name != ""... pass
module.eks.cluster_version == var.eks_cluster_version... pass
length(module.eks.eks_managed_node_groups) == 1... pass
contains(keys(module.eks.eks_managed_node_groups), "karpenter")... pass
module.karpenter.node_iam_role_name == local.name... pass
helm_release.karpenter.namespace == "kube-system"... pass
helm_release.karpenter.chart == "karpenter"... pass
helm_release.karpenter.version == "0.37.0"... pass
module.vpc.name == local.name... pass
module.vpc.vpc_cidr_block == var.vpc_cidr... pass
length(module.vpc.private_subnets) == length(local.azs)... pass
length(module.vpc.public_subnets) == length(local.azs)... pass
length(module.vpc.intra_subnets) == length(local.azs)... pass
pass
Success! 1 passed, 0 failed.
If there are any failures, the detailed error message looks like the following:
$ terraform test
Testing terraform/tests/eks.tftest.hcl...
run "create_eks_cluster"...
module.eks.cluster_version == var.eks_cluster_version... fail
EKS cluster version should match the specified version
module.eks.cluster_version is "1.29"
var.eks_cluster_version is "1.30"
fail
Error: 1 test failed.
These outputs provide comprehensive validation that your infrastructure code is correctly defined and will create the expected resources when deployed.
2. Functional testing with Pytest BDD
Functional testing validates that your EKS cluster behaves as expected from an operational perspective. It’s essential because:
- It verifies that critical Kubernetes components are running correctly
- It establishes that cluster services are accessible and responding properly
- It validates that the cluster can perform its intended functions
- It catches integration issues that unit tests might miss
- It provides confidence that the cluster works for end users
How to implement
Create a tests/functional directory with your BDD tests:
tests/functional/
├── requirements.txt
├── conftest.py
├── features/
│ └── cluster_validation.feature
└── steps/
└── cluster_steps.py
Example requirements.txt (specifies the Python package dependencies needed to run the functional tests and establish consistent test environments across different systems):
pytest
pytest-bdd
kubernetes
boto3
Example cluster_validation.feature (behavior specifications written in Gherkin syntax, defines test scenarios in plain, human-readable language):
Feature: EKS Cluster Validation
Scenario: Verify critical components are running
Given an EKS cluster is available
When I check the kube-system namespace
Then all critical pods should be in Running state
Scenario: Check logs for errors
Given an EKS cluster is available
When I check pods in the kube-system namespace
Then logs should not contain any errors
Example cluster_steps.py (the actual Python implementation of the test steps defined in the feature file):
from pytest_bdd import given, when, then, parsers
from kubernetes import client, config
import boto3
@given("an EKS cluster is available")
def eks_cluster(request):
config.load_kube_config()
v1 = client.CoreV1Api()
return v1
@when(parsers.parse("I check the {namespace} namespace"))
def check_namespace(eks_cluster, namespace):
return eks_cluster.list_namespaced_pod(namespace)
@then("all critical pods should be in Running state")
def check_pods_running(check_namespace):
for pod in check_namespace.items:
assert pod.status.phase == "Running", f"Pod {pod.metadata.name} is not running"
@when(parsers.parse("I check pods in the {namespace} namespace"))
def check_pods_logs(eks_cluster, namespace):
pods = eks_cluster.list_namespaced_pod(namespace)
logs = {}
for pod in pods.items:
try:
logs[pod.metadata.name] = eks_cluster.read_namespaced_pod_log(
name=pod.metadata.name, namespace=namespace
)
except Exception:
logs[pod.metadata.name] = ""
return logs
@then("logs should not contain any errors")
def check_logs_for_errors(check_pods_logs):
error_keywords = ["error", "exception", "fail", "critical"]
for pod_name, log in check_pods_logs.items():
for keyword in error_keywords:
assert keyword.lower() not in log.lower(), f"Error found in {pod_name} logs"
Sample run and expected output
When running the functional tests, the output looks like the following:
$ cd tests/functional
$ pytest --bdd-format=pretty --junitxml=pytest-report.xml
============================= test session starts ==============================
platform linux -- Python 3.9.7, pytest-7.3.1, pluggy-1.0.0
rootdir: /repo/tests/functional
plugins: bdd-6.1.1
collected 2 items
Feature: EKS Cluster Validation # features/cluster_validation.feature:1
Scenario: Verify critical components are running # features/cluster_validation.feature:3
Given an EKS cluster is available # steps/cluster_steps.py:6
When I check the kube-system namespace # steps/cluster_steps.py:12
Then all critical pods should be in Running state # steps/cluster_steps.py:16
Scenario: Check logs for errors # features/cluster_validation.feature:8
Given an EKS cluster is available # steps/cluster_steps.py:6
When I check pods in the kube-system namespace # steps/cluster_steps.py:20
Then logs should not contain any errors # steps/cluster_steps.py:32
============================= 2 passed in 8.32s ===============================
The JUnit XML report (pytest-report.xml) contains structured test results like the following:
<?xml version="1.0" encoding="utf-8"?>
<testsuites>
<testsuite name="features.cluster_validation" errors="0" failures="0" skipped="0" tests="2" time="8.320" timestamp="2025-06-04T10:20:15">
<testcase classname="features.cluster_validation" name="Verify critical components are running" time="4.123">
</testcase>
<testcase classname="features.cluster_validation" name="Check logs for errors" time="4.197">
</testcase>
</testsuite>
</testsuites>
This output demonstrates:
- Successful execution of BDD scenarios
- Verification that all critical pods are running
- Confirmation that no errors are found in pod logs
- Test timing information
- Overall test summary showing all tests passed
The JUnit report can be integrated with CI/CD systems for reporting and tracking test results over time.
3. Helm testing
Helm testing establishes that your applications deploy correctly and function as expected within the Kubernetes environment. It’s important because:
- It validates that your Helm charts are correctly structured
- It establishes that deployed applications are accessible and functional
- It verifies that services can communicate with each other
- It catches configuration issues before they affect users
- It provides a standardized way to test application deployments
How to implement
Create a helm directory with your Helm charts and tests:
helm/
├── Chart.yaml
├── values.yaml
├── templates/
│ └── ...
└── tests/
├── test-connection.yaml
└── test-resources.yaml
Example test-connection.yaml. This is a Helm test manifest that creates a temporary Pod to verify that the application’s service is accessible within the Kubernetes cluster by running a wget command):
apiVersion: v1
kind: Pod
metadata:
name: "{{ include "app.fullname" . }}-test-connection"
labels:
{{- include "app.labels" . | nindent 4 }}
annotations:
"helm.sh/hook": test
spec:
containers:
- name: wget
image: busybox
command: ['wget']
args: ['{{ include "app.fullname" . }}:{{ .Values.service.port }}']
restartPolicy: Never
Sample run and expected output
When running Helm tests, the output looks like the following:
$ helm test app-release --logs
NAME: app-release
LAST DEPLOYED: Wed Jun 4 10:15:22 2025
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: app-release-test-connection
Last Started: Wed Jun 4 10:16:05 2025
Last Completed: Wed Jun 4 10:16:15 2025
Phase: Succeeded
NOTES:
Application successfully deployed and tested!
POD LOGS: app-release-test-connection
wget: download completed
This output confirms that:
- The Karpenter Helm chart was successfully deployed
- The Karpenter Controller pod was ready within the timeout period (120 seconds)
- All tests passed successfully
4. Kubernetes policy testing with Chainsaw
Policy testing establishes that your Kubernetes cluster enforces the security and compliance requirements that your organization needs. It’s critical because:
- It validates that security policies are correctly implemented
- It establishes that non-compliant resources are rejected
- It verifies that your governance controls are working
- It helps maintain compliance with industry standards and regulations
- It prevents security vulnerabilities from being introduced
How to implement
Create a policies directory with your Kyverno policies and Chainsaw tests:
policies/
├── kyverno-policies/
│ ├── require-labels.yaml
│ └── restrict-image-registries.yaml
└── tests/
├── test-require-labels.yaml
└── test-restrict-registries.yaml
Example test-require-labels.yaml. This is a Chainsaw test manifest that validates Kubernetes label policy enforcement by applying a policy, testing a valid deployment passes, and confirming that an invalid deployment without the required labels is properly rejected:
apiVersion: chainsaw.kyverno.io/v1alpha1
kind: Test
metadata:
name: test-require-labels
spec:
steps:
- name: step-01-apply-policy
apply:
file: ../kyverno-policies/require-labels.yaml
- name: step-02-apply-valid-deployment
apply:
file: resources/valid-deployment.yaml
- name: step-03-apply-invalid-deployment
apply:
file: resources/invalid-deployment.yaml
expect:
reject: true
message: "validation error: required labels are not set"
Sample run and expected output
When running policy tests with Chainsaw, the output looks like the following:
$ chainsaw test --report-format junit
=== RUN test-require-labels
=== RUN test-require-labels/step-01-apply-policy
INFO[0000] ✅ Successfully applied resource name=require-labels namespace=default resource=ClusterPolicy.kyverno.io/v1
=== RUN test-require-labels/step-02-apply-valid-deployment
INFO[0001] ✅ Successfully applied resource name=valid-deployment namespace=default resource=Deployment.apps/v1
=== RUN test-require-labels/step-03-apply-invalid-deployment
INFO[0002] ✅ Resource rejected as expected name=invalid-deployment namespace=default resource=Deployment.apps/v1
INFO[0002] ✅ Error message matched expected="validation error: required labels are not set" received="admission webhook \"validate.kyverno.svc\" denied the request: resource Deployment/default/invalid-deployment was blocked due to the following policies: require-labels: validation error: required labels are not set"
--- PASS: test-require-labels (3.45s)
--- PASS: test-require-labels/step-01-apply-policy (0.82s)
--- PASS: test-require-labels/step-02-apply-valid-deployment (1.21s)
--- PASS: test-require-labels/step-03-apply-invalid-deployment (1.42s)
PASS
=== RUN test-restrict-registries
=== RUN test-restrict-registries/step-01-apply-policy
INFO[0000] ✅ Successfully applied resource name=restrict-image-registries namespace=default resource=ClusterPolicy.kyverno.io/v1
=== RUN test-restrict-registries/step-02-apply-valid-deployment
INFO[0001] ✅ Successfully applied resource name=valid-registry-deployment namespace=default resource=Deployment.apps/v1
=== RUN test-restrict-registries/step-03-apply-invalid-deployment
INFO[0002] ✅ Resource rejected as expected name=invalid-registry-deployment namespace=default resource=Deployment.apps/v1
INFO[0002] ✅ Error message matched expected="validation error: image registry not allowed" received="admission webhook \"validate.kyverno.svc\" denied the request: resource Deployment/default/invalid-registry-deployment was blocked due to the following policies: restrict-image-registries: validation error: image registry not allowed"
--- PASS: test-restrict-registries (3.12s)
--- PASS: test-restrict-registries/step-01-apply-policy (0.75s)
--- PASS: test-restrict-registries/step-02-apply-valid-deployment (1.15s)
--- PASS: test-restrict-registries/step-03-apply-invalid-deployment (1.22s)
PASS
Ran 2 test(s) in 6.57s
Tests succeeded: 2, Failed: 0
This output demonstrates:
- Successful application of the policy
- The necessary RBAC permissions are in place for Kyverno to perform its operations
- The policy for mutating deployments based on secret updates is correctly defined and active
- Overall test summary showing all tests passed
5. Non-functional testing with Locust
Non-functional testing evaluates the performance, scalability, and reliability of your EKS cluster under various conditions. It’s vital because:
- It identifies performance bottlenecks before they impact users
- It determines the maximum capacity of your cluster
- It validates that your cluster can handle expected load
- It helps optimize resource allocation and scaling configurations
- It establishes that critical services remain responsive under stress
How to implement
Create a tests/load directory with your Locust tests:
tests/load/
├── coredns_locustfile.py
└── karpenter_locustfile.py
Example coredns_locustfile.py, This is a Locust load testing script that simulates DNS resolution stress on CoreDNS by dynamically creating Kubernetes services, querying their DNS records, and deleting them to measure DNS performance under load:
from locust import HttpUser, task, between, TaskSet
import kubernetes as k8s
import random
import string
import time
# Load Kubernetes configuration
k8s.config.load_kube_config()
v1 = k8s.client.CoreV1Api()
namespace_name = "locust-test"
def generate_service_name(length=10):
return ''.join(random.choices(string.ascii_lowercase, k=length))
def create_service(name, namespace):
service = k8s.client.V1Service(
api_version="v1",
kind="Service",
metadata=k8s.client.V1ObjectMeta(name=name, namespace=namespace),
spec=k8s.client.V1ServiceSpec(
ports=[k8s.client.V1ServicePort(port=80, target_port=80)],
selector={"app": name}
)
)
return v1.create_namespaced_service(namespace=namespace, body=service)
def generate_and_create_services(namespace, count=5):
service_names = []
for _ in range(count):
name = generate_service_name()
create_service(name, namespace)
service_names.append(name)
return service_names
def query_coredns(service_names, namespace):
import dns.resolver
resolver = dns.resolver.Resolver()
resolver.nameservers = ['10.100.0.10'] # CoreDNS service IP
for name in service_names:
try:
dns_name = f"{name}.{namespace}.svc.cluster.local"
answers = resolver.resolve(dns_name, 'A')
for rdata in answers:
ip = rdata.address
except Exception as e:
print(f"DNS query failed: {e}")
def delete_services(service_names, namespace):
for name in service_names:
v1.delete_namespaced_service(name=name, namespace=namespace)
class CoreDNSUser(HttpUser):
wait_time = between(1, 3)
@task
class CoreDNSTaskSet(TaskSet):
@task
def create_query_delete_services(self):
# Create services
service_names = generate_and_create_services(namespace_name, 5)
# Query CoreDNS
query_coredns(service_names, namespace_name)
# Delete services
delete_services(service_names, namespace_name)
Sample run and expected output
When running the Locust load tests, the output looks like the following:
$ locust -f coredns_locustfile.py --headless -u 20 -r 2 -t 5m
[2025-06-04 10:30:12,345] INFO/MainProcess: Starting Locust 2.15.1
[2025-06-04 10:30:12,346] INFO/MainProcess: Starting web interface at http://0.0.0.0:8089
[2025-06-04 10:30:12,352] INFO/MainProcess: Starting Locust 2.15.1
[2025-06-04 10:30:12,352] INFO/MainProcess: Starting 20 users at a rate of 2 users/s
[2025-06-04 10:35:12,456] INFO/MainProcess: Test finished
Type Name # reqs # fails | Avg Min Max Med | req/s failures/s
--------|----------------------------------------------------------------------------|-------|------------|-------|-------|-------|-------|--------|-----------
Aggregated 542 0(0.00%) | 345 78 1245 320 | 1.8 0.00
Response time percentiles (approximated)
Type Name 50% 66% 75% 80% 90% 95% 98% 99% 99.9% 99.99% 100% # reqs
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
Aggregated 320 380 450 510 680 820 980 1100 1230 1240 1245 542
Test completed successfully.
This output demonstrates how your CoreDNS service performs under load, showing metrics like the following:
- Average response time (345 ms)
- Minimum and maximum response times (78 ms to 1245 ms)
- Request throughput (1.8 requests per second)
- Error rate (0% in this example)
You can use these metrics to identify potential bottlenecks and establish that your cluster can handle the expected load before deploying to production.
6. Resilience assessment with Resilience Hub
Resilience assessment evaluates how well your EKS cluster can withstand and recover from failures. It’s essential because:
- It identifies single points of failure in your architecture
- It validates that your recovery mechanisms work as expected
- It establishes business continuity during disruptions
- It helps meet availability SLAs and compliance requirements
- It provides confidence that your cluster can handle real-world incidents
How to implement
Create a resilience directory with scripts for resilience assessment:
resilience/
├── run-resilience-assessment.sh
└── run-fault-injection.sh
Example run-resilience-assessment.sh. This is a shell script that creates a Resilience Hub application for an EKS cluster, runs a resilience assessment to evaluate its disaster recovery capabilities, and saves the results to a JSON file:
#!/bin/bash
set -e
CLUSTER_NAME=$1
REGION=$2
APP_NAME="${CLUSTER_NAME}-app"
# Create Resilience Hub application if it doesn't exist
APP_ARN=$(aws resiliencehub list-apps --query "appSummaries[?name=='${APP_NAME}'].arn" --output text)
if [ -z "$APP_ARN" ]; then
echo "Creating Resilience Hub application..."
APP_ARN=$(aws resiliencehub create-app \
--name "${APP_NAME}" \
--description "EKS cluster resilience assessment" \
--app-template-body "{\"resources\":[{\"logicalResourceId\":{\"identifier\":\"${CLUSTER_NAME}\"},\"resourceType\":\"AWS::EKS::Cluster\",\"type\":\"AWS::EKS::Cluster\"}]}" \
--query "app.appArn" \
--output text)fi
# Run assessment
echo "Running resilience assessment..."
ASSESSMENT_ARN=$(aws resiliencehub start-app-assessment \
--app-arn "${APP_ARN}" \
--assessment-name "pipeline-assessment-$(date +%Y%m%d-%H%M%S)" \
--query "assessment.assessmentArn" \
--output text)
# Wait for assessment to complete
echo "Waiting for assessment to complete..."
aws resiliencehub wait assessment-executed --assessment-arn "${ASSESSMENT_ARN}"
# Get assessment results
echo "Getting assessment results..."
aws resiliencehub describe-app-assessment \
--assessment-arn "${ASSESSMENT_ARN}" > assessment-report.json
echo "Assessment complete. Results saved to assessment-report.json"
Example run-fault-injection.sh. This is a shell script that creates and runs an AWS FIS experiment to test the resilience of an EKS cluster by simulating an availability zone outage and capturing the results:
#!/bin/bash
set -e
CLUSTER_NAME=$1
REGION=$2
# Create FIS experiment template
TEMPLATE_ID=$(aws fis create-experiment-template \
--targets "eks-cluster={resourceType=aws:eks:cluster,resourceArns=[arn:aws:eks:${REGION}:$(aws sts get-caller-identity --query Account --output text):cluster/${CLUSTER_NAME}]}" \
--actions "az-outage={actionId=aws:eks:inject-availability-zone-failure,targets={eks-cluster=eks-cluster},parameters={completionMode=forced}}" \
--stop-conditions "duration={source=none,value=10m}" \
--role-arn "arn:aws:iam::$(aws sts get-caller-identity --query Account --output text):role/FISExperimentRole" \
--description "Test EKS cluster resilience to AZ failure" \
--query "experimentTemplate.id" \
--output text)
# Start FIS experiment
EXPERIMENT_ID=$(aws fis start-experiment \
--experiment-template-id "${TEMPLATE_ID}" \
--query "experiment.id" \
--output text)
echo "Started FIS experiment ${EXPERIMENT_ID}"
# Wait for experiment to complete
echo "Waiting for experiment to complete..."
aws fis wait experiment-completed --id "${EXPERIMENT_ID}"
# Get experiment results
echo "Getting experiment results..."
aws fis get-experiment \
--id "${EXPERIMENT_ID}" > fis-results.json
echo "Experiment complete. Results saved to fis-results.json"
Sample run and expected output
When running the resilience assessment scripts, the output looks like the following:
$ ./run-resilience-assessment.sh eks-validation-cluster us-west-2
Creating Resilience Hub application...
Running resilience assessment...
Waiting for assessment to complete...
Getting assessment results...
Assessment complete. Results saved to assessment-report.json
$ cat assessment-report.json
{
"assessment": {
"appArn": "arn:aws:resiliencehub:us-west-2:123456789012:app/eks-validation-cluster-app/1a2b3c4d",
"assessmentArn": "arn:aws:resiliencehub:us-west-2:123456789012:app-assessment/5e6f7g8h",
"assessmentName": "pipeline-assessment-20250604-103015",
"assessmentStatus": "SUCCEEDED",
"complianceStatus": "POLICY_COMPLIANT",
"resiliencyScore": 85.0,
"driftStatus": "NOT_DRIFTED",
"invoker": "USER",
"appVersion": "1",
"assessmentTimeStamp": "2025-06-04T10:30:15.000Z"
}
}
$ ./run-fault-injection.sh eks-validation-cluster us-west-2
Started FIS experiment fis-12345678abcdef01
Waiting for experiment to complete...
Getting experiment results...
Experiment complete. Results saved to fis-results.json
$ cat fis-results.json
{
"experiment": {
"id": "fis-12345678abcdef01",
"experimentTemplateId": "fit-12345678abcdef01",
"state": {
"status": "COMPLETED",
"reason": "Experiment completed successfully"
},
"targets": {
"eks-cluster": {
"resourceType": "aws:eks:cluster",
"resourceArns": [
"arn:aws:eks:us-west-2:123456789012:cluster/eks-validation-cluster"
]
}
},
"actions": {
"az-outage": {
"actionId": "aws:eks:inject-availability-zone-failure",
"state": {
"status": "COMPLETED"
},
"startTime": "2025-06-04T10:35:00.000Z",
"endTime": "2025-06-04T10:45:00.000Z"
}
},
"startTime": "2025-06-04T10:35:00.000Z",
"endTime": "2025-06-04T10:45:00.000Z"
}
}
These outputs show a successful resilience assessment with a score of 85.0 and a completed fault injection experiment that simulated an Availability Zone failure. The assessment indicates that the cluster is policy compliant, and the fault injection experiment completed successfully, helping you identify how your cluster responds to failures.
Pipeline monitoring and visualization
You can view the pipeline execution in GitLab’s CI/CD interface, which provides a visual representation of each stage and its status. The pipeline generates reports and artifacts that can be reviewed to assess the quality of your Amazon EKS deployment. This implementation creates a complete quality assurance pipeline that validates all aspects of your EKS clusters throughout the development lifecycle.
Benefits of the Amazon EKS validation framework
Our Amazon EKS validation framework brings practical value to our Kubernetes operations through several key benefits. We test each part of our Amazon EKS setup to catch and fix issues before they reach production, leading to more stable services for our users. Our policy tests verify that security measures work as planned, giving us confidence in our cluster protection. Through load testing, we understand how our applications and infrastructure handle increased traffic, helping us prepare for busy periods and plan for growth. Tools such as Resilience Hub and AWS FIS teach us how our system reacts to failures so that we can improve recovery plans and reduce potential downtime. Moreover, the automation in our framework cuts down manual testing time so that we can focus more on building new features and responding quickly to changes. This approach of finding and fixing issues early in development saves costs when compared to addressing them in production. Our testing process also establishes that our Amazon EKS environment meets both regulatory standards and internal rules, clarifying audits and reviews. The framework is a practical tool that helps us build and maintain reliable Kubernetes infrastructure that serves our needs today and supports our growth for tomorrow.
Conclusion
A comprehensive quality assurance pipeline for Amazon EKS clusters helps establish that your Kubernetes environments are properly configured, secure, and ready for production workloads. You can implement the six validation components outlined in this post to verify correct infrastructure provisioning, establish functional correctness and security of applications, test the environment’s ability to handle expected load, and validate recovery from failures. This structured approach to validation builds confidence in your Amazon EKS deployments, reduces the risk of production issues, and maintains high-quality, production-ready Kubernetes environments. As containerized applications become increasingly critical to business operations, investing in comprehensive validation frameworks is essential. Organizations can maximize the benefits of Kubernetes while minimizing operational risks. You can adopt these validation frameworks to accelerate your journey toward reliable and performant Kubernetes deployments on Amazon EKS.
About the authors
Niall Thomson is a Principal Specialist Solutions Architect, Containers, at AWS where he helps customers who are building modern application platforms on AWS container services.
Ramesh Mathikumar is a Principal Consultant within the Global Financial Services practice. He has been working with Financial services customers over the last 25 years. At AWS, he helps customers succeed in their cloud journey by implementing AWS technologies every day.
Sundar Shanmugam is a Sr. Cloud Infrastructure Architect at AWS, specializing in solution architecture, workload migrations, and modernization. He focuses on developing innovative generative AI solutions and helps customers drive digital transformation while maximizing their AWS investment to achieve business objectives.