Containers

Implementing assurance pipeline for Amazon EKS Platform

Organizations using Amazon Elastic Kubernetes Service (Amazon EKS) need to establish that their clusters are built-as-designed, production-ready, and follow Amazon EKS Best Practices. Although Amazon EKS manages the Kubernetes control plane, validating cluster configurations and establishing quality across infrastructure, applications, policies, and resilience remains a key responsibility for platform teams. This post details how platform engineering teams can build an assurance pipeline for Amazon EKS deployments, incorporating validation frameworks that verify configurations, test infrastructure as code (IaC), assess application resilience, and establish compliance with organizational standards.

This comprehensive validation approach complement the robust scalability capabilities of Amazon EKS, helping teams build confidence in their deployments and maintain high-quality Kubernetes environments that can handle the demands of large-scale operations.

Current pain points in validating EKS clusters

Organizations deploying applications on Amazon EKS face several validation challenges:

  • Infrastructure validation gaps: Traditional testing often focuses on application code, neglecting IaC validation, and leading to misconfigurations and deployment failures.
  • Siloed testing approaches: Teams often use disconnected testing methods across infrastructure, applications, and policies, creating blind spots in validation coverage.
  • Limited policy enforcement testing: Organizations struggle to validate that their Kubernetes policies are correctly enforced, potentially exposing security vulnerabilities.
  • Non-functional testing complexity: Load testing Kubernetes components such as CoreDNS requires specialized knowledge and tools that many teams lack.
  • Resilience assessment challenges: Understanding how applications behave during infrastructure failures is difficult without through simulation and frameworks/tools supporting that simulation.
  • Manual and time-consuming processes: Without automated validation frameworks, teams resort to manual validation, which are error-prone, limited in nature, and often lead to inefficient practices.

Solution overview

To address cluster validation challenges, we’ve developed an assurance pipeline that systematically validates Amazon EKS environments through six distinct frameworks, each serving a specific purpose in our validation process.

  1. Infrastructure validation (Terraform test): Validates infrastructure before deployment by testing EKS cluster component modules and verifying compliance with Amazon Web Services (AWS) best practices. This early validation process helps detect and resolve infrastructure issues during the development phase rather than in production.
  2. Behavioral testing (Pytest BDD): Validates cluster behavior through readable test scenarios that verify core operations such as pod scheduling and service discovery. The framework establishes proper component interactions and confirms that Kubernetes API operations respond as expected.
  3. Package validation (Helm testing): Verifies Helm chart installations and cluster add-ons deployment while establishing proper resource creation. This validation step maintains consistency as code moves between different environments.
  4. Policy compliance (Chainsaw): Tests admission controls, security policies, and network policies to establish that clusters adhere to organizational standards and compliance requirements. This comprehensive policy validation safeguards cluster security configurations.
  5. Performance assessment (Locust): Evaluates cluster performance under various load conditions by measuring component response times and monitoring scaling behavior. This testing helps identify potential performance bottlenecks before they impact production workloads.
  6. Resilience testing (AWS Tools): Uses AWS Resilience Hub and AWS Fault Injection Service (AWS FIS) to test failure recovery procedures and validate availability configurations. These tools help identify reliability improvements and establish robust cluster operations.

This pipeline gives us a clear view of our Amazon EKS environments, helping us catch issues before they affect our applications. Each framework adds a layer of validation, creating a practical approach to testing our Kubernetes infrastructure.

Prerequisites

The following prerequisites are necessary before continuing:

Furthermore, navigate to your GitLab project and configure the following:

  1. Go to Settings > CI/CD > Variables.
  2. Add the following variables:
    • AWS_ACCESS_KEY_ID: Your AWS access key
    • AWS_SECRET_ACCESS_KEY: Your AWS secret key
    • AWS_REGION: Your preferred AWS Region
    • CLUSTER_NAME: Your EKS cluster name

Walkthrough

In this walkthrough, you integrate the Amazon EKS validation framework in the GitLab CI/CD pipeline. Create a .gitlab-ci.yml file in your repository root with the following structure:

stages:
  - validate-infrastructure
  - deploy-infrastructure
  - validate-policies
  - deploy-applications
  - functional-tests
  - non-functional-tests
  - resilience-assessment

variables:
  AWS_REGION: us-west-2
  CLUSTER_NAME: eks-validation-cluster
  TERRAFORM_DIR: terraform
  HELM_DIR: helm
  POLICY_DIR: policies
  FUNCTIONAL_TEST_DIR: tests/functional
  LOAD_TEST_DIR: tests/load

# Reusable templates
.aws-auth: &aws-auth
  before_script:
    # Using GitLab CI/CD environment variables for AWS credentials
    # These should be set as protected and masked variables in GitLab CI/CD settings
    # No need to explicitly configure credentials as AWS CLI will automatically use these variables
    - export AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID
    - export AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY
    - export AWS_DEFAULT_REGION=$AWS_REGION
    - aws sts get-caller-identity # Verify AWS credentials are working

.k8s-auth: &k8s-auth
  before_script:
    - aws eks update-kubeconfig --name $CLUSTER_NAME --region $AWS_REGION

# Infrastructure validation and deployment
terraform-validate:
  stage: validate-infrastructure
  image: hashicorp/terraform:latest
  <<: *aws-auth
  script:
    - cd $TERRAFORM_DIR
    - terraform init
    - terraform validate
    - terraform test

terraform-plan:
  stage: validate-infrastructure
  image: hashicorp/terraform:latest
  <<: *aws-auth
  script:
    - cd $TERRAFORM_DIR
    - terraform init
    - terraform plan -out=tfplan
  artifacts:
    paths:
      - $TERRAFORM_DIR/tfplan
    expire_in: 1 day

terraform-apply:
  stage: deploy-infrastructure
  image: hashicorp/terraform:latest
  <<: *aws-auth
  script:
    - cd $TERRAFORM_DIR
    - terraform init
    - terraform apply -auto-approve tfplan
  dependencies:
    - terraform-plan
  when: manual
  environment:
    name: production
    url: https://console.aws.amazon.com/eks/home?region=$AWS_REGION#/clusters/$CLUSTER_NAME

# Policy validation
policy-test:
  stage: validate-policies
  image: ghcr.io/kyverno/chainsaw:latest
  <<: *k8s-auth
  script:
    - cd $POLICY_DIR
    - chainsaw test --report-format junit --report-path chainsaw-report.xml
  artifacts:
    reports:
      junit: $POLICY_DIR/chainsaw-report.xml

# Application deployment and Helm testing
helm-deploy-test:
  stage: deploy-applications
  image: alpine/helm:latest
  <<: *k8s-auth
  script:
    - cd $HELM_DIR
    - helm dependency update ./
    - helm upgrade --install app-release ./ --wait
    - helm test app-release --logs

# Functional testing
functional-test:
  stage: functional-tests
  image: python:3.9
  <<: *k8s-auth
  script:
    - cd $FUNCTIONAL_TEST_DIR
    - pip install -r requirements.txt
    - pytest --bdd-format=pretty --junitxml=pytest-report.xml
  artifacts:
    reports:
      junit: $FUNCTIONAL_TEST_DIR/pytest-report.xml

# Non-functional testing
load-test:
  stage: non-functional-tests
  image: locustio/locust:latest
  <<: *k8s-auth
  script:
    - cd $LOAD_TEST_DIR
    - locust -f coredns_locustfile.py --headless -u 20 -r 2 -t 5m --html=locust-report.html
  artifacts:
    paths:
      - $LOAD_TEST_DIR/locust-report.html
    expire_in: 1 week

# Resilience assessment
resilience-assessment:
  stage: resilience-assessment
  image: amazon/aws-cli:latest
  <<: *aws-auth
  script:
    - cd resilience
    - ./run-resilience-assessment.sh $CLUSTER_NAME $AWS_REGION
    - ./run-fault-injection.sh $CLUSTER_NAME $AWS_REGION
  artifacts:
    paths:
      - resilience/assessment-report.json
      - resilience/fis-results.json
    expire_in: 1 week

1. Unit testing with Terraform test

Unit testing your infrastructure code is crucial for catching configuration errors early in the development cycle. It helps:

  • Validate that your infrastructure components are correctly defined
  • Establish that resources have the expected properties and configurations
  • Prevent costly mistakes before deploying to AWS
  • Provide documentation of expected infrastructure behavior
  • Enable refactoring with confidence

How to implement

To implement unit testing with Terraform’s native testing framework, you can use your existing Terraform repository and create a tests directory with an eks.tftest.hcl file. The eks.tftest.hcl file is a Terraform test configuration file used for unit testing Amazon EKS infrastructure code. It validates that your Amazon EKS infrastructure components are correctly defined before actual deployment to AWS, helping catch configuration errors early in the development cycle.In your existing Terraform project structure, create the following tests directory:

your-terraform-project/
├── main.tf
├── eks.tf
├── ...
├── karpenter.tf
├── variables.tf
├── outputs.tf
└── tests/
    └── eks.tftest.hcl

Sample run and expected output

When running the Terraform tests with the new native testing framework, the output looks like the following:

$ cd terraform
$ terraform test

Testing terraform/tests/eks.tftest.hcl...

run "create_eks_cluster"... pass

Success! 1 passed, 0 failed.

For more detailed output with verbose flag:

$ terraform test -verbose

Testing terraform/tests/eks.tftest.hcl...

run "create_eks_cluster"...
  module.eks.cluster_name != ""... pass
  module.eks.cluster_version == var.eks_cluster_version... pass
  length(module.eks.eks_managed_node_groups) == 1... pass
  contains(keys(module.eks.eks_managed_node_groups), "karpenter")... pass
  module.karpenter.node_iam_role_name == local.name... pass
  helm_release.karpenter.namespace == "kube-system"... pass
  helm_release.karpenter.chart == "karpenter"... pass
  helm_release.karpenter.version == "0.37.0"... pass
  module.vpc.name == local.name... pass
  module.vpc.vpc_cidr_block == var.vpc_cidr... pass
  length(module.vpc.private_subnets) == length(local.azs)... pass
  length(module.vpc.public_subnets) == length(local.azs)... pass
  length(module.vpc.intra_subnets) == length(local.azs)... pass
  pass

Success! 1 passed, 0 failed.

If there are any failures, the detailed error message looks like the following:

$ terraform test

Testing terraform/tests/eks.tftest.hcl...

run "create_eks_cluster"...
  module.eks.cluster_version == var.eks_cluster_version... fail
    EKS cluster version should match the specified version
    module.eks.cluster_version is "1.29"
    var.eks_cluster_version is "1.30"
  fail

Error: 1 test failed.

These outputs provide comprehensive validation that your infrastructure code is correctly defined and will create the expected resources when deployed.

2. Functional testing with Pytest BDD

Functional testing validates that your EKS cluster behaves as expected from an operational perspective. It’s essential because:

  • It verifies that critical Kubernetes components are running correctly
  • It establishes that cluster services are accessible and responding properly
  • It validates that the cluster can perform its intended functions
  • It catches integration issues that unit tests might miss
  • It provides confidence that the cluster works for end users

How to implement

Create a tests/functional directory with your BDD tests:

tests/functional/
├── requirements.txt
├── conftest.py
├── features/
│   └── cluster_validation.feature
└── steps/
    └── cluster_steps.py

Example requirements.txt (specifies the Python package dependencies needed to run the functional tests and establish consistent test environments across different systems):

pytest
pytest-bdd
kubernetes
boto3

Example cluster_validation.feature (behavior specifications written in Gherkin syntax, defines test scenarios in plain, human-readable language):

Feature: EKS Cluster Validation
  
  Scenario: Verify critical components are running
    Given an EKS cluster is available
    When I check the kube-system namespace
    Then all critical pods should be in Running state
    
  Scenario: Check logs for errors
    Given an EKS cluster is available
    When I check pods in the kube-system namespace
    Then logs should not contain any errors

Example cluster_steps.py (the actual Python implementation of the test steps defined in the feature file):

from pytest_bdd import given, when, then, parsers
from kubernetes import client, config
import boto3

@given("an EKS cluster is available")
def eks_cluster(request):
    config.load_kube_config()
    v1 = client.CoreV1Api()
    return v1

@when(parsers.parse("I check the {namespace} namespace"))
def check_namespace(eks_cluster, namespace):
    return eks_cluster.list_namespaced_pod(namespace)

@then("all critical pods should be in Running state")
def check_pods_running(check_namespace):
    for pod in check_namespace.items:
        assert pod.status.phase == "Running", f"Pod {pod.metadata.name} is not running"

@when(parsers.parse("I check pods in the {namespace} namespace"))
def check_pods_logs(eks_cluster, namespace):
    pods = eks_cluster.list_namespaced_pod(namespace)
    logs = {}
    for pod in pods.items:
        try:
            logs[pod.metadata.name] = eks_cluster.read_namespaced_pod_log(
                name=pod.metadata.name, namespace=namespace
            )
        except Exception:
            logs[pod.metadata.name] = ""
    return logs

@then("logs should not contain any errors")
def check_logs_for_errors(check_pods_logs):
    error_keywords = ["error", "exception", "fail", "critical"]
    for pod_name, log in check_pods_logs.items():
        for keyword in error_keywords:
            assert keyword.lower() not in log.lower(), f"Error found in {pod_name} logs"

Sample run and expected output

When running the functional tests, the output looks like the following:

$ cd tests/functional
$ pytest --bdd-format=pretty --junitxml=pytest-report.xml
============================= test session starts ==============================
platform linux -- Python 3.9.7, pytest-7.3.1, pluggy-1.0.0
rootdir: /repo/tests/functional
plugins: bdd-6.1.1
collected 2 items

Feature: EKS Cluster Validation # features/cluster_validation.feature:1
  Scenario: Verify critical components are running # features/cluster_validation.feature:3
    Given an EKS cluster is available                 # steps/cluster_steps.py:6
    When I check the kube-system namespace            # steps/cluster_steps.py:12
    Then all critical pods should be in Running state # steps/cluster_steps.py:16
  Scenario: Check logs for errors                  # features/cluster_validation.feature:8
    Given an EKS cluster is available              # steps/cluster_steps.py:6
    When I check pods in the kube-system namespace # steps/cluster_steps.py:20
    Then logs should not contain any errors        # steps/cluster_steps.py:32

============================= 2 passed in 8.32s ===============================

The JUnit XML report (pytest-report.xml) contains structured test results like the following:

<?xml version="1.0" encoding="utf-8"?>
<testsuites>
<testsuite name="features.cluster_validation" errors="0" failures="0" skipped="0" tests="2" time="8.320" timestamp="2025-06-04T10:20:15">
<testcase classname="features.cluster_validation" name="Verify critical components are running" time="4.123">
</testcase>
<testcase classname="features.cluster_validation" name="Check logs for errors" time="4.197">
</testcase>
</testsuite>
</testsuites>

This output demonstrates:

  • Successful execution of BDD scenarios
  • Verification that all critical pods are running
  • Confirmation that no errors are found in pod logs
  • Test timing information
  • Overall test summary showing all tests passed

The JUnit report can be integrated with CI/CD systems for reporting and tracking test results over time.

3. Helm testing

Helm testing establishes that your applications deploy correctly and function as expected within the Kubernetes environment. It’s important because:

  • It validates that your Helm charts are correctly structured
  • It establishes that deployed applications are accessible and functional
  • It verifies that services can communicate with each other
  • It catches configuration issues before they affect users
  • It provides a standardized way to test application deployments

How to implement

Create a helm directory with your Helm charts and tests:

helm/
├── Chart.yaml
├── values.yaml
├── templates/
│   └── ...
└── tests/
    ├── test-connection.yaml
    └── test-resources.yaml

Example test-connection.yaml. This is a Helm test manifest that creates a temporary Pod to verify that the application’s service is accessible within the Kubernetes cluster by running a wget command):

apiVersion: v1
kind: Pod
metadata:
  name: "{{ include "app.fullname" . }}-test-connection"
  labels:
    {{- include "app.labels" . | nindent 4 }}
  annotations:
    "helm.sh/hook": test
spec:
  containers:
    - name: wget
      image: busybox
      command: ['wget']
      args: ['{{ include "app.fullname" . }}:{{ .Values.service.port }}']
  restartPolicy: Never

Sample run and expected output

When running Helm tests, the output looks like the following:

$ helm test app-release --logs
NAME: app-release
LAST DEPLOYED: Wed Jun 4 10:15:22 2025
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE:     app-release-test-connection
Last Started:   Wed Jun 4 10:16:05 2025
Last Completed: Wed Jun 4 10:16:15 2025
Phase:          Succeeded
NOTES:
Application successfully deployed and tested!

POD LOGS: app-release-test-connection
wget: download completed

This output confirms that:

  • The Karpenter Helm chart was successfully deployed
  • The Karpenter Controller pod was ready within the timeout period (120 seconds)
  • All tests passed successfully

4. Kubernetes policy testing with Chainsaw

Policy testing establishes that your Kubernetes cluster enforces the security and compliance requirements that your organization needs. It’s critical because:

  • It validates that security policies are correctly implemented
  • It establishes that non-compliant resources are rejected
  • It verifies that your governance controls are working
  • It helps maintain compliance with industry standards and regulations
  • It prevents security vulnerabilities from being introduced

How to implement

Create a policies directory with your Kyverno policies and Chainsaw tests:

policies/
├── kyverno-policies/
│   ├── require-labels.yaml
│   └── restrict-image-registries.yaml
└── tests/
    ├── test-require-labels.yaml
    └── test-restrict-registries.yaml

Example test-require-labels.yaml. This is a Chainsaw test manifest that validates Kubernetes label policy enforcement by applying a policy, testing a valid deployment passes, and confirming that an invalid deployment without the required labels is properly rejected:

apiVersion: chainsaw.kyverno.io/v1alpha1
kind: Test
metadata:
  name: test-require-labels
spec:
  steps:
  - name: step-01-apply-policy
    apply:
      file: ../kyverno-policies/require-labels.yaml
  - name: step-02-apply-valid-deployment
    apply:
      file: resources/valid-deployment.yaml
  - name: step-03-apply-invalid-deployment
    apply:
      file: resources/invalid-deployment.yaml
    expect:
      reject: true
      message: "validation error: required labels are not set"

Sample run and expected output

When running policy tests with Chainsaw, the output looks like the following:

$ chainsaw test --report-format junit
=== RUN   test-require-labels
=== RUN   test-require-labels/step-01-apply-policy
INFO[0000] ✅ Successfully applied resource               name=require-labels namespace=default resource=ClusterPolicy.kyverno.io/v1
=== RUN   test-require-labels/step-02-apply-valid-deployment
INFO[0001] ✅ Successfully applied resource               name=valid-deployment namespace=default resource=Deployment.apps/v1
=== RUN   test-require-labels/step-03-apply-invalid-deployment
INFO[0002] ✅ Resource rejected as expected               name=invalid-deployment namespace=default resource=Deployment.apps/v1
INFO[0002] ✅ Error message matched                       expected="validation error: required labels are not set" received="admission webhook \"validate.kyverno.svc\" denied the request: resource Deployment/default/invalid-deployment was blocked due to the following policies: require-labels: validation error: required labels are not set"
--- PASS: test-require-labels (3.45s)
    --- PASS: test-require-labels/step-01-apply-policy (0.82s)
    --- PASS: test-require-labels/step-02-apply-valid-deployment (1.21s)
    --- PASS: test-require-labels/step-03-apply-invalid-deployment (1.42s)
PASS

=== RUN   test-restrict-registries
=== RUN   test-restrict-registries/step-01-apply-policy
INFO[0000] ✅ Successfully applied resource               name=restrict-image-registries namespace=default resource=ClusterPolicy.kyverno.io/v1
=== RUN   test-restrict-registries/step-02-apply-valid-deployment
INFO[0001] ✅ Successfully applied resource               name=valid-registry-deployment namespace=default resource=Deployment.apps/v1
=== RUN   test-restrict-registries/step-03-apply-invalid-deployment
INFO[0002] ✅ Resource rejected as expected               name=invalid-registry-deployment namespace=default resource=Deployment.apps/v1
INFO[0002] ✅ Error message matched                       expected="validation error: image registry not allowed" received="admission webhook \"validate.kyverno.svc\" denied the request: resource Deployment/default/invalid-registry-deployment was blocked due to the following policies: restrict-image-registries: validation error: image registry not allowed"
--- PASS: test-restrict-registries (3.12s)
    --- PASS: test-restrict-registries/step-01-apply-policy (0.75s)
    --- PASS: test-restrict-registries/step-02-apply-valid-deployment (1.15s)
    --- PASS: test-restrict-registries/step-03-apply-invalid-deployment (1.22s)
PASS

Ran 2 test(s) in 6.57s
Tests succeeded: 2, Failed: 0

This output demonstrates:

  • Successful application of the policy
  • The necessary RBAC permissions are in place for Kyverno to perform its operations
  • The policy for mutating deployments based on secret updates is correctly defined and active
  • Overall test summary showing all tests passed

5. Non-functional testing with Locust

Non-functional testing evaluates the performance, scalability, and reliability of your EKS cluster under various conditions. It’s vital because:

  • It identifies performance bottlenecks before they impact users
  • It determines the maximum capacity of your cluster
  • It validates that your cluster can handle expected load
  • It helps optimize resource allocation and scaling configurations
  • It establishes that critical services remain responsive under stress

How to implement

Create a tests/load directory with your Locust tests:

tests/load/
├── coredns_locustfile.py
└── karpenter_locustfile.py

Example coredns_locustfile.py, This is a Locust load testing script that simulates DNS resolution stress on CoreDNS by dynamically creating Kubernetes services, querying their DNS records, and deleting them to measure DNS performance under load:

from locust import HttpUser, task, between, TaskSet
import kubernetes as k8s
import random
import string
import time

# Load Kubernetes configuration
k8s.config.load_kube_config()
v1 = k8s.client.CoreV1Api()
namespace_name = "locust-test"

def generate_service_name(length=10):
    return ''.join(random.choices(string.ascii_lowercase, k=length))

def create_service(name, namespace):
    service = k8s.client.V1Service(
        api_version="v1",
        kind="Service",
        metadata=k8s.client.V1ObjectMeta(name=name, namespace=namespace),
        spec=k8s.client.V1ServiceSpec(
            ports=[k8s.client.V1ServicePort(port=80, target_port=80)],
            selector={"app": name}
        )
    )
    return v1.create_namespaced_service(namespace=namespace, body=service)

def generate_and_create_services(namespace, count=5):
    service_names = []
    for _ in range(count):
        name = generate_service_name()
        create_service(name, namespace)
        service_names.append(name)
    return service_names

def query_coredns(service_names, namespace):
    import dns.resolver
    resolver = dns.resolver.Resolver()
    resolver.nameservers = ['10.100.0.10']  # CoreDNS service IP
    
    for name in service_names:
        try:
            dns_name = f"{name}.{namespace}.svc.cluster.local"
            answers = resolver.resolve(dns_name, 'A')
            for rdata in answers:
                ip = rdata.address
        except Exception as e:
            print(f"DNS query failed: {e}")

def delete_services(service_names, namespace):
    for name in service_names:
        v1.delete_namespaced_service(name=name, namespace=namespace)

class CoreDNSUser(HttpUser):
    wait_time = between(1, 3)
    
    @task
    class CoreDNSTaskSet(TaskSet):
        @task
        def create_query_delete_services(self):
            # Create services
            service_names = generate_and_create_services(namespace_name, 5)
            # Query CoreDNS
            query_coredns(service_names, namespace_name)
            # Delete services
            delete_services(service_names, namespace_name)

Sample run and expected output

When running the Locust load tests, the output looks like the following:

$ locust -f coredns_locustfile.py --headless -u 20 -r 2 -t 5m
[2025-06-04 10:30:12,345] INFO/MainProcess: Starting Locust 2.15.1
[2025-06-04 10:30:12,346] INFO/MainProcess: Starting web interface at http://0.0.0.0:8089
[2025-06-04 10:30:12,352] INFO/MainProcess: Starting Locust 2.15.1
[2025-06-04 10:30:12,352] INFO/MainProcess: Starting 20 users at a rate of 2 users/s
[2025-06-04 10:35:12,456] INFO/MainProcess: Test finished

Type     Name                                                                          # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s
--------|----------------------------------------------------------------------------|-------|------------|-------|-------|-------|-------|--------|-----------
         Aggregated                                                                        542     0(0.00%) |    345      78    1245    320 |    1.8        0.00

Response time percentiles (approximated)
Type     Name                                                                                  50%    66%    75%    80%    90%    95%    98%    99%  99.9% 99.99%   100% # reqs
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
         Aggregated                                                                             320    380    450    510    680    820    980   1100   1230   1240   1245    542

Test completed successfully.

This output demonstrates how your CoreDNS service performs under load, showing metrics like the following:

  • Average response time (345 ms)
  • Minimum and maximum response times (78 ms to 1245 ms)
  • Request throughput (1.8 requests per second)
  • Error rate (0% in this example)

You can use these metrics to identify potential bottlenecks and establish that your cluster can handle the expected load before deploying to production.

6. Resilience assessment with Resilience Hub

Resilience assessment evaluates how well your EKS cluster can withstand and recover from failures. It’s essential because:

  • It identifies single points of failure in your architecture
  • It validates that your recovery mechanisms work as expected
  • It establishes business continuity during disruptions
  • It helps meet availability SLAs and compliance requirements
  • It provides confidence that your cluster can handle real-world incidents

How to implement

Create a resilience directory with scripts for resilience assessment:

resilience/
├── run-resilience-assessment.sh
└── run-fault-injection.sh

Example run-resilience-assessment.sh. This is a shell script that creates a Resilience Hub application for an EKS cluster, runs a resilience assessment to evaluate its disaster recovery capabilities, and saves the results to a JSON file:

#!/bin/bash
set -e

CLUSTER_NAME=$1
REGION=$2
APP_NAME="${CLUSTER_NAME}-app"
# Create Resilience Hub application if it doesn't exist
APP_ARN=$(aws resiliencehub list-apps --query "appSummaries[?name=='${APP_NAME}'].arn" --output text)
if [ -z "$APP_ARN" ]; then
echo "Creating Resilience Hub application..."
  APP_ARN=$(aws resiliencehub create-app \
    --name "${APP_NAME}" \
    --description "EKS cluster resilience assessment" \
    --app-template-body "{\"resources\":[{\"logicalResourceId\":{\"identifier\":\"${CLUSTER_NAME}\"},\"resourceType\":\"AWS::EKS::Cluster\",\"type\":\"AWS::EKS::Cluster\"}]}" \
    --query "app.appArn" \
    --output text)fi
# Run assessment
echo "Running resilience assessment..."
ASSESSMENT_ARN=$(aws resiliencehub start-app-assessment \
  --app-arn "${APP_ARN}" \
  --assessment-name "pipeline-assessment-$(date +%Y%m%d-%H%M%S)" \
  --query "assessment.assessmentArn" \
  --output text)
# Wait for assessment to complete
echo "Waiting for assessment to complete..."
aws resiliencehub wait assessment-executed --assessment-arn "${ASSESSMENT_ARN}"
# Get assessment results
echo "Getting assessment results..."
aws resiliencehub describe-app-assessment \
  --assessment-arn "${ASSESSMENT_ARN}" > assessment-report.json
echo "Assessment complete. Results saved to assessment-report.json"

Example run-fault-injection.sh. This is a shell script that creates and runs an AWS FIS experiment to test the resilience of an EKS cluster by simulating an availability zone outage and capturing the results:

#!/bin/bash
set -e

CLUSTER_NAME=$1
REGION=$2
# Create FIS experiment template
TEMPLATE_ID=$(aws fis create-experiment-template \
  --targets "eks-cluster={resourceType=aws:eks:cluster,resourceArns=[arn:aws:eks:${REGION}:$(aws sts get-caller-identity --query Account --output text):cluster/${CLUSTER_NAME}]}" \
  --actions "az-outage={actionId=aws:eks:inject-availability-zone-failure,targets={eks-cluster=eks-cluster},parameters={completionMode=forced}}" \
  --stop-conditions "duration={source=none,value=10m}" \
  --role-arn "arn:aws:iam::$(aws sts get-caller-identity --query Account --output text):role/FISExperimentRole" \
  --description "Test EKS cluster resilience to AZ failure" \
  --query "experimentTemplate.id" \
  --output text)
# Start FIS experiment
EXPERIMENT_ID=$(aws fis start-experiment \
  --experiment-template-id "${TEMPLATE_ID}" \
  --query "experiment.id" \
  --output text)
echo "Started FIS experiment ${EXPERIMENT_ID}"
# Wait for experiment to complete
echo "Waiting for experiment to complete..."
aws fis wait experiment-completed --id "${EXPERIMENT_ID}"
# Get experiment results
echo "Getting experiment results..."
aws fis get-experiment \
  --id "${EXPERIMENT_ID}" > fis-results.json
echo "Experiment complete. Results saved to fis-results.json"

Sample run and expected output

When running the resilience assessment scripts, the output looks like the following:

$ ./run-resilience-assessment.sh eks-validation-cluster us-west-2
Creating Resilience Hub application...
Running resilience assessment...
Waiting for assessment to complete...
Getting assessment results...
Assessment complete. Results saved to assessment-report.json

$ cat assessment-report.json
{
  "assessment": {
    "appArn": "arn:aws:resiliencehub:us-west-2:123456789012:app/eks-validation-cluster-app/1a2b3c4d",
    "assessmentArn": "arn:aws:resiliencehub:us-west-2:123456789012:app-assessment/5e6f7g8h",
    "assessmentName": "pipeline-assessment-20250604-103015",
    "assessmentStatus": "SUCCEEDED",
    "complianceStatus": "POLICY_COMPLIANT",
    "resiliencyScore": 85.0,
    "driftStatus": "NOT_DRIFTED",
    "invoker": "USER",
    "appVersion": "1",
    "assessmentTimeStamp": "2025-06-04T10:30:15.000Z"
  }
}

$ ./run-fault-injection.sh eks-validation-cluster us-west-2
Started FIS experiment fis-12345678abcdef01
Waiting for experiment to complete...
Getting experiment results...
Experiment complete. Results saved to fis-results.json

$ cat fis-results.json
{
  "experiment": {
    "id": "fis-12345678abcdef01",
    "experimentTemplateId": "fit-12345678abcdef01",
    "state": {
      "status": "COMPLETED",
      "reason": "Experiment completed successfully"
    },
    "targets": {
      "eks-cluster": {
        "resourceType": "aws:eks:cluster",
        "resourceArns": [
          "arn:aws:eks:us-west-2:123456789012:cluster/eks-validation-cluster"
        ]
      }
    },
    "actions": {
      "az-outage": {
        "actionId": "aws:eks:inject-availability-zone-failure",
        "state": {
          "status": "COMPLETED"
        },
        "startTime": "2025-06-04T10:35:00.000Z",
        "endTime": "2025-06-04T10:45:00.000Z"
      }
    },
    "startTime": "2025-06-04T10:35:00.000Z",
    "endTime": "2025-06-04T10:45:00.000Z"
  }
}

These outputs show a successful resilience assessment with a score of 85.0 and a completed fault injection experiment that simulated an Availability Zone failure. The assessment indicates that the cluster is policy compliant, and the fault injection experiment completed successfully, helping you identify how your cluster responds to failures.

Pipeline monitoring and visualization

You can view the pipeline execution in GitLab’s CI/CD interface, which provides a visual representation of each stage and its status. The pipeline generates reports and artifacts that can be reviewed to assess the quality of your Amazon EKS deployment. This implementation creates a complete quality assurance pipeline that validates all aspects of your EKS clusters throughout the development lifecycle.

Benefits of the Amazon EKS validation framework

Our Amazon EKS validation framework brings practical value to our Kubernetes operations through several key benefits. We test each part of our Amazon EKS setup to catch and fix issues before they reach production, leading to more stable services for our users. Our policy tests verify that security measures work as planned, giving us confidence in our cluster protection. Through load testing, we understand how our applications and infrastructure handle increased traffic, helping us prepare for busy periods and plan for growth. Tools such as Resilience Hub and AWS FIS teach us how our system reacts to failures so that we can improve recovery plans and reduce potential downtime. Moreover, the automation in our framework cuts down manual testing time so that we can focus more on building new features and responding quickly to changes. This approach of finding and fixing issues early in development saves costs when compared to addressing them in production. Our testing process also establishes that our Amazon EKS environment meets both regulatory standards and internal rules, clarifying audits and reviews. The framework is a practical tool that helps us build and maintain reliable Kubernetes infrastructure that serves our needs today and supports our growth for tomorrow.

Conclusion

A comprehensive quality assurance pipeline for Amazon EKS clusters helps establish that your Kubernetes environments are properly configured, secure, and ready for production workloads. You can implement the six validation components outlined in this post to verify correct infrastructure provisioning, establish functional correctness and security of applications, test the environment’s ability to handle expected load, and validate recovery from failures. This structured approach to validation builds confidence in your Amazon EKS deployments, reduces the risk of production issues, and maintains high-quality, production-ready Kubernetes environments. As containerized applications become increasingly critical to business operations, investing in comprehensive validation frameworks is essential. Organizations can maximize the benefits of Kubernetes while minimizing operational risks. You can adopt these validation frameworks to accelerate your journey toward reliable and performant Kubernetes deployments on Amazon EKS.


About the authors

Niall Thomson is a Principal Specialist Solutions Architect, Containers, at AWS where he helps customers who are building modern application platforms on AWS container services.

Ramesh Mathikumar is a Principal Consultant within the Global Financial Services practice. He has been working with Financial services customers over the last 25 years. At AWS, he helps customers succeed in their cloud journey by implementing AWS technologies every day.

Sundar Shanmugam is a Sr. Cloud Infrastructure Architect at AWS, specializing in solution architecture, workload migrations, and modernization. He focuses on developing innovative generative AI solutions and helps customers drive digital transformation while maximizing their AWS investment to achieve business objectives.