AWS Partner Network (APN) Blog

Simplifying Management of Amazon EKS Environments with Automated Fleet Operations from Rafay

By Anirban Chatterjee, Head of Product Marketing – Rafay Systems
By Welly Siauw, Principal Partner Solutions Architect – AWS

Rafay Systems

Many organizations choose Amazon Elastic Kubernetes Services (Amazon EKS) for orchestrating and managing their containerized applications on Amazon Web Services (AWS) and on-premises. Amazon EKS provides a managed scaled Kubernetes control plane, with integration to AWS networking and security services.

Customers often share that managing the lifecycle of a fleet of Kubernetes clusters has become a critical yet complex operation. Optimal fleet-wide operations ensure uniformity, scalability, and high availability of the EKS cluster. However, despite their numerous advantages, fleet-wide operation practices have their own challenges.

In this post, we will explain the key features of fleet operations and provide an overview of how Automated Fleet Operations from Rafay can help customers improve operational efficiency, governance, and high availability of EKS clusters.

Rafay Systems is an AWS Specialization Partner and AWS Marketplace Seller with the Containers Competency. Rafay helps customers manage the full lifecycle of Kubernetes infrastructure and modern applications in a single, easy to use, integrated platform.

What is Fleet Operation?

Enterprises typically manage sets of Amazon EKS clusters as a fleet. Depending on how the enterprise is structured and operates, the fleet could consist of groups of EKS clusters for development, test, and production. It may include EKS clusters running on AWS, on-premises, or edge locations such as factories, stores, or distribution centers.

Clusters in the fleet are grouped because of the need to apply similar operations, monitoring, and governance. Fleet operations is a process for managing, monitoring, and governing a heterogeneous fleet of EKS clusters.

Managing fleet operations at scale often comes with challenges such as:

  • Limited tooling: One of the most pressing issues in fleet operations is the lack of robust, dedicated tooling. While Kubernetes itself provides tools for managing individual clusters, many organizations find it difficult to extend these tools to manage fleets in a streamlined and automated manner.
  • Complexity in multi-cluster management: Managing multiple Kubernetes clusters can be challenging due to differences in configuration, policy, and security across various clusters and environments. This complexity often leads to inconsistencies and operational challenges, impacting the efficiency and reliability of applications.
  • Security and compliance concerns: As organizations scale up their Kubernetes deployments, ensuring security and compliance across all clusters becomes increasingly difficult. Centralized security policy enforcement across diverse clusters and environments can be a complex task at scale.
  • Manual pre-op and post-op checks: Another crucial gap lies in the end-to-end workflow process of complex operation. This generally requires a set of prerequisites, a set of core operations on the Kubernetes resources, and post operations to validate or reconfigure workloads back to production. The lack of automated end-to-end workflow reduces organizations’ ability to build a full picture of their operations, and often causes huge impact if not done correctly.
  • Resource orchestration: Fleet-wide operations that apply to broad scope for resources, such as a Kubernetes upgrade, should be staged across the environment to maintain stability and availability for developers and business functions. Doing so manually introduces inefficiencies and makes the overall process take longer.

Automated Fleet Operations from Rafay

We often hear from customers that they need help managing fleet operations without developing their own solutions. Rafay’s Kubernetes Operations Platform provides automation and governance capabilities that platform teams can use to standardize Kubernetes workflows.

Automated Fleet Operations is a new capability offered by Rafay that eliminates the inefficiencies of running fleetwide Kubernetes operations manually.

Rafay’s Automated Fleet Operations consists of four components:

  • Fleet Plans: Collections of operations you can use to apply comprehensive tasks such as upgrades, backups, patch, and scaling in/out.
  • Operations: Each operation consists of a set of pre-hooks, an action, and a set of post-hooks. Hooks are executed as a pod by either cluster type runner or agent type runner.
  • Fleet: Collection of target EKS clusters for the operations, grouped by project and label as the selectors.
  • Agents: Container that runs on a select cluster to execute the hooks.

diagram showing Rafay controller with Fleet plan components, including pre-hooks, action, and post-hooks. Each fleet plan can target multiple EKS clusters. Optionally the EKS cluster can run agent as Kubernetes pod

Figure 1 – Rafay Fleet Operations components.

In the following section, we’ll demonstrate how Rafay’s Automated Fleet Operations can assist you in performing complex operations at scale, across multiple EKS clusters.

In this demo, we have pre-existing EKS clusters to upgrade. Prior to the upgrade, we want to perform pre-upgrade inspection for any Kubernetes API deprecation. As the post-upgrade steps, we want to validate the upgrade completed successfully and clusters are in the Ready state.

First, log in to the Rafay Kubernetes Operations Platform to review the list of EKS clusters we provisioned using Rafay. We group the clusters by Rafay’s built-in label “rafay/k8sVersion” to allow fleet operations to target the cluster by its Kubernetes version.

screen showing two EKS clusters named tenant1 and tenant2, both clusters are managed by Rafay

Figure 2 – Rafay console for cluster.

Next, we navigate to the Fleet Plans to create a new plan. We can use a declarative YAML configuration or the form user interface (UI) to set up the plan. You can set up multiple operations within a single Fleet Plan; for example, to upgrade the control plane and node groups separately.

For this demo, we add an operation with pre-hook, post-hook, and single action.

Figure 3 – Rafay Fleet Plans operation workflow.

Rafay provides built-in named actions for common operations such as control plane upgrade, node group update, node + control plane upgrade, and patch operations. We choose control plane upgrade and set the Kubernetes version as a parameter.

The pre-hook executes KubePug as a pod to perform deprecation checks against specific versions of Kubernetes. The post-hook runs Kubectl to verify all pods are in Ready state. The full YAML configuration for this Fleet Plan is shown below.

Kind: FleetPlan
 name: upgrade-1-26
 project: defaultproject
   kind: clusters
   labels: aws-us-west-2 aws-eks ‘1.25’
     - name: defaultproject
     - name: upgrade
         - description: check K8s API deprecation
           name: pre-hook-upgrade
             runner: cluster
               - ‘—k8s-version=v1.26.6’
         type: controlPlaneUpgrade
         name: upgrade-to-1-26
           version: ‘1.26’
         - description: Check nodes status post upgrade
           name: post-hook-upgrade
             runner: cluster
             image: bitnami/kubectl:latest
               - ‘-c’
               - >-
                 kubectl get nodes -o wide | grep -v ‘^NAME’ | grep -v ‘Ready’
                 | wc -l | tr -d ‘ ‘
               - /bin/sh

In the YAML configuration earlier, we selected the cluster fleet based on the label and project selector. Rafay also provides a UI to select and verify the target clusters.

screen showing Fleet Plan console, with Operation Workflow Fleet selected on the side panel, the center screen showing list of selected labels including clusterLocation, clusterType and defaultproject

Figure 4 – Rafay fleet selectors.

Finally, to execute the operation we run the Fleet Plan and watch the status from the dashboard.

screen showing pie chart diagram with 100% completion, two total resources, two success and zero failed

Figure 5 – Rafay Fleet Plan status.

Using Rafay’s Automated Fleet Operations, you can perform other operations such as scaling up/down the worker node across multiple clusters consistently.

     - name: group-management
         type: patch
         name: set-max-node-group
           - op: replace
             path: .spec.config.managedNodeGroups[0].desiredCapacity
             value: 10

Or security hardening such as updating cluster endpoint.

     - name: security-hardening
         type: patch
         name: disable-public-access
           - op: replace
             path: .spec.config.vpc.clusterEndpoints.privateAccess
             value: false
           - op: replace
             path: .spec.config.vpc.clusterEndpoints.publicAccess
             value: true

Everything we showed you earlier can be performed from Rafay’s web console, as well as from the command line interface (CLI) and API. Rafay’s CLI (RTCL) supports Fleet Plan automation lifecycle to declare and execute the fleet plan. You can also use Rafay’s API to perform fleet plan operations.

For more details on how to configure RTCL and the API, check out Rafay’s automation documentation.


In this post, we demonstrated how to use Rafay’s Automated Fleet Operations. By leveraging this solution with Amazon EKS environments, you can enhance visibility through centralized management of clusters, improve operational efficiency, and achieve compliance with organizational governance policies.

Automated Fleet Operations is available for customers using Rafay to manage their EKS fleets. To learn more about this feature, refer to the Rafay documentation and tutorial, or visit Rafay Systems in AWS Marketplace.


Rafay Systems – AWS Partner Spotlight

Rafay Systems in an AWS Partner that helps customers manage the full lifecycle of Kubernetes infrastructure and modern applications in a single, easy to use, integrated platform.

Contact Rafay Systems | Partner Overview | AWS Marketplace