Part 1: Multi-Cluster GitOps using Amazon EKS, Flux, and Crossplane

Introduction

GitOps is a way of managing application and infrastructure deployment so that the whole system is described declaratively in a Git repository. It’s an operational model that offers you the ability to manage the state of multiple Kubernetes clusters using the best practices of version control, immutable artifacts, and automation. Organizations have adopted GitOps to improve productivity, developer experience, stability, reliability, consistency, standardization, and security guarantees. Refer to the Guide to GitOps for more details about GitOps principles, patterns, and benefits.

Many AWS customers use multiple Amazon Elastic Kubernetes Service (Amazon EKS) clusters to segregate workloads that belong to different lines of business within their organizations or environments, such as production and staging, to comply with governance rules related to division of responsibilities. Platform teams in these organizations face the challenge of managing the lifecycles of these clusters in a consistent manner. Customers that adopt Amazon EKS also make use of other services, such as messaging services, relational databases, key-value stores, etc., in conjunction with their containerized workloads. It’s typical for an application running on an Amazon EKS cluster to interact with managed services such as Amazon Simple Storage Service (Amazon S3), Amazon DynamoDB, and Amazon Simple Queue Service (Amazon SQS). Ideally, the lifecycles of these managed resources, including the Amazon EKS cluster, should be managed using the same GitOps declarative model used for managing applications.

Flux and Crossplane are both open-source, Cloud Native Computing Foundation (CNCF) projects built on the foundation of Kubernetes to orchestrate application and infrastructure resources. Flux is a declarative, GitOps-based continuous delivery tool that can be integrated into any Continuous Integration (CI)/Continuous Delivery (CD) pipeline – it gives users the flexibility of choosing their Git provider (GitHub, GitLab, or BitBucket). The ability to manage deployments to multiple remote Kubernetes clusters from a central management cluster, support for progressive delivery, and multi-tenancy are some of the notable features of Flux. Crossplane is an open-source Kubernetes add-on that enables platform teams to assemble cloud infrastructure resources, without writing any code. Through the use of Kubernetes-style APIs (Custom Resource Definitions), Crossplane allows users to manage the lifecycles of AWS-managed resources. Employing these two tools together, customers can effectively manage the lifecycle of these resources using the GitOps model. They can define their managed resources using Kubernetes-style declarative configurations and deploy those artifacts to an Amazon EKS cluster along with those that pertain to application workloads, thus unifying application and infrastructure configuration and deployment.

In this blog series, we demonstrate how to build an extensible and flexible multi-cluster GitOps system, based on a hub-and-spoke model that addresses the platform and application teams requirements. The series covers use cases, such as managing the lifecycle of Amazon EKS clusters, bootstrapping them with various tools needed for Day Two operations, deploying application workloads to the newly provisioned clusters, and managing the lifecycle of associated managed resources such as Amazon SQS queues, and DynamoDB tables. The topics covered in the series include:

Use a management cluster (hub) to provision, bootstrap, and manage workload clusters (spokes) with Crossplane. Crossplane-specific custom resources are used to define the complete infrastructure for setting up an Amazon EKS cluster. These declarative artifacts are deployed to the management cluster using Flux, which allows you to consistently deploy a fleet of workload clusters as well as bootstrap them with consistent tooling.
Create a multi-repository Git structure that caters for different personas involved in the Software Development Life Cycle (SDLC) process, which includes a platform repository (repo) that contains platform-related manifests, application repos that contains the application deployment manifests, and a separate repo that stitches the two worlds (i.e., platform and application) together – it contains the manifests that instruct the system to deploy applications into various clusters.
Manage secrets using GitOps by leveraging open-source tools like External Secrets Operator and Sealed Secrets.

Here’s the outline of what is covered in each post in this three-part series:

Part 1 introduces the high-level architecture of the solution, and the key components of it
Part 2 dives into the mechanics of how Flux and Crossplane are used for provisioning Amazon EKS clusters and bootstrapping it with the needed tools
Part 3 discusses the application onboarding flow and how to use Kubernetes role-based access control (RBAC) and AWS Identity and Access Management (AWS IAM) Roles for Service Accounts (IRSA) to address security and multi-tenancy requirements

Solution overview

Use cases

Let’s start with the use cases and personas. The key personas involved in SDLC include platform engineers and application developers. Platform engineers want to:

provision new clusters, based on pre-defined templates;
manage existing clusters (e.g., upgrade the Kubernetes version of the control plane or the data plane);
install tools on the clusters (e.g., logging agents, monitoring agents, ingress controllers);
update installed tools;
or even delete installed tools, and replace them with new ones that are proven to be better fit for the organization

On the other side, application developers want to:

deploy an application into a cluster, which may involve provisioning cloud resources that the application depends on;
update an application (i.e., roll out a new version);
or un-deploy an application when it’s no longer needed

Some organizations run multi-tenant clusters, which means multiple applications that are developed by different teams share the same cluster. In this case, governance is needed — a central governance team needs to approve or reject application teams’ requests to onboard their application into one of these multi-tenant clusters.

Figure 1. Use cases

GitOps for cloud resources

Kubernetes controllers continuously watch the desired state defined in the control plane of the Kubernetes cluster (i.e., the etcd data store), compare it to the actual state, and request any required changes to make the actual state match the desired state. GitOps extends this concept to the source code management system (SCM), which is usually Git. The desired state is stored in a Git repo, and a GitOps controller is installed on the cluster to continuously watch Git, and change the desired state defined in the cluster control plane (through the API server) to match the desired state defined in Git. In turn, the Kubernetes controllers (e.g., Deployment controller, Job controller, etc.) request any changes that are needed to bring the actual state into alignment with the desired state, as defined in the cluster control plane, and by implication, as defined in Git. The GitOps controller effectively orchestrates other Kubernetes controllers to make the desired state as defined in Git a reality.

Multiple GitOps controllers are available, with Flux CD and Argo CD among the most popular. Flux CD is the GitOps controller used in this solution.

The built-in Kubernetes controllers manage the lifecycle of the native Kubernetes resources (e.g., Deployment, Service, ConfigMap, Secret, etc.). However, as mentioned above, applications often have dependencies on cloud resources like databases, messaging systems, etc., which organizations typically use AWS-managed services to address these dependencies. This is where an infrastructure controller comes into the picture.

Kubernetes is an extensible containers orchestration platform — in addition to the native Kubernetes resources like Deployment, Service, etc., custom resources definitions (CRDs) can be added and combined with custom controllers to extend Kubernetes. Please check Custom Resources for more details.

AWS Controllers for Kubernetes (ACK) and Crossplane are two examples for infrastructure controllers that uses the CRD mechanism to manage the lifecycle of AWS cloud resources through the Kubernetes APIs. Crossplane is the one used in this post. One of the key concepts of Crossplane is Providers — those are Crossplane packages that bundle a set of Managed Resources and their respective controllers to allow Crossplane to provision the respective cloud resource. Crossplane has Providers for AWS, and for other major clouds. Crossplane allows for extending the desired state defined in Git to cover cloud resources like Amazon RDS instances, Amazon DynamoDB tables, Amazon SQS queues, etc.

Multi-cluster GitOps

Now, let’s put these concepts together to address the requirements explained at the beginning of the post. The following diagram depicts a hub-and-spoke model comprising a management cluster and multiple workload clusters.

Figure 2. Multi-cluster GitOps solution

The management cluster is used for provisioning, bootstrapping, and managing workload clusters. The workload clusters are used for running applications.

The management cluster is created and bootstrapped as part of the initial setup of the system. There are several ways to perform this activity. Please refer to Creating an Amazon EKS cluster for different options for creating an Amazon EKS cluster.

Crossplane is used to manage the lifecycles of the workload clusters and other cloud resources, such as DynamoDB tables, RDS databases, and Amazon SQS queues that may be needed by applications deployed to these workload clusters. Crossplane compositions are used to enable platform teams provide opinionated definitions of cloud resources, including clusters. The gitops-system repo (i.e., the platform repo) contains various artifacts related to Crossplane and its dependencies that’re synced to the management cluster by the Flux controller running on it.

After provisioning the workload clusters, Flux needs to be installed on them. Flux in the management cluster is also used for deploying and bootstrapping Flux in the workload clusters. Once Flux is up and running in the workload cluster, it deploys Crossplane and any other tools needed on the workload clusters, as defined in the gitops-system repo.

Flux on the workload cluster is responsible for reconciling the applications manifests to the workload cluster. Those manifests contain native Kubernetes resources like Deployment, Service, ConfigMap, Secret resources, and may also contain Crossplane resources for the cloud resources that the application depends on. Crossplane on the workload cluster manages the lifecycles of cloud resources needed by the applications. It watches Crossplane resources applied to the workload cluster, and invokes the corresponding AWS APIs for provisioning these resources.

As depicted in the diagram above, Flux and Crossplane are installed on the management cluster, and on each of the workload clusters (i.e., de-centralized). This approach was chosen over centralized deployment of Flux and Crossplane in the management cluster for the entire landscape. The rationale behind this is:

Scalability — a separate instance of Flux and Crossplane on each cluster is a more scalable deployment model compared to a single instance that serves the entire landscape
Reducing the blast radius — in case of the de-centralized approach, the management cluster unavailability impairs platform management activities (e.g., provisioning additional clusters, upgrading existing clusters, etc.), but doesn’t impact the availability of the workload cluster, or the application deployment activities. On the other side, in the case of the centralized approach, the management cluster unavailability impairs platform management activities and application deployment activities as well
Aligning with separation of concerns and singular responsibility principles

The steps involved in provisioning and bootstrapping the management cluster and the workload clusters are explained in detail in Part 2. The steps involved in onboarding an application into a workload cluster are explained in detail in Part 3.

Git structure

The solution depicted in Figure 2 contains multiple repos: gitops-system, gitops-workloads, and one or more application repos.

The gitops-system repo contains the manifests that describe all the workload clusters. This repo is owned by the platform team, and they use it for managing the workload clusters. This includes creating new clusters, updating existing clusters (e.g., upgrading from Kubernetes version from 1.24 to 1.25), or even deleting a cluster. It also allows for a self-service model. The repo has template folders for workload clusters that contains the manifests for a template cluster, which is aligned with the organization standards. When the applications teams need a new cluster, they copy the template folders, rename it, replace a placeholder with the cluster name, make the necessary changes (e.g., changing the number of worker nodes, specifying the instance type they want to use, etc.), then submit a pull request (PR) that contains the folder for the new cluster. The platform team reviews the PR. Once it’s approved and merged, the process described above kicks in, and the workload cluster is provisioned.

Separate repos are assumed for the various applications in the organization. These repos contain the manifests for application deployment. The manifests can be a mixture of Kubernetes resources (e.g., Deployment, Service, ConfigMap, etc.) and Crossplane custom resources (e.g., resources for DynamoDB tables, Amazon SQS queues, etc.). Application teams are the owner of the application repos. Hence, they maintain full control over deployment into the workload clusters, and aren’t dependent on the platform team.

The gitops-workloads repo is the one that maps applications to workload clusters, because it contains the manifests that instruct Flux in the workload cluster to reconcile a specific folder in the application repo (i.e., onboard the application into the cluster). The gitops-workloads repo is owned by the governance team. The repo contains a template folder for onboarding an application into a workload cluster. When an application team needs to deploy their application into a workload clusters, they make a copy of the template folder under a folder dedicated for the target cluster, rename it, replace the placeholders, make the necessary changes (e.g., update the application repo URL, provide the credentials used by Flux to connect to the application repo, etc.), then submit a pull request (PR) that contains the folder for the new application. The governance team reviews the PR. Once it’s approved and merged, Flux in the target cluster starts reconciling the application repo. From this point onwards, changes in the application repo are automatically reconciled into the target cluster.

Source code

The implementation of the solution outlined in this blog series is available in this GitHub repository.

Conclusion

In this post, we showed how you can combine Flux and Crossplane to build a GitOps solution that supports the creation and management of multiple workload clusters, deployment of applications and any additional cloud resources required by the applications.

The solution uses a central management cluster to provision and bootstrap new clusters, while at the same time addressing scalability and availability concerns by deploying a separate instance of Flux and Crossplane to each workload cluster.

In the next post, we show how you can provision and manage a fleet of Amazon EKS clusters using Flux and Crossplane.

Containers