AWS Cloud Operations Blog
Migrate On-Premises Multi-Tenant Systems to Amazon Elastic Kubernetes Service
Managing the deployment of containers in a multi-tenant environment presents a number of new challenges for many of my customers. Some organizations have explored building and managing their own Kubernetes container orchestration environment, but the management challenges lead them to evaluate Amazon Elastic Kubernetes Service (Amazon EKS).
Particularly, Independent Software Vendors (ISVs) are using a service such as Amazon EKS to focus on the value proposition of their solution as opposed to the infrastructure required to support it. Using EKS, ISVs don’t need to build multi-tenant systems in their on-premises data centers where dedicated virtual machines were assigned per-tenant. The ISV may have chosen a combination of Commercial Off the Shelf (COTS) software, open-source software, and in-house developed software to create the business logic of a multi-tenant system. As an architect driving the evolution of these ISV’s, you may have focused on the transition to Docker containers to improve the ISV’s service delivery, resource optimization, and service agility. In this blog, you will learn:
- Methods for migrating multi-tenant on-premises systems to Amazon EKS
- Key decision criteria for migrating systems to AWS services
- Evaluating the tenant segregation models available in Amazon EKS
- Recommending a target architecture
One approach that a company may choose for optimization is to leverage Docker containers for the web code and business logic when providing multi-tenant services. The container deployment is created, managed, and orchestrated with an image repository such as Git and container orchestration system, such as Kubernetes. Amazon Elastic Kubernetes Service and Amazon Elastic Container Registry were designed to offload the management burden from the ISV, empower them to operate much more efficiently, and innovate with new service capabilities for their tenants.
But migrating the containers to AWS is only part of the story. For each tenant in the multi-tenant on-premises system, there are a set of dedicated sub-systems. A typical tenant deployment consists of the following subsystems: Virtual firewall context, virtual load balancing context, web applications and business logic systems, storage systems, and databases.
Tenant Migration Methods
As the architect, you must perform an analysis of each subsystem dedicated to a tenant to determine the appropriate migration tool, method, and target infrastructure. We assume that the ISV has already determined that the web servers and business logic software components will be containerized. These containers will still need access to external resources, such as databases, storage resources, and identity systems. In addition, the containers must be externally accessible by the tenants. Next, the ISV’s architect will look at the migration of the subsystems in each layer to identify the target solution for that functional component. The architect might have an end-state goal of completely re-architecting the solution to a multi-tenant Software as a Service (SaaS) solution. This document highlights the first incremental step enabling that transition and positioning the evolution of the ISV to a complete SaaS offering.
Database Migration
The database system may be migrated to AWS using the AWS Database Migration Service (DMS) to an equivalent system built on Amazon Relational Database Service (RDS). A key decision must be made regarding resource sharing. Each tenant may be assigned a dedicated Amazon RDS instance where the application and business logic is configured with a database endpoint URL and a unique Amazon RDS endpoint for each tenant system. The name of the database schema may be replicated for each tenant because the servers are independent. The application connection attributes for a MySQL database may be characterized as follows. Note that a unique Amazon RDS endpoint is identified per tenant, but the database schema name is the same.
This approach has minimal resource efficiency, as each tenant has their own database and set of backups.
Alternatively, a shared Amazon RDS instance may be used where the application’s connection to the database server URL is modified by changing the path attribute, while the Amazon RDS endpoint remains the same for each tenant. The application connection attributes for a MySQL database may be characterized as follows. Note that a shared Amazon RDS endpoint is identified while the database schema name is unique per tenant.
This method optimizes resources, because a single Amazon RDS instance may serve many tenants with unique databases per tenant. Of course, backups of tenant Amazon RDS databases are aggregated for all tenants, so the recovery of the database of a single tenant is more complicated.
In both cases, the server endpoint and database name should be injected into the container via tenant specific attributes derived from variables populated during the deployment process. Modifying the connection attribute should only be used if the organization supports a Continuous Integration/Continuous Deployment (CI/CD) environment where the per-tenant database connection endpoint and path is automatically configured from a template when applied to the application and business software components.
The third permutation includes a shared Amazon RDS instance, as well as a shared database schema which requires additional code updates in the application and business software components. For example, the shared database might have dedicated tables or rows per-tenant that require more sophisticated changes to the application and business logic software components. Each software component must accommodate multiple tenant contexts in all of the transactions. This method is aligned with a SaaS deployment model and is outside of the scope of this document.
Storage Migration
The storage system chosen will depend on the application requirements. Storage may be evaluated in two contexts: ephemeral storage and persistent storage. Containers deployed in Kubernetes pods use ephemeral storage by default with the assumption that the code operating in the container is stateless.
A container is generally ephemeral in nature, meaning it exists temporally and the data within the container may be destroyed with no repercussions to the system (e.g., local cache). Ephemeral storage is provisioned from either the compute node’s memory or from the node’s locally attached disk. When the container is destroyed, the data is lost as well. The node’s local storage resources can be presented through a Kubernetes Container Storage Interface (CSI) driver, such as the Amazon Elastic Block Store (EBS) CSI. There is no need to migrate temporal tenant data.
In many cases, the code requires or creates some stateful information that must persist outside of the pod. Persistent storage is enabled through the Kubernetes CSI driver using various plugins. An evaluation of the application and business software components will determine the type of storage appropriate for the container.
The requirement to maintain persistent data outside of the node on which the software runs often leads to storage cluster systems. On-premises persistent storage may use locally attached disks, Network Attached Storage (NAS), or Storage Area Networks (SAN). The application and business software components are written in a manner that attaches to the persistent storage system, and this data must be replicated to an appropriate storage service in AWS. The migration of the tenant data may leverage the AWS DataSync service to move data from the on-premises system to persistent data in the appropriate AWS storage service.
Application or business software components written in Microsoft .NET Framework may use Windows File Share methods based on Server Message Block (SMB). The application may be migrated to .NET Core containers while retaining the SMB file attachment methods to minimize code changes. The appropriate cluster storage method would be Amazon FSx for Windows File Server. The Amazon FSx CSI driver provides the necessary mechanism for the container to mount the FSx for Windows File Server storage using the SMB protocol.
Alternatively, an application may leverage a native Linux-based network attached storage solution based on Network File System (NFS) protocols. In this case, the appropriate storage cluster system would be Amazon Elastic File System, where the containers can attach to the cluster using NFS mount points. The Amazon EFS CSI driver provides the mechanism for the container to mount the EFS using NFS protocols.
Application and Business Software Migration
The application and business software components will be refactored into containers so that they may execute in a container runtime environment, such as Kubernetes. The migration of the software components can be accomplished by leveraging a CI/CD pipeline to publish the container images in Amazon Elastic Container Registry (ECR). The Kubernetes environment pulls the images from Amazon ECR to create a tenant specific representation of the system.
The deployment of unique container instances of the images allows the ISV to specifically identify adjacent resources in other layers of the system that are dedicated to the tenant. This includes the database layer, storage layer, and the routing logic in the Domain Name System and load balancing systems.
Kubernetes provides the means of orchestrating the deployment of the containers. The Amazon Elastic Kubernetes Service (EKS) automates the deployment of Kubernetes and simplifies the management of resources available to execute the containers. Tenant contract requirements often guide the choice of the target architecture for the Kubernetes environment.
A multi-tenant architecture may leverage one of several architectural models: Kubernetes Cluster per Tenant, Kubernetes Node Group per Tenant, and Kubernetes Namespace per Tenant.
Note: A complete refactoring of the system components may enable the transition to a SaaS model deployed on AWS. SaaS requires the refactoring of all of the code into a multi-tenant resources model. The operator may leverage programs such as SaaS Factory. However, SaaS Factory assumes that the code has been fully transformed to preserve both tenant and user context. The SaaS Factory multi-tenant model may be the target end-state. However, the code must be completely refactored.
Each of these segmentation models have architectural trade-offs regarding availability and scalability, optimization of resources, segmentation and security, and operational complexity.
Cluster Segmentation
The deployment of a Kubernetes cluster per tenant provides dedicated and secured resource partitions where privileged containers can run without impacting other tenants. The system exposes a dedicated Kubernetes API for provisioning resources within the tenant subsystems. The drawback to this method is the complexity of managing multiple Kubernetes control planes, minimal efficiencies derived from dedicated resources, duplication of operator responsibilities, and fragmented visibility. This model should only be used for those tenant systems that require strict security segmentation and resource optimization is not a priority.
Node Group Segmentation
The second option is to deploy a shared Kubernetes cluster with dedicated node-groups per tenant. This model exposes a shared Kubernetes API for operations that must be protected, as multiple tenants can be affected by a single API call. Security is imposed at the node level, which minimizes the impact of one tenant interfering with another. Therefore, privileged containers may be allowed. This model still doesn’t provide significant resource optimization, as nodes are dedicated to each tenant while the control plane is shared. Inherent reachability between the nodes within the same VPC must be addressed through Kubernetes security mechanisms and AWS security mechanisms, such as Security Groups.
Namespace Segmentation
The third option is to deploy a shared Kubernetes cluster with shared node-groups while segmenting the tenant work-loads using Kubernetes namespaces. This model also exposes a shared Kubernetes API for operations that must be protected. The security is enforced at the namespace and pod level as opposed to the node level. With the possibility of one tenant’s functions interfering with another, the operator must apply resource consumption limits such as CPU and memory constraints. Well-defined resource constraints allow significant resource optimization where nodes are shared by multiple tenants. This option does eliminate the option of using privileged containers. In addition, there is inherent reachability between the containers of different tenants. Reachability is constrained between containers in different namespaces by enforcing traffic filters using network policies applied through the Kubernetes Container Network Interface (CNI). A critical requirement is the ability to instantiate a network policy that restricts a tenant’s container communication to only those pods in the same namespace or pods in a shared service. The application of the Amazon VPC Container Network Interface (CNI) plugin for Kubernetes allows the application of AWS Security Groups to individual pods. Well-defined ingress and egress rules are applied to specific pods to limit the exposure of ports and reachability. Various third-party CNI plugins exist that provide different capabilities. However, the architect must evaluate the compatibility of the plugin with Amazon EKS and the plugin’s ability to meet the multi-tenant segmentation requirements.
The namespace segmentation model is the most efficient method of building a multi-tenant service in Amazon EKS. However, the operator must build a CI/CD pipeline, configure per-tenant version control, and enforce Kubernetes specific mechanisms to isolate one tenant from another and deny access to external attackers.
The migration of a tenant’s on-premises deployment can be handled using the code pipeline. The DevOps team instantiates the tenant specific mechanisms from templates associated with the main trunk of the code pipeline. When a tenant instance is created, a branch is taken from the templates associated with the main trunk and the templates are populated with the tenant specific security attributes. This lets each tenant have a unique set of resources and security attributes using a common set of code and templates. In all likelihood, each tenant will have a unique set of application versions deployed on premises. Version drift is common in siloed multi-tenant systems. Each tenant is built using the latest set of software components. However, the software components are rarely updated. A key objective in managing a multi-tenant system is to simplify code management across tenants. The use of the Amazon Elastic Container Registry and per-tenant branch of templates allows unique tenant deployments while automating the version control of software components across multiple tenants.
Load Balancer
The Amazon EKS architecture enables the deployment of highly-available containers across multiple Availability Zones. Kubernetes provides a means of routing incoming traffic flows to any number of front-end systems using Ingress Controllers. There are many different versions of ingress controllers available for routing traffic to a specific container, but Transport Layer Security (TLS) is one of the most critical capabilities provided by an ingress controller. Most ingress controllers serve as a High Availability proxy (HA-proxy), but they can become a traffic bottleneck for bandwidth intensive applications. The AWS Load Balancer Controller serves as a Kubernetes ingress controller that can manage traffic routing on an Elastic Load Balancer (ELB) where traffic flows are directed to container pods without traversing the ingress controller. The ingress controller automatically updates the Target Groups on the load balancer as pods are created and destroyed.
The AWS Load Balancer Controller may be deployed in conjunction with two forms of the ALB – the Application Load Balancer (ALB) or the Network Load Balancer (NLB). The NLB simply routes traffic based on Layer-4 attributes (source IP, destination IP, protocol, source Port, destination Port), while the ALB can inspect HTTP headers for applying routing rules.
The transition to the Amazon EKS tenant instance must be coordinated with the tenant. Updates to the Domain Name System records may take 48 hours to propagate across the internet. Therefore, the use of a load balancer allows for more control over the timing of the cut-over. The multi-tenant operator may continue to route traffic via the load balancer to a tenant’s existing IP destination on-premises until all parties are ready to cut-over to the containerized front-end servers associated with a tenant’s deployment. The customer may leverage Amazon Route 53 to gracefully transition a single tenant’s flows from on-premises to cloud instances in a coordinated manner.
Firewall
The on-premises multi-tenant systems are often protected by a shared firewall system with a virtual context per tenant. Each context is bound to the respective tenant’s load balancer. The security rules applied to each context are typically the same because the service types are replicated across each tenant’s system. The customer has several options for migrating the firewall security rules to Amazon EKS. The first option is to use a virtual appliance or firewall container instantiated from the AWS Marketplace. A critical factor to consider is the ability to associate the firewall instance to one or more ALBs as described in the previous section. The recommendation is to create and associate a unique ALB or NLB per-tenant. Therefore, the firewall service must be able to route traffic to the appropriate load balancer. AWS offers the AWS Web Application Firewall (AWS WAF) for tenant services that need inspection of HTTP(S) transactions. The AWS WAF is associated to the load balancer routing traffic to the tenant’s service instance.
Conclusion
I showed you how customers (including ISVs) can use Amazon EKS to enable multi-tenant systems using containers to simplify their operation, optimize cost, improve segmentation, security, and increase availability/scalability. I covered how the transition of a single tenant must be conducted in a wave where all of the resources associated with a tenant are migrated at roughly the same time. The goal is to perform a graceful cut-over from on-premises to AWS Cloud with negligible downtime using a highly automated transition model with minimal cost. The service requirements will likely dictate the segmentation approach used when containers are transitioned to Amazon EKS, but one of the objectives is to leverage the stateless nature of containers allowing elasticity in supply to meet each tenant’s demand. Amazon EKS optimization techniques may save up to 80% of the compute infrastructure costs while enabling management automation. The results lead to an improved Total Cost of Ownership and better service for your tenants.