Containers
How webMethods iPaaS built a multi-tenant SaaS platform on Amazon EKS
This post was authored by Markus Kokott, Senior Solutions Architect, AWS and co-written with Balaji Balakrishnan, Head of Platform Services & DevOps, Santa Kumar Bethanapalli, Head of Cloud Operations & SRE, and Natarajan Ramani, Lead Platform Engineer, from webMethods iPaaS.
Introduction
In this post, we discuss webMethods’ journey in transitioning webMethods iPaaS into a successful Software as a Service (SaaS) offering on AWS, the challenges we faced and how these obstacles were overcome. We provide insights into the AWS architecture and design approach.
webMethods iPaaS, part of the IBM portfolio, enables businesses globally to remove connectivity barriers, aggregate data, and thrive, through their one-stop shop Integration Platform as a Service (iPaaS) solution. The SaaS offering consists of the following products:
- Application Integration: Connecting cloud, on-premises, and hybrid applications
- Data Integration: Repeatable and adaptive data pipelines
- API Management: Clear and secure management of APIs throughout their lifecycle
- B2B: Seamless integration between business partners, suppliers, and customers
- Events: Real-time connectivity for event-driven architectures
Why webMethods iPaaS on Amazon Elastic Kubernetes Service (Amazon EKS)?
The product portfolio of webMethods dates back more than 20 years and was initially developed to run in our customers’ datacenters, operated by their own IT teams. When we designed our initial webMethods-based SaaS offering, we needed to solve problems that you typically don’t see in siloed deployments.
Customers are spread globally and expect low latency and high operational efficiency from SaaS. The global infrastructure of AWS and their consistent API allowed us to create Infrastructure-as-Code (IaC) for quick tenant onboarding and efficient multi-site operations, offering webMethods close to our customer’s locations.
We decided early to build the platform for our SaaS offering on Kubernetes. The vast Cloud Native Compute Foundation (CNCF) landscape allowed us to build a tailored platform for our offering, choosing from a wide variety of tools and products. Another important reason for choosing Kubernetes was its concept of namespaces, which is a common building block for tenant isolation in multi-tenant SaaS products. However, we wanted to avoid the burden of managing such a complex platform. Therefore, we chose Amazon EKS, a certified Kubernetes-conformant managed service in AWS.
Even though the managed Kubernetes control plane provided by Amazon EKS is a huge benefit for our operations team, there are other aspects of Amazon EKS that make operating webMethods even easier. For example, we rely heavily on managed node groups (MNGs) to reduce our maintenance efforts and the complexity of scaling our microservices-based product. By using node groups, we optimized the allocation of Amazon Elastic Compute Cloud (Amazon EC2) instance types based on workload requirements. We reduced our costs by approximately 10%, while our overall performance increased.
Architecting webMethods for scalability and high availability
webMethods is powered by a diverse array of applications and services operating behind the scenes. These approximately 50 applications encompass both stateless and stateful services, with varying release cadences ranging from rapid to extended cycles. Additionally, there is a distinction between shared services accessible to all customers and dedicated, customer-specific services. This heterogeneity in the service landscape necessitates accommodating different requirements. For example, certain zonal services rely on specific resources within an Availability Zone (AZ), while others adhere to distinct release policies. Moreover, some services demand maintenance windows aligned with customers’ schedules, and others have stringent data security requirements.
The implementation of multiple MNGs facilitated the creation, scaling, and maintenance of diverse node groups, each tailored to the specific deployment constraints of our services. We capitalized on the powerful scheduling capabilities of Amazon EKS to assign pods to nodes based on a comprehensive set of rules. Among the crucial services we operate is Elasticsearch, which mandates a highly available cluster configuration. To meet this requirement, we needed to distribute Elasticsearch pods across different nodes and AZs, ensuring resiliency and redundancy.
We accomplished this by employing pod affinity rules:
The first two rules define a soft pod anti-affinity preference. The scheduler prefers to schedule an Elasticsearch pod on a node where neither it nor another node in the same AZ is already running a pod with the label app.kubernetes.io/name=elasticsearch. This preference is weighted at 50 for each rule. The third rule is a hard node affinity requirement, making sure that Elasticsearch pods are only scheduled on nodes from the “cluster-ng” node group.
You might wonder why we opted for soft pod anti-affinity instead of a hard requirement for Elasticsearch. Scaling Elasticsearch clusters to four or more instances is not possible in an AWS region with 3 AZs using hard requirements. The number of Elasticsearch instances would be limited by the number of available AZs. Soft pod anti-affinity allows the scheduler to select a node from an AZ that already hosts an Elasticsearch pod.
However, the scheduler still prioritizes nodes in AZs without existing Elasticsearch pods due to the combined weights. A node in an AZ without an Elasticsearch pod scores a weight of 50+50=100. If there is already an Elasticsearch pod running in an AZ, then nodes in that AZ score a weight of 50. This is unless a node already hosts an Elasticsearch pod. The score of nodes already running Elasticsearch is zero.
This scheduling strategy ensures Elasticsearch pods are distributed across nodes and AZs for high availability, while still allows to scale beyond the number of available AZs.
Another reason for us to define dedicated node groups are special, expensive infrastructure nodes that we want to reserve only for pods that benefit from it. Services with high memory demand are, for example, scheduled on memory optimized instances with an 8:1 ratio of memory to CPU. Here we use a different approach to assign pods to nodes: taints and tolerations.
While affinity is used to attract pods to nodes, taints are the opposite: they allow nodes to repel pods. If a node is tainted, then only pods tolerating this taint are scheduled on that node. We make use of the way MNGs can apply taints to their nodes automatically.
Memory optimized instances have the following taint in our cluster:
If a pod needs to be scheduled on such an instance, then it must add a matching toleration to its pod spec:
Now that we discussed how services are scheduled, we look into how traffic is routed.
As shown in Figure 1, we built an NGINX-based central layer fronting our platform for inbound traffic and use NAT Gateways for outbound traffic.
![Figure 1 - Initial high-level design of webMethods iPaaS as SaaS](https://d2908q01vomqb2.cloudfront.net/fe2ef495a1152561572949784c16bf23abb28057/2025/01/09/HLD-1.png)
Figure 1 – Initial high-level design of webMethods iPaaS as SaaS
Reflecting on our initial design
One big learning we derived, is that you’re never done building a platform. With the growing success of our SaaS offering, we identified challenges with our design.
With over 20 MNGs, each containing up to hundreds of nodes, we saw impacts on the scalability of the Cluster Autoscaler and increased complexity upgrading nodes in the data-plane.
Even though Kubernetes namespaces are a great start to building a multi-tenant SaaS application, it does not cover all use cases (for example where regulations require physical isolation).
SaaS providers need to analyze their costs in multiple dimensions, for example, to identify cost optimization opportunities or evaluate whether their pricing model meets their cost structure. Although Kubecost already provided a lot of insights into our cost structure, we still found blind spots, especially related to costs of our central inbound and outbound networking infrastructure.
And then there was a very popular feature request from our customers: they wanted to access their SaaS tenant and integrate their systems in a private and secure way.
Next generation application integration platform
We started to rearchitect our SaaS platform with the decision to distribute our workload across multiple clusters. Although this decision meant that our operational complexity increased and some resource efficiency was sacrificed, we got vital benefits for a SaaS offering: better isolation and a reduced impact surface.
We separated our workload into shared services used by multiple tenants and services specific to individual tenants. We deployed the latter into dedicated EKS clusters operated in their own virtual private clouds (VPCs), and shared services into multi-tenant clusters. This helped us achieve strong isolation on the control plane, data plane, and networking levels.
Certain services require communication with each other. Therefore, we designed a hub and spoke networking architecture. Tenant-specific services deployed to a customer-spoke VPC are allowed to communicate with other tenant-specific services within their own VPC as well as with shared services deployed to the hub VPC. No communication between spoke VPCs is allowed, effectively isolating each tenant’s compute and networking resources.
We moved the NGINX routing layer for increased maintainability of the proxy and its configuration and extended it for sophisticated management of service-to-service communication. With this we can use a single Network Load Balancer (NLB) to expose the services from an EKS cluster using the AWS Load Balancer Controller for Kubernetes.
We use the AWS PrivateLink service to enable private connectivity between EKS clusters in the hub VPC and each spoke VPC: the NGINX routing layer, exposed through NLB in the shared services EKS cluster of the hub VPC, is advertised as VPC Endpoint in the consumer’s spoke VPC. Then, traffic arriving at the VPC Endpoint in a spoke VPC is encrypted and delivered through the AWS backbone to the NLB of the hub VPC. To allow communication flowing in the opposite direction, from shared services to tenant-specific services, we also expose the NLBs from the tenants’ spoke VPCs as VPC Endpoints in the hub VPC, as shown in Figure 2.
![Figure 2 - Hub and spoke VPCs connected with AWS PrivateLink](https://d2908q01vomqb2.cloudfront.net/fe2ef495a1152561572949784c16bf23abb28057/2025/01/09/Hub-Spoke-VPCs.png)
Figure 2 – Hub and spoke VPCs connected with AWS PrivateLink
This design change not only provided us with better isolation and visibility into our tenant-specific cost structure, it also prepared us for future growth. There are no changes to our existing infrastructure when we onboard new tenants and we do not have to plan our network design upfront. This is because AWS PrivateLink hides the underlying networking configuration, which makes Classless Inter-Domain Routing (CIDR) range planning unnecessary.
Our infrastructure gets more complex as we add new customers to our platform. Setting up PrivateLink connections between environments manually is error-prone and doesn’t scale well. Therefore, we heavily invested in automating our onboarding and maintenance processes using Bitbucket pipelines for our continuous integration/continuous delivery (CI/CD) processes and Terraform for IaC.
Whenever a developer makes a change, an automated pipeline runs. This pipeline checks the quality of the change to identify potential problems. As part of this, it runs `terraform plan` to understand what infrastructure changes would happen if we deployed it.
We have separate pipelines for each service in our product. However, our central platform team builds templates that abstract away the complexity. This means that our product teams can edit configuration files in YAML format, and our templates handle the underlying infrastructure details.
A particularly interesting part of our infrastructure automation is the setup of PrivateLink communication between hub and spoke VPCs. Now, we look at the overall process of establishing a PrivateLink connection:
- Service provider creates an NLB in the VPC of the service.
- Service provider creates a VPC Endpoint service and attaches it to the NLB.
- Service provider configures access control for the VPC Endpoint service: who can access the endpoint and does the service provider need to approve each request?
- Service provider creates a private DNS name for the VPC Endpoint and verifies it in Amazon Route 53.
- Service consumer creates a VPC Endpoint specifying the service name: this is information that the service provider needs to provide to the service consumer.
- Service consumer must request access to the VPC Endpoint service, if the service provider configured approval for each request.
- Service provider validates and approves request from service consumer, if the service provider configured approval for each request.
- Service consumer configures applications to consume service provider’s service through the VPC Endpoint.
As you can see, setting up PrivateLink is not a single step. It requires creating resources across two separate VPCs. There also needs to be mandatory information sharing between the environments, and orchestration of the overall process. This is where Terraform helps us with two key resources:
- aws_vpc_endpoint_service: For creating the VPC Endpoint service in the service provider environment
- aws_vpc_endpoint: For creating the VPC Endpoint in the service consumer environment
This is a snippet from the Terraform script that we execute in the service provider’s environment to automate steps 1 through 4:
As we don’t need request approval between hub and spoke VPC, we can set aws_vpc_endpoint_service.acceptance_required=false
to enable automation of the remaining steps in the Service Consumer’s environment. Here’s the relevant parts of our Terraform script: