Containers

How webMethods iPaaS built a multi-tenant SaaS platform on Amazon EKS

This post was authored by Markus Kokott, Senior Solutions Architect, AWS and co-written with Balaji Balakrishnan, Head of Platform Services & DevOps, Santa Kumar Bethanapalli, Head of Cloud Operations & SRE, and Natarajan Ramani, Lead Platform Engineer, from webMethods iPaaS.

Introduction

In this post, we discuss webMethods’ journey in transitioning webMethods iPaaS into a successful Software as a Service (SaaS) offering on AWS, the challenges we faced and how these obstacles were overcome. We provide insights into the AWS architecture and design approach.

webMethods iPaaS, part of the IBM portfolio, enables businesses globally to remove connectivity barriers, aggregate data, and thrive, through their one-stop shop Integration Platform as a Service (iPaaS) solution. The SaaS offering consists of the following products:

  • Application Integration: Connecting cloud, on-premises, and hybrid applications
  • Data Integration: Repeatable and adaptive data pipelines
  • API Management: Clear and secure management of APIs throughout their lifecycle
  • B2B: Seamless integration between business partners, suppliers, and customers
  • Events: Real-time connectivity for event-driven architectures

Why webMethods iPaaS on Amazon Elastic Kubernetes Service (Amazon EKS)?

The product portfolio of webMethods dates back more than 20 years and was initially developed to run in our customers’ datacenters, operated by their own IT teams. When we designed our initial webMethods-based SaaS offering, we needed to solve problems that you typically don’t see in siloed deployments.

Customers are spread globally and expect low latency and high operational efficiency from SaaS. The global infrastructure of AWS and their consistent API allowed us to create Infrastructure-as-Code (IaC) for quick tenant onboarding and efficient multi-site operations, offering webMethods close to our customer’s locations.

We decided early to build the platform for our SaaS offering on Kubernetes. The vast Cloud Native Compute Foundation (CNCF) landscape allowed us to build a tailored platform for our offering, choosing from a wide variety of tools and products. Another important reason for choosing Kubernetes was its concept of namespaces, which is a common building block for tenant isolation in multi-tenant SaaS products. However, we wanted to avoid the burden of managing such a complex platform. Therefore, we chose Amazon EKS, a certified Kubernetes-conformant managed service in AWS.

Even though the managed Kubernetes control plane provided by Amazon EKS is a huge benefit for our operations team, there are other aspects of Amazon EKS that make operating webMethods even easier. For example, we rely heavily on managed node groups (MNGs) to reduce our maintenance efforts and the complexity of scaling our microservices-based product. By using node groups, we optimized the allocation of Amazon Elastic Compute Cloud (Amazon EC2) instance types based on workload requirements. We reduced our costs by approximately 10%, while our overall performance increased.

Architecting webMethods for scalability and high availability

webMethods is powered by a diverse array of applications and services operating behind the scenes. These approximately 50 applications encompass both stateless and stateful services, with varying release cadences ranging from rapid to extended cycles. Additionally, there is a distinction between shared services accessible to all customers and dedicated, customer-specific services. This heterogeneity in the service landscape necessitates accommodating different requirements. For example, certain zonal services rely on specific resources within an Availability Zone (AZ), while others adhere to distinct release policies. Moreover, some services demand maintenance windows aligned with customers’ schedules, and others have stringent data security requirements.

The implementation of multiple MNGs facilitated the creation, scaling, and maintenance of diverse node groups, each tailored to the specific deployment constraints of our services. We capitalized on the powerful scheduling capabilities of Amazon EKS to assign pods to nodes based on a comprehensive set of rules. Among the crucial services we operate is Elasticsearch, which mandates a highly available cluster configuration. To meet this requirement, we needed to distribute Elasticsearch pods across different nodes and AZs, ensuring resiliency and redundancy.

We accomplished this by employing pod affinity rules:

affinity:
  podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
    - weight: 50
      podAffinityTerm:
        labelSelector:
          matchExpressions:
          - key: app.kubernetes.io/name
            operator: In
            values:
            - elasticsearch
        topologyKey: "kubernetes.io/hostname"
    - weight: 50
      podAffinityTerm:
        labelSelector:
          matchExpressions:
          - key: app.kubernetes.io/name
            operator: In
            values:
            - elasticsearch
        topologyKey: "topology.kubernetes.io/zone"
nodeAffinity:
  requiredDuringSchedulingIgnoredDuringExecution:
    nodeSelectorTerms:
          - matchExpressions:
      - key: nodegroup
        operator: In
        values:
        - cluster-ng
Markup

The first two rules define a soft pod anti-affinity preference. The scheduler prefers to schedule an Elasticsearch pod on a node where neither it nor another node in the same AZ is already running a pod with the label app.kubernetes.io/name=elasticsearch. This preference is weighted at 50 for each rule. The third rule is a hard node affinity requirement, making sure that Elasticsearch pods are only scheduled on nodes from the “cluster-ng” node group.

You might wonder why we opted for soft pod anti-affinity instead of a hard requirement for Elasticsearch. Scaling Elasticsearch clusters to four or more instances is not possible in an AWS region with 3 AZs using hard requirements. The number of Elasticsearch instances would be limited by the number of available AZs. Soft pod anti-affinity allows the scheduler to select a node from an AZ that already hosts an Elasticsearch pod.

However, the scheduler still prioritizes nodes in AZs without existing Elasticsearch pods due to the combined weights. A node in an AZ without an Elasticsearch pod scores a weight of 50+50=100. If there is already an Elasticsearch pod running in an AZ, then nodes in that AZ score a weight of 50. This is unless a node already hosts an Elasticsearch pod. The score of nodes already running Elasticsearch is zero.

This scheduling strategy ensures Elasticsearch pods are distributed across nodes and AZs for high availability, while still allows to scale beyond the number of available AZs.

Another reason for us to define dedicated node groups are special, expensive infrastructure nodes that we want to reserve only for pods that benefit from it. Services with high memory demand are, for example, scheduled on memory optimized instances with an 8:1 ratio of memory to CPU. Here we use a different approach to assign pods to nodes: taints and tolerations.

While affinity is used to attract pods to nodes, taints are the opposite: they allow nodes to repel pods. If a node is tainted, then only pods tolerating this taint are scheduled on that node. We make use of the way MNGs can apply taints to their nodes automatically.

Memory optimized instances have the following taint in our cluster:

{
  "key": "webmethods.io/resourceType",
  "effect": "NoSchedule",
  "value": "MemoryOptimized"
}
Markup

If a pod needs to be scheduled on such an instance, then it must add a matching toleration to its pod spec:

tolerations:
- key: webmethods.io/resourceType
  operator: Equal
  effect: NoSchedule
  value: MemoryOptimized
Markup

Now that we discussed how services are scheduled, we look into how traffic is routed.

As shown in Figure 1, we built an NGINX-based central layer fronting our platform for inbound traffic and use NAT Gateways for outbound traffic.

Figure 1 - Initial high-level design of webMethods iPaaS as SaaS

Figure 1 – Initial high-level design of webMethods iPaaS as SaaS

Reflecting on our initial design

One big learning we derived, is that you’re never done building a platform. With the growing success of our SaaS offering, we identified challenges with our design.

With over 20 MNGs, each containing up to hundreds of nodes, we saw impacts on the scalability of the Cluster Autoscaler and increased complexity upgrading nodes in the data-plane.

Even though Kubernetes namespaces are a great start to building a multi-tenant SaaS application, it does not cover all use cases (for example where regulations require physical isolation).

SaaS providers need to analyze their costs in multiple dimensions, for example, to identify cost optimization opportunities or evaluate whether their pricing model meets their cost structure. Although Kubecost already provided a lot of insights into our cost structure, we still found blind spots, especially related to costs of our central inbound and outbound networking infrastructure.

And then there was a very popular feature request from our customers: they wanted to access their SaaS tenant and integrate their systems in a private and secure way.

Next generation application integration platform

We started to rearchitect our SaaS platform with the decision to distribute our workload across multiple clusters. Although this decision meant that our operational complexity increased and some resource efficiency was sacrificed, we got vital benefits for a SaaS offering: better isolation and a reduced impact surface.

We separated our workload into shared services used by multiple tenants and services specific to individual tenants. We deployed the latter into dedicated EKS clusters operated in their own virtual private clouds (VPCs), and shared services into multi-tenant clusters. This helped us achieve strong isolation on the control plane, data plane, and networking levels.

Certain services require communication with each other. Therefore, we designed a hub and spoke networking architecture. Tenant-specific services deployed to a customer-spoke VPC are allowed to communicate with other tenant-specific services within their own VPC as well as with shared services deployed to the hub VPC. No communication between spoke VPCs is allowed, effectively isolating each tenant’s compute and networking resources.

We moved the NGINX routing layer for increased maintainability of the proxy and its configuration and extended it for sophisticated management of service-to-service communication. With this we can use a single Network Load Balancer (NLB) to expose the services from an EKS cluster using the AWS Load Balancer Controller for Kubernetes.

We use the AWS PrivateLink service to enable private connectivity between EKS clusters in the hub VPC and each spoke VPC: the NGINX routing layer, exposed through NLB in the shared services EKS cluster of the hub VPC, is advertised as VPC Endpoint in the consumer’s spoke VPC. Then, traffic arriving at the VPC Endpoint in a spoke VPC is encrypted and delivered through the AWS backbone to the NLB of the hub VPC. To allow communication flowing in the opposite direction, from shared services to tenant-specific services, we also expose the NLBs from the tenants’ spoke VPCs as VPC Endpoints in the hub VPC, as shown in Figure 2.

Figure 2 - Hub and spoke VPCs connected with AWS PrivateLink

Figure 2 – Hub and spoke VPCs connected with AWS PrivateLink

This design change not only provided us with better isolation and visibility into our tenant-specific cost structure, it also prepared us for future growth. There are no changes to our existing infrastructure when we onboard new tenants and we do not have to plan our network design upfront. This is because AWS PrivateLink hides the underlying networking configuration, which makes Classless Inter-Domain Routing (CIDR) range planning unnecessary.

Our infrastructure gets more complex as we add new customers to our platform. Setting up PrivateLink connections between environments manually is error-prone and doesn’t scale well. Therefore, we heavily invested in automating our onboarding and maintenance processes using Bitbucket pipelines for our continuous integration/continuous delivery (CI/CD) processes and Terraform for IaC.

Whenever a developer makes a change, an automated pipeline runs. This pipeline checks the quality of the change to identify potential problems. As part of this, it runs `terraform plan` to understand what infrastructure changes would happen if we deployed it.

We have separate pipelines for each service in our product. However, our central platform team builds templates that abstract away the complexity. This means that our product teams can edit configuration files in YAML format, and our templates handle the underlying infrastructure details.

A particularly interesting part of our infrastructure automation is the setup of PrivateLink communication between hub and spoke VPCs. Now, we look at the overall process of establishing a PrivateLink connection:

  1. Service provider creates an NLB in the VPC of the service.
  2. Service provider creates a VPC Endpoint service and attaches it to the NLB.
  3. Service provider configures access control for the VPC Endpoint service: who can access the endpoint and does the service provider need to approve each request?
  4. Service provider creates a private DNS name for the VPC Endpoint and verifies it in Amazon Route 53.
  5. Service consumer creates a VPC Endpoint specifying the service name: this is information that the service provider needs to provide to the service consumer.
  6. Service consumer must request access to the VPC Endpoint service, if the service provider configured approval for each request.
  7. Service provider validates and approves request from service consumer, if the service provider configured approval for each request.
  8. Service consumer configures applications to consume service provider’s service through the VPC Endpoint.

As you can see, setting up PrivateLink is not a single step. It requires creating resources across two separate VPCs. There also needs to be mandatory information sharing between the environments, and orchestration of the overall process. This is where Terraform helps us with two key resources:

This is a snippet from the Terraform script that we execute in the service provider’s environment to automate steps 1 through 4:

resource "aws_vpc_endpoint_service" "producer" {
  acceptance_required        = false
  allowed_principals         = var.allowed_principals
  network_load_balancer_arns = [data.aws_lb.this.arn]
  private_dns_name           = var.private_dns_name
}
 
resource "aws_route53_record" "verification" {
  name            = aws_vpc_endpoint_service.producer.private_dns_name_configuration[0][ "name"]
  type            = "CNAME"
  ttl             = 300
  zone_id         = var.private_dns_zone_id
  records         = [aws_vpc_endpoint_service.producer.private_dns_name_configuration[0][ "value"]]
}
Markup

As we don’t need request approval between hub and spoke VPC, we can set aws_vpc_endpoint_service.acceptance_required=false to enable automation of the remaining steps in the Service Consumer’s environment. Here’s the relevant parts of our Terraform script:

resource "aws_vpc_endpoint" "consumer" {
  vpc_id              = var.vpc_id
  service_name        = data.terraform_remote_state.vpc_endpoint_service.outputs.service_name
  vpc_endpoint_type   = "Interface"
  auto_accept         = true
  private_dns_enabled = true
  subnet_ids = [var.subnet_ids]
}
 
resource "aws_route53_record" "internal" {
  name    = var.dns_record_name
  type    = "CNAME"
  ttl     = 300
  zone_id = var.private_dns_zone_id
  records = [
    aws_vpc_endpoint.consumer.dns_entry[0]["dns_name"]
  ]
}
Markup

We use Terraform’s terraform_remote_state data source to import information about the VPC Endpoint service created in the service producer’s environment at the time that the VPC Endpoint of the service consumer is created.

Last but not least, we use the outputs from Terraform to create a Kubernetes service of the type ExternalName to advertise the VPC Endpoint in the service consumer’s cluster:

apiVersion: v1 
kind: Service 
metadata: 
  name: producer-app 
  labels: 
    app.kubernetes.io/instance: producer-app 
    app.kubernetes.io/name: producer-app-ext-svc 
  namespace: consumer 
  spec: 
    type: ExternalName 
    externalName: producer-app.hub.my-domain 
    sessionAffinity: "None" 
    ports: 
    - protocol: "TCP" 
      port: 8080 
      targetPort: 8080
Markup

Securely integrating hundreds of individual customer networks

There was one important reason to change our initial SaaS design for webMethods iPaaS that we did not address yet: How do we connect our SaaS product to the large number of heterogenous customer networks in a scalable fashion?

When we drafted solutions for this problem, we decided on a number of tenets that are important to us:

  • Customers are free to use their own established connectivity standards for integrating with our products.
  • State of the art security for exposed endpoints prevents unauthorized access to customer data.
  • Additional operational efforts need to be minimized. So, we prefer managed services offered by AWS.
  • We don’t compromise our fast tenant onboarding. So, the solution needs to be as automated as possible with minimal manual intervention.

The first tenet was especially important for our final decision. We did not want to become the bottleneck for customer onboarding, due to the complexity that comes with arbitrary technologies and processes to integrate our SaaS solution into networks of hundreds of organizations.

Therefore, we decided to offer our customers self-service for integrating our SaaS product into their IT landscape. We use the pattern of transit VPCs, set up in AWS accounts owned by our customers. We use this transit VPC to expose services from the individual customers’ spoke VPCs. If a customer’s use case demands bidirectional communication, then relevant services on their side are exposed in their spoke VPC as well.

Given the good experience with PrivateLink to connect hub and spoke VPCs, we used the same pattern to integrate spoke and transit VPCs. This allows our customers to use any means to securely and privately integrate systems controlled by them with their tenant within our SaaS product. Figure 3 shows common examples:

Figure 3 - Integrating webMethods iPaaS into customer networks through transit VPCs powered by PrivateLink

Figure 3 – Integrating webMethods iPaaS into customer networks through transit VPCs powered by PrivateLink

Many of our customers set up site-to-site VPN connections to the transit VPC, using AWS Site-to-Site VPN or other third-party technologies already in use in their organization. Enterprise customers in particular often rely on their AWS Direct Connect connections to integrate their transit VPC. Latency sensitive customers with workloads outside of the three AWS Regions, in which our SaaS product is currently available, can use the AWS backbone and route traffic from their preferred AWS Regions through their transit VPC using AWS Transit Gateway.

However, did we meet our four tenets? Customers can choose any means that they want to integrate with us, because the integration happens in their AWS accounts, following their own processes.

All communication between our and the customers’ network segments is encrypted and isolated from other customers’ traffic by dedicated PrivateLink connections and routed through the AWS backbone. The decision to introduce spoke VPCs and dedicated EKS clusters further introduced physical isolation for tenant-specific parts of the workload.

Our operational efforts to maintain infrastructure for integration with our customers’ networks is restricted to the onboarding process. We are not involved in their internal processes, eliminating a potential bottleneck.

Establishing PrivateLink communication between spoke and transit VPC conceptually follows the same process as before. But there’s a major difference: part of the setup is not under our control, because depending on the direction of communication, our customers are service consumer or service producer. To have more control over the onboarding process, we also decided for now to enable request approval when the customer creates the VPC Endpoint as service consumer in their account.

The Terraform snippets shown above can still be used for automating the majority of the process. We are currently exploring options to enhance our tenant onboarding process and offer self-service to customers for setting up their PrivateLink. Until then, we continue to work with our customers’ teams for the initial setup.

This design also improves our visibility into per-tenant costs. This is especially valuable on the networking level, because each customer’s traffic is now routed through dedicated NLBs. The adoption of AWS PrivateLink has broadened our customer base as it has attracted new customers’ looking for enhanced network isolation, lower latency, and stricter security postures.

Conclusion

In this post, we shared our SaaS journey. Our incremental approach enabled us to change architecture decisions when needed. This way, we scaled our release velocity from 10s to 100s releases per year, leading to higher agility and faster time-to-market. This increased developer productivity considerably. Additionally, incremental changes also improve resiliency. We reduced minor outages by 40%, resulting in more consistent uptime and a better user experience. We continue to iterate and are looking forward to build an even better product on AWS!

Request a live demo to see how the webMethods iPaaS offering addresses all of your integration needs! And start with watching this video, if you’re considering building your own SaaS offering powered by Amazon EKS!