AWS Partner Network (APN) Blog

How CyberArk Built a Tenant Management Service for its SaaS Offering

By Ran Isenberg, Principal Software Architect – CyberArk
By Yossi Lagstein, Sr. Solutions Architect – AWS

CyberArk-AWS-Partners-2023
CyberArk
Connect with CyberArk-1

Software-as-a-service (SaaS) applications can contain many services maintained by different teams using various technologies. CyberArk‘s customers that sign up to use CyberArk’s SaaS applications represent tenants in the system. A tenant is the most fundamental construct of a SaaS environment.

A tenant management service manages the tenant’s provisioning and lifecycle. Tenant management is usually one of the first services SaaS providers build for their SaaS control plane, as the tenant onboarding experience must be simple and fast.

CyberArk is an AWS Partner and global leader in identity security, and its products have focused heavily on SaaS and cloud-native solutions in recent years. CyberArk has a rich portfolio of SaaS products for which customers can buy and use a subscription-based license. Each customer has access to various business services.

At CyberArk, the delivery team handles customer onboarding and tenant management. Besides creating tenants, the delivery team needs access to a unified view of tenants’ subscription configuration. This allows the team to add or remove products from a tenant per the customer’s needs. The team requires a simple and fast process to perform these actions.

In this post, you will learn how CyberArk built a serverless, simple, and scalable tenant management service. Its primary responsibility is adding new tenants and provisioning multiple CyberArk products to the tenants according to the customer’s subscription.

In addition, the service manages the lifecycle of the tenant and its products, from licenses and configurations to customer notifications and tenant deletion.

Design Considerations

The following key requirements played a significant role in shaping the final design of the service:

  • Each business service uses a dedicated Amazon Web Services (AWS) account, requiring cross-account access to other business services. Establishing the cross-account access requires integration with other teams, which increases coupling and complexity if not done correctly.
  • There are multiple development groups, each owning a business service. Each group can choose the best way to develop and deliver their service. They can select various technology stacks (like Kubernetes, Amazon EC2, and serverless) and programming languages. The tenant management service is required to orchestrate a tenant creation in different systems and technology stacks with the same generic process, regardless of the underlying service technology.
  • Some tenant onboarding processes are long-running. For example, tenant creation in a business service can vary from a few seconds to about an hour, depending on the service’s technology stack. As a result, the process needs to be asynchronous. The challenge is to handle tenant creation asynchronously and handle errors or timeouts in case the business services take too long to respond.
  • Tenant deployment models can vary between different CyberArk products. Some products use a silo model, meaning each tenant gets a dedicated set of resources, while others use a pool model where multiple tenants share infrastructure. Product provisioning is considered a black box and hidden from the tenant management service that serves as the orchestrator.
  • The tenant management service architecture includes a user interface and a CRUD (Create, Read, Update, Delete) API to view, create, delete and edit tenants, their configurations, and subscription licenses. The tenant management service is the source of truth for customers’ business services status and configuration.

To meet the requirements, CyberArk’s tenant management service team came up with a straightforward approach that minimizes the dependency between the tenant management service and the business services.

The architecture described in this post is a generalization of the CyberArk tenant management service. In addition, this post focuses on the “create tenant” action. However, the architecture and event flow is the same for all other tenant management events.

Solution Overview

Below is a high-level architecture of the tenant management service the team built to meet the aforementioned requirements. It comprises two logical units: CRUD API and Orchestration.

CyberArk-SaaS-Tenant-Management-1

Figure 1 – Tenant management service.

Below are the steps to create a tenant:

  1. A delivery team member triggers the CRUD API to create a tenant. This single API call will start the onboarding process.
  2. The CRUD API validates the request, calls the Orchestration logical unit to start the event-driven asynchronous tenant creation process, and returns a successful response. The tenant is neither created nor active at that stage but is in the “create in progress” state.
  3. Orchestration logical unit fans out the “create tenant” request to the business services and waits for their responses.
  4. Once the tenant’s business services are ready, the Orchestration logical unit sends an invitation to the customer to use its new tenant via email.

CyberArk makes the cross-account access simple by implementing a producer-consumer pattern. The tenant management service uses an Amazon Simple Notification Service (Amazon SNS) topic to publish tenant lifecycle events. Each business service deploys an Amazon Simple Queue Service (Amazon SQS) queue in its account and subscribes to tenant lifecycle events.

The same pattern allows each business service to use a different technology stack for provisioning tenant resources. The business service SQS queue triggers an AWS Lambda function that starts the provisioning process.

To deal with the error-prone, long-lasting nature of the tenant creation process, CyberArk orchestrates the entire process using the AWS Step Functions state machine with a Wait state and a Fail state.

Decoupling tenant creation orchestration from its implementation by each business service allows each business service to design different deployment models.

Now that we have covered the high-level design and described how CyberArk addressed the design considerations, let’s dive into the design details of each logical unit.

Tenant Management CRUD API Logical Unit

We’ll start by looking into the tenant management CRUD API logical unit architecture.

CyberArk-SaaS-Tenant-Management-2

Figure 2 – Tenant management CRUD API logical unit.

The steps of the tenant creation process are:

  1. A delivery team member authenticates to CyberArk’s identity provider (IdP). Once logged in, the team member gets JSON Web Token (JWT) with claims that describe the team member’s permissions. The team member sends a request to the Amazon API Gateway REST API to create the tenant, which contains the relevant configuration such as tenant admin contact details, AWS region to deploy to, and the list of business services to create.
  2. The API Gateway triggers a Lambda function, the “create” handler, which checks the request for authentication and authorization based on the claims in the team member’s JWT.
  3. The “create” handler generates a unique tenant ID for the customer, and then stores tenant configuration details to an Amazon DynamoDB table and sets tenant status to “create in progress.” The handler then returns HTTP 201 code.
  4. Adding tenant configuration details to the DynamoDB table creates a new stream record in Amazon DynamoDB Streams. The new stream record triggers the “dispatcher” AWS Lambda function.
  5. The “dispatcher” function parses the DynamoDB Streams record to retrieve the event action (“create tenant” in this flow), and calls the tenant management Orchestration logical unit with the event details.

Tenant Management Orchestration Logical Unit

The Orchestration logical unit uses AWS Step Functions as the orchestration mechanism for its ease of use and support for timeout and wait functionality. Step Functions is a visual workflow service that helps developers use AWS services to build distributed applications, automate processes, orchestrate microservices, and create data and machine learning (ML) pipelines.

On each tenant creation event, the Step Functions workflow performs a number of tasks. The state machine waits until all business services have finished tenant creation and responds with either a success or failure status.

CyberArk-SaaS-Tenant-Management-3

Figure 3 – Tenant management Orchestration logical unit.

The Step Functions state machine workflow comprises the following states:

  1. Provision identity provider state invokes a Lambda function to create an Amazon Cognito user pool for the tenant. Each tenant requires its own isolated IdP to manage users or connect to the tenant’s external IdP.
  2. The “trigger tenant creation” state invokes a Lambda function using the Task Token integration pattern. The function:
    • Adds the initial state entry to the Workflow State Database. The state entry includes the following details: tenant ID, request ID, action type (create/delete/update), list of business services required to respond, their current creation status, and the task token.
    • Publishes the “create tenant request” via SNS topic.
    • Amazon SNS fans out the request to the business services. Each business service uses Amazon SQS to subscribe to tenant management messages.
    • The business services read the SQS message using a Lambda function. The function parses the request, validates it, and triggers the corresponding tenant creation process. It can be a serverless or Kubernetes-based service, or any other technology stack. Once the business service completes the tenant creation, it sends the “create tenant response” message to the tenant management’s SQS queue with either a success or failure status.
    • The SQS queue triggers the “tenant creation status checker” Lambda function. It parses the business service response, validates it, and updates the service status in the Workflow State Database.
    • When all business services have responded, the “tenant creation status checker” function uses the Task Token from the Workflow State Database to resume the Step Functions workflow execution.
  3. Update tenant status state invokes a Lambda function that updates the Tenant Database table with the business services’ status. Tenant Database is the source of truth for business services’ status and configuration.
  4. The “send welcome mail” state invokes a Lambda function that creates a tenant-specific welcome mail and login instructions for the admin user.

Tenant Management Orchestration Logical Unit Error Handling

The tenant creation process can be lengthy and complex. Let’s examine how the Orchestration logical unit handles errors and timeouts.

CyberArk-SaaS-Tenant-Management-4

Figure 4 – Tenant management Orchestration error handling.

A timeout can occur if one or more business services do not respond and the Step Functions workflow is stuck at the “trigger tenant creation” state. When a timeout occurs, the workflow transitions to the Timeout state. The Timeout state invokes a Lambda function that creates an error event and sends it to the Dead Letter Queue (DLQ).

When any exception occurs, the workflow transitions to the Fail state, which invokes a Lambda function that creates an error event and sends it to the DLQ.

When the DLQ receives an event, it triggers the “DLQ error” handler. The handler sets tenant and business service status to “failed to create” in the Tenant Database so the delivery team will be aware and take action.

In both cases, the error event describes the state that led to the error and contains the create tenant request details.

Conclusion

Implementing tenant management is a challenge for many SaaS providers. This post describes how CyberArk built a tenant management service for its SaaS offering and shows how to decouple the tenant management service from the business services.

If your company is looking to build a SaaS offering, this post can be a helpful reference. If your company has an existing SaaS offering, use this post as a reference when looking into the evolution of your architecture.

To learn more, we recommend you review these additional resources:

.
CyberArk-APN-Blog-Connect-2023
.


CyberArk- AWS Partner Spotlight

CyberArk is an AWS Partner and global leader in identity security with a rich portfolio of SaaS products for which customers can buy and use a subscription-based license.

Contact CyberArk | Partner Overview | AWS Marketplace