Stream multi-tenant data with Amazon MSK

Real-time data streaming has become prominent in today’s world of instantaneous digital experiences. Modern software as a service (SaaS) applications across all industries rely more and more on continuously generated data from different data sources such as web and mobile applications, Internet of Things (IoT) devices, social media platforms, and ecommerce sites. Processing these data streams in real time is key to delivering responsive and personalized solutions, and maximizes the value of data by processing it as close to the event time as possible.

AWS helps SaaS vendors by providing the building blocks needed to implement a streaming application with Amazon Kinesis Data Streams and Amazon Managed Streaming for Apache Kafka (Amazon MSK), and real-time processing applications with Amazon Managed Service for Apache Flink.

In this post, we look at implementation patterns a SaaS vendor can adopt when using a streaming platform as a means of integration between internal components, where streaming data is not directly exposed to third parties. In particular, we focus on Amazon MSK.

Streaming multi-tenancy patterns

When building streaming applications, you should take the following dimensions into account:

Data partitioning – Event streaming and storage needs to be isolated at the appropriate level, physical or logical, based on tenant ownership
Performance fairness – The performance coupling of applications processing streaming data for different tenants must be controlled and limited
Tenant isolation – A solid authorization strategy needs to be put in place to make sure tenants can access only their data

Underpinning all interactions with a multi-tenant system is the concept of SaaS identity. For more information, refer to SaaS Architecture Fundamentals.

SaaS deployment models

Tenant isolation is not optional for SaaS providers, and tenant isolation approaches will differ depending on your deployment model. The model is influenced by business requirements, and the models are not mutually exclusive. Trade-offs must be weighed across individual services to achieve a proper balance of isolation, complexity, and cost. There is no universal solution, and a SaaS vendor needs to carefully weigh their business and customer needs against three isolation strategies: silo, pool and bridge (or combinations thereof).

In the following sections, we explore these deployment models across data isolation, performance fairness, and tenant isolation dimensions.

Silo model

The silo model represents the highest level of data segregation, but also the highest running cost. Having a dedicated MSK cluster per tenant increases the risk of overprovisioning and requires duplication of management and monitoring tooling.

Having a dedicated MSK cluster per tenant makes sure tenant data partitioning occurs at the disk level when using an Amazon MSK Provisioned model. Both Amazon MSK Provisioned and Serverless clusters support server-side encryption at rest. Amazon MSK Provisioned further allows you to use a customer managed AWS Key Management Service (AWS KMS) key (see Amazon MSK encryption).

In a silo model, Kafka ACL and quotas is not strictly required unless your business requirements require them. Performance fairness is guaranteed because only a single tenant will be using the resources of the entire MSK cluster and are dedicated to applications producing and consuming events of a single tenant. This means spikes of traffic on a specific tenant can’t impact other tenants, and there is no risk of cross-tenant data access. As a drawback, having a provisioned cluster per tenant requires a right-sizing exercise per tenant, with a higher risk of overprovisioning than in the pool or bridge models.

You can implement tenant isolation the MSK cluster level with AWS Identity and Access Management (IAM) policies, creating per-cluster credentials, depending on the authentication scheme in use.

Pool model

The pool model is the simplest model where tenants share resources. A single MSK cluster is used for all tenants with data split into topics based on the event type (for example, all events related to orders go to the topic orders), and all tenant’s events are sent to the same topic. The following diagram illustrates this architecture.

Image showing a single streaming topic with multiple producers and consumers

This model maximizes operational simplicity, but reduces the tenant isolation options available because the SaaS provider won’t be able to differentiate per-tenant operational parameters and all responsibilities of isolation are delegated to the applications producing and consuming data from Kafka. The pool model also doesn’t provide any mechanism of physical data partitioning, nor performance fairness. A SaaS provider with these requirements should consider either a bridge or silo model. If you don’t have requirements to account for parameters such as per-tenant encryption keys or tenant-specific data operations, a pool model offers reduced complexity and can be a viable option. Let’s dig deeper into the trade-offs.

A common strategy to implement consumer isolation is to identify the tenant within each event using a tenant ID. The options available with Kafka are passing the tenant ID either as event metadata (header) or part of the payload itself as an explicit field. With this approach, the tenant ID will be used as a standardized field across all applications within both the message payload and the event header. This approach can reduce the risk of semantic divergence when components process and forward messages because event headers are handled differently by different processing frameworks and could be stripped when forwarded. Conversely, the event body is often forwarded as a single object and no contained information is lost unless the event is explicitly transformed. Including the tenant ID in the event header as well may simplify the implementation of services allowing you to specify tenants that need to be recovered or migrated without requiring the provider to deserialize the message payload to filter by tenant.

When specifying the tenant ID using either a header or as a field in the event, consumer applications will not be able to selectively subscribe to the events of a specific tenant. With Kafka, a consumer subscribes to a topic and receives all events sent to that topic of all tenants. Only after receiving an event will the consumer will be able to inspect the tenant ID to filter the tenant of interest, making access segregation virtually impossible. This means sensitive data must be encrypted to make sure a tenant can’t read another tenant’s data when viewing these events. In Kafka, server-side encryption can only be set at the cluster level, where all tenants sharing a cluster will share the same server-side encryption key.

In Kafka, data retention can only be set on the topic. In the pool model, events belonging to all tenants are sent to the same topic, so tenant-specific operations like deleting all data for a tenant will not be possible. The immutable, append-only nature of Kafka only allows an entire topic to be deleted, not selective events belonging to a specific tenant. If specific customer data in the stream requires the right to be forgotten, such as for GDPR, a pool model will not work for that data and silo should be considered for that specific data stream.

Bridge model

In the bridge model, a single Kafka cluster is used across all tenants, but events from different tenants are segregated into different topics. With this model, there is a topic for each group of related events per tenant. You can simplify operations by adopting a topic naming convention such as including the tenant ID in the topic name. This will practically create a namespace per tenant, and also allows different administrators to manage different tenants, setting permissions with a prefix ACL, and avoiding naming clashes (for example, events related to orders for tenant 1 go to tenant1.orders and orders of tenant 2 go to tenant2.orders). The following diagram illustrates this architecture.

Image showing multiple producers and consumers each publishing to a stream-per-tenant

With the bridge model, server-side encryption using a per-tenant key is not possible. Data from different tenants is stored in the same MSK cluster, and server-side encryption keys can be specified per cluster only. For the same reason, data segregation can only be achieved at file level, because separate topics are stored in separate files. Amazon MSK stores all topics within the same Amazon Elastic Block Store (Amazon EBS) volume.

The bridge model offers per-tenant customization, such as retention policy or max message size, because Kafka allows you to set these parameters per topic. The bridge model also simplifies segregating and decoupling event processing per tenant, allowing a stronger isolation between separate applications that process data of separate tenants.

To summarize, the bridge model offers the following capabilities:

Tenant processing segregation – A consumer application can selectively subscribe to the topics belonging to specific tenants and only receive events for those tenants. A SaaS provider will be able to delete data for specific tenants, selectively deleting the topics belonging to that tenant.
Selective scaling of the processing – With Kafka, the maximum number of parallel consumers is determined by the number of partitions of a topic, and the number of partitions can be set per topic, and therefore per tenant.
Performance fairness – You can implement performance fairness using Kafka quotas, supported by Amazon MSK, preventing the services processing a particularly busy tenant to consume too many cluster resources, at the expense of other tenants. Refer to the following two-part series for more details on Kafka quotas in Amazon MSK, and an example implementation for IAM authentication.
Tenant isolation – You can implement tenant isolation using IAM access control or Apache Kafka ACLs, depending on the authentication scheme that is used with Amazon MSK. Both IAM and Kafka ACLs allow you to control access per topic. You can authorize an application to access only the topics belonging to the tenant it is supposed to process.

Trade-offs in a SaaS environment

Although each model provides different capabilities for data partitioning, performance fairness, and tenant isolation, they also come with different costs and complexities. During planning, it’s important to identify what trade-offs you are willing to make for typical customers, and provide a tier structure to your client subscriptions.

The following table summarizes the supported capabilities of the three models in a streaming application.

.	Pool	Bridge	Silo
Per-tenant encryption at rest	No	No	Yes
Can implement right to be forgotten for single tenant	No	Yes	Yes
Per-tenant retention policies	No	Yes	Yes
Per-tenant event size limit	No	Yes	Yes
Per-tenant replayability	Yes (must implement with logic in consumers)	Yes	Yes

Anti-patterns

In the bridge model, we discussed tenant segregation by topic. An alternative would be segregating by partition, where all messages of a given type are sent to the same topic (for example, orders), but each tenant has a dedicated partition. This approach has many disadvantages and we strongly discourage it. In Kafka, partitions are the unit of horizontal scaling and balancing of brokers and consumers. Assigning partitions per tenants can introduce unbalancing of the cluster, and operational and performance issues that will be hard to overcome.

Some level of data isolation, such as per-tenant encryption keys, could be achieved using client-side encryption, delegating any encryption or description to the producer and consumer applications. This approach would allow you to use a separate encryption key per tenant. We don’t recommend this approach because it introduces a higher level of complexity in both the consumer and producer applications. It may also prevent you from using most of the standard programming libraries, Kafka tooling, and most Kafka ecosystem services, like Kafka Connect or MSK Connect.

Conclusion

In this post, we explored three patterns that SaaS vendors can use when architecting multi-tenant streaming applications with Amazon MSK: the pool, bridge, and silo models. Each model presents different trade-offs between operational simplicity, tenant isolation level, and cost efficiency.

The silo model dedicates full MSK clusters per tenant, offering a straightforward tenant isolation approach but incurring a higher maintenance and cost per tenant. The pool model offers increased operational and cost-efficiencies by sharing all resources across tenants, but provides limited data partitioning, performance fairness, and tenant isolation capabilities. Finally, the bridge model offers a good compromise between operational and cost-efficiencies while providing a good range of options to create robust tenant isolation and performance fairness strategies.

When architecting your multi-tenant streaming solution, carefully evaluate your requirements around tenant isolation, data privacy, per-tenant customization, and performance guarantees to determine the appropriate model. Combine models if needed to find the right balance for your business. As you scale your application, reassess isolation needs and migrate across models accordingly.

As you’ve seen in this post, there is no one-size-fits-all pattern for streaming data in a multi-tenant architecture. Carefully weighing your streaming outcomes and customer needs will help determine the correct trade-offs you can make while making sure your customer data is secure and auditable. Continue your learning journey on SkillBuilder with our SaaS curriculum, get hands-on with an AWS Serverless SaaS workshop or Amazon EKS SaaS workshop, or dive deep with Amazon MSK Labs.

About the Authors

Emmanuele Levi is a Solutions Architect in the Enterprise Software and SaaS team, based in London. Emanuele helps UK customers on their journey to refactor monolithic applications into modern microservices SaaS architectures. Emanuele is mainly interested in event-driven patterns and designs, especially when applied to analytics and AI, where he has expertise in the fraud-detection industry.

Lorenzo Nicora is a Senior Streaming Solution Architect helping customers across EMEA. He has been building cloud-native, data-intensive systems for over 25 years, working across industries, in consultancies and product companies. He has leveraged open-source technologies extensively and contributed to several projects, including Apache Flink.

Nicholas Tunney is a Senior Partner Solutions Architect for Worldwide Public Sector at AWS. He works with Global SI partners to develop architectures on AWS for clients in the government, nonprofit healthcare, utility, and education sectors. He is also a core member of the SaaS Technical Field Community where he gets to meet clients from all over the world who are building SaaS on AWS.

AWS Big Data Blog