Simplifying Kafka operations with Amazon MSK Express brokers

In this post, we show you how Amazon Managed Streaming for Apache Kafka (Amazon MSK) Express brokers brokers streamline the end-to-end activities for Kafka administration. Apache Kafka has become the de facto standard for real-time data streaming, powering mission-critical applications across industries worldwide. Its popularity stems from its ability to handle high-throughput, fault-tolerant data pipelines at scale. Given its central role in modern data architectures, managing Apache Kafka with high resilience and reliability is essential for business success.

To maintain this level of resilience, administrators need to handle several important operational tasks. Apache Kafka is a distributed stateful system, whose state management requires constant communication and data movement in dynamic cloud environments. Administrators need to carefully size clusters by calculating complex compute, storage, and network requirements. They must provision storage volumes upfront and monitor utilization constantly to avoid disruptions. When workloads grow, scaling the cluster requires hours or days of effort using multiple tools to provision capacity and rebalance load.

With these operational requirements in mind, many administrators ask: is there an easier way to manage Apache Kafka at scale while maintaining the high resilience their applications demand?

Amazon MSK Express addresses these challenges directly. In this post, we show you how MSK Express brokers streamline the end-to-end activities for Kafka administration, including:

Sizing Kafka clusters for optimal performance and cost
Scaling cluster storage up and down with workload changes
Scaling cluster compute in and out over time
Monitoring cluster health
Managing cluster security
Ensuring high availability with fast and automatic broker recovery

What are Amazon MSK Express brokers?

Amazon MSK Express brokers are a transformative breakthrough for customers needing high-throughput Kafka clusters that scale faster and cost less. Express brokers reimagine Kafka’s compute and storage, decoupling to unlock performance and elasticity benefits. Express brokers deliver performance improvements that directly impact your operations:

Up to 3x more throughput per broker, allowing you to handle more data with fewer resources and lower costs
Rebalance partitions across brokers 180x faster, reducing scaling from hours to minutes
Scale up to 20x faster, enabling you to respond to demand spikes without lengthy planning cycles
Recover 90% quicker compared to standard Apache Kafka brokers, minimizing workload disruption and maintaining business continuity

To learn more about the technical details, see Express brokers for Amazon MSK: Turbo-charged Kafka scaling with up to 20 times faster performance. For a comprehensive overview of Express broker capabilities, see the MSK Express brokers documentation.

Let’s explore how MSK Express brokers simplify Apache Kafka management.

Sizing an Express cluster

Sizing a traditional Apache Kafka cluster is complex. Working backwards from your ingress and egress load, you need to consider every dimension of your cluster compute, storage, and network limitations. Each node must be carefully sized to handle:

Ingress and egress traffic from your clients
Internal Kafka operations like replication and rebalancing (the process of redistributing partitions across brokers to maintain balance)
High availability with node and Availability Zone failures
Client operations like backfill procedures when reading historical data

These activities impact your cluster storage I/O limits, network ingress/egress limits, and CPU and memory constraints. Beyond this, you need to consider the number of partitions required and determine whether your cluster can scale to handle partition management for your use case.

MSK Express brokers simplify this calculus. Rather than considering these complex variables, you can focus on what matters:

Your ingress throughput
Your egress throughput
Your partition needs

MSK documents the Express broker throughput throttle and partition limits by broker size. MSK pre-calculates these to consider all cluster limits. They include multi-Availability Zone high availability to handle rare events like node failures or AZ impairment.

Notice we did not discuss storage in sizing an Express cluster. That is because storage in Express scales nearly infinitely. You pay for storage as you go rather than sizing storage up front.

Scaling Express cluster storage

With sizing simplified by focusing on throughput and partitions, storage management becomes the next operational consideration.

Normally, Apache Kafka clusters need storage volumes pre-provisioned to handle all retained data. You must allocate all storage up-front and pay for that storage no matter what your actual data retention is.

Example: If you store 7 days of data at 1 MB/sec ingress, that’s 600+ GB of storage. This does not include data replication across nodes and buffers for growth and workload variability. This workload requires over 3 TB of storage, allocated up-front, to handle replicas and storage buffers.

As your workload evolves, careful monitoring of storage utilization becomes essential. Adding storage capacity prevents workload disruptions. Often, you cannot reclaim this storage. Once you increase the volume size, you continue paying for additional storage even if your workload scales down and no longer requires additional capacity.

With Express brokers, there is no need for sizing and provisioning storage volumes. You pay for what you use with no provisioning: the data ingested to the cluster and data stored in the cluster per-GB-per-hour. All data stored in the cluster is replicated across 3 Availability Zones for high availability. This pay-as-you-go model eliminates wasted capacity costs and reduces your total infrastructure spend.

As workloads scale up, the cluster uses more storage with no changes needed from you
When workloads scale down, the cluster uses less storage, reducing storage charges automatically
Storage management for Apache Kafka becomes simpler with Express. You focus on ensuring that your per-topic retention is right-sized for each use case. That is the only consideration. Once you set up topic retention, MSK Express automatically manages and cost-optimizes storage on your behalf.

Storage management in MSK Express brokers is far simpler than in a traditional Apache Kafka cluster. So is scaling the compute capacity for an Express-based cluster.

Scaling Express cluster compute

Just as storage scales automatically with your workload, compute capacity can also adapt to changing demands.

As your workload grows and changes, you may find that you exceed your initial sizing estimates. For a traditional Apache Kafka cluster, scaling the cluster capacity is a significant event. Scaling takes effort to provision capacity and rebalance load, it requires using multiple tools to manage the scaling process (compute, storage, DNS, rebalancing, client configs, and more). The scaling process can take hours or days to complete, which can exacerbate application impact. This means you need to plan well ahead to ensure your Kafka cluster is prepared for any load changes.

With MSK Express clusters, this process becomes much simpler and requires little to no upfront planning. It has near zero disruption to your existing workload, allowing your team to focus on building features rather than managing infrastructure.

To scale up an MSK Express cluster, you simply add brokers to the cluster. Once new brokers come online, Express Intelligent Rebalancing automatically rebalances topic partitions to the new nodes. Thanks to the Express storage architecture, the new nodes automatically have almost all the data they need. There is no significant inter-broker communication for rebalancing. This causes no disruption to existing brokers.

The cluster then elects new broker leaders for each partition, enabling producers to direct traffic to the new nodes. The same applies to consumer groups.

Express broker DNS design keeps this in mind. Express broker connection strings abstract away from the nodes themselves. Clients connect to the active broker nodes with one connection string. No changes to DNS, load balancing, or client configurations are needed.

Deciding when to scale in an Express cluster is also simpler than in a traditional Apache Kafka cluster. The simplified Express architecture means less to monitor and manage for long-term cluster operations.

Monitoring Express clusters

With simplified scaling decisions comes simplified monitoring. Express brokers reduce the number of metrics you need to track for cluster health. The below image demonstrates a dashboard which highlights the key metrics for monitoring MSK Express broker health.

Dashboard with key Amazon MSK Express brokers metrics

In a traditional Apache Kafka cluster, you need to consider dozens of metrics to understand overall cluster health. Express brokers simplify this operational process. They highlight ingress and egress throughput as two critical metrics for workload sizing and scaling. This streamlined monitoring approach reduces the expertise required to operate Kafka clusters and allows smaller teams to manage larger deployments effectively.

Other factors, like poorly designed clients, can incur additional overhead on a cluster. This can cause symptoms such as high CPU utilization without high ingress throughput. It is still important to monitor a variety of metrics with MSK Express brokers.

For Express brokers, the following table shows the critical metrics you must monitor and alert on for cluster health:

Metric Name	Description	Recommended Alarm
BytesInPerSec	Ingress throughput to the cluster	When > broker limit for > 5 minutes
BytesOutPerSec	Egress throughput to the cluster	When > broker limit for > 5 minutes
CpuUser + CpuSystem	CPU utilization percentage	When greater than 60% for 15 minutes
NetworkProcessorAvgIdlePercent	Network processor thread idle time	When less than 0.5 for > 5 minutes
RequestHandlerAvgIdlePercent	Request processor thread idle time	When less than 0.4 for > 15 minutes
FetchThrottleByteRate	Consumer fetch throttling rate	When < 0 for > 15 minutes
ProduceThrottleByteRate	Producer ingress throttling rate	When < 0 for > 15 minutes

For more information on monitoring Amazon MSK, see Monitoring Amazon MSK with Amazon CloudWatch.

Managing Express cluster access

Beyond monitoring, cluster management is another area where MSK Express brokers reduce operational complexity.

Express brokers simplify the internal management of Kafka clusters. In a traditional Kafka environment, you use schemes like SASL/SCRAM (username and password-based authentication) or mutual TLS (certificate-based authentication) for client authentication. Once authenticated, you configure complex Kafka ACLs (Access Control Lists—permissions that define who can access which topics) inside the Kafka cluster to authorize client access to topics and data.

These paradigms require you to manage all topics, authentication, and authorization inside Apache Kafka. This includes credential management, rotation, and other operational activities surrounding cluster access.

MSK simplifies this process by integrating with AWS Identity and Access Management (IAM) for access control. Clients can use IAM Roles that clearly specify cluster access boundaries. They also provide topic-level authorization to read and write data to a cluster with Kafka APIs.

Finally, clients can use MSK APIs to directly manage Kafka cluster configurations and Kafka topics, including creating new topics, updating topic configurations and partition counts, and deleting topics. Configurations and topics can be managed with the AWS Console, AWS CLI, and AWS SDK. For more information, refer to Amazon MSK simplifies Kafka topic management with new APIs and console integration.

You can focus only on your existing enterprise standards for IAM access controls, and your existing AWS CloudFormation and AWS CDK automation to manage your cluster with Infrastructure as Code (IaC). This integration reduces the operational overhead of cluster management and accelerates your time to production by leveraging existing security infrastructure.

MSK also supports using SASL/SCRAM and mutual TLS authentication modes alongside IAM access control. This gives you the flexibility to authorize applications outside of AWS. You can also provide access to legacy applications without the need for code changes.

For more information, see IAM access control for Amazon MSK and Security in Amazon MSK.

Building highly available Express brokers

With security simplified through IAM integration, high availability is the final piece of the operational puzzle.

Many of the same considerations we discussed in scaling Express cluster compute align with high availability considerations for MSK Express brokers.

Based on internal testing, MSK Express broker storage improvements enable faster recovery when broker nodes fail—90% faster than standard brokers. The new node can simply start up with almost no disruption to the rest of the cluster without needing to perform significant rebalancing. This contrasts with standard Kafka clusters, where the cluster needs to rebalance partitions to new nodes after recovery.

In addition to these improvements, MSK Express brokers are highly available by default. The service manages critical cluster and topic configurations for high availability and performance on your behalf. This eliminates the need for managing most cluster configurations.

Express fully manages configurations like min.insync.replicas, num.io.threads, and others described in Express brokers’ read-only configurations. This gives you a highly available and performant cluster out of the box.

You no longer need to worry about most cluster-level configurations of an Apache Kafka cluster. You can simply:

Start an MSK Express cluster
Configure topics and retention
Proceed without the fine tuning normally needed to ensure a highly available cluster

Conclusion

In this post, we showed how MSK Express brokers simplify cluster operations for Apache Kafka clusters. They lower the Total Cost of Ownership (TCO) of running an Apache Kafka cluster by simplifying sizing, storage management, compute management, high availability, and access control, while providing high performance, reliability, and cost-efficiency. These simplifications reduce the specialized expertise needed for cluster administration and accelerate your deployment timeline.

With this in mind, we recommend MSK Express brokers for almost all MSK workloads. If you are starting out with a new Kafka cluster or optimizing an existing one, MSK Express brokers provide a strong combination of simplicity, performance, and cost-efficiency.

Ready to simplify your Kafka operations? Get started using Amazon MSK to create your first Express cluster today. You can provision a fully managed, highly available Kafka cluster in minutes and start experiencing the operational benefits immediately. For pricing details, see Amazon MSK pricing.

For comprehensive information about Amazon MSK capabilities and features, visit the Amazon MSK product page and the Amazon MSK Developer Guide.

AWS Big Data Blog

Simplifying Kafka operations with Amazon MSK Express brokers

What are Amazon MSK Express brokers?

Sizing an Express cluster

Scaling Express cluster storage

Scaling Express cluster compute

Monitoring Express clusters

Managing Express cluster access

Building highly available Express brokers

Conclusion

About the authors

Resources

Follow

Learn

Resources

Developers

Help