AWS Big Data Blog
Amazon MSK Replicator and MirrorMaker2: Choosing the right replication strategy for Apache Kafka disaster recovery and migrations
Customers need to replicate data from their Apache Kafka clusters for a variety of reasons, such as compliance requirements, cluster migrations, and disaster recovery (DR) implementations. However, the right replication strategy can vary depending on the application context. In this post, we walk through the different considerations for using Amazon MSK Replicator over Apache Kafka’s MirrorMaker 2, and help you choose the right replication solution for your use case. We also discuss how to make applications using Amazon Managed Streaming for Apache Kafka (Amazon MSK) resilient to disasters using a multi-Region Kafka architecture using MSK Replicator.
Challenges with choosing DR strategies
Customers create business continuity plans and DR strategies to maximize resiliency for their applications, because downtime or data loss can result in losing revenue or halting operations. DR planning helps the business continue running in the event of a disaster impacting a subset of their application architecture. For customers using Kafka as a core streaming and messaging service in their applications, planning for DR for their Kafka infrastructure is an essential part of meeting goals for their application Recovery Time Objective (RTO) and Recovery Point Objective (RPO).
Amazon MSK is a fully managed service that makes it straightforward to build and run Kafka to process streaming data. Amazon MSK provides high availability by offering multi-AZ configurations to distribute brokers across multiple Availability Zones within an AWS Region. A single MSK cluster deployment provides message durability through intra-cluster data replication. Data replication with a replication factor of 3 and min-ISR value of 2 along with the producer setting acks=all provides the strongest availability guarantees, because it makes sure other brokers in the cluster acknowledge receiving the data before the leader broker responds to the producer. This design provides robust protection against single broker failure as well as single-AZ failure.
For enhanced resilience within a single Region, Amazon MSK also offers Express brokers, which significantly improve Kafka cluster reliability, throughput, recovery times. Express brokers include pay-as-you-go storage, automatic best-practice reliability configurations, no maintenance windows, and faster broker scaling and recovery times. This architecture reduces recovery time, minimizes the chance of errors with misconfigurations, and increases throughput, making your Kafka clusters more resilient across Availability Zones.
However, if an unlikely issue is impacting your applications or infrastructure across more than one Availability Zone, the architecture outlined in this post can help you prepare, respond, and recover from it.
For companies that can withstand a longer RTO but require a lower RPO on Amazon MSK, backing up data to Amazon Simple Storage Service (Amazon S3) is sufficient as a DR plan. This approach requires you to think through how to handle restarting the application after a DR failover. In this approach, you build a system to recover the data from Amazon S3 to Kafka topics (as described in Back up and restore Kafka topic data using Amazon MSK Connect). Depending on the volume of data being restored, it might take a long time to recover in this scenario. Additionally, you must consider how to handle consumer group offsets, and whether to allow applications to consume from the latest offset in the restored Kafka topics. Due to the high RTO, as well as the complexity and challenges associated with this approach, most streaming use cases rely on the availability of the MSK cluster itself for their business continuity plan. In these cases, setting up MSK clusters in multiple Regions and configuring data replication between clusters provides the required business resilience and continuity.
Choosing the right replication solution: MSK Replicator vs MirrorMaker 2
AWS recommends two primary solutions for cross-Region Kafka replication: MSK Replicator and MirrorMaker 2. Understanding when to use each solution is crucial for designing an effective DR strategy.
MSK Replicator: For most MSK cluster replications in the same account
MSK Replicator is a fully managed, serverless Kafka replication service that makes it straightforward to reliably replicate data across MSK clusters in different Regions or within the same Region. MSK Replicator is the recommended solution for application scenarios replicating data within the same AWS account. MSK Replicator has the following benefits:
- Replication between MSK clusters – It supports replicating between MSK clusters in the same AWS account (including active-active or active-passive DR architectures for Amazon MSK)
- No infrastructure management – It’s fully serverless with automatic scaling and straightforward setup through the AWS Management Console, AWS Command Line Interface (AWS CLI), or APIs
- Built-in monitoring – It’s integrated with Amazon CloudWatch metrics and logging
- Built-in high availability – As a managed service, it offers built-in fault tolerance across Availability Zones
MirrorMaker 2: For migrations and complex and hybrid scenarios
MirrorMaker 2 (MM2) remains the preferred solution for specific use cases that require more flexibility or involve non-Amazon MSK environments. MM2 is a utility bundled as part of Kafka that helps replicate data between Kafka clusters using the Kafka Connect framework.
We recommend MirrorMaker 2 for the following use cases:
- Cross-account replication – Replicating data between MSK clusters in different AWS accounts
- Migrations to Amazon MSK – Migrating from existing Kafka clusters on premises, in other clouds, or on self-managed Amazon Elastic Compute Cloud (Amazon EC2) deployments
- Cross-cloud or hybrid cloud scenarios – Replicating between Kafka running on-premises or on different cloud providers and Amazon MSK for disaster recovery or data analytics use cases
- Using mTLS or SASL/SCRAM authentication – When you need mutual TLS certificate-based or SASL/SCRAM authentication and can’t enable AWS Identity and Access Management (IAM) authentication in your MSK cluster (for replication from one MSK cluster to another in these scenarios, you can still use MSK Replicator by enabling IAM authentication in addition to existing authentication methods)
- Custom replication policies – Advanced topic naming or transformation requirements
In the following sections, we discuss the architecture and deployment approaches for use cases where MSK Replicator and MirrorMaker 2 are the appropriate choices.
MSK Replicator solution overview
The following diagram illustrates the architecture for using MSK Replicator.
We create two MSK clusters – one in the primary Region, the other in the secondary Region as a standby cluster for disaster recovery. We deploy MSK Replicator in the secondary region to replicate topics, ACLs, data, and consumer group offsets from the primary cluster. In this solution, we showcase a single-direction replication for active-passive disaster recovery. This solution can also be extended for active-active disaster recovery scenarios. Our Kafka clients connect to the primary cluster and can be configured to connect to the secondary cluster in the event of a disaster recovery failover.
For details on implementation steps, refer to Introducing Amazon MSK Replicator – Fully Managed Replication across MSK Clusters in Same or Different AWS Regions. For details on disaster recovery scenarios, refer to Use replication to increase the resiliency of a Kafka streaming application across Regions. These resources provide the following benefits:
- Full deployment steps – Step by step deployment process for MSK Replicator between regions
- Comprehensive examples – Multiple deployment scenarios and configurations
- Failover process – Key steps in executing a disaster recovery failover when using MSK Replicator
MirrorMaker2 solution overview
The following diagram illustrates the architecture for using MirrorMaker 2.
We create an MSK cluster in the primary Region, with the existing Kafka cluster on premises. This Kafka cluster is analogous to Kafka clusters running in other clouds, or in self-managed Kafka clusters on Amazon EC2. In this solution, we showcase a single-direction replication for cluster migration scenarios. Our Kafka clients interact with the on-premises Kafka cluster and can be migrated to run on AWS to interact with the MSK cluster.
Rather than manually configuring each component, we recommend using the automated deployment resources available in the following GitHub repository. For a step-by-step walkthrough of deploying MirrorMaker 2 on Amazon ECS with Fargate using auto scaling, refer to Amazon MSK Migration Workshop: Modernizing with Express Brokers. These resources provide the following benefits:
- Infrastructure as code – Terraform for MSK clusters and supporting infrastructure
- Containerized Kafka Connect – Docker images optimized for AWS
- Amazon ECS with AWS Fargate deployment – Scalable, serverless container deployment using Amazon Elastic Container Service (Amazon ECS) with AWS Fargate
- Auto scaling configuration – Automatic scaling based on workload demands
- Comprehensive examples – Multiple deployment scenarios and configurations
- Migration process – Key steps in executing a Kafka migration using MM2
Conclusion
Choosing the right replication solution depends on your specific requirements. We recommend using MSK Replicator when replicating from one MSK cluster to another and you want a fully managed solution for disaster recovery. MirrorMaker 2 is recommended for migrations to Amazon MSK, hybrid environments, or when you need complex custom replication policies.
For MSK Replicator deployments, refer to Introducing Amazon MSK Replicator – Fully Managed Replication across MSK Clusters in Same or Different AWS Regions and Use replication to increase the resiliency of a Kafka streaming application across Regions.
For MirrorMaker 2 deployments, refer to the GitHub repository and Amazon MSK Migration Workshop to implement production-ready solutions with automated deployment, monitoring, and scaling capabilities.
These approaches provide a customizable set of options for data redundancy and business continuity capabilities needed to meet regulatory compliance and disaster recovery requirements, while minimizing operational overhead through automation and best practices.