Amazon MSK is a new AWS streaming data service that manages Apache Kafka infrastructure and operations, making it easy for developers and DevOps managers to run Apache Kafka applications on AWS without the need to become experts in operating Apache Kafka clusters. Amazon MSK is an ideal place to run existing or new Apache Kafka applications in AWS. Amazon MSK operates and maintains Apache Kafka clusters, provides enterprise-grade security features out of the box, and has built-in AWS integrations that accelerate development of streaming data applications. To get started, you can migrate existing Apache Kafka workloads into Amazon MSK, or with a few clicks, you can build new ones from scratch in minutes. There are no data transfer charges for in-cluster traffic, and no commitments or upfront payments required. You only pay for the resources that you use.
Apache Kafka is an open-source, high performance, fault-tolerant, and scalable platform for building real-time streaming data pipelines and applications. Apache Kafka is a streaming data store that decouples applications producing streaming data (producers) into its data store from applications consuming streaming data (consumers) from its data store. Organizations use Apache Kafka as a data source for applications that continuously analyze and react to streaming data. Learn more about Apache Kafka.
Streaming data is a continuous stream of small records or events (a record or event is typically a few kilobytes) generated by thousands of machines, devices, websites, and applications. Streaming data includes a wide variety of data such as log files generated by customers using your mobile or web applications, ecommerce purchases, in-game player activity, information from social networks, financial trading floors, geospatial services, and telemetry from connected devices or instrumentation in data centers. Streaming data services like Amazon MSK and Amazon Kinesis Data Streams make it easy for you to continuously collect, process, and deliver streaming data. Learn more about streaming data.
- Apache Kafka stores streaming data in a fault-tolerant way as a continuous series of records and preserves the order in which the records were produced.
- Apache Kafka acts as a buffer between data producers and data consumers. Apache Kafka allows many data producers (e.g. websites, IoT devices, Amazon EC2 instances) to continuously publish streaming data and categorize this data using Apache Kafka topics. Multiple data consumers (e.g. machine learning applications, Lambda functions) read from these topics at their own rate, similar to a message queue or enterprise messaging system.
- Data consumers process data from Apache Kafka topics on a first-in-first-out basis, preserving the order data was produced.
Apache Kafka is used to support real-time applications that transform, deliver, and react to streaming data, and for building real-time streaming data pipelines that reliably get data between multiple systems or applications.
Amazon MSK makes it easy to get started and run open-source versions of Apache Kafka in AWS with high availability and security while providing integration with AWS services without the operational overhead of running an Apache Kafka cluster. Amazon MSK allows you to use and configure open-source versions of Apache Kafka while the service manages the setup, provisioning, AWS integrations, and on-going maintenance of Apache Kafka clusters.
For supported Kafka versions, see the Amazon MSK documentation.
Yes, all data plane and admin APIs are natively supported by Amazon MSK.
Data production and consumption
Migrating to Amazon MSK
Each cluster contains broker instances, provisioned storage, and Apache ZooKeeper nodes.
Some resources, like elastic network interfaces (ENIs), will show up in your Amazon EC2 account. Other Amazon MSK resources will not show up in your EC2 account as these are managed by the Amazon MSK service.
Q: What do I need to provision within an Amazon MSK cluster?
Q: Can I change the default broker configurations or upload a cluster configuration to Amazon MSK?
Yes, Amazon MSK allows you to create custom configurations and apply them to new and existing clusters. For more information on custom configurations, see the configuration documentation.
Q: What configuration properties am I able to customize?
The configurations properties that you can customize are documented here.
Q: What is the default configuration of a new topic?
Amazon MSK uses Apache Kafka’s default configuration unless otherwise specified here.
Connecting to the VPC
Q: How do I connect to my AWS MSK cluster outside of the VPC?
There are several methods to connect to your AWS MSK clusters outside of your VPC.
- VPN: https://docs.aws.amazon.com/vpc/latest/userguide/vpn-connections.html
- VPC Peering: https://docs.aws.amazon.com/vpc/latest/peering/what-is-vpc-peering.html
- VPC Transit Gateway: https://docs.aws.amazon.com/vpc/latest/tgw/what-is-transit-gateway.html
- AWS Direct Connect: https://aws.amazon.com/directconnect/
- REST Proxy: A REST proxy can be installed on an instance running within your VPC. REST proxies allow your producers and consumers to communicate to the cluster through HTTP API requests.
Q: Is data encrypted in-transit between brokers within an Amazon MSK cluster?
Yes, by default new clusters have encryption in-transit enabled via TLS for inter-broker communication. You can opt-out of using encryption in-transit when a cluster is created.
Q: Is data encrypted in-transit between my Apache Kafka clients and the Amazon MSK service?
Yes, by default in-transit encryption is set to TLS only for clusters created from the CLI or AWS Console. Additional configuration is required for clients to communicate with clusters using TLS encryption. You can change the default encryption setting by selecting the TLS/plaintext or plaintext settings. Read More: MSK Encryption
Q: Is data encrypted in-transit as it moves between brokers and Apache ZooKeeper nodes in an Amazon MSK cluster?
Yes, Amazon MSK clusters running Apache Kafka version 2.5.1 or greater support TLS in-transit encryption between Kafka brokers and ZooKeeper nodes.
Authentication and authorization
Q: How can I restrict the scope of connectivity to an Amazon MSK cluster across multiple clients in my VPC?
Amazon MSK supports TLS certificate authentication (CA) and SASL/SCRAM authentication. You can use these methods to authenticate client connections to an Amazon MSK cluster. With TLS certificate authentication, you deploy private CAs within the AWS Certificate Manager service to an MSK cluster. When TLS client authentication is enabled, only clients presenting TLS certificates generated from the previously loaded private CAs can authenticate with the cluster. With SASL/SCRAM authentication, you create and store credentials in AWS Secrets Manager. When SASL/SCRAM authentication is enabled, only clients presenting the correct username and password stored in AWS Secrets Manager can authenticate with the cluster.
Q: How does authorization work in Amazon MSK?
Apache Kafka uses access control lists (ACLs) for authorization and Amazon MSK supports the use of ACLs. To enable ACLs you must enable client authentication using either TLS certificates or SASL/SCRAM.
Q: How can I authenticate and authorize a client at the same time?
If you are using TLS authentication, you can use the Dname of clients TLS certificates as the principal of the ACL to authorize client requests. If you are using SASL/SCRAM, you can use the username as the principal of the ACL to authorize client requests.
Monitoring, metrics, logging, tagging
Yes, you can tag Amazon MSK clusters from the AWS CLI or Console.
Consumer lag within your Amazon MSK cluster can be monitored using consumer lag tools like Linkedin's Burrow: https://github.com/linkedin/Burrow
You can enable broker log delivery for new and existing Amazon MSK clusters. You can deliver broker logs to Amazon CloudWatch Logs, Amazon S3, and Kinesis Data Firehose. Kinesis Data Firehose supports Amazon Elasticsearch Service among other destinations. To learn how to enable this feature, see the Amazon MSK Logging Documentation. To learn about pricing refer to CloudWatch Logs and Kinesis Data Firehose pricing pages.
Amazon MSK provides INFO level logs for all brokers within a cluster.
- Amazon VPC for network isolation and security
- Amazon CloudWatch for metrics
- Amazon KMS for storage volume encryption
- Amazon IAM for authentication of cluster APIs
- AWS CloudTrail for AWS API logs
- AWS Certificate Manager for Private CAs used for client TLS authentication
- AWS CloudFormation for describing and provisioning Amazon MSK clusters using code
- Amazon Kinesis Data Analytics for fully managed Apache Flink applications
- AWS Secrets Manager for client credentials used for SASL/SCRAM authentication
You can create an auto scaling policy for storage using the AWS Management console or by creating an AWS Application Auto scaling Policy using the AWS CLI or APIs.
No. Scaling the instance size of brokers in an existing cluster is not currently supported by Amazon MSK, but is on our roadmap.
Pricing and availability
Q: Do I pay for data transfer as a result of data replication?
- HIPAA eligible
- SOC 1,2,3
For a complete list of AWS services and compliance programs, please see AWS Services in Scope by Compliance Program.