AWS Big Data Blog
Enhance operational insights for Amazon MSK using Amazon Managed Service for Prometheus and Amazon Managed Grafana
Amazon Managed Streaming for Apache Kafka (Amazon MSK) is an event streaming platform that you can use to build asynchronous applications by decoupling producers and consumers. Monitoring of different Amazon MSK metrics is critical for efficient operations of production workloads. Amazon MSK gathers Apache Kafka metrics and sends them to Amazon CloudWatch, where you can view them. You can also monitor Amazon MSK with Prometheus, an open-source monitoring application. Many of our customers use such open-source monitoring tools like Prometheus and Grafana, but doing it in self-managed environment comes with its own challenges regarding manageability, availability, and security.
In this post, we show how you can build an AWS Cloud native monitoring platform for Amazon MSK using the fully managed, highly available, scalable, and secure services Amazon Managed service for Prometheus and Amazon Managed Grafana for better operational insights.
Why is Kafka monitoring critical?
As a critical component of the IT infrastructure, it is necessary to track Amazon MSK clusters’ operations and their efficiencies. Amazon MSK metrics helps monitor critical tasks while operating applications. You can not only troubleshoot problems that have already occurred, but also discover anomalous behavior patterns and prevent problems from occurring in the first place.
Some customers currently use various third-party monitoring solutions like lenses.io, AppDynamics, Splunk, and others to monitor Amazon MSK operational metrics. In the context of cloud computing, customers are looking for an AWS Cloud native service that offers equivalent or better capabilities but with the added advantage of being highly scalable, available, secure, and fully managed.
Amazon MSK clusters emit a very large number of metrics via JMX, many of which can be useful for tuning the performance of your cluster, producers, and consumers. However, that large volume brings complexity with monitoring. By default, Amazon MSK clusters come with CloudWatch monitoring of your essential metrics. You can extend your monitoring capabilities by using open-source monitoring with Prometheus. This feature enables you to scrape a Prometheus friendly API to gather all the JMX metrics and work with the data in Prometheus.
This solution provides a simple and easy observability platform for Amazon MSK along with much needed insights into various critical operational metrics that yields the following organizational benefits for your IT operations or application teams:
- You can quickly drill down to various Amazon MSK components (broker level, topic level, or cluster level) and identify issues that need investigation
- You can investigate Amazon MSK issues after the event using the historical data in Amazon Managed Service for Prometheus
- You can shorten or eliminate long calls that waste time questioning business users on Amazon MSK issues
In this post, we set up Amazon Managed Service for Prometheus, Amazon Managed Grafana, and a Prometheus server running as container on Amazon Elastic Compute Cloud (Amazon EC2) to provide a fully managed monitoring solution for Amazon MSK.
The solution provides an easy-to-configure dashboard in Amazon Managed Grafana for various critical operation metrics, as demonstrated in the following video.
Solution overview
Amazon Managed Service for Prometheus reduces the heavy lifting required to get started with monitoring applications across Amazon MSK, Amazon Elastic Kubernetes Service (Amazon EKS), Amazon Elastic Container Service (Amazon ECS), and AWS Fargate, as well as self-managed Kubernetes clusters. The service also seamlessly integrates with Amazon Managed Grafana to simplify data visualization, team management authentication, and authorization.
Grafana empowers you to create dashboards and alerts from multiple sources such as an Amazon Managed Prometheus workspace, CloudWatch, AWS X-Ray, Amazon OpenSearch Service, Amazon Redshift, and Amazon Timestream.
The following diagram demonstrates the solution architecture. This solution deploys a Prometheus server running as a container within Amazon EC2, which constantly scrapes metrics from the MSK brokers and remote write metrics to an Amazon Managed Service for Prometheus workspace. As of this writing, Amazon Managed Service for Prometheus is not able to scrape the metrics directly, therefore a Prometheus server is necessary to do so. We use Amazon Managed Grafana to query and visualize the operational metrics for the Amazon MSK platform.
The following are the high-level steps to deploy the solution:
- Create an EC2 key pair.
- Configure your Amazon MSK cluster and associated resources. We demonstrate how to configure an existing Amazon MSK cluster or create a new one.
- Option A:- Modify an existing Amazon MSK cluster
- Option B:- Create a new Amazon MSK cluster
- Enable AWS IAM Identity Center (successor to AWS Single Sign-On), if not enabled.
- Configure Amazon Managed Grafana and Amazon Managed Service for Prometheus.
- Configure Prometheus and start the service.
- Configure the data sources in Amazon Managed Grafana.
- Import the Grafana dashboard.
Prerequisites
- Clone the GitHub repository to download the AWS CloudFormation template files:
- You download three CloudFormation template files along with the Prometheus configuration file (
prometheus.yml
),targets.json
file (you need this to update the MSK broker DNS later on), and three JSON files for creating a dashboard within Amazon Managed Grafana. - Make sure internet connection is allowed to download docker image of Prometheus from within Prometheus server
1. Create an EC2 key pair
To create your EC2 key pair, complete the following steps:
- On the Amazon EC2 console, under Network & Security in the navigation pane, choose Key Pairs.
- Choose Create key pair.
- For Name, enter
DemoMSKKeyPair
. - For Key pair type¸ select RSA.
- For Private key file format, choose the format in which to save the private key:
- To save the private key in a format that can be used with OpenSSH, select .pem.
- To save the private key in a format that can be used with PuTTY, select .ppk.
The private key file is automatically downloaded by your browser. The base file name is the name that you specified as the name of your key pair, and the file name extension is determined by the file format that you chose.
- Save the private key file in a safe place.
2. Configure your Amazon MSK cluster and associated resources.
Using the following options to configure an existing Amazon MSK cluster or create a new one.
2.a Modify an existing Amazon MSK cluster
If you want to create a new Amazon MSK cluster for this solution, skip to the section – 2.b.Create a new Amazon MSK cluster, otherwise complete the steps in this section to modify an existing cluster.
Validate cluster monitoring settings
We must enable enhanced partition-level monitoring (available at an additional cost) and open monitoring with Prometheus. Note that open monitoring with Prometheus is only available for provisioned mode clusters.
- Sign in to the account where the Amazon MSK cluster is that you want to monitor.
- Open your Amazon MSK cluster.
- On the Properties tab, navigate to Monitoring metrics.
- Check the monitoring level for Amazon CloudWatch metrics for this cluster, and choose Edit to edit the cluster.
- Select Enhance partition-level monitoring.
- Check the monitoring label for Open monitoring with Prometheus, and choose Edit to edit the cluster.
- Select Enable open monitoring for Prometheus.
- Under Prometheus exporters, select JMX Exporter and Note Exporter.
- Under Broker log delivery, select Deliver to Amazon CloudWatch Logs.
- For Log group, enter your log group for Amazon MSK.
- Choose Save changes.
Deploy CloudFormation stack
Now we deploy the CloudFormation stack Prometheus_Cloudformation.yml
that we downloaded earlier.
- On the AWS CloudFormation console, choose Stacks in the navigation pane.
- Choose Create stack.
- For Prepare template, select Template is ready.
- For Template source, select Upload a template.
- Upload the
Prometheus_Cloudformation.yml
file, then choose Next.
- For Stack name, enter
Prometheus
. - VPCID – Provide the VPC ID where your Amazon MSK cluster is deployed (mandatory)
- VPCCIdr – Provide the VPC CIDR where your Amazon MSK Cluster is deployed (mandatory)
- SubnetID – Provide any one of the subnets ID where your existing Amazon MSK cluster is deployed (mandatory)
- MSKClusterName – Provide the name your existing Amazon MSK Cluster
- Leave Cloud9InstanceType, KeyName, and LatestAmild as default.
- Choose Next.
- On the Review page, select I acknowledge that AWS CloudFormation might create IAM resources.
- Choose Create stack.
You’re redirected to the AWS CloudFormation console, and can see the status as CREATE_IN_PROGRESSS
. Wait until the status changes to COMPLETE
.
- On the stack’s Outputs tab, note the values for the following keys (if you don’t see anything under Outputs tab, click on refresh icon):
PrometheusInstancePrivateIP
PrometheusSecurityGroupId
Update the Amazon MSK cluster security group
Complete the following steps to update the security group of the existing Amazon MSK cluster to allow communication from the Kafka client and Prometheus server:
- On the Amazon MSK console, navigate to your Amazon MSK cluster.
- On the Properties tab, under Network settings, open the security group.
- Choose Edit inbound rules.
- Choose Add rule and create your rule with the following parameters:
- Type – Custom TCP
- Port range – 11001–11002
- Source – The Prometheus server security group ID
Set up your AWS Cloud9 environment
To configure your AWS Cloud9 environment, complete the following steps:
- On the AWS Cloud9 console, choose Environments in the navigation pane.
- Select
Cloud9EC2Bastion
and choose Open in Cloud9.
- Close the Welcome tab and open a new terminal tab
- Create an SSH key file with the contents from the private key file
DemoMSKKeyPair
using the following command: - Run the following command to list the newly created key file
- Open the file, enter the contents of the private key file
DemoMSKKeyPair
, then save the file.
- Change the permissions of the file using the following command:
- Log in to the Prometheus server using this key file and the private IP noted earlier:
- Once you’re logged in, check if the Docker service is up and running using the following command:
- To exit the server, enter
exit
and press Enter.
2.b Create a new Amazon MSK cluster
If you don’t have an Amazon MSK cluster running in your environment, or you don’t want to use an existing cluster for this solution, complete the steps in this section.
As part of these steps, your cluster will have the following properties:
- An AWS Identity and Access Management (IAM) role used to control who can perform Amazon MSK operations on your cluster
- TLS encryption between the client and brokers
- TLS encryption within the cluster
- An AWS Key Management Service (AWS KMS) managed key for encryption at rest
Deploy CloudFormation stack
Complete the following steps to deploy the CloudFormation stack MSKResource_Cloudformation.yml
:
- On the AWS CloudFormation console, choose Stacks in the navigation pane.
- Choose Create stack.
- For Prepare template, select Template is ready.
- For Template source, select Upload a template.
- Upload the
MSKResource_Cloudformation.yml
file, then choose Next. - For Stack name, enter
MSKDemo
. - Network Configuration – Generic (mandatory)
- Stack to be deployed in NEW VPC? (true/false) – if false, you MUST provide VPCCidr and other details under Existing VPC section (Default is true)
- VPCCidr – Default is 10.0.0.0/16 for a new VPC. You can have any valid values as per your environment. If deploying in an existing VPC, provide the CIDR for the same
- Network Configuration – For New VPC
- PrivateSubnetMSKOneCidr (Default is 10.0.1.0/24)
- PrivateSubnetMSKTwoCidr (Default is 10.0.2.0/24)
- PrivateSubnetMSKThreeCidr (Default is 10.0.3.0/24)
- PublicOneCidr (Default is 10.0.0.0/24)
- Network Configuration – For Existing VPC (You need at least 4 subnets)
- VpcId – Provide the value if you are using any existing VPC to deploy the resources else leave it blank(default)
- SubnetID1 – Any one of the existing subnets from the given VPCID
- SubnetID2 – Any one of the existing subnets from the given VPCID
- SubnetID3 – Any one of the existing subnets from the given VPCID
- PublicSubnetID – Any one of the existing subnets from the given VPCID
- Leave the remaining parameters as default and choose Next.
- On the Review page, select I acknowledge that AWS CloudFormation might create IAM resources.
- Choose Create stack.
You’re redirected to the AWS CloudFormation console, and can see the status as CREATE_IN_PROGRESSS
. Wait until the status changes to COMPLETE
.
- On the stack’s Outputs tab, note the values for the following (if you don’t see anything under Outputs tab, click on refresh icon):
KafkaClientPrivateIP
PrometheusInstancePrivateIP
Set up your AWS Cloud9 environment
Follow the steps as outlined in the previous section to configure your AWS Cloud9 environment.
Retrieve the cluster broker list
To get your MSK cluster broker list, complete the following steps:
- On the Amazon MSK console, navigate to your cluster.
- In the Cluster summary section, choose View client information.
- In the Bootstrap servers section, copy the private endpoint.
You need this value to perform some operations later, such as creating an MSK topic, producing sample messages, and consuming those sample messages.
- Choose Done.
- On the Properties tab, in the Brokers details section, note the endpoints listed.
These need to be updated in the targets.json
file (used for Prometheus configuration in a later step).
3. Enable IAM Identity Center
Before you deploy the CloudFormation stack for Amazon Managed Service for Prometheus and Amazon Managed Grafana, make sure to enable IAM Identity Center.
If you don’t use IAM Identity Center, alternatively, you can set up user authentication via SAML. For more information, refer to Using SAML with your Amazon Managed Grafana workspace.
If IAM Identity Center is currently enabled/configured in another region, you don’t need to enable in your current region.
Complete the following steps to enable IAM Identity Center:
- On the IAM Identity Center console, under Enable IAM Identity Center, choose Enable.
- Choose Create AWS organization.
4. Configure Amazon Managed Grafana and Amazon Managed Service for Prometheus
Complete the steps in this section to set up Amazon Managed Service for Prometheus and Amazon Managed Grafana.
Deploy CloudFormation template
Complete the following steps to deploy the CloudFormation stack AMG_AMP_Cloudformation
:
- On the AWS CloudFormation console, choose Stacks in the navigation pane.
- Choose Create stack.
- For Prepare template, select Template is ready.
- For Template source, select Upload a template.
- Upload the
AMG_AMP_Cloudformation.yml
file, then choose Next. - For Stack name, enter ManagedPrometheusAndGrafanaStack, then choose Next.
- On the Review page, select I acknowledge that AWS CloudFormation might create IAM resources.
- Choose Create stack.
You’re redirected to the AWS CloudFormation console, and can see the status as CREATE_IN_PROGRESSS
. Wait until the status changes to COMPLETE
.
- On the stack’s Outputs tab, note the values for the following (if you don’t see anything under Outputs tab, click on refresh icon):
- GrafanaWorkspaceURL – This is Amazon Managed Grafana URL
- PrometheusEndpointWriteURL – This is the Amazon Managed Service for Prometheus write endpoint URL
Create a user for Amazon Managed Grafana
Complete the following steps to create a user for Amazon Managed Grafana:
- On the IAM Identity Center console, choose Users in the navigation pane.
- Choose Add user.
- For Username, enter
grafana-admin
. - Enter and confirm your email address to receive a confirmation email.
- Skip the optional steps, then choose Add user.
A success message appears at the top of the console.
- In the confirmation email, choose Accept invitation and set your user password.
- On the Amazon Managed Grafana console, choose Workspaces in the navigation pane.
- Open the workspace
Amazon-Managed-Grafana
. - Make a note of the Grafana workspace URL.
You use this URL to log in to view your Grafana dashboards.
- On the Authentication tab, choose Assign new user or group.
- Select the user you created earlier and choose Assign users and groups.
- On the Action menu, choose what kind of user to make it: admin, editor, or viewer.
Note that your Grafana workspace needs as least one admin user.
- Navigate to the Grafana URL you copied earlier in your browser.
- Choose Sign in with AWS IAM Identity Center.
- Log in with your IAM Identity Center credentials.
5. Configure Prometheus and start the service
When you cloned the GitHub repo, you downloaded two configuration files: prometheus.yml
and targets.json
. In this section, we configure these two files.
- Use any IDE (Visual Studio Code or Notepad++) to open prometheus.yml.
- In the
remote_write section
, update the remote write URL and Region.
- Use any IDE to open
targets.json
. - Update the targets with the broker endpoints you obtained earlier.
- In your AWS Cloud9 environment, choose File, then Upload Local Files.
- Choose Select Files and upload targets.json and prometheus.yml from your local machine.
- In the AWS Cloud9 environment, run the following command using the key file you created earlier:
- copy targets.json to the Prometheus server:
- copy prometheus.yml to the Prometheus server:
- SSH into the Prometheus server and start the container service for Prometheus
- start the prometheus container
- Check if the Docker service is running:
6. Configure data sources in Amazon Managed Grafana
To configure your data sources, complete the following steps:
- Log in to the Amazon Managed Grafana URL.
- Choose AWS Data Services in the navigation pane, then choose Data Sources.
- For Service, choose Amazon Managed Service for Prometheus.
- For Region, choose your Region.
The correct resource ID is populated automatically.
- Select your resource ID and choose Add 1 data source.
- Choose Go to settings.
- For Name, enter
Amazon Managed Prometheus
and enable Default.
The URL is automatically populated.
- Leave everything else as default.
- Choose Save & Test.
If everything is correct, the message Data source is working appears.
Now we configure CloudWatch as a data source.
- Choose AWS Data Services, then choose Data source.
- For Services, choose CloudWatch.
- For Region, choose your correct Region.
- Choose Add data source.
- Select the CloudWatch data source and choose Go to settings.
- For Name, enter
AmazonMSK-CloudWatch
. - Choose Save & Test.
7. Import the Grafana dashboard
You can use the following preconfigured dashboards, which are available to download from the GitHub repo:
- Kafka Metrics
- MSK Cluster Overview
- AWS MSK – Kafka Cluster-CloudWatch
To import your dashboard, complete the following steps:
- In Amazon Managed Grafana, choose the plus sign in the navigation pane.
- Choose Import.
- Choose Upload JSON file.
- Choose the dashboard you downloaded.
- Choose Load.
The following screenshot shows your loaded dashboard.
Generate sample data in Amazon MSK (Optional – when you create a new Amazon MSK Cluster)
To generate sample data in Amazon MSK, complete the following steps:
- In your AWS Cloud9 environment, log in to the Kafka client.
- Set the broker endpoint variable
- Run the following command to create a topic called TLSTestTopic60:
- Still logged in to the Kafka client, run the following command to start the producer service:
- Open a new terminal from within your AWS Cloud9 environment and log in to the Kafka client instance
- Set the broker endpoint variable
- Now you can start the consumer service and see the incoming messages
- Press CTRL+C to stop the producer/consumer service.
Kafka metrics dashboards on Amazon Managed Grafana
You can now view your Kafka metrics dashboards on Amazon Managed Grafana:
- Cluster overall health – Configured using Amazon Managed Service for Prometheus as the data source:
- Critical metrics
Amazon MSK cluster overview – Configured using Amazon Managed Service for Prometheus as the data source:
- Critical metrics
- Cluster throughput (broker-level metrics)
- Cluster metrics (JVM)
Kafka cluster operation metrics – Configured using CloudWatch as the data source:
- General overall stats
- CPU and Memory metrics
Clean up
You will continue to incur costs until you delete the infrastructure that you created for this post. Delete the CloudFormation stack you used to create the respective resources.
If you used an existing cluster, make sure to remove the inbound rules you updated in the security group (otherwise the stack deletion will fail).
- On the Amazon MSK console, navigate to your existing cluster.
- On the Properties tab, in the Networking settings section, open the security group you applied.
- Choose Edit inbound rules.
- Choose Delete to remove the rules you added.
- Choose Save rules.
Now you can delete your CloudFormation stacks.
- On the AWS CloudFormation console, choose Stacks in the navigation pane.
- Select
ManagedPrometheusAndGrafana
and choose Delete. - If you used an existing Amazon MSK cluster, delete the stack
Prometheus
. - If you created a new Amazon MSK cluster, delete the stack
MSKDemo
.
Conclusion
This post showed how you can deploy a fully managed, highly available, scalable, and secure monitoring system for Amazon MSK using Amazon Managed Service for Prometheus and Amazon Managed Grafana, and use Grafana dashboards to gain deep insights into various operational metrics. Although this post only discussed using Amazon Managed Service for Prometheus and CloudWatch as the data sources in Amazon Managed Grafana, you can enable various other data sources, such as AWS IoT SiteWise, AWS X-Ray, Redshift, and Amazon Athena, and build a dashboard on top of those metrics. You can use these managed services for monitoring any number of Amazon MSK platforms. Metrics are available to query in Amazon Managed Grafana or Amazon Managed Service for Prometheus in near-real time.
You can use this post as prescriptive guidance and deploy an observability solution for a new or an existing Amazon MSK cluster, identify the metrics that are important for your applications and then create a dashboard using Amazon Managed Grafana and Prometheus.
About the Authors
Anand Mandilwar is an Enterprise Solutions Architect at AWS. He works with enterprise customers helping customers innovate and transform their business in AWS. He is passionate about automation around Cloud operation , Infrastructure provisioning and Cloud Optimization. He also likes python programming. In his spare time, he enjoys honing his photography skill especially in Portrait and landscape area.
Ajit Puthiyavettle is a Solution Architect working with enterprise clients, architecting solutions to achieve business outcomes. He is passionate about solving customer challenges with innovative solutions. His experience is with leading DevOps and security teams for enterprise and SaaS (Software as a Service) companies. Recently he is focussed on helping customers with Security, ML and HCLS workload.