Centralized monitoring has reduced incidents and now improves alerting and troubleshooting speed
What is our primary use case?
My main use case for Grafana is to create and design dashboards based on the metrics provided by different exporters via Prometheus.
We have different exporters, and we are creating different dashboards based on them. We have a set of dashboards related to Kafka, virtual machines, and instances. Inside Kafka, we have a broker dashboard, consumer dashboard, partition dashboard, and other ingestion and consumption rate dashboards. Apart from that, we have a dashboard for consumer lag and consumption by partition.
We are collecting metrics from Prometheus and creating dashboards inside Grafana. Inside Grafana, we have different data sources including Thanos and Prometheus. We are also using Grafana for alert setup. We have set up alerts based on the exceptions we are collecting from Loki, and if any such exception occurs, it will create an incident alert over Squadcast.
What is most valuable?
Grafana offers many features including the ability to create dashboards, add variables, and set up alerts, which also covers notifications via integration with incident management tools or by configuring your email ID to get the notifications.
You can directly configure alerts in Grafana by either creating a dashboard or using the explore icon in Grafana, where you can select Loki and set alerts based on your exceptions.
There are many features including dashboard creation being much easier. You can configure multiple data sources such as Prometheus and Thanos. Apart from that, you can directly link AWS CloudWatch with your Grafana and other tools. For alerting, you can create alerts based on thresholds and exceptions, and in Grafana, there are many plugins you can configure to create data source dashboards. Additionally, there is also a restriction in Grafana that allows you to provide viewer, editor, or admin access based on roles.
We have had very positive outcomes from Grafana because you can directly visualize the metrics based on past and current inputs and take timely actions based on the responses you are getting from the visualization dashboards. Apart from that, the alerts notify you through your incident management tool.
You can check those metrics in the incident management tool by filtering the alert source as Grafana, and it helps in reducing production incidents because you can acknowledge and visualize the metrics from Grafana on time.
What needs improvement?
Currently, I do not think that any improvement is required, but there are multiple use cases.
For how long have I used the solution?
I have been using Grafana for the last four years.
What do I think about the stability of the solution?
What do I think about the scalability of the solution?
Grafana has excellent scalability.
How are customer service and support?
The customer support for Grafana is excellent.
How would you rate customer service and support?
Which solution did I use previously and why did I switch?
This is the only solution we are currently using.
Before choosing Grafana, we evaluated other options including DataDog, but it was quite costlier, so we switched to Grafana.
How was the initial setup?
I have seen a return on investment as we actually need fewer employees, and you can take timely actions on the alerts. Apart from that, it reduces MTTR because you receive notifications through the incident management tool, allowing for timely action and better troubleshooting by visualizing metrics and logs inside Grafana. You can optimize these processes by visualizing issues earlier based on the metrics from Grafana.
I have seen a return on investment with fewer employees needed, and you can take timely actions based on alerts. Apart from this, it helps reduce MTTR because you receive notifications through the incident management tool, enabling timely responses and better troubleshooting by visualizing metrics and logs inside Grafana, thus allowing you to tackle issues earlier based on Grafana metrics.
What's my experience with pricing, setup cost, and licensing?
My experience with pricing, setup cost, and licensing is that it is very reasonable and has excellent community support.
What other advice do I have?
You are able to detect issues faster because you can configure alerts based on thresholds in your Grafana and get notifications from your tool like Squadcast, which will reduce MTTR. Apart from that, system visibility is there; you can visualize CPU metrics, memory, disk usage, API latencies, and other ports inside the Grafana dashboard. Based on these metrics, you can troubleshoot your issues very easily.
If you want a scalable solution, better visualization, optimization, centralized monitoring, and improved troubleshooting, then you can choose Grafana without any doubts in your mind. I would rate this product a ten out of ten.
Which deployment model are you using for this solution?
Public Cloud
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Unified dashboards have empowered teams and have democratized real-time operational insights
What is our primary use case?
My main use case for Grafana involves operational dashboarding and data visualization, where I use it as a central pane of glass to pull in metrics from multiple sources like Prometheus, Elasticsearch, and SQL databases to visualize the overall health of our systems in one unified view.
For example, I have built a NOC dashboard that tracks CPU memory usage and network traffic across all the pods. If a specific service starts failing, the Grafana dashboard highlights the issue in red, allowing my on-call engineers to identify the failing cluster at a glance.
What is most valuable?
Grafana's snapshot and dashboard sharing features are critical for our remote incident response. During production issues, I generate a public snapshot of a dashboard at a specific point and share the URL in our Slack war room so every engineer can see exactly what the metrics looked like when the error occurred. This helps significantly during the process of finding the root cause in those scenarios.
The best features Grafana offers go beyond just pretty charts; it is an integration engine. The fact that I can join data from my SQL database with metrics from Prometheus in the same table is a feature I have not found performed as well elsewhere.
My team uses this feature by comparing two different tables from the databases to show one single view, which Grafana is really helping with. In a visualized way, the charts can be displayed on one dashboard, allowing end users who are not familiar with these technical aspects to extract valuable data from it.
Grafana has positively impacted our organization by democratizing data within our company. Before using Grafana, only developers could see the system health, but now our product managers and executives have their own high-level dashboards, which has improved cross-departmental transparency and alignment.
What needs improvement?
I find that the alerting UI in Grafana can be complex for new users. While it is very powerful, it takes time to learn the differences between contact points, notification policies, and silences.
The documentation can be improved to provide more detailed descriptions, allowing new users to understand more concepts before they come to knowledge transfer sessions with senior team members.
For how long have I used the solution?
I have been using Grafana for over four years to build real-time observability dashboards and monitor our complex infrastructure and application performance.
What do I think about the stability of the solution?
In my experience, Grafana is extremely stable. Even when handling millions of data points, the visualization layer remains responsive. Since it is decoupled from the actual data storage, the dashboard stays up even if one of our underlying data sources is temporarily slow.
What do I think about the scalability of the solution?
Grafana's scalability is impressive. It is highly scalable and built on a big data architecture capable of ingesting trillions of data points. For our on-premise instance, I use a high availability configuration with a shared database to manage growth.
How are customer service and support?
Customer support for Grafana is solid. The community support is massive, and the technical support team is very helpful with complex PromQL troubleshooting.
Which solution did I use previously and why did I switch?
Before Grafana, I relied solely on the native monitoring console of our cloud providers, like AWS CloudWatch. I switched to Grafana because I needed a way to see all my clouds in a single dashboard rather than switching between multiple tabs.
How was the initial setup?
Grafana's forever free tier for the cloud version allowed the initial setup cost to be zero. As I scaled, I moved to a paid tier based on my number of active series and users, which I found to be very fair compared to other observability vendors.
What was our ROI?
I identified over-provisioned servers and reduced my AWS monthly bill by 15%, which is a significant saving in terms of costs. Additionally, I see a 25% improvement in MTTD due to my shift from text-based logs to visualized dashboards.
What's my experience with pricing, setup cost, and licensing?
I purchased my Grafana Cloud subscription through the AWS Marketplace, which simplified my procurement process and allowed me to apply the cost towards my AWS committed spend.
Which other solutions did I evaluate?
I looked at Kibana and Tableau before deciding on Grafana. I chose Grafana because Kibana is mostly limited to Elasticsearch, whereas Grafana can connect to almost any data source. Unlike Tableau, Grafana is specifically optimized for time series data and real-time monitoring.
What other advice do I have?
When Grafana highlights an issue, it will trigger email alerts that engineers can rely on. Immediately when they receive these alerts, they involve other support teams, and a bridge is initialized to start troubleshooting.
For those looking into using Grafana, I advise starting with the Grafana play site to see what is possible and then using the pre-built dashboards from the Grafana dashboard gallery. There is likely already a perfect dashboard available for free tailored to your tech stack.
Grafana is unique in that I can join data from my SQL database with metrics from Prometheus in the same table, a feature I have not found performed as well elsewhere. My overall rating for this product is 10 out of 10.
Which deployment model are you using for this solution?
Hybrid Cloud
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Amazon Web Services (AWS)