
Datadog Enterprise
Datadog made monitoring easy and effective
Centralized monitoring has improved cloud observability and reduces manual debugging efforts
What is our primary use case?
My main use case for Datadog is to monitor the logs and capture metrics like CPU metrics, memory, and traces across different services in a cloud-based monitoring system where I initially worked, specifically to debug failing systems and systems which are slow, mainly for monitoring my servers in AWS.
What is most valuable?
The best features of Datadog for me are the user-friendly real-time dashboard and its ability to easily integrate with AWS, Azure, Kubernetes, Kafka, and provide a centralized log management system, which gives me excellent visibility into the microservice architecture.
Datadog has impacted my organization by providing a centralized monitoring system so that each person can trace what is happening in the VM servers, and it has given us a centralized dashboard view.
Since adopting Datadog, it has reduced the manual effort by around seven to eight hours per week, making the process completely automated.
Datadog has improved the collaboration across the teams and cross-functional teams, making it very fast and allowing us to easily track what is wrong.
What needs improvement?
If I could change one thing about Datadog, it would be the pricing, as it has extraordinary functionality, but the pricing is somewhat expensive, and as we increase the number of servers and monitoring services, the cost increases. A more predictable and flexible pricing structure would be beneficial, along with additional customization options and reporting features.
For how long have I used the solution?
I have been familiar with Datadog for more than two years.
What do I think about the stability of the solution?
I have not yet faced any frustration with Datadog.
Which solution did I use previously and why did I switch?
Before I landed on Datadog, I used to review the CloudWatch logs in AWS, and we initially had the tool Checkmk for monitoring.
How was the initial setup?
When I first implemented Datadog, it took me around thirty to forty minutes for the basic setup because we had a very large application to monitor metrics. After the configuration, the data actually appeared within three to four minutes.
What about the implementation team?
We did not have any formal training on Datadog. Instead, we referred to Google documentation regarding what Datadog is, how to set it up, and what the use cases are, and based on that, we initially set up Datadog.
Which other solutions did I evaluate?
When evaluating options before choosing Datadog, I compared it with tools such as New Relic and Grafana Labs with Prometheus. The main reason I chose Datadog is that it is a single platform where I can see metrics, logs, traces, and alerts, and it easily integrates with Kubernetes and other services such as Kafka.
What other advice do I have?
Our workflow is both team-wide and individual, as we check the end-to-end observability and the monitoring of our end-to-end application, infrastructure, and cloud services individually as well as in a team.
When I open Datadog, the first thing I do is see the home dashboard, which will have the active alerts and the system health status, as well as listing out all the monitored resources, including the servers, virtual machines, Kubernetes pods, and nodes. I will also see the CPU usage and memory usage, including the disk utilization.
Datadog is used by the cloud infrastructure monitoring team and the application team within the company, and everyone uses it on the same level as I do.
I have not experienced any features during implementation of Datadog that I am not really using in practice.
As of now, for my use case, I am satisfied with what Datadog offers, and I do not wish for any specific features that it currently lacks.
My advice to someone considering Datadog who has a similar workflow to mine is to read the entire documentation and work on it. I would rate my overall experience with Datadog as an eight out of ten.
Which deployment model are you using for this solution?
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Excellent Real-Time Monitoring Across APIs, Services, and Systems
Granular Insights with Interactive Filters and Time-Saving Search
One Platform to Unify Heterogeneous Data
Datadog Excels at Observability and Monitoring
Professional, Clean Design That’s Easy to Use
Unified Observability with Powerful Integrations and Fast Root Cause Analysis
Easy-to-Use Dashboard, But Data Storage Isn’t in India
Unified monitoring has improved incident response and now reduces root cause analysis time
What is our primary use case?
Datadog serves as my primary tool for infrastructure monitoring and log analysis in a cloud environment. From a network and security perspective, I use it to monitor server health, track network metrics like latencies and traffic patterns, and analyze logs for troubleshooting issues such as VPN instability and unexpected spikes. The ability to correlate metrics and logs in one place makes it much faster to identify the root cause instead of checking multiple tools.
One example where Datadog proved invaluable was during a sudden spike in application response time. We received alerts on increased latencies, and instead of checking multiple tools, I used Datadog's dashboard to quickly correlate metrics. I noticed that while the application CPU was normal, there was a spike in database response times. Using the logs and metrics together, I was able to confirm that the issue was coming from the database, not the application. This helped us quickly involve the right team and resolve the issue faster.
What is most valuable?
The best features of Datadog are the correlation capabilities and unified visibility. The most useful aspect is that I can see metrics, logs, and service-level data in one place. During troubleshooting, I do not have to switch tools; I can directly correlate spikes in latencies with log error patterns, which saves considerable time. Another feature I find very useful is the dashboards, which are flexible, and I can create views based on what I actually need to monitor daily instead of relying on default setups. The integration with cloud services makes onboarding very easy, and once integrated, most of the data starts flowing automatically without much manual effort.
Datadog has had a positive impact, mainly by improving how quickly we detect and understand issues. Earlier, when something went wrong, considerable time went into figuring out where the problem actually was. Now, with better visibility across services and logs, we can quickly narrow down the source, whether it is application, infrastructure, or dependency-related. It has also helped in reducing the back and forth between teams because we can validate issues with the data before escalating, which has made incident handling smoother and more efficient overall.
What needs improvement?
One area where Datadog can be improved is around alert quality. In the beginning, it tends to generate many alerts, and without proper tuning, many of them are not actionable. It would help if there were more built-in guidance or smarter defaults to reduce noise. Another improvement area is cost visibility and control. As log and metric ingestion increases, it has not always been straightforward to track which data is driving the cost. More granular and real-time cost insights would make it easier to manage. Additionally, while the dashboards are flexible, navigating and organizing them at scale can become slightly difficult. Better structuring or management options would help in larger environments.
For how long have I used the solution?
I have been using Datadog for nearly two years.
What do I think about the stability of the solution?
Datadog has been stable overall in my experience. We have not seen any major platform outages. Metrics collection and alerting have been consistent in day-to-day use. Most issues we have faced were related to configurations or alert tuning rather than the platform itself. The platform is stable with no major platform issues, only configuration-related challenges.
What do I think about the scalability of the solution?
Datadog scales well as environments grow in my experience. As we add more servers and services, onboarding is straightforward with agents and integrations. We have not faced any major performance issues from the platform side; it handles increased metrics and monitoring loads smoothly. The primary consideration is managing log volume carefully because as the scale increases, data ingestion and costs also go up. Datadog is scalable technically, but the ingestion costs need to be managed as the environment grows.
How are customer service and support?
We do not rely on Datadog support for day-to-day issues. Most of the time, we are able to resolve things using the dashboards, logs, and their documentation. We have only reached out in a few cases, mainly for configuration-related queries, and in those situations, support was helpful, though sometimes it required a few back and forth interactions to get to the exact solution. Overall, support is decent, but we mostly depend on self-troubleshooting.
Which solution did I use previously and why did I switch?
Before Datadog, we were mainly using native cloud monitoring like Azure Monitor, along with a few basic tools. The main issue was that monitoring was fragmented. Metrics, logs, and alerts were spread across different places, and so during an incident, we had to switch between multiple tools to understand what was happening. We moved to Datadog to have everything in one place. The ability to correlate metrics and logs in a single platform made troubleshooting much faster and more efficient.
How was the initial setup?
Setting up dashboards and integrations in Datadog is relatively straightforward in my experience, especially for standard cloud services. For integrations, once we connect our cloud account, most of the metrics start coming in automatically, so the initial setup is not very complex. The documentation also helps considerably during this phase. For dashboards, basic ones are easy to create using existing templates, but to make them truly useful, we have to spend time customizing them based on our actual use cases, like adding specific metrics and refining the layout. Overall, the initial setup is easy, but making it truly effective takes practical tuning.
What was our ROI?
We have seen a clear return on investment with Datadog, mainly in terms of time saved and faster incident handling. For example, earlier when an issue occurred, it would take around thirty-five to forty-five minutes just to identify the root cause because we had to check multiple tools. With Datadog, we are usually able to narrow it down within ten to fifteen minutes using the centralized dashboard and logs. We have also reduced repeated troubleshooting efforts because we can identify patterns and fix the root cause instead of dealing with the same issues repeatedly. It has not reduced headcount, but it has definitely improved team efficiency and allowed us to handle more incidents with the same team.
What's my experience with pricing, setup cost, and licensing?
My experience with pricing for Datadog has been mixed. The initial setup cost is relatively low since it is a SaaS model and does not require a heavy upfront investment. Getting started is quite quick with agent-based deployments. However, the ongoing cost is something that needs to be managed. Pricing is mainly based on data ingestion, such as logs, metrics, and traces, and it can increase quickly if everything is enabled by default. Licensing is flexible, but it requires continuous monitoring and optimization to keep costs under control.
What other advice do I have?
One additional point I can add is that with Datadog, I focused considerably on making alerts actionable and reducing noise. In the initial phases, we had too many alerts that were not very useful, so we spent time tuning thresholds, adding conditions, and correlating alerts with real impact. After that, alerts became much more meaningful and helpful in faster response. I also use it regularly for trend analysis, checking for recurring spikes or patterns over time, which helps in identifying potential issues before they become incidents.
The features of Datadog become truly useful when you start combining them, not just using them separately. For example, just looking at the metrics alone does not always give the full picture, but when you combine metrics with logs and service-level data, it becomes much easier to understand what is actually happening during an incident. Features like tagging help considerably in filtering data across environments and services, especially when the setup grows. Without proper tagging, it can get difficult to navigate. Overall, the strength of Datadog is not just the individual features, but how well they work together in real scenarios.
We have seen noticeable improvements after using Datadog, mainly in terms of time saved and faster incident handling. Earlier when an issue occurred, it could take around twenty to forty minutes just to understand where the problem was. Now, with the centralized visibility and correlation of metrics and logs, we are often able to narrow it down within fifteen to twenty-five minutes. We have also seen fewer repeated incidents because we can identify patterns and fix the root cause instead of just resolving symptoms. Incidents are getting resolved faster, and the time spent on troubleshooting has reduced significantly.
My advice for anyone considering Datadog is to be selective about what you monitor from day one. It is tempting to enable everything, but that usually leads to too much data and noisy alerts. Instead, start with critical services and key metrics, and then expand gradually. Invest time in tagging and structuring your data properly because it makes a considerable difference later when you need to filter, troubleshoot, or build dashboards. Finally, review your setup regularly because what works in the beginning may not stay relevant as the environment grows. Start small, avoid collecting all data, use proper tagging, and keep refining your setup over time. This review reflects an overall rating of eight.