Very useful Network Hosts
The user interface is intuitive, making it easy to manage domains, emails, and databases. The dashboard is well-organized, which is a plus for beginners who might feel overwhelmed by technical details.
Single pane of glass, easy to share dashboards, and good for monitoring
What is our primary use case?
We primarily use the solution for a variety of purposes, including:
- Watching RUM data for frontend site, using LCP and INP metrics to compare across the old and new architecture to inform rollout decisions.
- Watching APM data for backend services, observing how the backend server reacts (CPU util, memory, requests/second) to make sure the backend can handle the load.
- Using Datadog CCM during our free trial period to get visibility over our AWS spend across accounts and resources and looking at recommendations and acting on those.
- Browsing the service catalog to look at the current state of services that are running and what resources it uses.
How has it helped my organization?
This provides a single place to find monitoring data. Prior to DD, we had some metrics living in New Relic, some in Grafana, and some in Circonus, and it was very confusing to navigate across them. Understanding different query languages is challenging. Here, there's a single UI to get used to, and everything is so sharable.
DD has led to teams making more decisions based on data that they observe about their service metrics and RUM metrics. I've seen decisions get made based on what has been observed in DD, and less based on anecdotal data.
What is most valuable?
I really enjoyed using CCM since it showed cloud cost data easily next to other metrics, and I could correlate the two.
Across CCM and the rest of Datadog, I like how sharable everything is. It's so easy to share dashboards and links with my teammates so we can quickly get up to speed on debugging/solving an issue.
I also have really enjoyed K8s view of pods and pod health. It's very visual, and as a non-K8s platform owner at my company, I can still observe the overall health of the system. Then I can drill in and have learned things about K8s by exploring that part of the product and talking with the team.
What needs improvement?
We've had some issues where we had Datadog automatically turned on in AWS regions that we weren't using, which incurred a small but steady cost that amounted to tens of thousands of dollars spent over a few weeks. I wish there was a global setting that lets an admin restrict which regions DD is turned on in as a default setup step.
Sometimes, the APM service dashboard link isn't sharable. I click something in the service catalog, and on that service's APM default view, I try to share a link to that with a teammate, and they reach a blank or error screen.
I wish there was more organization and detail in the suggestions when I use the query editor. I'm never quite sure when the autofill dropdown shows up if I'm seeing some custom tag or some default property, so I have to know exactly what I'm looking for in order to build a chart. It's hard to navigate and explore using the query autofill suggestions without knowing exactly what tag to look for.
It's been a bit hard to understand how data gets sampled or how many data points a particular dashboard value is using. We've had questions over the RUM metrics that we see and we had to ask for help with how values are calculated, bin sizes, etc to get confidence in our data.
For how long have I used the solution?
I've used the solution for six months.
What do I think about the stability of the solution?
I've only been aware of a recent outage that affected the latency of data collection for one of our production tests. Outside of that, the solution seems stable.
What do I think about the scalability of the solution?
The solution seems like it can scale very well and beyond our needs.
How are customer service and support?
Technical support has been stellar. We love working with a team that responds fast, in great detail, and with great empathy. I trust what they say.
How would you rate customer service and support?
Which solution did I use previously and why did I switch?
We used New Relic, Grafana, and Circonus. Circonus was flakey, always having downtime and we were always on the phone with them. New Relic and grafana, different metrics lived in either and it was hard for consumers of the data to easily find what they need. And we had licensing issues across the 3 so not everybody could easily access all of them.
What's my experience with pricing, setup cost, and licensing?
I didn't do this portion of the product setup.
Which deployment model are you using for this solution?
Public Cloud
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Amazon Web Services (AWS)
Good query filtering and dashboards to make finding data easier
What is our primary use case?
We use the solution for monitoring microservices in a complex AWS-based cloud service.
The system is comprised of about a dozen services. This involves processing real-time data from tens of thousands of internet connected devices that are providing telemetry. Thousands of user interactions are processed along with real-time reporting of device date over transaction intervals that can last for hours or even days. The need to view and filter data over periods of several months is not uncommon.
Datadog is used for daily monitoring and R&D research as well as during incident response.
How has it helped my organization?
The query filtering and improved search abilities offered by Datadog are by far superior to other solutions we were using, such as AWS CloudWatch. We find that we can simply get at the data we need quicker and easier than before. This has made responding to incidents or investigating issues a much more productive endeavour. We simply have less roadblocks in the way when we need to "get at the data". It is also used occasionally to extract data while researching requirements for new features.
What is most valuable?
Datadog dashboards are used to provide a holistic view of the system across many services. Customizable views as well as the ability to "dive in" when we see someting anomalous has improved the workflow for handling incidents.
Log filtering, pattern detection and grouping, and extracting values from logs for plotting on graphs all help to improve our ability to visualize what is going on in the system. The custom facets allow us to tailor the solution to fit our specific needs.
What needs improvement?
There are some areas on log filtering screens where the user interface can take some getting used to. Perhaps having the option for a simple vs advanced user interface would be helpful in making new or less experienced users comfortable with making their own custom queries.
Maybe it is just how our system is configured, yet finding the valid values for a key/value pair is not always intuitively obvious to me. While there is a pop-up window with historical or previously used values and saved views from previous query runs, I don't see a simple list or enumeration of the set of valid values for keys that have such a restriction.
For how long have I used the solution?
I've used the solution for one year.
What do I think about the stability of the solution?
The solution is very stable.
What do I think about the scalability of the solution?
The product is reasonably scalable, although costs can get out of hand if you aren't careful.
How are customer service and support?
I have not had the need to contact support.
How would you rate customer service and support?
Which solution did I use previously and why did I switch?
We did use AWS CloudWatch. It was to awkward to use effectively and simply didn't have the features.
How was the initial setup?
We had someone experienced do the initial setup. However, with a little training, it wasn't too bad for the rest of us.
What about the implementation team?
We handled the setup in-house.
What's my experience with pricing, setup cost, and licensing?
Take care of how you extract custom values from logs. You can do things without thought to make your life easier and not realize how expensive it can be from where you started.
Which other solutions did I evaluate?
I'm not aware of evaluating other solutions.
What other advice do I have?
Overall I recommend the solution. Just be mindful of costs.
Which deployment model are you using for this solution?
Public Cloud
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Amazon Web Services (AWS)
Helpful support, with centralized pipeline tracking and error logging
What is our primary use case?
Our primary use case is custom and vendor-supplied web application log aggregation, performance tracing and alerting.
How has it helped my organization?
Through the use of Datadog across all of our apps, we were able to consolidate a number of alerting and error-tracking apps, and Datadog ties them all together in cohesive dashboards.
What is most valuable?
The centralized pipeline tracking and error logging provide a comprehensive view of our development and deployment processes, making it much easier to identify and resolve issues quickly.
Synthetic testing is great, allowing us to catch potential problems before they impact real users. Real user monitoring gives us invaluable insights into actual user experiences, helping us prioritize improvements where they matter most. And the ability to create custom dashboards has been incredibly useful, allowing us to visualize key metrics and KPIs in a way that makes sense for different teams and stakeholders.
What needs improvement?
While the documentation is very good, there are areas that need a lot of focus to pick up on the key details. In some cases the screenshots don't match the text when updates are made.
I spent longer than I should trying to figure out how to correlate logs to traces, mostly related to environmental variables.
For how long have I used the solution?
I've used the solution for about three years.
What do I think about the stability of the solution?
We have been impressed with the uptime.
What do I think about the scalability of the solution?
It's scalable and customizable.
How are customer service and support?
Support is helpful. They help us tune our committed costs and alert us when we start spending out of the on-demand budget.
Which solution did I use previously and why did I switch?
We used a mix of SolarWinds, UptimeRobot, and GitHub actions. We switched to find one platform that could give deep app visibility.
How was the initial setup?
Setup is generally simple. .NET Profiling of IIS and aligning logs to traces and profiles was a challenge.
What about the implementation team?
We implemented the solution in-house.
What was our ROI?
There has been significant time saved by the development team in terms of assessing bugs and performance issues.
What's my experience with pricing, setup cost, and licensing?
I'd advise others to set up live trials to asses cost scaling. Small decisions around how monitors are used can have big impacts on cost scaling.
Which other solutions did I evaluate?
NewRelic was considered. LogicMonitor was chosen over Datadog for our network and campus server management use cases.
What other advice do I have?
We are excited to dig further into the new offerings around LLM and continue to grow our footprint in Datadog.
Which deployment model are you using for this solution?
Hybrid Cloud
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Amazon Web Services (AWS)
Very good custom metrics, dashboards, and alerts
What is our primary use case?
Our primary use case for Datadog involves utilizing its dashboards, monitors, and alerts to monitor several key components of our infrastructure.
We track the performance of AWS-managed Airflow pipelines, focusing on metrics like data freshness, data volume, pipeline success rates, and overall performance.
In addition, we monitor Looker dashboard performance to ensure data is processed efficiently. Database performance is also closely tracked, allowing us to address any potential issues proactively. This setup provides comprehensive observability and ensures that our systems operate smoothly.
How has it helped my organization?
Datadog has significantly improved our organization by providing a centralized platform to monitor all our key metrics across various systems. This unified observability has streamlined our ability to oversee infrastructure, applications, and databases from a single location.
Furthermore, the ability to set custom alerts has been invaluable, allowing us to receive real-time notifications when any system degradation occurs. This proactive monitoring has enhanced our ability to respond swiftly to issues, reducing downtime and improving overall system reliability. As a result, Datadog has contributed to increased operational efficiency and minimized potential risks to our services.
What is most valuable?
The most valuable features we’ve found in Datadog are its custom metrics, dashboards, and alerts. The ability to create custom metrics allows us to track specific performance indicators that are critical to our operations, giving us greater control and insights into system behavior.
The dashboards provide a comprehensive and visually intuitive way to monitor all our key data points in real-time, making it easier to spot trends and potential issues. Additionally, the alerting system ensures we are promptly notified of any system anomalies or degradations, enabling us to take immediate action to prevent downtime.
Beyond the product features, Datadog’s customer support has been incredibly timely and helpful, resolving any issues quickly and ensuring minimal disruption to our workflow. This combination of features and support has made Datadog an essential tool in our environment.
What needs improvement?
One key improvement we would like to see in a future Datadog release is the inclusion of certain metrics that are currently unavailable. Specifically, the ability to monitor CPU and memory utilization of AWS-managed Airflow workers, schedulers, and web servers would be highly beneficial for our organization. These metrics are critical for understanding the performance and resource usage of our Airflow infrastructure, and having them directly in Datadog would provide a more comprehensive view of our system’s health. This would enable us to diagnose issues faster, optimize resource allocation, and improve overall system performance. Including these metrics in Datadog would greatly enhance its utility for teams working with AWS-managed Airflow.
For how long have I used the solution?
I've used the solution for four months.
What do I think about the stability of the solution?
The stability of Datadog has been excellent. We have not encountered any significant issues so far.
The platform performs reliably, and we have experienced minimal disruptions or downtime. This stability has been crucial for maintaining consistent monitoring and ensuring that our observability needs are met without interruption.
What do I think about the scalability of the solution?
Datadog is generally scalable, allowing us to handle and display thousands of custom metrics efficiently. However, we’ve encountered some limitations in the table visualization view, particularly when working with around 10,000 data points. In those cases, the search functionality doesn’t always return all valid results, which can hinder detailed analysis.
How are customer service and support?
Datadog's customer support plays a crucial role in easing the initial setup process. Their team is proactive in assisting with metric configuration, providing valuable examples, and helping us navigate the setup challenges effectively. This support significantly mitigates the complexity of the initial setup.
Which solution did I use previously and why did I switch?
We used New Relic before.
How was the initial setup?
The initial setup of Datadog can be somewhat complex, primarily due to the learning curve associated with configuring each metric field correctly for optimal data visualization. It often requires careful attention to detail and a good understanding of each option to achieve the desired graphs and insights
What about the implementation team?
We implemented the solution in-house.