My main use case for Datadog is dashboards and monitoring.
We use dashboards and monitoring with Datadog to monitor the performance of our Nexus Artifactory system and make sure the services are running.
External reviews are not included in the AWS star rating for the product.
My main use case for Datadog is dashboards and monitoring.
We use dashboards and monitoring with Datadog to monitor the performance of our Nexus Artifactory system and make sure the services are running.
The best features Datadog offers are the dashboarding tools as well as the monitoring tools.
What I find most valuable about the dashboarding and monitoring tools in Datadog is the ease of use and simplicity of the interface.
Datadog has positively impacted our organization by allowing us to look at things such as Cloud Spend and make sure our services are running at an optimal performance level.
We have seen specific outcomes such as cost savings by utilizing the cost utilization dashboards to identify areas where we could trim our spend.
To improve Datadog, I suggest they keep doing what they're doing.
Newer features using AI to create monitors and dashboards would be helpful.
I have been using Datadog for six years.
Datadog is stable.
I am not sure about Datadog's scalability.
Customer support with Datadog has been great when we needed it.
I rate the customer support a nine on a scale of 1 to 10.
Positive
We did not previously use a different solution.
In terms of return on investment, there is a lot of time saved from using the platform.
I was not directly involved in the pricing, setup cost, and licensing details.
Before choosing Datadog, we evaluated other options such as Splunk and Grafana.
I rate Datadog an eight out of ten because the expense of using it keeps it from being a nine or ten.
My advice to others looking into using Datadog is to brush up on their API programming skills.
My overall rating for Datadog is eight out of ten.
The primary purposes for which Datadog is used include infrastructure monitoring and application monitoring.
The main use case for Datadog integration capabilities is to monitor workloads in public cloud, and those public cloud integrations that reached the public cloud metric natively were helpful or critical for us. We are not using Datadog for AI-driven data analysis tasks, but more cloud-native and vendor-native tools at the moment, and at the time when I was still in my last employer, we didn't use Datadog for the AI piece at all.
I find alerting and metrics to be the most effective features of Datadog for system monitoring. It was still cheaper to run Datadog than other alternatives, so the running costs were cheaper because it was SaaS and quite easy to use.
Datadog is only available in SaaS.
The pricing nowadays is quite complex.
In future updates, I would like to see AI features included in Datadog for monitoring AI spend and usage to make the product more versatile and appealing for the customer.
I have been using Datadog since 2014.
There were no problems with the deployment of Datadog.
The deployment of Datadog just took a few hours.
The challenges I encountered while using Datadog were in the early days when the product was missing the ability to monitor Kubernetes and similar features, but they have since added those features. At the moment, I don't think there are too many challenges that I am worrying about.
One person is enough to do the installation.
I am not working with any of these solutions currently because I'm on sabbatical, but I used to work with Datadog six months ago, and now at the moment I'm on sabbatical.
We were using the tools that AWS and Azure came with natively to monitor the AI workflows on their platforms.
I used to work as the CTO at Northcloud, but I no longer work there.
On a scale of one to ten, I rate Datadog an eight out of ten.
We primarily use the solution for a variety of purposes, including:
This provides a single place to find monitoring data. Prior to DD, we had some metrics living in New Relic, some in Grafana, and some in Circonus, and it was very confusing to navigate across them. Understanding different query languages is challenging. Here, there's a single UI to get used to, and everything is so sharable.
DD has led to teams making more decisions based on data that they observe about their service metrics and RUM metrics. I've seen decisions get made based on what has been observed in DD, and less based on anecdotal data.
I really enjoyed using CCM since it showed cloud cost data easily next to other metrics, and I could correlate the two.
Across CCM and the rest of Datadog, I like how sharable everything is. It's so easy to share dashboards and links with my teammates so we can quickly get up to speed on debugging/solving an issue.
I also have really enjoyed K8s view of pods and pod health. It's very visual, and as a non-K8s platform owner at my company, I can still observe the overall health of the system. Then I can drill in and have learned things about K8s by exploring that part of the product and talking with the team.
We've had some issues where we had Datadog automatically turned on in AWS regions that we weren't using, which incurred a small but steady cost that amounted to tens of thousands of dollars spent over a few weeks. I wish there was a global setting that lets an admin restrict which regions DD is turned on in as a default setup step.
Sometimes, the APM service dashboard link isn't sharable. I click something in the service catalog, and on that service's APM default view, I try to share a link to that with a teammate, and they reach a blank or error screen.
I wish there was more organization and detail in the suggestions when I use the query editor. I'm never quite sure when the autofill dropdown shows up if I'm seeing some custom tag or some default property, so I have to know exactly what I'm looking for in order to build a chart. It's hard to navigate and explore using the query autofill suggestions without knowing exactly what tag to look for.
It's been a bit hard to understand how data gets sampled or how many data points a particular dashboard value is using. We've had questions over the RUM metrics that we see and we had to ask for help with how values are calculated, bin sizes, etc to get confidence in our data.
I've used the solution for six months.
I've only been aware of a recent outage that affected the latency of data collection for one of our production tests. Outside of that, the solution seems stable.
The solution seems like it can scale very well and beyond our needs.
Technical support has been stellar. We love working with a team that responds fast, in great detail, and with great empathy. I trust what they say.
Positive
We used New Relic, Grafana, and Circonus. Circonus was flakey, always having downtime and we were always on the phone with them. New Relic and grafana, different metrics lived in either and it was hard for consumers of the data to easily find what they need. And we had licensing issues across the 3 so not everybody could easily access all of them.
I didn't do this portion of the product setup.
We use the solution for monitoring microservices in a complex AWS-based cloud service.
The system is comprised of about a dozen services. This involves processing real-time data from tens of thousands of internet connected devices that are providing telemetry. Thousands of user interactions are processed along with real-time reporting of device date over transaction intervals that can last for hours or even days. The need to view and filter data over periods of several months is not uncommon.
Datadog is used for daily monitoring and R&D research as well as during incident response.
The query filtering and improved search abilities offered by Datadog are by far superior to other solutions we were using, such as AWS CloudWatch. We find that we can simply get at the data we need quicker and easier than before. This has made responding to incidents or investigating issues a much more productive endeavour. We simply have less roadblocks in the way when we need to "get at the data". It is also used occasionally to extract data while researching requirements for new features.
Datadog dashboards are used to provide a holistic view of the system across many services. Customizable views as well as the ability to "dive in" when we see someting anomalous has improved the workflow for handling incidents.
Log filtering, pattern detection and grouping, and extracting values from logs for plotting on graphs all help to improve our ability to visualize what is going on in the system. The custom facets allow us to tailor the solution to fit our specific needs.
There are some areas on log filtering screens where the user interface can take some getting used to. Perhaps having the option for a simple vs advanced user interface would be helpful in making new or less experienced users comfortable with making their own custom queries.
Maybe it is just how our system is configured, yet finding the valid values for a key/value pair is not always intuitively obvious to me. While there is a pop-up window with historical or previously used values and saved views from previous query runs, I don't see a simple list or enumeration of the set of valid values for keys that have such a restriction.
I've used the solution for one year.
The solution is very stable.
The product is reasonably scalable, although costs can get out of hand if you aren't careful.
I have not had the need to contact support.
Neutral
We did use AWS CloudWatch. It was to awkward to use effectively and simply didn't have the features.
We had someone experienced do the initial setup. However, with a little training, it wasn't too bad for the rest of us.
We handled the setup in-house.
Take care of how you extract custom values from logs. You can do things without thought to make your life easier and not realize how expensive it can be from where you started.
I'm not aware of evaluating other solutions.
Overall I recommend the solution. Just be mindful of costs.
We utilize Datadog to monitor both some legacy products and a new PaaS solution that we are building out here at Icario which is Micro-Service arch.
All of our infrastructure is in AWS with very few legacies being rackspace. For the PaaS we mainly just utilize the K8s Orchestrator which implements the APM libraries into services deployed there as well as giving us infra info regarding the cluster.
For legacies, we mainly just utilize the Agent or the AWS integration. With APM in specific places. We monitor mainly prod in Legacy and the full scope in the PaaS for now.
Datadog has greatly improved the time needed to investigate issues. Putting everything into a single pane of glass. Allowing us to get ahead of infra/app-based issues before they affect customer experience with our products.
Outside of that, the ease of management, deployment of agents, integrations etc. has greatly helped the teams. There isn't much leg work needed by the devs to manage or deploy Datadog into their stacks. This is with the use of Terraform, pipelines and the orchestrator. All in all, it has been an improvement.
The two most valuable aspects are the Terraform provider for Datadog and the K8s Orchestrator. People don't take that into account when buying into a tooling product like Datadog in this age where scalability, management, and ease of implementation is key. Other tools not having good IaC products or options is a ball drop. Orchestration for the tools agent is good. Not having to use another tool to manage the agents and config files in mutiple places/instances is a huge win!
A big problem with Datadog is the billing. They need to make the billing more user-friendly. I know it like the back of my hand at this point, yet trying to explain it to the C-suite as to why costs went up or are what they are is many times more complicated than it needs to be. I can't even say "why" due to of the lack of metadata tied to billing. For instance, with the AWS Integration Host ingestion, I cant say well this month THESE host got added and thats what caused cost to go up. The billing visibility really needs to be resolved!
I'd rate the solution for more than four years.
Datadog has always been extremely stable, with outages really only ever creating delays, never actual downtime of the service, which is amazing and impressive.
The solution is very scalable if implemented right and not on top of complicated architecture.
Support is excellent. They are always looking for a resolution, and a ticket is never left unresolved unless the feature just can't exist or isn't currently possible.
Positive
We did have New Relic, Datadog, Sumo Logic, Pingdom, and some other custom or third-party tooling. We switched because we wanted everything to be in a single pane and because Datadog is a better solution than the competitors.
For us, set-up is a mixed bag as we support legacy apps and architectures as well as a new microservice architecture. That being said, legacy is somewhat complex just due to the nature of how those apps stack and the underlying infra and configuration and setup. Microservice is a breeze and straight-forward for most of the out-of-the-box stuff.
Our Team of SRE Engineers, Platform Engineers and Cloud Engineers implemented the solution.
I can't really speak to ROI; however, from my perspective, we definitely get our money's worth from the product.
Users just just really need to make sure they stay on top of costs and don't let all of the engineers do as they please. Billing with Datadog can get out of hand if you let them. Not everything needs to be monitored.
We didn't really need to evaluate other options.
We use Datadog as our main log ingestion source, and Datadog is one of the first places we go to for analyzing logs.
This is especially true for cases of debugging, monitoring, and alerting on errors and incidents, as we use traffic logs from K8s, Amazon Web Services, and many other services at our company to Datadog. In addition, many products and teams at our company have dashboards for monitoring statistics (sometimes based on these logs directly, other times we set queries for these metrics) to alert us if there are any errors or health issues.
Overall, at my company, Datadog has made it easy to search for and look up logs at an impressively quick search rate over a large amount of logs.
It seamlessly allows you to set up monitoring and alerting directly from log queries which is convenient and helps for a good user experience, and while there is a bit of a learning curve, given enough time a majority of my company now uses Datadog as the first place to check when there are errors or bugs.
However, the cost aspect of Datadog is tricky to gauge because it's related to usage, and thus, it is hard to tell the relative value of Datadog year to year.
The feature I've found most valuable is the log search feature. It's set up with our ingestion to be a quick one-stop shop, is reliable and quick, and seamlessly integrates into building custom monitors and alerts based on log volume and timeframes.
As a result, it's easy to leverage this to triage bugs and errors, since we can pinpoint the logs around the time that they occur and get metadata/context around the issue. This is the main feature that I use the most in my workflow with Datadog to help debug and triage issues.
More helpful log search keywords/tips would be helpful in improving Datadog's log dashboard. I recently struggled a lot to parse text from raw line logs that didn't seem to match directly with facets. There should be smart searching capabilities. However, it's not intuitive to learn how to leverage them, and instead had to resort to a Python script to do some simple regex parsing (I was trying to parse "file:folder/*/*" from the logs and yet didn't seem to be able to do this in Datadog, maybe I'm just not familiar enough with the logs but didn't seem to easily find resources on how to do this either).
I've used the solution for 10 months.
Beware that the cost will fluctuate (and it often only gets more expensive very quickly).
Our primary use case is custom and vendor-supplied web application log aggregation, performance tracing and alerting.
Through the use of Datadog across all of our apps, we were able to consolidate a number of alerting and error-tracking apps, and Datadog ties them all together in cohesive dashboards.
The centralized pipeline tracking and error logging provide a comprehensive view of our development and deployment processes, making it much easier to identify and resolve issues quickly.
Synthetic testing is great, allowing us to catch potential problems before they impact real users. Real user monitoring gives us invaluable insights into actual user experiences, helping us prioritize improvements where they matter most. And the ability to create custom dashboards has been incredibly useful, allowing us to visualize key metrics and KPIs in a way that makes sense for different teams and stakeholders.
While the documentation is very good, there are areas that need a lot of focus to pick up on the key details. In some cases the screenshots don't match the text when updates are made.
I spent longer than I should trying to figure out how to correlate logs to traces, mostly related to environmental variables.
I've used the solution for about three years.
We have been impressed with the uptime.
It's scalable and customizable.
Support is helpful. They help us tune our committed costs and alert us when we start spending out of the on-demand budget.
We used a mix of SolarWinds, UptimeRobot, and GitHub actions. We switched to find one platform that could give deep app visibility.
Setup is generally simple. .NET Profiling of IIS and aligning logs to traces and profiles was a challenge.
We implemented the solution in-house.
There has been significant time saved by the development team in terms of assessing bugs and performance issues.
I'd advise others to set up live trials to asses cost scaling. Small decisions around how monitors are used can have big impacts on cost scaling.
NewRelic was considered. LogicMonitor was chosen over Datadog for our network and campus server management use cases.
We are excited to dig further into the new offerings around LLM and continue to grow our footprint in Datadog.
Our primary use case for Datadog involves utilizing its dashboards, monitors, and alerts to monitor several key components of our infrastructure.
We track the performance of AWS-managed Airflow pipelines, focusing on metrics like data freshness, data volume, pipeline success rates, and overall performance.
In addition, we monitor Looker dashboard performance to ensure data is processed efficiently. Database performance is also closely tracked, allowing us to address any potential issues proactively. This setup provides comprehensive observability and ensures that our systems operate smoothly.
Datadog has significantly improved our organization by providing a centralized platform to monitor all our key metrics across various systems. This unified observability has streamlined our ability to oversee infrastructure, applications, and databases from a single location.
Furthermore, the ability to set custom alerts has been invaluable, allowing us to receive real-time notifications when any system degradation occurs. This proactive monitoring has enhanced our ability to respond swiftly to issues, reducing downtime and improving overall system reliability. As a result, Datadog has contributed to increased operational efficiency and minimized potential risks to our services.
The most valuable features we’ve found in Datadog are its custom metrics, dashboards, and alerts. The ability to create custom metrics allows us to track specific performance indicators that are critical to our operations, giving us greater control and insights into system behavior.
The dashboards provide a comprehensive and visually intuitive way to monitor all our key data points in real-time, making it easier to spot trends and potential issues. Additionally, the alerting system ensures we are promptly notified of any system anomalies or degradations, enabling us to take immediate action to prevent downtime.
Beyond the product features, Datadog’s customer support has been incredibly timely and helpful, resolving any issues quickly and ensuring minimal disruption to our workflow. This combination of features and support has made Datadog an essential tool in our environment.
One key improvement we would like to see in a future Datadog release is the inclusion of certain metrics that are currently unavailable. Specifically, the ability to monitor CPU and memory utilization of AWS-managed Airflow workers, schedulers, and web servers would be highly beneficial for our organization. These metrics are critical for understanding the performance and resource usage of our Airflow infrastructure, and having them directly in Datadog would provide a more comprehensive view of our system’s health. This would enable us to diagnose issues faster, optimize resource allocation, and improve overall system performance. Including these metrics in Datadog would greatly enhance its utility for teams working with AWS-managed Airflow.
I've used the solution for four months.
The stability of Datadog has been excellent. We have not encountered any significant issues so far.
The platform performs reliably, and we have experienced minimal disruptions or downtime. This stability has been crucial for maintaining consistent monitoring and ensuring that our observability needs are met without interruption.
Datadog is generally scalable, allowing us to handle and display thousands of custom metrics efficiently. However, we’ve encountered some limitations in the table visualization view, particularly when working with around 10,000 data points. In those cases, the search functionality doesn’t always return all valid results, which can hinder detailed analysis.
Datadog's customer support plays a crucial role in easing the initial setup process. Their team is proactive in assisting with metric configuration, providing valuable examples, and helping us navigate the setup challenges effectively. This support significantly mitigates the complexity of the initial setup.
We used New Relic before.
The initial setup of Datadog can be somewhat complex, primarily due to the learning curve associated with configuring each metric field correctly for optimal data visualization. It often requires careful attention to detail and a good understanding of each option to achieve the desired graphs and insights
We implemented the solution in-house.