Has resolved user errors faster by reviewing behavior with replay features
What is our primary use case?
My main use case for Datadog involves working on projects related to our sales reps in terms of registering new clients, and I've been using Datadog to pull up instances of them while they're beta testing our product that we're rolling out just to see where their errors are occurring and what their behavior was leading up to that.
I can't think of all of the specific details, but there was a sales rep who was running into a particular error message through their sales registration process, and they weren't giving us a lot of specific screenshots or other error information to help us troubleshoot. I went into Datadog and looked at the timestamp and was able to look at the actual steps they took in our platform during their registration and was able to determine what the cause of that error was. I believe if I remember correctly, it was user error; they were clicking something incorrectly.
One thing I've seen in my main use case for Datadog is an option that our team can add on, and it's the ability to track behavior based on the user ID. I'm not sure at this time if our team has turned that on, but I do think that's a really valuable feature to have, especially with the real-time user management where you can watch the replay. Because we have so many users that are using our platform, the ability to filter those replay videos based on the user ID would be so much more helpful. Especially in terms where we're testing a specific product that we're rolling out, we start with smaller beta tests, so being able to filter those users by the user IDs of those using the beta test would be much more helpful than just looking at every interaction in Datadog as a whole.
What is most valuable?
The best features Datadog offers are the replay videos, which I really find super helpful as someone who works in QA. So much of testing is looking at the UI, and being able to look back at the actual visual steps that a user is taking is really valuable.
Datadog has impacted our organization positively in a major way because not even just as a QA engineer having access to the real-time replay, but just as a team, all of us being able to access this data and see what parts of our system are causing the most errors or resulting in the most frustration with users. I can't speak for everybody else because I don't know how each other segment of the business is using it, but I can imagine just in terms of how it's been beneficial to me; I can imagine that it's being beneficial to everybody else and they're able to see those areas of the system that are causing more frustration versus less.
What needs improvement?
I think Datadog can be improved, but it's a question that I'm not totally sure what the answer is. Being that my use case for it is pretty specific, I'm not sure that I have used or even really explored all of the different features that Datadog offers. So I'm not sure that I know where there are gaps in terms of features that should be there or aren't there.
I will go back to just the ability to filter based on user ID as an option that has to be set up by an organization, but I would maybe recommend that being something part of an organization's onboarding to present that as a first step. I think as an organization gets bigger or even if the organization starts using Datadog and is large, it's going to be potentially more difficult to troubleshoot specific scenarios if you're sorting through such a large amount of data.
For how long have I used the solution?
I have been working in this role for a little over a year now.
What do I think about the stability of the solution?
As far as I can tell, Datadog has been stable.
What do I think about the scalability of the solution?
I believe we have about 500 or so employees in our organization using our platform, and Datadog seems to be able to handle that load sufficiently, as far as I can tell. So I think scalability is good.
How are customer service and support?
I haven't had an instance where I've reached out to customer support for Datadog, so I do not know.
How would you rate customer service and support?
Which solution did I use previously and why did I switch?
I do not believe we used a different solution previously for this.
What was our ROI?
I cannot answer if I have seen a return on investment; I'm not part of the leadership in terms of making that decision. Regarding time saved, in my specific use case as a QA engineer, I would say that Datadog probably didn't save me a ton of time because there are so many replay videos that I had to sort through in order to find the particular sales reps that I'm looking for for our beta test group. That's why I think the ability to filter videos by the user ID would be so much more helpful. I believe features that would provide a lot of time savings, just enabling you to really narrow down and filter the type of frustration or user interaction that you're looking for. But in regards to your specific question, I don't think that's an answer that I'm totally qualified to answer.
Which other solutions did I evaluate?
I was not part of the decision-making process before choosing Datadog, so I cannot speak to whether we evaluated other options.
What other advice do I have?
Right now our users are in the middle of the beta test. At the beginning of rolling the test out, I probably used the replay videos more just as the users were getting more familiar with the tool. They were probably running into more errors than they would be at this point now that they're more used to the tool. So it kind of ebbs and flows; at the beginning of a test, I'm probably using it pretty frequently and then as it goes on, probably less often.
It does help resolve issues faster, especially because our sales reps are used to working really quickly in terms of the sales registration, as they're racing through it. They're more likely to accidentally click something or click something incorrectly and not fully pay attention to what they're doing because they're just used to their flow. Being able to go back and watch the replay and see that a person clicked this button when they intended to click another button, or identifying the action that caused an error versus going off of their memory.
I have not noticed any measurable outcomes in terms of reduction in support tickets or faster resolution times since I started using Datadog. For myself, looking at the users in our beta test group, none of those came as a result of any sort of support ticket. It came from messages in Microsoft Teams with all the people in the beta group. We have resulted in fewer messages in relation to the beta test because they are more familiar with the tool. Now that they know there might be differences in terms of what their usual flow is versus how their flow is during the beta test group, they are resulting in fewer messages because they are probably being more careful or they've figured out those inflection points that would result in an error.
My biggest piece of advice for others looking into using Datadog would be to use the filters based on user ID; it will save so much time in terms of troubleshooting specific error interactions or occurrences. I would also suggest having a UI that's more simple for people that are less technical. For example, logging into Datadog, the dashboard is pretty overwhelming in terms of all of the bar charts and options; I think having a more simplified toggle for people that are not looking for all of the options in terms of data, and then having a more technical toggle for people that are looking for more granular data, would be helpful.
I rate Datadog 10 out of 10.
Which deployment model are you using for this solution?
Public Cloud
Has improved our ability to identify cloud application issues quickly using trace data and detailed log filtering
What is our primary use case?
My team and I primarily rely on Datadog for logs to our application to identify issues in our cloud-based solution, so we can take the requests and information that's being presented as errors from our customers and use it to identify what the errors are within our back-end systems, allowing us to submit code fixes or configuration changes.
I had an error when I was trying to submit an API request this morning that just said unspecified error in the web interface. I took the request ID and filtered a facet of our logs to include that request ID, and it gave me the specific examples, allowing me to look at the code stack that we had logged to identify what specifically it was failing to convert in order to upload that data.
My team doesn't utilize Datadog logs very often, but we do have quite a few collections of dashboards and widgets that tell us the health of the various API requests that come through our application to identify any known issues with some of our product integrations. It's useful information, but it's not necessarily stuff that our team monitors directly as we're more of a reactionary team.
What is most valuable?
The best features Datadog offers, in my experience, are the ability to filter down by facets very quickly to identify the problems we're experiencing with our individual customers using our cloud application. I really enjoy the trace option so that I can see all of the various components and how they communicate with each other to see where the failures are occurring.
The trace option helps us spot issues by giving access to see if the problem is occurring within our Java components or if it's a result of the SQL queries, allowing us to look at the SQL queries themselves to identify what information it's trying to pull. We can also look at other integrations, whether that's serverless Lambda functions or different components from our outreach.
Datadog has impacted our organization positively because the general feeling is that it's superior to the ELK stack that we used to use, being significantly faster in searching and filtering the information down, as well as providing links to our search criteria that our development teams and cloud operations teams can use to look at the same problems without having to set up their own search and filter criteria.
What needs improvement?
For the most part, the issues that we come across with Datadog are related to training for our organization. Our development and operations teams have done a really good job of getting our software components into Datadog, allowing us to identify them. However, we do have reduced logging in our Datadog environment due to the amount of information that's going through.
The hardest thing we experience is just training people on what to search for when identifying a problem in Datadog, and having some additional training that might be easily accessible would probably be a benefit.
At this point, I do not know what I don't know, so while there may be options for improvements, Datadog works very well for the things that we currently use it for. Additionally, the extra training that would be more easily accessible would be extremely helpful, perhaps something within the user interface itself that could guide us on useful information or how to tie different components or build a good dashboard.
For how long have I used the solution?
I have worked for Calabrio for 13 years.
What do I think about the stability of the solution?
What do I think about the scalability of the solution?
Datadog's scalability is strong; we've continued to significantly grow our software, and there are processes in place to ensure that as new servers, realms, and environments are introduced, we're able to include them all in Datadog without noticing any performance issues. The reporting and search functionality remain just as good as when we had a much smaller implementation.
Which solution did I use previously and why did I switch?
Previously, we used the ELK stack—Elasticsearch, Logstash, and Kibana—to capture data. Our cloud operations team set that up because they were familiar with it from previous experiences. We stopped using it because as our environment continued to grow, the response times and the amount of data being kept reached a point where we couldn't effectively utilize it, and it lacked the capability to help us proactively identify issues.
What other advice do I have?
A general impression is that Datadog saves time because the ability to search, even over the vast amount of AWS realms and time spans that we have, is significantly faster compared to other solutions that I've used that have served similar purposes.
I would advise others looking into using Datadog to identify various components within their organization that could benefit from pulling that information in and how to effectively parse and process all of it before getting involved in a task, so they know what to look for. Specifically, when searching for data, if a metric can be pulled out into an individual facet and used, the amount of filtering that can be done is significantly improved compared to a general text search.
I would love to figure out how to use Datadog more effectively in the organization work that I do, but that is a discussion I need to have with our operations and research and development teams to determine if it can benefit the customer or the specific implementation software that I work with.
On a scale of one to ten, I rate Datadog a ten out of ten.
Which deployment model are you using for this solution?
Public Cloud
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Cross-functional teams have gained clearer insight into funding delays through simplified data dashboards
What is our primary use case?
My main use case for Datadog is to analyze data in regards to instant funding.
A specific example of how I use Datadog for instant funding data is understanding how long it takes for an application to be processed, approved, and then instantly funded, how many applications there are, and if there's any holdups on the applications as well.
We are identifying the reason behind a hold-up for instant funding and possibly why some applications do not get instantly funded. Datadog helps us identify those weak areas.
How has it helped my organization?
Datadog has significantly improved our organization’s visibility into system performance and application health. The real-time dashboards and alerting capabilities have helped our teams detect issues faster, reduce downtime, and improve response times. It’s also made collaboration between engineering and operations smoother by providing a shared view of metrics and logs in one place.
What is most valuable?
In my experience, the best features Datadog offers include the layout of the reporting, which is user-friendly, and for those who are not familiar with data, this helps the visual impact.
The layout and reporting are user-friendly because there is a dashboard that I use the most.
Datadog has positively impacted my organization by allowing cross-functional teams who do not necessarily work directly with data to understand, simplify, and take in the data points.
Those cross-functional teams are using the data now by reviewing these reports and they are able to identify weak spots as well to improve cross-functionally the application process.
What needs improvement?
Areas for improvement:
Datadog could improve in dashboard usability and data correlation across products. While it’s powerful, the interface can feel cluttered and overwhelming for new users. Streamlining navigation and offering simpler default dashboards would help teams ramp up faster.
Additional features for next release:
It would be great to see stronger AI-driven anomaly detection and predictive analytics to help identify potential issues before they impact performance. Improved cost management insights or forecasting tools would also help teams monitor usage and control expenses more effectively.
For how long have I used the solution?
I have been using Datadog for roughly six months.
What do I think about the stability of the solution?
What do I think about the scalability of the solution?
Regarding Datadog's scalability, we have not scaled yet, but we are in the process of continuously scaling up, so we will find out in the near future.
How are customer service and support?
The customer support of Datadog is amazing.
I would rate the customer support a definite 10, as friendliness is top tier.
How would you rate customer service and support?
Which solution did I use previously and why did I switch?
I previously used a different solution, and we switched due to inconsistencies. The previous solution was also inaccurate and unreliable.
What was our ROI?
I have seen a return on investment in terms of time saved. I don't have metrics on hand for that answer, but there has been time saved due to the Datadog output.
What's my experience with pricing, setup cost, and licensing?
My experience with pricing, setup cost, and licensing has been that all were fair.
Which other solutions did I evaluate?
Before choosing Datadog, I evaluated other options, but I don't want to identify other ones.
What other advice do I have?
I don't have anything else to mention about the features, including integrations, alerts, or ease of setup.
I am unsure what advice I would give to others looking into using Datadog.
I found this interview impressive for AI, and I do not think there is anything I would change for the future.
On a scale of one to ten, I rate Datadog a 10.
Which deployment model are you using for this solution?
Private Cloud
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Google
Has improved response times and streamlined daily threat monitoring across teams
What is our primary use case?
My main use case for Datadog is the security aspect of it, utilizing the SIEM and the cloud security features. I use it every day monitoring different types of logs and reports that come through, managing most of the alerts that populate from our different applications and software, and it's been a good ride.
How has it helped my organization?
Datadog has impacted my organization positively because it tracks all the logs and helps us utilize our features through security. We use Datadog in basically all of our other teams, including engineering, code, APIs, and many other features available, and my peers always say something good about it.
Datadog has helped my organization improve a lot of response time because we get alerts the minute it happens, which is our only means to reduce incident response time. I also appreciate how it provides remediation efforts, allowing us to implement different playbooks while constantly updating with new threats and vulnerabilities, keeping us safe.
What is most valuable?
One of the best features I appreciate is the Cloud SIEM, and I've used many SIEMs in my experience, but until I got to this company, I never had the chance to really see how Datadog works. With this organization, they were able to show me how easy it was, and Datadog has a really good UI that's easily navigable, helping us teach new team members quickly.
My experience with the Cloud SIEM specifically is that it works 24/7 and stands out due to the easy UI it provides, which helps onboard new members who enjoy it. They are able to pick it up quickly without any prior knowledge.
Datadog helped us out with cloud security features and alerts during situations where we get numerous account lockouts or accounts being jeopardized. Datadog was able to find the alerts and trigger to notify our team in a very prompt manner before it got worse, allowing us to promptly adjust and remediate the situation in time.
What needs improvement?
Something I would appreciate seeing from Datadog is more events focused on the networking aspect, which allows me to see what others are using. I enjoy showing up to those events and exploring new features they are releasing as well.
I think Datadog has been performing excellently with no areas that need improvement, as they've been doing great and I want them to keep striving to do better.
For how long have I used the solution?
I'm fairly new with Datadog, having used it for the past year and a half, almost two years now, and it's been going really well.
What do I think about the stability of the solution?
Datadog is very stable, as there hasn't been any downtime or issues since I've been here, and it's always on time. I would appreciate seeing it as an app or mobile app for quicker issue tracking.
What do I think about the scalability of the solution?
Datadog has definitely kept up with our growth.
How are customer service and support?
I've had a couple instances where I reached out to Datadog's support team, and they have been really super helpful and very kind, even reaching back out after resolving my issues to check if everything's going well.
How would you rate customer service and support?
Which solution did I use previously and why did I switch?
I was not here during the time they onboarded Datadog or looked for different solutions, so I'm not aware of which solution we used before.
What was our ROI?
I cannot share any metrics regarding return on investment.
What's my experience with pricing, setup cost, and licensing?
Pricing is fairly affordable, and the setup cost has been good, while licensing has been well maintained, making it pretty great.
Which other solutions did I evaluate?
I'm certain they did their research and looked around at many different options, but I cannot speak on their behalf regarding which they chose or had competition with.
What other advice do I have?
My advice for others looking into using Datadog is to honestly give yourself a week or two to explore all the features and software application, as there are quite a lot of amazing features to learn and utilize, making it not just a software to monitor threats but also a tool to enhance your knowledge in this industry. I rate Datadog 10 out of 10.
Which deployment model are you using for this solution?
Private Cloud
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Has created intuitive dashboards and streamlined monitoring across teams
What is our primary use case?
Our main use case for Datadog is collecting metrics, specifically things such as latency metrics and error metrics for our services at Procore.
To give a specific example of how I use Datadog for those metrics in my daily work, I had to create a new service to solve a particular problem, which was an API. I used Datadog to get metrics around successful requests, failure requests, and 400 requests. I then created dashboards that showed those metrics along with some latency metrics from the API, and I also built a monitor that triggers and sends an alert whenever we're over a certain number of the failure metrics.
How has it helped my organization?
The single biggest improvement has been breaking down the silos between our teams. Before we adopted it, our developers, operations, and SRE teams all lived in separate tools. Ops had their infrastructure graphs, Devs had their log files, and no one had a complete picture.
Here’s where we’ve seen the most significant impact:
-
We Find and Fix Problems Drastically Faster: The "single pane of glass" is a real thing for us. When an alert fires, our on-call engineer can see the infrastructure metric spike (like CPU), pivot directly to the application traces (APM) running on that host, and see the exact, correlated logs from the services causing the problem—all in one place. We've cut our Mean Time to Resolution (MTTR) significantly because we're no longer "swivel-chairing" between three different tools trying to manually line up timestamps.
-
We Are More Proactive and Less Reactive: Features like Watchdog (its anomaly detection) have been crucial. We've been alerted to a slow-building memory leak and an abnormal spike in error rates on a specific API endpoint before they breached our static thresholds and caused a user-facing outage. It's helped us move from a "firefighting" culture to one where we can catch problems before they escalate.
What is most valuable?
The best features of Datadog include a great dashboard, a super simple and easy to use Python library, and an easy monitor, which together provide a really great UI experience.
What makes the dashboard and Python library stand out for me is that they save a lot of time, getting right to the point and being super intuitive.
Datadog has positively impacted my organization by allowing us to have a link to a dashboard for most services.
We have dashboards across the company, which can easily be passed around, making it super easy for everyone to understand the metrics they are looking at.
What needs improvement?
Oh, that's a great question. We actually have a running list of things we'd love to see. Even though we get a ton of value from it, no tool is perfect. Our feedback generally falls into two categories: making the current experience less painful and adding new capabilities we think are the logical next step.
Honestly, our biggest frustrations aren't about a lack of features, but about the management of the platform itself.
-
Cost Predictability and Governance: This is, without a doubt, our number one issue. It's not just that Datadog is expensive—it's that the cost is incredibly complex and hard to predict. Our bill can fluctuate wildly based on custom metrics, log ingestion, and traces from a new service. We've had to dedicate engineering time just to managing our Datadog costs, creating exclusion filters, and sampling aggressively, which feels like we're being punished for using the product more.
-
How to improve it: We need a "cost calculator" inside the platform. Before I enable monitoring on a new cluster or turn on a new integration, I want Datadog to give me a concrete estimate of what it will cost. We also need better built-in tools for attributing costs back to specific teams or services before the bill arrives.
-
The Steep Learning Curve and UI Density: The UI is incredibly powerful, but it's dense. For a senior SRE who lives in the tool all day, it's fine. For a new engineer or a developer who only jumps in during an incident, it's overwhelming. We've seen people "click in circles" trying to find a simple stack trace that's buried three layers deep. Building a "perfect" dashboard is still too much of an art form.
For how long have I used the solution?
I have been using Datadog for about five years.
What do I think about the stability of the solution?
Which solution did I use previously and why did I switch?
I did not previously use a different solution.
How was the initial setup?
I did not deal with any of the pricing, setup cost, or licensing.
What about the implementation team?
I do not know if we purchased Datadog through the AWS Marketplace.
What other advice do I have?
My advice to others looking into using Datadog is to just try using it and see how easy it is to use. I found this interview great. On a scale of 1-10, I rate Datadog a 10.
Which deployment model are you using for this solution?
Private Cloud
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Debugs slow performance with good support and a straightforward setup
What is our primary use case?
We use Datadog for monitoring the performance of our infrastructure across multiple types of hosts in multiple environments. We also use APM to monitor our applications in production.
We have some Kubernetes clusters and multi-cloud hosts with Datadog agents installed. We have recently added RUM to monitoring our application from the user side, including replay sessions, and are hoping to use those to replace existing monitoring for errors and session replay for debugging issues in the application.
How has it helped my organization?
We have been using Datadog since I started working at the company ten years ago and it has been used for many reasons over the years. Datadog across our services has helped debug slow performance on specific parts of our application, which, in turn, allows us to provide a snappier and more performant application for our customers.
The monitoring and alerting system has allowed our team to be aware of the issues that have come up in our production system and react faster with more tools to debug and view to keep the system online for our customers.
What is most valuable?
Datadog infrastructure monitoring has helped us identify health issues with our virtual machines, such as high load, CPU, and disk usage, as well as monitoring uptime and alerting when Kubernetes containers have a bad time staying up. Our use of Datadog's Application Monitoring, APM over the last six years or so has been crucial to identifying performance and bottleneck issues as well as alerting us when services are seeing high error rates, which have made it easier to debug when specific services may be going down.
What needs improvement?
We have found that some of the different options for filtering for logs ingestion, APM traces and span ingestion, and RUM sessions vs replay settings can be hard to discover and tough to determine how to adjust and tweak for both optimal performance and monitoring as well as for billing within the console.
It can sometimes be difficult to determine which information is documented, as we have found inconsistencies with deprecated information, such as environment variables within the documentation.
For how long have I used the solution?
I've been using the solution for ten years.
What do I think about the stability of the solution?
The solution seems pretty stable, as we've been using it for more than a decade.
What do I think about the scalability of the solution?
The solution seems quite scalable, especially within Kubernetes. Costs are a factor.
How are customer service and support?
SUpport has been very helpful whenever we need it.
How would you rate customer service and support?
Which solution did I use previously and why did I switch?
We had tried some other APM monitoring in the past, however, it was too expensive, and then we added it to Datadog since we were already using Datadog and it seemed like a good value add.
How was the initial setup?
The solution is straightforward to set up. Sometimes, it is complex to find the correct documentation.
What about the implementation team?
We handled the setup in-house.
What was our ROI?
Our ROI is ease of mind with alerts and monitoring, as well as the ability to review and debug issues for our customers.
What's my experience with pricing, setup cost, and licensing?
Getting settled on pricing is something you want to keep an eye on, as things seem to change regularly.
Which other solutions did I evaluate?
We used New Relic previously.
What other advice do I have?
Datadog is a great service that is continually growing its solution for monitoring and security. It is easy to set up and turn on and off its features once you have instrumented agents and tailored solutions to your needs.
Which deployment model are you using for this solution?
Public Cloud
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Other
Easy to configure with synthetic testing and offers a consolidated approach to monitoring
What is our primary use case?
We use this solution for enterprise monitoring across a large number of applications in multiple environments like production, development, and testing. It helps us track application performance, uptime, and resource usage in real time, providing alerts for issues like downtime or performance bottlenecks.
Our hybrid environment includes cloud and on-premise infrastructure. The solution is crucial for ensuring reliability, compliance, and high availability across our diverse application landscape.
How has it helped my organization?
Datadog has greatly improved our organization by centralizing all monitoring into one platform, allowing us to consolidate data from a wide range of sources.
From infrastructure metrics and application logs to end-user experience and device monitoring, everything is now collected and displayed in one place. This has simplified our monitoring processes, improved visibility, and allowed for faster issue detection and resolution.
By streamlining these operations, Datadog has enhanced both efficiency and collaboration across teams.
What is most valuable?
Synthetic testing is by far the most valuable feature in our organization. It’s highly requested since the setup process is both quick and straightforward, allowing us to simulate user interactions across our applications with minimal effort.
The ease of configuring tests and interpreting the results makes it accessible even to non-technical team members. This feature provides valuable insights into user experience, helps identify performance bottlenecks, and ensures that our critical workflows are functioning as expected, enhancing reliability and uptime.
What needs improvement?
One area where the product could be improved is Application Performance Monitoring (APM). While it's a powerful feature, many in our organization find it difficult to fully understand and utilize to its maximum potential.
The data provided is comprehensive, yet it can sometimes be overwhelming, especially for those who are less familiar with the intricacies of application performance metrics.
Simplifying the interface, offering clearer guidance, or providing more intuitive visualizations would make it easier for users to extract valuable insights quickly and efficiently.
For how long have I used the solution?
I've used the solution for four years.
What do I think about the stability of the solution?
The solution is very stable. Issues happen once or twice a year and are usually solved before we have any real impact on the service.
What do I think about the scalability of the solution?
Scalability has never been a bottleneck for us; we've never felt any issues here.
How are customer service and support?
Support is slow at the beginning, however, they are much better and responsive now.
How would you rate customer service and support?
Which solution did I use previously and why did I switch?
Datadog offered the most consolidated approach to our monitoring needs.
How was the initial setup?
This was a migration project, so it was rather complex.
What about the implementation team?
We implemented the solution with our in-house team.
What's my experience with pricing, setup cost, and licensing?
I'd recommend new users look down the road and decide on at least a three-year plan.
Which other solutions did I evaluate?
Improved response time and cost-efficiency with good monitoring
What is our primary use case?
We monitor our multiple platforms using Datadog and post alerts to Slack to notify us of server and end-user issues. We also monitor user sessions to help troubleshoot an issue being reported.
We monitor 3.5 platforms on our Datadog instance, and the team always monitors the trends and Dashboards we set up. We have two instances to span the 3.5 platforms and are currently looking to implement more platform monitoring over time. The user session monitoring is consistent for one of these platforms.
How has it helped my organization?
Datadog has improved our response time and cost-efficiency in bug reporting and server maintenance. We're able to track our servers more fluidly, allowing us to expand our outreach and decrease response time.
There are many different ways that Datadog is used, and we monitor three and a half platforms on the Datadog environment at this time. By monitoring all of these platforms in one easy-to-use instance, we're able to track the platform with the issue, the issue itself, and its impact on the end user.
What is most valuable?
The server monitoring, service monitoring, and user session monitoring are extremely helpful, as they allow us to be alerted ahead of time of issues that users might experience. More often than not, an issue is not only able to be identified, but solved and released before an end user notices an issue.
We are currently using this as an investigative tool to notice trends, identify issues, and locate areas of our program that we can improve upon that haven't been identified as pain points yet. This is another effective use case.
What needs improvement?
I would like to see a longer retention time of user sessions, even if by 24 to 48 hours, or even just having the option to be configurable. By doing this, we're enabled to store user sessions that have remained invisible for a long time, and identify issues that people are working around.
I would also like to see an improvement in the server's data extraction times, as sometimes it can take up to ten minutes to download a report for a critical issue that is costing us money. Regardless, I am very happy with Datadog and love the uses we have for the program so far.
For how long have I used the solution?
I've used the solution for more than four years.
Which solution did I use previously and why did I switch?
We did not previously use a different solution.
Improves monitoring and observability with actionable alerts
What is our primary use case?
We are using Datadog to improve our monitoring and observability so we can hopefully improve our customer experience and reliability.
I have been using Datadog to build better actionable alerts to help teams across the enterprise. Also by using Datadog we are hoping to have improved observability into our apps and we are also taking advantage of this process to improve our tagging strategy so teams can hopefully troubleshoot incidents faster and a much reduced mean time to resolve.
We have a lot of different resources we use like Kubernetes, App Gateway and Cosmos DB just to name a few.
How has it helped my organization?
As soon as we started implementing Datadog into our cloud environment people really like how it looked and how easy it was to navigate. We could see the most data in our Kubernetes environments than we ever could.
Some people liked how the logs were color coded so it was easy to see what kind of log you were looking at. The ease of making dashboards has also been greatly received as a benefit.
People have commented that there is so much information that it takes a time to digest and get used to what you are looking at and finding what you are looking for.
What is most valuable?
The selection of monitors is a big feature I have been working with. Previously with Azure Monitor we couldn't do a whole lot with their alerts. The log alerts can sometimes take a while to ingest. Also, we couldn't do any math with the metrics we received from logs to make better alerts from logs.
The metric alerts are ok but are still very limited. With Datadog, we can make a wide range of different monitors that we can tweak in real time because there is a graph of data as you are creating the alert which is very beneficial. The ease of making dashboards has saved a lot of people a lot of time. No KQL queries to put together the information you are looking for and the ability to pin any info you see into a dashboard is very convenient.
RUM is another feature we are looking forward to using this upcoming tax season, as we will have a front-row view into what frustrates customers or where things go wrong in their process of using our site.
What needs improvement?
The PagerDuty integration could be a little bit better. If there was a way to format the monitors to different incident management software that would be awesome. As of right now, it takes a lot of manipulating of PagerDuty to get the monitors from Datadog to populate all the fields we want in PagerDuty.
I love the fact you can query data without using something like KQL. However, it would also be helpful if there was a way to convert a complex KQL query into Datadog to be able to retrieve the same data - especially for very specific scenarios that some app teams may want to look for.
For how long have I used the solution?
I've used the solution for about two years.
Which solution did I use previously and why did I switch?
We previously used Azure Monitor, App Insights, and Log Analytics. We switched because it was a lot for developers and SREs to switch between three screens to try troubleshoot and when you add in the slow load times from Azure it can take a while to get things done.
What's my experience with pricing, setup cost, and licensing?
I would advise taking a close look at logging costs, man-hours needed, and the amount of time it takes for people to get comfortable navigating Datadog because there is so much information that it can be overwhelming to narrow down what you need.
Which other solutions did I evaluate?
We did evaluate DynaTrace and looked into New Relic before settling on Datadog.
Which deployment model are you using for this solution?
Hybrid Cloud
Centralized pipeline with synthetic testing and a customized dashboard
What is our primary use case?
Our primary use case is custom and vendor-supplied web application log aggregation, performance tracing and alerting.
We run a mix of AWS EC2, Azure serverless, and colocated VMWare servers to support higher education web applications. Managing a hybrid multi-cloud solution across hundreds of applications is always a challenge.
Datadog agents on each web host, and native integrations with GitHub, AWS, and Azure gets all of our instrumentation and error data in one place for easy analysis and monitoring.
How has it helped my organization?
Through the use of Datadog across all of our apps, we were able to consolidate a number of alerting and error-tracking apps, and Datadog ties them all together in cohesive dashboards.
Whether the app is vendor-supplied or we built it ourselves, the depth of tracing, profiling, and hooking into logs is all obtainable and tunable. Both legacy .NET Framework and Windows Event Viewer and cutting-edge .NET Core with streaming logs all work. The breadth of coverage for any app type or situation is really incredible. It feels like there's nothing we can't monitor.
What is most valuable?
Centralized pipeline tracking and error logging provide a comprehensive view of our development and deployment processes, making it much easier to identify and resolve issues quickly.
Synthetic testing has been a game-changer, allowing us to catch potential problems before they impact real users. Real user monitoring gives us invaluable insights into actual user experiences, helping us prioritize improvements where they matter most.
The ability to create custom dashboards has been incredibly useful, allowing us to visualize key metrics and KPIs in a way that makes sense for different teams and stakeholders.
These features form a powerful toolkit that helps us maintain high performance and reliability across our applications and infrastructure, ultimately leading to better user satisfaction and more efficient operations.
What needs improvement?
I'd like to see an expansion of the Android and IOS apps to have a simplified CI/CD pipeline history view.
I like the idea of monitoring on the go, yet it seems the options are still a bit limited out of the box. While the documentation is very good considering all the frameworks and technology Datadog covers, there are areas - specifically .NET Profiling and Tracing of IIS-hosted apps - that need a lot of focus to pick up on the key details needed.
In some cases the screenshots don't match the text as updates are made. I spent longer than I should have figuring out how to correlate logs to traces, mostly related to environmental variables.
For how long have I used the solution?
I've used the solution for about three years.
What do I think about the stability of the solution?
We have been impressed with the uptime and clean and light resource usage of the agents.
What do I think about the scalability of the solution?
The solution has been very scalable and customizable.
How are customer service and support?
Sales service is always helpful in tuning our committed costs and alerting us when we start spending outside the on-demand budget.
Which solution did I use previously and why did I switch?
We used a mix of a custom error email system, SolarWinds, UptimeRobot, and GitHub actions. We switched to find one platform that could give deep app visibility regardless of whether it is Linux or Windows or Container, cloud or on-prem hosted.
How was the initial setup?
Generally simple, but .NET Profiling of IIS and aligning logs to traces and profiles was a challenge.
What about the implementation team?
We implemented the solution in-house.
What was our ROI?
I'd count our ROI as significant time saved by the development team assessing bugs and performance issues.
What's my experience with pricing, setup cost, and licensing?
Set up live trials to asses cost scaling. Small decisions around how monitors are used can have big impacts on cost scaling.
Which other solutions did I evaluate?
NewRelic was considered. LogicMonitor was chosen over Datadog for our network and campus server management use cases.
What other advice do I have?
Excited to dig further into the new offerings around LLM and continue to grow our footprint in Datadog.
Which deployment model are you using for this solution?
Hybrid Cloud
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Microsoft Azure