
PagerDuty Operations Cloud
Cloud platform has improved efficiency and time savings but still needs stronger AI integrations
What is our primary use case?
I have been using PagerDuty Operations Cloud for the past few months. My main use case is that I use it in my day-to-day applications. A specific example of how I use PagerDuty Operations Cloud in my application is that I use it for hosting my agent. Hosting my agent on PagerDuty Operations Cloud helps me with my day-to-day work by being efficient in terms of scalability and managing infrastructure. It has been pretty helpful.
What is most valuable?
PagerDuty Operations Cloud's best features include scalability, managing infrastructure, and managing other services. It helps me manage other services comprehensively, and I think it is pretty good overall. PagerDuty Operations Cloud has positively impacted my organization by being effective in terms of managing the system and in terms of scalability. A specific outcome that shows how PagerDuty Operations Cloud has helped my organization is that it has improved efficiency and helped in saving a lot of time.
What needs improvement?
I think PagerDuty Operations Cloud can be improved in terms of services, such as integration with AI.
For how long have I used the solution?
I have been working in my current field for the past three years.
What do I think about the stability of the solution?
PagerDuty Operations Cloud has been stable based on what we have used.
What do I think about the scalability of the solution?
PagerDuty Operations Cloud's scalability has been pretty good because we are able to spin up different resources based on the use case and load.
How are customer service and support?
We have not used customer support explicitly.
Which solution did I use previously and why did I switch?
I did not previously use a different solution.
What was our ROI?
I have seen a return on investment, as I mentioned earlier; there has been a lot of improvement in terms of time and cost.
What's my experience with pricing, setup cost, and licensing?
My experience with pricing, setup cost, and licensing has been quite reasonable and cost-effective.
Which other solutions did I evaluate?
Before choosing PagerDuty Operations Cloud, I did not evaluate other options.
What other advice do I have?
I would rate PagerDuty Operations Cloud six out of ten because I believe you can add more features to make the platform even better. Regarding PagerDuty Operations Cloud's AI capabilities, I think its governance and security are pretty good, and the applications are quite secure. As for PagerDuty Operations Cloud's accuracy and reliability of output, I think the accuracy is pretty high and pretty good, and I believe it should be quite reliable, though I have not explored much on the recent AI capabilities.
I would definitely suggest PagerDuty Operations Cloud as a good platform, but it depends on your use case and the amount of scalability that you are looking for. PagerDuty Operations Cloud is pretty good and quite helpful. My overall rating for PagerDuty Operations Cloud is six out of ten.
Which deployment model are you using for this solution?
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Centralized alerts have improved incident response and now support flexible on-call workflows
What is our primary use case?
My main use case for PagerDuty Operations Cloud is on-call staff. For instance, when we have sites go down, we need somebody to investigate, so we require a text SMS or a phone call alert.
What is most valuable?
PagerDuty Operations Cloud offers several best features including cloud-based hosting, reliable performance, and flexible expandability.
Regarding the flexibility and expandability, you can scale up and down the amount of employees, add different paths to contacting people, and have monitoring capabilities, which has greatly helped my team.
PagerDuty Operations Cloud has positively impacted my organization with its very good interface and centralized operation. Having a centralized interface has made things easier by providing easy access administration.
What needs improvement?
PagerDuty Operations Cloud could be improved with clearer instructions for beginners.
For how long have I used the solution?
I have been using PagerDuty Operations Cloud for a year.
What do I think about the stability of the solution?
PagerDuty Operations Cloud is stable.
What do I think about the scalability of the solution?
The scalability of PagerDuty Operations Cloud is very good; when we need to add or reduce employees, it can adjust.
How are customer service and support?
Customer support has been very good, and I can reach somebody anytime. I would rate customer support an eight on a scale of one to ten.
Which solution did I use previously and why did I switch?
Previously, we used just a custom alerting solution.
How was the initial setup?
We are testing AI and automation through PagerDuty Operations Cloud for incident response right now, but not too much has changed yet.
What's my experience with pricing, setup cost, and licensing?
My experience with pricing, setup cost, and licensing has been fairly reasonable and not expensive.
Which other solutions did I evaluate?
Before choosing PagerDuty Operations Cloud, I did not evaluate other options and only considered some standard custom operator solutions.
What other advice do I have?
I would rate PagerDuty Operations Cloud an eight out of ten because it is pretty good, but it is not perfect yet. Regarding PagerDuty Operations Cloud's AI capabilities, I think its governance and security are pretty good with no issues. Regarding PagerDuty Operations Cloud's AI capabilities, I find its accuracy and reliability of output to be pretty accurate and pretty stable. My advice to others looking into using PagerDuty Operations Cloud is to see how many users you need and use the licensing accordingly. My overall review rating for PagerDuty Operations Cloud is eight out of ten.
Incident workflows have transformed and now reduce downtime for critical gaming services
What is our primary use case?
My name is Dinesh Singh Negi and I currently work as a Lead DataOps Engineer in the online gaming industry. My primary responsibility is ensuring the reliability, availability, and performance of our data platform and complete production system. I work extensively with AWS services, Prometheus, Grafana, and PagerDuty for monitoring, alerting, and incident management. My team supports critical gaming workloads and data pipelines that require high uptime and quick incident response. A significant part of my role involves setting up monitoring strategies, managing on-call operations, handling production incidents, and performing root cause analysis. We drive operational improvements, and we use PagerDuty Operations Cloud as our central incident management platform to ensure alerts are routed to the right team and escalated appropriately. I have been working in operations and reliability for nine to ten years and have hands-on experience managing large-scale customer-facing environments where managing, minimizing downtime, and reducing meantime to resolution are key priorities. We use PagerDuty Operations Cloud to understand the maximum time of acknowledgment and maximum time of resolution to derive meaningful analysis from the incidents that have been triggered to different teams.
I have been working for nine to ten years in operation, production support, reliability engineering, and mixed roles during this time. I have worked extensively on monitoring, incident management, system reliability, and operational excellence while particularly supporting large-scale online platforms and data operations. For five to six years, my focus has been on ensuring high availability, managing production incidents, optimizing monitoring and alerting strategies, and improving operational processes. Throughout these years, I have gained hands-on experience with AWS Cloud, Prometheus, Grafana, and PagerDuty Operations Cloud, which are the core tools we use for monitoring, alerting, and incident responses.
What is most valuable?
The best features are those we have been using for incident management. We have been using PagerDuty Operations Cloud for on-call scheduling, escalation policies, and integration capabilities. Incident management is extremely valuable because it ensures critical alerts are delivered to the right people immediately. On-call scheduling and escalation policies are very helpful because we can define clear ownership for the services and automatically escalate incidents if they are not acknowledged within a specific timeframe. Another key strength is the integration ecosystem. We can integrate it with our monitoring stack including Prometheus, Grafana, and AWS services, which helps us automate alerts ingestion and incident creation without manual intervention. The most valuable features are automating alerts, escalations, on-call management, integrations, and incident analytics.
One example that stands out was a production incident where we experienced a sudden spike in database latency during peak gaming hours. This started impacting player transactions and causing delays in some backend services. Our Prometheus and Grafana monitoring detected this abnormal latency and error rate increase, which went beyond a threshold, and the alert was automatically routed to PagerDuty Operations Cloud. PagerDuty Operations Cloud immediately notified the on-call engineer of our team and triggered the escalation workflow based on the incident severity. Since the issue occurred during peak traffic, quick response was critical, which was maintained. PagerDuty Operations Cloud helped us coordinate multiple teams, including DataOps, application, and other infrastructure teams. The platform helped ensure everyone was engaged quickly and that no critical notifications were missed. While we were under investigation, we identified a resource bottleneck in the database layer caused by an unexpected traffic surge. With the help of the database team, we scaled the required AWS resource and optimized a few long-running queries. This restored normal performance.
What needs improvement?
A significant positive impact is improving incident response efficiency and overall service reliability. Before we had a mature incident management process, coordinating responses during critical issues often required manual communication and follow-ups. PagerDuty Operations Cloud automated all of those things, including alert ownership, escalation, ensuring that incidents are routed to the right team members immediately. One of the most measurable benefits is the reduction in meantime to acknowledge and meantime to resolve. Faster detection and response help minimize service disruptions and maintain a stable experience for our users, which is especially important in the online gaming industry where availability and performance directly affect customer satisfaction. The platform has helped us mature our operational practices by analyzing incident trends, alert volumes, and escalation patterns. We have been able to refine our monitoring, reduce alert fatigue, and proactively address recurring issues before they become major bottlenecks in production.
For how long have I used the solution?
I have been using PagerDuty Operations Cloud for approximately more than five years.
What do I think about the stability of the solution?
PagerDuty Operations Cloud is stable.
What do I think about the scalability of the solution?
When you are using a tool for incident response, you need to trust that notifications and escalations work when a critical event occurs. PagerDuty Operations Cloud has been very dependable in that regard. Another aspect we have found valuable is the flexibility to support different teams and services as our environment grows. We have added new applications, data pipelines, and AWS service resources. We are able to extend our PagerDuty Operations Cloud configuration without major challenges or changes to our overall operational model.
Which solution did I use previously and why did I switch?
I have not used any solution previously. Since the beginning of 2021, I have been using PagerDuty Operations Cloud.
How was the initial setup?
The setup and customization process was relatively straightforward. The integrations were one of the easiest parts. PagerDuty Operations Cloud provides well-documented integrations for monitoring tools and cloud platforms. Connecting it with our Prometheus, Grafana, and AWS monitoring stack did not require significant development efforts. The initial setup involved configuring alert routing, defining service ownership, and mapping severity levels to appropriate escalation policies. Customizing on-call schedules and escalation workflows was also quite flexible. We were able to create different schedules for various teams, define escalation paths based on incident severity, and establish notification rules that match our operational requirements. As our team and environment grew, we refined the configuration further by tuning alert thresholds and reducing noise to avoid alert fatigue. It is important to ensure engineers receive only actionable alerts rather than excessive notifications.
What about the implementation team?
PagerDuty Operations Cloud's AI and automation capabilities are primarily used for alert correlation, event intelligence, noise reduction, incident prioritization, and providing operational context to responders. These capabilities help engineers identify and respond to issues more quickly while keeping humans in control of critical decisions. We see value in the direction of autonomous operations. If AI agents continue to improve in areas such as incident triage, root cause analysis, and automated remediation for well-understood scenarios, they could further reduce response times and operational overhead.
What was our ROI?
We have seen a positive return on investment from PagerDuty Operations Cloud through improved operational efficiencies, faster incident response, and reduced downtime. I cannot share financial figures, but I can speak to operational outcomes we have observed. Since implementing PagerDuty Operations Cloud and integrating it with AWS, Prometheus, and Grafana monitoring stack, we have seen measurable improvements in incident processes such as MTTA and MTTR, or reduced alert fatigue by using event correlation and alert deduplication. These improvements have helped us a great deal.
Which other solutions did I evaluate?
I did not get a chance to evaluate any other applications. When I was in the company, they were using PagerDuty Operations Cloud only, so I started with that.
What other advice do I have?
My advice would be to start with a clear incident management strategy rather than focusing only on the tool itself. PagerDuty Operations Cloud delivers the most value when you have well-defined service ownership, escalation policies, severity levels, and monitoring practices in place. The platform is very powerful, but its effectiveness depends on the quality of the alerts and operational processes behind it. I would also recommend investing time in alert tuning early on and integrating PagerDuty Operations Cloud with your monitoring stack, whether it is AWS, Prometheus, Grafana, or any other observability tool. Make sure the alerts being sent are actionable. Reducing noise from the beginning will help prevent alert fatigue and improve adoption among engineering teams. I would rate this product an eight out of ten.
Which deployment model are you using for this solution?
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
AI-driven incident management has reduced downtime and improves focus on strategic work
What is our primary use case?
PagerDuty Operations Cloud is a multifunctional digital operations platform that meets my organization's needs.
I am impressed by this digital operations solution because it is the most appropriate tool for incident detection and alerting.
PagerDuty Operations Cloud is a very user-friendly tool, highly accurate, and an easy-to-customize digital operations management system that suits my organization's needs.
It has intelligent noise reduction capabilities that play a significant role in minimizing alert floods.
What is most valuable?
PagerDuty Operations Cloud offers top-tier features that enable real-time alerting and accelerate incident response.
The solution is reliable and effective when it comes to automating routine diagnostic tasks.
Regarding how the real-time alerting and automation features have helped my team, problem-solving became automatic, and incident management becomes less complex to manage.
PagerDuty Operations Cloud has positively impacted my organization by enabling faster issue response, which helped reduce downtime, saved revenue by avoiding long outages, improved team accountability during incidents, reduced manual effort in handling alerts, and helped maintain a better customer experience.
The solution's alert reduction feature has had a major impact on preventing costly incidents in my organization. By grouping related alerts and de-duplicating noise, my team was able to spot real issues faster instead of getting buried in alerts, helping us prevent two to three potential outages because engineers responded to the root alert instead of missing it in noise.
What needs improvement?
The user interface should be easier to customize and use.
The pricing could be less expensive, especially for smaller organizations.
The user interface could be made easier to customize and navigate so that users who are new to this platform find the learning curve smoother.
PagerDuty Operations Cloud needs improvements because sometimes integrations are not very seamless and misbehave.
For how long have I used the solution?
I have been using PagerDuty Operations Cloud for about one year and a few months.
What other advice do I have?
PagerDuty Operations Cloud is a great operational efficiency tool, not just for paging.
It is very cost-effective, especially for organizations that are not limited by budgets.
PagerDuty Operations Cloud solves a lot of problems.
For example, if any issue arises during our online exam with our client, then PagerDuty Operations Cloud alerts the right team and the right people, and tasks are assigned so those problems can be resolved at the correct time and our real task does not get disrupted.
PagerDuty Operations Cloud's AI functionality has improved my team's ability to focus on core tasks rather than routine issues by removing routine alert triage.
The AI groups and de-duplicates alerts automatically, so our engineers are not manually sorting through twenty duplicate notifications for one root issue, allowing them to save a lot of time and focus on other strategic tasks, which improves productivity in my organization.
We are using PagerDuty Operations Cloud's autonomous AI agents for low-severity incidents, which automatically triage, correlate, and resolve known issues without human intervention, such as restarting services or acknowledging flapping alerts.
This has contributed to efficiency by cutting manual workload by thirty-five percent and also reducing MTTR for routine incidents.
The effectiveness of PagerDuty Operations Cloud's generative AI in providing insights for decision-making is effective during incidents.
The AI provides clear insights through incident summaries and what-changed analysis, helping us decide where to start troubleshooting instead of guessing, enabling us to make data-driven decisions easily, and providing actionable insights that improve response decisions.
The influence of PagerDuty Operations Cloud's embedded AI on revenue protection in terms of reducing alert fatigue and incident costs has a positive impact by reducing downtime risks and operational costs per incident.
I would rate this review nine out of ten.
Which deployment model are you using for this solution?
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Automated incident paging has improved on-call response but reporting and pricing still need work
What is our primary use case?
The main use case for using PagerDuty Operations Cloud is that we get paged as and when required for all the issues and incidents which are happening, rather than requiring us to keep track of all of them. We are in the exploring phase for using AI around PagerDuty, but that is still in exploration and we haven't started with that.
When there is an incident, we get paged for an alert. We have escalation policies set up that are being followed, and if someone is not acknowledging the page or if someone is not available, then accordingly it will go to the next level of escalation. This ensures that none of the alerts are missed.
PagerDuty Operations Cloud has multiple integrations. In our case, we use the Slack integration the most. The alert triggers from our SignalFx stack, goes to PagerDuty, follows the escalation policy, and reaches the user. Along with that, it is also sent to the Slack channels so that whatever triaging happens for that alert or incident happens over Slack in that particular thread where the alert is triggered from PagerDuty.
What is most valuable?
PagerDuty Operations Cloud is very easy to use and user-friendly.
Regarding the features that PagerDuty Operations Cloud offers, I have explored the automation area and it has a good amount of integrations. For example, the event intelligence and the noise reduction are areas where PagerDuty is really powerful. It reduces and cleans up alerts by doing alert de-duplication and alert grouping. It has also recently got machine learning capabilities, which would surely be helpful. We also have automations and runbooks in place which can help to do auto-remediation of issues or trigger scripts as per the runbooks. We haven't been using all of those things, but I know that these things are present. The incident response on-call management is very easy to use with PagerDuty. There are flexible on-call schedules, escalation policies, and the ability to set up overrides easily. There are multiple channels by which you can send alerts including SMS, calls, and notification pushes.
PagerDuty Operations Cloud also has war room features. Many emerging tools provide this as well, but since PagerDuty is a pretty established company, it has a very mature model with all of these features. The analytics and reporting are also decent.
PagerDuty Operations Cloud has improved our incident management process by ensuring that the right set of people are notified within time. The best part is that it has automated on-call schedules and escalation policies, so you don't have to set them again and again for every week or every month. Features including alert grouping, alert de-duplication, and good analytics and reporting are very helpful during incident management and also for post-incident activities.
What needs improvement?
The analytics and reporting have some scope for improvement. First, it should have more granular capabilities and we should be able to query it in a more granular way. There should also be more advanced trend analysis or cross-team operational insights available. That would be helpful. Licensing is also a bit expensive, so there should be some cost optimization for large deployments to take care of licensing cost optimization. Since we are in the AI era, I know PagerDuty has been investing in a lot of AI capabilities, but there should be good enhancements which we are looking forward to, such as automated root cause analysis or doing historical pattern matching. There could also be recommendations around runbook automation.
For how long have I used the solution?
I have been using PagerDuty since the last one and a half years at Splunk, and before that I was also an active user of PagerDuty in my last organization.
What do I think about the stability of the solution?
PagerDuty Operations Cloud is stable.
What do I think about the scalability of the solution?
PagerDuty Operations Cloud is pretty scalable. I never had any issue where a large number of alerts impacted PagerDuty.
How are customer service and support?
The support is decent.
Which solution did I use previously and why did I switch?
Previously, we were using OpsGenie, but that was quite a long time ago. PagerDuty Operations Cloud already has all the things which OpsGenie had, as per my knowledge.
How was the initial setup?
The costing took away two points from my overall rating. There are still some good amount of areas of improvement which took away the last one point, resulting in a rating of seven out of ten.
What about the implementation team?
I am not in the position to select any tool. I am not the one who selected or chose PagerDuty or evaluated any tools before that. We are just end users.
What was our ROI?
The return on investment is nine out of ten.
Which other solutions did I evaluate?
We are still in the phase of doing that evaluation and it is not yet completed. However, it is pretty helpful because PagerDuty itself has a good amount of data which can be used with AI to make the best use of it. I am still in the experimenting phase, but the AI functionality of PagerDuty would definitely be a good way to analyze the ongoing issues and how issues are handled right now, tracking the MTTR and MTTDI, and finding spots where there are a lot of areas of improvements which are needed.
What other advice do I have?
PagerDuty Operations Cloud already has a lot of integrations available, which is pretty good. The user experience is swift and smooth, which is a very good thing about PagerDuty Operations Cloud that I appreciate. I am not very aware of the governance and security aspects, but it has SSO as well, which is pretty good. Many organizations would be happy to adopt it, though I am not very aware of these features. The AI capabilities are not very reliable or accurate at the moment, but it is in the development phase and should improve over time.
I don't have the exact metrics available, but there is a significant amount of improvement which we can see after onboarding to PagerDuty Operations Cloud. Normally, before PagerDuty Operations Cloud, I can compare with my previous to previous organization because in that company we didn't have PagerDuty Operations Cloud. There were quite a good amount of alerts which were getting missed. With PagerDuty Operations Cloud, there is a good layer of notifications and notification policies that you have. Even if you miss any page, you will get a push notification on your mobile. If you miss that, you will get a call on your mobile, which is pretty good.
The overall pricing, setup cost, and licensing are pretty expensive. The PagerDuty Operations Cloud licensing is a bit confusing because it is primarily based on users, not on the number of alerts or incidents which are triggered. If it is a small organization, it is good, but if it is a large organization, it is difficult because many people would need to use PagerDuty Operations Cloud. At the same time, to make it more efficient or to get the best out of it, we need to have an end-to-end setup on PagerDuty Operations Cloud, which does take time. There should be some flexible licensing options.
PagerDuty Operations Cloud is a pretty mature product. If you are a mid-scale organization who is trying to get the best out of PagerDuty Operations Cloud, I would recommend going for it. My overall rating for this product is seven out of ten.
Integrated incident workflows have improved on-call efficiency and automated critical alerts
What is our primary use case?
We are currently using PagerDuty Operations Cloud for incident management, escalations, on-call, and the status page, which represents our main product utilization.
What I like the most about it is that it has so many integrations like Azure integrations, AWS integrations, and Prometheus and Grafana integration for the alerting system, which makes it more convenient for us. We are using all kinds of tools like Grafana and others, which are easy to access and integrate with PagerDuty Operations Cloud. Our infrastructure is going to be more secured whenever incidents get triggered, and with the help of PagerDuty Operations Cloud, we are able to get incidents triggered automatically after alerts are triggered.
Currently, there is one tool called Rootly. I think they are new to the industry and we are also using that for one of our other clients. It's somewhat similar, but I think they have the potential to compete with PagerDuty Operations Cloud in the future as well.
As of now, we are not using any generative AI features in PagerDuty Operations Cloud. We are currently using it for on-call and other things.
What is most valuable?
What I like the most about it is that it has so many integrations like Azure integrations, AWS integrations, and Prometheus and Grafana integration for the alerting system, which makes it more convenient for us. We are using all kinds of tools like Grafana and others, which are easy to access and integrate with PagerDuty Operations Cloud. Our infrastructure is going to be more secured whenever incidents get triggered, and with the help of PagerDuty Operations Cloud, we are able to get incidents triggered automatically after alerts are triggered.
The benefits in terms of on-call are that we are getting maximum utilization of it. Previously, we were not having any alerting system for our client, and after implementing PagerDuty Operations Cloud, we started finding out the root cause and made other things easier compared to earlier. With the help of PagerDuty Operations Cloud, we are able to fix most of the issues and reduce repetitive issues in our infrastructure.
What needs improvement?
There is nothing I dislike about PagerDuty Operations Cloud, but perhaps it's due to the networks or the medium which it is taking. Usually, what happens is that if an incident gets triggered, suppose if it triggers in five to ten seconds, but sometimes, maybe due to latency or other factors, the call gets triggered after two or three minutes. That is quite understandable, but some kind of production issues need to be addressed at the earliest critical issues. So that latency needs to be reduced from PagerDuty Operations Cloud. I think they need to work on that. Apart from that, most of the things they are doing well, and we are not facing any such kind of issues. Everything is good.
Except for the frequency of the call, we don't see any lagging, crashing, or downtime. In rare cases, we hear some noises in the call, which is rare but not frequent. Apart from that, the triggering latency is a bit slow, but not every time.
For how long have I used the solution?
We have been currently using PagerDuty Operations Cloud for more than two years.
What do I think about the stability of the solution?
Except for the frequency of the call, we don't see any lagging, crashing, or downtime. In rare cases, we hear some noises in the call, which is rare but not frequent. Apart from that, the triggering latency is a bit slow, but not every time.
What do I think about the scalability of the solution?
Regarding scalability, I don't think there are any issues; it is going well.
How are customer service and support?
We have very good support with PagerDuty Operations Cloud.
In few cases, not frequently, we have had to contact the technical support for clarification regarding the integration or for creating escalation things. Initially, we reached out to the technical support, but now we are well-versed with the tool. The community is good, and I think we are able to get solutions within the community itself.
For the support of PagerDuty Operations Cloud, I would give them a score of nine to ten.
Which solution did I use previously and why did I switch?
Currently, there is one tool called Rootly. I think they are new to the industry and we are also using that for one of our other clients. It is somewhat similar, but I think they have the potential to compete with PagerDuty Operations Cloud in the future as well.
How was the initial setup?
I don't think the deployment for PagerDuty Operations Cloud is difficult to handle. It is easy to handle, and the best thing is they have a very good support team that we can reach out to at any time.
What's my experience with pricing, setup cost, and licensing?
The pricing for PagerDuty Operations Cloud is a bit expensive, especially for startups like us, compared to the other platform which I mentioned, which is Rootly. Rootly is not based on a per-user model. In PagerDuty Operations Cloud, it is going to cost fifty dollars per user for admins or other roles, whereas in the other platform there is no such kind of thing; it is based on a pay-as-you-go model. I think that is one of the drawbacks for PagerDuty Operations Cloud regarding billing and other aspects. Apart from that, the plans and other things for incident creations and the triggering of calls are quite good.
Which other solutions did I evaluate?
Currently, there is one tool called Rootly. I think they are new to the industry and we are also using that for one of our other clients. It is somewhat similar, but I think they have the potential to compete with PagerDuty Operations Cloud in the future as well.
Real‑time incident alerts have improved uptime and keep critical services continuously monitored
What is our primary use case?
I have been working with PagerDuty Operations Cloud for more than two years, though recently, within the past three months, I have not been using PagerDuty Operations Cloud. I worked in an IT firm called Equifax, which is a credit monitoring system where you can see your credit score and credit-related products and services. I am in the software engineering department building the application, and I was supporting a few applications for the past two years, integrating PagerDuty Operations Cloud with DataDog. We set alerts on our software system such that if any software that we are serving on the cloud goes down, we receive an alert in DataDog system, and we reroute the same to PagerDuty Operations Cloud by providing our PagerDuty Operations Cloud credentials in DataDog. What we are doing is integrating PagerDuty Operations Cloud with DataDog, and that alert will be generated by PagerDuty Operations Cloud through multiple channels such as SMS, phone call, and message. If any of the systems we are hosting in our cloud goes down or any alerts that we have set up through our DataDog system triggers an event, that event sends to PagerDuty Operations Cloud, which will give the alerts through multiple channels. That is how I am using PagerDuty Operations Cloud in my company and my work.
What is most valuable?
What I appreciate about PagerDuty Operations Cloud is its real-time alert capability, which is one of the main things. If a critical system in the production environment goes down and we receive a message via SMS, we might miss that, or if we get an email, we might miss that as well. Someone who is on call or directly providing support twenty-four hours a day and seven days a week, such as the support team, might need PagerDuty Operations Cloud support. Without PagerDuty Operations Cloud, it is difficult to say when the alert got triggered and those kinds of things. PagerDuty Operations Cloud has all the history, including who acknowledged that particular pager, the timeline when it got triggered, and in which channel it got triggered, making it easy to prepare a report for the past month on how many alerts we received for particular services or to segregate by team or by alert name. It is a kind of perfect application, but I can suggest a few more additional improvements to enhance user experience.
PagerDuty Operations Cloud's main benefits are the alerts related to our organization. Alerts are critical; we want our system to be one hundred percent available, but no system is one hundred percent reliable. We want to know whenever our system goes down or we are experiencing some latency in response time or when a certificate for our DNS expires, as these are critical issues that can be handled through PagerDuty Operations Cloud. Even if we set an email notification, individuals working in front of the system twenty-four hours a day and seven days a week may not always be available. If they go for tea, coffee, or lunch, they might miss the critical functionality. However, if you have a pager, you will receive a call, which is much more reliable. While there may be instances when multiple PagerDuty Operations Cloud events trigger and result in one call, that is not the case all the time. Most of the time, we will receive alerts through one of the three channels, and organizations will configure calls to check the logs and address the problem promptly.
PagerDuty Operations Cloud's alert reduction feature has significantly impacted my organization, preventing approximately ninety to ninety-five percent of critical incidents from occurring. From my understanding, if the system is down, people will see the alert and take the necessary resolution steps. If it does not involve actual engineering work, such as restarting a service, that can be followed through this PagerDuty Operations Cloud alert, allowing resolutions to happen as soon as possible.
What needs improvement?
One aspect about PagerDuty Operations Cloud is that it is perfect, but no application is perfect. If we get an alert with the same name, when creating alerts with the same name, if I search for the alert by relevance, the data is not coming as expected. That is one of the things I would like to see improved. The user interface perspective is good, and while I think about what improvements I want to see in PagerDuty Operations Cloud, I am not getting that answer right now. Additionally, the integration part looks good; PagerDuty Operations Cloud can be integrated with multiple platforms including other applications such as DataDog, so I wonder why it cannot be directly integrated with a cloud such as Google Cloud without the need for a third layer such as DataDog. We should be able to integrate directly with PagerDuty Operations Cloud without any dependency.
What I would like to see included in PagerDuty Operations Cloud is the integration of some AI functionalities; most users leverage PagerDuty Operations Cloud for alert functionalities or critical things. For each particular alert, I would like to know the resolution steps or root cause analysis, or a runbook, so if you get this alert, what we need to do. Most of the alerts we receive are repeated; if a system goes down, it is usually a known error since no system is one hundred percent reliable. If we get a similar alert, I would like to see the root cause. If we receive an alert that your team got previously, we should know what they went through or we can preconfigure what runbook to follow for that alert. Some integration with multiple vendors such as Confluence or systems such as Jira, as I am primarily talking about IT, are functionalities I would like to see included in PagerDuty Operations Cloud.
For how long have I used the solution?
I have been using PagerDuty Operations Cloud for more than two years, though recently, within the past three months, I have not been using PagerDuty Operations Cloud.
What do I think about the stability of the solution?
When it comes to stability, I would not say there are performance issues with PagerDuty Operations Cloud. For example, if the same alert ID is generated multiple times, such as if a namespace goes down where multiple services are deployed, all alerts trigger at the same time, and we generally receive only one call for that. While it could be regarded as a performance issue, I think it is understandable given the situation.
What do I think about the scalability of the solution?
I do not find any limitations or issues regarding scalability with PagerDuty Operations Cloud; it appears to be scalable.
How are customer service and support?
Regarding PagerDuty Operations Cloud tech support, I personally have not escalated any issues, but my team has escalated one issue related to reporting. They faced an issue where we could create a report from PagerDuty Operations Cloud. The report generation accordingly faced a limitation; for instance, we encountered issues with reports generated in the morning that only showed data until four a.m. These reports might lack some recent alerts or events, though we can still see them manually. It is just that the report generation seems to be somewhat outdated.
How was the initial setup?
Regarding the initial setup process for PagerDuty Operations Cloud, I do not think there were challenges. I followed a runbook, and the process is straightforward, requiring just primary and secondary contact information, email, and phone numbers. It is a straightforward process, and I do not see any issues with that.
Which other solutions did I evaluate?
Prior to adopting PagerDuty Operations Cloud, I evaluated alternative options, and the alternatives I mentioned earlier do exist, but PagerDuty Operations Cloud is the only application that I have used because it has the capability to trigger phone calls for alerts. For other channels, while I can see multiple tools that trigger events via email notifications, I have not come across other applications that can do phone call alerts as PagerDuty Operations Cloud does.
What other advice do I have?
We do have alerting systems, as I mentioned, using DataDog, which can send only emails and other channels, but I do not think there are any other applications we are using for alerting apart from that. We also use Grafana, DataDog, and Chaos Search for the alert system along with PagerDuty Operations Cloud.
I have not used PagerDuty Operations Cloud's autonomous AI agents or generative AI yet. It was introduced by PagerDuty Operations Cloud, but my organization recently adopted those features. After the AI integration, I did not get a chance to use those because I moved to another team and did not use PagerDuty Operations Cloud after the AI integration. I think my colleagues mentioned that after the integration, it was good; they could integrate with multiple teams and applications such as Slack, but I did not have hands-on experience with that.
My advice for organizations considering PagerDuty Operations Cloud is that many organizations seem to already use it. If your system is large and you need to handle incidents, particularly critical applications driving revenue or something similar, you cannot afford for your system to go down for five minutes, as it may result in millions of dollars lost. To mitigate this, increasing reliability is essential. No system is entirely reliable, so we have to depend on products such as PagerDuty Operations Cloud to alert our engineers or the support team to reduce incident counts and impacts for monitoring purposes, system performance analysis, and other objectives. My overall rating for PagerDuty Operations Cloud is eight out of ten.
Seamless PagerDuty API Integration and a Streamlined Incident Management UI
Integrated alerts have improved on-call response and enabled proactive incident management
What is our primary use case?
PagerDuty Operations Cloud's main use case for my organization is the integration of alerts by receiving them via mobile, email, and SMS.
For example, we integrated AWS CloudWatch alarms. If we get any age of oldest message alert, which applies to SQS, we set up our SNS integration by providing a topic and subscription, and we give the integration URL of PagerDuty Operations Cloud. If the age of the oldest message threshold is breached, we will receive an alert via PagerDuty Operations Cloud.
Regarding my main use case, we have also integrated other tools such as Prometheus and Grafana along with Alertmanager. Via Grafana, we have integrated our dashboard metrics and created alerts if the threshold is breached. Based on those metrics, PagerDuty Operations Cloud alert will be triggered.
What is most valuable?
PagerDuty Operations Cloud offers great features, including service directories, Slack integrations, incident reports, alert suppression, orchestration, team management, and permission handling such as read or write. These are the best features according to my daily experience.
The integration part stands out for us, as we have utilized integrations like AWS, Azure, Jenkins for pipeline breaches, Slack, and New Relic, along with a number of plugins that are helpful.
In the market, when comparing PagerDuty Operations Cloud to VictorOps and other services, PagerDuty Operations Cloud offers great features, particularly the simplicity of its integration plugins, which is a significant advantage.
If there are issues in our production environment, we immediately get alerts and can take action. Even if we are busy with other issues and we cannot fully monitor our dashboards, the integration of alerts is fantastic because it notifies us via phone at any time. We can react immediately based on the alert's priority and take necessary action, leveraging our understanding of the infrastructure and the type of issues.
What needs improvement?
In terms of improvements, I do not have any specific suggestions. The alert suppression and merging features look good, and overall, I see no issues.
No improvements are necessary from what I have heard, and everything seems fine. I manage PagerDuty Operations Cloud operations for my team, including on-call management and schedules, and I am happy with everything from both my colleagues' and my perspective.
For how long have I used the solution?
I have been using PagerDuty Operations Cloud for eight years.
What do I think about the stability of the solution?
The accuracy and reliability of PagerDuty Operations Cloud are good, and I have no doubts about it.
What do I think about the scalability of the solution?
Currently, PagerDuty Operations Cloud is deployed in my organization on the public cloud.
What other advice do I have?
As for measurable outcomes like reduced downtime or faster response times, sometimes we have to acknowledge even low-priority alerts. One instance involved a few users on my team who did not accept a low-priority alert.
From my perspective, PagerDuty Operations Cloud is good, with user-friendly features that anyone can quickly learn, including integration processes, on-call management, and escalation policies. It is a valuable asset for my organization.
Recently we integrated our AI capabilities into PagerDuty Operations Cloud, which helps us get alerts integrated with Slack, providing information on the service directory and AWS CloudWatch URL.
We have integrated our AI agents, which provide comprehensive details such as tasks, responses, acknowledgments, resolves, and links, making it beneficial for viewing and acknowledgment processes.
I need to dive deeper into this, but as of now, the resources we are using have all the necessary functionality and everything is working well.
I provide a review rating of 8.5 for PagerDuty Operations Cloud.
Which deployment model are you using for this solution?
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Integrated alerts have simplified incident response and reduced resolution time for our teams
What is our primary use case?
I mostly use PagerDuty Operations Cloud for monitoring and alerting through Jira and Slack integrations, where we create a ticket and if we want to close the ticket then we use the PagerDuty Operations Cloud dashboard itself instead of going through the Jira dashboard and closing it manually, so integrations helped a lot. Even with the Slack integration, instead of closing directly on PagerDuty Operations Cloud, we can close it on the Slack message itself.
I think the API usage for PagerDuty Operations Cloud is also helpful.
What is most valuable?
Regarding team management in PagerDuty Operations Cloud, I appreciate how it shows which people are API, people who can manage API, and who can manage the entire systems, and that's good actually.
PagerDuty Operations Cloud has positively impacted my organization as I think the mean time to resolve has reduced significantly.
The solution's alert reduction feature has had a positive impact on preventing costly incidents in our organization as we have reached our SLAs due to that, by reducing alert noise.
What needs improvement?
I would like to add that everyone is integrating AI into their tools, and I am unsure whether PagerDuty Operations Cloud has that at the moment besides the cost.
I choose nine out of ten because I think there should be some edge cases including integration with AI. Every tool is evolving, so I think PagerDuty Operations Cloud still needs some more advanced features to compete with other services.
I do not have much idea about PagerDuty Operations Cloud's AI capabilities when it comes to governance and security, but we need to set guardrails for it on its capabilities.
Regarding PagerDuty Operations Cloud's AI capabilities, I think it is mostly reliable, as far as I know, when it comes to its accuracy and reliability of output.
I have not implemented AI and automation through PagerDuty Operations Cloud for incident response, and I am not aware of how it has changed our operational efficiency because it was not really needed during that time.
I am not aware of any ways PagerDuty Operations Cloud's AI functionality has improved my team's ability to focus on core tasks rather than routine issues.
I would assess the effectiveness of PagerDuty Operations Cloud's generative AI in providing insights for decision-making as something that would be good if it can reach SLAs and eliminate false alarms.
For how long have I used the solution?
What do I think about the stability of the solution?
What do I think about the scalability of the solution?
How are customer service and support?
Which solution did I use previously and why did I switch?
We previously used a different solution, CloudWatch alerting, but we had to use our own way of writing Lambdas, which did not alert on phones or based on timing, geographic location, and that is the reason why we had to switch to the enterprise-level PagerDuty Operations Cloud so we can create schedules and alert based on alert policies.