My primary use case for PagerDuty Operations Cloud is for the on-call support that my team provides. We work in two phases: we build and architect things, and then we support them. Every week, two engineers from my team serve as on-call support engineers for any escalation tasks or emergencies that may happen with our production environment. We have monitoring and observability enabled for our services, applications, and servers. Whoever is the on-call engineer for that particular week follows an escalation matrix with five people: two primary and secondary on-call engineers, two primary and secondary management contacts, and the highest level, which is CVP. If something is not resolved by the primary engineer, it escalates to the secondary engineer, then to primary management, secondary management, and so on. We have a predefined roster that has been scheduled for approximately the next year, where each primary engineer is assigned a week. Every engineer is assigned a role as either a primary or secondary on-call engineer, and any escalations during that period are managed by PagerDuty Operations Cloud and communicated to the primary engineer, who then acknowledges it, resolves it, escalates it, or takes whatever action is needed.
PagerDuty Operations Cloud
PagerDutyExternal reviews
External reviews are not included in the AWS star rating for the product.
Boringly Reliable Incident Response Glue That Gets the Right Person Paged Fast
It pulls alerts in from everywhere, routes them intelligently, and makes sure the right person gets paged quickly. The mobile app is solid, the on-call scheduling is flexible, and having runbooks/notes right in the incident makes handoffs a lot less painful.
It’s boringly reliable, which is exactly what I want from an incident tool.
PagerDuty helps cut MTTA/MTTR, keeps our on‑call rotation sane, and makes it much easier to coordinate during outages. Clear ownership, sensible escalation, and postmortems replace Slack chaos and guesswork.
PagerDuty Brings Order to Production Incidents with Smart Escalations
For example, in a fintech firm, we had an incident with our payment reconciliation system whereby our payment reconciliation process would fail at times during our settlement periods due to delays in queue processing between our bank partners and internal ledgers. Prior to implementing structured incident management, there was chaos in the way alerts were being handled via our Slack channels and email. However, after introducing PagerDuty, there was a defined process of alert escalations based on services. Every time latency levels increased, the relevant backend, infra, and database engineering teams would be escalated through the incident process to help minimize any delay in resolving the issue.
Another benefit about PagerDuty was its ability to cut down on noise and prioritize critical alerts. When you work within fast-growing systems and event-driven architectures, it is easy for engineers to get used to too many alerts.
For example, in an environment for a fintech company, where real-time fraud detection and transactions monitoring applications operate, we originally set too many infrastructure and application-level alerts within PagerDuty. Over time, engineers began receiving numerous low-information content alerts in peak transaction periods, coming from various dependent microservices that were downgraded in performance but did not impact customers. This led to engineers spending more time dealing with notifications than solving the underlying problem. In no way PagerDuty was problematic here, but the platform requires precise incident design and management for large-scale distributed environments.
The second point I'd like to make is that maintaining escalation policies becomes more challenging in terms of operational costs as the team grows larger. Engineering organizations often undergo frequent changes in terms of responsibilities and service ownership, so keeping escalation structures up-to-date is important for proper incident resolution.
In a financial technology setting with transaction processing and settlement flows, even small delays can result in downstream operational problems such as payment failures, accounting discrepancies, customer complaints, and regulatory risks. In the absence of formal incident management protocols, alerts would be scattered throughout various monitoring services, emails, and communication channels, sometimes resulting in ownership disputes when outages occurred. The PagerDuty suite unified this entire workflow process, routing all incidents according to their respective ownership and priority automatically.
For example, there was an actual operational situation where we received latency alerts from our banking integration partner due to API requests being slow during peak payout periods. The initial challenge for us was to determine whether the problem was in our infrastructure, in our database, or if it was coming from our integration partner. However, with PagerDuty integrated with our monitoring systems and escalation processes, the backend engineers, infrastructure responders, and platform leads could be notified simultaneously and collaborate to resolve the bottleneck much faster than before.
Peace of mind for on-call teams, though setup takes time
For me, the biggest benefit is a lower Mean Time to Repair (MTTR) and a clear, automated escalation path. It gives me peace of mind that if something breaks at 3 AM, the right person is notified immediately. I also have the context I need to fix the issue quickly, which ultimately helps us maintain our service level agreements (SLAs) with our customers.
Essential for Incident Management with Room for Improvement
On-call workflows have been streamlined and critical alerts are now managed without being missed
What is our primary use case?
What is most valuable?
The best features of PagerDuty Operations Cloud are that it is a fairly good tool for alerting. Here is how the process works: suppose there is an XYZ server in my environment hosting a production or development application, and a primary on-call engineer has been assigned for that particular week. We have set up monitoring and observability for that node so that if the node is not reachable, an alert is triggered and sends a notification to our integrated Slack channels with PagerDuty Operations Cloud. If the engineer is available, they can acknowledge the alert. If they fail to acknowledge it, the system calls them on their provided number. If that is also not acknowledged, it sends a text message. If those actions are not acknowledged, it sends an alert to the secondary engineer and calls them as well. This multi-channel approach makes it very difficult to miss an important alert or update. PagerDuty Operations Cloud handles this process perfectly, and we do not miss any alerts because of this system.
Regarding the stability of PagerDuty Operations Cloud, I cannot recall an incident where it was not available. I can say that it is 100 percent reliable for my needs.
What needs improvement?
PagerDuty Operations Cloud itself functions well, but our setup sometimes feels irritating. The calls come very early in the morning, and even after we acknowledge them, numerous calls from PagerDuty Operations Cloud pop up before we have fully woken up. We try to snooze them, but this is a result of how we have configured our alerting mechanisms rather than a PagerDuty Operations Cloud issue.
Another piece of feedback is that there should be more options for changing the automated voice that calls us. The automated voice could be better as it is not very interesting and feels outdated. I have not seen updates to it during the time I have been using PagerDuty Operations Cloud. I do not see many updates made to PagerDuty Operations Cloud overall. The UI is simple, but it should be refreshed periodically to keep up with current times. Everything needs a fresh appearance periodically.
For how long have I used the solution?
I have been using PagerDuty Operations Cloud for the past four years.
What do I think about the scalability of the solution?
PagerDuty Operations Cloud's scalability as a solution is fairly straightforward and has maintained its effectiveness. With new integrations being launched periodically, such as Slack and Datadog, the platform has blended itself well from an integration perspective. Whatever top-notch tools we are using as an enterprise solution, PagerDuty Operations Cloud has kept itself current and integrates nicely with all the tools we use these days. Even with upgrades we are making, such as adopting AI agents and using the latest AWS services like Bedrock and SageMaker, it is fairly easy to integrate with PagerDuty Operations Cloud. The platform also provides the option of changing rosters. PagerDuty Operations Cloud has a mobile application that I can use on my mobile phone, and it is fairly easy to use from mobile as well. I can change my rosters and reach out to my primary management and other contacts through the mobile app. It has kept up with recent times and continues to evolve.
What other advice do I have?
I was not involved in the deployment of PagerDuty Operations Cloud because it was already in my organization when I joined. However, after using it for the past four years, I can say it does not need much complexity. The architecture is straightforward, as PagerDuty Operations Cloud is integrated with my Datadog and Slack systems. The integrations are easy, and it is now being integrated with AI agents as well. In the UI, I can see who is on the schedule. Although I was not involved in the deployment, I know it is fairly easy to use.
I would rate PagerDuty Operations Cloud around a nine out of ten. I deduct one point for the lack of updates to the UI, as the platform has not made many updates to its interface. Despite this, it does the job that it needs to do, and I would rate it a nine.
Proactive alerts and clear incident documentation have improved our outage response times
What is our primary use case?
I handle level two operations. Whenever a major incident occurs or there is an outage, I am informed first via Splunk, DataDog, and PagerDuty. PagerDuty Operations Cloud is used for alertness, and we have configured threshold values within it. I have the mobile application installed on my phone, so I receive information about any outage as soon as it occurs.
I work at Vodafone Intelligence Services, which is a subsidiary of Vodafone. We are a UK-based company that performs level one, level two, and level three operations for all European countries and some countries in India, including South Africa, Ghana, Spain, Egypt, Hungary, and the UK. These are our major customers. As part of operations, we have a team of about 15,000 people who manage the different markets and customers. PagerDuty Operations Cloud is used everywhere across our organization, along with Splunk and AppDynamics.
What is most valuable?
PagerDuty Operations Cloud is used for monitoring, and we upload detailed documentation for major incidents such as P2 or P1 severity. We prepare documentation about the incident including what caused it, what the resolution time was, what the impact was, and everything else, which we then put on PagerDuty Operations Cloud. Apart from this, we do not use it for any other applications; it is used exclusively for monitoring purposes and setting up alertness.
We receive many benefits as part of L1 or L2 operations running 24 hours a day. As soon as there is an issue, if I am the first point of contact and I do not receive the call, it goes to the second person, my line manager. If my line manager does not pick up the phone, it goes to the third person, the skip-level manager. This is beneficial for us; even if it is a minor outage lasting 5 or 10 minutes, we receive an alert about it. If there is a major incident, we still receive the alert. Even if we are away from the system and not actively monitoring, we get the alert as soon as there is an outage.
We have the TIBCO integration layer, which is integrated with DataDog, and DataDog is integrated with PagerDuty Operations Cloud. When we ask PagerDuty Operations Cloud how many incidents are recurring with a specific service, it provides historical data showing how many times that service was down.
What needs improvement?
I do not see any improvements needed in how I use PagerDuty Operations Cloud; it is still good. We receive phone calls and emails, but the use case is limited. It needs to be integrated with some other applications. I expect it to be one platform for all operations; it should not depend upon Splunk, DataDog, or other applications or tools. Everything should be in one place to make things easier and reduce complexity. Otherwise, we have to manage different tools. I expect monitoring tools to be consolidated together for better results and less complexity.
For how long have I used the solution?
I have been using PagerDuty Operations Cloud for almost three and a half years now.
What do I think about the stability of the solution?
I have not seen any crashes in PagerDuty Operations Cloud; it is a good tool. The user interface is really what I appreciate most. It is not a tedious task to spend time on PagerDuty Operations Cloud. The smoothness, availability, and user interface are very friendly. I have used other tools like DataDog, which is a little more complex, but PagerDuty Operations Cloud is a good tool with a friendly UI.
What do I think about the scalability of the solution?
PagerDuty Operations Cloud will grow; we do not have any concerns for this product. We need to put the system together for alerts, and it is good that PagerDuty Operations Cloud has the availability and will definitely grow over time.
How are customer service and support?
For technical support, we raise tickets most of the time and do not get in touch with them directly. However, we receive resolutions in a timely manner. The technical support team has expertise and answers our questions on the first attempt, keeping interactions short and simple, which makes a huge impact. If they sent us back and forth, it would make for lengthy discussions. When we raise an issue with PagerDuty Operations Cloud technical team, they respond effectively and keep it concise. We do not have to raise multiple tickets for the same issue; it is the best experience we have had.
Which solution did I use previously and why did I switch?
Since I joined the organization, I have only three and a half years of experience, and from day one, I have used PagerDuty Operations Cloud. I am not sure how the team handled previous incidents before. I believe the organization has been associated with PagerDuty Operations Cloud for a longer period of time. I do not remember how the team managed incidents prior, but PagerDuty Operations Cloud helps us monitor systems effectively, and we have not had any escalations to date. We handle outages within 10 to 15 minutes. Customers may panic, but major escalations are managed effectively.
What other advice do I have?
PagerDuty Operations Cloud is used for monitoring, and detailed documentation is uploaded if there is a major incident such as P2 or P1 severity. Documentation about the incident is prepared including what caused it, what the resolution time was, what the impact was, and everything else, which is then put on PagerDuty Operations Cloud. Apart from this, it is not used for any other applications; it is used exclusively for monitoring purposes and setting up alertness. I would rate this product 8 out of 10.
Flexible Alerting Options That Keep Us on Top of Incidents
PagerDuty Simplifies Major Incident Management and Escalations
Reliable Alerting and Strong On-Call Management
Incident response has become faster and on-call alerts stay reliable for critical operations
What is our primary use case?
I am an end user of PagerDuty Operations Cloud in my organization, with a background in incident management. I primarily use it for managing on-call schedules, triggering and handling incidents, and monitoring alerts. It helps ensure timely responses, efficient escalation, and better coordination during incidents, making it a key tool for maintaining operational reliability.
How has it helped my organization?
PagerDuty Operations Cloud has improved our incident response by ensuring reliable alerting and faster escalation to the right teams. It has significantly reduced alert fatigue through better alert filtering and deduplication. The platform has also lowered our mean time to resolve (MTTR) with runbook automation and streamlined on-call management, leading to fewer disruptions and improved overall operational efficiency.
What is most valuable?
The features of PagerDuty Operations Cloud that I have found the most valuable and useful include alerting, which is very reliable with minimal delays, and the escalation policies and routing rules that are more flexible. Additionally, the on-call scheduling capabilities are great, and it integrates well with any cloud platforms such as AWS, GCP, or Azure, and observability tools such as DataDog and New Relic for logging and checking out logs.
I have noticed that PagerDuty Operations Cloud influences revenue protection by reducing alert fatigue and incident costs. AIOps has helped recently in reducing noise and alert duplications, and runbook automations aid in lowering the mean time to resolve by integrating triggers to Slack and updating runbooks.
I see PagerDuty Operations Cloud as a very good incident management and on-call platform, mostly used by large-scale organizations because it comes with premium pricing, but it is very reliable with alerting and on-call scheduling, triggering incidents, escalation policies, rules, and runbooks.
What needs improvement?
I think an area of PagerDuty Operations Cloud that could be improved is their premium pricing, as it compares unfavorably with competitors such as Atlassian's Opsgenie and ServiceNow, which offer bundle deals, plus DataDog now has incident management capabilities. Overall, the premium pricing makes it less accessible for small to medium businesses.
I think the pricing of PagerDuty Operations Cloud is a bit too high, and also, the UI can feel a bit curvy for new users; the learning curve might be a bit dense for them. The initial setup is straightforward, but the event orchestration could be complex, and the automation workflow definitely requires great expertise.
For how long have I used the solution?
I have been using PagerDuty Operations Cloud for approximately four years.
What do I think about the stability of the solution?
I would rate the stability and reliability of PagerDuty Operations Cloud 9.5 out of 10. The platform is highly stable and dependable in production environments, especially for critical incident management workflows. We have experienced consistent alert delivery, reliable on-call scheduling, and minimal downtime or disruptions.
That said, no system is completely perfect, so I cannot say it is 100% flawless. However, overall it has proven to be very reliable for mission-critical operations, where even small delays or failures would have significant impact.
What do I think about the scalability of the solution?
I would rate the scalability of PagerDuty Operations Cloud 8 out of 10. From a technical perspective, the platform scales very well and can support large, distributed teams with complex incident management needs. It handles high volumes of alerts, multiple services, and integrations across cloud platforms efficiently.
However, the main limitation to scalability is its premium pricing. As organizations grow and onboard more users or services, the cost increases significantly, which can be a challenge for small to mid-sized teams. So while it is technically highly scalable, cost can be a limiting factor for broader adoption.
How are customer service and support?
I have had regular interactions with PagerDuty Operations Cloud’s technical support, and my overall experience has been positive. The support team is responsive and helpful in addressing queries.
For example, during a user audit, I requested specific data on active users and those who had not accepted invitations. The support team responded quickly and provided the required information without delays. Overall, the support experience has been efficient and reliable when assistance is needed.
Which solution did I use previously and why did I switch?
I have only been using PagerDuty Operations Cloud; recently with my new organization, I am also using Fire Hydrant.
How was the initial setup?
I was not directly involved in the initial setup of PagerDuty Operations Cloud, as it was handled by senior team members. However, from my observations, the setup process appears to be straightforward at a basic level for core features like alerting and on-call scheduling.
That said, advanced configurations such as event orchestration and automation can become complex. If rules are not configured properly, they may lead to alert storms or missed incidents. Additionally, runbook automation is not plug-and-play and typically requires scripting knowledge and careful setup to function effectively.
What was our ROI?
From an ROI perspective, I do not have direct visibility into financial metrics, so I cannot quantify exact cost savings. However, I have seen strong operational ROI from PagerDuty Operations Cloud.
It has improved incident response efficiency by reducing alert fatigue, ensuring faster escalation, and lowering mean time to resolve (MTTR) through runbook automation. These improvements have helped prevent prolonged outages and reduced the impact of incidents, which indirectly contributes to cost savings and better service reliability at an operational level.
What other advice do I have?
I have some exposure to its autonomous AI agents, I have not extensively used its AIOps or generative AI capabilities. Despite that, the platform has had a strong positive impact on our operations.
By properly configuring alerting rules, we have been able to significantly reduce alert fatigue and shift focus toward more critical issues rather than routine noise. PagerDuty has also helped in reducing the number of duplicate alerts through intelligent pattern recognition.
Additionally, runbook automation has contributed to lowering our mean time to resolve (MTTR), enabling faster and more efficient incident handling. Overall, it has helped prevent costly incidents and improved operational efficiency across the team.
My review rating for PagerDuty Operations Cloud is nine point five out of ten.