
Overview
High customer expectations and increasingly distributed systems mean disruptions to digital service can have catastrophic effects on sales, brand loyalty, and operating costs. The PagerDuty Operations Cloud deflects unnecessary work from teams and subject matter experts so they can focus on delivering business value. Urgent work is escalated to the right teams and routine work is made self-service. Teams can automate and accelerate issue resolutions with minimal human interruption -and improve system resilience and team capacity while reducing the strain of operational complexity and the unexpected.
With more than 700 integrations, APIs, and apps for customer service, the PagerDuty Operations Cloud empowers rapid responses in any environment. And thanks to more than 10 years of data ingestion, its machine learning-powered AIOps functionality can reduce alert noise by up to 98% and drive down MTTR with critical context for faster triage and effective automation.
PagerDuty integrates with various AWS services, including AWS CloudWatch, Amazon GuardDuty, AWS CloudTrail, AWS Personal Health Dashboard, Amazon EventBridge, AWS Security Hub, Amazon DevOps Guru, AWS Control Tower, AWS Outposts, and AWS S3 Storage Lens.
AIOps PagerDuty AIOps helps teams reduce noise, triage efficiently to drive the right actions towards resolution, and remove manual, repetitive work from the incident response process. Noise reduction baked in with an ML model that learns and adapts based on user behavior means teams see fewer incidents overall. And automating toil from manual event processing results in greater efficiency, saving teams valuable time for innovating.
Process Automation PagerDuty Runbook Automation is a managed cloud service that enables DevOps teams and SREs to create and delegate operational tasks in automated runbooks to other stakeholders such as developers, NOC personnel, and incident responders. Runbook Automation provides automated workflows and task automation focused on IT and developer process automation. Examples include service provisioning, CI/CD, configuration management, incident diagnosis and remediation, and more. With PagerDuty Runbook Automation, you can resolve requests in minutes, rather than days, optimize security and compliance, and give your engineers more time to spend on innovation rather than firefighting.
Incident Response PagerDuty helps you save time and money by bringing together the right teams with the right information to resolve incidents faster. Replace manual processes with automation to streamline incident response, freeing up time and resources for more innovation. Orchestrate end-to-end incident response with a service ownership model that only brings in the teams you need. Over 21K organizations trust PagerDuty to help them adopt DevOps best practices and build more resilient operational practices to minimize costly downtime and protect the customer experience.
Custom Private Offer We can create a custom offer tailored to your needs. Please contact us at aws-sales@pagerduty.com
Highlights
- Incident Response - Manage incidents end-to-end
- Process Automation - Automate and delegate business and IT processes
- AIOps - Maximize IT capacity with fewer incidents and faster resolution
Get personalized pricing in minutes - New
Details
Features and programs
Buyer guide

Financing for AWS Marketplace purchases
Pricing
Dimension | Description | Cost/12 months |
|---|---|---|
Professional | On-call and incident response for growing teams | $252.00 |
Business | Streamlined incident response for the enterprise | $492.00 |
CustomerServProfessional | Bi-directional comms between CS & Dev, protect SLAs, & lower MTTR | $252.00 |
CustomerService Business | Bi-directional comms between CS & Dev, protect SLAs, & lower MTTR | $492.00 |
Runbook Automation | Automate manual procedures in runbooks | $1,500.00 |
Automation Actions | Add-on: Automate steps to diagnose & remediate incidents | $240.00 |
Live Call Routing | Add-on: For on-call schedules & escalations (by line) | $1,890.00 |
Runbook Auto Job Runner | Add-on: For Runbook Automation | $750.00 |
Stakeholder Users | Bundle of 50 Stakeholder users | $1,800.00 |
PagerDuty Status Pages | 1000 User Pack | $1,068.00 |
The following dimensions are not included in the contract terms, which will be charged based on your usage.
Dimension | Cost/unit |
|---|---|
Additional events over contracted value | $0.06 |
Vendor refund policy
All fees are non-cancellable and non-refundable except as required by law.
Custom pricing options
How can we make this page better?
Legal
Vendor terms and conditions
Content disclaimer
Delivery details
Software as a Service (SaaS)
SaaS delivers cloud-based software applications directly to customers over the internet. You can access these applications through a subscription model. You will pay recurring monthly usage fees through your AWS bill, while AWS handles deployment and infrastructure management, ensuring scalability, reliability, and seamless integration with other AWS services.
Support
Vendor support
Our team provides multiple resources for customers to find answers to questions and get help with our product. Users may browse our integration guides (pagerduty.com/integrations) to integrate with partner tools, our knowledge base (support.pagerduty.com) to learn more about using PagerDuty, and our developer docs (developer.pagerduty.com) to use our APIs. Additionally, anyone can interact with other PagerDuty users and PagerDuty employees via the PagerDuty Community (community.pagerduty.com). Our Support team is available during regular business hours around the globe, Monday through Friday, and can be contacted at: Email: support@pagerduty.com or via a ticket submitted at tickets.pagerduty.com
AWS infrastructure support
AWS Support is a one-on-one, fast-response support channel that is staffed 24x7x365 with experienced and technical support engineers. The service helps customers of all sizes and technical abilities to successfully utilize the products and features provided by Amazon Web Services.


Standard contract
Customer reviews
Incident workflows have transformed and now reduce downtime for critical gaming services
What is our primary use case?
My name is Dinesh Singh Negi and I currently work as a Lead DataOps Engineer in the online gaming industry. My primary responsibility is ensuring the reliability, availability, and performance of our data platform and complete production system. I work extensively with AWS services, Prometheus, Grafana , and PagerDuty for monitoring, alerting, and incident management. My team supports critical gaming workloads and data pipelines that require high uptime and quick incident response. A significant part of my role involves setting up monitoring strategies, managing on-call operations, handling production incidents, and performing root cause analysis. We drive operational improvements, and we use PagerDuty Operations Cloud as our central incident management platform to ensure alerts are routed to the right team and escalated appropriately. I have been working in operations and reliability for nine to ten years and have hands-on experience managing large-scale customer-facing environments where managing, minimizing downtime, and reducing meantime to resolution are key priorities. We use PagerDuty Operations Cloud to understand the maximum time of acknowledgment and maximum time of resolution to derive meaningful analysis from the incidents that have been triggered to different teams.
I have been working for nine to ten years in operation, production support, reliability engineering, and mixed roles during this time. I have worked extensively on monitoring, incident management, system reliability, and operational excellence while particularly supporting large-scale online platforms and data operations. For five to six years, my focus has been on ensuring high availability, managing production incidents, optimizing monitoring and alerting strategies, and improving operational processes. Throughout these years, I have gained hands-on experience with AWS Cloud, Prometheus, Grafana , and PagerDuty Operations Cloud, which are the core tools we use for monitoring, alerting, and incident responses.
What is most valuable?
The best features are those we have been using for incident management. We have been using PagerDuty Operations Cloud for on-call scheduling, escalation policies, and integration capabilities. Incident management is extremely valuable because it ensures critical alerts are delivered to the right people immediately. On-call scheduling and escalation policies are very helpful because we can define clear ownership for the services and automatically escalate incidents if they are not acknowledged within a specific timeframe. Another key strength is the integration ecosystem. We can integrate it with our monitoring stack including Prometheus, Grafana, and AWS services, which helps us automate alerts ingestion and incident creation without manual intervention. The most valuable features are automating alerts, escalations, on-call management, integrations, and incident analytics.
One example that stands out was a production incident where we experienced a sudden spike in database latency during peak gaming hours. This started impacting player transactions and causing delays in some backend services. Our Prometheus and Grafana monitoring detected this abnormal latency and error rate increase, which went beyond a threshold, and the alert was automatically routed to PagerDuty Operations Cloud. PagerDuty Operations Cloud immediately notified the on-call engineer of our team and triggered the escalation workflow based on the incident severity. Since the issue occurred during peak traffic, quick response was critical, which was maintained. PagerDuty Operations Cloud helped us coordinate multiple teams, including DataOps, application, and other infrastructure teams. The platform helped ensure everyone was engaged quickly and that no critical notifications were missed. While we were under investigation, we identified a resource bottleneck in the database layer caused by an unexpected traffic surge. With the help of the database team, we scaled the required AWS resource and optimized a few long-running queries. This restored normal performance.
What needs improvement?
A significant positive impact is improving incident response efficiency and overall service reliability. Before we had a mature incident management process, coordinating responses during critical issues often required manual communication and follow-ups. PagerDuty Operations Cloud automated all of those things, including alert ownership, escalation, ensuring that incidents are routed to the right team members immediately. One of the most measurable benefits is the reduction in meantime to acknowledge and meantime to resolve. Faster detection and response help minimize service disruptions and maintain a stable experience for our users, which is especially important in the online gaming industry where availability and performance directly affect customer satisfaction. The platform has helped us mature our operational practices by analyzing incident trends, alert volumes, and escalation patterns. We have been able to refine our monitoring, reduce alert fatigue, and proactively address recurring issues before they become major bottlenecks in production.
For how long have I used the solution?
I have been using PagerDuty Operations Cloud for approximately more than five years.
What do I think about the stability of the solution?
PagerDuty Operations Cloud is stable.
What do I think about the scalability of the solution?
When you are using a tool for incident response, you need to trust that notifications and escalations work when a critical event occurs. PagerDuty Operations Cloud has been very dependable in that regard. Another aspect we have found valuable is the flexibility to support different teams and services as our environment grows. We have added new applications, data pipelines, and AWS service resources. We are able to extend our PagerDuty Operations Cloud configuration without major challenges or changes to our overall operational model.
Which solution did I use previously and why did I switch?
I have not used any solution previously. Since the beginning of 2021, I have been using PagerDuty Operations Cloud.
How was the initial setup?
The setup and customization process was relatively straightforward. The integrations were one of the easiest parts. PagerDuty Operations Cloud provides well-documented integrations for monitoring tools and cloud platforms. Connecting it with our Prometheus, Grafana, and AWS monitoring stack did not require significant development efforts. The initial setup involved configuring alert routing, defining service ownership, and mapping severity levels to appropriate escalation policies. Customizing on-call schedules and escalation workflows was also quite flexible. We were able to create different schedules for various teams, define escalation paths based on incident severity, and establish notification rules that match our operational requirements. As our team and environment grew, we refined the configuration further by tuning alert thresholds and reducing noise to avoid alert fatigue. It is important to ensure engineers receive only actionable alerts rather than excessive notifications.
What about the implementation team?
PagerDuty Operations Cloud's AI and automation capabilities are primarily used for alert correlation, event intelligence, noise reduction, incident prioritization, and providing operational context to responders. These capabilities help engineers identify and respond to issues more quickly while keeping humans in control of critical decisions. We see value in the direction of autonomous operations. If AI agents continue to improve in areas such as incident triage, root cause analysis, and automated remediation for well-understood scenarios, they could further reduce response times and operational overhead.
What was our ROI?
We have seen a positive return on investment from PagerDuty Operations Cloud through improved operational efficiencies, faster incident response, and reduced downtime. I cannot share financial figures, but I can speak to operational outcomes we have observed. Since implementing PagerDuty Operations Cloud and integrating it with AWS, Prometheus, and Grafana monitoring stack, we have seen measurable improvements in incident processes such as MTTA and MTTR, or reduced alert fatigue by using event correlation and alert deduplication. These improvements have helped us a great deal.
Which other solutions did I evaluate?
I did not get a chance to evaluate any other applications. When I was in the company, they were using PagerDuty Operations Cloud only, so I started with that.
What other advice do I have?
My advice would be to start with a clear incident management strategy rather than focusing only on the tool itself. PagerDuty Operations Cloud delivers the most value when you have well-defined service ownership, escalation policies, severity levels, and monitoring practices in place. The platform is very powerful, but its effectiveness depends on the quality of the alerts and operational processes behind it. I would also recommend investing time in alert tuning early on and integrating PagerDuty Operations Cloud with your monitoring stack, whether it is AWS, Prometheus, Grafana, or any other observability tool. Make sure the alerts being sent are actionable. Reducing noise from the beginning will help prevent alert fatigue and improve adoption among engineering teams. I would rate this product an eight out of ten.
Which deployment model are you using for this solution?
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
AI-driven incident management has reduced downtime and improves focus on strategic work
What is our primary use case?
PagerDuty Operations Cloud is a multifunctional digital operations platform that meets my organization's needs.
I am impressed by this digital operations solution because it is the most appropriate tool for incident detection and alerting.
PagerDuty Operations Cloud is a very user-friendly tool, highly accurate, and an easy-to-customize digital operations management system that suits my organization's needs.
It has intelligent noise reduction capabilities that play a significant role in minimizing alert floods.
What is most valuable?
PagerDuty Operations Cloud offers top-tier features that enable real-time alerting and accelerate incident response.
The solution is reliable and effective when it comes to automating routine diagnostic tasks.
Regarding how the real-time alerting and automation features have helped my team, problem-solving became automatic, and incident management becomes less complex to manage.
PagerDuty Operations Cloud has positively impacted my organization by enabling faster issue response, which helped reduce downtime, saved revenue by avoiding long outages, improved team accountability during incidents, reduced manual effort in handling alerts, and helped maintain a better customer experience.
The solution's alert reduction feature has had a major impact on preventing costly incidents in my organization. By grouping related alerts and de-duplicating noise, my team was able to spot real issues faster instead of getting buried in alerts, helping us prevent two to three potential outages because engineers responded to the root alert instead of missing it in noise.
What needs improvement?
The user interface should be easier to customize and use.
The pricing could be less expensive, especially for smaller organizations.
The user interface could be made easier to customize and navigate so that users who are new to this platform find the learning curve smoother.
PagerDuty Operations Cloud needs improvements because sometimes integrations are not very seamless and misbehave.
For how long have I used the solution?
I have been using PagerDuty Operations Cloud for about one year and a few months.
What other advice do I have?
PagerDuty Operations Cloud is a great operational efficiency tool, not just for paging.
It is very cost-effective, especially for organizations that are not limited by budgets.
PagerDuty Operations Cloud solves a lot of problems.
For example, if any issue arises during our online exam with our client, then PagerDuty Operations Cloud alerts the right team and the right people, and tasks are assigned so those problems can be resolved at the correct time and our real task does not get disrupted.
PagerDuty Operations Cloud's AI functionality has improved my team's ability to focus on core tasks rather than routine issues by removing routine alert triage.
The AI groups and de-duplicates alerts automatically, so our engineers are not manually sorting through twenty duplicate notifications for one root issue, allowing them to save a lot of time and focus on other strategic tasks, which improves productivity in my organization.
We are using PagerDuty Operations Cloud's autonomous AI agents for low-severity incidents, which automatically triage, correlate, and resolve known issues without human intervention, such as restarting services or acknowledging flapping alerts.
This has contributed to efficiency by cutting manual workload by thirty-five percent and also reducing MTTR for routine incidents.
The effectiveness of PagerDuty Operations Cloud's generative AI in providing insights for decision-making is effective during incidents.
The AI provides clear insights through incident summaries and what-changed analysis, helping us decide where to start troubleshooting instead of guessing, enabling us to make data-driven decisions easily, and providing actionable insights that improve response decisions.
The influence of PagerDuty Operations Cloud's embedded AI on revenue protection in terms of reducing alert fatigue and incident costs has a positive impact by reducing downtime risks and operational costs per incident.
I would rate this review nine out of ten.
Which deployment model are you using for this solution?
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Automated incident paging has improved on-call response but reporting and pricing still need work
What is our primary use case?
The main use case for using PagerDuty Operations Cloud is that we get paged as and when required for all the issues and incidents which are happening, rather than requiring us to keep track of all of them. We are in the exploring phase for using AI around PagerDuty, but that is still in exploration and we haven't started with that.
When there is an incident, we get paged for an alert. We have escalation policies set up that are being followed, and if someone is not acknowledging the page or if someone is not available, then accordingly it will go to the next level of escalation. This ensures that none of the alerts are missed.
PagerDuty Operations Cloud has multiple integrations. In our case, we use the Slack integration the most. The alert triggers from our SignalFx stack, goes to PagerDuty, follows the escalation policy, and reaches the user. Along with that, it is also sent to the Slack channels so that whatever triaging happens for that alert or incident happens over Slack in that particular thread where the alert is triggered from PagerDuty.
What is most valuable?
PagerDuty Operations Cloud is very easy to use and user-friendly.
Regarding the features that PagerDuty Operations Cloud offers, I have explored the automation area and it has a good amount of integrations. For example, the event intelligence and the noise reduction are areas where PagerDuty is really powerful. It reduces and cleans up alerts by doing alert de-duplication and alert grouping. It has also recently got machine learning capabilities, which would surely be helpful. We also have automations and runbooks in place which can help to do auto-remediation of issues or trigger scripts as per the runbooks. We haven't been using all of those things, but I know that these things are present. The incident response on-call management is very easy to use with PagerDuty. There are flexible on-call schedules, escalation policies, and the ability to set up overrides easily. There are multiple channels by which you can send alerts including SMS, calls, and notification pushes.
PagerDuty Operations Cloud also has war room features. Many emerging tools provide this as well, but since PagerDuty is a pretty established company, it has a very mature model with all of these features. The analytics and reporting are also decent.
PagerDuty Operations Cloud has improved our incident management process by ensuring that the right set of people are notified within time. The best part is that it has automated on-call schedules and escalation policies, so you don't have to set them again and again for every week or every month. Features including alert grouping, alert de-duplication, and good analytics and reporting are very helpful during incident management and also for post-incident activities.
What needs improvement?
The analytics and reporting have some scope for improvement. First, it should have more granular capabilities and we should be able to query it in a more granular way. There should also be more advanced trend analysis or cross-team operational insights available. That would be helpful. Licensing is also a bit expensive, so there should be some cost optimization for large deployments to take care of licensing cost optimization. Since we are in the AI era, I know PagerDuty has been investing in a lot of AI capabilities, but there should be good enhancements which we are looking forward to, such as automated root cause analysis or doing historical pattern matching. There could also be recommendations around runbook automation.
For how long have I used the solution?
I have been using PagerDuty since the last one and a half years at Splunk, and before that I was also an active user of PagerDuty in my last organization.
What do I think about the stability of the solution?
PagerDuty Operations Cloud is stable.
What do I think about the scalability of the solution?
PagerDuty Operations Cloud is pretty scalable. I never had any issue where a large number of alerts impacted PagerDuty.
How are customer service and support?
The support is decent.
Which solution did I use previously and why did I switch?
Previously, we were using OpsGenie, but that was quite a long time ago. PagerDuty Operations Cloud already has all the things which OpsGenie had, as per my knowledge.
How was the initial setup?
The costing took away two points from my overall rating. There are still some good amount of areas of improvement which took away the last one point, resulting in a rating of seven out of ten.
What about the implementation team?
I am not in the position to select any tool. I am not the one who selected or chose PagerDuty or evaluated any tools before that. We are just end users.
What was our ROI?
The return on investment is nine out of ten.
Which other solutions did I evaluate?
We are still in the phase of doing that evaluation and it is not yet completed. However, it is pretty helpful because PagerDuty itself has a good amount of data which can be used with AI to make the best use of it. I am still in the experimenting phase, but the AI functionality of PagerDuty would definitely be a good way to analyze the ongoing issues and how issues are handled right now, tracking the MTTR and MTTDI, and finding spots where there are a lot of areas of improvements which are needed.
What other advice do I have?
PagerDuty Operations Cloud already has a lot of integrations available, which is pretty good. The user experience is swift and smooth, which is a very good thing about PagerDuty Operations Cloud that I appreciate. I am not very aware of the governance and security aspects, but it has SSO as well, which is pretty good. Many organizations would be happy to adopt it, though I am not very aware of these features. The AI capabilities are not very reliable or accurate at the moment, but it is in the development phase and should improve over time.
I don't have the exact metrics available, but there is a significant amount of improvement which we can see after onboarding to PagerDuty Operations Cloud. Normally, before PagerDuty Operations Cloud, I can compare with my previous to previous organization because in that company we didn't have PagerDuty Operations Cloud. There were quite a good amount of alerts which were getting missed. With PagerDuty Operations Cloud, there is a good layer of notifications and notification policies that you have. Even if you miss any page, you will get a push notification on your mobile. If you miss that, you will get a call on your mobile, which is pretty good.
The overall pricing, setup cost, and licensing are pretty expensive. The PagerDuty Operations Cloud licensing is a bit confusing because it is primarily based on users, not on the number of alerts or incidents which are triggered. If it is a small organization, it is good, but if it is a large organization, it is difficult because many people would need to use PagerDuty Operations Cloud. At the same time, to make it more efficient or to get the best out of it, we need to have an end-to-end setup on PagerDuty Operations Cloud, which does take time. There should be some flexible licensing options.
PagerDuty Operations Cloud is a pretty mature product. If you are a mid-scale organization who is trying to get the best out of PagerDuty Operations Cloud, I would recommend going for it. My overall rating for this product is seven out of ten.
Integrated incident workflows have improved on-call efficiency and automated critical alerts
What is our primary use case?
We are currently using PagerDuty Operations Cloud for incident management, escalations, on-call, and the status page, which represents our main product utilization.
What I like the most about it is that it has so many integrations like Azure integrations, AWS integrations, and Prometheus and Grafana integration for the alerting system, which makes it more convenient for us. We are using all kinds of tools like Grafana and others, which are easy to access and integrate with PagerDuty Operations Cloud . Our infrastructure is going to be more secured whenever incidents get triggered, and with the help of PagerDuty Operations Cloud, we are able to get incidents triggered automatically after alerts are triggered.
Currently, there is one tool called Rootly . I think they are new to the industry and we are also using that for one of our other clients. It's somewhat similar, but I think they have the potential to compete with PagerDuty Operations Cloud in the future as well.
As of now, we are not using any generative AI features in PagerDuty Operations Cloud. We are currently using it for on-call and other things.
What is most valuable?
What I like the most about it is that it has so many integrations like Azure integrations, AWS integrations, and Prometheus and Grafana integration for the alerting system, which makes it more convenient for us. We are using all kinds of tools like Grafana and others, which are easy to access and integrate with PagerDuty Operations Cloud. Our infrastructure is going to be more secured whenever incidents get triggered, and with the help of PagerDuty Operations Cloud, we are able to get incidents triggered automatically after alerts are triggered.
The benefits in terms of on-call are that we are getting maximum utilization of it. Previously, we were not having any alerting system for our client, and after implementing PagerDuty Operations Cloud, we started finding out the root cause and made other things easier compared to earlier. With the help of PagerDuty Operations Cloud, we are able to fix most of the issues and reduce repetitive issues in our infrastructure.
What needs improvement?
There is nothing I dislike about PagerDuty Operations Cloud, but perhaps it's due to the networks or the medium which it is taking. Usually, what happens is that if an incident gets triggered, suppose if it triggers in five to ten seconds, but sometimes, maybe due to latency or other factors, the call gets triggered after two or three minutes. That is quite understandable, but some kind of production issues need to be addressed at the earliest critical issues. So that latency needs to be reduced from PagerDuty Operations Cloud. I think they need to work on that. Apart from that, most of the things they are doing well, and we are not facing any such kind of issues. Everything is good.
Except for the frequency of the call, we don't see any lagging, crashing, or downtime. In rare cases, we hear some noises in the call, which is rare but not frequent. Apart from that, the triggering latency is a bit slow, but not every time.
For how long have I used the solution?
We have been currently using PagerDuty Operations Cloud for more than two years.
What do I think about the stability of the solution?
Except for the frequency of the call, we don't see any lagging, crashing, or downtime. In rare cases, we hear some noises in the call, which is rare but not frequent. Apart from that, the triggering latency is a bit slow, but not every time.
What do I think about the scalability of the solution?
Regarding scalability, I don't think there are any issues; it is going well.
How are customer service and support?
We have very good support with PagerDuty Operations Cloud.
In few cases, not frequently, we have had to contact the technical support for clarification regarding the integration or for creating escalation things. Initially, we reached out to the technical support, but now we are well-versed with the tool. The community is good, and I think we are able to get solutions within the community itself.
For the support of PagerDuty Operations Cloud, I would give them a score of nine to ten.
Which solution did I use previously and why did I switch?
Currently, there is one tool called Rootly . I think they are new to the industry and we are also using that for one of our other clients. It is somewhat similar, but I think they have the potential to compete with PagerDuty Operations Cloud in the future as well.
How was the initial setup?
I don't think the deployment for PagerDuty Operations Cloud is difficult to handle. It is easy to handle, and the best thing is they have a very good support team that we can reach out to at any time.
What's my experience with pricing, setup cost, and licensing?
The pricing for PagerDuty Operations Cloud is a bit expensive, especially for startups like us, compared to the other platform which I mentioned, which is Rootly. Rootly is not based on a per-user model. In PagerDuty Operations Cloud, it is going to cost fifty dollars per user for admins or other roles, whereas in the other platform there is no such kind of thing; it is based on a pay-as-you-go model. I think that is one of the drawbacks for PagerDuty Operations Cloud regarding billing and other aspects. Apart from that, the plans and other things for incident creations and the triggering of calls are quite good.
Which other solutions did I evaluate?
Currently, there is one tool called Rootly. I think they are new to the industry and we are also using that for one of our other clients. It is somewhat similar, but I think they have the potential to compete with PagerDuty Operations Cloud in the future as well.
Real‑time incident alerts have improved uptime and keep critical services continuously monitored
What is our primary use case?
I have been working with PagerDuty Operations Cloud for more than two years, though recently, within the past three months, I have not been using PagerDuty Operations Cloud . I worked in an IT firm called Equifax, which is a credit monitoring system where you can see your credit score and credit-related products and services. I am in the software engineering department building the application, and I was supporting a few applications for the past two years, integrating PagerDuty Operations Cloud with DataDog. We set alerts on our software system such that if any software that we are serving on the cloud goes down, we receive an alert in DataDog system, and we reroute the same to PagerDuty Operations Cloud by providing our PagerDuty Operations Cloud credentials in DataDog. What we are doing is integrating PagerDuty Operations Cloud with DataDog, and that alert will be generated by PagerDuty Operations Cloud through multiple channels such as SMS, phone call, and message. If any of the systems we are hosting in our cloud goes down or any alerts that we have set up through our DataDog system triggers an event, that event sends to PagerDuty Operations Cloud, which will give the alerts through multiple channels. That is how I am using PagerDuty Operations Cloud in my company and my work.
What is most valuable?
What I appreciate about PagerDuty Operations Cloud is its real-time alert capability, which is one of the main things. If a critical system in the production environment goes down and we receive a message via SMS, we might miss that, or if we get an email, we might miss that as well. Someone who is on call or directly providing support twenty-four hours a day and seven days a week, such as the support team, might need PagerDuty Operations Cloud support. Without PagerDuty Operations Cloud, it is difficult to say when the alert got triggered and those kinds of things. PagerDuty Operations Cloud has all the history, including who acknowledged that particular pager, the timeline when it got triggered, and in which channel it got triggered, making it easy to prepare a report for the past month on how many alerts we received for particular services or to segregate by team or by alert name. It is a kind of perfect application, but I can suggest a few more additional improvements to enhance user experience.
PagerDuty Operations Cloud's main benefits are the alerts related to our organization. Alerts are critical; we want our system to be one hundred percent available, but no system is one hundred percent reliable. We want to know whenever our system goes down or we are experiencing some latency in response time or when a certificate for our DNS expires, as these are critical issues that can be handled through PagerDuty Operations Cloud. Even if we set an email notification, individuals working in front of the system twenty-four hours a day and seven days a week may not always be available. If they go for tea, coffee, or lunch, they might miss the critical functionality. However, if you have a pager, you will receive a call, which is much more reliable. While there may be instances when multiple PagerDuty Operations Cloud events trigger and result in one call, that is not the case all the time. Most of the time, we will receive alerts through one of the three channels, and organizations will configure calls to check the logs and address the problem promptly.
PagerDuty Operations Cloud's alert reduction feature has significantly impacted my organization, preventing approximately ninety to ninety-five percent of critical incidents from occurring. From my understanding, if the system is down, people will see the alert and take the necessary resolution steps. If it does not involve actual engineering work, such as restarting a service, that can be followed through this PagerDuty Operations Cloud alert, allowing resolutions to happen as soon as possible.
What needs improvement?
One aspect about PagerDuty Operations Cloud is that it is perfect, but no application is perfect. If we get an alert with the same name, when creating alerts with the same name, if I search for the alert by relevance, the data is not coming as expected. That is one of the things I would like to see improved. The user interface perspective is good, and while I think about what improvements I want to see in PagerDuty Operations Cloud, I am not getting that answer right now. Additionally, the integration part looks good; PagerDuty Operations Cloud can be integrated with multiple platforms including other applications such as DataDog, so I wonder why it cannot be directly integrated with a cloud such as Google Cloud without the need for a third layer such as DataDog. We should be able to integrate directly with PagerDuty Operations Cloud without any dependency.
What I would like to see included in PagerDuty Operations Cloud is the integration of some AI functionalities; most users leverage PagerDuty Operations Cloud for alert functionalities or critical things. For each particular alert, I would like to know the resolution steps or root cause analysis, or a runbook, so if you get this alert, what we need to do. Most of the alerts we receive are repeated; if a system goes down, it is usually a known error since no system is one hundred percent reliable. If we get a similar alert, I would like to see the root cause. If we receive an alert that your team got previously, we should know what they went through or we can preconfigure what runbook to follow for that alert. Some integration with multiple vendors such as Confluence or systems such as Jira , as I am primarily talking about IT, are functionalities I would like to see included in PagerDuty Operations Cloud.
For how long have I used the solution?
I have been using PagerDuty Operations Cloud for more than two years, though recently, within the past three months, I have not been using PagerDuty Operations Cloud.
What do I think about the stability of the solution?
When it comes to stability, I would not say there are performance issues with PagerDuty Operations Cloud. For example, if the same alert ID is generated multiple times, such as if a namespace goes down where multiple services are deployed, all alerts trigger at the same time, and we generally receive only one call for that. While it could be regarded as a performance issue, I think it is understandable given the situation.
What do I think about the scalability of the solution?
I do not find any limitations or issues regarding scalability with PagerDuty Operations Cloud; it appears to be scalable.
How are customer service and support?
Regarding PagerDuty Operations Cloud tech support, I personally have not escalated any issues, but my team has escalated one issue related to reporting. They faced an issue where we could create a report from PagerDuty Operations Cloud. The report generation accordingly faced a limitation; for instance, we encountered issues with reports generated in the morning that only showed data until four a.m. These reports might lack some recent alerts or events, though we can still see them manually. It is just that the report generation seems to be somewhat outdated.
How was the initial setup?
Regarding the initial setup process for PagerDuty Operations Cloud, I do not think there were challenges. I followed a runbook, and the process is straightforward, requiring just primary and secondary contact information, email, and phone numbers. It is a straightforward process, and I do not see any issues with that.
Which other solutions did I evaluate?
Prior to adopting PagerDuty Operations Cloud, I evaluated alternative options, and the alternatives I mentioned earlier do exist, but PagerDuty Operations Cloud is the only application that I have used because it has the capability to trigger phone calls for alerts. For other channels, while I can see multiple tools that trigger events via email notifications, I have not come across other applications that can do phone call alerts as PagerDuty Operations Cloud does.
What other advice do I have?
We do have alerting systems, as I mentioned, using DataDog, which can send only emails and other channels, but I do not think there are any other applications we are using for alerting apart from that. We also use Grafana , DataDog, and Chaos Search for the alert system along with PagerDuty Operations Cloud.
I have not used PagerDuty Operations Cloud's autonomous AI agents or generative AI yet. It was introduced by PagerDuty Operations Cloud, but my organization recently adopted those features. After the AI integration, I did not get a chance to use those because I moved to another team and did not use PagerDuty Operations Cloud after the AI integration. I think my colleagues mentioned that after the integration, it was good; they could integrate with multiple teams and applications such as Slack, but I did not have hands-on experience with that.
My advice for organizations considering PagerDuty Operations Cloud is that many organizations seem to already use it. If your system is large and you need to handle incidents, particularly critical applications driving revenue or something similar, you cannot afford for your system to go down for five minutes, as it may result in millions of dollars lost. To mitigate this, increasing reliability is essential. No system is entirely reliable, so we have to depend on products such as PagerDuty Operations Cloud to alert our engineers or the support team to reduce incident counts and impacts for monitoring purposes, system performance analysis, and other objectives. My overall rating for PagerDuty Operations Cloud is eight out of ten.