
PagerDuty Operations Cloud
Alert de-duplication has reduced noise and now improves response time and root cause analysis
What is our primary use case?
I have integrated other monitoring tools like LogicMonitor, and alerts come to PagerDuty Operations Cloud where we acknowledge them and work upon the issues. I have created multiple services that send alerts to out-of-hours groups for the on-call engineers. We detect issues faster and this helps in root cause analysis. Alert noise reduction is a major use case for us, as it groups duplicate alerts, which is very useful. The mobile application is also excellent.
What is most valuable?
I appreciate the event de-duplication feature in PagerDuty Operations Cloud because my company has many alerts for similar devices or servers, and it groups them together. This helps us see when a particular server's CPU and memory are both spiking, which aids significantly in root cause analysis.
Another feature I value is push notifications. We receive calls, SMS messages, and emails for the same alert, so we do not miss any notifications.
My organization has reduced noise by approximately 20% because of the de-duplication feature in PagerDuty Operations Cloud and the report feature. The report feature sends us a weekly report showing how many similar alerts occurred that week, and we work on reducing those alerts. By following this policy for three months, we reduced noise by 20%, which is a huge achievement for us.
PagerDuty Operations Cloud has improved our response time and mean time to resolution in my organization. We have integrated many monitoring tools through PagerDuty Operations Cloud, and the integration feature is excellent. It integrates very well with other monitoring tools via API and through email. I recommend other organizations use this integration feature.
The platform generates weekly reports showing how many alerts we received and the response time for each service and alert. I now pull daily reports via API. Since my company operates from 7:30 AM to 4:30 PM, with on-calls after hours, I need to know how many alerts occur outside business hours. Using a report scheduled through PagerDuty Operations Cloud API, the system sends me the alerts. I then analyze how many alerts came that night and work with the application team to reduce noise and resolve incidents. I value the report feature completely.
As a technical engineer, I observe that noise is being reduced and platform stability is increasing. My company is product-based with many products, and they are becoming more stable because we receive alert notifications faster. PagerDuty Operations Cloud is helping my organization tremendously.
What needs improvement?
Overall, I have positive feedback about PagerDuty Operations Cloud, but as an enhancement, I would suggest the reporting feature could be improved. I generate reports based on the service, but it has a limitation where it cannot send all alerts. The limitation is that it can only send 1,000 incidents using the API. If that capacity could be enhanced to send 2,000 alerts in one report, that would be beneficial.
Currently, we have not applied any automation through PagerDuty Operations Cloud. However, it does help with automation in that when we receive more alerts for a similar issue or for only one server, we know that server's health is not good. We then find the root cause and apply automation directly on the server, not through PagerDuty Operations Cloud. The feature would be useful, but my company does not have the automation feature enabled. It shows as a request trial, so I think we need to try that.
For how long have I used the solution?
I have been using PagerDuty Operations Cloud for one year and two months.
What do I think about the stability of the solution?
We do not experience downtime. However, I have observed one issue: we integrated with LogicMonitor, which is a monitoring tool, and alerts come from there to PagerDuty Operations Cloud. When alerts are resolved in LogicMonitor, they should also resolve in PagerDuty Operations Cloud, but sometimes they do not resolve. This should happen, and I think this is an API issue that needs to be addressed. I am not certain whether other customers of PagerDuty Operations Cloud are experiencing the same issue.
How are customer service and support?
I have no complaints about customer service because PagerDuty Operations Cloud is an incident management tool and it performs that function very well.
Which solution did I use previously and why did I switch?
We preferred PagerDuty Operations Cloud over ServiceNow, which we used previously for the same purpose. When an alert came, we would call engineers, and ServiceNow has that feature as well. However, PagerDuty Operations Cloud is much more advanced in terms of notifying users and reducing the time to respond. We are satisfied with it and are not planning to move to other tools currently.
How was the initial setup?
I joined this organization one year and two months ago, and the initial setup was already done. I only enhanced that setup and created new integrations and new event orchestrations. I cannot comment on the initial setup itself, but I am confident it would have been easy.
Which other solutions did I evaluate?
Overall, I can say PagerDuty Operations Cloud is a critical part of our incident management process. Reliability and alert delivery are strong compared to other tools such as ServiceNow. The area where we see the biggest opportunity is AI-driven event correlation, richer alert context, and improved analytics. I do not think any other tool is near that level. We tried ServiceNow because we have it as well, but it does not match PagerDuty Operations Cloud. The overall feedback is positive.
What other advice do I have?
I would recommend that organizations with high alert noise, whether similar to my company or larger companies, should try PagerDuty Operations Cloud. They should use its event and alert de-duplication features and integration with other tools, which are excellent. The calling notification feature is also very good. Overall, it is a strong solution. I rate PagerDuty Operations Cloud as nine out of ten because I do not see any gaps in what I use on a daily basis.
Integration workflows have become seamless and now power AI-driven incident management
What is our primary use case?
I started as a user working in an operations team where we handled the AWS infrastructure deployed for a particular company. Whenever any issue occurred, we received pages using PagerDuty Operations Cloud. I gradually learned about PagerDuty Operations Cloud and started integrating it into different workflows. I began integrating it to make Slack bots, and right now I am using PagerDuty Operations Cloud API endpoints to make AI agents as well. I can think of myself as an integration engineer who works extensively with integrating different services, one of which is PagerDuty Operations Cloud.
I also use xMatters alongside PagerDuty Operations Cloud. Speaking from an integrations engineer's perspective, I have not integrated xMatters heavily, but I have been a user of xMatters more lately. The major difference I observed was the workflow management. xMatters has better workflow management than PagerDuty Operations Cloud. Let me explain what I mean by workflow management. If you have a company with ten teams working on a particular product, every member of those teams may or may not receive a page. Every member should have an orchestrator where they can define custom rules such as a service should be paged directly to X team, or if a page comes from ABC issue, it should directly go to Y team without having me to manually put the team name or team details regarding where to page. This capability is lacking in PagerDuty Operations Cloud while in xMatters, it is flawlessly integrated where you can add custom rules and custom rule sets. In PagerDuty Operations Cloud, we have to create separate pages for that functionality. However, when I talk about integration, the xMatters API toolkit is confusing and disorganized. The tree structure is not present in xMatters, but I appreciate that about PagerDuty Operations Cloud. Integration-wise, PagerDuty Operations Cloud is flawless. I love PagerDuty Operations Cloud from an integration perspective, but it makes my life difficult if someone wants me to integrate xMatters.
What is most valuable?
I appreciate the overall API toolkit very much. It is one of the simplest API toolkits I have seen that lets me do literally anything via API calls, which I can essentially do via the browser. I do not even need to log into my browser to do anything if I have a CLI or any tools integrated with it.
I have built an AI agent which detects if any page comes into PagerDuty Operations Cloud. PagerDuty Operations Cloud has webhooks, which is great. If anything comes into PagerDuty Operations Cloud, I basically poll every detail of that page, perform some incident resolution, and do something on the infrastructure according to whatever page I receive. I add comments in the pages via PagerDuty Operations Cloud API and then do the whole incident life cycle using all the APIs. There is also a very good Python library called PDPYRAS, which I use extensively, which uses PagerDuty Operations Cloud APIs to build SDK. I have developed my own CLI toolkit using PagerDuty Operations Cloud APIs itself, which is on my GitHub.
The UI is good and looks good, but sometimes when pages come very frequently, such as receiving ten to fifteen pages per five to ten seconds, it works flawlessly. However, when you tie your PagerDuty Operations Cloud instance to very large-scale infrastructure where you have millions of instances and get at least five to ten pages per second, the UI starts to hang. The API works flawlessly even then, but if someone does not know how to use all the integrations that PagerDuty Operations Cloud provides, they have only one choice but to log into the UI and check for the pages. Then they will have to face the lag in the UI.
Integration is very easy. I have completed entire integrations, deployments, and testing within six hours. It is just so easy.
One person can do everything end to end on their own. I have done it multiple times, and I have seen other people doing it multiple times as well. Everything is very seamless. I should also appreciate the official documentation that you have. Usually official documentation is not that good, but yours feels like someone has taken time to write those documents. The commands which you have written for back-end integration are straightforward. I literally have to just copy and paste after setting environment variables.
What needs improvement?
I believe you really need to work on your UI. The UI is good and looks good, but when pages come very frequently, such as receiving ten to fifteen pages per five to ten seconds, it works flawlessly. However, when you tie your PagerDuty Operations Cloud instance to very large-scale infrastructure where you have millions of instances and get at least five to ten pages per second, the UI starts to hang. The API works flawlessly even then, but if someone does not know how to use all the integrations that PagerDuty Operations Cloud provides, they have only one choice but to log into the UI and check for the pages. Then they will have to face the lag in the UI. This is the area where I feel improvements are needed.
PagerDuty Operations Cloud does require a fair amount of maintenance. Many incidents go into triage and need to be regularly cleaned up. If an incident comes for which we have not set rules to auto acknowledge and close, it basically stays in the triage and we have to manually clean it up. I think an auto-detection mechanism can be implemented out of the box. Because it is not there currently, we have to develop modules around that. An alternative would be webhooks that you can come up with which we can utilize to implement this out of the box.
The only lag I have experienced is the UI lag that I have already described.
For how long have I used the solution?
I have been using PagerDuty Operations Cloud for more than five years now.
What do I think about the scalability of the solution?
PagerDuty Operations Cloud is scalable. I have not faced any issues with scalability. It is pretty good when it comes to scalability.
How are customer service and support?
I have contacted support multiple times.
There are ticket levels which we can create. I have not called the support team on the phone, but I have mailed and raised tickets with them. There have been instances where I had to integrate a very old server, AIX server framework to PagerDuty Operations Cloud for which the modules were not present in PagerDuty Operations Cloud. I am not blaming them for this; I am blaming the company who are using AIX servers. However, they did not have the module, so I had to raise a ticket. It was around Severity 3. The response was not super fast, which is expected. However, as soon as it escalated to Severity 2, I received an immediate email from PagerDuty Operations Cloud team. I think the support is fine. I have faced just one time some inconvenience. I do not know what was the reason, but there was one time where I did not receive any response for two days. Thank goodness it was not a Severity 1 incident for us, but two days was unacceptable at that point. Hence we started looking at other products including xMatters. There has been just one instance, but the company which I work with, even one instance sometimes causes a lot of friction. People start looking at other options. However, there has been just once.
How was the initial setup?
Integration is very easy. I have completed entire integrations, deployments, and testing within six hours. It is just so easy.
What about the implementation team?
One person can do everything end to end on their own. I have done it multiple times, and I have seen other people doing it multiple times as well. Everything is very seamless. I should also appreciate the official documentation that you have. Usually official documentation is not that good, but yours feels like someone has taken time to write those documents. The commands which you have written for back-end integration are straightforward. I literally have to just copy and paste after setting environment variables. Everything is very easy.
Which other solutions did I evaluate?
I use xMatters alongside PagerDuty Operations Cloud. Speaking from an integrations engineer's perspective, I have not integrated xMatters heavily, but I have been a user of xMatters more lately. The major difference I observed was the workflow management. The workflow management in xMatters is better than PagerDuty Operations Cloud. Let me explain what I mean by workflow management. If you have a company with ten teams working on a particular product, every member of those teams may or may not receive a page. Every member should have an orchestrator where they can define custom rules such as a service should be paged directly to X team, or if a page comes from ABC issue, it should directly go to Y team without having me to manually put the team name or team details regarding where to page. This capability is lacking in PagerDuty Operations Cloud while in xMatters, it is flawlessly integrated where you can add custom rules and custom rule sets. In PagerDuty Operations Cloud, we have to create separate pages for that functionality. However, when I talk about integration, the xMatters API toolkit is confusing and disorganized. The tree structure is not present in xMatters, but I appreciate that about PagerDuty Operations Cloud. Integration-wise, PagerDuty Operations Cloud is flawless. I love PagerDuty Operations Cloud from an integration perspective, but it makes my life difficult if someone wants me to integrate xMatters.
What other advice do I have?
Regarding pricing, I do not remember the current prices, but I used the first tier about two years back for one of the startup failures which I was working on. That startup did not work out, but I integrated PagerDuty Operations Cloud with a lot of things there.
For the enterprise and for large-scale enterprises, the pricing is good. I will not even say it is fine; it is good for large-scale enterprises. However, for small-scale startups and small businesses, because they already are in a very nascent stage, the pricing is a little on the higher side. There is no custom module which I can just add to my cart which gets me custom pricing. It is just one bucket. For small-scale operations, I think the pricing is a bit pricey.
I do not use PagerDuty Operations Cloud's AI assistant, but I integrate the back-end to create agents. I do not use their default ones. I have never used it, and I do not know how good or bad that is. I am integrating PagerDuty Operations Cloud modules, APIs, and SDKs to develop AI agents, not using anything which comes out of the box.
My overall review rating for PagerDuty Operations Cloud is eight out of ten.
Centralized alerting has streamlined on-call workflows and reduced incident response times
What is our primary use case?
What is most valuable?
PagerDuty Operations Cloud is part of our daily operational workflow. It sits between monitoring tools and response teams, ensuring alerts reach the right people without delay. We use it for on-call scheduling, incident escalations, and coordinating responses across teams. Having everything centralized has reduced alert fatigue and helped us respond to issues more consistently, especially during off-hours and high-priority incidents.
PagerDuty Operations Cloud offers intelligent alerting, on-call scheduling, automated escalations, and incident management as its best features. The platform makes it easy to ensure alerts reach the right person, and escalation policies prevent critical issues from being missed. We also rely heavily on its integration with monitoring and collaboration tools and its real-time visibility into operations. Together, these features help our team respond faster, stay organized during an incident, and reduce service disruptions for our customers.
What needs improvement?
While PagerDuty Operations Cloud is strong overall, there are a few areas for improvement. The initial setup and configuration can be complex, especially for teams managing multiple services, escalation policies, and integrations. Some reporting and analytics features could offer more customization without requiring additional configurations. The mobile app works well for alerting, but managing more advanced settings is generally easier from the web interface. It would also be helpful to have more out-of-the-box workflow templates and automation recommendations to simplify onboarding for new teams.
To make the daily workflow smoother, simplifying the user interface for certain administrative tasks would be a significant improvement. Sometimes, navigating the settings to adjust on-call schedules or escalation policies can take a few extra steps, particularly for large environments. More customizable dashboards and easier reporting for non-technical stakeholders, along with additional guided recommendations for alert tuning, could help teams get even more value from the platform. These are relatively minor points, but addressing them would make an already great tool even more user-friendly.
I did not give PagerDuty Operations Cloud a perfect rating because there is still room for improvement in areas such as reporting flexibility, dashboard customization, and simplifying certain administrative tasks. Overall, it is a mature and dependable platform that positively impacts our work.
For how long have I used the solution?
What do I think about the stability of the solution?
What do I think about the scalability of the solution?
How are customer service and support?
Which solution did I use previously and why did I switch?
How was the initial setup?
What about the implementation team?
What was our ROI?
Which other solutions did I evaluate?
What other advice do I have?
Which deployment model are you using for this solution?
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Intelligent alerts have protected revenue and now drive faster incident triage with AI guidance
What is our primary use case?
I have been using PagerDuty for the last nine years, but PagerDuty Operations Cloud for over one and a half years.
We work directly with merchants and need to trigger immediate alerts whenever there are 5xx errors or business errors like 4xx issues, as well as payment failures. We have configured every alert on a data log in some other monitoring tools that are integrated with PagerDuty. We receive alerts very immediately and trigger calls and Slack notifications. We integrate everything with PagerDuty and get notifications instantly, after which we start our triage process.
One use case I can mention is when we have an auth rate dip. Whenever there is an auth rate dip, we run into revenue losses with the merchants or partners that PayPal currently works with. Since everything is integrated, PagerDuty Operations Cloud catches when there is an auth rate dip for particular merchants and immediately triggers a notification for us. We then immediately dive into what the problem is and figure out how to fix the issue with the help of engineering teams.
What is most valuable?
PagerDuty Operations Cloud is one of the best tools we have seen because it is already integrated with AI. We use it as a barrier tool, meaning it is the top tool that we consider and we get notified when there is an issue.
The best features include integrating with any tool and analyzing all previous alerts that have been stored. When an alert occurred on a particular day, we can immediately be notified on Slack with historical data and, since it is integrated with AI, we receive suggestions on how it can be resolved, how it was resolved earlier, and who resolved it. These are the very best features we have seen on PagerDuty Operations Cloud.
Since we have historical data showing when an alert has triggered on a particular day, we can turn it into a problem incident and work with the relevant teams to get it fixed completely so it does not reoccur. We are recording these kinds of repetitive issues using that feature.
It is very helpful that we can integrate with numerous monitoring tools such as Datadog, Splunk, and Kibana. Since we have integrated many other tools, I feel this is one of the features that PagerDuty Operations Cloud offers that makes it great.
What needs improvement?
Since PagerDuty Operations Cloud is already equipped with the latest technologies, I do not feel that anything more needs to be added, including summarizing content, as it is already available. Since it is already connected with AI, I do not feel that any other features could be added, so I do not have a concrete answer right now since we already have a number of features available and this is already a highly improved state.
While PagerDuty has comment functionality, a chat option would be a potential addition.
For how long have I used the solution?
I have been using PagerDuty for the last nine years, but PagerDuty Operations Cloud for over one and a half years.
What do I think about the stability of the solution?
PagerDuty Operations Cloud is highly accurate and there are no issues with the accuracy. It is highly reliable in terms of alert triggering and we do not get any false alarms, with only very minimal ones based on our internal signals. We do not have any complaints about PagerDuty Operations Cloud.
What do I think about the scalability of the solution?
PagerDuty Operations Cloud definitely increases efficiency for us. Since we do not have much manual work with workflows and everything is automated, it definitely helps.
Which solution did I use previously and why did I switch?
We are using only PagerDuty and do not have any other tool in use. There is no other tool that can match PagerDuty Operations Cloud.
What was our ROI?
We definitely have an ROI in terms of earlier requiring multiple employees. Since we are now using AI, we have reduced our staffing needs and can save a lot of time and money as well.
Which other solutions did I evaluate?
There is no other tool that can match PagerDuty Operations Cloud.
What other advice do I have?
Earlier, PagerDuty Operations Cloud was just notifying incidents, but now it is showing historical data and we can see how it was resolved earlier and quickly get notes from that to resolve issues with the historical data and suggestions.
Earlier, when there was an auth rate dip or different signals that we received through Datadog or different platforms, we used to have some false alarms. Now, everything we are using is AI-based with agents that were configured with those signals. We have very accurately configured the AI using factors such as holiday seasons that will have high traffic, and everything was configured with historical data. We are getting very solid results and signals.
Since PagerDuty Operations Cloud has all the data and provides forward-looking resolution steps and information about which team was involved, PagerDuty AI helps us tremendously.
We definitely do not have any revenue loss since we are getting accurate signals and alerts and have a solution for all configured alerts.
Since it has all advanced features integrated with AI, I am really impressed with the ability to integrate with numerous monitoring tools very easily and the ease of onboarding any member to PagerDuty Operations Cloud. Setting up the alerts and everything is very easy with a number of monitoring tools. That is why I rated this product a nine out of ten. There is no other tool that can match PagerDuty Operations Cloud right now.
We have a number of layers in terms of governance and security since we are a payment gateway. PagerDuty Operations Cloud has its own governance and security at a great level, so we do not need to think about any security concerns from PagerDuty Operations Cloud governance.
Since it already has AI features, I am going to recommend others to use PagerDuty Operations Cloud. I rate this solution a nine out of ten.
Which deployment model are you using for this solution?
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Unified alerting has improved incident response and enabled proactive, multi‑channel notifications
What is our primary use case?
I primarily use PagerDuty Operations Cloud for alert management and incident call rotations. In my earlier firm, we managed rotation shifts across three time zones: EMEA, APAC, and New York time. All rotation and shift management was handled through PagerDuty Operations Cloud schedules. Application monitoring was also updated through PagerDuty Operations Cloud. According to the schedule, we updated people's contact information so that in case of any issues, the contact would be transferred to the respective shift member. We also managed escalations with five layers of escalation. If a first team member missed an alert, it would go to the second team member after 10 minutes, then to the next person after five minutes, continuing according to the priority of the service.
What is most valuable?
The most valuable features I found were the integration capabilities and notification system. We used the open-source tool Alertmanager, which triggers health metrics from Prometheus and Splunk. PagerDuty Operations Cloud allowed us to integrate alerting seamlessly and notify users effectively, which helped the business significantly. Early detection of issues leads to better service provision. PagerDuty Operations Cloud provides multiple notification channels including SMS, phone calls, and email, which I found to be the best part of the platform.
Regarding the autonomous AI agents, I have not explored them because the AI trend started recently and I have been out of touch for the last seven or eight months. However, I have read about how AI integrates with the scheduling part. Previously, we had to manually update schedules every week, but with AI integration, we can write a prompt and build MCPs. Some firms I read about integrated an MCP they built in-house, and with the MCP, they can provide an Excel sheet or image, and PagerDuty Operations Cloud API can update everything without needing to manually access the platform.
We implemented automation through PagerDuty Operations Cloud for incident response. Previously, we had to manually update service level details, SLAs, notification mechanisms, and API keys. Now we can submit an Excel sheet or CSV file and make an API call using Python, which updates everything automatically. PagerDuty Operations Cloud also helps with analytics by showing how many alerts were triggered, how many were resolved, and which person handled which alert. This visualization helps us demonstrate to clients that we managed a certain number of alerts and reduced the alert count.
What needs improvement?
PagerDuty Operations Cloud has been excellent so far. Over the last six months, generative AI could help further. Some organizations are using their own MSP engines, but if PagerDuty Operations Cloud provides in-house MCP tools integrated with GenAI, it would be better for end-users. Integrating with in-house tools and something already provided by PagerDuty Operations Cloud would make a difference. I am not certain if this has been explored in the last six months, but this is an area PagerDuty Operations Cloud could improve.
For how long have I used the solution?
I used PagerDuty Operations Cloud for approximately 3.5 to four years.
What do I think about the stability of the solution?
PagerDuty Operations Cloud provided notifications early when issues occurred. We used PagerDuty Operations Cloud's status page as our first source of truth to check for existing or ongoing issues. If no issues were listed there, we reached out to a dedicated account manager who would connect us with the concerned team. We rarely encountered any operational issues with PagerDuty Operations Cloud because it was always working. We experienced only one or two latency issues, which were due to underlying cloud infrastructure issues rather than PagerDuty Operations Cloud itself.
What do I think about the scalability of the solution?
PagerDuty Operations Cloud maintained good scalability levels. We started with a beta phase with approximately 50 to 60 members, then moved to a development stage where we increased to 150 people. The platform remained stable as we expanded. We ran some instances on-premises, which required high security, and others on cloud premises for client-facing deployments. We experienced no issues with scalability on either on-premises or cloud deployments, and integration was seamless in both cases.
How are customer service and support?
When making on-premises installations, we connected with PagerDuty Operations Cloud's technical support. They guided us on setup and what to take care of during installation. We had two or three calls with them, and they were very helpful throughout the process.
Which solution did I use previously and why did I switch?
Before PagerDuty Operations Cloud, we used Alertmanager, which triggered only email notifications and not calls or SMS. PagerDuty Operations Cloud introduced the calling mechanism and SMS capability, which was innovative compared to what we had seen with open-source tools.
How was the initial setup?
The initial setup process involved starting with PagerDuty Operations Cloud's cloud offering. We purchased a plan and set up our account. During actual deployment, we purchased a license with our own DNS, meaning instead of using pagerduty.com, we mapped our own subdomain to our environment. We then created licenses for individual users, starting with approximately 150 members from our technical support team and L1 engineers. We gradually increased our user count rather than immediately granting licenses to thousands of people because they would have received spam calls. We started with 50 to 60 members for a trial to understand how the system should behave and how we could optimize it.
What about the implementation team?
We handled the initial setup and installation of PagerDuty Operations Cloud ourselves, although we received support. When we signed up, the PagerDuty Operations Cloud team called to offer assistance. They set up a demo for our team, but we proceeded with the installation ourselves since we had prior knowledge before starting.
Which other solutions did I evaluate?
We evaluated other options before choosing PagerDuty Operations Cloud. We attempted to build our own solution using an existing open-source tool, but the latency issues made it not time and cost-optimized. Since a stable product like PagerDuty Operations Cloud already existed, investing two to three years in building our own solution did not make sense. We also explored building a Python solution using Alertmanager before deciding on PagerDuty Operations Cloud.
What other advice do I have?
I have not explored the generative AI capabilities of PagerDuty Operations Cloud. PagerDuty Operations Cloud delivers very high performance when notifying users, especially in high-frequency trading environments where even a second of delay can result in billion or trillion dollar transaction losses. The notification service and seamless integration across different team layers provide significant value. Although open-source tools are available, they are not as effective as PagerDuty Operations Cloud.
Regarding alert fatigue and incident costs, when onboarding new clients in my previous project, I demonstrated our capabilities using incident management charts to showcase our skills. We showed clients how many alerts triggered daily, weekly, or monthly before PagerDuty Operations Cloud, and how we reduced them to monthly or bi-weekly intervals based on specific conditions. This data helped us acquire deals.
PagerDuty Operations Cloud improved my team's ability to focus on core tasks rather than routine issues. Initially, my team was exploring multiple notification and monitoring options and building their own tools. With PagerDuty Operations Cloud as an organization-level mandate, instead of managing different tools across ten teams, we now use one standard tool. This has allowed the team to focus on other important tasks since this major challenge has been resolved.
Regarding preventing costly incidents, I would emphasize business trust more than direct cost savings. We earned significant client trust by detecting issues early and informing clients promptly, allowing them to manage their side of any issues. On multiple occasions, we caught issues before business hours and clients were appreciative of our proactive approach.
I am not aware of the specific pricing and licensing details of PagerDuty Operations Cloud as that is managed by our management. From what I have heard, the business plan is not very expensive. I have not explored individual pricing since our organization was large with dedicated departments handling such decisions. My review rating for PagerDuty Operations Cloud is eight out of ten.
Fast Incident Alerts and Visibility, but Configuration Complexity Can Cause Alert Fatigue
Reliable Incident Response and On-Call Management Platform
On-call alerts have ensured critical issues are addressed faster and teams focus on core work
What is our primary use case?
I usually use PagerDuty Operations Cloud for the notification of high-priority incidents within the infrastructure.
I also use it for escalating to the on-call members, scheduling the priority of incidents or issues within the infrastructure, and creating scheduled rotations for team members.
What is most valuable?
The most valuable feature of PagerDuty Operations Cloud is that even though my device is on silent, it still rings and lets me know that something happened for the organization.
On-call schedules for team members are very helpful to find out who is currently on call to get help with incidents or to get tickets routed to them. At the same time, it pushes me notifications, gives me a call on my mobile number, and triggers emails on my email address, so the multiple notification service of PagerDuty Operations Cloud is excellent.
From a user perspective, the most valuable part of PagerDuty Operations Cloud is the notification feature that continuously contacts me until I acknowledge it. High and critical incidents are totally valuable for the organization because something is failing and I need to repair it on priority to not lose the business.
PagerDuty Operations Cloud improved my team's ability to focus on core tasks rather than routine issues by having the notification feature that was very helpful to monitor and trigger high and critical issues directly to team members.
What needs improvement?
I am not using PagerDuty Operations Cloud's autonomous AI agents now because we have not gotten into that yet.
I have not used generative AI yet.
The integration with ServiceNow is very good, as even though if I add some notes over there, it directly pushes the email or also pastes it on the ServiceNow tickets.
PagerDuty Operations Cloud also provides me information about how many incidents with the same errors I have encountered, as it does have the analysis engine running with incoming tickets.
There was agent alert fatigue with more granular root cause analysis that can be done. If I consider the false positive alerts, reducing them and giving real numbers of the issue would be beneficial.
I believe there is always room for improvement, and since technology is changing day by day, I will rate PagerDuty Operations Cloud as a nine.
For how long have I used the solution?
I have been using PagerDuty Operations Cloud for two-plus years, and I am still actively using it.
What do I think about the stability of the solution?
PagerDuty Operations Cloud is stable, but I did have one issue where services were down for about ten to twelve minutes. I consider it highly stable and reliable overall.
What do I think about the scalability of the solution?
The scalability of PagerDuty Operations Cloud is good and I have not encountered any problems with it.
How are customer service and support?
I did not have to reach customer service because the product has been stable and reliable, and I can say it is really good.
Which solution did I use previously and why did I switch?
I found PagerDuty Operations Cloud to be more stable than other solutions, so I directly went with PagerDuty Operations Cloud.
How was the initial setup?
Another team integrated PagerDuty Operations Cloud into the system and set it up.
We did refer to the PagerDuty Operations Cloud documents for setting up teams and creating schedules.
What about the implementation team?
Another team integrated PagerDuty Operations Cloud into the system and set it up.
What was our ROI?
PagerDuty Operations Cloud improved my team's ability to focus on core tasks rather than routine issues by having the notification feature that was very helpful to monitor and trigger high and critical issues directly to team members.
Regarding cost saving, PagerDuty Operations Cloud provides the feature but is not really reducing the cost of other operations.
What's my experience with pricing, setup cost, and licensing?
I do not usually focus on pricing for PagerDuty Operations Cloud at the moment, but for smaller teams, I believe it is costlier, while for multi-million dollar companies, it is still affordable. For smaller teams who want to improve their operations, the cost is an issue.
Which other solutions did I evaluate?
I found PagerDuty Operations Cloud to be more stable than other solutions, so I directly went with PagerDuty Operations Cloud.
What other advice do I have?
I am satisfied with PagerDuty Operations Cloud and really appreciate the product, so I do not have any questions at the moment, but I do have interest in whether PagerDuty Operations Cloud has implemented agents to help with any issues that happen. I rate this product a nine overall.
Which deployment model are you using for this solution?
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
On-call automation has reduced critical incident impact and ensures faster production responses
What is our primary use case?
As a cloud operation team, I was a user who set the alerts, and whatever important incidents or anomalies were detected that needed to be immediately taken care of were bifurcated through our APM tools that we integrated with PagerDuty Operations Cloud. As a cloud operation team, we supported the platform for rotational shifts. My roles involved setting the person in the shift according to the shift roster, so whenever any incidents triggered, they would get the call. The primary use was supporting production operations and cloud activities.
Our multi-environment consists of AWS infrastructure, Linux servers, Kubernetes clusters, and customer-facing applications. PagerDuty Operations Cloud was mainly used for incident management and alerting. We integrated it with AppDynamics, Instana, and CloudWatch, where it would monitor the patterns and platform, and then PagerDuty Operations Cloud would generate the critical alerts that the appropriate support team who was working in that present shift would get notified of immediately. This platform really helped us manage production incidents beyond service outages, mostly high CPU utilization where we set alerts, application failures, pod issues in Kubernetes, and infrastructure-related alerts. We configured all kinds of alerts, which ensured that alerts were routed to the correct on-call person, helping us reduce response time in critical situations.
What is most valuable?
One of the best features I would mention about PagerDuty Operations Cloud is its on-call rotational scheduling support and escalation management practices. If an engineer did not acknowledge the alert within a defined time frame, the incident was automatically escalated to the next person, support team, or manager of that specific team. Another useful feature was its integration capability. We were able to integrate PagerDuty Operations Cloud with monitoring and observability tools that allow alerts to generate automatically whenever issues were detected in the environment within a fraction of time. We also had the mobile application that was very helpful because the engineer could receive calls, notifications, and acknowledge the incident and track the updates even when they were away from their laptop.
I also valued the centralized incident management dashboard that provides visibility into active incidents, response status, escalation history, and overall operational health. I used to get all the data accumulated there through the dashboard.
PagerDuty Operations Cloud helps us manage production incidents beyond service outages, mostly high CPU utilization where we set alerts, application failures, pod issues in Kubernetes, and infrastructure-related alerts.
What needs improvement?
My experience with PagerDuty Operations Cloud has been positive overall. One area where I believe improvement can be made is reporting and dashboard customization to make it more user-friendly. The operations team often requires different views compared to the management team. Having more flexibility in generating custom reports would be helpful. Another improvement could be providing more advanced AI-driven collaboration capabilities to reduce unnecessary noise alerts and help the team focus on the most critical issues. Apart from these areas, the platform is very reliable and effective for managing production incidents and on-call operations.
For how long have I used the solution?
I have been using PagerDuty Operations Cloud for almost five to six years.
What do I think about the stability of the solution?
PagerDuty Operations Cloud has been stable and performing well wherever our incident management or alerting was configured for production support. Timely notifications and incident responses were critical. PagerDuty Operations Cloud delivers alerts immediately through multiple channels which we configured, including mobile on-call notifications, email, SMS, and phone calls. Since PagerDuty Operations Cloud was integrated with our monitoring and observability tools, it helped ensure that critical incidents were captured and routed to the appropriate on-call team. During my usage, I did not encounter any significant outages or stability issues that impacted our operations due to PagerDuty Operations Cloud.
What do I think about the scalability of the solution?
PagerDuty Operations Cloud is highly scalable and works well with small and large environments. The project I worked on was integrated with multiple application servers and cloud resources for monitoring. PagerDuty Operations Cloud handles all the alerts from different resources and routes them to the appropriate teams. As the infrastructure grows, new services get implemented, escalation policies get defined, and schedules and teams are easily available without requiring major changes in our existing setup. This makes it suitable for an organization to manage large cloud infrastructure and multiple team supports.
Which solution did I use previously and why did I switch?
When I joined this project, they had already implemented PagerDuty Operations Cloud. When I joined, the SOPs and testing were already in process. After a few days, when I was actually onboarded, many of the alerts were configured in PagerDuty Operations Cloud. I did not get the chance to work on different tools besides PagerDuty Operations Cloud.
How was the initial setup?
During the initial setup of PagerDuty Operations Cloud, when I joined the project, I got a Jira ticket listing a few of the servers where I needed to install PagerDuty agents so it could trigger any alerts or integrate with the server. I was mostly involved in the configuration part.
The setup was straightforward. PagerDuty Operations Cloud also helped us in this process. It was not directly integrated on the individual servers, but we integrated our monitoring tools and observability with PagerDuty Operations Cloud. The servers and applications were monitored through application monitoring tools such as Instana, Zabbix, and Splunk. Whenever critical alerts were generated, they would automatically forward to PagerDuty Operations Cloud through the configured integrations we set up with the application. PagerDuty Operations Cloud would notify the on-call engineers and follow different escalation policies if the alerts were not acknowledged within a specific time. Our flow was that we had EC2 instances, AWS servers, and CloudWatch alarms, and if any alert triggered, it would send through SNS, AWS Simple Notification Service, and then to PagerDuty Operations Cloud and the on-call engineer.
What about the implementation team?
We followed the documentation provided by PagerDuty Operations Cloud for the configuration part.
The documentation is full-fledged with proper details on how to configure it depending on the integration with any application monitoring tool. They specify what steps need to be followed. If integrating with servers, they mention which type of server, whether it is Windows or Linux, and accordingly, they have provided all the documents. The documentation is comprehensive and easy to understand, such that even a layperson can do the configuration part with the way they have provided the documentation.
What other advice do I have?
We are not mostly focused on utilizing PagerDuty's autonomous AI agents because we are working on cloud infrastructure where we do the deployments. We have not implemented AI in our cloud to that extent. Going forward, if our infrastructure is AI-based, then we will definitely explore where PagerDuty Operations Cloud can help in that.
As of now, we do not use generative AI capabilities of PagerDuty Operations Cloud. Our infrastructure is huge, and there is a dedicated developer team working on AI-related things. They are still in two POCs, and the POC is being evaluated. If it looks good, then only we can roll this out into production because my application is customer-facing, and we do not want anything to go wrong or if the alert triggers unnecessarily due to some AI alert that did not notify us. That would ultimately cause us to lose our SLAs and SLOs, and all the other escalation matrices would come into the picture. That is why we are still in POCs as it is critical.
That part is taken care of by a different team or mostly the clients themselves. My main role is to keep the environment always up and running, and all alerts should be properly centralized and customized accordingly.
PagerDuty Operations Cloud is basically where we get the alert, and we can integrate through Slack and on-call rotational shifts on cell phones. Prior to this, we were mostly relying on application monitoring tools only and emails and Slack notifications. If an on-call shift person is not at their desk and if any alert has been triggered and no one is there to acknowledge it or look into it and take necessary action, then ultimately there will be customer impact. That is why we implemented PagerDuty Operations Cloud. Even if the on-call person is not near their laptop, they will get the call and can immediately acknowledge and report to the team that we have received a P1 call for this specific environment or that the alert is regarding a production issue. Another team member will immediately take action, so there will not be any miss.
I did not encounter any issues that required contacting support for PagerDuty Operations Cloud. This review represents an overall rating of 9 out of 10.
Which deployment model are you using for this solution?
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Automation has improved incident workflows and response times but still offers unexplored features
What is our primary use case?
My main use case for PagerDuty Operations Cloud involves working in the AIOps team, which is an operations team. We have a monitoring tool called Checkmk, and we have integrated it with PagerDuty for incident management. We monitor many servers across different teams, including the Linux team, network team, Windows teams, and database team. All of these servers are monitored in Checkmk, tracking live CPU, memory, and file systems. Upon reaching certain thresholds, Checkmk generates events, which we integrate into PagerDuty Operations Cloud console. There, we set conditions so that if an event is critical or a warning, it converts into an incident. We then route the incidents to respective teams, who handle the details.
A specific example of an incident where PagerDuty Operations Cloud played a key role involves automations we created within PagerDuty. Rundeck is a job workflow tool where we can implement scripts or schedule jobs. If a server meets its threshold, it triggers PagerDuty Operations Cloud. We create scripts in Rundeck to handle issues, such as clearing a full file system. We utilize a feature called Automation Actions in PagerDuty Operations Cloud, and whenever an incident comes that matches specific conditions, that job will automatically run in Rundeck. This incident management cycle is effectively managed in PagerDuty Operations Cloud, allowing jobs to run and resolve incidents automatically, ensuring the server is healthy again.
What is most valuable?
The best features PagerDuty Operations Cloud offers include Incident Workflows, which we use frequently to ease our team's work. These workflows trigger jobs in Rundeck based on certain conditions when incidents occur. We can create flows in Incident Workflow features and utilize Automation Actions, which allow us to run individual jobs in Rundeck. Additionally, Event Orchestration enables us to integrate various tools using integration keys with multiple applications. These features significantly simplify our daily operations within the team.
Integrations with other tools have been beneficial for our team as we receive requests from different teams to integrate their tools with PagerDuty Operations Cloud, enabling them to manage incidents. We have integrated AWS CloudWatch and Azure for monitoring, as well as CyberArk and Guardicore. If teams have specific requirements for integrating their tools, they approach us to create the necessary flows.
PagerDuty Operations Cloud has positively impacted our organization significantly. The response time has improved, and the team responds more quickly now. The PagerDuty Operations Cloud mobile application allows team members to acknowledge incidents via their mobile devices, where they can also receive calls when incidents trigger. The response time has become very quick.
I do not have precise numbers regarding the improvement in response time since using PagerDuty Operations Cloud, but I can share a story about a major incident with Checkmk. After we upgraded our Checkmk console, everything crashed, causing random events to be sent to PagerDuty Operations Cloud. We fixed the event flow from Checkmk using PagerDuty Operations Cloud's features. Furthermore, we have automated the restart of systems through PagerDuty Operations Cloud. If any server requires a restart, we trigger that job with just one click using Ansible, completing the task efficiently.
What needs improvement?
I do not see immediate improvements for PagerDuty Operations Cloud because there are numerous features we have yet to explore. As a product, it is continually upgrading its features, so we are focusing on how we can incorporate those into our use case.
Concerning PagerDuty Operations Cloud's AI capabilities, I am not certain as we currently do not use advanced AI-related features since our package offers limited access in that area. However, regarding governance and security, it appears very secure.
For how long have I used the solution?
I have been using PagerDuty Operations Cloud for three months.
What do I think about the stability of the solution?
PagerDuty Operations Cloud appears to be quite stable. I have not encountered any downtime or reliability issues, and it has consistently operated successfully.
What do I think about the scalability of the solution?
Regarding scalability, I think there is demand for PagerDuty Operations Cloud as we have been striving to mitigate manual tasks within our organization. We conduct demonstrations to illustrate how we can reduce manual work through our automations.
How are customer service and support?
The customer support for PagerDuty Operations Cloud is excellent. They have been very responsive. We have standing weekly calls to discuss any doubts, and there is a dedicated team, including an engineer and a PagerDuty Relations Manager, assigned to support us. They have been excellent in following up on the features we need to use.
What was our ROI?
I believe a return on investment is occurring because we are promoting PagerDuty Operations Cloud within our organization, aiming to involve more people and teams in using it. We continuously explore new features to facilitate ease of use among many people.
What other advice do I have?
There are many new features introduced in PagerDuty Operations Cloud. AI has been included, and specific features including Incident Workflows and Event Orchestrations have been implemented. One recent implementation in Incident Workflows is SLA tagging for incidents. We created a workflow to notify managers if an SLA has been breached beyond a certain time. This planning has helped us manage incidents more effectively.
I have not utilized PagerDuty Operations Cloud's AI agents to address routine issues. However, for team productivity, we leverage escalation policies in PagerDuty Operations Cloud, assigning individual service directories to teams. Consequently, team members receive calls and messages based on their escalation hierarchy.
We have not utilized PagerDuty Operations Cloud's generative AI for decision-making, but the event analytics and operations console provide valuable insights. I can observe real-time data on incidents and alerts, which helps us address the inflow of events from integration keys. This information allows us to refine our planning and reduce event volumes from Checkmk.
I would highly recommend PagerDuty Operations Cloud as a reliable product. I do not have any negative experiences using PagerDuty Operations Cloud, and I believe it adds significant value to our environment if used properly.
When it comes to the accuracy and reliability of PagerDuty Operations Cloud's output, I find it quite reliable. It presents us with extensive data and analytics. The event flow we get from Checkmk provides much useful information, and we rely heavily on PagerDuty Operations Cloud for this analytics format.
I do not believe we have a business relationship with PagerDuty Operations Cloud beyond being a customer. We purchase memberships based on their plans and use them within our organization. I think we are not partners; rather, we simply resell their services internally.
My overall rating for PagerDuty Operations Cloud is seven out of ten.