AWS Cloud Operations Blog
Detect and remediate issues faster with AWS Systems Manager OpsCenter and Moogsoft AIOps
AWS Systems Manager, the operational hub for AWS and hybrid cloud deployments, recently announced the launch of OpsCenter to help you view, investigate, and resolve operational issues related to your environment from a central location. OpsCenter presents operational issues in a standardized view, along with contextually relevant data, and associated Systems Manager Automation documents, enabling easier diagnosis and remediation.
This post focuses on the new native integration between OpsCenter and Moogsoft AIOps. Moogsoft, an AWS partner, is doing pioneering work in building out an AI platform for IT operations. They’re focusing on reducing the signal-to-noise ratio, suggesting root causes, and enabling cross-team collaboration to solve incidents faster.
You can now take advantage of this native integration between OpsCenter and Moogsoft AIOps to further enhance productivity for your DevOps engineers. The benefits include:
● Noise reduction using OpsItems deduplication logic and AI to filter out OpsItem noise automatically and cluster related OpsItems into Moogsoft Situations.
● Faster remediation of OpsItems using contextual investigative data and Moogsoft’s root cause and collaboration features.
● Reduction in incidents by automating remediation workflows and using machine learning. Moogsoft customers have experienced reduction by up to 60% in many cases.
OpsCenter and Moogsoft AIOps integration
Moogsoft AIOps together with OpsCenter enables you to reduce mean time to resolution (MTTR) and stay focused on innovation projects instead of ongoing operational firefighting.
OpsCenter provides an open API to ingest any event data across the entire AWS service stack. This removes the burden from end users of gathering disparate event data across compute, storage, network, and other critical AWS services.
OpsCenter together with Moogsoft AIOps reduces MTTR by automating the event-to-resolution workflow using AI and machine learning. The workflow reduces event noise, clusters similar alerts into Situations, provides probable root cause analysis, and enables collaboration by integrating with ITSM, notification, orchestration, and remediation systems.
The value of combining OpsCenter and Moogsoft is multifaceted. OpsCenter is the aggregation point for operational issues across various AWS services and then enabling contextual investigation and remediation actions. Moogsoft industrializes the ingestion of OpsItem data and then uses Moogsoft AIOps to surface critical IT incidents, cluster-related OpsItems, and offering teams with collaboration workflow to remediate what’s wrong. The following diagram shows the data flow and points of integration.
“At Moogsoft, we are impressed with the AWS Systems Manager OpsCenter feature set and open framework. It’s easy for our customers to combine it with our Moogsoft AIOps data science approach, and achieve a powerful, modern combined approach to Service Assurance, both on premises and in the cloud,” says Dave Casper, CTO, Moogsoft.
Auto Scaling failure: An OpsCenter and Moogsoft AIOps use case
The following use case describes how OpsCenter and Moogsoft AIOps help reduce MTTR and provide an event-to-resolution workflow.
Imagine the scenario occurs on Black Friday — the biggest shopping day of the year. An online retailer is a digital native company running its critical services within the AWS Cloud. The retailer created a digital native application that leverages AWS services for high availability, redundancy, and performance. The scenario includes a looming disaster as Auto Scaling failed to expand the compute layer while demand quickly spiked during the holiday rush. It turns out that Auto Scaling failed due to human error!
This failure in the customer account created an OpsItem in OpsCenter. Moogsoft industrialized the ingestion of all incoming events, including those from OpsCenter and a new Situation was created. Moogsoft AI analytics algorithms identified the new Situation as similar to a prior Situation. An operator checked the past Situation and learned the Amazon Machine Image (AMI) name referenced by the Auto Scaling group was not found.
The remediation step for the past similar OpsItem included running an AWS Systems Manager Automation document to correct the configuration error in the Auto Scaling group and update the AMI name.
As part of the initial analysis, the Probable Root Cause (PRC) was presented to the operator. While several alerts were clustered together, the Moogsoft Situation Room console displayed a mixture of root cause and symptomatic alerts.
The PRC for this specific Situation was identified as an alert on an Auto Scaling failure.
The information provided to the operator for similar Situations and PRC helped reduce the mean-time-to-troubleshoot the problem. Further, OpsCenter recommended running an SSM Automation document that was previously used to resolve the issue. Together with a recommended remediation action in OpsCenter, the operator reduced their entire mean-time-to-resolution.
The Timeline view in Moogsoft AIOps showed the evolution of how the Situation’s alerts arrived, including the order, frequency, and severity.
The Timeline view showed the story of how the compute stack started having degraded resource performance. At the same time and in regular intervals, Amazon CloudWatch was informing Moogsoft of Auto Scaling launch failures. As time progressed, the severity of the compute stack resource usage increased from minor, major, and then to critical status. The last few messages originated from the application monitoring tool regarding slow application response time and application connection refusals.
Two-way integration
As illustrated in the data flow diagram, OpsCenter and Moogsoft AIOps have a two-way integration. When a Situation is created, Moogsoft AIOps creates a new OpsItem in OpsCenter. The new Situation-based OpsItem contains all the relevant information about related OpsItems that are clustered together. The following screenshot shows the new Situation-based OpsItem and its description detail.
Summary
The combination of AWS OpsCenter and Moogsoft AIOps brings economic value to your operations by reducing time to resolution and improving service delivery of critical applications
About the authors
Craig Yenni is a Strategic Architect at Moogsoft, focused on the joint success of channels and alliances. He has been engaged in technology for over 20 years. This journey has taken him down the path to application development, operations, architecture, engineering, sales, and consulting.
Sid Arora leads product development on OpsCenter at AWS Managed Services, working cross-functionally with AWS Systems Manager. Sid has led building multiple products at Amazon.com and Amazon Web Services over the last 7+ years. His passions include leveraging machine learning and artificial intelligence to simplify and personalize user experiences across consumer, enterprise, and cloud operations products.