Operational insights in Systems Manager OpsCenter help you identify duplicate issues and noisy event sources
If you use AWS Systems Manager OpsCenter, you might be familiar with the challenges of large numbers of OpsItems. When the same problem causes the creation of a significant number of OpsItems, it can be hard to see that these OpsItems are in fact the result of a single issue. It can also be difficult to see other unique issues in the noise, which can cause you to miss critical issues. Although it’s good practice, it takes time to close a lot of related OpsItems. If you overlook them and leave them open, you might waste time later when you’re troubleshooting an issue.
Operational insights, a new OpsCenter feature, can help you improve operational efficiency by:
- Identifying noisy and duplicate OpsItems.
- Providing recommendations and suggested automations to reduce the creation of unnecessary OpsItems.
- Resolving OpsItems in bulk.
The feature currently provides two insights:
- Duplicate OpsItems: This insight helps you identify OpsItems that might have the same root cause and are duplicates. We identify these by collecting OpsItems with the same title and resource.
- Sources generating most OpsItems: This insight helps you identify sources that are generating more than the expected number of OpsItems. We identify these by collecting OpsItems with the same title, but no resource.
Note: These insights are not generated in real time, but through a scheduled batch process. This means you must wait for that process to run before you see any insights. If you follow along with the example in this post, generate a few events and come back later to explore the generated insights.
In this blog post, we share an example of an Amazon EventBridge rule that creates an OpsItem when a virtual machine (VM) changes state (for example, goes from running to stopped). In our example, the VM is an Amazon Elastic Compute Cloud (Amazon EC2) instance. We’ll show you how to view an operational insight, how to reduce unnecessary duplicate and noisy OpsItems occurring in the future, and how to bulk-resolve the items identified by these insights.
To follow the steps in this post, you need to enable Systems Manager and Operational insights in your account. To enable Systems Manager, follow the steps in the Manage instances using AWS Systems Manager Quick Setup blog post. If you want to customize Systems Manager, see Setting up AWS Systems Manager in the AWS Systems Manager User Guide.
To enable Operational insights, from the left navigation pane of the Systems Manager console, choose OpsCenter. Under Operational insights, choose Enable, as shown in Figure 1.
Figure 1: Enable Operational insights
Create the EventBridge rule
In the EventBridge console, complete the fields as shown in Figure 2. In Define pattern, choose Event pattern. Under Event matching pattern, choose Pre-defined pattern by service. For Service provider, choose AWS. For Service name, choose EC2. For Event type, choose EC2 Instance State-change Notification. Under Select targets, for Target, choose SSM OpsItem.
Figure 2: EventBridge rule with event pattern and targets defined
After you have created the EventBridge rule, you can create an EC2 instance and stop and start it a few times to generate an operational insight. In our example, we created OpsItems for 12 state changes for one EC2 instance. Because a state change progresses from Pending, Running, Stopping, and Stopped, this represents three full cycles of starting and stopping the instance.
Figure 3 shows the number of open operational insights (one duplicate OpsItem). A maximum of 25 insights can be open at any time, after which no new insights will be created.
To view the insight generated from our EventBridge rule, choose View all operational insights.
Figure 3: Operational insights
If you have lots of OpsItems, you can filter from the All insight types dropdown. You’ll see the EC2 instance has changed state a number of times, which created multiple OpsItems. It is identified as a duplicate because the OpsItems have the same title and resource (the EC2 instance).
Figure 4: Multiple OpsItems created with the same title “EC2 Instance State-change Notification”
Reduce duplicate OpsItems
We want to create a single OpsItem instead of multiple OpsItems.
Choose the insight ID to open the details page for the operational insight. Figure 5 shows the details (insight type, number of affected OpsItems, description, status, date created, and last updated date).
Figure 5: Insight Details
You can see recommended actions you can take in the form of runbooks. The runbooks vary. They are recommended to help you to reduce noise, not to resolve the underlying issue that triggered the OpsItem. You should do root cause analysis and take appropriate action to remediate the issue.
Figure 6: Recommended runbooks
When multiple runbooks are recommended, apply them in the order provided. In Figure 6, the recommendation is to apply the
AWS-AddOpsItemDedupStringToEventBridgeRule runbook to add a deduplication string to reduce the number of duplicate OpsItems and to apply the
AWS-BulkResolveOpsItems runbook to resolve all the OpsItems already created.
View runbook automations
The details page shows the automation history, which includes any runbooks you have executed. It also includes a Tips section. In Figure 7, the tip is to add a deduplication string. The
AWS-AddOpsItemDedupStringToEventBridgeRule runbook will help us with that.
Figure 7: Automation execution history
Add a deduplication string
When you build an EventBridge rule to create an OpsItem, you have the option to specify a deduplication string. If you specify a deduplication string, an OpsItem is created only if there are no other open OpsItems with the same deduplication string. For more information, see Working with deduplication strings in the AWS Systems Manager User Guide.
In our example, the recommended deduplication string,
EC2 Instance State-change Notification, is in the runbook description and the Tips section of the operational insight.
To execute the runbook, choose it in the list and then choose Execute.
Figure 8: Recommended runbooks with AWS-AddOpsItemDedupStringToEventBridgeRule selected
Figure 9 shows the required input parameters for this runbook. Recommendations from the operational insight are already populated. You do not need to change these values, but you can modify the value in DedupString, if you like. Choose Execute to execute the runbook.
Figure 9: Runbook input parameters
View the EventBridge rule
Go to your EventBridge rule. In Figure 10, you’ll see that the runbook has added input transformations to the target (previously empty). This is how you tell EventBridge to modify the event information before sending it to create an OpsItem.
This change means that every event sent to an OpsItem, for every EC2 instance state change, will have the same deduplication string. So, only the first EC2 state change event will create an OpsItem, which will reduce noise.
For more information, see Transforming Amazon EventBridge target input in the Amazon EventBridge User Guide.
Figure 10: EventBridge rule showing the addition of a deduplication string after executing the runbook
Resolve OpsItems in bulk
We executed the runbook to add the deduplication string to reduce future noise, but we still have an operational insight with multiple OpsItems open. We can now run the second runbook to bulk-resolve all the OpsItems in this insight.
The Operational insights feature introduced a new runbook that resolves multiple OpsItems in a single operation. Choose AWS-BulkResolveOpsItems and then choose Execute.
Figure 11: Recommended runbooks with AWS-BulkResolveOpsItems selected
As with the previous runbook, the required parameters are already populated. Choose Execute.
On the OpsCenter summary page, the OpsItems are no longer open. They are set to Resolved.
Resolve the insight
Your insight might still be visible after you complete these runbooks. That’s because operational insights are not generated in real time, but through a scheduled batch process. After this process runs again, if all the conditions for the insight are no longer satisfied, the insight will be resolved for you.
Change the state of the instance
After your insight has resolved itself, try changing the state of your EC2 instance a few times. Now that you have applied a static deduplication string, you should only see a single OpsItem, regardless of the number of state changes or EC2 instances.
Identify the sources generating the most OpsItems
Some OpsItems do not have a resource specified, but can still be noisy. These might be misconfigured rules, rules without sufficient useful information to act on, or noisy sources we want to adjust our rules for.
Consider our example rule: If we had created the OpsItem without a resource ID, how would we know which EC2 resources to troubleshoot? This is an example of a misconfigured rule that we would want to modify.
The second type of operational insight helps when we have multiple OpsItems with the same title, but no resource information. What’s different for these operational insights?
The only difference is the recommended resolution. In this case, the recommended runbooks (in order) are
AWS-DisableEventBridgeRule to disable the EventBridge rule and
AWS- BulkResolveOpsItems to resolve all the items already created.
The deduplication string runbook is not recommended, because it would have no impact in reducing the number of OpsItems created.
As with our previous example, run both runbooks in order.
Disabling the EventBridge rule is the only way to reduce the noise generated by these OpsItems. Before you disable the rule, consider whether you need to be aware of the event’s occurrence. If you do, you can use bulk resolution to save time.
The creation of an operational insight is charged at the same rate as the creation of an OpsItem. Each runbook carries a cost of two API calls for each resolved OpsItem. For more information, see AWS Systems Manager pricing.
To avoid charges in your account, delete the resources you created.
To disable operational insights, see Disabling operational insights. If you disable operational insights, you will stop new insights from being created, but will not remove existing insights.
In this post, we introduced you to the Operational insights feature and shared an example to help you see the insight details, use recommended runbooks to reduce noise, and bulk-resolve existing OpsItems.
By acting on operational insights, you can make sure OpsItems contain appropriate events and improve your visibility and the time to resolution of operational issues. For more information, see Working with operational insights in the AWS Systems Manager User Guide.