AWS Cloud Operations Blog

Improve the Visibility and Collaboration during Incident Handling in AWS Systems Manager Incident Manager

Today, AWS announces new capabilities: Incident Notes and extension of Incident Status Banner within AWS Systems Manager Incident Manager. Incident Manager enables faster resolution of incidents ensuring high application availability. These new capabilities in Incident Manager provide more visibility and collaboration features to our customers to quickly bring the right people together with the right information during the critical events, improving the overall experience for our customers and reducing downtime.

During any critical incident, it is important for our customers to keep track of latest information and have the up-to-date status of the incident resolution steps. The Incident Notes panel provides the space for customers to organize and track progress of mitigation activities, post updates and other relevant information. This functionality provided by Incident Notes will help operational teams coordinate and communicate when handling critical application issues as well as capture this information as part of the incident record for a potential later post-incident analysis.

The image below shows a section on the right pane for Incident Notes, where users can post the updates related to this Incident so that other users can see the actions that have been taken and the investigation being pursued.

Pane displaying incident notes

Figure 1: Incident Notes pane to post updates related to incidents

Each note is timestamped so a user can track when each post was made. Notes are also visible on the timeline tab so users can see what occurred in relation to other important events for this incident. It also gives an option to edit or delete the note as shown in the image below. Incident Notes remain as part of this incident record, and are later available as part of the Incident timeline in the Post-Incident Analysis.

Below is the sample view for an Incident Notes section for a specific Incident.

Pane displaying incident notes

Figure 2: Close-up of Incident Notes pane with updates

The new extension of the incident status banner consists of 2 sections – Runbooks and Engagements as shown in the below image and is a new addition to the existing top banner that already contains items such as Status, Impact, Chat channel and Duration.

Banner with Status, Impact, Chat channel, Duration, Runbooks, and Engagements

Figure 3: Incident status banner showing new sections of Runbooks and Engagements

  1.  The Runbooks section in the banner provides a consolidated view of overall status, runbook progress and any pending actions that need to be taken for a particular runbook. As of now below is the list of status that are currently supported in the banner:
    • Waiting for input
    • Unsuccessful (will have a popover containing Runbooks in the timed out, cancelled or failed state)
    • Successful
    • In progress

If a runbook’s status is “Waiting for input”, one can click the link to view action details.

As it can be seen in the below snapshot, as soon as a runbook is started the banner shows the status for the runbooks. For example, in this incident it is showing that one runbook is waiting for the input and one is in progress.

Banner with detailed status for Runbooks

Figure 4: Banner showing status for Runbooks once the runbook has started

If a Runbook has a manual step that needs human intervention, then the banner will also show which step is waiting, with further information available in the Runbooks tab. This provides an easy interface for our customers to see which steps and which Runbooks are waiting for any manual actions and this visibility aims to increase the speed at which actions can be taken during critical incidents.

Banner with detailed status for Runbooks

Figure 5: Banner showing status for Runbooks, including which step is waiting

  1. The Engagements section provides a summary of the total engagements and also how many of those have been acknowledged. By clicking on the link, the customers transition to the Engagements tab, from where they can view engaged contacts, or engage additional contacts into the incident.

The Engagement tab provides the resources for the engaged contact to get up to speed quickly once involved in the incident. Once the requested contacts have joined the incident, they can perform the relevant steps during the incident such as approving any manual steps in runbooks, running another runbook etc.

In the following image, the contact “rising” was engaged and not acknowledged. The Engagements tab shows additional details about this specific engagement, such as the escalation plan, if one was used for this engagement.

 Banner with status for Engagements

Figure 6: Banner showing status for Engagements

Both the banner and the Engagements tab content are automatically updated as soon as the engagement is acknowledged, as shown in the following image. Here we can see that there are two engagements and out of those two, only one “rising” is acknowledged at the moment.

Banner with detailed status for Engagements

Figure 7: Banner showing detailed status for Engagements

These new capabilities will be available in all the regions where Incident Manager is currently launched. To view information about Incident Manager regions and quotas, see Incident Manager Endpoints and quotas in the Amazon Web Services General Reference guide.

There are no additional charges for using these new launched features. For more information on Incident Manager pricing, please visit the Systems Manager Incident Manager Pricing.

Conclusion:

In this blog, you learned how to use the new capabilities of Incident Notes and the extension of Incident Status Banner in AWS Systems Manager Incident Manager to resolve incidents more quickly. The new capabilities help you bring together the right people with the right information during critical events, ensuring high application availability and reducing downtime. Learn more in our Incident Manager documentation.

To learn more, visit the Incident Manager feature page and to get started, visit the Systems Manager console.

 

About the author:

Rishi Singla

Rishi is Senior Partner Solutions Architect at AWS where he specialises in CloudOps and Security services. He also works closely with the AWS reStack Partners in APJ region enabling them to achieve their strategic objectives.Rishi is a big cricket enthusiast and also loves playing social tennis tournaments.