Incident Manager, from AWS Systems Manager, enables faster resolution of critical application availability and performance issues. It helps you prepare for incidents with automated response plans that bring the right people and information together. With Incident Manager, you can automatically take action when a critical issue is detected by an Amazon CloudWatch alarm or Amazon Eventbridge event. Incident Manager executes pre-configured response plans to engage responders via SMS and phone calls, links designated chat channels using AWS Chatbot, and executes AWS Systems Manager Automation runbooks. Incident Manager helps you improve service reliability by suggesting post-incident action items, such as automating a runbook step or adding a new alarm, which were developed at Amazon based on decades of experience in incident response and analysis.
Incident Manager provides the ability to automatically collect and track the metrics related to an incident, through Amazon CloudWatch metrics. You can add metrics manually to an incident, in real time, by using a chat channel or the Incident Manager incident dashboard. Investigate metrics further by using the built-in CloudWatch graphs. Use the Incident Manager incident timeline to display points of interest in chronological order. Responders can also use the timeline to add custom events, describing what they did or what happened.
Incident Manager brings incident responders together through contacts, escalation plans, and chat channels. Define contacts directly in Incident Manager with their preferred contact methods. Create escalation plans to engage the necessary responders at the right time during an incident. Bring together responders in connected chat channels where they can directly interact with the incident using AWS Chatbot clients. Incident Manager displays the real-time actions of incident responders in the chat channel, providing context to others.
Automate and improve
Incident Manager enables the use of runbooks, which detail and automate the repeatable steps needed to resolve an incident, saving responders time so they can focus on analyzing and responding to incidents. By using Incident Manager’s post incident analysis, your team can develop more robust response plans and affect change across your applications to prevent future incidents and down time. Post incident analysis also provides for iterative learning and improvement of runbooks, response plans, and metrics.
With Incident Manager, you can plan ahead for potential incidents and how best to detect and respond to specific incident types. Create response plans that define how to respond when an incident occurs, such as who to engage, the expected severity of the event, automatic runbooks to initiate, and metrics to monitor.
You can use Incident Manager together with Systems Manager Automation to define runbooks, which detail and automate the repeatable steps needed to resolve an incident. Use runbooks to reduce response times and provide detailed steps to responders. By connecting the runbook to a response plan, Incident Manager records progress of manual and automated steps in the incident response.
Post-incident analysis provides the structure in which your team can formulate ways to improve response and customer experience. Use Incident Manager to guide you through identifying improvements to your incident response, including time to detection and mitigation. Incident Manager also creates recommended action items to improve your application, response plan, runbooks, or alerting.