Creating a correction of errors document

This blog post will walk you through an example of creating a Correction of Errors (COE) document. At Amazon, operational excellence is in our DNA. One best practice that we have learned at Amazon is to have a standard mechanism for post-incident analysis. The COE process facilitates learning from an event to avoid reoccurrences in the future. This is a follow up to a previous blog post introducing the correction of errors process and why you should use one. If you are not familiar with the Correction of Errors process, and why you should use one, please read the previous blog before proceeding with this one.

In order to demonstrate the Correction of Errors process, we will walk through a fictional scenario involving a failure event. Once the event was resolved, and the workload returned to normal operations, it was fundamental to understand what went wrong and what would be done to prevent it from happening again. For this, it was decided to conduct a Correction of Errors review.

We will walk through the construction of a Correction of Errors document, one section at a time. For each section we will explain what information is required and then show an example of what it would look like in this particular scenario.

There are many ways to document the Correction of Errors process. The examples in this blog will show the data as it would appear in document form, as well as screen shots of the data in Incident Manager, a capability of AWS Systems Manager. Incident Manager, from AWS Systems Manager, enables faster resolution of critical application availability and performance issues. It helps you prepare for incidents with automated response plans that bring the right people and information together. Creating the data in Incident Manager is a 3-step process. 1) Create a Response Plan. A response plan is a template to prepare for an incident. This will help expedite engagement and mitigation. 2) Manually create an incident using the response plan. Then document the specific details and metrics of the event. 3) Once the event has been resolved select create analysis. This is where you perform and document post-incident analysis.

For the purpose of this exercise, you can assume all the data in the Correction of Errors document below was gathered from an incident. In a live scenario, you would be fully aware of the details of the incident. Because this is a fictional scenario, we included the incident details, as understood before the Correction of Errors process began:

The team tested the application. Everything seemed to work well and they decided to push the app into production. 50 minutes later, they received a call from the service team, commenting that there were multiple customers calling the contact center indicating problems with the application. The team checked the dashboard and everything seemed to be working well. Troubleshooting began, the root cause was identified, and service was restored. Once the event was resolved, and the workload returned to normal operations, the team followed the Correction of Errors process and created the document below.

Summary: The first section in the COE document is the Summary. Since they have not compiled all the data yet, they are not ready to summarize the event. This will be skipped for now and written it at the end when they have the necessary details.

Impact: The Impact section is a concise paragraph of the impact of the event to our customers.

Example impact

Over 10,000 files that were processed were not successfully transformed. The customers received a message that their uploads were successful, but the data was never reflected in the application. The event started at 9:38:18 am (GMT-5) and was resolved at 11:38:24 am (GMT-5)

Figure 1: Impact section, within the overview tab, of Incident Manager

Timeline: We recommend the timeline to be a bullet list that walks through what happened and when. It is important to compile a detailed timeline. This helps the author and the reviewers understand how the incident was managed.

Example timeline:

- 9:00:00 am (GMT-5) 5/1/2023– Application pushed to production
- 9:25:00 am (GMT-5) 5/1/2023 – Engineers verify that the dashboard metrics are as expected
- 9:38:18 am (GMT–5) 5/1/2023–Transformer Lambda errors increase
- 9:40:00 am (GMT-5) 5/1/2023 – Call center customer complaints surge
- 9:45:00 am (GMT-5) 5/1/2023 – Call center notifies Service team of customer complaints
- 9:47:00 am (GMT-5) 5/1/2023 – Engineers review dashboards and metrics are acceptable
- 9:53:00 am (GMT-5) 5/1/2023 – Engineers broaden search to all logs
- 10:25:00 am (GMT-5) 5/1/2023 – Engineers noticed increased error rate in Transformer Lambda logs
- 10:45:00 am (GMT-5) 5/1/2023 – Engineers deployed a patch to test environment
- 10:55:00 am (GMT-5) 5/1/2023 – Test environment completed successful Acceptance Testing
- 10:59:00 am (GMT-5) 5/1/2023 – Engineers deployed a patch to production
- 11:25:00 am (GMT-5) 5/1/2023 – System start to show recovery
- 11:38:24 am (GMT-5)- 5/1/2023 – System recovery is complete

Figure 2: Timeline tab of Incident Manager

Metrics: Now the data needs to be provided to show what happened and when.

Example metrics

- Original dashboard metrics.
- Queue depth of documents to be processed.
- Percentage of documents to be processed.

Amazon CloudWatch Dashboards displaying graphs of Queue depth of documents to be processed and percentage of documents to be processed

Figure 3: Amazon CloudWatch dashboards showing application metrics

Incident Questions: To understand the incident better and find opportunities to detect, diagnose and solve the incident in less time.

Example incident questions

- Detection
  - When did you learn there was customer impact?
    - 9:45:00 am (GMT-5)
  - How did you learn there was customer impact?
    - Call center called the service team to report customer complaints
  - How can you reduce the time-to-detect in half?
    - Instrument a metric, and associated alarm, that allows us to identify failures.
- Diagnosis
  - What was the underlying cause of the customer impact?
    - Poor data validation, and error exception handling, in application.
  - Was an internal activity happening during the incident? (ex. maintenance window)
    - Yes, a new application was deployed to production.
  - How can you reduce the time-to-diagnose in half?
    - Instrument analytics correlating change log events with historical metrics.
- Mitigation
  - When did customer impact return to pre-incident levels?
    - 11:38:24 am (GMT-5)
  - How does the system owner know that the system is properly restored?
    - Newly created metrics – Better metrics, analytics
  - How did you determine where and how to mitigate the problem?
    - Checking all the logs
  - How can you reduce the time-to-mitigate in half?
    - Implement run books for known mitigations

the Incident detection section of Incident Manager completed with the example incident data

Figure 4: Incident detection section, within the Questions tab, of Incident Manager

5 (or more) Whys: Asking “why” multiple times clarifies the root cause of the incident and defines areas for improvement.

Example 5 Whys section

- Why did the application crash?
  - Because the uploaded files never showed up in the application.
- Why didn’t the uploaded files show up in the application?
  - Because the transformer lambda didn’t update the database
- Why didn’t the transformer lambda update the database?
  - Because the Transformer Lambda threw an execution error.
- Why did the Transformer Lambda throw an execution error?
  - Because there was an invalid data type.
- Why didn’t the application gracefully handle invalid data types?
  - Because there were no error handlers within the application.
- Why didn’t the application development team add error handlers?
  - There was no development policy requiring code for error handling.

The prevention section of Incident Manager completed with the example 5 whys data.

Figure 5: Prevention section, within the Questions tab, of Incident Manager

(Note: After asking and answering “why” 6 times, root cause was determined and multiple areas of improvement were identified.)

Action Items: The action items are the main result of the COE process. The goal is to identify actionable activities that improve either the prevention, diagnosis, or resolution of the same problem in the future. Each action item must state its priority, who is the owner, and a due date for when it will be finished.

Example Action Items

- Update runbook for this incident – Service Team 6/30/2023
- Proposed additional metrics – Service Team 6/30/2023
  - Transformer Lambda error rate
- Update application awareness – Service Team 6/30/2023
  - Add error handling into the application
  - Add input validation into the application
- Update development policy to require code for error handling. – Service Team 6/30/2023

Figure 6: Action items tab of Incident Manager

Summary: This section provides the context for the entire event. Include details on who was impacted, when, where, and how. Include details of how long it took to discover the problem and summarize both how you mitigated it and how you plan to prevent re-occurrences. Do not try to fit all of the details here, instead provide basic information about the incident in the summary. The summary should stand alone without requiring the reader to reference other sections. Write the summary as if it were going to travel as an email update to your company’s main stakeholder (such as the CEO).

Example summary

The application was launched on 5/1/2023. The Service team validated that all metrics were as expected. The service team received notification that customers were reporting errors. Over 10,000 files that were processed were not successfully transformed. The customers received a message that their uploads were successful, but the data was never reflected in the application. The service team began broader investigations into all the application logs to determine the cause. It was determined that the Transformer Lambda was erroring due to invalid date type. The service team developed and applied a patch to resolve the issue. Full system recovery occurred.

Figure 7: Incident summary section, within the overview tab, of Incident Manager

Conclusion

In this post, we walked through the construction of a Correction of Errors document, one section at a time. For each section we explained what is needed and then showed an example of what it would look like in this particular scenario. To start implementing your own COE process, we recommend using this post as a reference and the Incident Manager COE template.

The best way to learn and get better is by practice, so we encourage you to choose an incident and start your first Correction of Errors document.

To learn more about Incident Manager, see What Is AWS Systems Manager Incident Manager in the AWS documentation.

AWS Cloud Operations & Migrations Blog

Creating a correction of errors document

Resources

Follow