AWS Cloud Operations Blog
Streamlining the Correction of Errors process using Amazon Bedrock
Generative AI can streamline the Correction of Errors process, saving time and resources. By using generative AI to leverage large language models, combined with the Correction of Errors process, businesses can expedite the identification and documentation of the cause of errors, while saving time and resources.
Purpose and set-up
The purpose of this blog is to showcase where generative AI can provide the greatest impact on the Correction of Errors (CoE) process. It will not create a fully automated CoE application. Generative AI, and the applications that interact with generative AI, have the ability to utilize tools to automate data gathering. Although this blog will not discuss how to create these mechanisms, it will call out the possibilities.
As we look at the anatomy of the Correction of Errors document, generative AI can streamline the creation of many of the sections within the document. For the purpose of this blog, we will rely on human input to provide the facts of the event. We will use those facts, and the general knowledge of a large language model (LLM), to generate the yellow highlighted sections in the diagram shown in Figure 1.
The sections in green, created by generative AI, include:
- Impact Statement
- 5 Whys
- Action Items
- Summary
Although automation could be created to gather the data for the Metrics, Incident Questions, and Related Items sections, generative AI did not meaningfully improve the current processes. Therefore, the sections in blue were not included in this blog.
For the purpose of this blog, we created a variable called “facts”. This variable will be populated from human input and will contain a general description of what happened. We will leverage generative AI to turn the “facts” into the first draft of your CoE.
Generative AI introduction
Generative AI uses large language models, called LLMs, to respond to natural language processing. This means that the LLM can understand, and reply, in the conversational style humans use. We will not go into the science of the LLMs, but will touch on how to use them more efficiently. The diagram in Figure 2 shows an example process flow of a generative AI application. The user provides input to the generative AI application. The application captures the human input, packages it with instructions for the LLM, to create a prompt. The LLM sends a response back to the application. The application formats the response and sends it to the user.
There are many techniques to improve the response from the LLM. “Prompt Engineering“ is an important part of effectively using generative AI. We will show examples, but the topic is too lengthy of a subject to address within the constraints of this blog.
Prerequisites
In order to understand the steps taken we should first explain the technical environment we are using. We are using Jupyter notebooks, setup on a laptop, to create our prompts. (The Figure images in the sections that follow are taken from those Jupyter notebooks.) The Jupyter notebook makes API calls, using Boto 3, to Amazon Bedrock to invoke Anthropic Claude 3 Sonnet the foundational model we used. Amazon Bedrock is a fully managed service that provides a single API to access and utilize various high-performing foundation models (FMs) from leading AI companies. It offers a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI practices.
Now that we understand the technologies being used, let’s start creating the Correction of Errors document. To get started, we need the facts from the incident.
Inputs – the facts
The variable “facts” has been set to the simulated human input (Figure 3). In a real-world application, you would use the variable to capture the input directly from the user.
Timeline
The timeline of events will also be provided by human input. To simulate this, the variable “timeline” has been set to the simulated human input. In a real-world application, you would use the variable to capture the input directly from the user.
General Correction of Errors document prompt build up
In order to interact with the LLM as effectively as possible, we need to add instructions to the prompt. There are some general instructions that apply to many prompts. For those, we created three variables called “task_context”, “task_tone”, and “task_rules”. We set the “task_context” variable to “You will be acting as an IT Executive.” We set the “task_tone” variable to “You should maintain a professional tone.” We set the “task_rules” variable to “You must avoid blaming. You must avoid punishing. Your answer will be part of the incident report.” We will use those variables in our prompts to the LLM. The following screenshot (Figure 5) shows the prompt engineering buildup of these general instruction variables.
Impact section specific task rules and prompt
We added prompt engineering instructions specific to the impact section. We created a variable “impact_task_rules”. We set the variable to “Your answer should be concise. Your answer should be a paragraph. Your answer should include the impact analysis alone. Your answer should not have more than 200 words”. The last two lines of the screen shot consolidate the general, and impact section specific, prompt engineering statements into one variable “impact_prompt” which will be used in the API call to the LLM.
In Figure 7, variables “max_tokens”, “temperature”, “model_id”, and “impact_body” are set to the configurations for the LLM. In the API request body, the “text” field is set to the variable “impact_prompt” that we defined in the previous paragraph.
Now we are ready to make an API call to Amazon Bedrock. The API code below instructs the LLM to generate the output for the Impact section. Using the API call, the following output was created.The incident resulted in a significant business impact. Approximately 10,000 customer files were not successfully processed, despite the customers receiving confirmation messages indicating successful uploads. This discrepancy led to incomplete data in the application, potentially causing customer dissatisfaction, loss of trust, and potential revenue implications. The duration of the event, lasting around two hours, further compounded the impact. During this period, customer inquiries and support requests likely increased, straining resources and affecting overall operational efficiency. The reputational damage and potential loss of confidence from customers could have long-term consequences on customer retention and acquisition efforts. A thorough investigation is necessary to identify the root cause and implement preventative measures to mitigate similar incidents in the future.
5 Whys section specific prompt rules
Each section, of the Correction of Errors document, will follow a similar process. Section specific prompts will be added to the general prompts and captured in a variable. The variable will then be used in an API call. In the interest of brevity, this blog will not show each section specific API call.
The following screenshot (Figure 9) shows the “5 Whys” section specific prompts. The variable “five_whys_task_rules” is set to “Skip the preamble”. The variable “five_whys_request” is set to the output of an instruction for the LLM. The instruction includes the “facts” variable we previously assigned. Finally, the general, and 5 Whys section specific, prompt engineering statements are consolidated into one variable “5_whys_prompt” which will be used in the API call to the LLM.
Using an API call similar to the one described earlier, the following output was created.
Why were over 10,000 files not successfully transformed?
Because there was an issue with the Transformer Lambda function.Why was there an issue with the Transformer Lambda function?
Because there was an error or bug in the code that caused the Lambda function to fail when processing certain files.Why did the error or bug in the code cause the Lambda function to fail?
Because the code was not thoroughly tested for all possible input scenarios before being pushed to production.Why was the code not thoroughly tested for all possible input scenarios?
Because there was a lack of comprehensive test cases or test data that could have caught the issue during the testing phase.Why was there a lack of comprehensive test cases or test data?
Because the testing process and procedures were not robust enough to identify edge cases or potential issues with different types of input data.Root Cause: The root cause appears to be an inadequate testing process and procedures, which led to an unidentified bug or error in the Transformer Lambda function’s code. This bug caused the Lambda function to fail when processing certain types of files, resulting in over 10,000 files not being successfully transformed and the data not being reflected in the application, despite the customers receiving a successful upload message.
To prevent similar incidents in the future, it is essential to review and improve the testing processes, procedures, and test data coverage to ensure that all potential scenarios and edge cases are adequately tested before deploying code to production. Additionally, implementing robust monitoring and alerting mechanisms can help identify issues more quickly and facilitate a faster response and resolution.
Action items section specific prompt rules
Figure 10 shows the “Action items” section specific prompts. First it set the variable “action_items_task_rules” to “Skip the preamble. You should provide a list of actions. You should focus on the application. The actions should be S.M.A.R.T. Limit the list to the 3 most impactful actions. Skip the preamble. Do not list the S.M.A.R.T explanation. Be concise. You should suggest a detailed list of subtask for each task.” Next, we set a variable “action_items_request” to the output of an instruction for the LLM. The instruction includes the “five_whys_result” variable we previously assigned. Finally, the general, and action items section specific, prompt engineering statements are consolidated into one variable “action_items_prompt” which will be used in the API call to the LLM.
Using an API call similar to the one described earlier, the following output was created.Enhance Testing and Verification Processes:
– Conduct a comprehensive review of existing test cases and test coverage
– Develop additional test cases to cover edge cases and data transformation scenarios
– Implement rigorous integration and end-to-end testing
– Automate testing processes and incorporate them into the CI/CD pipeline
– Establish a dedicated testing team or assign testing responsibilities to specific individuals
Improve Code Review and Quality Assurance Practices:
– Implement mandatory code reviews for all code changes
– Establish coding standards and best practices
– Conduct regular training sessions on coding practices and quality assurance
– Leverage static code analysis tools to identify potential issues
– Foster a culture of quality and continuous improvement
Enhance Monitoring and Observability:
– Review and improve application logging and error handling mechanisms
– Implement comprehensive monitoring and alerting systems
– Establish clear escalation procedures for incidents and issues
– Conduct regular monitoring and observability reviews
– Provide training on monitoring and observability tools and practices
Summary section specific prompt rules
Figure 11 shows the “Summary” section specific prompts. First the variable “summary_task_rules” is set to “Be concise. Limit your answer to 200 words. Skip the preamble”. Next, the variable “summary_request” is set to the output of an instruction for the LLM. The instruction includes the “facts” and “timeline” variables we previously assigned as well as the “impact_result” and “five_whys_result” variables as captured from outputs of the LLM. Finally, the general, and summary section specific, prompt engineering statements are consolidated into one variable “summary_prompt” which will be used in the API call to the LLM.
Using an API call similar to the one described earlier, the following output was created.
The incident involved over 10,000 files failing to transform successfully after an application deployment, despite customers receiving confirmation of successful uploads. The event lasted 2 hours from 9:38 am to 11:38 am GMT-5 on 5/1/2023.
Engineers initially verified metrics were acceptable but later found increased Transformer Lambda errors after reviewing logs. A patch was deployed to test and production environments, with recovery starting at 11:25 am.
The root cause was insufficient integration and performance testing strategies, leading to an inadequate testing environment and data sets that did not represent production workloads and data characteristics. This allowed an issue with the deployment to go undetected before production release, resulting in the Transformer Lambda errors.
The impact was significant, with potential data loss or inaccuracies, compromised service integrity, and undermined customer trust due to the inconsistency between confirmations and actual data processing. Prompt resolution was crucial to restore data consistency and regain customer confidence.
Conclusion
In this blog, we walked through leveraging generative AI to streamline the Correction of Errors process, saving time and resources. For each section we explained the implementation, and showed example prompts, that were used for this particular scenario.
To start implementing generative AI with your own COE process, we recommend using this blog as a reference and Amazon Bedrock to build generative AI applications with security, privacy, and responsible AI. While we used Claude 3 Sonnet from Anthropic, you should experiment with multiple models to find the one that works best for your company’s use case. We encourage you to start experimenting with generative AI in your Correction of Errors process today.
Contact an AWS Representative to know how we can help accelerate your business.
Further Reading
- Why you should develop a correction of error (CoE)
- Creating a correction of errors document
- Prompt engineering techniques and best practices