Achieving Operational Excellence using automated playbook and runbook
An important aspect of operational readiness is having a well-defined process to perform activities in your workload for various scenarios as indicated in Question 7 of Operational Excellence pillar in AWS Well-Architected Framework. Which aims at evaluating your workload’s readiness for operation, from process and personnel perspective.
In the case of Incident response, a team needs a playbook that contains actions they can take to guide through the issue investigation. Additionally, for everyday actions with a known outcome, runbooks can be used. For example, when an incident occurs, an engineer can perform investigation by following the steps defined in a playbook to look into various relevant logs and metrics.
Once the cause is identified or a conclusion is made, they can then perform actions defined in a runbook to achieve the desired state in resolving or mitigating the situation.
These playbook and runbook activities can be automated, or performed manually by the engineers.
But there are several common challenges in performing them manually.
- Hard to scale.
Humans have limited cognitive capacity. At a large scale, running these activities manually can lack consistency and become prone to errors.
For example; An engineer’s reliability in identifying error patterns over 100 lines of logs, will be very different compared to 10,000 or 1,000,000 lines.
- Visibility on activities are limited.
When activities are performed by following a written document, it is difficult to track what has been done and what the results are.
This makes it difficult to measure the impact and quality of the document, because activities are disbursed in many locations, such as Amazon CloudWatch for Logs, or AWS CloudTrail for APIs.
- Difficult to ensure validity.
To ensure that the playbook and runbooks are valid and ready to be used during an incident, it needs to be validated continuously.
To do this, an engineer will have to dedicate time on a regular basis to perform a test on the playbook / runbook, which can be costly from operational perspective.
Well-Architected Operational Excellence
This is why one of the design principals in Operational Excellence pillar of AWS Well-Architected Framework is to Perform the operations as code, which advocates the practice of defining your infrastructure as a code, as well as codifying your operational process to be performed automatically.
In the case for Playbooks and Runbooks, you want to be able to define the operational steps in a codified document that you can directly run and perform actions automatically.
By doing so, you are streamlining the process definition with the actual activity, making it easier for you to validate, as you can test them as if you are testing your application code.
Having these activities codified, also separates human element from the process, giving you better consistency and scalability. You will also gain visibility on activities performed by accessing the logs of your codified playbooks or runbooks. This will give you a foundation to learn from your operational failure, and refine operational process reliably, which are also two of the other design principles in Operational Excellence pillar.
Now let’s take a look on how you can build a codified playbook and runbooks using AWS Services.
How to build automated playbook and runbook on AWS.
There are various tools you could use to codify automated playbook and runbooks. In AWS you can leverage AWS Systems Manager Runbook ( previously called Automated Document ). Using this capability in AWS Systems Manager you can define actions to perform activities in a document, defined in JSON or YAML format. The document resource is created and centrally hosted in AWS Systems Manager service, which you can then use to orchestrate actions, on your AWS resources.
For example, in the diagram below, we have an architecture backed by several AWS services such as Application Load Balancer, Amazon ECS, Amazon RDS, and it is configured with Amazon CloudWatch Alarm to detect a certain failure or incident.
When the CloudWatch alarm is triggered, an engineer is then notified via email. They can then perform investigation activities by running the playbook defined in AWS Systems Manager Runbook. In alignment with “Use process for event, incident, and problem management” best practice in Operational Excellence question 10. The playbook will contain automated investigation activities such as, checking the canary monitor in CloudWatch synthetics, querying the application logs for any error patterns or capturing any configurations on the related services.
AWS Systems Manager runbook document supports several built-in actions that allows you to define the automation in various fashion. One of them is to embed a Python or PowerShell script directly into the document using ‘aws:executeScript’ action. This allows you to define and customize your own logic into the playbook and runbook document, which is particularly useful if you need to perform a complex logic in your code.
For example, let’s say you would like to compile a list of all AWS resources that are related to a certain incident, and the starting reference that you have is the ARN of the CloudWatch Alarm. Using ‘aws:executeScript’ you can define python code logic to first locate the tag of the Alarm, and then query other AWS APIs to find all related resources that has the same tag as a reference, from here you can then compile the list ( click here for sample code ). Alternatively, if you know that your application is defined using Infrastructure as code tool such as AWS CloudFormation, you can also find a the CloudFormation Stack with related tag and list the resources defined in the stack.
Just like when you are writing a code for your application, you want to avoid writing the same code multiple times. Because, as it is a wasted effort, it also makes it harder for you to test it at a larger scale. Therefore, creating separate re-usable modules when building playbooks and runbooks documents will provide you with better scalability in the long run. Whenever you see a series of actions being repeated in different playbooks or runbooks, this is a good candidate to split the process into a separate smaller document. To orchestrate these modular documents you can use ‘aws:executeAutomation’, which allows you define a step in the document, to run another document as part of the process flow.
For example, in the case of our previous playbook that generates a list of resources related to an incident. Using ‘aws:executeAutomation’ action you can run that document as part of a larger playbook, that will take the list generated and pass it to another document that will run investigation on the actual resources, following a certain pattern. e.g.: check Load balancer logs, check ECS config, check RDS logs. If later, you have a different investigation pattern that you would like to do, you can re-use the same playbook to generate the resource list and pair it with the new investigation playbook.
Your activity in the playbooks or runbooks can sometime be as simple as invoking an API call to an AWS service. For example, you would like to send the investigation report that was generated by the previous playbook to an Amazon SNS Topic. Sending the report your engineer via email. For this you can use‘aws:executeAwsApi’ action to perform a direct AWS API call natively from the playbook or runbook document. The action lets you define which AWS API to call, specify which actions you would like to perform, and construct the parameter payload without writing any additional line of code.
When building your playbooks and runbooks it is important to integrate them into your existing business process. In some scenarios, this could simply mean that you have to create a gated mechanism where you would seek approval from business owners before continuing the action. For this you can use ‘aws:approve’ action, and integrate it with Amazon SNS Topic. This action will in turn send a message to your business owner along with a URL they can use to either approve or deny the request.
Alternatively, in a more nuanced scenario, you may need to build a mechanism where you can give an opportunity for business owners to intercept the action within a certain time period. For this case you can use a combination of ‘awsexecuteScript’ to run a separate document asynchronously to wait for a certain period before automatically approving the request. At the same time you can leverage the same ‘aws:approve’ action to give the opportunity for business owner to deny the request before the automatic approval trigger kicks in.
For more detail and steps on this example you can follow the step by step instructions to build this scenario in the AWS Well-Architected Labs – Automating Operations with Playbooks and Runbooks.
In this post, you have learnt one of the key design principles in achieving Operational Excellence, which is “perform operations as code”. You can apply this principle when building your playbooks and runbooks, by defining it as a code using AWS Systems Manager runbook. With this knowledge you can now choose a playbook or runbook in your own organization and experiment making it automated and scalable.
|“Use runbooks to perform procedures”, “Use playbooks to investigate issues”, and “Use process for event, incident, and problem management” are part of the Operational Excellence Pillar of the AWS Well-Architected Framework. AWS Well-Architected is a set of guiding design principles developed by AWS to help organizations build secure, high-performing, resilient, and efficient infrastructure for a variety of applications and workloads. Use the AWS Well-Architected Tool to review your workloads periodically to address important design considerations and ensure that they follow the best practices and guidance of the AWS Well-Architected Framework. For follow up questions or comments, join our growing community on AWS re:Post.”|