AWS Cloud Operations Blog
Use Atlassian Opsgenie with AWS Systems Manager to run the EC2Rescue tool
On-call engineers are responsible for responding to alerts, troubleshooting high priority incidents, and taking action to remediate issues. Automation tools like AWS Systems Manager and Atlassian Opsgenie can help these engineers by reducing repetitive work and allowing them to focus on the most important tasks. In this blog post, Merve Bolat, Associate Product Manager at Opsgenie, Atlassian, explains how Opsgenie can execute an automation document in AWS Systems Manager in response to an incoming alert.
Opsgenie and EC2Rescue tool
Opsgenie is a modern alert and incident management platform from Atlassian that empowers on-call responders to centralize alerts, notifies the right people reliably, and enables them to take rapid action. Now you can directly integrate Opsgenie with AWS Systems Manager to quickly execute automation documents without leaving the Opsgenie console or mobile app.
EC2Rescue is an AWS troubleshooting tool that you can run on your Amazon EC2 instances to resolve operating system-level issues and collect advanced logs and configuration files for further analysis. In the example in this blog post, Opsgenie will be monitoring for a Status Check Failure alert from Amazon CloudWatch, which is a sign that an EC2 instance needs attention. When this alert is received, an action policy will trigger EC2Rescue using the AWSSupport-ExecuteEC2Rescue automation document that comes standard in AWS Systems Manager. The following screenshots show Opsgenie receiving an EC2 StatusCheckFailed alert.
Configuring the EC2Rescue Action in Opsgenie
To create an automation action on Opsgenie, you need a corresponding action channel for AWS Systems Manager. In the Actions tab of your team’s dashboard in the Opsgenie console, create an AWS Systems Manager channel with your AWS account ID, AWS Region, and AWS Identity and Access Management (IAM) role. Multiple actions can be created by using the same template as long as the account ID, Region, and role are compatible with the automation document.
After the automation template is created, you can add the related automation action from the Manage Actions section. Specify the name of action, select AWS Systems Manager as the type, and choose the action channel you created in the previous step. Then, select the AWSSupport-ExecuteEC2Rescue document from the AWS Systems Manager documents (SSM Documents) section. You can search for the document from the drop-down list or simply type the name of the document in the search box.
The next section lists the parameters that can be configured for the action. Opsgenie imports the parameters of the corresponding automation document of AWS Systems Manager directly. Parameters that are marked as “required” are mandatory for execution. For the EC2Rescue tool, UnreachableInstanceId must be provided, whereas LogDestination, EC2RescueInstanceType, SubnetId, and AssumeRole are optional.
Note: We recommend that you provide an Amazon S3 bucket name as the LogDestination, so that diagnostic information and OS level logs will be uploaded to S3. That helps in case the AWS Systems Manager Automation document doesn’t fix the issue and manual investigation is necessary at that point. The default value for SubnetId is determined as CreateNewVPC. However, you should consider VPC limit and access restrictions.
To manually execute an automation action, the related action needs to be added to the alerts using alert policies or integration rules (which can be done from the Advanced Settings of the integration). Since Amazon CloudWatch is a convenient tool to track the status of EC2 instances, you can add the *EC2Rescue action on the Amazon CloudWatch Events integration in Opsgenie. This way, whenever an Amazon EC2-related alert is created by Amazon CloudWatch Events integration on Opsgenie, you can easily execute the action from the alert itself.
To add the EC2Rescue action on Amazon CloudWatch Events integration, switch to the Advanced settings of the integration. In the Create Alert section, type EC2Rescue in the Actions field, then save the integration.
Configuring permissions in AWS
EC2Rescue needs specific permissions and trust entities to perform the automation actions. You can either create a role by using an AWS CloudFormation template with the minimum required permission policies and trusted entities during action channel creation or add an IAM role using the AWS Management Console. The IAM role must start with the prefix opsgenie-automation-actions- to execute an action. If you have administrator access, you can easily run the action with the role trust policy document that follows. Otherwise, you might need to contact your account admin to configure the necessary permissions and trust entities. For further details, refer to the AWS Systems Manager User Guide. The following screenshot shows configuring a role in the AWS Management Console.
Executing the EC2Rescue Action in Opsgenie
After the permission configurations are done on AWS Systems Manager, the EC2Rescue action can be executed in the Opsgenie console. A window will appear when you choose the action to execute on the related alert.
If all the required parameters have predefined values, you can directly choose the Execute button. Otherwise, you need to give values for the parameters before executing.
After you choose Execute, the AWSSupport-ExecuteEC2Rescue automation document will be run on the Instance that you specified. If the alarm condition is cleared, Opsgenie can automatically resolve the alert. You can view the results of the EC2Rescue operation in the Opsgenie alert activity log and track the operation steps in the AWS Management Console using the execution ID provided.
Conclusion
EC2Rescue is just one example of how Opsgenie and AWS Systems Manager can help on-call engineers respond to alerts and resolve issues faster. By enabling alert responders to execute any automation document, the troubleshooting and remediation steps that are normally manual tasks can be automated and triggered during an incident. To try Opsgenie Actions for Incident Response with AWS Systems Manager, visit https://www.atlassian.com/software/opsgenie.
The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.