Troubleshooting automated pre- and post-scripts for AWS Backup

Customers can use event-driven architectures with decoupled tasks to automate and orchestrate custom scripts for backup jobs. With event-driven architectures, troubleshooting is key to understanding failures at the component levels in order to resolve issues that arise and keep the entire automated workflow running smoothly.

In the first post in this two-part blog series, we showed how to deploy an automated solution leveraging AWS Backup, AWS Step Functions, and AWS Systems Manager to run custom scripts before or after a backup. The deployed solution lets you schedule and orchestrate backups across multiple AWS resources, manage their lifecycle, and protect them against unauthorized actions. If you haven’t deployed the solution in your target account yet, then refer to the first post to get started.

In this post, we will focus on troubleshooting the Step Functions execution paths, paying close attention to these components whenever an execution results in a failure.

Understanding the Input JSON

The input used for the custom invocation, scheduled invocation, or manually invoking the state machine follows a strict pattern. Refer to the properties section in the README.md to understand which supported properties you can pass within the input JSON. The current solution supports eight execution paths based on the properties that you specify in the input.

Everything enabled: Execute Pre Script | Stop EC2 Instance | Run Backup Job | Start EC2 Instance | Execute Post Script | Log Details in DynamoDB Table.
No Post Script: Execute Pre Script | Stop EC2 Instance | Run Backup Job | Start EC2 Instance | Log Details in DynamoDB Table.
No Pre Script: Stop EC2 Instance | Run Backup Job | Execute Post Script | Start EC2 Instance | Log Execution Details in DynamoDB Table.
No Pre and Post Scripts: Stop EC2 Instance | Run Backup Job | Start EC2 Instance | Log Execution Details in DynamoDB Table.
No Instance termination; everything else enabled: Execute Pre Script | Run Backup Job | Execute Post Script | Log Details in DynamoDB Table.
No Instance termination, no Post Script: Execute Pre Script | Run Backup Job | Log Details in DynamoDB Table.
No Instance termination, no Pre Script: Run Backup Job | Execute Post Script | Log Details in DynamoDB Table.
No Instance termination, no Pre and Post Scripts: Run Backup Job | Log Details in DynamoDB Table.

See this document for more information about enabling and disabling these configurations based on your use case.

Troubleshooting

In event-driven architectures with decoupled tasks, you must understand which components can fail and how to troubleshoot those issues. This section provides a brief overview of how to troubleshoot failures and knowing where to look.

State machine

Every state machine execution will have a graph view where you can visualize the workflow (Figure 1) where different states are color coded. If an error occurs in any step, then you can select an individual failed step, highlighted in red, to get additional details about it. The following figure is an example of a successful step execution. It provides the following:

Input and output: Input sent to the step and the corresponding output. If any step fails, then the failure details will be here.
Details: You can see which resource is backing this step, in this case an AWS Lambda function, and the logs corresponding to the step.

Diagram showing StopEC2Instance step details in the state machine

Figure 1: Diagram showing StopEC2Instance step details in the state machine

Amazon DynamoDB

Every edge case about failure isn’t logged in the Amazon DynamoDB table. This could be a future enhancement. Here, the DynamoDB table is primarily used for housekeeping, where an administrator can look back at historical jobs that were run and the impacted targets and scripts that were executed. The table consists of four attributes.

SM_Execution_ID: The Amazon Resource Name (ARN) of the step function execution.
StartTime: The start time of the execution.
EndTime: The end time of the execution.
Workflow_Overall_Status: This contains a list of instances along with the status of every operation, i.e., Stop EC2 instance, Run Pre Script, Run Backup, Start EC2 instance, run Post Script.

Here is a sample Workflow_Overall_Status snippet for your reference:

[     
     {         
         "InstanceId": "i-0d7ac8ebf0957190a",
         "PreBackupScriptStatus": "Success",
         "StopEc2InstanceStatus": "skipped",
         "BackupJobStatus": "COMPLETED",
         "PostBackupScriptStatus": "Success",
         "StartEc2InstanceStatus": "skipped",
         "FailureMessage": ""
     } 
]

Backup jobs

There is a backup job created for every instance with a unique BackupJobId. If the backup jobs are taking too long, then you can grab this ID from the Step Function step RunBackupJob and troubleshoot the execution by using the AWS Command Line Interface (AWS CLI) or AWS SDK to check the status of the backup job. Another check would be the status of the recovery point created inside of the backup vault that you specified in your input.

For more information on troubleshooting backup jobs, refer to the troubleshooting section of the AWS Backup documentation.

AWS Systems Manager

Run command of Systems Manager is leveraged in this solution to execute scripts on the target EC2 instances. To troubleshoot any run command failures, grab the CommandId from the input and output section of the state machine step and see its details in the Systems Manager console. You can also find the logs of the execution in the output path that you provided for the Remote RunCommand in the input “OutputS3BucketName”, “OutputS3KeyPrefix”

Verifying configurations as follows:

Make sure that the EC2 instances have the correct permissions and the Systems Manager agent installed, as mentioned in the prerequisites section. This could lead to the remote run command not being executed on the target.
Make sure that the tag key and value you pass in the input match the tags on the EC2 instances.
Make sure that all of the properties in the input file point to the right resources – BackupJobExecutionRoleArn, BackupVaultName, Amazon Simple Storage Service (Amazon S3) location where the scripts are hosted.
Verify if the right combination of inputs is being passed for the expected outcome. Refer to the properties documentation.

Cleaning up

In order to avoid incurring future charges, delete the example resources created in this solution. Please refer to cleaning up section of the first blog for more details.

Conclusion

In this post, we showed how to troubleshoot the workflow that integrates AWS Backup with Amazon EC2 using Step Functions, Lambda, and Systems Manager to run scripts and actions before and after your backup job. In this serverless solution, observability is the most important aspect given the number of services that are involved in the orchestration. It is essential to understand how individual components work and employ necessary logging strategies to minimize troubleshooting time.

This post can be used as a guide if you run into issues during the Step Functions workflows. Finally, note that as you add more functionality to your Step Functions workflows, you need to build better traceability to identify bottlenecks within your orchestration. Enabling Amazon X-Ray for Step Functions is a great solution for that.

For additional information on Step Functions, please refer to Best Practices for Step Functions and Getting started with event-driven architectures.

Thanks for reading! If you have any comments, feedback, or questions, leave them in the comment section.