AWS Storage Blog
Enhance logs for AWS Elastic Disaster Recovery with CloudWatch Log Insights
Operational teams play a crucial role in making sure of the readiness and reliability of a disaster recovery (DR) solution. When these teams don’t have direct access to monitor the resources and services that make up a solution, it can create significant challenges. Logs provide insights into system behaviors, performance, and potential anomalies. When operations teams don’t have access to this information, it can delay the identification and resolution of issues and problems when they occur. This delay can ultimately compromise the effectiveness of a DR solution, while impacting recovery objectives and business expectations.
AWS Elastic Disaster Recovery streamlines and automates the process of DR, minimizing both downtime and data loss by enabling the seamless recovery of physical, virtual, and cloud-based servers into AWS. Amazon CloudWatch is a service that monitors applications, responds to performance changes, helps optimize resource use, and provides insights into the operational health of applications. Amazon CloudWatch Log Insights allows you to interactively search and analyze your log data in Amazon CloudWatch Logs. It allows you to perform queries to help you more efficiently and effectively respond to operational issues.
This post provides guidance on how to improve operations for Elastic Disaster Recovery by centrally storing and analyzing replication logs from many source servers using CloudWatch Log Insights. By adopting this approach, operational teams enhance their visibility and control of key information without needing direct access to the source servers.
Prerequisites
If you intend to follow this step-by-step guide, the following prerequisites must be met:
- Elastic Disaster Recovery service initialized in the target AWS Region.
- A source machine that has the Elastic Disaster Recovery and AWS Systems Manager Agent (SSM Agent) installed.
- Connectivity from the source machine to the to the SSM Agent and CloudWatch service endpoints in the target Region.
- attached to the source machine with the following AWS managed policies attached
- AmazonSSMManagedInstanceCore
- AWSElasticDisasterRecoveryEc2InstancePolicy
- CloudWatchAgentServerPolicy
Walkthrough
This section shows the step-by-step process of:
- Installing the CloudWatch agent
- Configuring the CloudWatch agent to send AWS replication agent logs from an Elastic Disaster Recovery source server to Amazon CloudWatch logs
- Building a query using Log Insights
1. Install CloudWatch agent
In this walkthrough, I am using an Amazon Linux 2 instance and installing through command line using Session Manager. For more information on alternative installation approaches, refer to the CloudWatch documentation.
1.1. When the prerequisites have been met, connect to the Amazon Elastic Compute Cloud(Amazon EC2 ) instance where the AWS Replication agent is installed, and enter the following command, choosing y when prompted.
sudo yum install amazon-cloudwatch-agent
If the installation is successful, then the console reports back as Complete!
, as shown in the following figure.
Figure 1: CloudWatch Agent install complete
2. Configure CloudWatch agent
Before running the CloudWatch agent, you must create one or more CloudWatch agent configuration files. The agent configuration file is a JSON file that specifies the metrics, logs, and traces that the agent should collect, including custom metrics. For this walkthrough I have provided a configuration file for both Linux and Windows. However, the following steps assume that the source operating system is Linux-based.
Linux configuration
{
"logs": {
"logs_collected": {
"files": {
"collect_list": [
{
"file_path": "/var/lib/aws-replication-agent/agent.log*",
"log_group_name": "aws-replication-agent",
"timezone": "UTC"
}
]
}
},
"log_stream_name": "{instance_id}",
"force_flush_interval": 15
}
}
Windows configuration
{
"logs": {
"logs_collected": {
"files": {
"collect_list": [
{
"file_path": "C:\\Program Files (x86)\\AWS Replication Agent\\agent.log.0",
"log_group_name": "aws-replication-agent",
"timezone": "UTC"
}
]
}
},
"log_stream_name": "{instance_id}"
}
}
2.1. Enter the following command to create a local configuration file:
sudo touch /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.d/cloudwatch_agent_linux.json
2.2. Edit the configuration file using the following command and paste the appropriate configuration text from Step 2.
sudo vi /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.d/cloudwatch_agent_linux.json
2.3. (Linux only) Add cwagent user (CloudWatch agent user) to the aws-replication user group and update permissions on the agent.log.* to provide read access to aws-replication group.
sudo usermod -a -G aws-replication cwagent
sudo su chmod 640 /var/lib/aws-replication-agent/agent.log.*
2.4. Start the CloudWatch Agent and specify the configuration file that you created.
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -s -c file:/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.d/cloudwatch_agent_linux.json
Look for the output text Configuration validation succeeded
, as shown in the following figure.
Figure 2: CloudWatch validation checks have succeeded
2.5. Finally, validate the status of the CloudWatch agent.
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -m ec2 -a status
Look for the output text to show the status as running
and configstatus as configured
, as shown in the following figure.
Figure 3: CloudWatch agent is running and configured
2.6. Go to the Amazon CloudWatch console and choose Log groups from the Logs sub menu, as shown in the following figure. If the preceding step was successful, then you should have a new log group called aws-replication-agent. Under Log streams there should be a stream named “{instance_id}”, which contains log events that were received after the CloudWatch agent was installed and configured.
From here operational teams can access each individual source server log stream and filter the results by time and specific terms or phrases, such as Error or Warning.
Figure 4: CloudWatch log stream for an Elastic Disaster Recovery source server.
Repeat Steps 1 and 2 for each additional Elastic Disaster Recovery source server, and consider using AWS Systems Manager to deploy the preceding steps at scale.
3. Build a query using Logs Insights
Logs Insights lets you interactively search and analyze the log data within specific CloudWatch Log groups. Therefore, this lets you perform queries using the purpose-built query language to search log data generated from many Elastic Disaster Recovery source servers at the same time. An example use case would be to identify issues during mass agent installation or to identify agent communication errors to AWS service endpoints.
3.1. Go to the CloudWatch console and choose Logs Insights from the Logs sub menu.
3.2. Choose the aws-replication-agent Log group, and then choose Run query.
CloudWatch Logs Insights automatically discovers fields for the log types present in the chosen CloudWatch Log group, as shown in the following figure. By default, an example query is provided for you that shows the last 1000 log entries sorted by timestamp.
Figure 5: CloudWatch Logs Insights example query
After a few seconds, the query returns some results that you can further analyze by opening the specific Log or export to CSV, JSON, or XLSX, as shown in the following figure.
Figure 6: CloudWatch Logs Insights example query results
3.3. Now you can experiment using a different query. The following is an example query that shows the last 20 log entries that contain the terms error
or warning
.
fields @timestamp, @message | sort @timestamp desc | limit 20 | filter @message like /(?i)(error)|(warning)/
Upon reviewing the query output, you can observe that some records have been matched because of the presence of the word error
, as shown in the following figure.
Figure 7: Log Insights query output shows log messages with errors
By choosing the arrow next to the record number, you can expand the record details. In this example you can observe that instance i-0d5352167b56b2335 is experiencing timeouts connecting to the replication server.
Figure 8: Log record shows agent is unable connect to replication server
In this example an issue was identified with the communication between an agent and its paired replication server. Operational teams can use this insight to troubleshoot the connectivity and resolve the issue.
Cleaning up
To avoid incurring unwanted AWS costs after performing these steps, if the created AWS resources aren’t needed, then delete them. These include the CloudWatch Log Group and the EC2 instances created for this exercise.
Conclusion
In this post, I covered how AWS Elastic Disaster Recovery service can centrally store and analyze AWS replication logs from many source servers using Amazon CloudWatch. By adopting this approach, operational teams can enhance their visibility and control of key information without needing direct access to the underlying source servers.
In addition to the centralized storage of AWS replication logs from many source servers, I also covered how users can use tools such as Logs Insights to query log data at scale and identify patterns. I created an example query to identify log entries that contained the patterns error or warning and could identify a source server that had communication issues between the agent and its replication server.
Thanks for reading this post. If you have comments or questions, then don’t hesitate to leave them in the comments section.