Microsoft Workloads on AWS

Exporting the Windows Failover Cluster log to CloudWatch

In this deep-dive blog post, we will go through a step-by-step guide on how to capture Windows Failover Cluster Event Viewer logs using Amazon CloudWatch agent and send alerts using Amazon Simple Notification Service (Amazon SNS).

Introduction

Windows Event Viewer logs are a crucial aspect of monitoring and troubleshooting Windows systems. However, manually reviewing these logs can be time-consuming and error-prone. By using Amazon CloudWatch agent and Amazon SNS, you can automate the process of capturing and analyzing Event Viewer logs, as well as receive near real-time alerts in the event of critical system events.

Also take a look at Amazon CloudWatch Application Insights, which provides centralized monitoring, automated anomaly detection, and actionable insights for many enterprise workloads including SQL Server Always On Availability Groups (AOAG) and Failover Cluster instances (SQL FCI).

Solution overview

Figure 1. Solution overview
Figure 1. Solution overview

This solution includes the following steps:

  1. Create a Windows Server failover cluster (WSFC).
  2. Publish the Event Viewer WSFC logs to Amazon CloudWatch using Amazon CloudWatch agent.
  3. Create a filter pattern and an Amazon CloudWatch alarm based on the Windows failover events.
  4. Use Amazon SNS to send an email if a failover/error event occurs.

Prerequisites

Before you start, complete the following tasks:

  • Start at least two Windows Server instances using available Windows AMIs.
    Sample configuration used in this this blog post includes:

  • Configure a Windows Server failover cluster between your nodes (Active/Passive).
  • Install and configure the AWS Command Line Interface (AWS CLI).
  • Create an AWS Identity and Access Management (IAM) role with permissions for Amazon CloudWatch Logs and Amazon SNS.
  • Create an Amazon SNS topic to be used with the alarm.

Walkthrough

Step 1: Install Amazon CloudWatch agent on the Windows instance

The first step is to install the Amazon CloudWatch agent on the Windows instance. The Amazon CloudWatch agent is a lightweight data collection agent that can collect logs, metrics, and custom data from Amazon Elastic Compute Cloud (Amazon EC2) instances and on-premises servers.

To install the Amazon CloudWatch agent, perform the following steps:

  1. Log in to the Windows instance with local administrator permissions.
  2. Download the Amazon CloudWatch agent installer from the AWS website using the following PowerShell script:
    Invoke-WebRequest https://s3.amazonaws.com/amazoncloudwatch-agent/windows/amd64/latest/amazon-cloudwatch-agent.msi -OutFile $env:USERPROFILE\Desktop\SSMAgent_latest.msi
  3. Install the agent by running the following PowerShell script to start a silent setup:
    Start-Process $env:USERPROFILE\Desktop\SSMAgent_latest.msi /qr
  4. Repeat steps 1-3 on all Windows Server cluster nodes.

Step 2: Configure Amazon CloudWatch agent on the Windows instance

After the Amazon CloudWatch agent installation, the next step is to configure the Amazon CloudWatch agent on the Windows instance.

To configure the Amazon CloudWatch agent, perform the following steps:

    1. Log in to the Windows instance with local administrator permissions.
    2. Create the Amazon CloudWatch agent configuration file and specify the System, Microsoft-Windows-FailoverClustering/Diagnostic and Microsoft-Windows-FailoverClustering/Operational Windows event log names during the configuration.
    3. Your C:\Program Files\Amazon\AmazonCloudWatchAgent\config.json file should be like this:
      {
          "logs": {
              "logs_collected": {
                  "windows_events": {
                      "collect_list": [
                          {
                              "event_format": "text",
                              "event_levels": [
                                  "WARNING",
                                  "ERROR",
                                  "CRITICAL"
                              ],
                              "event_name": "System",
                              "log_group_name": "System",
                              "log_stream_name": "{instance_id}",
                              "retention_in_days": 60
                          },
                          {
                              "event_format": "text",
                              "event_levels": [
                                  "INFORMATION",
                                  "WARNING",
                                  "ERROR",
                                  "CRITICAL"
                              ],
                              "event_name": "Microsoft-Windows-FailoverClustering/Operational",
                              "log_group_name": "Microsoft-Windows-FailoverClustering/Operational",
                              "log_stream_name": "{instance_id}",
                              "retention_in_days": 60
                          }
                      ]
                  }
              }
          }
      }
    4. Start the Amazon CloudWatch agent on the Windows instance by running the following PowerShell script with administrator permissions:
      Set-Location 'C:\Program Files\Amazon\AmazonCloudWatchAgent'
      & "C:\Program Files\Amazon\AmazonCloudWatchAgent\amazon-cloudwatch-agent-ctl.ps1" -a fetch-config -m ec2 -s -c file:Config.JSON

      You should receive an outcome similar to the one presented in Figure 2:
      Output of PowerShell script to start Amazon CloudWatch agent
      Figure 2. Output of PowerShell script to start Amazon CloudWatch agent.

    5. Run the following PowerShell script to check the Amazon CloudWatch agent service status.
      & $Env:ProgramFiles\Amazon\AmazonCloudWatchAgent\amazon-cloudwatch-agent-ctl.ps1 -m ec2 -a statusYou should receive an outcome similar to the one presented in Figure 3:
      Output of PowerShell script to check the Amazon CloudWatch agent status
      Figure 3. Output of PowerShell script to check the Amazon CloudWatch agent status.
    6. Repeat steps 1-5 on all cluster nodes.

Once the service is started, it transmits logs to Amazon CloudWatch Logs. It may take a few minutes for Amazon CloudWatch Logs to include the initial data.

To locate the newly created log groups, navigate to the Amazon CloudWatch console in the region where your cluster is running, as shown in Figure 4:

Amazon CloudWatch log groups

Figure 4. Amazon CloudWatch log groups

The Microsoft-Windows-FailoverClustering/Operational log group contains one group for each Instance Id of your cluster nodes, as shown in Figure 5:

Log streams for each Amazon EC2 instance id

Figure 5. Log streams for each Amazon EC2 instance id.

Step 3: Create a filter pattern and Amazon CloudWatch alarm

You can create a filter for specific errors you want to monitor. Let’s create a filter to capture a failover event. The failover event will log the following event in the Event Viewer (FailoverClustering/Operational):

Log Name: Microsoft-Windows-FailoverClustering/Operational Event ID: 1641 Level: Information Description: Clustered role '<role name>' is moving from cluster node '<PreviousNodeName>' to cluster node '<DestinationNodeName>'.

Now perform a test failover on your clustered resource. The error from Event Viewer should be in the Microsoft-Windows-FailoverClustering/Operational log group, as shown in Figure 6:

Performing a failover on Windows Failover Cluster resource

Figure 6. Performing a failover on Windows Failover Cluster resource.

The failover event is logged under Application and Services Logs / Microsoft/Windows/FailoverClustering/Operational System and Operational, as shown in Figure 7:

Event Viewer showing the log for failover operation

Figure 7. Event Viewer showing the log for failover operation.

  1. On the Amazon CloudWatch console, under Logs, choose Log groups/Microsoft-Windows-FailoverClustering/Operational, choose Search log group, as shown in Figure 8:
    Selecting logs group to create an Amazon CloudWatch filter
    Figure 8. Selecting logs group to create an Amazon CloudWatch filter.
  2. Search for “[1641]”, the EventId listed in Event Viewer after the failover. The console lists the matching failover events. To create a filter, choose Create metric filter, as shown in Figure 9:
    Creating a metric filter for the failover event ID
    Figure 9. Creating a metric filter for the failover event ID.
  3. Under Create metric filter, do the following:
    1. For Filter name, enter failover.
    2. For Metric namespace, turn off Create new and choose CWAgent.
    3. For Metric name, enter failover.
    4. For Metric value, enter 1.
    5. For Unit, choose Count.
    6. Choose Create.
      Figure 10 shows your filter details:
      Metric filter details
      Figure 10. Metric filter details.
  4. Return to CloudWatch/Log groups/Microsoft-Windows-FailoverClustering/Operational. The filter you just created will be listed under Metric filters.
  5. After creating the failover filter, we can now create the alarm. Choose the Failover filter and choose Create alarm, as shown in Figure 11:
    Creating an Amazon CloudWatch alarm
    Figure 11. Creating an Amazon CloudWatch alarm.
  6. In the Specify metric and conditions page, do the following:
    1. For Metric name, enter Failover.
    2. For Statistic, choose Minimum.
    3. For Period, choose the time for the alarm, for example, 1 minute, as shown in Figure 12:
      Specifying metrics for the alarm
      Figure 12. Specifying metrics for the alarm.
    4. In the Conditions section, for Threshold type, choose Static.
    5. For Whenever Failover is, choose Greater > threshold.
    6. For Than, enter 0.
    7. Choose Next, as shown in Figure 13:
      Specifying conditions for the alarm
      Figure 13. Specifying conditions for the alarm.
    8. In the Notification section, for Alarm state trigger, choose In alarm.
    9. For Send a notification to the following SNS topic, either:
      1. choose Select an existing Amazon SNS topic, then choose a topic;
      2. or choose Create new topic to create an Amazon SNS topic using the email address you want to receive alerts, as shown in Figure 14:
        Selecting the notification method for the alarm
        Figure 14. Selecting the notification method for the alarm.
    10. Choose Next.
  7. In the Name and description section, enter a name and description for your alarm.
  8. Choose Next, as shown in Figure 15:
    Specifying the alarm name and description
    Figure 15. Specifying the alarm name and description.
  9. In the Preview and create section, review your alarm configuration, then choose Create alarm.

Once you have completed the above steps, you can test the alerting workflow to ensure that it works as expected. To test the workflow, perform a failover on your cluster and check your email inbox. You will receive an email like the example in Figure 16:
Notification email after a failover on WSFC role

Figure 16. Notification email after a failover on WSFC role.

Monitoring additional failover cluster events

Now that you know how to create a filter and an alarm, you can create as many filters as you need. Here is a list of suggested events to monitor for a Windows Failover Cluster environment:

Event Viewer EventID Level Message
System 1205 ERROR The Cluster service failed to bring clustered role ‘<Role name>’ completely online or offline
System 1069 ERROR Cluster resource ‘<Resource name>’ of type ‘<Resource type>’ in clustered role ‘<Role name>’ failed
System 1254 ERROR Clustered role ‘<Role name>’ has exceeded its failover threshold.
System 1641 ERROR Clustered role ‘<Role name>’ is moving from cluster node ‘<Node name> ‘ to cluster node ‘<Node name> ‘.
System 7034 ERROR The SQL Server (MSSQLSERVER) service terminated unexpectedly.
System 1045 WARNING No matching network interface found for resource ‘<resource name>’ IP address ‘<IP address>’
Microsoft-Windows-FailoverClustering/Operational 1204 INFORMATION The Cluster service successfully brought the clustered role ‘<Role name’ offline.
Microsoft-Windows-FailoverClustering/Operational 1637 INFORMATION Cluster resource ‘<Resource name>’ in clustered role ‘<Role name>’ has transitioned from state online to state ProcessingFailure.
Microsoft-Windows-FailoverClustering/Operational 1674 INFORMATION Group ‘<Group name>’ has transitioned from state ‘<Current state>’ to state ‘<New state>’.

Cleanup

To clear the Amazon CloudWatch alarm and stop Event Viewer from streaming to Amazon CloudWatch, follow these steps:

  1. Stop streaming the Event viewer (repeat this process on all cluster nodes):
    & $Env:ProgramFiles\Amazon\AmazonCloudWatchAgent\amazon-cloudwatch-agent-ctl.ps1 -m ec2 -a stop
  2. Uninstall Amazon CloudWatch agent (repeat this process on all cluster nodes):
    $app = Get-WmiObject -Class Win32_Product -Filter "Name = 'Amazon CloudWatch Agent'"
    $app.Uninstall()
  3. Remove failover metric filter. You can use the AWS Management Console or the following PowerShell script to remove the Amaon CloudWatch metric filter:
    Get-CWLMetricFilter -FilterNamePrefix "Failover" | Remove-CWLMetricFilter
  4. Remove the log group streams. You can use the AWS Management Console or the following PowerShell script to remove the Amazon CloudWatch log group streams:
    Get-CWLLogGroup | ?{$_.LogGroupName -eq "System" -or $_.LogGroupName -eq "Microsoft-Windows-FailoverClustering/Operational" }| Foreach-Object{ $LogGroupName = $_.LogGroupName
    Get-CWLLogStream -LogGroupName $LogGroupName |foreach { Remove-CWLLogStream -LogGroupName $LogGroupName -LogStreamName $_.LogStreamName}}
  5. If you want, you can also delete the Amazon CloudWatch log group (make sure there are no logs other than the logs used in this blog). You can use the AWS Management Console or the following PowerShell script to remove the Amazon CloudWatch log group:
    Get-CWLLogGroup | ?{$_.LogGroupName -eq "System" -or $_.LogGroupName -eq "Microsoft-Windows-FailoverClustering/Operational" }| Remove-CWLLogGroup

Conclusion

In this blog post, we have provided step-by-step instructions on how to capture Windows Event Viewer logs using Amazon CloudWatch agent, how to create a metric based on an EventID, and send alerts using Amazon SNS. By automating, capturing, and analyzing logs, as well as receiving near real-time alerts in the event of critical system events, you can save time and reduce the risk of human error.


AWS has significantly more services, and more features within those services, than any other cloud provider, making it faster, easier, and more cost effective to move your existing applications to the cloud and build nearly anything you can imagine. Give your Microsoft applications the infrastructure they need to drive the business outcomes you want. Visit our .NET on AWS and AWS Database blogs for additional guidance and options for your Microsoft workloads. Contact us to start your migration and modernization journey today.