How do I troubleshoot Microsoft SQL issues on my EC2 Windows instance?

Last updated: 2021-06-25

I'm receiving errors on my Amazon Elastic Compute Cloud (Amazon EC2) instance running Microsoft SQL Server. Or, SQL Server is slow causing application logging time out errors. How do I troubleshoot SQL Server issues on my instance?

Short description

The following are common errors that might occur on your SQL Server.

  • The SQL Server instance failed over from a primary replica to a secondary.
  • Event Viewer logs event 1196 "Cluster network name resource 'Cluster Name' failed registration of one or more associated DNS name(s) for the following reason" every 15 minutes.
  • You're unable to create a SQL listener.
  • SQL Server is slow and causing applications to log timeout errors.
  • Adding a node fails with an invalid credentials error after detecting settings for a separate server on SQL setup.

Resolution

The SQL Server instance failed over from a primary replica to a secondary

Fail over often occurs with the underlying cluster group. To determine if the issue is at the cluster level, view the System Logs in the Event Viewer using the following steps:

  1. Open the Run application, and then enter eventvwr and select OK.
  2. Select Windows Logs, System. If you see events 1069, 1146, 1230, or 1135 in the logs, then begin investigating at the cluster level by generating cluster logs.
    Note: If you see event 1135, then cluster node failover occurred. For information on resolving this issue, see Troubleshooting cluster issue with Event ID 1135.

To generate and retrieve cluster logs:

  1. Generate the latest set of cluster logs by running the get-clusterlogs command in Windows PowerShell.
  2. Retrieve the cluster logs from the default location at C:\windows\cluster\reports.

If the cluster logs show that the cluster failed over due to SQL resources, review the Event Viewer application logs for SQL errors. For more information, see Mechanics and guidelines of lease, cluster, and health check timeouts for Always On availability groups.

Event Viewer logs event 1196 "Cluster network name resource 'Cluster Name' failed registration of one or more associated DNS name(s) for the following reason" every 15 minutes

By default, the Cluster Name Object (CNO) goes to DNS every 15 minutes to check if needed records still exist. If the check fails, then you receive the Cluster network name resource "Cluster Name" failed registration error.

This error often occurs when the CNO is missing permissions on the cluster DNS record for the network name in the cluster. To resolve this, grant full control to the CNO on the DNS record. By default, the cluster usually grants these permissions at the time of creation.

If you're using a static DNS record, or if you don't want the cluster to keep updating your DNS records, do the following:

  1. Navigate to ncpa.cpl, NIC card properties, Ipv4 properties, Advanced, DNS.
  2. Uncheck Register this connection's addresses in DNS.

You're unable to create a SQL listener

SQL Server is slow and causing applications to log timeout errors

Verify if your SQL server is experiencing high memory, high CPU, or a disk bottleneck:

Check for high memory consumption on your OS. Open Task Manager and then select the Performance tab. If you see high memory consumption, cap the SQL memory, if it isn't already capped. If SQL memory is already capped, access the Windows Performance Monitor (perfmon) to determine what application is using memory. You can set perfmon locally or remotely.

Check for high CPU on the OS. Open Task Manager and then select the Performance tab. Or, review AWS CloudWatch metrics. Access the Windows Performance Monitor to determine what application is using memory. High CPU is commonly caused by an outdated driver. Upgrade to the latest PV driver version. Or, upgrade the AWS NVMe drivers to the latest version.

Check for high disk consumption on the server, volume slowness, or high I/O. As with high CPU, outdated drivers might cause disk consumption. Upgrade to the latest PV driver version. Or, upgrade the AWS NVMe drivers to the latest version. If the issue persists after updating the drivers, configure perfmon to determine what is consuming disk I/O.

Adding a node fails with an invalid credentials error after detecting settings for a separate server on SQL setup

Verify if one of the following is true:

  • The SQL agent account name is blank in SQL Setup on the secondary node after detecting the second server detail.
  • The registry key HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL12.SQL2014\ConfigurationState has one or more keys with the value 2.

If one or both of the preceding are true, do the following:

1.    Open Microsoft Powershell and then run the following commands:

$ClusterName = "YourClusterName"
$FciClusterGroupName = "SQL Server (MSSQLSERVER)"
Add-ClusterResourceType -Name "SQL Server Agent" -Dll "sqagtres.dll"

Note: Modify the variable names, such as YourClusterName, in the preceding command examples to match your environment.

2.    Open the Failover Cluster Manager, and then select Roles, SQL role, Add a resource, More resources.

3.    Select New SQL Server Agent, Properties, Dependencies.

4.    In the Resource field, select SQL Server.

5.    Select the Properties tab and then enter the name of your SQL server instance in the InstanceName field. Then, enter the name of the SQL cluster in the VirtualServerName field. The new SQL Server agent starts.

6.    Make sure that you change all keys to value 1 in HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL12.SQL2014\ConfigurationState.

7.    Run the setup again. You can now add nodes without receiving credentials errors.