How do I use SSM Agent logs to troubleshoot issues with SSM Agent in my managed instance?
Last updated: 2021-06-08
AWS Systems Manager Agent (SSM Agent) fails to run successfully, but I don't know how to troubleshoot the issue using the SSM Agent logs. How do I access and interpret the SSM Agent log messages?
SSM Agent runs on your managed Amazon Elastic Compute Cloud (Amazon EC2) instance and processes requests from the AWS Systems Manager service. The following conditions must be met to use SSM Agent:
- SSM Agent must connect to the required service endpoints.
- SSM Agent requires AWS Identity and Access Management (IAM) permissions to call the Systems Manager API calls.
SSM Agent fails to run successfully if either of those conditions aren't met.
To identify the root cause of the SSM Agent failure, review SSM Agent logs in the following locations:
Note: It's a best practice to configure automated updates for SSM Agent as SSM Agent is updated frequently with new capabilities.
After you review the logs and identify whether the SSM Agent issue is caused by missing endpoint connections or missing permissions, follow these troubleshooting steps:
SSM Agent can't talk to the required endpoints
SSM Agent can't reach the metadata service
SSM Agent can't locate the Region information, IAM role, and instance ID from the instance metadata service because SSM Agent can't reach the metadata service.
When SSM Agent can't reach the metadata service endpoints, you see an error message similar to the following in the SSM Agent logs:
INFO- Failed to fetch instance ID. Data from vault is empty. RequestError: send request failed caused by: Get http://169.254.169.254/latest/meta-data/instance-id
The most common reason for this error is using a proxy to make outbound Internet connections from your instance without configuring SSM Agent to use a proxy. Be sure to configure SSM Agent to use a proxy.
On Windows instances, the error can also be caused by a misconfigured persistent network route when you use a custom AMI to launch your instance. You must verify that the route for the metadata service IP points to the correct default gateway.
SSM Agent can't reach Systems Manager service endpoints
SSM Agent must make an outbound connection with the following Systems Manager service API calls on port 443. SSM Agent fails if the connection isn't made.
- SSM endpoint: ssm.REGION.amazonaws.com
- EC2 messaging endpoint: ec2messages.REGION.amazonaws.com
- SSM messaging endpoint: ssmmessages.REGION.amazonaws.com
Note: SSM Agent uses the Region information retrieved by the instance metadata service to replace the REGION value in the above endpoints.
When SSM Agent can't connect with the Systems Manager endpoints, you see error messages similar to the following in the SSM Agent logs:
ERROR [HealthCheck] error when calling AWS APIs. error details - RequestError: send request failed caused by: Post https://ssm.ap-southeast-2.amazonaws.com/: dial tcp 172.31.24.65:443: i/o timeout
DEBUG [MessagingDeliveryService] RequestError: send request failed
caused by: Post https://ec2messages.ap-southeast-2.amazonaws.com/: net/http: request cancelled while waiting for connection (Client.Timeout exceeded while awaiting headers)
The following are some common reasons why SSM Agent can't connect with the Systems Manager API endpoints on port 443:
- Instance egress security group rules don't allow outgoing connections on port 443.
- Virtual private cloud (VPC) endpoint ingress and egress security group rules don't allow incoming and outgoing connections to the VPC interface endpoint on port 443.
- Routing table rules aren't configured to direct traffic using an internet gateway when the instance lives in public subnet.
- Routing table rules aren't configured to direct traffic using a NAT gateway or VPC endpoint when the instance lives in private subnet.
- SSM Agent isn't configured to use proxy if routing table rules are configured to use proxy for all outgoing connections.
SSM Agent doesn't have permissions to call the required Systems Manager API calls
SSM Agent failed to register itself as online on Systems Manager because SSM Agent isn't authorized to make UpdateInstanceInformation API calls to the service.
The UpdateInstanceInformation API call is required to maintain a connection with SSM Agent so the service knows that SSM Agent is functioning as expected. SSM Agent calls the Systems Manager service in the cloud every five minutes to provide health check information.
If SSM Agent doesn't have the correct IAM permissions, you see an error message similar to the following in the SSM Agent logs.
If SSM Agent uses the incorrect IAM permissions:
ERROR [instanceID=i-XXXXX] [HealthCheck] error when calling AWS APIs. error details - AccessDeniedException: User: arn:aws:sts::XXX:assumed-role/XXX /i-XXXXXX is not authorized to perform: ssm:UpdateInstanceInformation on resource: arn:aws:ec2:ap-southeast-2:XXXXXXX:instance/i-XXXXXX
status code: 400, request id: XXXXXXXX-XXXX-XXXXXXX
INFO [instanceID=i-XXXX] [HealthCheck] increasing error count by 1
If SSM Agent doesn't have any IAM permissions:
ERROR [instanceID=i-XXXXXXX] [HealthCheck] error when calling AWS APIs. error details - NoCredentialProviders: no valid providers in chain. Deprecated. For verbose messaging see aws.Config.CredentialsChainVerboseErrors
2018-05-08 10:58:39 INFO [instanceID=i-XXXXXXX] [HealthCheck] increasing error count by 1
Verify that the IAM role attached to the instance contains the required permissions to allow an instance to use Systems Manager service core functionality. Or, if an instance profile role isn't already attached, attach an instance profile role and include AmazonSSMManagedInstanceCore permissions.
For more information about the required IAM permissions for Systems Manager see About policies for a Systems Manager instance profile.
Systems Manager API call throttling
API calls might get throttled if a high volume of managed instances running SSM Agent make concurrent UpdateInstanceInformation API calls.
If the UpdateInstanceInformation API call for your instance is throttled, you see error messages similar to the following in the SSM Agent logs:
INFO [HealthCheck] HealthCheck reporting agent health.
ERROR [HealthCheck] error when calling AWS APIs. error details - ThrottlingException: Rate exceeded
status code: 400, request id: XXXXX-XXXXX-XXXX
INFO [HealthCheck] increasing error count by 1
Try the following troubleshooting steps to prevent ThrottlingException errors:
- Reduce the frequency of API calls.
- Implement error retries and exponential backoffs when you make API calls.
- Stagger the intervals of API calls so that they don't all run at the same time.
- Request a throttling limit increase for UpdateInstanceInformation API calls.