How do I stop AWS OpsWorks Stacks from unexpectedly starting or restarting healthy instances?

Last updated: 2019-04-23

AWS OpsWorks Stacks restarts instances that it determines to be unhealthy, even if Amazon Elastic Compute Cloud (Amazon EC2) health checks are passing. How can I stop this from happening?

Short Description

Auto healing restarts unhealthy or failed instances in your stack, even if your instances pass Amazon EC2 health checks. Auto healing is enabled by default in the layer settings of your stack. The following events occur during auto healing:

  • The OpsWorks Stacks agent sends keepalives approximately every 30 seconds. If the service doesn't receive a keepalive after five minutes, then the service marks the instance as unhealthy. If auto healing is enabled on the layer of an instance that's launched by the AWS OpsWorks Stacks API, the API stops and starts the instance.
  • If the instance is backed by Amazon Elastic Block Store (Amazon EBS), then the AWS OpsWorks API stops and starts the underlying Amazon EC2 instance. If the instance is instance-store backed, the underlying Amazon EC2 instance is terminated on an instance stop. Then, the instance is recreated when OpsWorks Stacks starts the instance again. For more information, see Using Auto Healing to Replace Failed Instances.
  • If the instance is launched in Amazon EC2 and registered with an OpsWorks stack, the AWS OpsWorks API stops and restarts the instance. If a registered on-premises instance fails AWS OpsWorks health checks, then the instance is marked as connection-lost, but won't be restarted. For more information, see Managing Registered Instances.

Resolution

Check the Amazon EC2 StopInstances API call output for signs of auto healing

1.    Open the AWS CloudTrail console.

2.    Choose Event history.

3.    For Filter, choose Event name. For more information, see Filtering CloudTrail Events.

4.    In the search box, choose StopInstances.

5.    For Filter, choose Resource name.

6.    In the search box, enter the EC2 Instance ID, and then note the timestamp.

If OpsWorks Stacks stopped the instance, then the Amazon EC2 StopInstances API shows the following output:

"invokedBy": "opsworks.amazonaws.com"

Check the AWS OpsWorks StopInstances API call output for signs of auto healing

1.    Open the CloudTrail console.

2.    Choose Event history.

3.    For Filter, choose Event name.

4.    In the search box, choose StopInstances.

5.    For Filter, choose Resource name.

6.    In the search box, enter the OpsWorks instance ID, and then note the timestamp.

7.    Search for the StopInstance API call at the time the instance was stopped in Amazon EC2, and then note the timestamp.

Note: If you can't find the API call, then auto healing was applied to the instance.

Keep the following in mind:

  • Configure OpsWorks Stacks managed instances through the AWS OpsWorks API only, and not through Amazon EC2.
  • Be sure that managed instances are in a consistent state with the OpsWorks Stacks service.
    Note: For example, if an instance is stopped in Amazon EC2, then the instance is marked as unhealthy, because OpsWorks Stacks is expecting a signal from that instance’s agent. OpsWorks Stacks will then auto heal the instance (if enabled), which can freeze the instance, because the status in OpsWorks Stacks doesn’t match the status in Amazon EC2. If this happens, use the --force flag to stop the instance with the stop-instance command.

Create a CloudWatch rule to listen for auto healing events in your stack

1.    (Optional) To receive notifications when auto healing is applied to an instance, set up notifications.

2.    Create an SNS topic called OpsWorksAutoHealingNotifier, and then subscribe an endpoint to that topic (such as an email address or a phone number).

3.    Create a CloudWatch Events rule, and then set your SNS topic as the target.

4.    Within your rule configurations, use the following pattern to set up a CloudWatch rule to listen to an auto healing event:

{
  "source": [
    "aws.opsworks"
  ],
  "detail": {
    "initiated_by": [
      "auto-healing"
    ]
  }
}

5.    To save your configurations, choose Create rule.

Troubleshoot using the log files of the instance

1.    To view log files in /var/log/aws/opsworks for Linux, connect to your instance using SSH. To view log files in C:\ProgramDataOpsWorksAgent\var\logs for Windows, connect to your instance using RDP.

2.    Troubleshoot the following log files:
Check opsworks-agent.keep_alive.log for the successful and unsuccessful attempts of the agent to send the keepalive signals back to OpsWorks. For more information, see Running a Stack in a VPC.
Check opsworks-agent.statistics.log to see how the system is handling the CPU load and memory. You can review how much memory is used, and see if the CPU and load metrics are high.
Check opsworks-agent.log for reports on the overall health of the agent running on the instance, including when the agent was stopped or started.
Check opsworks-agent.process_command.log for reports on successful and unsuccessful commands made by the agent on the instances.

Note: These log files are retained only if the root device on the instance is backed by Amazon EBS. Instance store-backed instances are terminated on a StopInstance API call in OpsWorks, which causes logs to be lost.

3.    Check system-level logs to determine the overall health of the instance when auto healing is applied.

Note: It's possible for the logs to leave out system-level information (for example, "Out of Memory" errors) if the agent has issues caused by an auto healing event.

Prevent OpsWorks Stacks from auto healing the instances that it manages

  • Be sure that your instances can reach the internet either with an internet gateway or a NAT on the VPC route tables.
  • Unlock port 443 at the level of your instance, security group, and VPC ACL.
    Note: Changes to a VPC or an incorrectly created VPC can prevent an instance from communicating with OpsWorks Stacks over the internet.
  • Be sure that your application has enough resources (such as memory and CPU) at the instance level to function when the instance is under extra load.
    Note: It's a best practice to be prepared for extra load. For example, a lifecycle event can put unexpected load on the instance.
  • Use CloudWatch metrics and alarms to warn you if your instance has a high load of CPU, memory, or network traffic.
    If auto healing doesn't work for your application, disable auto healing in your layer configurations.

Did this article help you?

Anything we could improve?


Need more help?