How to set up automatic failover for AWS OpsWorks for Chef Automate

Creating a resilient configuration management system comes with a variety of challenges. The goal of this post is to solve an important component of this: failing over to a standby AWS OpsWorks for Chef Automate server when a primary server is unavailable.

With the procedure described in this post, if the main OpsWorks for Chef Automate server fails, nodes automatically use a standby OpsWorks for Chef Automate Server for management. This prevents failure from orphaning your servers, and helps nodes receive important cookbook updates.

Overview

This post uses a custom user data script to bootstrap your nodes. This user data keeps nodes aware of a main OpsWorks for Chef Automate server and any number of standby OpsWorks for Chef Automate servers. This user data runs when an instance starts up, configures the Chef client, and establishes the connection to your main and standby OpsWorks for Chef Automate servers. For more information, see Running Commands on Your Linux Instance at Launch.

This example uses two servers (mainserver and standbyserver), but you can expand this solution to include any number of OpsWorks for Chef Automate servers. This post covers two main scenarios:

Automatically registering a node to a healthy OpsWorks for Chef Automate server while booting, when one is available.
Automatically verifying the health of OpsWorks for Chef Automate servers on chef-client runs, so that if the mainserver fails at any point in the node’s lifecycle, it can fail over to standbyserver.

Prerequisites

For this solution to work, you need the following items:

At least two healthy OpsWorks for Chef Automate servers, ideally placed in different Availability Zones. For more information about how to create mainserver and standbyserver, see Getting Started with AWS OpsWorks for Chef Automate or follow the wizard in the AWS OpsWorks console.
Consistently push all cookbook code to both the servers. Review this process in the Perform continuous cookbook integration testing and delivery for AWS OpsWorks for Chef Automate post.
The nodes that you launch must have an Amazon EC2 instance profile with the correct IAM permissions to call the OpsWorksCM associate-node API. This IAM role requires the following policy at a minimum:


{ 
"Version": "2012-10-17", 
"Statement": [
                { 
				"Action": [ 
                                 "opsworks-cm:AssociateNode", 
				"opsworks-cm:DescribeNodeAssociationStatus", 
				], 
				"Resource": "*", 
				"Effect": "Allow" 
		} 
	] 
}

Implementation

After launching the OpsWorks for Chef Automate servers, you must launch nodes to manage. The OpsWorks for Chef Automate starter kit provides an automatic node association script for this purpose.

In this scenario, I modified the basic version of this script to cover the failover mechanisms, and I include a full copy of the modified script at the end of this post. The modified node association script first checks whether mainserver is healthy. If the server is healthy, the node associates with it. However, when mainserver is not healthy, the node associates with standbyserver.

The first change from the standard user data script is the introduction of a variable named CHEF_SERVER_NAMES to track all possible OpsWorks for Chef Automate servers. Populate the CHEF_SERVER_NAMES variable with valid server names for your environment. It also requires valid CHEF_SERVER_ENDPOINTS for your environment.


# The name of your Chef Server
# Here you specify two servers (mainserver and standbyserver)
CHEF_SERVER_NAMES=("mainserver" "standbyserver")
#You declare an array of endpoints for the servers
CHEF_SERVER_ENDPOINTS=("mainserver-xxx.us-east-1.opsworks-cm.io" "standbyserver-xxx.us-east-1.opsworks-cm.io")

A custom Ruby script evaluates the health of the servers by checking the connection to the servers on each chef-client run. The chef client config (/etc/chef/client.rb) includes this script and returns the first healthy server from the list.

This automatic failover covers the following two cases:

A server becoming unhealthy doesn’t orphan a node which bootstrapped to it.
On every chef-client run, nodes evaluate the connection to the servers in CHEF_SERVER_ENDPOINTS and connect to the first healthy server.


require 'net/https'

class OpsWorksHealthCheck
  def self.healthy?(endpoint)
    return false if associated?(endpoint)
    uri = URI("https://#{endpoint}")
    Net::HTTP.start(uri.host, uri.port, use_ssl: true, read_timeout: 5, verify_mode: OpenSSL::SSL::VERIFY_NONE) do |https|
      https.get("/").is_a?(Net::HTTPSuccess)
  end

  rescue Exception => e
   puts "Server not reachable: #{e.message}"
   false
  end

  def self.associated? endpoint

   unassociated_servers = IO.readlines("/etc/chef/unassociated_servers.txt").map(&:chomp)
   unassociated_servers.any?{|servername| endpoint.start_with?(servername.downcase)}

  end

  def self.active_chef_server_endpoint endpoints
   active = endpoints.detect { |endpoint| healthy?(endpoint) }
   puts active
   active
 end

  def self.active_chef_server_endpoints endpoints
   active_servers = endpoints.collect { |endpoint| endpoint if(healthy? endpoint) }.inspect
   active_servers
  end

end

A custom client.rb uses the preceding Ruby script:


# When you write the chef client.rb config you will use the opsworks health check script on every single chef client run
# The node will use the first server that is accessible
write_chef_config() {
(cat <<-RUBY
require_relative "opsworks_health_check"
#running a health check every time a chef-run is executed
chef_server_url "https://#{OpsWorksHealthCheck.active_chef_server_endpoint(%w(${CHEF_SERVER_ENDPOINTS[@]}))}/organization/${CHEF_ORGANIZATION}"
node_name "$NODE_NAME"
ssl_ca_file "$CHEF_CA_PATH"
RUBY
) > /etc/chef/client.rb # generic approach, evaluates health for all servers on each chef-client run
}

You can find a complete user data script using this resilient failover setup stored in S3.

Testing during bootstrapping

Before starting any testing, I recommend that you review the README.md file that comes with the starter kit.

For testing purposes, the following examples used a basic Policyfile.rb and cookbook to push to the OpsWorks for Chef Automate servers. Perform these steps from the respective folders of each starter kit on the local workstation.

Download and install the cookbooks listed in Policyfile.rb using chef install.
Push the policy opsworks-demo, as defined in Policyfile.rb, to your server using the command chef push opsworks-demo.
Verify the installation of your policy by running the command chef show-policy.

Test 1: When both the servers are healthy

The output from /var/log/cloud-init-output.log demonstrates the behavior when a node bootstraps with the custom user data script. For more information about how to bootstrap instances, see Logging in the cloud-init documentation.


mainserver-xxxxxxxxxx.us-east-1.opsworks-cm.io
Starting Chef Client, version 14.11.21
Using policy 'opsworks-demo-webserver' at revision 'xxxxxxxxxxxxxxxxx'
resolving cookbooks for run list: ["chef-client::default@11.3.0 (3819072)"]
Synchronizing Cookbooks:
- chef-client (11.3.0)
- cron (6.2.1)
- logrotate (2.2.0)

Recipe: chef-client::systemd_service
* service[chef-client] action restart
- restart service service[chef-client]
Recipe: <Dynamically Defined Resource>
* service[automate-liveness-agent] action restart
- restart service service[automate-liveness-agent]

Running handlers:
Running handlers complete
Chef Client finished, 20/26 resources updated in 06 seconds
+ touch /tmp/userdata.done
+ eval
Cloud-init v. 18.2-72.amzn2.0.7 finished at Fri, 25 Oct 2019 14:44:30 +0000. Datasource DataSourceEc2.  Up 47.23 seconds

From the preceding logging output, you can determine that chef-client completed successfully. The log output also includes information about the server that the node uses for association. The above is a sampling of the log output, shortened for brevity in this post.

Test 2: When mainserver is not available

For testing purposes, you can intentionally make mainserver unhealthy by stopping the underlying EC2 instance. When a server is unhealthy, you will see “connection lost” warnings in the OpsWorks for Chef Automate console.

To test association when the primary server is unhealthy, you can launch a new node. You can check the user data output for this node from the cloud-init logs at /var/log/cloud-init-output.log:

Server not reachable: execution expired
standbyserver-xxxxxxxxxx.us-east-1.opsworks-cm.io
Starting Chef Client, version 14.11.21
Using policy 'opsworks-demo-webserver' at revision 'xxxxxxxxxxxxxxxxxxx'
resolving cookbooks for run list: ["chef-client::default@11.3.0 (3819072)"]
Synchronizing Cookbooks:

You can see that the node can’t associate itself to mainserver due to the health status, so it associated itself with standbyserver immediately after.

The provided Ruby health check script writes the unhealthy servers to a file (/etc/chef/unassociated_servers.txt). In this test, the contents of that file are as follows:

[ec2-user@ip-172-31-17-219 chef]$ cat /etc/chef/unassociated_servers.txt
mainserver

Testing after bootstrapping using the dynamic health check Ruby function

This custom user data also sets up a mechanism so that the node automatically analyzes the health of the appropriate OpsWorks for Chef Automate servers when running chef-client.

In the following example, a node used mainserver before that server turned unhealthy. Because of this, the dynamic health check script chose standbyserver automatically. Here’s what the output looks like:


[root@ip-172-31-4-45 ~]$ chef-client
Server not reachable: execution expired
standbyserver-xxxxxxxx.us-east-1.opsworks-cm.io
Starting Chef Client, version 14.11.21
resolving cookbooks for run list: []
Synchronizing Cookbooks:
Installing Cookbook Gems:
Compiling Cookbooks...

In this case, chef-client throws an error stating: “Server not reachable: execution expired” for mainserver but can still run chef-client with standbyserver. This means that the node can get the latest version of cookbooks, despite mainserver being in an unhealthy state.

Conclusion

In this post, I demonstrated a solution for a resilient, fault-tolerant OpsWorks for Chef Automate architecture. By adjusting the user data, it is possible to associate the node dynamically based on server health. If the primary server becomes unhealthy, then the node associates with a secondary server. As long as all servers maintain parity in cookbook versions, this helps nodes consistently receive the most updated code, even in the event of an unreachable server. For reference, you can download the script in full.

About the Authors

Maggie O’Toole is a Solutions Architect for AWS in Germany, and has been with AWS since 2017. She enjoys helping customers on their cloud journey to the stars, and specializes in Containers and Configuration Management.

Ramesh Venkataraman is a Solutions Architect who supports customers using AWS CloudFormation and AWS OpsWorks and AWS ECS among other DevOps AWS services. Outside of work, Ramesh enjoys following stack overflow questions and answers them in any way he can.

AWS Cloud Operations & Migrations Blog