Making Application Failover Seamless by Failing Over Your Private Virtual IP Across Availability Zones
By Suney Sharma, Solutions Architect at AWS
One of the core architecture principles of building highly available applications on Amazon Web Services (AWS) is to work with a multi-Availability Zone (AZ) architecture. In the unlikely event an AZ fails, this architecture allows applications to continue running using resources in the other AZs.
Customers use different strategies to handle the routing of user traffic to different components of their applications across Availability Zones. These range from using load balancers, Elastic IPs (EIPs), and Domain Name Resolution.
Customers make use of moving EIPs between AZs to move traffic to a standby Amazon Elastic Compute Cloud (Amazon EC2) instance. This approach works if you’re dealing with public IPs and have user traffic coming in from the internet. However, if your Amazon EC2 instances are in a private subnet then you will use a private IP range instead of EIPs.
In this post, we present an approach to achieve failover of a private IP address across AZs. This private IP address acts as a virtual IP that fails over to a host running in another Availability Zone.
A Simple Application to Explain the Failover
Let’s look at a simple application architecture shown in Figure 1. We plan to demonstrate the failover of the private IP address (VIP-10.1.5.5) being used by the application servers to connect with the database.
On failover, the application servers continue connecting to standby database node without any intervention, making the failover seamless. In our example, we have a PHP-based web application running on two webservers (web_host1 and web_host2) behind a load balancer across two Availability Zones (AZ1 and AZ2).
Figure 1 – Our sample application architecture.
The webservers run a simple query on a MySQL database running on two Amazon EC2 instances (DB_Host1 and DB_Host2).
For the purpose of simplicity, both the databases have the same data. In real-world use cases, you can replicate data from one database host to another to keep them in sync. The focus of this post is to show you how the virtual IP (VIP 10.1.5.5) configured as the database host (DB_Host) in the webservers is available on DB_Host2 when we stop MySQL on DB_Host1 or shut down DB_Host1.
Let’s start by explaining the flow of traffic in our sample application. The application runs in an Amazon Virtual Private Cloud (VPC) with a CIDR range of 10.0.0.0/16. There are four subnets—two public and two private across two AZs, as indicated in Figure 1.
Users connect to the webservers via the Application Load Balancer. The webservers display the result of a query to list the names of the Major League Baseball teams along with their short names. The output of the query on the webpage looks like this:
Figure 2 – Query result.
The webservers are configured to query the database on the virtual IP (VIP: 10.1.5.5).
Figure 3 – Configuration showing DB_Host being used as the database host.
Let’s look at the IP address configured for the hostname DB_Host on the webservers.
Figure 4 – DB_Host is mapped to the VIP 10.1.5.5.
If you look carefully, our Virtual IP is outside the CIDR range of the VPC, and yet we’re able to route traffic to it. This is accomplished by manually “plumbing” VIP as a logical interface on both our Amazon EC2 instances running the database.
A logical interface is a way of assigning an additional IP to an existing Ethernet interface. In our case, it’s eth0:1. Please refer to your OS documentation on how to create one.
Figure 5 – Logical interface configured on the Amazon EC2 instance.
Since this still needs to be routed to the Amazon EC2 instance, we create a route to the VIP in the routing table to point to the instance-id of DB_Host1.
Figure 6 – Routing table entry to route traffic to the DB_Host1.
If you look at the Amazon EC2 console, you can see the instance ids of the two database servers.
Figure 7 – The instance_ids of the Amazon EC2 instances.
As you can see, the routing table in Figure 6 forwards all traffic destined for the VIP to DB_Host1. An important point to note is that DB_Host1 and DB_Host2 need to have “Source/Destination Checks Disabled.”
In order to failover the VIP to the second DB server (DB_Host2), all that’s required is to change the Target entry in the routing table to the instance id of DB_Host2. We do this using a Lambda function that gets triggered every minute using Amazon Cloudwatch Events to check availability of the current DB server.
Figure 8 – Checker_Fn is the Lambda function that checks the heath of the DB_Host.
If the DB server is unavailable, or if the MySQL daemon is down, it invokes another Lambda function to failover the VIP to the second server.
Figure 9 – VIP_Mover_Fn is the Lamdba function that performs the failover.
One of the challenges here is to ascertain which of the two Amazon EC2 instances is handling the VIP traffic. This information will be used by our Lambda function to decide what instance ID to use as target when performing a failover.
We essentially need a place to store the value of the Amazon EC2 instance handling VIP traffic currently. This is done using the Parameter Store, where the name of the primary database server is recorded, and the Lambda function reads the parameter store to check and if required make changes to the value of the master DB server parameter.
Figure 10 – The MasterDB_VIP parameter is set to DB_Host1 currently.
Now, let’s log in to the primary database server and stop the MySQL daemon. Notice that we use DB_Host as the host to connect.
Figure 11 – Stopping the Mysqld service on the DB_Host1 that is the primary DB_host.
Even after shutting down, the query still returns a successful result. Let’s check what the routing table looks like.
Figure 12 – VIP_Mover_Fn has updated the routing table to point to instance_id of DB_Host2.
As you can see, the Lambda function has successfully changed the routing table entry on detecting that the MySQL was down on the primary DB server.
Figure 13 – Traffic flow in the failed over scenario.
If you look at the parameter value in the Parameter Store, the Lambda function has modified it to DB_Host2. With this, we have successfully been able to achieve a failover of the virtual IP -10.0.5.5 to the other Availability Zone.
Figure 14 – The MasterDB_VIP parameter has been changed to DB_Host2.
While we used AWS Lambda to make changes to the routing table, the same can be done using different methods. The central idea of the approach described in this post is modifying the routing table of the subnet to help route traffic to the “virtual IP” and how the same can be used to failover the private IP address across availability zones.
Finally, this approach is not just limited to a database use case but can be used anywhere you need a Virtual Private IP to move between Availability Zones.