Achieve faster switchover for Amazon RDS Blue/Green Deployments with large number of connections

In this post, we show you a recent improvement for Amazon RDS Blue/Green Deployment switchovers to reduce your overall downtime when you have a large number of connections to your database.

Blue/Green Deployments enforce safety measures to make sure that the switchover from your blue environment to the green environment maintains data consistency.

For reference, to provide a safe switchover the following steps are taken:

Run guardrail checks to verify if the blue and green environments are ready for switchover.
Stop new write operations on the primary DB instance in both environments.
Drop connections to the DB instances in both environments and don’t allow new connections.
Wait for replication to catch up in the green environment so that the green environment is in sync with the blue environment.
Rename the DB instances in the both environments.
Allow connections to databases in both environments.
Allow write operations on the primary DB instance in the new production environment.

One of these steps is to cleanup connections (3) from the blue environment so that user applications are triggered to re-establish their connections to the new production cluster. This is run after blocking writes (2) on both blue and green, which makes this step in the path of write downtime, as noted in switchover actions.

In Amazon Aurora MySQL-Compatible Edition and Amazon Relational Database (Amazon RDS) for MySQL, previously, this process could take up to 60 seconds for every 1,000 connections on a DB instance. With the recent updates, this process has been reduced to less than a few seconds even for 15,000 connections or more by using MySQL offline_mode.

Offline mode

OFFLINE_MODE is an engine feature that allows quicker cleanup of existing connections. As noted in the MySQL documentation:

“In offline mode, the MySQL instance disconnects client users unless they have relevant privileges, and doesn’t allow them to initiate new connections. Clients that are refused access receive an ER_SERVER_OFFLINE_MODE error.”

Switchover tests

To demonstrate this improvement, we create an Aurora MySQL cluster using 8.0.mysql_aurora.3.04.1 version and db.r6i.4xlarge family with a single DB instance. We attach a custom parameter group and set a max_connections parameter value of 16000. We then test the switchover and measure the total write downtime.

First, to effectively measure write downtime, we prepared a simple heartbeat table to record write timestamps every second during switchover.

CREATE TABLE heartbeat (
  id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY, 
  ts INT UNSIGNED NOT NULL,
  h VARCHAR(255) NOT NULL
);

We also prepare a simple sysbench-based dataset to generate a small workload:

sysbench --mysql-host=<CLUSTER-ENDPOINT> --mysql-user=amazn --mysql-password=<PASSWORD> \
  --mysql-db=amazn --tables=64 --table-size=1000000 --threads=64 oltp_read_write prepare

Once the dataset is ready, we create a total of 15,360 connections by spawning 30 background sysbench processes with 512 connections each:

for n in {1..30} ; do 
  sleep 5 ; 
  screen -dm -S sysbench${n} sysbench --mysql-host=<CLUSTER-ENDPOINT> --mysql-user=amazn \
    --mysql-password<PASSWORD> --mysql-db=amazn --tables=64 --table-size=1000000 \
    --threads=512 --rate=1 --events=0 --time=3600 --db-ps-mode=disable oltp_read_write run ;
done

We monitor when the connections build up on the cluster using a simple for loop from another session:

while true ; do 
( 
  mysql -u amazn -p<PASSWORD> -h <CLUSTER-ENDPOINT> -BNe \
    'SELECT CURRENT_TIMESTAMP, @@hostname, @@aurora_version' ; 
  mysql -u amazn -p<PASSWORD> -h <CLUSTER-ENDPOINT> -BNe \
    "SHOW GLOBAL STATUS LIKE 'Threads_connected'" ) | xargs; 
  sleep 1 ; 
done

...
2023-11-21 04:42:36 ip-172-31-0-43 3.04.1 Threads_connected 15367
2023-11-21 04:42:38 ip-172-31-0-43 3.04.1 Threads_connected 15367
2023-11-21 04:42:39 ip-172-31-0-43 3.04.1 Threads_connected 15367

After our connections have built up, we stop the monitoring process and start a continuous write to the heartbeat table:

while true ; do
  mysql --connect-timeout=1 -u amazn -p<PASSWORD> -h <CLUSTER-ENDPOINT> amazn \
    -e "INSERT INTO heartbeat VALUES (NULL, UNIX_TIMESTAMP(), @@hostname)" ;
  date ;
  sleep 1 ;
done

Now that our workload is running and heartbeat monitoring is in place, the next step is to trigger the switchover using the AWS CLI:

aws rds switchover-blue-green-deployment \
    --blue-green-deployment-identifier <BLUE-GREEN-IDENTIFIER>

When the switchover is complete, we can stop the heartbeat writing process and inspect the output from our monitoring:

Tue Nov 21 04:49:20 UTC 2023
Tue Nov 21 04:49:21 UTC 2023 
ERROR 1290 (HY000) at line 1: The MySQL server is running 
    with the --read-only option so it cannot execute this statement
Tue Nov 21 04:49:22 UTC 2023 
ERROR 2003 (HY000): Can't connect to MySQL server on 
    'mycluster.cluster-cn9t8nqv5vex.us-west-2.rds.amazonaws.com:3306' (110)
Tue Nov 21 04:49:24 UTC 2023
...
ERROR 2003 (HY000): Can't connect to MySQL server on 
    'mycluster.cluster-cn9t8nqv5vex.us-west-2.rds.amazonaws.com:3306' (110)
Tue Nov 21 04:49:44 UTC 2023 
ERROR 1290 (HY000) at line 1: The MySQL server is running 
    with the --read-only option so it cannot execute this statement
Tue Nov 21 04:49:45 UTC 2023
...
ERROR 1290 (HY000) at line 1: The MySQL server is running with the --read-only 
    option so it cannot execute this statement
Tue Nov 21 04:50:02 UTC 2023
Tue Nov 21 04:50:03 UTC 2023
Tue Nov 21 04:50:04 UTC 2023

The output tells us the following:

At Tue Nov 21 04:49:22 UTC 2023, blue became read-only
At Tue Nov 21 04:49:45 UTC 2023, the cluster DNS endpoint is now pointing to the green environment
At Tue Nov 21 04:50:03 UTC 2023, green, which is now the production cluster, becomes writable

In our heartbeat table, the test shows our total write downtime of around 42 seconds:

mysql -u amazn -p<PASSWORD> -h <CLUSTER-ENDPOINT> amazn -e \
  "select id, FROM_UNIXTIME(ts) AS ts, h from heartbeat"
+----+---------------------+-----------------+
| id | ts                  | h               |
+----+---------------------+-----------------+
...
| 8  | 2023-11-21 04:49:19 | ip-172-31-0-43  |
| 9  | 2023-11-21 04:49:21 | ip-172-31-0-43  |
| 10 | 2023-11-21 04:50:03 | ip-172-31-0-205 |
| 11 | 2023-11-21 04:50:04 | ip-172-31-0-205 |
| 12 | 2023-11-21 04:50:05 | ip-172-31-0-205 |
...

To get an idea of how long it really was to clean up connections on blue during switchover, we can look at the DB instance events. In our test, the event information shows it took less than a minute to clean up all over 15,000+ connections:

aws rds describe-events --output json --source-type db-instance \
  --source-identifier mycluster-instance-1
{
  "Events": [
  ...
    {
      "SourceIdentifier": "mycluster-instance-1",
      "SourceType": "db-instance",
      "Message": "Starting to terminate connections and user processes in the blue environment at 2023-11-21T04:49:22.931Z",
      "EventCategories": [],
      "Date": "2023-11-21T04:49:22.931Z",
      "SourceArn": "arn:aws:rds:us-west-2:123456789012:db:mycluster-instance-1"
    },
    {
      "SourceIdentifier": "mycluster-instance-1",
      "SourceType": "db-instance",
      "Message": "Finished terminating connections and user processes in the blue environment at 2023-11-21T04:49:30.891Z",
      "EventCategories": [],
      "Date": "2023-11-21T04:49:30.892Z",
      "SourceArn": "arn:aws:rds:us-west-2:123456789012:db:mycluster-instance-1"
    }
  ]
}

Conclusion

This post provided a quick demonstration of how OFFLINE_MODE helps reduce your overall write downtime for major version upgrades. This feature is available for Blue/Green Deployments using Amazon Aurora MySQL 2.x and above and Amazon RDS for MySQL 5.7 and above. For more information, refer to Best practices for Blue/Green Deployments and Switchover best practices.

About the Author

Jervin Real is a Senior Database Engineer at Amazon Web Services helping Amazon RDS for MySQL and MariaDB customers towards efficiency.

AWS Database Blog

Achieve faster switchover for Amazon RDS Blue/Green Deployments with large number of connections

Offline mode

Switchover tests

Conclusion

About the Author

Resources

Blog Topics

Follow