Why did I get a read only error after an Amazon Aurora DB cluster failed over?
Last updated: 2020-11-26
I'm using an Amazon Aurora DB cluster and I am receiving the following error after a failover:
"The MySQL server is running with the --read-only option so it cannot execute this statement"
How can I resolve this error?
When an Amazon Aurora DB cluster experiences a Multi-AZ failover, the cluster endpoints update automatically to reflect and point to the newly appointed roles of the Writer and Reader. The old Writer is rebooted, and then set into a read-only mode while an existing replica is promoted to become a Writer.
If you receive a read-only error message, this means that you are trying to perform a data definition language (DDL), data manipulation language (DML), or data control language (DCL) operation through an existing node that has the role of a Reader. In Aurora MySQL, you can confirm this by checking the innodb_read_only variable, as follows:
mysql> show variables where variable_name='innodb_read_only'; +------------------+-------+ | Variable_name | Value | +------------------+-------+ | innodb_read_only | ON | +------------------+-------+ 1 row in set (0.01 sec)
Utilize the Cluster Writer endpoint
Because the role of a DB instance in an Aurora cluster can change, it is a best practice to use the cluster Writer endpoint to make sure that you are always pointing to the latest Writer. If you use the DB instance endpoint or a direct IP Address, you might not be aware that a failover has occurred, and will get a read-only message if you re-connect to the same host. This will stop you from performing any DDL/DML changes, as intended.
Don't excessively cache DNS
If you are not using a smart driver, then you depend on the DNS record updates and propagation after a failover event occurs. Aurora DNS zones use a short time-to-live (TTL) of 5 seconds, so it is important that your network and client configurations don't further increase this. DNS caching can occur at multiple layers of an architecture, such as the operating system (OS), the network layer and the application container. It is important that you understand how each of these layers is configured. If there is unintended DNS caching beyond the TTL of 5 seconds, it is possible that you will re-connect to the old writer after a failover.
Java Virtual Machines (JVM) can excessively cache DNS, indefinitely. When the JVM resolves a hostname to an IP address, it caches the IP address for a specified period of time (TTL). On some configurations, the JVM default TTL is set to never refresh DNS entries until the JVM is restarted. This can lead to read-only errors after a failover. In this case, it is important to manually set a small TTL so that it will periodically refresh.
Use a Smart Driver
The Amazon Aurora DB cluster endpoints propagate DNS record updates automatically, but the process doesn't happen instantly. This can cause delays in responding to an event that occurred on the database and the event might be handled by the application. A Smart Driver uses the DB cluster topography through the INFORMATION_SCHEMA.REPLICA_HOST_STATUS metadata table, which is in near-real-time. This helps to route connections to the appropriate role, and helps load-balance across the existing replicas. MariaDB Connector/J is an example of a third party Smart Driver that has native support for Aurora MySQL.
Note: Even Smart Drivers might be affected by excessive DNS Caching.
Test which instance you are connected to
As mentioned in the best practices of the Aurora connection management handbook, when not using a smart driver, you should test and understand which instance that you are logged into after establishing a new connection. This can help you make sure that you are connected to the correct instance. You can test whether you are connected to the writer instance or an Aurora Reader using the @@innodb_read_only variable. This example shows a value of 0 which means you are connected to the writer.
mysql> select @@innodb_read_only; +--------------------+ | @@innodb_read_only | +--------------------+ | 0 | +--------------------+ 1 row in set (0.00 sec)