AWS Database Blog

Orchestrating database refreshes for Amazon RDS and Amazon Aurora

The database refresh process consists of recreating of a target database using a consistent data copy of a source database, usually done for test and development purposes. Fully-managed database solutions such as Amazon Relational Database Service (Amazon RDS) or Amazon Aurora make it incredibly easy to do that. However, database administrators may need to run some post-refresh activities such as data masking or password changes, or they may need to orchestrate multiple refreshes because they manage several databases, each of them with more than one environment. In some cases, refreshes have to be performed frequently, even daily.

In this post, we describe the features of a serverless solution that you can use to perform database refresh operations at scale, with a higher level of automation. This solution can be deployed and tested using the instructions available in the GitHub repo. In the next section, we go over what you’re going to build.

Potential use cases

The solution described in this post enables you to do the following:

  • Refresh an existing database (or create a new one) using one of the four options available:
    • Latestpoint – The data is aligned to the latest point in time.
    • torestorepoint – The data is aligned to a specified point in time.
    • fromsnapshot – The data is aligned at the snapshot creation time.
    • fast-cloning (only for Aurora) – The data is aligned to the latest point in time, but it’s cloned using the fast-cloning feature provided by Aurora.
  • Refresh an existing encrypted database (or create a new one). A cross-account use case has the following considerations:
    • The only options available are fromsnapshot or fast-cloning (only for Aurora).
    • The AWS Key Management Service (AWS KMS) primary key (managed by the source account) must be manually shared with the target AWS account before launching the refresh.
  • Perform a cross-account refresh of an existing database (or create a new one).
    • As a pre-requisite, the source account has to share with the target account the Amazon RDS or Aurora snapshot or the source Aurora cluster, before launching the refresh process.
  • Run post-refresh SQL scripts against the new refreshed database (only available for Amazon RDS for MariaDB, Amazon RDS for MySQL and Aurora MySQL) to perform the following:
    • Clearing, masking, or modifying sensitive data coming from the source production database.
    • Deleting unnecessary data or removing unnecessary objects coming from the source production database.
  • Customize the solution by adding or removing steps to orchestrate operations for those applications that have different requirements, using the same state machine.
  • Keep the history of all the database refresh operations of your applications, in order to answer questions such as:
    • When has my database last been refreshed?
    • Does my application have all its non-production databases refreshed?
    • Is the refresh that I launched yesterday complete?

Prerequisites

The solution implemented focuses on a time-consuming administrative task that DBAs have to deal with: the database refresh.

The process consists of recreating an existing database. Typically, this is a copy used for test and development purposes whose data has to be “refreshed”. You can use a backup or the last available image of the related production environment to refresh a database. The solution can also be applied to scenarios when you create a new environment from scratch.

The process can involve additional steps to apply different settings or configurations to the new refreshed database.

The following diagram illustrates the process.

The following diagram illustrates the process.

The backup can be either logical (a partial or full export the source dataset) or physical (a binary copy of the database, which can be full, incremental, whole, or partial). The solution described in this post allows you to use physical backups (Amazon RDS or Aurora snapshots) during the restore process or the Aurora cloning feature in order to copy your databases.

Solution overview

The solution uses several AWS services to orchestrate the refresh process:

  • Amazon Aurora – A MySQL-and PostgreSQL-compatible relational database built for the cloud. The solution uses Aurora snapshots or the fast cloning feature to restore Aurora database instances. Restores are performed using APIs provided by RDS and Aurora.
  • Amazon DynamoDB – A fully-managed key-value and document database that delivers single-digit millisecond performance at any scale. We use it to keep track of all the refresh operations run by this solution.
  • Amazon Elastic Compute Cloud – Amazon EC2 provides secure, resizable compute capacity in the cloud. The solution uses it in conjunction with AWS Systems Manager to run SQL scripts against your restored databases.
  • AWS Lambda – Lambda lets you run code without provisioning or managing servers. You pay only for the compute time you consume. Lambda functions are used to implement all the steps of a database refresh.
  • AWS Step Functions – A serverless function orchestrator that makes it easy to sequence AWS Lambda functions and multiple AWS services into business-critical applications. This is the core service of the solution, used to orchestrate database refreshes.
  • Amazon RDS – A fully managed relational database solution that provides you with six familiar databases. The solution uses Amazon RDS snapshots to restore RDS databases instances. Restores are performed using APIs provided by RDS and Auroras.
  • Amazon Simple Notification Service – Amazon SNS is a fully managed messaging service for both systems-to-system and app-to-person communication. We use it to notify users about the completion of refresh operations.
  • Amazon Simple Storage Service – Amazon S3 is an object storage service that offers industry-leading scalability, data availability, security and performance. We use it to store the SQL scripts that the solution allows you to run against new restored databases.
  • AWS Secrets Manager – Secrets Manager helps you to securely encrypt, store, and retrieve credentials for your database and other services. We use it to manage the access credentials of the databases involved in your refreshes.
  • AWS Systems Manager – Systems Manager organizes, monitors and automates management tasks on your AWS resources. With Systems Manager Run Command, you can optionally run SQL scripts stored on Amazon S3 against your restored databases.

This solution may incur costs; check the pricing pages related to the services you’re using.

Architecture

The architecture of the solution proposed is shown in the following diagram.

The architecture of the solution proposed is shown in the following diagram.

The pre-restore workflow has the following steps:

  1. The end user prepares the refresh file (described later in this post) by configuring which steps have to be performed (including the optional creation of a Secrets Manager secret).
  2. If necessary, the end user can also prepare SQL scripts, stored on Amazon S3, to run as post-refresh scripts. 

The restore workflow has the following steps:

  1. The end user initiates the refresh process by starting the Step Functions state machine (the refresh process could be initiated automatically, if needed).
  2. The state machine manages each step of the database restore by invoking Lambda functions that are part of this solution 

The post-restore workflow includes the following steps:

  1. When the restore is complete, the state machine runs the post-restore SQL scripts. It provides two options:
    1. The state machine can run the scripts, stored on Amazon S3, through a Lambda function. If configured, you can use Secrets Manager to store and manage the database credentials.
    2. The state machine can run the scripts, stored on Amazon S3, using an EC2 instance, through Systems Manager Run Command.
  2. The state machine uses a DynamoDB table to store information about the process and its status.
  3. The state machine notifies the end user about the process final status using Amazon SNS.

Steps of a database refresh

Before describing in more detail what the solution looks like and how it works, it’s important to understand at a high level the main steps that are part of a database refresh:

  1. A backup of the source database is created.
  2. If the target database already exists, it’s stopped or, in most of the cases, deleted.
  3. The target database is re-created through a database restore operation, using the backup from Step 1.
  4. Post-restore scripts are run against the new restored target database.

The Step Functions state machine implemented for this solution is composed by several states; most of them are related to specific steps of a database refresh operation. In particular, some states are required only for Amazon RDS, others only for Aurora, and others are required for both.

The following table lists the main steps related to a refresh of an RDS DB instance performed by our solution.

Step # Step Name Description
1 delete-replicas Deletes the existing read replicas of the target database
2 stop-old-database Stops the existing target database
3 perform-restore Performs the restore
4 delete-old-database Deletes the old target database
5 rename-database Renames the new target database
6 fix-tags Updates the tags of the new target database
7 create-read-replicas Re-creates the read replicas previously deleted
8 change-admin-pwd Changes the admin password of the new target database
9 rotate-admin-pwd Rotates the admin password within the secret for the new target database
10 runscripts Runs SQL scripts against the new target database
11 update-dynamodb Updates a DynamoDB table with some information about the refresh completed
12 send-msg Sends an SNS notification (e-mail) about the completion of the refresh

The following table lists the main steps related to a refresh of an Aurora cluster performed by our solution.

Step # Step Name Description
1 delete-replicas Deletes the existing read replicas of the target database
2 perform-restore Performs the restore (it only creates the cluster)
3 create-instance Creates a new instance within the cluster restored at Step 2
4 delete-old-database Deletes the old target DB instance
5 delete-old-cluster Deletes the old target cluster
6 rename-cluster Renames the new target cluster
7 rename-database Renames the new target database
8 fix-tags Updates the tags of the new target database
9 create-read-replicas Re-creates the read replicas previously deleted
10 change-admin-pwd Changes the admin password of the new target database
11 rotate-admin-pwd Rotates the admin password within the secret for the new target database
12 runscripts Runs SQL scripts against the new target database
13 update-dynamodb Updates a DynamoDB table with some information about the refresh completed
14 send-msg Sends an SNS notification (e-mail) about the completion of the refresh

The graphic representation of the Step Function state machine that contains all the states mentioned above is available on the GitHub repo. You can use it on RDS DB instances, Aurora clusters, or both.

The job poller strategy

One of the main challenges of implementing an orchestrator with serverless services is to manage their stateless nature. When a certain operation is performed by a Lambda function against a database, how can we know when the operation is complete? The job poller strategy is a good solution. The following image is an extract from the solution showing this mechanism:

The following image is an extract from the solution showing this mechanism:

For most of the steps that are part of a database refresh, we implement the same strategy:

  1. Step Functions invokes a Lambda function that performs a certain operation (such as restore a database).
  2. Step Functions waits a certain number of seconds (you configure) using the state of “Wait”.
  3. Step Functions invokes a Lambda function that checks if the operation has completed (if the database has been restored and its status is “available”).
  4. Step Functions verifies the results of the previous check using the state type “Choice”.
  5. Step Functions goes to the next state if the operation has completed; otherwise it waits again (returns to Step 2).

Configuring your database refresh

The steps of the database refresh are orchestrated by a Step Functions state machine based on an input file provided – the “refresh file”. It’s a JSON document containing all the input parameters for the state machine (in particular for the Lambda functions associated to the state machine states) which determines the characteristics of the refresh.

A refresh file contains information about a specific refresh, so ideally for a single production database with two different non-production environments (one for development and one for test), a DBA has to prepare two refresh files. After these files are defined, they’re ready to be used and the related refresh can be scheduled or automated.

The following code is the high-level structure of a refresh file:

{
  "comment": "<comment>",
  "<state_name>": {
    "<parameter_name>": "<value>",
    "<parameter_name>": "<value>",
    "<parameter_name>": "<value>",
    [..]
    "wait_time": <value>,
    "check": {
       "<parameter_name>": "<value>",
       "<parameter_name>": "<value>",
       "checktodo": "<value>",
       "torun": "true|false"
    },
    "torun": "true|false"
  },
  "<state_name>": {
    "<parameter_name>": "<value>",
    "<parameter_name>": "<value>",
    "<parameter_name>": "<value>",
    [..]
    "wait_time": <value>,
    "torun": "true|false"
  },
  [..]
}

The file contains an element for every state machine’s state that needs an input. For more information about defining it, see the GitHub repo.

Keep in mind the following about the refresh file:

  • Not all the elements are required; some of them are related to steps that you may not want to run during a database refresh.
  • Some elements are only related to Amazon RDS, others only to Aurora.
  • Each element has a “torun” attribute that you can set to “false” in case you want to skip the related step.
  • Each element has a “wait_time” attribute that determines for how long the state machine has to wait before checking whether the related operation or step is complete.
  • Some elements have a “check” section that contains the input parameters for the Lambda function that verify whether a certain step completed successfully. This section has a “torun” parameter as well.
  • Within an element, some parameters are required and others are optional.
  • Within an element, some parameters are related to each other; if one has a value, the other one will become is also required.

In this post, we show you three examples of elements related to three different steps of a database refresh.

The following code shows a refresh of an RDS DB instance to the latest point:

[..]
  "restore": {
    "dbservice": "rds",
    "source": "mysqlinstp",
    "target": "mysqlinstd",
    "restoretype": "latestpoint",
    "application": "app1",
    "environment": "development",
    "port": 3307,
    "subgrp": "db-sub-grp-virginia",
    "iamdbauth": "False",
    "cwalogs": "audit,error,general,slowquery",
    "copytagstosnap": "True",
    "dbparamgrp": "default.mysql5.7",
    "deletionprotection": "False",
    "secgrpids": "sg-03aa3aa1590daa4d8",
    "multiaz": "False",
    "dbname": "mysqlinstd",
    "dbclass": "db.t3.micro",
    "autominor": "False",
    "storagetype": "gp2",
    "wait_time": 60,
    "check": {
       "dbservice": "rds",
       "dbinstance": "mysqlinstdtemp",
       "checktodo": "checkrestore",
       "torun": "true"
    },
    "torun": "true"
  }
  [..]

The preceding section of the refresh file indicates that an RDS for MySQL DB instance “mysqlinstp” must be used as the source for the refresh to the latest point of the DB instance “mysqlinstd”. The section includes other information about the new database to be restored, including the security group ID, the storage type, the DB instance class. The state machine verifies every 60 seconds whether the restore operation is complete. In the “check” section, you can notice that a database is always restored with a name ending with the suffix “%temp”. This suffix is removed later with another step. 

The following code illustrates how to rename an RDS for MySQL DB instance once restored:

[..]
  "rename": {
    "dbservice": "rds",
    "dbinstance": "mysqlinstdtemp",
    "wait_time": 10,
    "check": {
       "dbservice": "rds",
       "dbinstance": "mysqlinstd",
       "checktodo": "checkrename",
       "torun": "true"
    },
    "torun": "true"
  }
  [..]

The preceding section of the refresh file indicates that the new restored RDS DB instance “mysqlinstdtemp” must be renamed to “mysqlinstd”. The state machine verifies every 10 seconds whether rename operation is complete.

The following code runs post-refresh SQL scripts against a new restored RDS DB instance:

[..]
  "runscripts": {
    "dbservice": "rds",
    "dbinstance": "mysqlinstd",
    "engine": "mysql",
    "access": "secret",
    "secretname": "/development/app1r/mysqlinstd",
    "method": "lambda",
    "bucketname": "awsolproj",
    "prefix": "rdsmysql/mysqlinstd",
    "keys": "00test.sql,01test.sql",
    "wait_time": 10,
    "check": {
       "dbservice": "rds",
       "bucketname": "awsolproj",
       "prefix": "rdsmysql/mysqlinstd",
       "checktodo": "runscripts",
       "torun": "true"
    },    
    "torun": "true"
  }
  [..]

The preceding section of the refresh file indicates that the scripts “00test.sql” and “01test.sql” stored on Amazon S3 in the bucket “awsolproj” must be run through Lambda against the RDS for MySQL DB instance “mysqlinstd”. Database credentials are retrieved using Secrets Manager, and the status of the operation is verified every 10 seconds.

Managing secrets

At the end of the restore, the new database has the same passwords for all the users within the database, including the primary user. This situation could represent a problem from a security standpoint, and for this reason the Step Functions state machine includes the following two states: change-admin-pwd and rotate-admin-pwd.

With change-admin-pwd, the password of the primary user is automatically changed with a new one specified in the refresh file. If a Secrets Manager secret is configured for that database, the secret can be synchronized as well. See the following code:

[..]
"changeadminpwd": {
  "dbservice": "rds",
  "dbinstance": "mysqlinstd",
  "temppwd": "temppwd123",
  "secret": "true",
  "secretname": "/development/app1/mysqlinstd",
  "wait_time": 15,
  "check": {
     "dbservice": "rds",
     "dbinstance": "mysqlinstd",
     "checktodo": "checkpwd",
     "torun": "true"
  },    
  "torun": "true"
}
[..]

With rotate-admin-pwd, if a Secrets Manager secret is configured and it has the rotation settings enabled, the secret containing the primary user password is rotated:

"rotateadminpwd": {
  "dbservice": "rds",
  "dbinstance": "mybetainstd",
  "secretname": "/development/gamma/mybetainstd",
  "wait_time": 15,
  "check": {
     "dbservice": "rds",
     "secretname": "/development/gamma/mybetainstd",
     "temppwd": "temppwd123",
     "checktodo": "rotatepwd",
     "torun": "true"
  },    
  "torun": "true"
}

The solution allows you to run post-refresh SQL scripts in two ways:

  • Using Lambda
  • Using Systems Manager Run Command and EC2

The first option is more suitable if you’re more familiar with Lambda and want to keep the solution’s infrastructure completely serverless. Otherwise, DBAs who are used to directly managing SQL scripts on a server can easily manage them through Systems Manager: scripts are downloaded from Amazon S3 to the EC2 instance that is part of the solution and run from there.

In both cases, you have to store the scripts on Amazon S3.

The following code is the section of the refresh file related to the “run-script-<engine>” state:

"runscripts": {
  "dbservice": "aurora|rds",
  "cluster": "<string>",
  "dbinstance": "<string>",
  "engine": "aurora-mysql|mysql|mariadb|oracle|aurora-postgresql|postgresql",
  "access": "pwd|secret",
  "temppwd": "<string>",
  "secretname": "<string>",
  "method": "lambda|ec2",
  "bucketname": "<string>",
  "prefix": "<string>/<string>",
  "keys": "<filename1>,<filename2>,<filenameN>",
  "wait_time": <integer>,
  "check": {
     "dbservice": "aurora|rds",
     "bucketname": "<string>",
     "prefix": "<string>/<string>",
     "checktodo": "runscripts",
     "torun": "true"
  },    
  "torun": "true"
}

Within a SQL script, you can run SELECT, DDL (Data Definition Language), DML (Data Manipulation Language), and DCL (Data Control Language) statements. As of this writing, this feature is available only for MySQL-related databases (Amazon RDS for MySQL, Amazon RDS for MariaDB, and Aurora MySQL)

Tracking and troubleshooting your database refresh

Keeping track of database refreshes is important especially when you have to manage hundreds of production databases plus the related non-production ones. This solution uses an encrypted DynamoDB table to record information about databases refreshes, giving you the ability to quickly answer questions like the following:

  • Which date is the data of this database aligned to?
  • When was the last time we refreshed this database?
  • From which source did this database copy?
  • Did the refresh of this database run successfully yesterday?
  • Considering the production database, what’s the status of the refreshes of its non-production databases?

The current structure of the DynamoDB table is the following:

Table name – dbalignement-awssol
Partition key – dbinstance
Sort key – restoredate
Additional attributes – appname,environment,recordingtime,restoretype,snapshot,source,status

As of this writing, the solution doesn’t provide any local secondary index (LSI) or global secondary index (GSI) for the table, but you can easily add new GSIs to increase the number of access patterns that can be satisfied based on your needs.

If a database refresh fails for any reason, you can use different services to understand the reasons.

You can easily monitor the runs of your state machines through the Step Functions API or through its dashboard.

The graph inspector can immediately tell you at which state there was a failure or at which state the state machine got stuck. If you choose a state, you can also take a look at the related input and output.

You can also monitor the output of the Lambda functions associated with the states of the solution.

Lambda logs information about its runs in Amazon CloudWatch Logs, from which you can get more details about what happened during a certain operation.

Get notified or verify the database refresh completion

The solution uses Amazon SNS to send emails about the success or failure of the database refreshes performed. In case of success, some details about the database just refreshed are included in the message sent.

The following code is the section of the refresh file related to the “sendmsg” state:

"sendmsg": {
  "dbservice": "aurora|rds",
  "application": "<string>",
  "environment": "<string>",
  "dbinstance": "<string>",
  "source": "<string>",
  "restoretype": "fromsnapshot|restorepoint|latestpoint|fastcloning",
  "topicarn": "<string>",
  "torun": "true|false"
}

This feature is optional.

What’s next

The solution could be improved in some aspects, especially in the submission of the information about the database refresh. As of this writing, the input to provide must be manually prepared, but in the future we’re thinking about providing a user interface through which you can create the related JSON files and immediately perform some pre-checks that can validate the information provided.

Notifications are sent to users via Amazon SNS but another option could be to use Amazon Simple E-mail Service (Amazon SES) to get more detailed information about the refreshes performed by sending formatted e-mails with additional information attached about the new database just restored.

As of this writing, the solution doesn’t support Amazon RDS for SQL Server, and running post-refresh SQL scripts is available only for MySQL-related engines. We’re working to include those features in the remaining engines.

Conclusion

In this post, we showed how you can automate database refresh operations using serverless technology. The solution described can help you increase the level of automation in your infrastructure; in particular it can help reduce the time spent for an important and critical maintenance activity such as database refreshes, allowing DBAs to focus more on what matters when they manage their Amazon RDS and Aurora databases on AWS.

We’d love to hear what you think!

If you have questions or suggestions, please leave a comment.


About the Authors

Paola Lorusso is a Specialist Database Solutions Architect based in Milan, Italy. She works with companies of all sizes to support their innovation initiatives in the database area. In her role she helps customers to discover database services and design solutions on AWS, based on data access patterns and business requirements. She brings her technical experience close to the customer supporting migration strategies and developing new solutions with Relational and NoSQL databases.

 

Marco Tamassia is a technical instructor based in Milan, Italy. He delivers a wide range of technical trainings to AWS customers across EMEA. He also collaborates in the creation of new courses such as “Planning & Designing Databases on AWS” and “AWS Certified Database – Specialty”. Marco has a deep background as a Database Administrator (DBA) for companies of all sizes (included AWS). This allows him to bring his database knowledge into classroom brining real world examples to his students.