Managed database backup and recovery in a multi-tenant SaaS application
Exporting data or restoring to an earlier point in time are core capabilities that customers look for in a software as a service (SaaS) product. Database backup and recovery is vital to business continuity and regulatory compliance. You should understand how your multi-tenant data partitioning model will affect the backup and restore capabilities of your SaaS solution.
In this post, we explore how SaaS multi-tenant database partitioning models influence backup and recovery, approaches for tenant data segregation during backup and recovery, and how to reduce cost when performing selective restoration of a tenant’s data.
Overview of database partitioning models
We refer to the silo, bridge, and pool SaaS architecture and database partitioning models throughout this post. We only provide a basic overview; for more information, refer to Multi-tenant SaaS partitioning models for PostgreSQL.
The silo model provides a database instance or cluster per tenant without resource sharing. The bridge and pool models have tenants sharing resources on a database instance or cluster. In the bridge model, each tenant has their own database or schema on an instance or cluster. In the pool model, all tenants share the same schema in the same database.
For further information on database partitioning models, refer to Silo, Pool, and Bridge Models.
How partitioning models influence backup and recovery
Multi-tenant solutions can complicate database backup and recovery. AWS offers generic tools to help with backup and recovery, but you will probably have to leverage the tools provided by your database vendor to achieve the best results.
Silo-based SaaS deployment models do not have unique database backup and recovery restrictions beyond non-SaaS workloads.
Pooled and Hybrid SaaS deployment models bring challenges and we can generically approach them either by segregating (partitioning) the tenant data during the backup (extract) phase, or we can segregate the tenant data during the recovery (load) phase.
The operational effort and complexity of performing this segregation varies depending on the database partitioning model. The greater the resource sharing, the more effort required to partition a single tenant.
SaaS providers often choose the pooled model because it can be more cost-effective and simpler to operate for many tenants, however tenant backup and recovery is significantly more complex because of other tenants in the database compared to the silo model, where each tenant has their own database.
The partitioning model also influences the tools you can use for segregation. AWS service features, such as those from Amazon RDS, Aurora, or AWS Backup, back up the entire database instance (Amazon RDS) or cluster (Aurora). Where tenants are sharing resources, such as in the bridge or pool models, you must use database-engine specific tools like pg_dump or mysqldump to segregate tenant data, introducing additional tooling into your application.
We prefer segregation during recovery to take advantage of AWS service features during backup, and only introducing operational overhead during the recovery process. However, there are use cases for segregation during backup. Let’s compare both approaches in more detail.
Segregating tenant data during backup (extract)
For SaaS builders, this approach offers simpler and quicker recovery, as well as per-tenant data retention. This incurs an operational overhead per-tenant cost during the backup cycle, which may make this approach unsuitable for SaaS providers with a large number of tenants.
Per-tenant data retention provides several product opportunities as a SaaS provider. You can offer customers the ability to manage their own retention policies, export historical data on demand, and encrypt their backups with their own encryption keys. Per-tenant data retention can also help improve your security and regulatory compliance postures.
From an operational perspective, per-tenant data retention provides simple mechanisms for managing a tenant’s data. You can restore directly from backup, and a SaaS provider can easily delete data if a tenant leaves or invokes a “right to be forgotten” law.
Per-tenant data retention also provides an easy way to measure per-tenant backup costs.
To segregate during backup, each tenant has their data segregated in the backup process and stored independently, as shown in the following diagram. The methods available to segregate a tenant’s data depend on the partitioning model, which we investigate later.
Segregating tenant data during backup adds operational overhead to each backup cycle. This may not be practical for SaaS providers with many tenants. You must manage each tenant backup over time, and the work needed to perform segregation brings additional performance and financial cost to each backup cycle.
To ensure data consistency in your backups, consult the best practices of your chosen relational database. This may require read only locks, using write ahead buffers, or service downtime. These choices impact your tenants’ experience and add further complexity to the backup process.
Segregating tenant data during recovery (load)
Segregation during recovery is attractive to SaaS builders because it is operationally efficient and you can incorporate it into an existing backup process.
This approach has the benefit that you can use AWS service features for backup. Aurora backups are continuous and incremental in nature, without causing a performance impact to your application or any interruption to the database service. This is important for a SaaS offering, which should be always available for tenants.
You introduce complexity from segregation only in the recovery process and do not require service downtime during backup. However, this can add performance and usability issues for your tenants during recovery.
In segregation during recovery, you backup the entire database containing all tenants. You restore the entire multi-tenant snapshot to a temporary database, where you then segregate the tenant and restore their data into the live database. You can then delete the temporary database. The following diagram illustrates this process.
The cost of the temporary database needs to be considered during the recovery process. You must take care to reduce costs, especially for large databases. We investigate these options later in this post.
Restoring tenant data to a live database
During the recovery process, you must think about how to handle existing tenant data and how to restore recovered data back into the multi-tenant database.
A complete restoration to an earlier point in time is the simplest method. Delete all existing tenant data before restoration. Complexity of recovery in this scenario depends on the partitioning model, which we explore later.
For selective recovery, you need to overwrite any existing data that is being restored. You can use database engine-specific tools to import the segregated tenant data into the live database, overwriting existing files.
Now that we’ve described some benefits and challenges to the two main approaches to backup and restore of multi-tenant data, let’s look at some examples of how you might implement them with PostgreSQL.
Backup and recovery in the silo model
The silo model requires no additional effort for segregation, because each tenant has their own instance or cluster.
During a complete point-in-time recovery (PITR), you restore a backup snapshot to a new database, and you configure the application to connect to the newly restored database instance or cluster. You can then clean up the old database instance or cluster. For more information, refer to Backing up and restoring an Amazon RDS DB instance.
Alternatively, you can use database engine-specific tools to restore to the original database cluster or instance. The application needs no configuration changes, but there is additional complexity. In the following example, note that the pg_restore must be able to connect to a database that is not the one being restored, because you delete the target database and recreate it during this process:
Backup and recovery in the bridge model
The schema-per-tenant and database-per-tenant bridge models bring additional complexity because multiple tenants exist on the same database instance or cluster.
Database engine-specific tools can segregate and restore a tenant’s database or schema. The following code shows examples of the segregation of a single database or schema:
Backup and recovery in the pooled model
Segregation in the pooled model is the most complex because you share tenant data within the same table and must first extract a tenant’s data from the shared table before you export it.
We prefer segregation during recovery in the pooled model to defer the cost of segregation until it is required. With pooled databases containing a large amount of tenants, the performance cost and management of multiple backup jobs reduces operational efficiency.
One approach to the challenge of segregation is to export the tenant’s data to a local file. For example:
During recovery, we delete the tenant’s existing data, and copy the restored data back into the multi-tenant table:
Most databases have multiple tables, and a mechanism is required to export tenant data from all tables. One approach is to iterate across all tables to export the tenant’s data to temporary tables, which are then exported. You can do this using reproducible database-specific functions, as shown in the following code:
Data consistency is critical when exporting multiple tables. Performing segregation during recovery ensures that you’re exporting from a backup snapshot that is already consistent.
A disadvantage to segregation during recovery is that you must restore the entire multi-tenant database, which may incur an appreciable cost with large databases. Take care to destroy any restored databases as soon as you export tenant data.
Selective restore in a pooled multi-tenant database
Allowing a tenant to perform a selective restoration of data can be a valuable product feature. A tenant may use your SaaS product to provide services for multiple customers and need the ability to recover a single customer to an earlier point in time, or need the ability to roll back unwanted changes.
The approach to allow a tenant to select records to restore depends on the dataset properties, the number of records available, and the time to restore from. Selectively restoring from a small dataset from a single backup is significantly less complex than allowing a tenant to restore from a large dataset across multiple backups.
A simple approach is to create a temporary database from a backup snapshot that remains online while the tenant selects their records. This incurs a cost during the selection period, and we investigate some techniques to minimize the financial impact.
If you are restoring a multi-tenant snapshot, then you can delete all other data except that belonging to the tenant you wish to recover. Aurora offers dynamic resizing of storage when underlying database files are deleted or reorganized to require less space.
You do this by copying the tenant data to a temporary table, deleting the original multi-tenant table, and then renaming the temporary table to the original multi-tenant table name. The tenant can then select which records to recover. See the following example code:
Choosing the lowest-cost database offering can reduce cost if recovery performance isn’t critical. If you expect the database to be idle for extended periods, then Amazon Aurora Serverless v2 offers the lowest idle cost (0.5 ACU at $0.06 per hour, versus $0.072 per hour for an on-demand db.t4g.medium instance. Current pricing in US East (Ohio) at the time of publishing).
You can reduce cost further, with an increase in complexity, by using a temporary cache of recovered records for the selection process.
For smaller datasets, you can store a complete copy of the recoverable tenant records, which you restore directly into the live database. This approach may not be workable for larger complex datasets spanning multiple tables.
Another approach is to use an inventory of record metadata, which the tenant uses to select which data to restore. Suitable metadata would be primary keys, foreign keys, and other columns relevant to the SaaS offering. You can generate a temporary inventory during the recovery process, or maintain a permanent inventory as part of our product offering, updated during our backup process.
Maintaining a backup inventory adds complexity and cost, but can provide a superior user experience if your product requires the ability to restore across any backup or a large dataset.
Tenant backup and recovery is critical for many SaaS applications. In this post, we looked at how database partitioning models influence how we perform backup and recovery, approaches to segregation in each model, and approaches to offer selective restore.
There is no single best-practice approach to backup and recovery for a managed database in a SaaS application, and the type of partitioning model will affect the complexity of this operation. We recommend segregating tenant data during recovery to reduce operational complexity during backup, but segregation during backup offers per-tenant data retention and reduces complexity during recovery.
You should thoroughly test your backup recovery solution once you have decided on your approach. Regardless of choice, testing of your backup solution should be a regular part of your operational footprint.
For further information, refer to Silo, Pool, and Bridge Models and Multi-tenant SaaS partitioning models for PostgreSQL.
About the Author
Dave Roberts is a Senior Solutions Architect and member of the SaaS Factory team. He helps guide and assist AWS partners with building their SaaS products on AWS. Dave has over 16 years of experience in IT working with large customers. Originally from New Zealand, he now lives in Germany with his wife and 2 children.