AWS Database Blog

Scale your relational database for SaaS, Part 2: Sharding and routing

This post is a continuation of our series on scaling your relational database for software as a service (SaaS). SaaS providers commonly use relational databases, such as Amazon Relational Database Service (Amazon RDS) and Amazon Aurora, in their solutions. In Part 1, we looked at some common ways to scale or optimize your relational database architecture. Those methods focused on scaling a set of physical resources with a finite limit, such as vertical scaling or horizontal scaling with read replicas. As a SaaS provider, you may need to grow beyond these limits, which is where database sharding can be a viable scaling mechanism.

Database sharding adds complexity and is often a one-way door once implemented. This is because sharding is incorporated explicitly into the data model, and the application has to be modified to account for this. Therefore, you should understand the implications of sharding your database before making a decision that can affect your SaaS operations. Additionally, as your database architecture increases in complexity, you may face the challenge of how to route tenant requests to the correct database.

In this post, we look at database sharding and how to handle the challenge of your application routing tenant requests to the correct database.

Scaling further

Vertical scaling improves read and write performance, but there is an upper limit to how large you can grow or optimize your database instance. Likewise, horizontal scaling with read replicas lets you scale out for read workloads, but there is a limit on the number of read replicas supported by your database solution. How will you continue to grow your application as you near or reach these limits?

In the pool and bridge models, one option is to migrate tenants to silo storage, particularly when you have a few tenants consuming more resources than the others. The tenants chosen are usually those with the highest performance requirements. You then need to introduce a mechanism to migrate existing tenants to the new environment. Tools such as AWS Database Migration Service (AWS DMS) can help automate this migration. You also need to map tenants to their storage, introducing additional complexity to your solution and operations.

Another option is to use database sharding. This will let you keep your partitioning model and scale your dataset as your customer base grows.

Database sharding in SaaS solutions

Database sharding is covered in-depth in What Is Database Sharding? and Sharding with Amazon Relational Database Service. We recommend exploring these references if you’re not familiar with the concept, because we focus just on SaaS-specific aspects in this post.

Your choice of shard key is important when designing your database sharding architecture. SaaS providers often use tenant_id as the shard key because it allows them to localize a tenant’s data on a single data shard. The following figure is an example of sharding the Order table in a relational database using tenant_id as the shard key.

Tenant Sharding Diagram

You can improve per-tenant performance when you localize all transactions for a tenant on a single data shard. For example, a database query for a tenant must search all rows of all databases. By confining the tenant to a single database, you reduce the number of rows searched, improving performance. It also provides the ability to use foreign keys and JOINs for a tenant. The following diagram shows sharding across multiple database tables, using tenant_id as the shard key.

Multi Table Sharding Diagram

Because a data shard is a unique database, you can have multiple shards on a single database instance or cluster. Your shard size and number of shards per instance will depend on your use case. For example, you may opt for one large shard per instance to reduce the operational overhead of managing many shards. Alternatively, you may want multiple smaller shards per instance if you’re experiencing issues because of the size of your database, such as challenges with replication or backup lag, or performing database maintenance tasks, such as vacuuming.

Benefits of sharding

With sharding, you can scale your application beyond the performance of a single database. You can architect your SaaS application in a way that will allow resizing of existing shards or adding new shards in real time to address performance issues, or as a response to newly provisioned tenants. If a single tenant requires more performance than a single database can offer, then sharding is one potential solution.

Using the pool model with sharding can be more operationally efficient with a large number of tenants than the silo model. You manage fewer databases, and tenant onboarding is simplified because you don’t need to provision a new database for each new tenant.

Using tenant_id as the shard key brings other operational efficiencies. It aligns well with tenant isolation approaches and it’s straightforward to rebalance tenants over new nodes to address performance issues because you store all data on a single data shard. Similarly, per-tenant backup and restore doesn’t need to ensure consistency across multiple data shards.

Combining sharding with other database scaling approaches can further improve tenant performance when using tenant_id as the shard key, although this will increase complexity. For example, you can implement table partitioning based on tenant_id to target individual tenant performance, or introduce database caching to improve shard performance.

Sharding also provides a cellular architecture pattern. This limits blast radius from infrastructure failure and reduces the impact of a single tenant on the performance of others. The following diagram shows an architecture where each database instance contains three shards containing up to two tenants, where each shard is a unique database. The diagram doesn’t show high availability, which is implemented natively by Aurora or Amazon RDS. For more information about these options, refer to High availability and durability and Amazon RDS Multi-AZ.

Tenant Shuffle Sharding

You can improve this with shuffle sharding to further reduce the impact a single problematic tenant can have on others. However, this increases complexity because you need to introduce a solution to manage shard replication.

Challenges of sharding

Sharding introduces significant complexity into your SaaS application. The application must handle the mapping and routing of data across all shards. Introducing a helper service can help hide this complexity from developers. We discuss this concept later in the post.

Queries that require data from multiple shards require additional application-level engineering. This usually results in higher latency than queries using a single shard. Some workloads are unsuitable for sharding, such as online analytic processing (OLAP), where you typically perform data analytics on the entire dataset. In these cases, it’s a common practice to create a copy of the dataset on an OLAP database in order to have both abilities.

You introduce several operational challenges with sharding. Support can be harder because the dataset footprint is more complicated. The distributed nature of a sharded dataset can make monitoring more difficult, requiring tenant-aware context, logging, and metering. Migration of tenants is non-trivial and can require more thought regarding data retention.

A sharded dataset can become unbalanced over time and can introduce a database hotspot. Performance differences across shards can lead to inconsistent customer performance. Continuous rebalancing of tenants across the dataset is required over time to maintain an even data distribution, so the ability to migrate tenants is required as part of the core architecture.

It is essential to have observability into your sharded architecture. Tools like Amazon RDS Performance Insights, Enhanced Monitoring, and Amazon DevOps Guru for RDS can provide visibility into your database performance and help you identify issues that may be affecting your solution.

Finally, reverting a shard-based architecture back to an un-sharded architecture is difficult and requires considerable technical expertise, engineering resources, and program management. You should consider your decision to shard as a one-way door.

Many of these challenges are solved by Amazon Aurora Limitless Database (currently in preview). Aurora Limitless Database is a serverless deployment of Aurora that scales beyond the limits of a single instance. Unlike implementing your own application-level sharding, Limitless database presents a single interface, so your application uses it in much the same way that it uses a single database. With Aurora Limitless Database, there is no need for the application to handle tenant routing or be aware of the topology of the cluster. Aurora Limitless Database knows the schema and key range placement to route the queries to the correct data access shards and aggregate results before returning to the application.

When to shard and how to approach sharding

The primary benefit of sharding is that it lets you scale beyond a single physical database. With modern databases continuing to grow in resources, you generally consider sharding only when other scaling approaches are no longer viable. One use case could be a tenant in a silo model who has reached the physical limits of their database and can’t scale in other ways. Alternatively, you may reach database engine performance limits and want to keep your existing partitioning model. We recommend reading Part 1 of this series to discover other ways you can improve your SaaS application’s relational database performance.

Another use case for sharding is operational efficiency at scale. When you have many tenants, you may find that managing these tenants in the silo model to be as operationally complex as a sharded pool model, without the cost-efficiency afforded by the pool model. For example, managing 400 tenants across 16 database shards may be easier than managing 400 individual databases.

You may also investigate sharding for resiliency. Your database architecture may be capable of supporting many tenants. However, the impact of a database failure may be too great for your business to risk. The physical isolation of shards reduces the blast radius of any database failure, and replication can enable data availability from secondary shards during a failure event.

If you have decided to shard your dataset, you need to evaluate how and what you will shard. You may be able to break your dataset into several sets, only sharding the part that has to be sharded. Assess your SaaS application functionality and tenant usage patterns to determine which pieces of your dataset make sense to be sharded, then break a certain part of your dataset and shard it.

Routing database requests in a complex dataset

When implementing your own sharding solution, the application needs to know where to route requests for your database. This can be challenging when you spread your dataset across multiple database shards or partitioning models, such as a premium tier in a silo model and a standard tier in a pool model.

The application needs an index, mapping each tenant to their database instance, to track where tenant data lives. When you onboard new tenants or migrate existing tenants between shards or partitioning models, you need to update this index. You may also want a mechanism to decide which shard you will place a new tenant on.

You can implement a data access manager as a helper service to manage and query this index. This hides the complexity of your dataset from developers and lets you change the dataset architecture in the future without requiring changes to your application. It is not limited to managing relational databases, and can index and map all datasets for your application.

The following figure shows an example architecture for a data access manager.

Shard Manager Diagram

In this example:

  1. We have a JSON Web Token (JWT) that is passed through our SaaS application and contains our tenant context.
  2. We pass the JWT to the data access manager, which calls a JWT manager.
  3. The JWT manager inspects the JWT and returns the tenant_id field.
  4. The data access manager uses the tenant_id to map to the correct database instance using a mapping stored in an Amazon DynamoDB table.
  5. The database details are returned back to our application.
  6. Our application then connects to the correct database instance.

We use DynamoDB for the mapping table because it provides a cost-effective and scalable solution for storing the mapping data. Because this mapping data could be used by all the services and micro-services within your application, it’s important that it not become a performance bottleneck. DynamoDB is well-suited to this key-value type of access pattern where only a single item is returned per query. You could extend this example architecture by adding a service_id attribute and creating a composite primary key if you had multiple services with different backing databases all using this sharded model. Additionally, you could introduce a caching layer to reduce the number of calls required to DynamoDB, because this data is expected to be fairly static.

This routing complexity is abstracted away when using Aurora Limitless Database. Aurora Limitless Database includes a fleet of router instances that automatically route the query to the correct data access shards, based on the chosen shard key (for example, tenant_id). The application then only needs to provide the tenant_id as part of the query and addresses a single shard group endpoint. For more details on Aurora Limitless Database, see Join the preview of Amazon Aurora Limitless Database. The following diagram illustrates the architecture of Aurora Limitless Database.

Aurora Limitless Database


In this post, we explored sharding as an option for scaling your relational database inside your SaaS application and the concept of a data access manager for handling data routing.

Sharding your relational database introduces architectural and operational complexity into your SaaS application. You should consider the trade-offs of implementing a sharded architecture and make sure that it’s the best fit for your use case. The performance benefits that sharding offers may be available with alternate scaling strategies.

You can combine several scaling strategies and still decide to shard. The right decision in your SaaS journey may be to continue to operate your solution with the existing knowledge in your teams. Maintaining existing technology stacks lets you focus on other areas of growth before going into more scalable technologies in the future.

If you decide to implement sharding, you should consider a managed sharding solution such as Aurora Limitless Database to reduce application routing complexity and simplify scaling and maintenance operations.

You should thoroughly test any scaling strategy before putting it into a production environment. When designing a scaling approach, you should implement operational metrics to gain visibility into the performance of your scaling mechanism and validate that you are reaching your expected scaling goals.

About AWS SaaS Factory

AWS SaaS Factory helps organizations at any stage of the SaaS journey. Whether looking to build new products, migrate existing applications, or optimize SaaS solutions on AWS, we can help. Visit the AWS SaaS Factory Insights Hub to discover more technical and business content and best practices.

SaaS builders are encouraged to reach out to their account representative to inquire about engagement models and to work with the AWS SaaS Factory team.

About the Authors

Dave RobertsDave Roberts is a Senior Solutions Architect and member of the AWS SaaS Factory team where he guides AWS partners building SaaS products on AWS. When he’s not talking SaaS, he enjoys building guitar effects pedals and spending time in the forest with his family.

Josh HartJosh Hart is a Principal Solutions Architect at Amazon Web Services. He works with ISV customers in the UK to help them build and modernize their SaaS applications on AWS.