Ten tips for multi-tenant, multi-Region object replication in Amazon S3

UPDATE (2/10/2022): Amazon S3 Batch Replication launched on 2/8/2022, allowing you to replicate existing S3 objects and synchronize your S3 buckets. See the S3 User Guide for additional details.

Independent software vendors (ISV) want to build multi-tenanted applications to benefit from more efficient usage of resources in addition to global reach. These apps require a durable, scalable, and highly available data storage layer. Building this data storage layer from the ground up represents undifferentiated heavy lifting, and Amazon S3 offers a viable and effective solution. S3 is highly durable (11 9s), but how do you configure the availability and resilience of multi-tenant datasets across multiple Regions?

Splunk, a data platform leader and an AWS Partner, developed a new capability to improve the resilience and availability of their customer environments by integrating S3 Replication. This capability seamlessly replicates customer data across multiple Regions. By relying on a managed solution, Splunk delivers on customer requirements without the complexity of building and maintaining its own, custom solution.

Splunk decided to leverage the native solution of S3 Replication for the following reasons:

Simplicity and ease of use in setting and configuring replication across Regions.
Leverage an existing replication solution that has been proven in large-scale scenarios.
Faster time-to-market in delivering the disaster recovery (DR) solution to Splunk customers.
Finally, it offers a cost-effective option without the need for third-party replication software.

In this blog, we discuss data replication scenarios and tips for implementing well-architected multi-tenant, multi-Region applications backed by S3. This will help you find answers for common configuration questions, avoid operational mistakes, and follow well-architected best practices.

Use-case scenarios

ISV customers like Splunk are implementing a Software as a Service (SaaS) distribution model to deliver software to end customers. SaaS distribution models offer several benefits such as reduced time to value, lower costs (for customers and operators), and better scalability and integration.

In order to realize these benefits, SaaS providers are rethinking how they build and deploy software by introducing concepts such as:

Global availability in multiple AWS Regions
Logical isolation via the use of “tenants,” a logical grouping of application users and their data
Cell-based architectures and deployments to minimize the blast radius of failures

With a growing number of global deployments and compliance requirements, data replication becomes a key consideration for ISVs. Customers using an ISV’s products may have offices around the world, with end users requiring low latency access to datasets for mission critical decisions such as threat hunting or incident response. ISVs may be required to implement multi-Region disaster recovery (DR) processes. Or, they may simply need to migrate customer data between different cloud deployments or Regions. S3 Replication dramatically simplifies the process of keeping two buckets in-sync for disaster recovery (DR) and availability.

The remainder of this section discusses three examples for data replication with Amazon S3 based on feedback from ISV customers.

Use case 1: Same-Region Replication

Same-Region Replication (SRR) enables you to replicate objects between buckets that reside in the same AWS Region. The following example uses SRR to replicate a user’s data between two cells, each having their own associated S3 bucket. SRR supports filters to restrict the replication to objects with a specific prefix or object tag. For example, you can restrict replication to object prefixes that match the user or tenant ID of this user. You can also configure replication to be ongoing (creating a replica of objects in the destination bucket) or one-off (replicating a user’s data to a destination bucket and then purging objects from the source bucket). Users can then access their data in the target cell.

Same-Region Replication (SRR) enables you to replicate objects between buckets that reside in the same AWS Region (1)

Use case 2: Cross-Region Replication

Cross-Region Replication (CRR) enables you to replicate objects between S3 buckets that reside in different AWS Regions. The following example uses CRR to perform a one-way replication of objects from one AWS Region (us-west-2) to another (us-east-1). Use this configuration to perform a backup of objects to the destination Region. When defining the replication rule, you can specify which storage class an object should move to in the destination Region. For example, you can move objects to the S3 Glacier storage class to lower cost or to the S3 Standard storage class for active/standby configurations.

Cross-Region Replication (CRR) enables you to replicate objects between S3 buckets that reside in different AWS Regions

Use case 3: Two-way replication between Regions

Splunk initially targets one-way replication across two separate Regions to reduce the potential impact of outages. Two-way replication is an area of active exploration and will further enhance the speed with which customers are able to resume operations once an outage is resolved.

This example uses two-way replication rules to implement an active/active scenario. When using two-way replication rules, changes to objects in either Region are replicated across. For example, creating a new object version in the primary Region will also increment versions in the secondary Region. If the change occurs in the secondary Region, then two-way replication makes sure it is replicated across to the primary Region. When using two-way replication, make sure to enable replica modification sync on both replication rules (source to destination and vice versa) if you want changes made to replica metadata (object tags, object access control lists, and object lock settings) to replicate to both Regions.

When using two-way replication rules, changes to objects in either Region are replicated across (1)

Improve security posture

Now that we have identified a number of useful use case scenarios for S3 Replication, it’s time to discuss tips and best practices for implementing data replication solutions in the context of SaaS applications.

In this first section, we discuss security posture. Grant least privilege and encrypt data at rest to improve the security posture of your multi-tenant, multi-Region object replication solution.

Tip 01: Grant least privilege

By default, objects in the source and destination S3 buckets are private. Amazon S3 needs permissions in order to replicate objects between S3 buckets. You set up permissions by creating an IAM role and specifying the role in your replication configuration. When you create the role, follow standard security best practices of granting least privilege – granting only permissions required to perform this task.

Make sure that the policy has the minimum permissions for object replication, and is restricted to the source/destination prefixes using IAM conditions. Unresolved permission issues will lead to object replication failure. Set up Event Notifications to monitor and respond to failures. Use IAM Access Analyzer to validate whether your access policy is valid.

Tip 02: Manage encryption keys

External factors such as compliance controls may require you to implement ways in which tenants can control encryption keys. Amazon S3 replication supports the following server-side encryption types:

Server-side encryption with Amazon S3-managed keys (SSE-S3) – each object is encrypted with a unique key and keys themselves are encrypted with a master key that regularly rotates.
Server-side encryption with AWS KMS keys stored in AWS Key Management Service (SSE-KMS) – enables you to define separate permissions of the use of the KMS key and provides an audit trail. You can create customer managed KMS keys or use AWS managed KMS keys that are unique to you, your service, and Region.

To select between SSE-S3 and SSE-KMS where replication is concerned, you need to consider who is responsible for the management key in each scenario. For SSE-S3, AWS manages both the data and KMS keys, which minimizes the overhead of encryption for your solution. SSE-KMS requires you to manage the KMS key while AWS manages data keys, which are wrapped by the KMS key. This split responsibility provides enhanced flexibility for key management, but introduces both costs for the additional AWS KMS operations and the potential to hit KMS-related Service Quotas.

Use S3 Bucket Keys to reduce the cost of Amazon S3 server-side encryption using SSE-KMS. S3 Bucket Keys reduce AWS KMS request costs by up to 99% by decreasing the request traffic from Amazon S3 to AWS KMS. This works as long as your customers do not require the use of a tenant-specific key. If you have such a requirement, consider setting up separate S3 buckets to benefit from S3 Bucket Keys.

Object replication supports the SSE-S3 and SSE-KMS (both customer managed and AWS managed CMKs) encryption methods. SSE-C, which grants full control over the key management and data lifecycle to the customer, is currently not supported by CRR out of the box.

Optimize for performance

Next, we discuss performance-related tips and how to make sure your datasets remain in synch and are replicated within a predictable timeframe.

Tip 03: Selective replication

Use selective replication to replicate only what you need by defining tenant-specific replication rules using either a prefix or object-tag filter. For example, you might define a prefix filter to filter by tenant, and object-tag filters to narrow down objects to be replicated.

Keep in mind that limits apply to the number of replication rules you can create. Review the following code to better understand how prefixes and tags can be used together to reduce the total number and complexity of the rules you must define. You can define a maximum of one prefix per replication rule.

<Rule>
    ...
    <Filter>
        <And>
           <Prefix>customers/</Prefix>
            <Tag>
                <Key>retention</Key>
                <Value>default</Value>
            </Tag>
            <Tag>
                <Key>access-tier</Key>
                <Value>premium</Value>
            </Tag>
             ...
        </And>
    </Filter>
    ...
</Rule>

Tip 04: Object size considerations

Follow S3 best practices for multipart uploads (MPU) for objects larger than 100 MB. Splitting larger objects into smaller chunks helps with parallelism and replication latency. Amazon S3 object sizes range between a minimum of 0 bytes to a maximum of 5 terabytes. The largest object size that can be uploaded in a single PUT operation is 5 gigabytes.

A large number of very small objects can lead to a backlog of replication operations waiting to be scheduled. Use fewer objects of larger size (200 megabytes) rather than a single large object or many small objects.

Tip 05: Enable Replication Time Control

You can use replication to copy data within or between Regions to meet regulatory requirements and to create redundancy as part of your DR plan. Enabling S3 Replication Time Control (RTC) increases the predictability of replication time and offers the following additional features:

Replication SLA – take advantage of the replication SLA to increase the predictability of replication time.
Replication metrics – monitor maximum replication time using CloudWatch metrics.
Replication events – track object replications that deviate from the SLA.

Replication time is influenced by object size and count, available bandwidth, and other traffic to the buckets. The replication SLA is expressed in terms of the percentage of objects that replicate within 15 minutes. RTC ensures 99.99% of objects replicate within a 15-minute window. Subscribe to Event Notifications to track if object replications deviate from the replication SLA.

Optimize for cost

Cost is an important architecture outcome. Implement these additional tips to optimize storage cost when building multi-Region replication solutions.

Tip 06: Archive older object versions

S3 Replication requires you to enable bucket versioning on both the source and destination bucket defined in your replication rule.

Object replication creates a new object version in the source and destination buckets for objects with the same object key every time your application writes to Amazon S3. Over time, the number versions you maintain will grow. Create and apply lifecycle rules to move older versions of an object stored in S3 to Amazon S3 Glacier to reduce storage cost.

Pay specific attention to the <Transition> and <NoncurrentVersionTransition> elements of the lifecycle rule configuration to ensure they meet your needs. You can define filters to apply different lifecycle rules to subsets of objects stored in your bucket (for example, by defining lifecycle rules based on common tenant retention requirements instead of the higher-cardinality ID property).

...
<Filter>
   <And>
      <Prefix>customers/</Prefix>
      <Tag>
         <Key>retention</Key>
         <Value>default</Value>
      </Tag>
    </And>
</Filter>
...

Tip 07: Remove incomplete multipart uploads (MPU)

Incomplete multipart uploads can accumulate after enabling S3 Replication, adding to overall storage cost. Consider multipart uploads when your object size reaches 100 MB. Parts remain in your S3 bucket until the multipart upload completes or is shut down. Use S3 Storage Lens to discover failed multipart uploads and gain insight into how much data you can recover in your source and destination S3 buckets.

Build highly resilient applications

Replicating data across multiple Regions increases the resilience of your SaaS application. Follow these tips to optimally configure your SaaS application and automate your response to prevent replication delays.

Tip 08: Handling concurrent writes

S3 Replication enables you to configure bidirectional replication rules. In this scenario, replication rules replicate object and metadata changes between two buckets in the same or different AWS Region. Writes to any one bucket are replicated to the other Region, allowing parallel and concurrent writes to objects stored in these buckets.

S3 Replication requires you to enable object versioning and replication is a “non-destructive process.” Concurrent writes to an object will be preserved and show up in the object’s version history after they have been replicated. For example, if your application writes to the same object in both Regions, then the last write (based on timestamp) wins. The most recent write will be retained as the current object version, whereas all previous writes are moved to the object’s version history.

In the event of concurrent writes in both Regions, use the version identifier when using GET operations to retrieve a particular object version the object history.

When you request an object (using GET object) or object metadata (using HEAD object), Amazon S3 returns the x-amz-version-id header in the response:

$ aws s3api head-object --bucket source-bucket --key object-key --version-id object-version-id
{
    ...
   "ReplicationStatus":"COMPLETED",
   "VersionId":"jfnW.HIMOfYiD_9rGbSkmroXsFj3fqZ.",
   "ETag":"\"6805f2cfc46c0f04559748bb039d69ae\"",
   ...  
}

Tip 09: Error handling and notifications

Monitor replication status of objects when testing replication rules to identify any objects that failed to replicate. Non-transient replication failures are the result of configuration issues (that is, S3 Replication does not have the necessary permissions to access the source or destination S3 bucket).

You can configure S3 Event Notifications at the S3 bucket level to notify administrators of a non-transient failure. If you have enabled S3 RTC, you can also notify administrators when operations exceed the 15-minute S3 RTC threshold. Select from the events starting with s3:Replication:[Operation…] in the notification event types.

As a best practice, configure S3 Inventory reports to retrieve replication status information for all objects in a given S3 bucket. This allows you to monitor whether all objects in buckets monitored by the S3 Replication have successfully replicated. S3 delivers inventory reports to a destination bucket that you specify. You can run S3 inventory reports in regular intervals (such as every 24 hours) and specify the inventory scope (for example, only objects that match a particular object prefix).

Design for operability

Finally, design your application for operability. Follow these tips to monitor replication metrics and to deliver a consistent and predictable experience to your end users.

Tip 10: Monitor replication progress

Activate replication metrics to monitor the replication progress of your replication rules. This helps you troubleshoot and alert on any issues that impact on your customer’s service level agreement (SLA). Once activated, CRR will feed replication metrics to Amazon CloudWatch metrics. Use your existing CloudWatch monitoring or third-party tooling like Splunk to monitor a rule’s replication status.

You can track the following replication metrics in Amazon CloudWatch:

Bytes pending replication – total number of bytes pending replication
Replication latency – number of seconds the destination bucket is behind the source bucket
Operations pending replication – number of operations pending replication for a given replication rule

Use the S3 console, CloudWatch, or a third-party solution like Splunk to monitor replication metrics.

Conclusion

This blog summarized the challenges ISV customers face when building a durable, scalable, and highly available data storage layer for their multi-tenant, multi-Region applications. Examples include the need to replicate data within and between AWS Regions and to reduce undifferentiated heavy lifting. It then introduced 10 tips ISVs can follow to build well-architected solutions using S3 Replication. You should now be able to recognize common use cases for S3 Replication and apply best practices when integrating or configuring S3 Replication in your SaaS application.

Thanks for reading our blog post on multi-tenant, multi-Region objects replication in Amazon S3. If you have any comments or questions, please don’t hesitate to leave them in the comments section.