Using Boto3 to configure Amazon S3 bucket replication at scale

UPDATE (2/10/2022): Amazon S3 Batch Replication launched on 2/8/2022, allowing you to replicate existing S3 objects and synchronize your S3 buckets. See the S3 User Guide for additional details.

Replicating your data on Amazon S3 is an effective way to meet business requirements by storing data across distant AWS Regions or across unique accounts in the same AWS Region. As an example, you can minimize application latency by maintaining object copies in AWS Regions that are geographically closer to your users. Amazon S3 Replication is an elastic, fully managed, low-cost feature that replicates S3 objects between buckets, both in the same Region or different Regions.

Replicating data using the AWS Management Console is simple – you can set up replication on a per-bucket basis, with replication rules that select source and target buckets, as well as other replication options. However, we often see customers with hundreds or thousands of buckets that need a way to automate and scale the process while ensuring consistency.

In this blog, I discuss a simple method to enable automation and to scale the S3 Replication setup process using Boto3: the AWS SDK for Python. This solution uses a spreadsheet as input to automatically create multiple S3 Replication rules for multiple buckets. I specify the replication parameters in the spreadsheet and the solution creates the required buckets and replication rules. I use the AWS Command Line Interface (CLI) to set up the permissions needed to operate, and then we walk through the implementation. Among other use cases, this post shows you a way of using Amazon S3 Replication and Boto3 to increase infrastructure resilience, meet compliance requirements, minimize latency, and configure live replication between production and test accounts.

Amazon S3 and S3 Replication

Amazon S3 is an object storage service that offers industry-leading scalability, data availability, security, and performance. You can organize all objects stored in S3 buckets with shared names called prefixes.

S3 Replication offers the most flexibility and functionality in cloud storage. Amazon S3 Replication enables automatic, asynchronous copying of objects across Amazon S3 buckets. You can replicate buckets and objects in the same Region (SRR) or across different Regions (CRR). Some of the benefits of the S3 replication are protecting data against disasters, meeting compliance requirements, minimizing latency, and abiding by data sovereignty laws.

Replication template for Boto3

Boto3 makes it easy to integrate your Python application, library, or script with AWS services including Amazon S3, Amazon EC2, Amazon DynamoDB and more.

For this solution, I use a spreadsheet template that the Python code reads to set up the parameters in the Boto3 methods. Each column on the spreadsheet relates to parameters used in the setup of the required buckets and replication rules. I use Boto3 methods to create the buckets and replication rules based on the parameters provided on the spreadsheet template.

Often, S3 buckets belong to different business units and accounts with different requirements and functions. Hence, the idea is to provide the spreadsheet template to each business unit to allow them to make their decisions based on business, compliance, operational and other requirements. Each business unit will use those decisions to set up the parameters in the code. For example, there is a column in the template for replication filters (tag or prefix), another column for target Region, and so on (further explained later in the post).

The following is an example of the spreadsheet template I use for this example:

Each column on the Excel template relates to parameters used in the setup of the required buckets and replication rules (1)

Configuring permissions

Amazon S3 must have permissions to replicate objects from the source bucket to the destination bucket on your behalf. You must create an IAM role and attach a permissions policy to the role. The replication configuration and source bucket use the role. Amazon S3 assumes this role to replicate objects on your behalf. For more details, refer to Amazon S3 user guide for details on that.

The following is an example of a permissions policy to attach to the replication role:

{
   "Version":"2012-10-17",
   "Statement":[
      {
         "Effect":"Allow",
         "Action":[
            "s3:GetObjectVersionForReplication",
            "s3:GetObjectVersionAcl"
         ],
         "Resource":[
            "arn:aws:s3:::source-bucket/*"
         ]
      },
      {
         "Effect":"Allow",
         "Action":[
            "s3:ListBucket",
            "s3:GetReplicationConfiguration"
         ],
         "Resource":[
            "arn:aws:s3:::source-bucket"
         ]
      },
      {
         "Effect":"Allow",
         "Action":[
            "s3:ReplicateObject",
            "s3:ReplicateDelete",
            "s3:ReplicateTags",
            "s3:GetObjectVersionTagging"

         ],
         "Resource":"arn:aws:s3:::destination-bucket/*"
      }
   ]
}

Implementation

The Python (Boto3) code reads the spreadsheet template and checks for each source bucket. For each source bucket, in case there is no existing replication, it creates a target bucket with the same name as the source appending “-target” as suffix. After that, it creates the replication rule. The target bucket and replication rule will use the parameters specified in the spreadsheet template.

The code performs the following steps:

Check source buckets for an existent replication configuration and versioning status
Add versioning to the source buckets (if needed)
Create target bucket using parameters in the spreadsheet
Create replication configuration using parameters in the spreadsheet
Tag buckets

Let’s look into more details on each step.

Check source buckets for an existent replication configuration and versioning status

In this step, use the get_bucket_replication method. Check if there is already an existing replication configuration. In that case, skip the new replication configuration and report on it. Otherwise, check if the bucket has versioning enabled and proceed onto the next steps.

Use the get_bucket_versioning Boto3 method to check if the source bucket has versioning enabled. Versioning is required for Amazon S3 replication. When versioning is enabled, each overwrite of an existing object creates a new variant of the object in the same bucket.

# GET REPLICATION
    try:
        config = client.get_bucket_replication(Bucket=bucket)

    except botocore.exceptions.ClientError:
        # Log replication config
        logging.info(f'Bucket "{bucket}" : Replication config not enabled. Will enable versioning on source bucket and create a replication config')

        # GET VERSIONING
        # try:
        client.get_bucket_versioning(Bucket=bucket)

Add versioning to the source buckets (if needed)

Use the method put_bucket_versioning to enable versioning on the source bucket in case it does not have it.

 client.put_bucket_versioning(
            Bucket=bucket,
            VersioningConfiguration={
                'Status': 'Enabled'
            },
        )

Create target bucket using parameters in the spreadsheet

Use the method create_bucket to create the target bucket. In this request, I define the Region for the target bucket to satisfy distance and residency requirements.

Also, use method put_public_access_block to ensure that all public access is disabled in the target bucket.

In addition to that, enable versioning on the target buckets.

 # CREATE TARGET BUCKET
        logging.info('Creating Target Bucket')

        client.create_bucket(
            Bucket=f'{bucket}-target',
            CreateBucketConfiguration={
                'LocationConstraint': region,
            },
        )

        logging.info(f'Blocking Public Access for the target bucket "{bucket}-target" ')
        client.put_public_access_block(
            Bucket=f'{bucket}-target',
            PublicAccessBlockConfiguration={
                'BlockPublicAcls': True,
                'IgnorePublicAcls': True,
                'BlockPublicPolicy': True,
                'RestrictPublicBuckets': True
            },
        )

        logging.info(f'Creating versioning for the target bucket "{bucket}-target" ')
        client.put_bucket_versioning(
            Bucket=f'{bucket}-target',
            VersioningConfiguration={
                'Status': 'Enabled'
            },
        )

Create replication configuration using parameters in the spreadsheet

Next, use the remaining parameters in the spreadsheet in the put_bucket_replication method. This is the method that creates the actual replication rule.

Define what storage class to use for the target bucket: S3 Standard, S3 Intelligent-Tiering, S3 Standard-IA, S3 One Zone-IA, Amazon S3 Glacier or Amazon S3 Glacier Deep Archive.
Decide to replicate the whole bucket or filter either by tag or prefix.
Enable delete marker replication in case of an active-active architecture.
You can use an additional layer of protection by changing the object ownership. That will protect against bad actors and IAM account compromise. Refer to the documentation on changing the replica owner for more details.
You can also use cross account by choosing to replicate into a second account. That will protect against AWS root account compromise. Refer to the documentation on configuring S3 Replication for source and destination buckets owned by different accounts for more details.
Refer to the S3 Replication Time Control (S3 RTC) documentation if you need a predictable replication time backed by a Service Level Agreement (SLA).

At this point, I have completed the creation of the replication rule and can proceed.

 client.put_bucket_replication(
            Bucket=bucket,
            #Modify the entry below with your account and the replication role you created
            ReplicationConfiguration={
                'Role': 'arn:aws:iam::000000000000:role/ReplicationRole',
                'Rules': [
                    {
                        'Priority': 1,
                        'Filter': {
                            'Prefix': prefix_filter,
                    },
                        'Destination': {
                            'Bucket': f'arn:aws:s3:::{bucket}-target',
                            'StorageClass': storageClass,
                        },
                        'Status': 'Enabled',
                        'DeleteMarkerReplication': {'Status': deleteMarkerReplication},
                    },

                ],
            },
        )

Tag buckets

For that, we can also add a “replicated” tag to the source bucket to categorize it. We can use the s3.BucketTagging class.

Implementation example

The following example provides the complete code to implement the solution. In this particular example, I use the template file CRR.xlsx with the previously discussed format and the previously discussed Boto3 methods.

# -*- coding: utf-8 -*-
import boto3
import botocore
import pandas as pd
import logging
import json
logging.basicConfig(filename='example.log',filemode='w',level=logging.INFO)
file = 'CRR.xlsx'
df = pd.read_excel(file)
s3 = boto3.resource('s3')
client = boto3.client('s3')
for i in df.index:
    bucket = df['Source Bucket Name'][i]
    region = df['Region'][i]
    storageClass = df['Target Storage Class'][i]
    prefix_filter = df['Prefix Filter'][i]
    if not isinstance(prefix_filter,str):
        prefix_filter = ''
    deleteMarkerReplication = df['DeleteMarkerReplication'][i]
    existingObjectReplication = df['Existing Object Replication'][i]
    # GET REPLICATION
    try:
        config = client.get_bucket_replication(Bucket=bucket)
    except botocore.exceptions.ClientError:
        # Log replication config
        logging.info(f'Bucket "{bucket}" : Replication config not enabled. Will enable versioning on source bucket and create a replication config')
        # GET VERSIONING
        # try:
        client.get_bucket_versioning(Bucket=bucket)
        # except Exception:
        logging.info(f'Bucket "{bucket}" : enabling versioning')
        client.put_bucket_versioning(
            Bucket=bucket,
            VersioningConfiguration={
                'Status': 'Enabled'
            },
        )
        # CREATE TARGET BUCKET
        logging.info('Creating Target Bucket')
        client.create_bucket(
            Bucket=f'{bucket}-target',
            CreateBucketConfiguration={
                'LocationConstraint': region,
            },
        )
        logging.info(f'Blocking Public Access for the target bucket "{bucket}-target" ')
        client.put_public_access_block(
            Bucket=f'{bucket}-target',
            PublicAccessBlockConfiguration={
                'BlockPublicAcls': True,
                'IgnorePublicAcls': True,
                'BlockPublicPolicy': True,
                'RestrictPublicBuckets': True
            },
        )
        logging.info(f'Creating versioning for the target bucket "{bucket}-target" ')
        client.put_bucket_versioning(
            Bucket=f'{bucket}-target',
            VersioningConfiguration={
                'Status': 'Enabled'
            },
        )
        # Checking if config was created and skipping if not needed
        logging.info(f'Inserting replication for the bucket "{bucket}" ')
        client.put_bucket_replication(
            Bucket=bucket,
            #Modify the entry below with your account and the role you created
            ReplicationConfiguration={
                'Role': 'arn:aws:iam::968283274184:role/replicationRole-10-17-2020',
                'Rules': [
                    {
                        'Priority': 1,
                        'Filter': {
                            'Prefix': prefix_filter,
                    },
                        'Destination': {
                            'Bucket': f'arn:aws:s3:::{bucket}-target',
                            'StorageClass': storageClass,
                        },
                        'Status': 'Enabled',
                        'DeleteMarkerReplication': {'Status': deleteMarkerReplication},
                    },
                ],
            },
        )
        print(bucket)
    else:
        logging.info(f'Bucket "{bucket}" already had replication')
    config=client.get_bucket_replication(Bucket=bucket)
    logging.info(f'bucket config "{config}" ')

Validation

Validate any replication rules created by checking the log file created by the code, or verifying the replication rules under the Management tab of the source buckets.

You can also use the AWS CLI to check the replication configuration created. Use the following syntax, where bucketname is the name of each source bucket:

aws s3api get-bucket-replication --bucket bucketname

Cleaning up

If you followed along and would like to delete resources used in this solution to avoid incurring any unwanted future charges, use the following AWS CLI steps to delete the target bucket and replication rule.

Delete the target bucket, where bucket-target is the name of the target bucket:
```
aws s3 rb s3://bucket-target.
```
Remove the replication rule, where source-bucket-name is the name of the source bucket:
```
aws s3api delete-bucket-replication --bucket source-bucket-name
```

Conclusion

Amazon S3 Replication is a powerful tool to address business and compliance requirements by providing an automatic mechanism to make identical copies of your objects at another location. S3 Replication enables you to replicate objects while retaining metadata, replicate objects to more cost-effective storage classes, maintain object copies under a different account, and replicate your objects within 15 minutes. Boto3 makes it easy to integrate your Python application, library, or script with AWS services including Amazon S3. In this blog, I covered how you could scale the Amazon S3 Replication setup by using the AWS SDK for Python: Boto3. By using Boto3, you can set up a large number of replication rules and ensure that you use a standardized process.

The solution in this post can help you automate and scale the S3 Replication setup process. S3 Replication can be used to increase operational efficiency, minimize latency, abide by data sovereignty laws, setup log aggregation, and many other use cases.

Additional resources to consider:

Thanks for reading this blog post! If you have any questions or suggestions, please leave your feedback in the comments section. If you need any further assistance, contact your AWS account team or a trusted AWS Partner.