AWS Storage Blog

Managing duplicate objects in Amazon S3

When managing a large volume of data in a storage system, it is common for data duplication to happen. Data duplication in data management refers to the presence of multiple copies of the same data within your system, leading to additional storage usage as well as extra overhead when handling multiple copies of the same file. To improve data management and cost efficiency, data deduplication is the practice of identifying and removing duplicate data through various methods, such as merging and purging, while making sure that the integrity of data is not compromised.

In Amazon Simple Storage Service (Amazon S3), duplicate objects can exist within the same bucket, whether due to accidental file duplication, data synchronization processes, or backup operations. For less critical data, such as temporary log files, users might prefer to retain just one copy if a process happens to generate multiple files with the exact same content to optimize their storage costs.

In this post, I discuss how you can initiate your own data deduplication process for objects stored within an S3 bucket. We identify these duplicate objects in your bucket using Amazon Athena, validate the duplicates can be removed, and delete them using AWS Lambda and S3 Batch Operations. This will help you reduce storage costs for objects with duplicate content without having to manually pick out objects to be deleted.

Identifying duplicate objects

We curate a list of duplicated objects using the entity tag (ETag) of the objects. The ETag is a hash of the object, reflecting changes only to the contents of an object, not its metadata. For objects that are encrypted by SSE-S3 or plaintext and not uploaded with Multipart Upload or Part Copy operation, the ETag is an MD5 digest of the object data. MD5 is a cryptographic hash algorithm that is used to produce a consistent 128-bit hash value from an input string. By comparing the MD5 digest of these objects, we can identify duplicated objects in a single bucket even if they have different keys or are located in separate prefixes.

This solution uses Amazon S3 Inventory to provide a list of all the objects and their metadata in your target S3 bucket. Then, it uses Amazon Athena to query the list to identify the duplicate objects. After reviewing the results from Athena and making sure that the listed objects are deemed safe for deletion, you use S3 Batch Operations to invoke an AWS Lambda function that performs large-scale bulk deletion on the objects. This solution focuses on buckets that do not have versioning enabled, but it can also work for buckets that have versioning enabled with a few tweaks that are mentioned later in the post.

Note that this solution only applies to objects encrypted with SSE-S3 or plaintext, which are created by the PUT Object, POST Object, or Copy operation. Since January 2023, all new object uploads to Amazon S3 are encrypted automatically with a minimum of SSE-S3, and this solution is applicable to those objects. As the ETag is not necessarily the same for duplicate objects that are encrypted with SSE-KMS or SSE-C keys, we cannot use the ETag to identify duplicate data, and these objects are outside the scope of this post. Additionally, objects that are uploaded through Multipart Upload or Part Copy operation do not have an ETag that can be used for data deduplication, and thus these objects are outside the scope of this post.

Solution overview

We identify duplicate objects within your S3 bucket using Athena to query the S3 Inventory report. Then, these objects are deleted using S3 Batch Operations through a Lambda function after you review that they are appropriate for deletion.

Architecture diagram of solution that queries S3 Inventory report with Amazon Athena to identify duplicated objects to delete with S3 Batch Operations invoking a Lambda function

Figure 1: Architecture diagram of solution that queries S3 Inventory report with Amazon Athena to identify duplicated objects to delete with S3 Batch Operations invoking a Lambda function

Prerequisites

The following prerequisites are needed to continue with this post.

  1. Target S3 bucket without versioning-enabled.
  2. Appropriate AWS Identity and Access Management (IAM) permissions to generate S3 Inventory reports, create tables and views in Athena, and configure S3 Batch Operations jobs. This is a sample IAM policy that has the minimum needed permissions, in addition to the AWS managed policy AWSLambda_FullAccess.
{
    "Version": "2012-10-17",
    "Statement": [
      {
        "Sid": "S3Actions",
        "Effect": "Allow",
        "Action": [
          "s3:GetObject",
          "s3:DeleteObject",
          "s3:PutObject",
          "s3:PutBucketPolicy",
          "s3:GetBucketPolicy",
          "s3:ListBucket",
          "s3:GetBucketLocation",
          "s3:GetInventoryConfiguration",
          "s3:PutInventoryConfiguration"
        ],
        "Resource": [
          "arn:aws:s3:::<target-bucket>",
          "arn:aws:s3:::<target-bucket>/*"
        ]
      },
      {
        "Sid": "CreateS3BatchOperationsJob",
        "Effect": "Allow",
        "Action": [
          "s3:CreateJob",
          "s3:ListAllMyBuckets",
          "s3:ListJobs",
          "s3:DescribeJob",
          "s3:UpdateJobStatus"
        ],
        "Resource": "*"
      },
      {
        "Sid": "Lambda",
        "Effect": "Allow",
        "Action": [
          "iam:CreateRole",
          "iam:CreatePolicy",
          "iam:AttachRolePolicy",
          "iam:ListPolicies",
          "iam:CreatePolicyVersion"
        ],
        "Resource": [
          "arn:aws:iam::111122223333:policy/service-role/AWSLambdaBasicExecutionRole*",
          "arn:aws:iam::111122223333:role/service-role/DeleteS3Objects*"
        ]
      },
      {
        "Sid": "S3BatchOperations",
        "Effect": "Allow",
        "Action": "iam:PassRole",
        "Resource": "arn:aws:iam::111122223333:role/<iam-batch-ops-role>"
      },
      {
        "Sid": "RunAthenaQueries",
        "Effect": "Allow",
        "Action": [
          "athena:GetWorkGroup",
          "athena:StartQueryExecution",
          "athena:StopQueryExecution",
          "athena:GetQueryExecution",
          "athena:GetQueryResults",
          "glue:GetTables",
          "glue:GetTable",
          "glue:CreateTable",
          "glue:GetDatabase",
          "glue:GetDatabases"
        ],
        "Resource": "*"
      }
    ]
  }

Walkthrough

We set up the needed resources in this architecture with the following steps:

  1. Configure S3 Inventory report
  2. Query S3 Inventory report using Athena
  3. Create the Lambda function to delete a single object
  4. Configure S3 Batch Operations Job to delete objects

Step 1. Configure S3 Inventory report

S3 Inventory allows you to generate a list of objects and their associated metadata for a particular S3 bucket that you can use to query and manage your objects. We use this feature to generate a list of objects so that you can compare their ETag value.

  1. Select the target bucket storing duplicate objects in the Amazon S3 management console and navigate to the Management tab
  2. Under the Inventory configurations section, select Create inventory configuration.
  3. Set the configuration name and inventory scope by limiting to the desired prefix level. Leave the default as Current version only for object versions. For the versioning-enabled bucket, you can select Include all versions if you want remove duplicates across all versions of objects in the bucket.
  4. Set Destination bucket and apply the provided sample policy on the destination bucket policy. You can use a different bucket from the target bucket for ease of management.
  5. Select FrequencyFormat, and Enable the inventory report. Choose from Daily or Weekly for frequency. For output, you can choose from CSVApache ORC, and Apache Parquet. Using the columnar formats can result in better query performance, especially if your target S3 bucket has millions of objects. This post uses Apache ORC as the output format.
  6. Under Additional metadata fields, select Size, Last modified, Encryption under the Object section and ETag under the Data integrity section.

Figure 2: Selection of additional metadata fields for S3 Inventory report

7. Select Create to finish the configuration. The first report lands in your destination bucket within the next 48 hours.

Step 2. Query S3 Inventory report using Athena

Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. It is serverless with no infrastructure to manage and you only need to pay for the queries that you run. You use this to query the S3 Inventory report generated to identify files that are duplicates and that can be deleted. By default, Athena does not support directly querying objects stored in S3 Glacier Flexible Retrieval or S3 Glacier Deep Archive storage classes, so archival objects are ignored and not queried against. To query archived objects, you must restore these objects first before running queries on the restored data. Refer to this blog post for further guidance on how to query archived data in your bucket using Amazon Athena.

  1. Open Amazon Athena in your console and select Query editor tabs on the left sidebar. Make sure you are in the same AWS Region as your S3 buckets.
  2. Run the first query in the Editor tab as follows. This creates the table schema of the S3 Inventory report on Athena in a data catalog and uses it when you run queries. If your report is in Parquet or CSV format, then refer to the AWS documentation on how you can create the table schema accordingly. For versioning-enabled buckets, include an extra column for Version ID.

Replace <S3 Inventory Location> with the S3 Inventory location configured in Step 1. For example, if my bucket name is “s3-duplicatebucket” with the inventory configuration name as “inventorycheck”, to use the report generated on December 12 2023, the location is “s3://s3-duplicatebucket/ inventorycheck/hive/dt=2023-12-13-01-00/”.

If there are objects in the bucket that are encrypted with SSE-KMS or SSE-C, then filter out these objects by creating another view and select only the objects with SSE-S3 or NOT-SSE as their encryption status. Query from the created view instead of the table in the following steps.

CREATE EXTERNAL TABLE s3inventory(
         bucket string,
         key string,
         size bigint,
         last_modified_date timestamp,
         etag string,
         encryption_status string
) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
  STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
  LOCATION '<S3 Inventory Location>'

3. After the table schema has been created, the next query helps create a view that lists all the ETags that are repeated more than once in the S3 Inventory report. Run this query.

CREATE VIEW duplicate_etag AS 
SELECT DISTINCT etag FROM "default"."s3inventory" 
GROUP BY etag 
HAVING COUNT(key) > 1

4. To identify which duplicated object to retain, we keep the object with the latest last modified date metadata regarding other objects that have the same ETag and object size. With this, we are comparing each object against both their ETag and object size before determining they are duplicates. This query identifies the latest last modified date metadata for each distinct ETag.

CREATE VIEW etag_max_date as 
SELECT etag, size, MAX(last_modified_date) as max_date from "default"."s3inventory"
GROUP BY etag, size

5. This last query generates a list of objects that are duplicated, leaving out those that are in etag_max_date, which are the objects to be retained in the bucket. For versioning-enabled buckets, include the version ID in the selection query.

SELECT s3inventory.bucket, s3inventory.key
FROM (s3inventory
LEFT JOIN etag_max_date 
ON ((s3inventory.etag = etag_max_date.etag) AND (s3inventory.last_modified_date = etag_max_date.max_date) AND (s3inventory.size = etag_max_date.size)))
WHERE (etag_max_date.max_date IS NULL)

6. Download the results, review the keys that are listed, and make sure it is safe for deletion. Reviewing the listed objects is important to make sure that they are appropriate for deletion, as we cannot retrieve these files once they are permanently deleted in the later steps. Remove the header row, and save as a CSV in your preferred S3 bucket to be used later for S3 Batch Operations.

Step 3. Create the Lambda function to delete single object

Lambda is a serverless, event-driven compute service that lets you run code without provisioning or managing servers. To delete each object, we configure a Lambda code that passes a Delete API to the specified object.

  1. Head to the Lambda console page in the same AWS Region.
  2. After selecting Dashboard on the side navigation bar, select Create function.
  3. Leave the mode as Author from scratch and input function name as DeleteS3Objects. Choose the latest version of Python as the Runtime and leave everything else as default. Select Create function.
  4. In the function page, insert the following code into the code source. This code deletes any key that is passed through the function. For versioning-enabled buckets, you must pass the version ID of the object through the DeleteObject API call to permanently remove the object version.
import logging
from urllib import parse
import boto3
from botocore.exceptions import ClientError

logger = logging.getLogger(__name__)
logger.setLevel("INFO")

s3 = boto3.client("s3")

def lambda_handler(event, context):
    """
    Permanently deletes the object.
    SPECIFIED OBJECT VERSION WILL BE PERMANENTLY REMOVED

    :param event: The S3 batch event that contains the key of the object
                  to remove.
    :param context: Context about the event.
    :return: A result structure that Amazon S3 uses to interpret the result of the
             operation. When the result code is TemporaryFailure, S3 retries the
             operation.
    """
    # Parse job parameters from Amazon S3 batch operations
    logger.info(event)

    invocation_id = event["invocationId"]
    invocation_schema_version = event["invocationSchemaVersion"]

    results = []
    result_code = None
    result_string = None

    task = event["tasks"][0]
    task_id = task["taskId"]

    try:
        obj_key = parse.unquote_plus(task["s3Key"], encoding="utf-8")
        bucket_name = task["s3BucketArn"].split(":")[-1]

        logger.info("Got task: remove object %s.", obj_key)

        try:
            s3.delete_object(
                Bucket=bucket_name, Key=obj_key
            )
            result_code = "Succeeded"
            result_string = (
                f"Successfully removed object {obj_key}"
            )
            logger.info(result_string)            
            logger.warning(result_string)
        except ClientError as error:
            # Mark request timeout as a temporary failure so it will be retried.
            if error.response["Error"]["Code"] == "RequestTimeout":
                result_code = "TemporaryFailure"
                result_string = (
                    f"Attempt to delete {obj_key} timed out."
                )
                logger.info(result_string)
            else:
                raise
            
    except Exception as error:
        # Mark all other exceptions as permanent failures.
        result_code = "PermanentFailure"
        result_string = str(error)
        logger.exception(error)
    finally:
        results.append(
            {
                "taskId": task_id,
                "resultCode": result_code,
                "resultString": result_string,
            }
        )
    return {
        "invocationSchemaVersion": invocation_schema_version,
        "treatMissingKeysAs": "PermanentFailure",
        "invocationId": invocation_id,
        "results": results,
    }

5. Select Deploy to save the changes to the function.
6. Apply the appropriate permissions for the Lambda function to execute the delete operation. Navigate to Configuration, then Permissions. Select the Lambda service role name to open up a tab in the IAM console. Add an additional policy that allows the Lambda function to perform the delete operation on the target bucket. Replace <target-bucket> with your bucket name.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "ObjectActions",
            "Effect": "Allow",
            "Action": [
                "s3:DeleteObject"
            ],
            "Resource:": "arn:aws:s3:::<target-bucket>/*"
        }
    ]
}

7. Your Lambda function is now ready for use.

Step 4. Configure S3 Batch Operations to delete objects

S3 Batch Operations is a data management feature that lets you manage billions of objects at scale by creating a job with a list of objects and the action to perform. Common actions would be to copy objects between buckets or restore archived objects from the S3 Glacier storage classes. In this case, we use S3 Batch Operations to invoke the Lambda function you just created on each object that is specified in the CSV results downloaded from Athena in Step 2.

  1. Head to the Amazon S3 management console and select Batch Operations on the side bar. Select Create Job.
  2. Make sure you are in the correct AWS Region. Select CSV for Manifest format and include the path for the file created earlier from the Athena results.
  3. For versioning-enabled buckets, select the box for Manifest includes version IDs. Otherwise, leave it unchecked. Choose Next.
  4. Choose Invoke AWS Lambda function and select the DeleteS3Objects Lambda function from the same AWS Region that you deployed in Step 3. Choose Next.
  5. Give your job a description, set its priority level, and choose a report type and its location
  6. Specify the S3 Batch Operations IAM role. You can use the provided IAM role policy template and IAM trust policy and assign to a new IAM role specifically for S3 Batch Operations to have appropriate permissions to delete the objects in the list. Select Next.
  7. Review the job details and select Create job.

The job now transitions into the Preparing state to read the job’s manifest, checks it for errors, and calculates the number of objects. Depending on the size of the manifest, reading can take minutes or hours. Once ready, the job moves into Awaiting your confirmation to run state.

  1. Once the job is in that state, select the job, check the details and select Run job when ready.

Upon completion, a completion report is generated in the location that you specified during the creation of the job. Your duplicate objects are now deleted.

If you experience failures, then you can investigate by checking the completion report or Amazon CloudWatch logs through the Lambda function to identify which files failed to transfer. Using the error log, you can filter out the objects that failed into a csv file and run a new S3 Batch Operations job to retry.

Things to know

Although deleting objects can help you save storage cost, it is important to review the list of objects and make sure that they can be deleted, since it is not possible to retrieve the objects once they are deleted. As this approach relies on using the MD5 hash, there may be objects that are different in content with a low possibility of having the same MD5 hash. To decrease the likelihood of this, the Athena queries we reviewed make a match on both the ETag and the size of objects before determining they are duplicate objects. However, it is still crucial to review the object keys before running S3 Batch Operations on the results.

Even though this solution focuses on buckets without versioning, with a few tweaks mentioned in the post, it can also work for buckets with versioning enabled. You must include the version ID in S3 Inventory report generation, edit the Athena query to reflect the version ID in the results, and pass the version ID through the Delete API in the Lambda function.

If you have objects in the bucket that are encrypted with either SSE-KMS or SSE-C, then you can filter these objects out in the Athena query by only selecting objects that have SSE-S3 or NOT-SSE as their encryption status.

There are costs associated with this solution with the resources that are used. Athena is priced at $5 per TB of data scanned, Lambda is priced accordingly to the request and compute charges, whereas S3 Inventory generation and S3 Batch Operations is charged based on the number of objects and jobs run. You should estimate the charges to see if your cost savings compared to the storage costs are substantial enough to test this solution on your bucket.

Cleaning up

Clean up the S3 bucket containing the S3 Inventory report and S3 Batch Operations completion report created during the previous steps to avoid incurring additional storage charges. You should also delete the S3 Inventory configuration set up to stop generating reports on your target S3 bucket. To avoid accidentally invoking the Lambda function that was set up, you should delete the function. Lastly, drop the Amazon Athena database to finish cleaning up the resources created.

Conclusion

In this post, I covered how you can delete duplicate objects in Amazon S3 by using Amazon Athena to identify them and S3 Batch Operations to invoke an AWS Lambda function to delete them. This allows you to reduce object duplication on the bucket level to optimize storage costs and clean up multiple identical objects for improved querying performance.

Thank you for reading this post and I look forward to any questions or feedback in the comments section if you try this out.