AWS Storage Blog

Efficiently verify Amazon S3 data at scale with compute checksum operation

Organizations across industries must regularly verify the integrity of their stored datasets to protect valuable information, satisfy compliance requirements, and preserve trust. Media and entertainment customers validate assets to make sure that content remains intact, financial institutions run integrity checks to meet regulatory obligations, and research institutions confirm the reproducibility of scientific results. These verifications are critical for maintaining audit trails and chain of custody, but they are often resource-intensive. Traditionally, customers needed to download entire datasets or restore archived content, consuming significant compute resources, time, and bandwidth. The challenge only grows with large assets such as 4K and 8K footage, medical imaging files, or billions of archived objects. This creates operational complexity and high costs at scale.

Amazon S3 now enables users to perform these integrity checks directly in the storage layer. With the compute checksum option in Amazon S3 Batch Operations, users can generate checksums for up to billions of objects without downloading or restoring data, regardless of object size or storage class. This capability makes it easier to confirm data integrity, maintain compliance, and streamline content validation workflows. Users can reduce the operational burden and costs of traditional verification approaches while gaining stronger confidence in the long-term durability of their data.

In this post, we explain how compute checksums work, the algorithms and options available, and how users can integrate this feature into their fixity and data verification workflows. We also explore the benefits of full-object and composite checksums, discuss trade-offs between cryptographic and CRC algorithms, and walk through how to initiate a job and generate reports that support compliance, audit, and preservation needs.

Compute checksum using S3 Batch Operations

A checksum is a short, fixed-length value generated from the content of a file or object. In the context of data integrity, users use checksums to verify that media assets remain unchanged during storage, transfer, or processing. Even a single-bit error in the file produces a different checksum, making it clear when corruption or tampering has occurred. When users create a new compute checksum operation job, they can choose from several supported algorithms, such as SHA-1, SHA-256, MD5, CRC32, CRC32C, and CRC64NVME. These algorithms support both full-object and composite checksum calculation methods. In this post, we implement the full object checksum approach, which delivers comprehensive end-to-end integrity verification through a single hash value, such as traditional on-premises methods, making the validation process clearer. Different algorithms offer different trade-offs. Cryptographic hash functions such as MD5, SHA-1, and SHA-256 are widely used in the media and entertainment industry, as well as digital preservation, because they provide strong guarantees against accidental corruption and deliberate modification. Users benefit from the ability to continue using these algorithms already integrated into existing applications. However, these functions operate sequentially, which means that they must process data one block at a time. As files grow into 4K, 8K, or high-frame-rate content, the sequential nature of cryptographic hashes makes them more computationally expensive and time-consuming.

For workflows that need faster performance, users can select CRC algorithms. These algorithms offer significant speed advantages because they support parallelization, which enables multiple data segments to be processed simultaneously. This reduces verification times when compared to cryptographic hashes, especially for large objects.

After users submit the compute checksum operation job, S3 calculates the checksum of each object listed in the manifest and generates a report containing the checksum value of each object. Users can use this report to track verification results, confirm data integrity across their media library, and maintain audit records as part of an ongoing fixity process.

Prerequisites

There are a few prerequisites you need before getting started:

  1. Make sure that you have the necessary permissions to create a job in S3 Batch Operations. For more information visit Granting permissions for Batch Operations.
  2. Have or create an AWS Identity and Access Management (IAM) role to use with S3 Batch Operations. For more information, visit IAM roles.
  3. Have or create three general purpose S3 buckets: one bucket for the assets, one for the S3 Batch Operations manifest, and one for the S3 Batch Operations reports. In this post, we refer to these S3 buckets as assets-bucket, manifest-bucket, and report-bucket.
  4. Make sure that you have access to Amazon Athena from your account.

Walkthrough

This section walks you through how to verify the content of stored datasets:

  1. Create and upload a manifest to the manifest-bucket
  2. Configure correct permissions on the IAM role assumed by S3 Batch Operations
  3. Create the S3 Batch Operations job
  4. Locate the S3 Batch Operations completion reports
  5. Analyze the S3 Batch Operations completion reports

1. Creating and managing the manifest

This section walks through creating the manifest, and uploading the manifest.

1.1. Creating the manifest

A manifest is an S3 object that contains the object keys for which you want to calculate the checksum, and it must be in the form of an Amazon S3 Inventory report or a CSV file. In this post, you use a CSV file. The manifest must contain the bucket name, object key, and optionally, the object version for each object. Each line of the manifest should look like the following:

bucket-name,object-name,<version-id>

1.2. Uploading the manifest

When you have created your manifest, upload it to the manifest-bucket.

2. Setting up IAM permissions

S3 must have permissions to perform S3 Batch Operations on your behalf. You grant these permissions through an IAM role. For more information, visit Granting permissions for Batch Operations. This section walks through creating the correct IAM policy, and creating the IAM role that S3 Batch Operations assumes.

2.1. Create the IAM policy

Depending on the type of operations, you need to grant a different set of permissions to the IAM role that S3 Batch Operations assumes for executing the job. Regardless of the operation, S3 needs permissions to read your manifest object from your S3 bucket and write a report to your bucket. For the compute checksums operation, you need to include the permission to get the objects for which you want to compute checksum.

The entire policy looks like the following:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "ChecksumPolicy",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:GetObjectVersion"
            ],
            "Resource": [
                "arn:aws:s3:::<assets-bucket>/*"
            ]
        },
        {
            "Sid": "ManifestPolicy",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject"
            ],
            "Resource": [
                "arn:aws:s3:::<manifest-bucket>/*"
            ]
        },
        {
            "Sid": "ReportPolicy",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject"
            ],
            "Resource": [
                "arn:aws:s3:::<report-bucket>/*"
            ]
        }
    ]
}

If the objects you want to compute checksum for are encrypted with AWS Key Management Service (AWS KMS), then your IAM policy must also include the permissions kms:Decrypt and kms:GenerateDataKey.

To create the IAM policy, navigate to IAM in the AWS Management Console. On the left menu, choose Policies, and then Create policy.

IAM console with boxes around Policies and Create policy

At Step 1 – Specify permissions, choose JSON, replace the existing policy with the new policy, and choose Next. Remember to replace the placeholder values for the bucket names with the correct names.

IAM console where you specify permissions when creating a new IAM policy

In Step 2 – Review and create, enter a name for the IAM policy, for example ComputeChecksumPolicy, and choose Create policy.

IAM console where you review and create a new IAM policy

In the next step, you attach this IAM policy to the IAM role you create.

2.2. Creating the IAM role

To create the IAM role, navigate to IAM in the console. On the left menu, choose Roles, and then Create role.

IAM console with boxes around Roles and Create role

At Step 1 – Select trusted entity, choose AWS service, choose Amazon S3 from the Service or use case dropdown list, choose S3 Batch Operations as Use case, and choose Next.

IAM console where you specify the trust policy when creating a new IAM role

At Step 2 – Add permissions, search for the policy that you created in the previous step, choose the checkbox next to the policy, and choose Next.

IAM console where you specify permissions when creating a new IAM role

At Step 3 – Name, review, and create, enter a name for the IAM role, for example ComputeChecksumRole, and choose Create role.

S3 Batch Operations assumes this IAM role when calculating the checksum.

3. Creating and running the S3 Batch Operations job

In this section, we walk through the steps to create the S3 Batch Operations job.

Navigate to S3 in the console. On the left menu, choose Batch Operations, and then Create job.

S3 console with boxes around Batch Operations and Create job

At Step 1 – Choose manifest, choose CSV as the Manifest format, enter the S3 path of the manifest that you uploaded in Step 1 (or browse to the manifest location), and choose Next.

S3 console where you choose the manifest for a new S3 Batch Operations job

At Step 2 – Choose operation, choose Compute checksum – new from the Operation type dropdown list, choose Full object as the Checksum type, choose CRC64NVME from the Checksum algorithm dropdown list, and choose Next.

S3 console where you choose the operation for a new S3 Batch Operations job

At Step 3 – Configure additional options, leave the default values in the Additional options section. In the Completion report section, acknowledge the warning that the completion report is necessary, choose This account as Completion report destination account, and enter the S3 path where you want to store the completion report (or browse to the S3 location).

S3 console where you configure the completion report for a new S3 Batch Operations job

In the Permissions section, select Choose from existing IAM role, choose the IAM role that you have created at Step 2, and choose Next.

S3 console where you choose the IAM role for a new S3 Batch Operations job

Review all settings at Step 4 – Review and choose Create job to create the compute checksum job.

4. Locating the checksum report

In this section, we walk through monitoring and reporting the S3 Batch Operations job status and locating the S3 Batch Operations completion reports.

4.1. Monitoring the S3 Batch Operations job status

After you create a job, it progresses through a series of statuses. First, S3 processes the manifest object and other job parameters to set up and run the job. Then, the job transitions to the Awaiting your confirmation to run status, which is waiting for you to run the job.

To run the job, choose the job that you just created, and choose Run job.

S3 console with boxes around the Job ID checkbox and the Run job button

In the following page, review the job configuration, and choose the Run job button at the bottom of the page.

S3 console where you review the S3 Batch Operation job and then run the job

At this point the job changes to Ready and automatically moves to Active when S3 begins to run it. The amount of time that a job remains in the Ready state depends on whether you have higher-priority jobs running already and how long those jobs take to complete. Now S3 is calculating the checksum of the objects listed in the manifest. While a job is Active, you can monitor its progress using the S3 console.

S3 console with a box highlighting the Active status of the S3 Batch Operation job

A job moves out of the Active state when the job is completed successfully or fails. When the job has finished calculating the checksum on all objects in the manifest, the report becomes available.

S3 console with a box highlighting the Completed status of the S3 Batch Operation job

4.2. Locating the S3 Batch Operations completion reports

To locate the completion reports, choose the Job ID and note the job ID. Scroll down to the Completion report section, and choose the Completion report destination link.

S3 console with boxes around the Job ID and the Completion report destination

Now find the folder named job-<jobId> and choose it to open the folder.

S3 console where you can copy the URI of the folder containing the completion reports

Choose the checkbox for the folder named results/ and choose the Copy S3 URI button. Note the S3 URI in your text editor because you need it in the next step.

S3 console where you can copy the URI of the folder containing the completion reports

5. Analyzing the S3 Batch Operations completion reports

In this section, we walk through understanding the completion reports schema, creating a table in Athena, analyzing the completion reports, and generating a flat CSV file with the checksum report.

5.1. Understanding the completion reports schema

S3 Batch Operations completion reports contain the following columns:

  • Bucket
  • Key
  • VersionId
  • TaskStatus
  • ErrorCode
  • HTTPStatusCode
  • ResultMessage

The ResultMessage column contains the checksum data as the JSON object with the following structure:

{
    "checksum_base64": "HuAT0qdvFVs=",
    "etag": "b0f0be06b455717c169c93fdcce8a10e-2",
    "checksumAlgorithm": "CRC64NVME",
    "checksumType": "FULL_OBJECT",
    "checksum_hex": "1EE013D2A76F155B"
}

For analyzing the completion reports, use Athena, an interactive query service that streamlines data analysis in S3 using standard SQL.

5.2. Creating a table in Athena

To analyze the completion reports, first create a table named checksum in Athena. To create the table, navigate to the Athena Query editor in the console, and run the following query. Remember to replace <S3 URI> with the URI you noted in your text editor previously.

CREATE EXTERNAL TABLE checksum(
         bucket string,
         key string,
         version_id string,
         task_status string,
         error_code string,
         https_status_code string,
         result_message string)

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
  WITH SERDEPROPERTIES (
    'separatorChar' = ',',
    'quoteChar' = '"',
    'escapeChar' = '\\'
  )
  STORED AS TEXTFILE
  LOCATION '<S3 URI>';

Athena Query editor console where you can run the SQL query to analyze the completion reports

When the query completed, you should see a table named checksum. You can expand the table to observe the schema.

Athena Query editor console showing the schema of the checksum table

5.3. Analyzing the completion reports

To verify access to the completion reports, you can run the following query:

SELECT * FROM checksum LIMIT 10;

Athena Query editor console showing a sample SQL query

You can count the number of failed compute checksum tasks by running the following query:

SELECT COUNT (error_code) AS "Count of errors" FROM checksum WHERE error_code != '200';

To troubleshoot the failed compute checksum tasks, you can run the following query:

SELECT key, result_message FROM checksum WHERE error_code != '200';

The result_message column contains details regarding the failure.

To choose only successful compute checksum tasks, you can run the following query:

SELECT key, result_message FROM checksum WHERE error_code = '200';

The ResultMessage column contains the compute checksum information stored as a JSON object with the following structure:

{
    "checksum_base64": "HuAT0qdvFVs=",
    "etag": "b0f0be06b455717c169c93fdcce8a10e-2",
    "checksumAlgorithm": "CRC64NVME",
    "checksumType": "FULL_OBJECT",
    "checksum_hex": "1EE013D2A76F155B"
}

Athena Query editor console showing a SQL query to select only successful Compute checksum tasks

5.4. Generating a flat CSV file with the checksum report

In some cases, you might need to parse the JSON string within the ResultMessage column and expand its key-value pairs into new columns in the table. To do so, you can run the following query:

SELECT
  key,
  json_extract_scalar(result_message, '$.checksum_base64') AS checksum_base64,
  json_extract_scalar(result_message, '$.ETag') AS etag,
  json_extract_scalar(result_message, '$.checksumAlgorithm') AS checksum_algorithm,
  json_extract_scalar(result_message, '$.checksumType') AS checksum_type,
  json_extract_scalar(result_message, '$.checksum_hex') AS checksum_hex
FROM Checksum
WHERE error_code = '200';

As a result, the compute checksum information is no longer stored as a JSON object, but each key-value pair has been expanded in dedicated columns. If you need to further analyze the data in your local computer, then you can download the query results as CSV file by choosing the Download results CSV button.

Athena Query editor console showing a SQL query to expand the checksum key-value pairs into new columns in the table

Conclusion

The compute checksum operation in Amazon S3 Batch Operations helps users verify the content of datasets at scale. This capability works with any object stored in Amazon S3, regardless of size or storage tier, with predictable costs through standard S3 Batch Operations job execution and data processing fees.

Creating checksum verification jobs is clear. Provide a manifest of objects to verify, and S3 Batch Operations automatically generates detailed integrity reports, thus eliminating manual verification processes. This means that you can verify billions of objects without storage retrieval fees or operational overhead from third party services. If you are managing large scale datasets across multiple storage classes, then compute checksum operation makes content verification direct and cost effective.

Fabio Lattanzi

Fabio Lattanzi

Fabio is a Sr. Solutions Architect focused on Amazon S3. He enjoys helping customers build the most durable, scalable, performant, and cost-effective storage solutions for their use case. He is based in Utah, loves traveling with his wife, taking photographs, and playing drums.

Yanko Bolanos

Yanko Bolanos

Yanko Bolanos is a Sr. Solutions Architect enabling customers to successfully run production workloads on AWS. With over 19 years of experience in media & entertainment, telco, gaming, and data analytics, he is passionate about driving innovation in cloud and technology solutions. Prior to AWS, Yanko applied his cross-disciplinary tech and media expertise while serving as a director leading R&D Engineering and Ad Engineering teams.

Nitish Pandey

Nitish Pandey

Nitish Pandey is a Technical Product Manager at Amazon S3, specializing in data integrity and S3 Glacier archive storage. With a passion for solving complex problems, he has experience in building enterprise, AI/ML, and consumer products that delight customers. Outside of work, he enjoys exploring nature, delving into stoic philosophy, and staying informed on tech and geopolitics.