Enabling and validating additional checksums on existing objects in Amazon S3

Verifying data integrity during a data migration or data transfer is a data durability best practice that ensures your data is error-free and not corrupt as it reaches its destination. One way to verify the integrity of your data is through checksums, which you can basically think of as a digital fingerprint for data that you can use to track any change in data during transmission or at rest. Many media and entertainment organizations, as well as government agencies, use checksums for end-to-end data integrity verification and to maintain evidential chain of custody.

Amazon S3 checksums enable you to validate the integrity of your data during uploads and downloads to Amazon S3, enabling you to easily comply with the digital preservation best practices that are specific to your industry.

In this blog, we’ll dive into how to add additional checksums to existing Amazon S3 objects, how to add additional checksums to newly created objects that are uploaded without enabling additional checksums, and how to validate local file integrity with the checksum information computed by Amazon S3 to verify data integrity between the two files.

With this guidance, you can use this information to confidently verify the integrity of data transferred and stored in Amazon S3 for its entire lifetime. This capability is important as you can verify assets are not altered when copied, you can accelerate integrity checking of data, and you can confirm that every byte is transferred without alteration, allowing you to maintain end-to-end data integrity.

Adding and validating checksums for existing Amazon S3 objects

Amazon S3 recently added new checksum capabilities that allow customers to choose between four hashing algorithms: SHA-1, SHA-256, CRC-32, and CRC-32C. S3 is the first storage service in the cloud to support a range for checksum algorithms. You have the flexibility to pick the algorithm that best suits your business needs. With these capabilities, customers can increase the performance and reduce the cost of data integrity checks during uploads and downloads to Amazon S3. This enables customers to verify that assets are not altered when copied. Amazon S3 will use that checksum value to validate the objects and store it throughout the lifespan of the object in S3. The checksum calculated is preserved with objects stored in S3 and used by the service to validate the object during operations such as lifecycle, replication, and storage class transitions. Customers can also use checksums to compare their on-premises data with the objects stored in Amazon S3.

This AWS News blog explains how customers can make use of the new checksum feature when uploading content via the Amazon S3 console. For customers that want to automate the upload process at scale, they can refer to this blog on building scalable checksums, which explains how to build a custom uploader function to transfer data making use of the new checksums feature. However, many customers have already built solutions to ingest data using services like AWS Transfer Family and AWS Storage Gateway, or use tools like the AWS Command Line Interface (AWS CLI) with the s3 cp command. Many of these solutions do not yet have support for the new checksum algorithms, so in this blog we will share how you can make use of the checksum algorithms.

As mentioned in the intro, in this blog post we focus on three key areas:

Adding checksums to existing objects in Amazon S3.
Integrating an automated mechanism to add checksums to subsequent uploads that do not specify a checksum.
Verifying the checksums for objects in Amazon S3 against the original on-premises data.

And with that, let’s get started.

Part 1: Adding checksums to existing objects in Amazon S3

There are two scenarios for adding checksums for data uploaded without specifying a checksum algorithm. In both cases we’ll make use of the copy-object function to add checksums to each object.

Part 1a: Prepare environment

To prepare the environment, we create a new bucket and upload a file via the AWS Management Console. This is just to create a mock example, so we can retrospectively add checksums to the objects.

Create a bucket in Amazon S3.

a. Sign in to the AWS Management Console and access the S3 console.
b. Then, select Create bucket.
c. In the Bucket name box of the General configuration section, type a bucket name.

- The bucket name you choose must be unique among all existing bucket names in Amazon S3. One way to help ensure uniqueness is to prefix your bucket names with the name of your organization. Bucket names must comply with certain rules. For more information, view the bucket restrictions and limitations in the Amazon S3 user guide.

d. Select an AWS Region.
e. Navigate down the page and keep all other default selections as they are.
f. Choose the Create bucket button. When Amazon S3 successfully creates your bucket, the console displays your empty bucket in the Buckets panel.

Create a folder.

a. In the Buckets section, choose the name of the new bucket that you just created.
b. Then, in the Objects section, select the Create folder.
c. In the Folder section, in the Folder name area, name the new folder data.
d. Keep the Server-side encryption selection as Amazon S3-managed keys (SSE-S3) and then select Create folder.

3. Upload some data files to the new Amazon S3 bucket.

a. In the Objects section, Select the Name of the data folder.
b. Then, select the Upload button.
c. On the Upload page select the Add files.
d. Follow the Amazon S3 console instructions to upload local files.
e. Once files have been uploaded, choose the Upload button.
f. You will then see that an Upload: status page indicating if your upload has been completed successfully. If your upload succeeded, select the Close button.

Once you have uploaded data into the new bucket, select the object in the Objects section of the same S3 console that you have been working in. In order to select the object(s), you must hover over the object name and directly select the object name. Note, just selecting the radio button next to the file name will not perform the selection.

Amazon S3 Console showing a red arrow pointing to object newly uploaded.

You will then be presented with another page with additional specifics on your object. Navigate down to the Additional checksums section and you will see Additional checksums are set to off.

Amazon S3 console checksum off

Part 1b: Adding checksums to existing objects

We are now going to set up some resources we’ll use during the rest of the walkthrough. For example, we’ll set up an IAM role and an AWS Lambda function which we’ll use to add checksums to objects in Amazon S3.

Run:

git clone https://github.com/aws-samples/amazon-s3-checksum-verification
cd amazon-s3-checksum-verification

This downloads the code to your local machine. Follow this documentation to deploy the AWS CloudFormation stack called s3-checksum.yaml. Following the documentation steps will take you to the AWS CloudFormation console. When you reach the specifying stack parameters section, enter the name you gave your bucket into the S3Bucket parameter.

To add checksums to all our existing Amazon S3 objects, we use Amazon S3 Batch Operations. For this walkthrough, we create a CSV manifest file for Amazon S3 Batch Operations to use.

Create a CSV file containing the following details:

Examplebucket,objectkey1

e.g.

s3-integrity-demo,data/filename.img

When preparing the environment, upload the file to the root of the bucket you created in part 1a above.

Note: If you are doing this at scale, we suggest you follow the documentation to setup Amazon S3 Inventory Reports to automatically create the manifest file for you. Amazon S3 Inventory is one of the tools Amazon S3 provides to help manage your storage. You can use it to audit and report on the replication and encryption status of your objects for business, compliance, and regulatory needs. You can also simplify and speed up business workflows and big data jobs using Amazon S3 Inventory, which provides a scheduled alternative to the Amazon S3 synchronous List API operation. You may need to wait up to 24 hours for the report to generate.

Once you have a manifest file (either the CSV we created above or an Amazon S3 Inventory manifest), we create an Amazon S3 Batch Operations job using the steps below.

Sign in to the AWS Management Console and open the Amazon S3 console.
When in the Amazon S3 console, choose Batch Operations on the left navigation pane. Then, choose the Create job button. Next, choose the AWS Region where you want to create your job.
In the Manifest section under Manifest format, choose CSV as the type of manifest format. If you choose Amazon S3 inventory report, enter the path to the manifest.json object that Amazon S3 generated as part of the CSV-formatted inventory report. Optionally, you can add the desired Manifest object version ID for the manifest object if you want to use a version other than the most recent. If you choose CSV, in the Manifest object section, enter the path to a CSV-formatted manifest object. The manifest object must follow the format described in the console, for further details see the documentation on supported formats. You can optionally include the version ID for the manifest object if you want to use a version other than the most recent. Then, choose Next.
Complete the form as follows:

a. Under Operation, select copy.
b. For the Destination in the Copy Destination area, enter the bucket name you created earlier without the prefix, e.g., s3://your-bucket-name/data.
c. Check, I acknowledge that existing objects with the same name will be overwritten.
d. Select Replace with a new checksum function under the Additional checksums section, and choose your algorithm of choice, e.g., SHA256. Then, select Next.

5. Uncheck Generate completion report.
6. Select Choose from existing IAM roles and select the role containing S3BatchRole…
7. For Review, verify the settings. If you need to make changes, choose Previous. Otherwise, choose Create Job.

Once the job has been created, select it and choose Run job. Upon job completion, you can validate the results by selecting the object you originally uploaded in the Amazon S3 console. Here you will be able to see the object has additional checksums enabled, which algorithm has been used to calculate the checksum, and what checksum value was computed.

text: Amazon S3 console checksum enabled and displaying a SHA256 checksum base64 encoded.

Part 2: Automatically enabling checksums on newly uploaded objects

To add checksums to newly uploaded data, we configure an Amazon S3 event to invoke our previously created Lambda function for any objects that are added to the bucket.

Sign in to the AWS Management Console and open the Amazon S3 console.
In the Buckets section, select the name of the bucket that you want to enable events for. Note, you must select the object name directly instead of the radio button. Then, choose Properties.
Navigate down the page to the Event notifications section and choose Create event notification.
In the General configuration section, specify a descriptive Event name for your event notification. Enter the folder name you created for the Prefix, e.g., data.
In the Event types section, select put and post. By making these selections, this will only enable checksums on new and updated Amazon S3 objects.
Navigate further down the page to the Destination section. Keep the defaults in that section as is. In the Lambda function drop down menu, choose the Lambda function containing ChecksumLambda…

Amazon S3 specify lambda for event section, showing Lambda function selected as the destination

7. Select Save Changes.

Now when an object is uploaded to Amazon S3, the Lambda function will perform a copy-object operation, during which it will add the additional checksum based on the algorithm specified during the deployment of the initial CloudFormation script in part 1b above. You can test this by uploading a new file to the data folder and checking that checksums are successfully enabled.

Part 3: Validating local files match the hashes generated by Amazon S3

We have now completed two procedures to apply additional checksums.

Adding checksums to existing objects using a batch process.
Adding an automated mechanism to add checksums to newly ingested data uploaded without specifying a checksum.

Now, we will cover how the checksum information can be used to validate the data in Amazon S3 against the local copy of our data.

In this example we use a simple python script to query the checksum details in Amazon S3 and use this information to perform the same actions against our local files to validate the data matches.

Clone the code from Github and enter the repo:

git clone https://github.com/aws-samples/amazon-s3-checksum-verification
cd amazon-s3-checksum-verification

Install the python packages the script needs:

pip install -r requirements.txt

Configure AWS cli credentials following this guide.
Run the following command to check file integrity.

./integrity-check.py  --bucketName <your bucketname> --objectName <prefix/objectname-in-s3> --localFileName <local file name>

You should see a confirmation confirming the data matches. For example:

PASS: Checksum match! - s3Checksum: GgECtUetQSLtGNuZ+FEqrbkJ3712Afvx63E2pzpMKnk= | localChecksum: GgECtUetQSLtGNuZ+FEqrbkJ3712Afvx63E2pzpMKnk=

The integrity-check.py function queries the Amazon S3 object metadata using the GetObjectAttributes API call. The information returned contains the number and size of the object parts and their checksum details (algorithm and hash value). It uses this information to split the local file and compute the hash for each part. It then validates the local file hashes against those returned from Amazon S3.

Cleaning up

The only ongoing charges from this exercise are the Amazon S3 storage costs for the files in S3 and charges for Lambda invocations if you continue to upload further objects. If you do not want to incur any additional charges, you should delete the S3 bucket and the CloudFormation script that were created for the testing.

Conclusion

As assets are migrated and used across workflows, customers want to make sure the files are not altered by network corruption, hard drive failure, or other unintentional issues. Customers migrating large volumes of data to Amazon S3 should perform data integrity checks as a durability best practice.

In this post, we have shown you how to:

Add additional checksums to existing Amazon S3 objects.
Add additional checksums to newly created objects that are uploaded without enabling additional checksums.
Validate local file integrity with the checksum information computed by Amazon S3 to verify data integrity between the two files.

To summarize, Amazon S3 uses checksum values to verify the integrity of data that you upload to or download from Amazon S3. We encourage you use this information to confidently verify the integrity of data transferred and stored in Amazon S3 for its entire lifetime in order to maintain end-to-end data integrity.

For more detail on the additional checksum feature, we recommend reading the technical documentation and following this handy getting started tutorial on how to check the integrity of data in S3 with additional checksums.

Thank you for reading this post. If you have any comments or questions, leave them in the comments section.