Building scalable checksums

Customers in the media and entertainment industry interact with digital assets in various formats. Common assets include digital camera negatives, film scans, post-production renders, and more, all of which are business-critical. As assets move from one step to the next in a workflow, customers want to make sure the files are not altered by network corruption, hard drive failure, or other unintentional issues. Today, the industry uses algorithms to scan a file byte by byte to generate a unique fingerprint for it, known as checksum.

With checksums, users can verify that assets are not altered when copied. Performing a checksum on a file entails using an algorithm to iterate sequentially over every byte in a file, and leveraging compute to calculate the checksum. As the files grow in size the total time to compute the checksum increases and becomes more costly using traditional methods.

In this post, we cover a different approach by splitting the calculation across different parts of the file. Each part of the file has a unique checksum value, different from the sequential calculation. This enables the calculation to run concurrently and reduces the total time for the checksum to complete. Amazon Simple Storage Service (Amazon S3) uses the new checksum feature to gain access to parts of an object that did not previously exist.

Alt text: New checksum options include SHA-256, SHA-1, CRC32 and CRC32C

To increase the speed of uploading a file to Amazon S3, large objects are cut into smaller pieces —known as parts—via the multipart API call. Amazon S3 calculates the MD5 digest of each individual part. MD5 digests are used to determine the ETag for the final object. Amazon S3 concatenates the bytes for the MD5 digests together and then calculates the MD5 digest of these concatenated values. The final step in creating the ETag is when Amazon S3 adds a dash with the total number of parts to the end.

Traditional methods for performing object integrity validation are directly impacted by the size of the file. Calculating a SHA256 checksum on 1TB of data in an Amazon Elastic Compute Cloud (Amazon EC2) i3en.2xlarge instance takes 86 minutes. Two variables affect this calculation. The first is compute power. CPUs horizontally scale by adding more cores, and this works well with the ability to break our object into multiple parts, allowing us to modernize our approach by parallelizing integrity checks. The second variable that impacts calculations is our underlying storage. The checksum calculation needs to read the files as quickly as possible. Amazon EC2 i3 instances are storage optimized instances that provide low latency and high performance I/O.

Multipart uploads and checksums

We gain several advantages by splitting the checksum across multiple parts of a file:

Parallelize the job across multiple cores, or even multiple processes.
Create a process that we can resume. In case of failure, we can calculate the checksum for the parts that failed and skip those already calculated.
We can upload multiple parts concurrently to increase our throughput and reduce transfer times.

Calculation of a SHA-256 on one core takes 86 minutes vs parallelized calculation takes around 7 minutes
In the previous example, where our 1TB file took 86 minutes to process, if the process calculating the checksum were to fail, we would lose all of our progress and need to start again. By calculating parts individually, we save the state of those parts and resume if there is a failure.

Calculating checksums in multipart uploads

Using the aws-sdk-go-v2, we can create a Multipart Upload and iterate through parts of the file. To use the new checksum features, we can pass the ChecksumAlgorithm parameter and the SDK will calculate as it uploads it to Amazon S3.

client.UploadPart(ctx, &s3.UploadPartInput{
    ChecksumAlgorithm: types.ChecksumAlgorithmSha256,
    Bucket:            &bucket,
    Key:               &key,
    PartNumber:        partNum,
    UploadId:          &uploadId,
    Body:              &limitedReader,
})

Creating pre-computed checksums allows us to move from source to destination without having the concern of object integrity. The new UploadPart API accepts the pre-computed values as a new attribute to object part. By presenting the pre-computed value to the SDK, the calculation is skipped. Amazon S3 will calculate the checksum once it receives the parts and compare the value provided by the client. If the checksums match, the upload completes as intended, otherwise the client will receive an error so it can retry.

GetObjectAttributes

New GetObjectAttributes API provides the following new attributes: Checksum algorithm, checksum value, number of parts, part boundaries, part-level checksum values
Along with new features, we introduced the Amazon S3 API call s3.GetObjectAttributes. This new API takes in a list of attributes that we can retrieve from the object’s metadata, including the checksums for the parts calculated during upload. The options we can retrieve are the following: ETag, Checksum, ObjectParts, StorageClass, ObjectSize.

client.GetObjectAttributes(ctx, &s3.GetObjectAttributesInput{
    Bucket: &bucket,
    Key:    &key,
    ObjectAttributes: []types.ObjectAttributes{
        types.ObjectAttributesChecksum,
       types.ObjectAttributesObjectParts,
    },
})

The following is an example of the response:

{
    "LastModified": "Fri, 25 Feb 2022 06:51:56 GMT",
    "VersionId": "9G.W.8LpAqXETuNaRra1mGnRHU2oBKjG",
    "ETag": "7931c2b8a6c796c7abb378bb546467cb-4",
    "ObjectParts": {
        "TotalPartsCount": 4,
        "PartNumberMarker": 0,
        "NextPartNumberMarker": 4,
        "MaxParts": 1000,
        "IsTruncated": false,
        "Parts": [
            {
                "PartNumber": 1,
                "Size": 16777216,
                "ChecksumSHA256": "Y17ezXQs3T0uBd1IfCd5+b0Z3S44Gfx5IuCN431af7I="
            },
            {
                "PartNumber": 2,
                "Size": 16777216,
                "ChecksumSHA256": "IO5OIm9O4H4UKCMYKqZ3g3sK1GK713lABEmOU991MC8="
            },
            {
                "PartNumber": 3,
                "Size": 16777216,
                "ChecksumSHA256": "yWHu1DARkVxDJ6fJyuD9DmdNIamLjAKE26jM8cRUVQM="
            },
            {
                "PartNumber": 4,
                "Size": 5000192,
                "ChecksumSHA256": "XxUYpRqbSiFjDrXJTrDdzJBmTNYblVqWQ6xkB5LGLlM="
            }
        ]
    }
}

This is a significant change in the API, as previously the ETag only contained the final checksum of checksums and we could not determine which part was altered. With the new API, we can inspect the checksum of that individual part and its size.

Concurrency

The aws-sdk-go-v2 has an Amazon S3 Transfer Manager library that simplifies the creation of Multipart Uploads by abstracting most of the complexities from developers. After initiating a new Uploader the client can include the ChecksumAlgorithm parameter to generate the checksums for every part it uploads. Behind the scenes, transfer manager creates a Multipart Upload and uses Trailing Checksums to calculate the checksum of the parts and append them to the upload request. This saves time since we perform both operations in a single pass.

uploader := manager.NewUploader(client, func(u *manager.Uploader) {
    u.PartSize = 100 * 1024 * 1024 // 100MB per part
   u.Concurrency = 64 // how many workers transfer manager will use
})
uploader.Upload(ctx, &s3.PutObjectInput{
    ChecksumAlgorithm: types.ChecksumAlgorithmSha256,
    Bucket:            &bucket,
    Key:               &key,
    Body:              &reader,
})

Developers have several options they can customize on Amazon S3 Transfer Manager; for example, PartSize and Concurrency. With this in mind, we can use the Amazon S3 Transfer Manager and create an uploader configured to split our 1TB object into 10,000 parts of 100MB each. For the test an Amazon EC2 i3.16xlarge instance with a RAID across all ephemeral NVMe drives to improve our I/O. The sha256 calculation and upload to Amazon S3 completed at 7 minutes and 57 seconds.

Conclusion

Using Amazon S3 MultiPart Uploads, customers can build the foundation needed to parallelize object integrity checks. In addition, with the new checksums introduced, customers can continue to use the same algorithms that they use in their current workflows. Modernizing a workflow that supports multipart files ensures that customers can parallelize their integrity checks, take advantage of multi-core instances, reduce time significantly, and build solutions that remain scalable as the size of their digital assets continues to grow.

AWS for M&E Blog

Building scalable checksums

Multipart uploads and checksums

Calculating checksums in multipart uploads

GetObjectAttributes

Concurrency

Conclusion

Resources

Follow

Learn

Resources

Developers

Help