Encrypting objects with Amazon S3 Batch Operations

Update 4/29/2022: A clarification has been made to the pricing paragraph in the “Setting up and running your S3 Batch Operations job” section of this blog post.

Data security keeps your business safe, but encrypting individual files when you manage an extensive data archive can seem daunting. The new Amazon S3 Batch Operations feature lets you perform repetitive or bulk actions like copying or tagging across millions of objects with a single request. All you provide is the list of objects, and S3 Batch Operations handles the rote work, including managing retries and displaying progress.

The launch of Amazon S3’s default encryption feature automated the work of encrypting new objects, and you asked for similar, straightforward ways to encrypt existing objects in your buckets. While tools and scripts exist to do this work, each one requires some development work to set up. S3 Batch Operations gives you a solution for encrypting large numbers of archived files.

In this post, I demonstrate how to create that list of objects, filter to only include unencrypted objects, set up permissions, and perform an S3 Batch Operations job to encrypt your objects.

Encrypting existing objects is one of the many ways that you can use S3 Batch Operations to manage your Amazon S3 objects.

Prerequisites

To follow along with the process outlined in this post, you need an AWS account and at least one Amazon S3 bucket to hold your working files and encrypted results. You might also find much of the existing S3 Batch Operations documentation useful, including The Basics: Amazon S3 Batch Operations Jobs, Operations, and Managing Batch Operations Jobs.

Getting your list of objects with S3 Inventory

To get started, you must first identify the S3 bucket with the objects to encrypt, and get a list of its contents. An S3 Inventory report is the most convenient and affordable way to do this. The report provides the list of the objects in a bucket along with their associated metadata.

S3 Inventory will generate a list of 100 million objects for only $0.25 in the N. Virginia Region, and for even less with smaller quantities. If you already have an S3 Inventory report for this bucket, you can skip this step.

In the S3 console, select a bucket with objects to encrypt.
On the Management tab, choose Inventory, Add New.
Give your new inventory a name, enter the destination S3 bucket, and create a destination prefix for S3 to assign objects in that bucket.
For Output format, select CSV and for Optional fields, select Encryption status. You might also set the Daily frequency for report deliveries, as doing so delivers the first report to your bucket sooner.
Save your configuration.

S3 can take up to 48 hours to deliver the first report, so check back when the first report arrives. If you want an automated notification sent when the first report is delivered, implement the solution from How Do I Know When an Inventory Is Complete?

After you receive your first report, proceed to the next section to filter your S3 Inventory report’s contents. If you no longer want to receive S3 Inventory reports for this bucket, delete your S3 Inventory configuration. Otherwise, S3 delivers reports on a daily or weekly schedule.

Note that an inventory list is not a single point-in-time view of all objects. Inventory lists are a rolling snapshot of bucket items, which are eventually consistent (that is, the list might not include recently added or deleted objects). Combining S3 Inventory and S3 Batch Operations works best when you work with static objects, or with an object set you created two or more days ago. To work with more recent data, use the LIST (Get Bucket) API to build your list of objects manually.

Filtering your object list with S3 Select or Athena

After you receive your S3 Inventory report, you can filter the report’s contents to only list unencrypted objects in the bucket. If all of your objects are unencrypted, you can ignore this step. Likewise, to re-encrypt all objects or change the encryption type of all objects, you can ignore this step.

Filtering your S3 Inventory report at this stage saves you the time and expense of re-encrypting objects you previously encrypted. I demonstrate how to filter using both Amazon S3 Select and Amazon Athena. To decide which tool to use, look at your S3 Inventory report’s manifest.json file. This file lists the number of data files associated with that report. If the number is large, use Athena because it runs across multiple S3 objects. S3 Select works on one object at a time. Otherwise, both tools filter effectively.

Using S3 Select

1. Open the manifest.json file from your S3 Inventory report and look at the fileSchema section of the JSON. This informs the query that you run on the data.

{
"sourceBucket" : "batchoperationsdemo",
"destinationBucket" : "arn:aws:s3:::rwxtestbucket",
"version" : "2016-11-30",
"creationTimestamp" : "1558656000000",
"fileFormat" : "CSV",
"fileSchema" : "Bucket, Key, VersionId, IsLatest, IsDeleteMarker, EncryptionStatus",
"files" : [ {
"key" : "demoinv/batchoperationsdemo/DemoInventory/data/009a40e4-f053-4c16-8c75-6100f8892202.csv.gz",
"size" : 72691,
"MD5checksum" : "c24c831717a099f0ebe4a9d1c5d3935c"
} ]
}

2. The fileSchema is: “Bucket, Key, VersionId, IsLatest, IsDeleteMarker, EncryptionStatus” so pay attention to columns 1, 2, 3, and 6 when you run your query. S3 Batch Operations needs the bucket, key, and version ID as inputs to perform the job, as well as the field to search by, encryption status. You don’t need the version ID field, but it helps to specify it when you operate on a versioned bucket.

3. Locate the data files for the S3 Inventory report, the manifest.json object lists the data files under files.

4. After you locate and select the data file in the S3 console, choose Select from.

5. Leave the preset CSV, Comma, and GZIP fields selected and choose Next.

6. To review your S3 Inventory’s format before proceeding, choose Show file Preview.

7. Enter the columns to reference in the SQL expression field. and choose Run SQL. The following expression returns columns 1–3 for all objects without server-side encryption (SSE):

select s._1, s._2, s._3 from s3object s where s._6 = 'NOT-SSE'.

Here are example results:

batchoperationsdemo,0100059%7Ethumb.jpg,lsrtIxksLu0R0ZkYPL.LhgD5caTYn6vu
batchoperationsdemo,0100074%7Ethumb.jpg,sd2M60g6Fdazoi6D5kNARIE7KzUibmHR
batchoperationsdemo,0100075%7Ethumb.jpg,TLYESLnl1mXD5c4BwiOIinqFrktddkoL
batchoperationsdemo,0200147%7Ethumb.jpg,amufzfMi_fEw0Rs99rxR_HrDFlE.l3Y0
batchoperationsdemo,0301420%7Ethumb.jpg,9qGU2SEscL.C.c_sK89trmXYIwooABSh
batchoperationsdemo,0401524%7Ethumb.jpg,ORnEWNuB1QhHrrYAGFsZhbyvEYJ3DUor
batchoperationsdemo,200907200065HQ%7Ethumb.jpg,d8LgvIVjbDR5mUVwW6pu9ahTfReyn5V4
batchoperationsdemo,200907200076HQ%7Ethumb.jpg,XUT25d7.gK40u_GmnupdaZg3BVx2jN40
batchoperationsdemo,201103190002HQ%7Ethumb.jpg,z.2sVRh0myqVi0BuIrngWlsRPQdb7qOS

8. Download the results, save them into a CSV format, and upload them to S3 as your list of objects for the S3 Batch Operations job.

9. If you have multiple manifest files, run S3 Select against those as well.

Depending on the size of the results, you could combine the lists and run a single S3 Batch Operations job or run each list as a separate job. Consider the price of running each S3 Batch Operations job when you decide the number of jobs to run.

Using Athena

To query with Athena, first create a table with your S3 Inventory report data. For more information, see Step 2: Create a Table.

1. In the Athena console, choose Add Table.

2. Name your database and table, and add the S3 location of your S3 Inventory report data files. For example: s3://rwxtestbucket/demoinv/batchoperationsdemo/DemoInventory/data/.

3. Choose CSV, Next.

4. Choose Bulk add columns and enter the applicable fields from your S3 Inventory report, from the following list:

bucket string,
key string,
version_id string,
is_latest boolean,
is_delete_marker boolean,
size bigint,
last_modified_date timestamp,
e_tag string,
storage_class string,
is_multipart_uploaded boolean,
replication_status string,
encryption_status string,
object_lock_retain_until_date timestamp,
object_lock_mode string,
object_lock_legal_hold_status string

Enter the ones applicable to this report:

bucket string,
key string,
version_id string,
is_latest boolean,
is_delete_marker boolean,
encryption_status string

5. Choose Add, Next.

6. Choose Create Table.

7. After creating your table, select the database that you specified earlier and run the query on the newly created table. Run the following query to get the buckets, key, and version IDs for only the unencrypted objects:

SELECT distinct bucket, key, version_id FROM rwx_tabledemo
where encryption_status != 'SSE'

If you have more than one S3 Inventory report saved in this location, run the query with Select Distinct to only select unique records and eliminate duplicates.

8. Download the results, remove the header row, and save as a CSV in your S3 bucket.

Setting up and running your S3 Batch Operations job

Now that you have your filtered CSV lists of S3 objects, you can begin the S3 Batch Operations job to encrypt the objects.

A job refers collectively to the list (manifest) of objects provided, the operation performed, and the specified parameters. The easiest way to encrypt this set of objects is by using the put copy operation and specifying the same destination prefix as the objects listed in the manifest. This either overwrites the existing objects in an unversioned bucket or creates a newer version of the objects, depending on the bucket’s versioning status.

As part of copying the objects, specify that S3 should encrypt the object with SSE-S3 or SSE-KMS encryption. This job copies the objects, so all your objects show an updated creation date upon completion, regardless of when you originally added them to S3. You also must specify the other properties for your set of objects as part of the S3 Batch Operations job, including object tags and storage class.

To create your job and encrypt your objects:

1. In the S3 console, choose Buckets, Batch Operations, Create Job.

2. Choose the Region where your objects are stored and the CSV file created earlier from the S3 Select or Athena results. If your manifest contains version IDs, check that box. Choose Next.

3. Choose PUT copy, the bucket containing the objects listed in your manifest, the desired encryption type (such as SSE-S3), storage class, and the other parameters as desired. The parameters specified at this step will apply to all operations performed on the objects listed in the manifest. Choose Next.

4. Give your job a description, set its priority level, choose a report type, and specify its IAM role. For more information, see Managing Batch Operations Jobs.

In the IAM console, choose Role, Create Role.
Choose AWS service, S3, S3 Batch Operations. Choose Next: Permissions.
Choose Create Policy, JSON and paste in the permissions from the PUT copy object section from Granting Permissions for Amazon S3 Batch Operations. Modify the buckets to match the S3 location of your manifest, objects, and report
Choose Review policy.

Use the following example policy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "CopyObjectsToEncrypt",
      "Effect": "Allow",
      "Action": [
        "s3:PutObject",
        "s3:PutObjectTagging",
        "s3:PutObjectAcl",
        "s3:PutObjectVersionTagging",
        "s3:PutObjectVersionAcl",
        "s3:GetObject",
        "s3:GetObjectAcl",
        "s3:GetObjectTagging",
        "s3:GetObjectVersion",
        "s3:GetObjectVersionAcl",
        "s3:GetObjectVersionTagging"
      ],
      "Resource": "arn:aws:s3:::{source_and_destination_bucket for copy}/*"
    },
    {
      "Sid": "ReadManifest",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:GetObjectVersion"
      ],
      "Resource": "arn:aws:s3:::{manifest_key}"
    },
    {
      "Sid": "WriteReport",
      "Effect": "Allow",
      "Action": [
        "s3:PutObject"
      ],
      "Resource": "arn:aws:s3:::{reportbucket}/*"
    }
  ]
}

Add the remaining parameters and choose Create Policy.
With your S3 Batch Operations policy now complete, return to the Role tab and Create Role wizard to attach the newly created policy to the IAM role in the Attach Permissions Policies section of the wizard.

5. Return to your S3 Batch Operations create job workflow, refresh the list of IAM roles, select your newly created role, and choose Next.

6. Check that everything is correct before choosing Create Job.

The setup wizard automatically returns you to the S3 Batch Operations section of the S3 console. Your new job transitions from the New state to the Preparing state as S3 begins the process. During the Preparing state, S3 reads the job’s manifest, checks it for errors, and calculates the number of objects. Depending on the size of the manifest, reading can take minutes or hours—roughly two minutes per one-million objects.

After S3 finishes reading the job’s manifest, the job moves to the Awaiting your confirmation state. Check the number of objects in the manifest and choose Confirm the job.

After the job begins running, you can check its progress through the console dashboard view or by selecting the specific job. You can also choose to perform this or any other step through the AWS CLI, SDKs, or APIs.

You are charged for S3 Batch Operations jobs, objects, and requests in addition to any charges associated with the operation that S3 Batch Operations performs on your behalf, including data transfer, requests, and other charges. Writing new objects to the bucket, as is done here while copying, could also initiate S3 Replication, event-driven workflows, or other actions. You can find more information on S3 pricing on the S3 pricing page.

When the job completes, view the Successful and Failed object counts to confirm that everything performed as expected. For the exact cause of any Failed objects, see your job report.

To recap, you created a list of objects in your bucket, filtered that list to include only unencrypted objects, and used S3 Batch Operations to encrypt those objects using the copy operation. If you use a versioned bucket, the S3 Batch Operations job performed previously creates new encrypted versions of your objects. It also maintains the previous unencrypted versions. To delete the old versions, you should set up an S3 lifecycle expiration policy for noncurrent versions as described in Lifecycle Configuration Elements.

Other operations

By the way, similar steps can be used to do other things at large scale including adding object tags to objects for lifecycle transitions or restoring objects from Amazon S3 Glacier. In addition to encrypting files described previously, you can modify the process to do the following tasks:

Adding object tags to drive lifecycle actions: Instead of writing a lifecycle rule for an entire bucket or prefix, tags allow you to perform actions on a targeted set of objects. You can filter objects by name, size, or any combination of these and other properties to identify objects for transition or expiration actions. You could, for example, tag all objects over 3 MB and only lifecycle those tagged objects to S3 Glacier to save storage costs.

Restoring objects from S3 Glacier: Similarly, filtering your objects helps identify only those objects that you must restore from S3 Glacier. This filtering can be done based on the suffix of the object (such as, .jpg, .csv) or by other properties.

Things to know

S3 Batch Operations performs the same operation across all the objects listed in the manifest. If you want your copied objects to have different storage classes or other properties, create multiple jobs. Or, use an AWS Lambda function and have it assign specific properties to each object.

S3 Batch Operations is available today in all commercial AWS Regions except Asia Pacific (Osaka). S3 Batch Operations is also available in both of the AWS GovCloud (US) Regions and the AWS China Regions.

Using the copy operation

The copy operation creates new objects with new creation dates, which can affect lifecycle actions like archiving. If you copy all objects in your bucket, all the new copies have identical or similar creation dates. To further identify these objects and create different lifecycle rules for various data subsets, consider using object tags.

Failure threshold

If greater than 50% of a job’s object operations fail after more than 1,000 operation attempts, the job automatically fails. This provides quicker feedback and reduces the cost of failed requests. Check your final report to identify the cause of the failures.

Use S3 Select to save money

S3 Select is an S3 capability designed to pull out only the data you need from an object, which can dramatically improve the performance and reduce the cost of applications that must access data in S3.

Most applications must retrieve the entire object and then filter out only the required data for further analysis. S3 Select enables applications to offload the heavy lifting of filtering and accessing data inside objects to the S3 service. By reducing the volume of data loaded and processed by your applications, S3 Select can improve the performance of most applications that frequently access data from S3.

Conclusion

In this post, I demonstrated how fast and easy it is to use S3 Batch Operations to encrypt existing data in your S3 buckets, even sorting existing objects to filter out data you have already encrypted. This process can save you time and money while allowing you to complete operations like encrypting all existing objects.

Thanks for reading this post and getting started with S3 Batch Operations. I can’t wait to hear your feedback and feature requests for S3 Batch Operations in the comments.