Streamline data management at scale by automating the creation of Amazon S3 Batch Operations jobs

Over time, Enterprises may need to undertake operations or make modifications to their data as part of general data management, to address changing business needs, or to comply with evolving data-management regulations and best practices. As datasets being generated, stored, and analyzed continue to grow exponentially, the need for simplified, scalable, and reproduceable data management processes have become a necessity.

Enterprises turn to Amazon Simple Storage Service (Amazon S3) object storage to provide a resilient, available, and scalable storage option for many different use cases. To simplify the process of data management at scale, Enterprises can use Amazon S3 Batch Operations. S3 Batch Operations is an Amazon S3 data management feature that lets you manage billions of objects with just a few clicks in the S3 console or a single API request. With this feature, you can make changes to object metadata, copy or replicate objects between buckets, replace object tag sets, modify access controls, restore archived objects from S3 Glacier storage classes, or invoke custom AWS Lambda functions. Some examples of common use cases for Batch Operations include replicating objects across accounts for business continuity, as discussed in this Amazon Storage post, or managing object tags at scale as discussed in this Amazon Storage post. Whatever the use case, S3 Batch Operations is a great option for managing large datasets in Amazon S3.

In this post, we discuss how you can automate the creation and execution of S3 Batch Operations jobs using an event driven serverless architecture. When paired with existing workflows or continuous integration/continuous delivery (CI/CD) pipeline automations, this offers a standardized, reproducible mechanism to manage Amazon S3 data at scale, with less manual intervention reducing the potential for human error. Automating data management tasks, such as data movement, replication, or retrieval from archives, helps enterprises tackle the operational overhead and operational efficiency challenges of data management at scale.

Solution overview

This solution is an event driven architecture that leverages serverless infrastructure and relies on the following AWS services:

This architecture automates the creation and execution of S3 Batch Operations jobs. It relies on the creation and upload of a job bundle file which includes the necessary details to configure and run the desired batch operations actions. When the job bundle file is uploaded to the S3 bucket, an S3 Event Notifications event is created. This event is evaluated by Amazon EventBridge rules, and if a match, triggers an AWS Lambda function. The lambda function then processes the information included in the job bundle file and generates the appropriate API calls to create and run the S3 Batch Operations job. After the job finishes, a completion report is generated and stored in the S3 bucket. The following diagram depicts the different stages of this architecture, in the next section we take a closer look at each of them.

Architecture Diagram of Automating Amazon S3 Batch Operations Job

Walkthrough

S3 Batch Operations has several required variables that must be defined before we can create a new job. Most notably, a list of objects you wish to take action upon (manifest file), the operation you want to perform against these objects, and the AWS Identity and Access management (IAM) role that will be used to complete the job. Details for each of these elements and which ones are required can be found in the Amazon S3 user guide. In this solution we list the necessary elements and their values as key value pairs and documents them in a comma separated values (CSV) file we call the job details.csv file (see following example screenshot). This CSV file is used by Lambda to generate the API calls necessary to create the S3 Batch Operations job.

Example: Job details file (csv) for S3 Batch Operations

Example: Job details file (csv) for S3 Batch Operations

S3 Batch Operations supports manual and automated options for generating the manifest file. If your workflow needs the manifest file to be created manually, then you can use an existing CSV formatted Amazon S3 inventory report, or you can generate your own manifest file following the CSV format depicted in the following manifest file example (bucket,key). Optionally, if you are using a version enabled bucket, then you can define object version IDs within your manifest file by adding the version IDs after the object key value or enabled version IDs in your Amazon S3 Inventory report configurations.

Example: Manifest file – manual creation

Example: Manifest file – manual creation

Alternatively, S3 Batch Operations supports the automatic generation of the manifest file based on filter criteria specified during the job creation. Supported filters include creation date, replication status, object size, storage class, and key name constraints such as substring, prefix, or suffix matching. Details regarding each of these manifest creation options can be found in the Amazon S3 user guide. If you choose to use the automatic manifest creation option, then the manifest.csv used by the Lambda function would not include the entire list of objects. Instead, it would include the necessary elements needed for automatic manifest creation, such as the source bucket, manifest output, and filter criteria details. These details are used by Lambda to specify the automatic manifest option when the S3 Batch Operations job is created.

Example: Manifest file – automatic creation (CSV)

Example: Manifest file – automatic creation

Once these two files have been created for a specific job, they are combined using an archiving tool such as tar to create a job bundle file. A unique jobID is used in the naming of this file. Then, the job bundle file is uploaded to the S3 bucket designated for job ingest.

Example: Job ingest S3 bucket after job bundle file has been uploaded

Example: Job ingest S3 bucket after job bundle file has been uploaded

2. With AWS CloudTrail data events enabled on the job ingest bucket, the event generated by the upload of the job bundle file will be captured by CloudTrail.

3. An EventBridge rule is created using the object level data events generated from Amazon S3 through CloudTrail. The rule defines an event pattern limiting matches to PutObject events originating from the job ingest bucket.

Example: Event pattern from Amazon EventBridge rule configuration

Example: Event pattern from Amazon EventBridge rule configuration

When a job bundle file is uploaded to the job ingest bucket, the EventBridge rule pattern is matched and the event is passed on to the target defined in the EventBridge rule. For this rule, you use Lambda as the primary target. You could also add addition targets, for example you may want to configure an Amazon CloudWatch log group for additional monitoring.

4. The lambda function is triggered by the matched event passed on from Amazon EventBridge. The Lambda function uses the event context to obtain the location of the uploaded job bundle file.

Example: Amazon EventBridge rule configured to trigger AWS Lambda function

Example: Amazon EventBridge rule configured to trigger AWS Lambda function

manifest_bucket = event['detail']['requestParameters']
source_bucket_name = manifest_bucket ['bucketName']
manifest_file_arn = event['detail']['resources'][0]['ARN']
manifest_bucket_arn = event['detail']['resources'][1]['ARN']

Example: AWS Lambda code snippet to parse details from the event notification. Written in python utilizing boto3.

5. Next Lambda creates a new job specific prefix inside the S3 ingest Bucket using the unique job ID defined in the job bundle file name. Then it decompresses the bundle and writes the job details and manifest files to the jobID prefix previous created.
Example: AWS Lambda creates prefix and writes job and manifest details in the Amazon S3 bucket

Example: AWS Lambda creates prefix and writes job and manifest details in the Amazon S3 bucket

6. The Lambda function then parses the information provided in the CSV files and generates the create job request. See the reference documentation outlining all of the required and optional values for the create job request.

s3batchjob_client.create_job(
      AccountId=AccountId,
      ConfirmationRequired=False,
      Description='Amazon S3 Batch Job Operation',
      ClientRequestToken=clientRequestToken,)
      Priority=1,
      RoleArn= arn:aws:iam::AccountId:role/S3BatchOperationRole,
      Operation=operationDetails,
      Report=reportDetails,
      Manifest=manifestDetails)

Example code snippet for create_job request

Once the create job request has been submitted successfully, Amazon S3 returns a job ID. This job ID can be used to monitor the current status of the S3 Batch Operations job. Batch Operations begins processing the submitted job as soon as it is ready and there are no higher priority jobs waiting in the processing queue. S3 Batch Operations allows four jobs to run concurrently, and the priority can be defined as part of the job details.

Example: Amazon S3 Batch Operations job

Example: Amazon S3 Batch Operations job

7. S3 Batch Operations can output a completion report when the job has completed, failed, or been canceled. You can configure this report to include a record of all tasks or only tasks that have failed. Examples of S3 Batch Operations completion reports are available in the online documentation. If you need a completion report, then an output location for the report must be defined through the Lambda function when the job is created. This location could be added to the job details.csv file, or the Lambda function could simply use the same jobID specific prefix location that was used to decompress the job details bundle.

Example: Amazon S3 bucket with S3 Batch Operations report

Example: Amazon S3 bucket with S3 Batch Operations report

It is also possible to create automated notifications providing updates on the current status of a S3 Batch Operations job. You can leverage CloudTrail and EventBridge rules to capture events related to the creation and changes in status for each job. You can configure these EventBridge rules to send updates to useful targets, such as a CloudWatch log group, an Amazon Simple Queue Service (Amazon SQS) queue, or an Amazon Simple Notification Service (Amazon SNS) topic.

Example: Amazon EventBridge pattern for job status updates

Example: Amazon EventBridge pattern for job status updates

Conclusion

In this post, we discussed how you can leverage an event driven, serverless architecture to streamline your S3 data management operations at scale with S3 Batch Operations. We highlighted a way to document and submit the necessary configuration variables for S3 Batch Operations job creation. How to create event rules and filters to automate the creation and execution of S3 Batch Operations jobs via AWS Lambda, and how you can implement job monitoring and reporting.

The concepts discussed here can help Enterprises standardize and simplify operational processes when managing data stored in Amazon S3 at scale. This allows for reductions in operational overhead and limits opportunities for human error resulting in more efficient, accurate, and reproducible workflows.

As a next step, we invite you to review the online documentation for the AWS services we discussed in this post. We have just scratched the surface of what these services can offer. Additionally, check out the great content that has been curated over on the Serverless Land site. Serverless Land offers the latest information, posts, videos, code, and learning resources for AWS Serverless technologies.

Thank you for taking the time to read this post, we hope it got you thinking about ways you can simplify and standardize how you manage Amazon S3 data at scale. If you have any questions or suggestions, leave your feedback in the comments section.

AWS Storage Blog

Streamline data management at scale by automating the creation of Amazon S3 Batch Operations jobs

Solution overview

Walkthrough

Conclusion

Resources

Follow