Process AWS Storage Gateway file upload notifications with AWS CDK

For AWS Storage Day 2020, we published a blog discussing how customers use AWS Storage Gateway (specifically, File Gateway) to upload individual files to Amazon S3. For some customers, these files constitute a larger logical set of data that they should group for downstream processing. As mentioned in that blog, before the release of file upload notifications, customers had been unable to reliably initiate this processing based on individual file upload events. In order to demonstrate the implementation of this feature in the real world, we have created an AWS Cloud Development Kit (AWS CDK) application based on the file notification event processing solution described in our earlier blog.

In this blog post, we discuss how the AWS CDK application, available in this GitHub repository, enables you to leverage individual file upload events to group together uploaded datasets for downstream processing. The repository contains a comprehensive workshop on how to deploy and test the solution for common data vaulting use cases. By conducting the workshop, you can gain hands-on experience in implementing file upload notifications as part of a larger AWS application stack. You can use this knowledge, along with the code provided, to create your own data processing pipelines for use-cases like backup and recovery.

Before proceeding, we recommend you read our previous blog in order to familiarize yourself with the File Gateway file upload notifications. That blog post also covers the reference architecture that is the basis of the event processing flow implemented by this AWS CDK application.

AWS CDK application architecture

The following diagram illustrates the architecture for the application. It shows a data pipeline processing workflow that provides for the backup and recovery of critical business assets. For instance, moving data into a secure location on AWS.

A data pipeline processing workflow that provides for the backup and recovery of critical business assets

AWS CDK application principles

For the example data vaulting use-case, the AWS CDK application components work with the following principles:

Logical datasets: A group of files and directories stored in a uniquely named folder on a File Gateway file share. These files represent a single logical dataset vaulted by the File Gateway to Amazon S3 and are treated as a single entity for the purposes of downstream processing. The files are copied from a source location that mounts the File Gateway file share using NFS or SMB.

Logical dataset IDs: A unique string that identifies a specific logical dataset. This is part of the name for the root directory containing a single logical dataset created on a File Gateway file share. The Dataset ID allows the event processing flow to distinguish between different vaulted datasets and reconcile within them accordingly.

Data files: All files that constitute a logical dataset. These are contained within a root logical dataset folder on a File Gateway file share. File upload notification events generated for data files are written, by the processing flow, to Amazon DynamoDB. Directories are treated as file objects for the purposes of uploads to Amazon S3 via File Gateway.

Manifest files: A file, one per logical dataset, which contains a manifest listing all data files that constitute that specific logical dataset. The file copy process generates the manifest files as part of the data vaulting operation for a logical dataset. The processing flow uses it to compare against data file upload events written to a DynamoDB table. Once both of these data sources are identical, it signifies the File Gateway has completed uploading all files to Amazon S3 that constitute that logical dataset and the data vaulting operation has completed.

The processing flow implemented by this AWS CDK application contains the following mandatory, but configurable, parameters. These can be modified via AWS CDK context keys used by the application (described in detail in the workshop walkthrough). These parameters enable you to customize the directory name clients can use when vaulting data. They also enable you to customize how long to provide for the reconciliation of File Gateway file upload notifications as part of the vaulting process to Amazon S3:

Vault folder directory suffix name: The directory suffix name of the root folder containing a logical dataset copied to File Gateway. The processing flow uses this to identify what directories created on a File Gateway should be processed. Directories created that do not end in this suffix will be ignored by the processing flow.

Manifest file suffix name: The suffix name for the logical dataset manifest file. The processing flow uses this to identify what file should be read to find out the list of files constituting the logical dataset. You can also use it to reconcile against file upload notification events received.

Number of iterations in state machine: The number of attempts the file upload reconciliation state machine makes to reconcile the contents of the logical dataset manifest file with the file upload notification events received. Due to the asynchronous nature in which File Gateway uploads files to Amazon S3, a manifest file may be uploaded prior to all data files in that logical dataset. This is especially the case for large datasets. Hence iterating as part of the file upload reconciliation process is required.

Wait time in State Machine: The time, in seconds, to wait between each iteration of the file upload reconciliation state machine. The total time the state machine continues to attempt file upload reconciliation is a product of this parameter and the total number of iterations configured for the state machine (preceding parameter).

The following is an example logical dataset directory structure that a client would create on a File Gateway file share when vaulting a dataset:

[LOGICAL DATASET ID]-vaultjob (root logical dataset directory)
[LOGICAL DATASET ID]-vaultjob/[DATA FILE][..] (data files at top level)
[LOGICAL DATASET ID]-vaultjob/[DIRECTORY][..]/[DATA FILE][..] (data files at n levels)
[LOGICAL DATASET ID]-vaultjob/[LOGICAL DATASET ID].manifest (recursive list of all files and directories)

The CDK application workshop provides scripts used during the walkthrough that will automatically create sample data and perform a data vaulting operation. The file copy process generates a logical dataset ID – the following is a portion of an example directory structure where the randomly generated dataset ID is DhoTdbmBHm3DfBWL:

DhoTdbmBHm3DfBWL-vaultjob/dir-ryN77APt1rIo
DhoTdbmBHm3DfBWL-vaultjob/dir-ryN77APt1rIo/file-E3l7u3XG
DhoTdbmBHm3DfBWL-vaultjob/dir-ydEDerqUGfCS 
DhoTdbmBHm3DfBWL-vaultjob/dir-ydEDerqUGfCS/file-PsZ514Ug
[…]
DhoTdbmBHm3DfBWL-vaultjob/DhoTdbmBHm3DfBWL.manifest

In your own specific implementations, generate the logical dataset ID to your required naming scheme. The contents of that dataset would be your own files and directories.

AWS CDK application stacks

The AWS CDK application contains two stacks:

EventProcessingStack: Deploys the event processing architecture only. This is intended to be used with a Storage Gateway (File Gateway) configured to generate file upload notifications. NOTE: This stack does not create the File Gateway or File Gateway client. For the workshop walkthrough, these are created as part of the data vaulting stack.

DataVaultingStack: Deploys a “minimal” Amazon VPC with two Amazon EC2 instances – a File Gateway appliance and a File Gateway NFS client. This stack is used to demonstrate an example data vaulting operation, triggering the components created by the event-processing stack.

Since customers can deploy File Gateways in both hybrid and AWS cloud-based environments, the AWS CDK application separates the data-vaulting environment into a dedicated stack. This allows you to deploy the event processing flow in isolation, in order to integrate with File Gateways in your specific environments. To do this, you simply associate a File Gateway with the Amazon S3 bucket created by the event processing stack.

Example data vaulting environment

The AWS CDK application contains the data vaulting stack as a useful demonstration of a real-world use-case. All resources in this stack reside in a private VPC with no internet connectivity. The following is an illustration of the architecture:

AWS CDK Application - AWS Storage Gateway file upload notification processing - Data Vaulting CDK Stack Architecture

The stack creates the following resources:

An Amazon VPC with three private subnets and various Amazon VPC endpoints for the relevant AWS services.
An Amazon S3 bucket used to deploy the AWS CDK application scripts required in the workshop walkthrough. Amazon EC2 user data commands will automatically copy these scripts to the File Gateway client.
1 x Amazon EC2 instance using a Storage Gateway AMI and 150 GB of additional Amazon EBS cache volume storage – to be used as a File Gateway. This instance resides within one of the private subnets. It cannot communicate outside of the Amazon VPC and only allows inbound NFS connections from the File Gateway client.
1 x Amazon EC2 instance using Amazon Linux 2 and 150 GB of additional Amazon EBS storage – to be used as a File Gateway client. This instance resides in a private subnet. It cannot communicate outside of the Amazon VPC and allows no inbound connections.

The workshop walks you through generating sample data within this environment and vaulting it to Amazon S3 via the File Gateway instance. The File Gateway instance generates upload notifications that the event processing flow consumes and reconciles.

Observing the event processing flow

To observe the event processing flow in action, following a data vaulting operation, you can inspect the resources created by the event processing stack. Viewing the following resources in the order listed demonstrates how the processing flow executed:

Amazon S3 bucket: Objects created in the Amazon S3 bucket, uploaded by the File Gateway.
Amazon CloudWatch Logs: Logs created to record “data” and “manifest” file upload event types.
Amazon DynamoDB table: Items created to record the receipt of upload events.
AWS Step Functions state machine: State machine execution that reconciles “manifest” file contents against the file upload events received.
Amazon CloudWatch Logs: File upload reconciliation events emitted by the Step Functions state machine.

Amazon EventBridge rules route file upload events to their corresponding Amazon CloudWatch log groups. The following are example screenshots of file upload events.

A “data” file upload event:

A 'data' file upload event

A “manifest” file upload event:

A 'manifest' file upload event

The following is a diagram of the Step Functions state machine. This state machine implements the file upload event reconciliation logic. It executes a combination of Choice, Pass, Task, and Wait states:

Diagram of the Step Functions state machine. This state machine implements the file upload event reconciliation logic

The following is a summary of the steps executed:

Configure Count: Configures the maximum total number of iterations the state machine executes. The relevant CDK context key sets the count value, as described in the “AWS CDK application principles” section of this post.
Reconcile Iterator: Executes an AWS Lambda function that increases the value of the current iteration count by one. If the current value equals the maximum count value configured, the Lambda function sets the Boolean variable continue to False, preventing the state machine from entering another iteration loop.
Check Count Reached: Checks if the Boolean variable continue is True or False. Proceeds to the “Reconcile Check Upload” step if True or the “Reconcile Notify” step if False.
Reconcile Check Upload: Executes an AWS Lambda function that reads the “manifest” file from the Amazon S3 bucket and compares the contents with the file upload events written to the Amazon DynamoDB table. If these are identical, the Lambda function sets the Boolean variable reconcileDone to True, indicating the reconcile process has completed. This variable is set to False if these data sources do not match.
Reconcile Check Complete: Checks if the Boolean variable reconcileDone is True or False. Proceeds to the “Reconcile Notify” step if True or the “Wait” step if False.
Wait: A simple wait state that sleeps for a configured time. This state obtains the sleep time from a CDK context key, as described in the “AWS CDK application principles” section of this blog post. This state is entered upon whenever Boolean variables continue and reconcileDone are set to True and False respectively.
Reconcile Notify: Executes an AWS Lambda function that sends an event to the EventBridge custom bus, notifying on the status of the reconciliation process. This is either Successful if completed within the maximum number of configured iterations or Timeout if not. Proceeds to the final “Done” state, completing the state machine execution.

The notification sent by the Step Functions state machine is the final step in the event processing flow and is written, via EventBridge, to an Amazon CloudWatch log group. See the following for an example screenshot:

Notification sent by the Step Functions state machine is the final step in the event processing flow and is written to CloudWatch

The structure of this event is as follows:

{
    "version": "0",
    "id": "[ID]",
    "detail-type": "File Upload Reconciliation Successful",
    "source": "vault.application",
    "account": "[ACCOUNT ID]",
    "time": "[YYYY-MM-DDTHH:MM:SSZ]",
    "region": "[REGION]",
    "resources": [],
    "detail": {
        "set-id": "[LOGICAL DATASET ID]",
        "event-time": [EPOCH TIME],
        "bucket-name": "[BUCKET NAME]",
        "object-key": "[MANIFEST FILE OBJECT]",
        "object-size": [SIZE BYTES]
    }
}

Since an EventBridge custom event bus is used, you can extend and customize the solution by adding additional targets in the EventBridge rule. By doing so, you can enable other applications or processes to consume the event and perform further downstream processing on the logical dataset.

The File Gateway implements a write-back cache and asynchronously uploads data to Amazon S3. It optimizes cache usage and the order of file uploads. It may also perform temporary partial uploads during the process of fully uploading a file (the partial copy can be seen momentarily in the Amazon S3 bucket at a smaller size than the original). Hence, you may observe a small delay and/or non-sequential uploads when comparing objects appearing in the Amazon S3 bucket with the arrival of corresponding Amazon CloudWatch Logs.

However, since the File Gateway only generates file upload notifications after it has completely uploaded files to Amazon S3, it is in these scenarios that the file upload notification feature becomes a powerful mechanism to coordinate downstream processing. The AWS CDK workshop walkthrough is a good demonstration of this feature for real-world scenarios where a File Gateway is often managing hundreds of TBs of uploads to Amazon S3. Often this can be for hundreds of thousands of files copied by multiple clients.

Cleaning up

Don’t forget to complete the cleanup section (Module 7) in the workshop, to prevent continuing AWS service charges in your account.

Conclusion

In this post, we discussed an AWS CDK application that enables you to leverage individual File Gateway file upload events to group together uploaded datasets for downstream processing. You can use the application to vault data to AWS for the backup and recovery of critical business assets, or to create your own custom data processing pipelines for files uploaded to Amazon S3.

Thanks for reading our blog and we hope you now enjoy working through the steps in the AWS CDK application workshop. If you have any comments or questions, please leave them in the comments section or create new issues and pull requests in the GitHub repository.

To learn more about the services mentioned in this post, refer to these product pages:

AWS Storage Blog

Process AWS Storage Gateway file upload notifications with AWS CDK

AWS CDK application architecture

AWS CDK application principles

AWS CDK application stacks

Example data vaulting environment

Observing the event processing flow

Cleaning up

Conclusion

Resources

Follow

Learn

Resources

Developers

Help