On-demand archival and retrieval of documents from Amazon WorkDocs to Amazon S3

Cloud storage of documents has seen rapid growth over the years as more and more customers and businesses move away from traditional physical storage. As the size and number of documents continue to grow, customers want to manage their documents and retain them using long term, durable, cost-effective document archives. Businesses such as medical research labs that run frequent clinical trials for new drugs and vaccines generate a large amount of data in the form of reports and research notes that require long term archival for future reference and mandatory regulatory compliance. They also have a need for visibility into which documents are archived. While archived documents are rarely accessed, when they do require access to those documents (say to refer to an earlier clinical study), they need the ability to retrieve those documents on demand.

Amazon WorkDocs is a fully managed platform for creating, sharing, and collaborating on digital content. With Amazon WorkDocs, you can create, edit, and share content, and access it from anywhere using a browser, on any device using WorkDocs Drive on Windows File Explorer, MacOS Finder, or your Amazon Workspace. The Amazon S3 Glacier storage classes are purpose-built for data archiving, providing you with the highest performance, most retrieval flexibility, and the lowest cost archive storage in the cloud. End users of Amazon WorkDocs are not usually familiar with Amazon Simple Storage Service (Amazon S3) and its features, and may not have AWS Management Console or API access to the S3 service. For these users, moving documents from WorkDocs to S3 for archival and then retrieving those documents back when needed would require multiple manual steps and multiple user interfaces.

In this blog, we walk you through a serverless solution that enables on-demand archival of documents from Amazon WorkDocs to Amazon S3, including the S3 Glacier storage classes. With this guidance, you can build and deploy this solution in your AWS account to enable Amazon WorkDocs end users to archive, view, and retrieve documents on demand to S3 using their existing WorkDocs interface.

Representative user flows for archival and retrieval from Amazon WorkDocs

Next, we examine how this solution would work from a user perspective:

Archival Flow: In this flow, an end user in WorkDocs selects one or more files or folders that they want to archive, and moves them to a designated Archive folder. These files will then be moved to Amazon S3 and an index-<jobId>.xls file will be placed in the Archive folder containing details of the files that were archived.

Archive flow

A sample index file looks like the below chart:

	A	B	C	D	E	F
1	File Name	Path	Archive Timestamp	Created Timestamp	Status	Retrieve
2	exec_summary.pptx	11-02-2022-ba8524e5/project 28/exec_summary.pptx	2022-11-02T13:19:18.668Z	2022-11-01T05:08:45.792Z	ARCHIVED
3	meeting notes.docx	11-02-2022-ba8524e5/project 28/meeting notes.docx	2022-11-02T13:19:17.181Z	2022-11-01T05:06:19.647Z	ARCHIVED
4	whiteboard.png	11-02-2022-ba8524e5/project 28/sprint 12/whiteboard.png	2022-11-02T13:19:27.952Z	2022-11-01T05:06:20.106Z	ARCHIVED
5	tnc.pdf	11-02-2022-ba8524e5/project 28/tnc.pdf	2022-11-02T13:19:23.880Z	2022-11-01T05:06:19.639Z	ARCHIVED

2. Restore Flow: In this flow, an end user in WorkDocs lists all the files that they want to restore in a restore.xls file in the designated Restore folder. A copy of the specified files will then be restored from S3 to the Restore folder in WorkDocs.

Restore folder in WorkDocs

The restore.xls file is the same as the index files, with the Retrieve column marked with ‘Y’ for all files that need to be restored. A sample restore.xls file looks like the chart below:

	A	B	C	D	E	F
1	File Name	Path	Archive Timestamp	Created Timestamp	Status	Retrieve
2	exec_summary.pptx	11-02-2022-ba8524e5/project 28/exec_summary.pptx	2022-11-02T13:19:18.668Z	2022-11-01T05:08:45.792Z	ARCHIVED	Y
3	meeting notes.docx	11-02-2022-ba8524e5/project 28/meeting notes.docx	2022-11-02T13:19:17.181Z	2022-11-01T05:06:19.647Z	ARCHIVED	Y

These Archive and Restore folders in WorkDocs can be centralised folders shared with all users who want to manage and archive files, or they could be separate end user folders. For simplicity, in this blog we assume there is one centralised Archive and one centralised Restore folder.

Solution overview

The solution is built using a series of AWS Lambda functions written in NodeJS. These functions utilise the WorkDocs and S3 APIs to monitor and sync files between WorkDocs folders and S3 buckets at scheduled intervals. The solution flow for each of the two use cases is described below:

Archival flow:

solution-overview-arcive-flow

The archive check Lambda function is invoked by Amazon EventBridge at pre-configured intervals (say every 1 hour, or every day at midnight). This Lambda function lists the files in the Archive WorkDocs folder that need to be moved to S3. The Lambda function then creates an archival task for each file and queues it in an Amazon SQS queue.
For each queued archive task, the archive worker Lambda function downloads the file from WorkDocs.
The file is then uploaded to S3 with the configured S3 storage class.
The archive worker Lambda function also writes a record with metadata in an Amazon DynamoDB table.
When all files have been archived, the archive worker Lambda function deletes the archived files from WorkDocs. The Lambda function also then creates an index file (index-<jobId>.xls) containing a list of all files that were archived, and stores it in the Archive folder of WorkDocs.
The archive worker Lambda function then sends a notification to a pre-configured email address with the status of the archive job and a list of files that were archived.

Retrieval flow:

Similar to the archive check Lambda function, the restore check Lambda function is also invoked by Amazon EventBridge at pre-configured intervals (say every 1 hour, or every day at midnight). This Lambda function checks for the restore.xls file in the Restore folder of WorkDocs. For every entry listed in the file that has ‘Y’ marked against the Retrieve column, it creates a restore task and queues it in an Amazon SQS queue.
For each queued restore task, the restore worker Lambda function does the following:

2a. For the S3 Glacier Flexible Retrieval & S3 Glacier Deep Archive storage classes, the files are not immediately accessible. To access this archived file, the Lambda function first requests the file to be restored to S3 Standard storage (temporarily for a duration of 1 day). When Amazon S3 restores the archived file (duration varies from minutes to hours, see Amazon S3 Performance Chart) an Amazon EventBridge rule is trigged which invokes the restore worker Lambda function again to copy that file from S3 to Amazon WorkDocs.
2b. For other S3 storage classes, including S3 Glacier Instant Retrieval, the file is available to copy immediately.

The restore worker Lambda function then copies the file from S3 to Amazon WorkDocs.
When all files have been restored, the restore worker Lambda function renames the restore.xls file to restore-done-<jobId>.xls and then sends out a notification to a pre-configured email address with the status of the restore job.

Prerequisites

For deploying the solution, you need access to an AWS account with admin access in an AWS Region where Amazon WorkDocs is available.

If your organization has not used Amazon WorkDocs previously, then follow the steps to create an Amazon WorkDocs site, which generates a site URL as shown in the following screenshot. Then, select the site URL and log in to the site.

manage your workdocs sites

Then, create folders with name “Archive” and “Restore” by choosing Create in the upper right corner, and selecting Folder.

create folders with name archive and restore

Once you have created the folders, they will appears in Amazon WorkDocs:

folders appear in amazon workdocs

Note the folder IDs for both of the folders you created. You can find the folder IDs in the URL of each page (after the word “folder/” in the URL).

folder IDs for both of the folders you created

Deploying the solution

AWS CloudFormation gives you the ability to model your entire infrastructure and application resources with either a text file or programming language. This removes the need for manual actions or custom scripts. With AWS CloudFormation, you work with stacks made up of templates, which can be JSON- or YAML-formatted text files. When you create a stack, AWS CloudFormation makes underlying service calls based on the templates that you provide, and provisions the resources.

To launch this solution using AWS CloudFormation follow this walkthrough:

Click on or copy this URL in another tab to deploy the CloudFormaton template in your AWS account using the AWS Management Console.
By default, the link takes you to the Create stack page within the CloudFormation console. Some of the solution’s parameters in the CloudFormation template are automatically populated.
Fill out the rest of the required values in the Parameters section using the guidance below:

a. Enter the folder IDs of the Archive and Restore folders that were previously created in WorkDocs. (See the Prerequisites section on how to obtain the WorkDocs Folder IDs).
b. Enter an email address where the notifications should be delivered.
c. Select the desired S3 storage class.

launch-this-solution-using-aws-cloudformation

Note that the default storage class is set to S3 Glacier Instant Retrieval, which has a minimum storage duration of 90 days with a minimum billable object size of 128 KB. To avoid additional storage costs, choose the appropriate storage class and expiration days based on the actual use case requirements.

The CloudFormation template will deploy and configure all the necessary components to enable this solution. It will also create the S3 Bucket with the name as wdbr-archive-${AWS::AccountId} where all the archived files will be stored.

You can download and review the full template here.

Additional considerations

This solution should help you set up the on demand archive from Amazon WorkDocs to Amazon S3 and restore the files back to Amazon WorkDocs. For more ways to expand this solution, consider the following factors:

Per user folders: This solution provides centralised folders which are monitored for archive and restore operations. This can be further expanded to per user folders. Note that the cost of the solution will increase as more folders are monitored.
Versioning: While this sample implementation will archive only the latest version of the document, the Amazon WorkDocs APIs will support retrieving all versions of a document and can be archived in a similar fashion.
Error handling: You need to consider edge case scenarios that could result in errors. E.g. manual deletion of files from Amazon S3 or deletion of index files from WorkDocs, and handle those gracefully.
File Size: This solution reads files in memory when copying between Amazon S3 and Amazon WorkDocs. This code can be enhanced to stream the file in chunks to improve memory utilization and handle large files.

Cleaning up

To avoid incurring future charges, delete the resources set up as part of this post:

Delete the associated AWS CloudFormation stack that was launched.
Delete the S3 bucket named wdbr-archive-${AWS::AccountId}.

Conclusion

In this post, we demonstrated a solution that monitors designated Amazon WorkDocs folders and files at scheduled intervals to perform archival and retrieval operations to Amazon S3 using AWS Lambda functions. This solution allows your Amazon WorkDocs end users to archive, view, and restore files to and from S3 using their existing and familiar WorkDocs interface. This solution prevents users from having to manually move the files to/from S3, and eliminates the need for WorkDocs end users to access the AWS Management Console.

As businesses continue to move their documents to the cloud, the size and number of documents continue to grow. These organizations need to manage their documents more efficiently and retain them using long term, durable, cost-effective document archives. The solution demonstrated in this post offers a serverless, low-cost method for document archival that can help you save money and time with automation.

Thanks for reading this blog post! If you have any comments or questions, don’t hesitate to leave them in the comments section.