AWS Storage Blog

Auto-sync files from Amazon WorkDocs to Amazon S3

Today, many customers use Amazon S3 as their primary storage service for various use cases, including data lakes, websites, mobile applications, backup and restore, archive, big data analytics, and more. Versatile, scalable, secure, and highly available worldwide, S3 serves as a cost-effective data storage foundation for countless application architectures. Often, customers want to exchange files and documents between Amazon WorkDocs and Amazon S3. In our previous blog, we covered the process to auto-sync files from Amazon S3 to Amazon WorkDocs. In this blog post, we cover the sync process from Amazon WorkDocs to Amazon S3.

WorkDocs provides secure cloud storage and allows users to share and collaborate on content with other internal and external users easily. Additionally, Amazon WorkDocs Drive enables users to launch content directly from Windows File Explorer, macOS Finder, or Amazon WorkSpaces without consuming local disk space. Amazon S3 and Amazon WorkDocs both support rich API operations to exchange files.

Manually moving individual objects from WorkDocs to Amazon S3 can become tedious. Many customers are looking for a way to automate the process, enabling them to have their files available in S3 for further processing.

In this post, we walk you through setting up an auto-sync mechanism for synchronizing files from Amazon WorkDocs to Amazon S3 using Amazon API Gateway and AWS Lambda. Amazon API Gateway is a fully managed service that makes it easy for developers to create, publish, maintain, monitor, and secure APIs at any scale. AWS Lambda lets you run code without provisioning or managing servers. This enables you to be flexible and pay for only the compute time you consume without needing to pre-plan. This tool enables end users to focus on analyzing data and avoid manual efforts for file movement from Amazon WorkDocs to Amazon S3, saving them time thereby improving overall productivity and efficiency.

Solution overview

A common approach to automatically syncing files from Amazon WorkDocs to Amazon S3 is to set up an auto-sync tool using a Python module in AWS Lambda. We show you how to create this solution in the following steps. The following diagram shows each of the steps covered in this post:

Auto-sync files from Amazon WorkDocs to Amazon S3 - solution architecture

The scope of this post is limited to the following steps:

  1. Creating Amazon WorkDocs folders
  2. Setting up this solution’s Amazon S3 components
  3. Creating AWS Systems Manager Parameter Store
  4. Setting up of Amazon SQS queue
  5. Setting up Amazon API Gateway
  6. Building AWS Lambda code with Python
  7. Setting up the WorkDocs notification
  8. Testing the Solution

As a first step, we create the Amazon WorkDocs folders, which generate WorkDocs folder IDs. We also set up an Amazon S3 bucket to receive the files. We use AWS Systems Manager Parameter Store to capture the Amazon S3 bucket name, WorkDocs folder IDs, folder names, and file extensions that need to sync. AWS Lambda uses the AWS Systems Manager Parameter Store to retrieve the information stored. We use Amazon API Gateway to integrate with Amazon SQS. We use an Amazon SQS queue to reprocess API events in case of a failure while syncing Amazon WorkDocs files to Amazon S3. Amazon SQS queues the Amazon API Gateway events and triggers AWS Lambda. As part of the process, we also enable WorkDocs notifications and subscribe to it using API Gateway to process the events generated from Amazon WorkDocs.

Prerequisites

For the following example walkthrough, you need access to an AWS account with admin access in the us-east-1 Region.

1. Creating Amazon WorkDocs folders

We use the Amazon WorkDocs folders created in this section to sync up with Amazon S3.

If your organization has no prior use of Amazon WorkDocs, then follow the steps to create an Amazon WorkDocs site, which generates a site URL as shown in the following screenshot. Then, select the Site Url and log in to the site.

Creating Amazon WorkDocs folders - accessing your WorkDocs Site

Then, create a folder named “test_user_1_reports” by choosing Create and selecting Folder.

Creating Amazon WorkDocs folders - create two folders

Once you have created the folder, it appears in WorkDocs.

Once you have created the folder, it appears in WorkDocs (1)

Note the folder ID for the folder you created. Find the folder ID in the URL of each page (after the word “folder/” in the URL).

Note the folder IDs for the folders you created - Find the folder IDs in the URL of each page

The “test_user_1_reports” folder ID

2. Setting up this solution’s Amazon S3 components

Create an Amazon S3 bucket with public access blocked and with the default encryption of SSE-S3. This configuration is for this sample solution, but please follow the compliance for configuring an Amazon S3 bucket as per your organization.

Create an Amazon S3 bucket with public access blocked and default encryption of SSE-S3.

3. Creating AWS Systems Manager Parameter Store

1. Create a Parameter Store named “/dl/workdocstos3/bucketname” for storing the Amazon S3 bucket names.

You must create three different parameters in the Amazon AWS Systems Manager Parameter Store - bucket name (1)

2. Create a Parameter Store named “/dl/workdocstos3/folderids” for storing the mapping between your Amazon WorkDocs folder ID and Amazon S3 prefix.

  • Sample value: {“7532e719cd8f28088c920cc1816506389a4deb9db1b50c3e6dc70af665ed6dec”:”test_user_1_reports”}

Create three different parameters in the Amazon AWS Systems Manager Parameter Store - WorkDocs folder IDs and S3 Prefix mapping (2) - updated

3. Create a Parameter Store named “/dl/workdocsos3/fileext” for storing the file extensions that should be synced from Amazon WorkDocs to Amazon S3.

  • Sample value: {“file_ext”:”.pdf,.xlsx,.csv”}

You must create three different parameters in the Amazon AWS Systems Manager Parameter Store - File extension (3) - update

4. Setting up Amazon SQS queue

Create an SQS Queue with Default visibility timeout as 15 minutes.

Create an SQS Queue with Default visibility timeout as 15 minutes.

Create an IAM role to integrate Amazon SQS with Amazon API Gateway. Choose API Gateway as a use case and create the role.

Create an IAM role to integrate Amazon SQS with Amazon API Gateway

Use the default policy as shown in the following screenshot and create the role.

Use the default policy when creating an IAM role to integrate Amazon SQS with Amazon API Gateway

Once the role is created, then add the additional policy “AmazonSQSFullAccess’ to the same role.

Add the additional policy 'AmazonSQSFullAccess' to the same role.

As shown in the following screenshot, you should have both policies attached to the IAM role.

AmazonSQSFullAccess and AmazonAPIGatewayPushToCloudWatchLogs policies attached to the IAM role.

5. Setting up Amazon API Gateway

Create an API Gateway with Rest API as the API type.

Create an API Gateway with Rest API as the API type.

Create the API with REST as your protocol and select New API. Then, select Edge optimized as your Endpoint Type.

Create the API with REST as your protocol and select New API. Then, select Edge optimized as your Endpoint Type.

Once the API is created, add Create Method.

Once the API is created, add Create Method.

Create a POST method, as shown in the following screenshot.

Create a POST method

Once you select the POST method, select the checkmark icon as shown in the following screenshot:

Once you select the POST method, select the checkmark icon

Fill in the details per the following screenshot and Save.

  1. Path override should have value as <AWS account#>/<SQS name>
  2. Execution Role should have the value of the IAM role ARN created in the preceding section.

Post - Setup (1)

Select Integration Request, as shown in the following screenshot.

POST - Method execution - Select Integration Request

Fill in the HTTP Headers and Mapping Templates sections, as shown in the following screenshot.

  1. Under HTTP Headers
    1. Name: Content-Type
    2. Mapped from: ‘application/x-www-form-urlencoded’
  2. To integrate API Gateway with Amazon SQS, we need to map the incoming message body to the MessageBody of the Amazon SQS service and set the Action to SendMessage. For details, please refer “How do I use API Gateway as a proxy for another AWS service?” For this solution’s walkthrough, under Mapping Templates choose text/plain as Content-Type, and under Generate template the provide value as Action=SendMessage&MessageBody=$util.urlEncode($input.body). Then, save it.

Fill in the HTTP Headers and Mapping Templates sections

Once it’s saved, deploy the API. Choose Deploy API under API ACTIONS, as shown in the following screenshot.

Choose Deploy API under API ACTIONS

Under the Deploy API prompt, fill in the details as shown in the following screenshot, and then Deploy.

Under the Deploy API prompt, fill in the details and then Deploy.

Also, capture the API endpoint URL from the Stages tab, as shown in the following screenshot.

Also, capture the API endpoint URL from the Stages tab

6. Building AWS Lambda code with Python

Create an AWS Lambda function with the name “workdocs_to_s3” using the following function code. Select the Python runtime version 3.8.

Also, create an AWS Lambda Layer compatible with Python 3.8 for Python’s Request library (2.2.4) and its dependencies.

import json
import boto3
import requests
import logging

sns_client = boto3.client('sns')
ssm_client = boto3.client('ssm')
workdocs_client = boto3.client('workdocs')
s3_client = boto3.client('s3')

logger = logging.getLogger()
logger.setLevel(logging.INFO)

## The function to confirm the subscription from Amazon Workdocs
def confirmsubscription (topicArn, subToken):
    try:
        response = sns_client.confirm_subscription(
            TopicArn=topicArn,
            Token=subToken
        )
        logger.info ("Amazon Workdocs Subscripton COnfirmaiton Message : " + response) 
    except Exception as e:
        logger.error("Error with Event : " + str(documentid) + " Exception Stacktrace : " + str(e) )
# This would result in failing the AWS Lambda function and the event will be retried.
# One of the mechanism to handle retries would be to configure Dead Letter Queue (https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-dead-letter-queues.html) as part of the Amazon SQS service.
# Another mechanism could be to skip raising the error and Amazon Cloudwatch can be used to detect logged error messages to collect error metrics and trigger corresponding retry process.
        raise Exception("Error Confirming Subscription from Amazon Workdocs")
    
def copyFileworkdocstos3 (documentid):

    # ssm parameter code
    # Reading the Amazon S3 prefixes to Amazon Workdocs folder id mapping, Bucket Name and configured File Extensions from AWS System Manager.
    try:
        bucketnm = str(ssm_client.get_parameter(Name='/dl/workdocstos3/bucketname')['Parameter']['Value'])
        folder_ids = json.loads(ssm_client.get_parameter(Name='/dl/workdocstos3/folderids')['Parameter']['Value'])
        file_exts = str(json.loads(ssm_client.get_parameter(Name='/dl/workdocstos3/fileext')['Parameter']['Value'])['file_ext']).split(",")
        
        logger.info ("Configured Amazon S3 Bucket Name : " + bucketnm)
        logger.info ("Configured Folder Ids to be synced : : " + folder_ids)
        logger.info ("Configured Supported File Extensions : " + file_exts)

        resp_doc = workdocs_client.get_document (DocumentId = documentid)
        logger.info ("Amazon Workdocs Metadata Response : " + str(resp_doc))
        
        # Retrieving the Amazon Workdocs Metadata
        parentfolderid = str(resp_doc['Metadata']['ParentFolderId'])
        docversionid = str(resp_doc['Metadata']['LatestVersionMetadata']['Id'])
        docname = str(resp_doc['Metadata']['LatestVersionMetadata']['Name'])
        
        logger.info ("Amazon Workdocs Parent Folder Id : " + parentfolderid)
        logger.info ("Amazon Workdocs Document Version Id : " + docversionid)
        logger.info ("Amazon Workdocs Document Name : " + docname)
        
        prefix_path = folder_ids.get(parentfolderid, None)
        logger.info ("Retrieving Amaozn S3 Prefix Path : " + prefix_path)
        
        ## Currently the provided sample code supports syncing documents for the configured Amazon Workdocs Folder Ids in AWS System Manager and not for the sub-folders.
        ## It can be extended to supported syncing documents for the sub-folders.
        if ( (prefix_path != None) and (docname.endswith( tuple(file_exts) )) ):
            resp_doc_version = workdocs_client.get_document_version (DocumentId = documentid,
                                                     VersionId= docversionid,
                                                     Fields = 'SOURCE'
            )
            logger.info ("Retrieve Amazon Workdocs Document Latest Version Details : " + resp_doc_version)
            
            ## Retrieve Amazon Workdocs Download Url
            url = resp_doc_version["Metadata"]["Source"]["ORIGINAL"]
            logger.info ("Amazon Workdocs Download url : " + url)
            ## Retrieve Amazon Workdocs Document contents
            ## As part of this sample code, we are reading the document in memory but it can be enhanced to stream the document in chunks to Amazon S3 to improve memory utilization 
            workdocs_resp = requests.get(url)
            ## Uploading the Amazon Workdocs Document to Amazon S3
            response = s3_client.put_object(
                Body=bytes(workdocs_resp.content),
                Bucket=bucketnm,
                Key=f'{prefix_path}/{docname}',
            )
            logger.info ("Amazon S3 upload response : " + response)
        else:
            logger.info ("Unsupported File type")
    except Exception as e:
        logger.error("Error with processing Document : " + str(documentid) + " Exception Stacktrace : " + str(e) )
# This would result in failing the AWS Lambda function and the event will be retried.
# One of the mechanism to handle retries would be to configure Dead Letter Queue (https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-dead-letter-queues.html) as part of the Amazon SQS service.
# Another mechanism could be to skip raising the error and Amazon Cloudwatch can be used to detect logged error messages to collect error metrics and trigger corresponding retry process.
        raise Exception("Error Processing Amazon Workdocs Events.")
    
    
def lambda_handler(event, context):
    logger.info ("Event Recieved from Amazon Workdocs : " + str(event))
        
    msg_body = json.loads(str(event['Records'][0]['body']))

    ## To Process Amazon Workdocs Subscription Confirmation Event
    if msg_body['Type'] == 'SubscriptionConfirmation':
        confirmsubscription (msg_body['TopicArn'], msg_body['Token'])
    ## To Process Amazon Workdocs Notifications
    elif (msg_body['Type'] == 'Notification') :
        event_msg = json.loads(msg_body['Message'])
        ## To Process Amazon Workdocs Move Document Event
        if (event_msg['action'] == 'move_document'):
            copyFileworkdocstos3 (event_msg['entityId'])
        ## To Process Amazon Workdocs Upload Document when a new version of the document is updated
        elif (event_msg['action'] == 'upload_document_version'):
            copyFileworkdocstos3 (event_msg['parentEntityId'])
        else:
        ## Currently the provided sample code supports two Amazon Workdocs Events but it can be extended to process other Amazon Workdocs Events.
        ## Refer this link for details on other supported Amazon Workdocs https://docs.aws.amazon.com/workdocs/latest/developerguide/subscribe-notifications.html.
            logger.info("Unsupported Action Type")
    else:
    ## Currently the provided sample code supports two Amazon Workdocs Events but it can be extended to process other Amazon Workdocs Events.
    ## Refer this link for details on other supported Amazon Workdocs https://docs.aws.amazon.com/workdocs/latest/developerguide/subscribe-notifications.html.
        logger.info("Unsupported Event Type")
   
    return {
        'statusCode': 200,
        'body': json.dumps('Hello from Amazon Workdoc sync to Amazon S3 Lambda!')
    }

The following screenshot shows the AWS Lambda Layer created based on Python’s Request library (2.2.4):

AWS Lambda Layer created based on Python’s Request library (2.2.4)

Add the AWS Lambda Layer to AWS Lambda function. For more details, refer to the documentation on configuring a function to use layers.

Add the AWS Lambda Layer to AWS Lambda function

Update the AWS Lambda function “workdocs-to-s3” Timeout and Memory (MB) settings as shown in the following screenshot (15 min 0 seconds and 3008 MB, respectively). For more details, refer to the documentation on configuring Lambda function memory.

Update the AWS Lambda function “workdocs-to-s3” Timeout and Memory (MB) settings

Update the AWS Lambda function’s “workdocs-to-s3” IAM execution role by selecting the AWS Lambda function and traversing to the Permissions tab. For more details, refer AWS Lambda execution role.

In this example, we add the following AWS managed policies:

  • AmazonSQSFullAccess
  • AmazonS3FullAccess
  • AmazonSSMFullAccess
  • AmazonSNSFullAccess
  • AmazonWorkDocsFullAccess

Note: In this example for simplicity, the AWS Lambda IAM Execution role will be provided full access to the concerned AWS services. We recommend enhancing the AWS Lambda function’s IAM execution role to provide more granular access for a production environment. For more details, refer to the documentation on policies and permissions in IAM.

Update the AWS Lambda function’s 'workdocs-to-s3' IAM execution role

Attach all the required policies, as shown in the following screenshot.

Attach all the required policies

Add a trigger to AWS Lambda by using the SQS Queue that was created. Change the Batch size to 1.

Add a trigger to AWS Lambda by using the SQS Queue that was created. Change the Batch size to 1.

7. Setting up the WorkDocs notification

You need an IAM role to set up WorkDocs notifications. For this blog purpose, we use an admin role. You can refer here for more details.

In the WorkDocs console, access WorkDocs notifications by selecting Manage Notifications under Actions, as shown in the following screenshot.

In the WorkDocs console, access WorkDocs notifications by selecting Manage Notifications under Actions

Select Enable Notification, as shown in the following screenshot:

Select Enable Notification

Provide the ARN from the preceding section and select Enable.

Provide the ARN and select Enable.

Access AWS CloudShell from the AWS Management Console. Run the following command to subscribe to the notification. To get the organization-id value, please refer to this link.

aws workdocs create-notification-subscription \
--organization-id <directory id from Directory Service> \
--protocol HTTPS \
--subscription-type ALL \
--notification-endpoint <Api Endpoint from Setting up Amazon API Gateway step>

Run the following command to subscribe to the Amazon WorkDocs notification

8. Testing the Solution

First, verify that the WorkDocs folder and Amazon S3 bucket are empty. Then, upload a file into the WorkDocs folder.

Upload a file into the WorkDocs folder

Next, you should see that the file is available in Amazon S3.

Next, you should see that the file is available in Amazon S3

Things to consider

This solution should help you set up an auto-sync mechanism for files from Amazon WorkDocs to Amazon S3. For more ways to expand this solution, consider the following factors.

File size

This solution is designed to handle files in the range of a few MBs to 2 GB. As part of the solution, the file is read in memory before syncing it to Amazon S3, but the Lambda code can be enhanced to stream the file in chunks to improve memory utilization and handle large files.

Monitoring

Monitoring can be done using Amazon CloudWatch, which acts as a centralized logging service for all AWS services. You can configure Amazon CloudWatch to trigger alarms for AWS Lambda failures. You can further configure the CloudWatch alarms to trigger processes that can re-upload or copy the failed Amazon S3 objects. Another approach would be to configure Amazon SQS dead-letter queues as part of the Amazon SQS, capturing the failed messages based on the number of configured retries to invoke a retry process.

IAM policy

We recommend you turn on S3 Block Public Access to ensure that your data remains private. To ensure that public access to all your S3 buckets and objects is blocked, turn on block all public access at the account level. These settings apply account-wide for all current and future buckets. If you require some level of public access to your buckets or objects, you can customize the individual settings to suit your specific storage use cases. Also, update your AWS Lambda execution IAM role policy, Amazon WorkDocs enable notification role, and Amazon SQS access policy to follow the standard security advice of granting least privilege or granting only the permissions required to perform a task.

Amazon WorkDocs document locked

If the WorkDocs document is locked for collaboration, it will sync to Amazon S3 only after unlocking or releasing the document.

Lambda batch size

For our example in this blog post, we used a batch size of 1 for the AWS Lambda function’s Amazon SQS trigger. As shown in the following screenshot, this can be modified to process multiple events in a single batch. In addition, you can extend the AWS Lambda function code to process multiple events and handle partial failures in a particular batch.

we used a batch size of 1 for the AWS Lambda function’s Amazon SQS trigger - this can modified and extended

Cleaning up and pricing

To avoid incurring future charges, delete the resources set up as part of this post:

  • Amazon WorkDocs
  • API Gateway
  • Amazon SQS
  • Systems Manager parameters
  • AWS Lambda
  • S3 bucket
  • IAM roles

For the cost details, please refer to the service pages: Amazon S3 pricingAmazon API Gateway pricing, Lambda pricingAmazon SQS pricingAWS Systems Manager pricing, and Amazon WorkDocs pricing. 

Conclusion

This post demonstrated a solution for setting up an auto-sync mechanism for synchronizing files from Amazon WorkDocs to Amazon S3 in near-real-time using Amazon API Gateway and AWS Lambda. This will avoid the tedious manual activity of moving files from Amazon WorkDocs to Amazon S3 and let customers focus on data analysis.

Thanks for reading this post on automatically syncing files from Amazon WorkDocs to Amazon S3. If you have any feedback or questions, feel free to leave them in the comments section. You can also start a new thread on the Amazon WorkDocs forum.

Vamsi Bhadriraju

Vamsi Bhadriraju

Vamsi Bhadriraju is a Data Architect at AWS. He works closely with enterprise customers to build data lakes and analytical applications on the AWS Cloud.

Abhishek Gupta

Abhishek Gupta

Abhishek Gupta is a Data and ML Engineer with AWS Professional Services. He helps customers implement big data and analytics solutions.

Jeetendra Vaidya

Jeetendra Vaidya

Jeetendra Vaidya is a Solutions Architect at AWS based in Chicago, IL. He is a Serverless and AI/ML enthusiast and loves helping customers architect secure, scalable, reliable, and cost-effective solutions.