AWS Storage Blog

Event-driven data transfer to container-shared storage on AWS

Businesses using data lake solutions built on  Amazon S3 often want their data science teams to have access to that same data for machine learning or analytics projects deployed on tools like RStudio Server and  Shiny. To do so, they can easily deploy these tools in the cloud using Amazon ECS or  Amazon EKS serverless containers with  AWS Fargate, and can access data through shared persistent file storage such as provided by Amazon EFS. For data scientists to access the data in their project environments, automatic replication from Amazon S3 into Amazon EFS makes the data readily available within their containers in the cloud.

In this blog, we demonstrate using AWS DataSync to transfer ingested data from a cross-account Amazon S3 bucket to an Amazon EFS file system mounted on an Amazon ECS container with AWS Fargate. For data delivery outside of defined DataSync schedules, AWS Lambda synchronizes on demand using S3 file upload events. To save on costs, we used an Amazon EFS Lifecycle Management policy to automatically and transparently move data to the lower cost Amazon EFS Standard-Infrequent Access (EFS Standard-IA) storage class. This solution follows the best practices laid out in the AWS Well-Architected Framework.

Using this solution, you can containerize legacy applications that are dependent on attached storage (like RStudio Server and Shiny), moving them from on-premises to the cloud. The AWS Cloud Development Kit (AWS CDK) application example provided in this post can help you transfer ingested data in an event-driven fashion to persistent storage on your containers.

Architecting the solution

The serverless example use case in this blog addresses RStudio Server and Shiny App, popular R integrated development environment (IDE) and interactive web applications used by data scientists for developing R-applications. In this use case, we refactor on-premises RStudio Server and Shiny App installations into serverless architectures in AWS, integrating them with an already existing data lake built on Amazon S3. In the demonstrated setup, users upload source files for analysis to an Amazon S3 bucket, and AWS DataSync transfers those files to Amazon ECS containers with AWS Fargate.

In cases where time-sensitive data analysis is required, you may need your files delivered quickly from Amazon S3 to the Amazon ECS container with AWS Fargate. For example, a financial company might require urgent data analysis on sudden market movements. We demonstrate this scenario in this blog, showing you how you can trigger the DataSync task on demand and programmatically. However, we recommend running tasks periodically for most data transfer use cases using DataSync, either on a schedule or at less frequent intervals. This allows for more efficient bulk transfers of data and avoids throttling that may occur if you conduct tasks too frequently.

In this solution, the agentless AWS DataSync performs serverless data delivery from Amazon S3 to Amazon EFS. An AWS Lambda function starts the DataSync task as soon as a file is uploaded to the S3 bucket, initiating the replication between S3 and EFS.

Figure 1 - Data delivery from Amazon S3 to Amazon EFS using AWS DataSync (1)

Figure 1: Data delivery from Amazon S3 to Amazon EFS using AWS DataSync

Process flow

Numbered items refer to Figure 1.

  1. Users upload files to Amazon S3 via S3 upload or via AWS Transfer Family using SFTP or AWS internal services send files to S3.
  2. Amazon S3 file upload event triggers AWS Lambda to start DataSync task.
  3. Defined DataSync task copies file from S3 to EFS.
  4. File becomes available on Amazon ECS containers with AWS Fargate that have the Amazon EFS volume mounted as persistent storage.
  5. File moves from Amazon EFS Standard to Amazon EFS Standard-IA as defined by the lifecycle policy to save costs.

Key design considerations

  • Amazon EFS on AWS Fargate: Amazon EFS is a shared file system that stores data in multiple Availability Zones within an AWS Region for data durability and high availability. AWS Fargate is a serverless container service that provides compute capacity for Amazon ECS and Amazon EKS. When a Fargate task terminates, the container file system is destroyed. As we want to persist data in the containers, we mount an Amazon EFS file system on the Amazon ECS container with AWS Fargate for persistent shared storage. RStudio Server and Shiny require persistent storage in the Amazon ECS container with AWS Fargate.
  • Integration with Amazon S3:  AWS Identity and Access Management (AWS IAM) roles enable cross-account integration between the S3 account and the Amazon EFS file system on an Amazon ECS container with AWS Fargate. These roles allow the DataSync service to pick up and push files from the S3 bucket to the Amazon EFS file system mounted on the Amazon ECS container with AWS Fargate that is running Rstudio Server and Shiny App.
  • Data delivery via AWS DataSync: AWS DataSync automates data transfer between Amazon S3 and Amazon EFS. This service is responsible for maintaining the data integrity and security of the file transfer to Amazon EFS for the uploaded files in S3.
  • AWS Lambda for instant data delivery: AWS Lambda is a serverless compute service that can maintain event integration by executing code as a response to an event. The event-driven architecture in this solution requires the DataSync task to be triggered based on the response from an event. To address this challenge, a Lambda function will start DataSync task from S3 to Amazon EFS on the Amazon ECS container with AWS Fargate in response to file upload events on the S3 bucket.
  • Scalability and availability: AWS services used in this architecture are fault-tolerant, available, and durable without requiring additional intervention in case of a failure in the underlying infrastructure. Amazon ECS containers with AWS Fargate can be defined to tolerate failure and spin up services in another Availability Zone within the Amazon VPC. The Amazon EFS file systems and Amazon S3 span multiple Availability Zones for durability and availability. AWS KMS is a Regional service that can withstand failure in a Region, whereas Amazon S3 and AWS IAM are global services that can tolerate failures across Regions. As this architecture is automated and delivered via AWS CDK, it is possible to spin up the entire infrastructure in another AWS Region to address disaster recovery (DR) in case of an extremely rare Regional failure.
  • Cost: Amazon EFS can provide great cost benefits (by a factor of 7-8 times less depending on usage) with the appropriate use of a lifecycle policy, where older files can be stored using Amazon EFS Standard-IA. Files stored using EFS Standard-IA would still be available from the Amazon ECS container with AWS Fargate, although with a slightly increased response time.

Deployment walkthrough 

The following steps describe the process of configuring DataSync to synchronize an Amazon S3 bucket with an Amazon EFS file system using the AWS Management Console:

  1. Create the source bucket in account A and configure the source S3 bucket for cross-account access by attaching a resource policy to it.
  2. Create an IAM role on the target account (account B) to allow DataSync access to the source bucket (located in account A).
  3. Create a DataSync source location for S3 in the target account and point it to the source bucket.
  4. Create the target Amazon EFS file system and configure Lifecycle Management on the Amazon EFS file system.
  5. Create a DataSync location pointing to the target Amazon EFS file system and create a new DataSync task using the previously created source and destination locations.
  6. Define an AWS Lambda function to support immediate file transfer when a user uploads a file into the S3 bucket.
  7. Configure an Amazon ECS container with AWS Fargate, mount the Amazon EFS file system on it, and confirm that the file was transferred successfully.

For this walkthrough, we assume the bucket containing the source files is located in account A and the Amazon EFS file system is located in account B within the same AWS Region.

1. Create the source bucket in account A and configure the source bucket for cross-account access by attaching a resource policy to it:

  1. Log in to the AWS Management Console for account A.
  2. Create a new S3 bucket by following this guide.
  3. Name the bucket – test-source-upload-bucket. Keep this bucket name handy for the following sections.
  4. From the Amazon S3 console in account A, add a bucket policy with the following JSON. Details on how to add a policy to an S3 bucket can be obtained here:
  5. Copy and paste the policy located in file into the Policy area. Please change values in the placeholders <…>.
  6. Save your changes.

2. Create an IAM role on the target account (account B) to allow DataSync access to the source bucket (located in account A).

  1. Log in to account B (target account) console. Instructions on how to configure the role can be obtained here.

3. Create a DataSync source location for Amazon S3 in the target account and point it to the source bucket.

  1. The DataSync console does not allow configuration of a source location pointing to a cross-account S3 bucket. We will use the AWS CLI to do this part.
  2. For this step to work, you must configure a profile for CLI with credentials pointing to the target account (account B).
  3. Create a DataSync location by executing the command located in file.
  4. The preceding command will return the ARN of the new Amazon S3 location you just created.
  5. Navigate to DataSync using the AWS Management Console. Choose Locations and filter by the location ID obtained in the previous step – for example, loc-00b5eccb098d4fb39. Note the source bucket name listed under Host column:

DataSync in the console - choosing locations and filter by location Id, taking not of the source bucket name

4. Create the target Amazon EFS file system and configure Lifecycle Management on Amazon EFS file system.

  1. Using the AWS Management Console on the target account, refer to this documentation for instructions on how to create a file system.
  2. Give the file system a name – for example, “TestEFS.”
  3. Add an access point to the preceding file system (see instructions for working with access points).
  4. To save cost on the Amazon EFS file system, configure the Lifecycle Management policy (30 days since last access) by following this guide.

5. Create a DataSync location pointing to the target Amazon EFS file system and create a new DataSync task using the previously created source and destination locations.

  1. With the destination Amazon EFS file system and access point created in step E, we can now create the DataSync destination location of Amazon EFS type using the console. To create an Amazon EFS location for DataSync using the console, follow the steps in this guide.
  2. Ensure you are logged into account B’s console and navigate to the DataSync console. Follow the process here for creating a DataSync task.

6. Define the Lambda function to support immediate file transfer when a file is uploaded into the S3 bucket.

The Lambda function will be created in account B, and the upload event to the source bucket in account A invokes it.

1. Log in to the console and navigate to AWS Lambda. Create a Python Lambda function by using the code in file – replace the placeholders inside <…> with your values. Select the Deploy button when done.

2. Use the CLI to add a resource-based policy to the preceding Lambda function. This policy allows the remote S3 bucket in account A to execute this Lambda function. Run the following CLI command in file – replace the place holders inside <…> with your values. Upon success, the command returns the policy statement in JSON format.

3. Reload the Lambda function in the AWS Management Console, visit the Configuration tab, and click on Permissions. Scroll down to the Resource-based policy area and confirm that the policy has been added:

Scroll down to the Resource-based policy area and confirm that the policy has been added

4. Add a permission to allow the Lambda function to start a DataSync task. From the Lambda console, go to ConfigurationPermissions and then click on the role name shown under the Execution role section. See file for sample permissions.

5. Configure the bucket notification to invoke the Lambda function when an object is created. Copy the ARN of the Lambda function from the Function overview section on the Lambda console.

6. Log into the AWS Management Console of the source account A, and navigate to the Amazon S3 console. Configure an event notification on object create function to invoke the Lambda function using the function ARN you copied. Details on configuring S3 Event Notifications can be referenced here.

7. Proceed to upload a file into the source bucket. The status of the DataSync task should change to RUNNING to show that the task has been triggered.

The status of the DataSync task should change to RUNNING to show that the task has been triggered

By default, DataSync performs a calculation and all the differences between Amazon S3 and the destination are copied over. For efficiency, the preceding function uses DataSync to include a filter and passes only the file that triggered the Lambda function. See the documentation on how DataSync transfers files for further information.

7. Configure an Amazon ECS container with AWS Fargate, mount the Amazon EFS file system on it, and confirm that files transfer successfully.

1. Create an Amazon ECS cluster with a Fargate task by following this guide. Once the task is running, navigate to the Amazon ECS console and click on your new cluster to view the details. Click on the Tasks tab. Click on the Task definition link next to the task.

Once the task is running, navigate to the Amazon ECS console and click on your new cluster to view the details.

2. Click the Create new revision button.

Create new revision

3. Scroll down to the Volumes section and click Add volume:

Volumes section and click Add volume

4. Capture the name of the new Volume and select EFS under the Volume type field. Select the correct File system ID that you created in the preceding steps, and then select the corresponding Access point ID. Check the Enable transit encryption check box. Click the Add button.

Amazon ECS console - adding a volume - volume type EFS

5. Click the Create button on the bottom right to proceed, then click on Actions and select Update service.

6. Check the Force new deployment check box and click Skip to review and click Update Service.

7. Once the task is in a RUNNING status, connect to your container instance and verify DataSync copied the files successfully. To connect to your container instance, follow this guide.

Automation with AWS CDK

We have created a set of three AWS CDK Python stacks to automate the following functionality:

  • Configuration of the source Amazon S3 Bucket
  • Configuration of the destination Amazon EFS file system
  • Creation and configuration of the Lambda function to trigger DataSync Tasks
  • Configuration of the DataSync source and destination locations in addition to the task

The stacks are available from aws-samples in GitHub, with instructions on how to deploy them provided in the readme.

Cleaning up

Remember to remove the deployment once you are done testing, either from the console or by deleting the CloudFormation CDK stacks.

Conclusion

In this blog, we demonstrated how to ingest data on container shared persistent storage for applications that need access to such data. You can easily deploy the serverless and event-driven architecture described in this blog from the GitHub repository. Using persistent shared storage on Amazon ECS containers with AWS Fargate, along with AWS DataSync and Amazon EFS, can provide you with flexibility in architecting and migrating long-running legacy workloads like RStudio Server and Shiny App to containers in the cloud.

Thanks for reading this blog post on event-driven data transfer to container-shared storage using Amazon EFS and AWS DataSync. If you have any comments or questions, please don’t hesitate to leave them in the comments section.

Chayan Panda

Chayan Panda

Chayan Panda is a Cloud Infrastructure Architect. He provides advisory services and thought leadership to AWS customers on robust solution design for Cloud Migrations, Cloud Infrastructure (Security, Network, DevOps), Greenfield platform implementations, Big Data/AI/ML, Serverless and Database solutions. When he is not obsessing about customers, he enjoys a short run, music, a book or travel with his family.

Mukosi Mukwevho

Mukosi Mukwevho

Mukosi Mukwevho is a Consultant Application Development working for the AWS Professional Services team. He works with customers on both cloud migrations and as well as improving cloud-native applications. Outside of work, he is into fitness and health, enjoys long-distances running (30kms+) and weightlifting.