AWS Storage Blog
Automate data transfers and migrations with AWS DataSync and Terraform
In today’s data-driven world, organizations face the challenge of efficiently managing and consolidating vast amounts of information from diverse sources. Whether it’s for analytics, machine learning (ML), or other business-critical applications, the ability to seamlessly transfer and organize data is crucial. However, this process can be complex, time-consuming, and prone to errors when done manually.
AWS DataSync offers a powerful solution to address this challenge. It is a secure service that automates and accelerates data transfers. When combined with Infrastructure as Code (IaC) tools such as Terraform by HashiCorp, organizations can automate infrastructure provisioning and data transfer tasks while ensuring consistency in ML workflows and reducing human error. This approach enables businesses to streamline their data operations and maintain reliable environments through version control, making it valuable for any organization dealing with large-scale data transfers and management.
In this post, we explore how to combine DataSync with Terraform to streamline data transfers and migrations. Although the solution is applicable across various industry verticals, we focus on a practical use-case for financial institutions. This scenario involves consolidating datasets for ML model development, such as Common Crawl‘s news dataset and US SEC filings. We demonstrate how to automate DataSync configuration using Terraform, implement cross-account transfer best practices, organize datasets effectively for ML workflows, and use automation for improved data management and ML initiatives.
DataSync Overview
DataSync is a service that streamlines data migration and securely transfers file or object data between storage services, whether from on premises, other clouds, or in Amazon Web Services (AWS). It automates data movement, handles scheduling, and ensures data integrity while supporting various use cases from cloud migration to disaster recovery.
Key terminology:
- Location: An endpoint that specifies the source or destination for data transfer operations. Locations can be on-premises storage systems (NFS, SMB, HDFS), self-managed object storage, other clouds, or AWS storage services (Amazon Simple Storage Service (Amazon S3), Amazon Elastic File System (Amazon EFS), Amazon FSx). This flexibility allows for diverse data transfer scenarios across different platforms.
- Task: The configuration that controls data transfer operations. Tasks specify the source and destination locations, along with settings for scheduling, filtering, and data verification. These parameters allow organizations to customize and automate their data transfer requirements effectively.
- Task Execution: An individual run of a DataSync transfer task. There are several phases involved in a task execution. During a task execution, DataSync prepares, transfers, and verifies your data.
Solution Overview
This solution uses the Terraform AWS DataSync Module to automate data transfers across AWS accounts. The module provides end-to-end examples for both Amazon S3 and Amazon EFS transfers, focusing on S3-to-S3 cross-account scenarios. Through Terraform, we create and configure DataSync locations, tasks, AWS Identity and Access Management (IAM) roles, and S3 buckets with AWS Key Management Service (AWS KMS) encryption. This makes sure of both the security and automation of your data transfer infrastructure.
To illustrate this solution, we use two data sources from the financial sector: SEC filings, which provide structured financial data and compliance documents from public companies, and Common Crawl’s news dataset, which offers comprehensive global news articles. Automating the consolidation of these datasets from separate S3 buckets into a centralized repository with scheduled updates allows organizations to focus on deriving value from their data rather than managing complex transfer configurations and security requirements.
The following figure shows the architecture overview for this solution.
Figure 1: Architecture overview of organizing ML datasets with AWS DataSync
Prerequisites
The following prerequisites are necessary for completing this solution:
- AWS account with permissions to create resources in IAM
- IAM or federated user in your AWS account with the permissions to create and administer:
- See changing permissions for an IAM user for on how to setup IAM permissions and defining permission boundaries
- Terraform version ≥ v1.0.7
Solution walkthrough
There are three AWS Accounts involved in storing and moving the datasets:
- AWS Account A: Contains a selected subset of the Common Crawl dataset in an S3 bucket
- AWS Account B: Contains selected SEC documents in an S3 bucket
- AWS Account C: Destination for the datasets organized under specific prefixes
Transferring the Common Crawl dataset
In this section we observe the Common Crawl news dataset (CC-News) which is publicly available on an S3 commoncrawl bucket at the prefix crawl-data/CC-NEWS/
. You can get the listings of files using the AWS Command Line Interface (AWS CLI) and the following command:
The dataset is organized in years and months sub-directories. For demonstration purposes, we have copied a small subset of files from the commoncrawl public S3 bucket to a private bucket called test-datasync-commoncrawl
. Our goal is to efficiently transfer these files to a data preparation S3 bucket in Account C called pre-training-dataset-dest-bucket”
in the following path: /CC-News/2016/08
Step 1. Clone the Terraform DataSync module repository
Clone the module repository using the git clone command as shown in the following example:
This repository contains the following directory structure:
For the ML data organization scenario, we call the datasync-locations/ and datasync-task/ modules from the examples/s3-to-s3-cross-account/main.tf. Change into the preceding directory using the following command:
Step 2. Configure the Terraform AWS provider
The s3-to-s3-cross-account/provider.tf file uses default AWS CLI profiles named source-account and destination-account which can be modified in variables.tf. Although there are a variety of ways to pass AWS credentials to Terraform, for this example, we use temporary credentials vended by the AWS IAM Identity Center using these steps:
1. Create IAM Identity Center user with access to both source and destination accounts.
2. Configure source account: Run aws configure sso, choose Account A,and set profile as source-account
.
3. Configure destination account: Repeat for Account C, and set profile as destination-account
.
4. Set default profile: Run export AWS_DEFAULT_PROFILE=source-account
(if an explicit AWS provider is not specified in Terraform).
Step 3. Setup values for Terraform variables
First, we will assign appropriate values to input variables needed by each module. The README.md
file for each module provides a description of all necessary and optional Terraform variables.
3.1 Call the DataSync Location module
The following code snippets from the main.tf
file shows the child module blocks and example input variables for the CC-News dataset. The following shows the source S3 location.
The DataSync S3 locations module allows you to create a DataSync IAM role by setting create_role = true
. This automatically generated IAM role has the necessary Amazon S3 permissions allowing the DataSync service to access the S3 bucket.
Cross-account Amazon S3 transfers through DataSync need specific permissions to access Amazon S3 in both AWS accounts. Creating an IAM role in the source account enables data transfer permissions, and then configuring the destination account’s S3 bucket policy grants access to this source account IAM role to copy data into. The following shows the destination S3 location.
DataSync Location and Task Modules are generic and do not have any cross-account provider configuration. Therefore, the IAM role that gives DataSync the permissions to transfer data to the destination bucket in Account C must be created outside of the module and passed as a parameter for destination S3 location configuration.
By default create_role
is set to false
for the destination S3 location because the IAM role is created outside the DataSync Locations Module.
The depends_on
meta-argument makes sure that Terraform creates the destination DataSync location only after the destination account S3 bucket policy is updated to allowing the source account IAM role to transfer data to the destination account bucket.
3.2 Call the DataSync task module
DataSync tasks need two locations configured: a source and a destination. Then, the Amazon Resource Names (ARNs) of these locations are used to create the DataSync task. The DataSync task module triggers the task execution based on the schedule defined by the schedule_expression
attribute. The following example shows an hourly schedule that starts automatically upon task creation and then repeat every hour. For more information, go to task options in DataSync and DataSync Terraform arguments and attributes.
Task filtering can be used to limit reading from a specific set of folders or files on the source location. This is useful if you want to include multiple folders from a top-level export path or further narrow a dataset within the specified source location path. Using an include filter or exclude filter allows you to specify unique folder paths for each DataSync task, then run those tasks in parallel.
When you’ve configured the necessary module input variables, the next step is to assign values to any Terraform variables in the root module that don’t have default values. Using a .tfvars
file provides a direct and common method for assigning variables in Terraform. We’ve provided a terraform.auto.tfvars.example
file in the module for reference. Rename this file to terraform.auto.tfvars
and then customize the variable values using your preferred text editor.
The variables configured in the terraform.auto.tfvars
file are passed into the module.
Step 4. Start the deployment
Before you can start a deployment, configure the AWS CLI credentials for Terraform using the service account that was created as part of the Prerequisites.
- Run the command
terraform init
to download the modules and initialize the directory. - Run
terraform plan
and examine the outputs. - Run
terraform apply
and allow the apply to complete.
If the terraform apply
is successful, then the output should appear as the follows.
To view and examine the resources created by terraform, you can use the commands terraform state list
and terraform state show
commands.
Step 5. Review DataSync task and data transfer in the AWS Management Console (Optional)
Navigate to the AWS DataSync service by logging into the AWS Management Console. In the DataSync console, locate the Data transfer section and choose Tasks. Here you can find the task created by Terraform, which displays the source and destination locations along with all associated task configuration settings.
The following screenshot shows a successful task execution that started automatically as per the schedule defined in the datasync-task
module.
Figure 2: AWS Management Console screenshot of the source and destination locations from the DataSync task created by Terraform
The following screenshot shows the successful task execution along with the synchronized files. To minimize costs during testing, choose only a subset of data for synchronization.
Figure 3: AWS Management Console screenshot showing Common Crawl files transferred by DataSync
Transfer the SEC filings dataset
In this section, you configure a data transfer job from the AWS Account B to AWS Account C as shown in Figure 1. We assume that you have downloaded one or more SEC filing documents and uploaded it to a source S3 bucket in account A. SEC filings are available online through the SEC’s EDGAR (Electronic Data Gathering, Analysis, and Retrieval) database. Our goal is to organize the company specific files with a prefix of its ticket name such as /SEC/AMNZ/10-K2024.pdf
.
Configure variables for the DataSync location module as shown in the following code snippet. Set up the provider configuration and trigger the deployment as shown in the same guidance as provided in Steps 2 through 4 in the Transfer the CC-News dataset section of this post. The following is an example configuration for the SEC filings dataset.
The following screenshot shows a successful task execution that started automatically as per the schedule defined in the datasync-task
module.
Figure 4: Source and destination locations from the DataSync task for transferring SEC filings
The following screenshot shows the successful task execution along with the synchronized SEC filing files. To minimize costs during testing, only a subset of data is chosen for synchronization purposes.
Figure 5: AWS Management Console screenshot showing SEC filling data transferred by DataSync
Further Considerations
Supported locations: The preceding sections provided a walkthrough of using DataSync to copy and organize ML dataset between S3 buckets in different AWS accounts. DataSync supports data transfer across a range of AWS and cross-cloud storage locations such as NFS, SMB, HDFS, and object storage. The terraform-aws-datasync
module contains examples to sync data from Amazon EFS to Amazon S3 and S3 to S3 for same account use cases.
Monitoring: Integration with CloudWatch provides comprehensive monitoring and logging. You can monitor your AWS DataSync transfer by using CloudWatch Logs. More information on logging can be found in the DataSync user guide.
The following figure shows the task logging details for the task created:
Figure 6: DataSync task monitoring options configured by Terraform
In this example, we’ve configured the log level to Log all transferred objects and files
, which means DataSync creates detailed logs for each file or object transfer.
Figure 7: Events emitted by DataSync task in CloudWatch Logs group
DataSync ensures data integrity through checksum verification during transfers, as shown in the following figure. This example uses the ONLY_FILES_TRANSFERRED
verify mode, where DataSync calculates checksums for transferred data and metadata at the source, then compares these to checksums calculated at the destination post-transfer. Optional further verification can be configured for transfer completion.
Figure 7: Events emitted by DataSync task in CloudWatch Logs group for verification
To enhance task reporting capabilities, you can set up task reports during DataSync task creation by implementing a task_report_configuration in the Terraform resource. For more comprehensive information about task reports, please refer to our documentation.
Cleaning up
To delete all the resources associated with this example, configure AWS CLI credentials as in Step 2, and change to the examples/s3-to-s3-cross-account/
directory. Run the terraform destroy
command to delete all the resources that Terraform previously created. Any resources created outside of Terraform must be deleted manually. Any S3 buckets must be empty before Terraform can delete them.
Conclusion
This blog post demonstrated how to use HashiCorp’s Terraform to automate AWS DataSync deployment. We reviewed a scenario for organizing ML datasets with DataSync in preparation for downstream ML tasks such as Exploratory Data Analysis (EDA) and data cleaning, followed by model training or fine-tuning. Although this example focuses on an S3 to S3 configuration, the DataSync Terraform Module can be adapted for more location types.
Using IaC with DataSync allows for an automated and streamlined approach to complex data transfers, minimizing manual intervention and potential misconfigurations. Ultimately organizations can benefit from accelerating data lake development and ML model creation. For more information and to learn more about AWS DataSync and how the preceding datasets are applied for fine-tuning a large language model (LLM) in this post, see the following resources: