Automate data transfers and migrations with AWS DataSync and Terraform

In today’s data-driven world, organizations face the challenge of efficiently managing and consolidating vast amounts of information from diverse sources. Whether it’s for analytics, machine learning (ML), or other business-critical applications, the ability to seamlessly transfer and organize data is crucial. However, this process can be complex, time-consuming, and prone to errors when done manually.

AWS DataSync offers a powerful solution to address this challenge. It is a secure service that automates and accelerates data transfers. When combined with Infrastructure as Code (IaC) tools such as Terraform by HashiCorp, organizations can automate infrastructure provisioning and data transfer tasks while ensuring consistency in ML workflows and reducing human error. This approach enables businesses to streamline their data operations and maintain reliable environments through version control, making it valuable for any organization dealing with large-scale data transfers and management.

In this post, we explore how to combine DataSync with Terraform to streamline data transfers and migrations. Although the solution is applicable across various industry verticals, we focus on a practical use-case for financial institutions. This scenario involves consolidating datasets for ML model development, such as Common Crawl‘s news dataset and US SEC filings. We demonstrate how to automate DataSync configuration using Terraform, implement cross-account transfer best practices, organize datasets effectively for ML workflows, and use automation for improved data management and ML initiatives.

DataSync Overview

DataSync is a service that streamlines data migration and securely transfers file or object data between storage services, whether from on premises, other clouds, or in Amazon Web Services (AWS). It automates data movement, handles scheduling, and ensures data integrity while supporting various use cases from cloud migration to disaster recovery.

Key terminology:

Location: An endpoint that specifies the source or destination for data transfer operations. Locations can be on-premises storage systems (NFS, SMB, HDFS), self-managed object storage, other clouds, or AWS storage services (Amazon Simple Storage Service (Amazon S3), Amazon Elastic File System (Amazon EFS), Amazon FSx). This flexibility allows for diverse data transfer scenarios across different platforms.
Task: The configuration that controls data transfer operations. Tasks specify the source and destination locations, along with settings for scheduling, filtering, and data verification. These parameters allow organizations to customize and automate their data transfer requirements effectively.
Task Execution: An individual run of a DataSync transfer task. There are several phases involved in a task execution. During a task execution, DataSync prepares, transfers, and verifies your data.

Solution Overview

This solution uses the Terraform AWS DataSync Module to automate data transfers across AWS accounts. The module provides end-to-end examples for both Amazon S3 and Amazon EFS transfers, focusing on S3-to-S3 cross-account scenarios. Through Terraform, we create and configure DataSync locations, tasks, AWS Identity and Access Management (IAM) roles, and S3 buckets with AWS Key Management Service (AWS KMS) encryption. This makes sure of both the security and automation of your data transfer infrastructure.

To illustrate this solution, we use two data sources from the financial sector: SEC filings, which provide structured financial data and compliance documents from public companies, and Common Crawl’s news dataset, which offers comprehensive global news articles. Automating the consolidation of these datasets from separate S3 buckets into a centralized repository with scheduled updates allows organizations to focus on deriving value from their data rather than managing complex transfer configurations and security requirements.

The following figure shows the architecture overview for this solution.

Architecture overview of organizing ML datasets with AWS DataSync

Figure 1: Architecture overview of organizing ML datasets with AWS DataSync

Prerequisites

The following prerequisites are necessary for completing this solution:

AWS account with permissions to create resources in IAM
IAM or federated user in your AWS account with the permissions to create and administer:
- S3 Bucket and bucket policies
- Amazon CloudWatch Logs
- AWS KMS and KMS policies
- AWS Security Token Service
- Amazon Virtual Private Cloud (Amazon VPC)
- AWS IAM role and policies
See changing permissions for an IAM user for on how to setup IAM permissions and defining permission boundaries
Terraform version ≥ v1.0.7

Solution walkthrough

There are three AWS Accounts involved in storing and moving the datasets:

AWS Account A: Contains a selected subset of the Common Crawl dataset in an S3 bucket
AWS Account B: Contains selected SEC documents in an S3 bucket
AWS Account C: Destination for the datasets organized under specific prefixes

Transferring the Common Crawl dataset

In this section we observe the Common Crawl news dataset (CC-News) which is publicly available on an S3 commoncrawl bucket at the prefix crawl-data/CC-NEWS/. You can get the listings of files using the AWS Command Line Interface (AWS CLI) and the following command:

aws s3 ls --recursive s3://commoncrawl/crawl-data/CC-NEWS/

The dataset is organized in years and months sub-directories. For demonstration purposes, we have copied a small subset of files from the commoncrawl public S3 bucket to a private bucket called test-datasync-commoncrawl. Our goal is to efficiently transfer these files to a data preparation S3 bucket in Account C called pre-training-dataset-dest-bucket” in the following path: /CC-News/2016/08

Step 1. Clone the Terraform DataSync module repository

Clone the module repository using the git clone command as shown in the following example:

git clone https://github.com/aws-ia/terraform-aws-datasync

This repository contains the following directory structure:

terraform-aws-datasync/ 
  - modules/           # AWS DataSync Modules
  - datasync-locations # AWS DataSync module for Location config
  - datasync-task      # AWS DataSync module for Task configuration
- examples/            # Full examples 
  - EFS-S3 - efs-to-s3 # To Sync data from EFS to S3 and vice-versa
  - S3-S3 - s3-to-s3   # Sync data between S3 buckets in the same     account
  - S3-S3 - s3-to-s3-cross-account # Sync data between cross account     S3 buckets

For the ML data organization scenario, we call the datasync-locations/ and datasync-task/ modules from the examples/s3-to-s3-cross-account/main.tf. Change into the preceding directory using the following command:

cd examples/s3-to-s3-cross-account

Step 2. Configure the Terraform AWS provider

The s3-to-s3-cross-account/provider.tf file uses default AWS CLI profiles named source-account and destination-account which can be modified in variables.tf. Although there are a variety of ways to pass AWS credentials to Terraform, for this example, we use temporary credentials vended by the AWS IAM Identity Center using these steps:

1. Create IAM Identity Center user with access to both source and destination accounts.

2. Configure source account: Run aws configure sso, choose Account A,and set profile as source-account.

3. Configure destination account: Repeat for Account C, and set profile as destination-account.

4. Set default profile: Run export AWS_DEFAULT_PROFILE=source-account (if an explicit AWS provider is not specified in Terraform).

Step 3. Setup values for Terraform variables

First, we will assign appropriate values to input variables needed by each module. The README.md file for each module provides a description of all necessary and optional Terraform variables.

3.1 Call the DataSync Location module

The following code snippets from the main.tf file shows the child module blocks and example input variables for the CC-News dataset. The following shows the source S3 location.

module "s3_source_location" {

  source = "../../modules/datasync-locations"

  s3_locations = [
    {
      name          = "source-bucket"
      s3_bucket_arn = " arn:aws:s3:::test-datasync-commoncrawl" # Pre staged source S3 bucket with Common Crawl data
      subdirectory  = "/"
      create_role   = true
      s3_dest_bucket_kms_arn = aws_kms_key.dest-kms.arn
      tags                   = {project = "datasync-module"}
    }
  ]
}

The DataSync S3 locations module allows you to create a DataSync IAM role by setting create_role = true. This automatically generated IAM role has the necessary Amazon S3 permissions allowing the DataSync service to access the S3 bucket.

Cross-account Amazon S3 transfers through DataSync need specific permissions to access Amazon S3 in both AWS accounts. Creating an IAM role in the source account enables data transfer permissions, and then configuring the destination account’s S3 bucket policy grants access to this source account IAM role to copy data into. The following shows the destination S3 location.

module "s3_dest_location" {
  source = "../../modules/datasync-locations"
  s3_locations = [
    {
      name             = "dest-bucket"
      s3_bucket_arn    = "arn:aws:s3:::pre-training-dataset-dest-bucket"
      s3_config_bucket_access_role_arn = aws_iam_role.datasync_dest_s3_access_role.arn
      subdirectory     = "/crawl-data/CC-NEWS/2016/08/" # Using the same directory structure on destination to copy data to 
      create_role      = false
      tags 		     = {project = "datasync-module"}
    }
  ]
  depends_on = [aws_s3_bucket_policy.allow_access_from_another_account]

}

DataSync Location and Task Modules are generic and do not have any cross-account provider configuration. Therefore, the IAM role that gives DataSync the permissions to transfer data to the destination bucket in Account C must be created outside of the module and passed as a parameter for destination S3 location configuration.

By default create_role is set to false for the destination S3 location because the IAM role is created outside the DataSync Locations Module.

The depends_on meta-argument makes sure that Terraform creates the destination DataSync location only after the destination account S3 bucket policy is updated to allowing the source account IAM role to transfer data to the destination account bucket.

3.2 Call the DataSync task module

DataSync tasks need two locations configured: a source and a destination. Then, the Amazon Resource Names (ARNs) of these locations are used to create the DataSync task. The DataSync task module triggers the task execution based on the schedule defined by the schedule_expression attribute. The following example shows an hourly schedule that starts automatically upon task creation and then repeat every hour. For more information, go to task options in DataSync and DataSync Terraform arguments and attributes.


module "backup_tasks" {
  source = "../../modules/datasync-task"
  datasync_tasks = [
    {
      name                     = "s3_to_s3_cross_account_task"
      source_location_arn      = module.s3_source_location.s3_locations["source-bucket"].arn
      destination_location_arn = module.s3_dest_location.s3_locations["dest-bucket"].arn
      cloudwatch_log_group_arn = "arn:aws:logs:${var.region}:${data.aws_caller_identity.current.account_id}:log-group:/aws/datasync:*"
      options = {
        posix_permissions = "NONE"
        uid               = "NONE"
        gid               = "NONE"
        log_level         = "TRANSFER"
      }
      schedule_expression = "rate(1 hour)" # Run every hour
    }
  ]
}

Task filtering can be used to limit reading from a specific set of folders or files on the source location. This is useful if you want to include multiple folders from a top-level export path or further narrow a dataset within the specified source location path. Using an include filter or exclude filter allows you to specify unique folder paths for each DataSync task, then run those tasks in parallel.

  includes = {
        "filter_type" = "SIMPLE_PATTERN"
        "value"       = "/projects/important-folder"
      }
      excludes = {
        "filter_type" = "SIMPLE_PATTERN"
        "value"       = "*/work-in-progress"
      }

When you’ve configured the necessary module input variables, the next step is to assign values to any Terraform variables in the root module that don’t have default values. Using a .tfvars file provides a direct and common method for assigning variables in Terraform. We’ve provided a terraform.auto.tfvars.example file in the module for reference. Rename this file to terraform.auto.tfvars and then customize the variable values using your preferred text editor.

mv terraform.auto.tfvars.example terraform.auto.tfvars

The variables configured in the terraform.auto.tfvars file are passed into the module.

Step 4. Start the deployment

Before you can start a deployment, configure the AWS CLI credentials for Terraform using the service account that was created as part of the Prerequisites.

Run the command terraform init to download the modules and initialize the directory.
Run terraform plan and examine the outputs.
Run terraform apply and allow the apply to complete.

If the terraform apply is successful, then the output should appear as the follows.

Terraform apply output

To view and examine the resources created by terraform, you can use the commands terraform state list and terraform state show commands.

Step 5. Review DataSync task and data transfer in the AWS Management Console (Optional)

Navigate to the AWS DataSync service by logging into the AWS Management Console. In the DataSync console, locate the Data transfer section and choose Tasks. Here you can find the task created by Terraform, which displays the source and destination locations along with all associated task configuration settings.

The following screenshot shows a successful task execution that started automatically as per the schedule defined in the datasync-task module.

AWS Management Console screenshot of the source and destination locations from the DataSync task created by Terraform

Figure 2: AWS Management Console screenshot of the source and destination locations from the DataSync task created by Terraform

The following screenshot shows the successful task execution along with the synchronized files. To minimize costs during testing, choose only a subset of data for synchronization.

AWS Management Console screenshot showing Common Crawl files transferred by DataSync

Figure 3: AWS Management Console screenshot showing Common Crawl files transferred by DataSync

Transfer the SEC filings dataset

In this section, you configure a data transfer job from the AWS Account B to AWS Account C as shown in Figure 1. We assume that you have downloaded one or more SEC filing documents and uploaded it to a source S3 bucket in account A. SEC filings are available online through the SEC’s EDGAR (Electronic Data Gathering, Analysis, and Retrieval) database. Our goal is to organize the company specific files with a prefix of its ticket name such as /SEC/AMNZ/10-K2024.pdf.

Configure variables for the DataSync location module as shown in the following code snippet. Set up the provider configuration and trigger the deployment as shown in the same guidance as provided in Steps 2 through 4 in the Transfer the CC-News dataset section of this post. The following is an example configuration for the SEC filings dataset.

module "s3_source_location" {

  source = "../../modules/datasync-locations"

  s3_locations = [
    {
      name          = "source-bucket"
      s3_bucket_arn = "arn:aws:s3:::test-datasync-sec-filings" # Source S3 bucket with SEC filing data
      subdirectory  = "/"
      create_role   = true
      s3_dest_bucket_kms_arn = aws_kms_key.dest-kms.arn
      tags                   = {project = "datasync-module"}
    }
  ]
}

module "s3_dest_location" {
  source = "../../modules/datasync-locations"
  s3_locations = [
    {
      name          = "dest-bucket"
      s3_bucket_arn = "arn:aws:s3:::pre-training-dataset-dest-bucket" # Pre-training S3 bucket on destination account collecting data from multiple accounts
      s3_config_bucket_access_role_arn = aws_iam_role.datasync_dest_s3_access_role.arn
      subdirectory                     = "/SEC/AMZN/"
      create_role                      = false
      tags                             = {project = "datasync-module"}
    }
  ]
  depends_on = [aws_s3_bucket_policy.allow_access_from_another_account]

}

The following screenshot shows a successful task execution that started automatically as per the schedule defined in the datasync-task module.

Source and destination locations from the DataSync task for transferring SEC filings

Figure 4: Source and destination locations from the DataSync task for transferring SEC filings

The following screenshot shows the successful task execution along with the synchronized SEC filing files. To minimize costs during testing, only a subset of data is chosen for synchronization purposes.

AWS Management Console screenshot showing SEC filling data transferred by DataSync

Figure 5: AWS Management Console screenshot showing SEC filling data transferred by DataSync

Further Considerations

Supported locations: The preceding sections provided a walkthrough of using DataSync to copy and organize ML dataset between S3 buckets in different AWS accounts. DataSync supports data transfer across a range of AWS and cross-cloud storage locations such as NFS, SMB, HDFS, and object storage. The terraform-aws-datasync module contains examples to sync data from Amazon EFS to Amazon S3 and S3 to S3 for same account use cases.

Monitoring: Integration with CloudWatch provides comprehensive monitoring and logging. You can monitor your AWS DataSync transfer by using CloudWatch Logs. More information on logging can be found in the DataSync user guide.

The following figure shows the task logging details for the task created:

DataSync task monitoring options configured by Terraform

Figure 6: DataSync task monitoring options configured by Terraform

In this example, we’ve configured the log level to Log all transferred objects and files, which means DataSync creates detailed logs for each file or object transfer.

Events emitted by DataSync task in CloudWatch Logs group

Figure 7: Events emitted by DataSync task in CloudWatch Logs group

DataSync ensures data integrity through checksum verification during transfers, as shown in the following figure. This example uses the ONLY_FILES_TRANSFERRED verify mode, where DataSync calculates checksums for transferred data and metadata at the source, then compares these to checksums calculated at the destination post-transfer. Optional further verification can be configured for transfer completion.

Events emitted by DataSync task in CloudWatch Logs group for verification

Figure 7: Events emitted by DataSync task in CloudWatch Logs group for verification

To enhance task reporting capabilities, you can set up task reports during DataSync task creation by implementing a task_report_configuration in the Terraform resource. For more comprehensive information about task reports, please refer to our documentation.

Cleaning up

To delete all the resources associated with this example, configure AWS CLI credentials as in Step 2, and change to the examples/s3-to-s3-cross-account/ directory. Run the terraform destroy command to delete all the resources that Terraform previously created. Any resources created outside of Terraform must be deleted manually. Any S3 buckets must be empty before Terraform can delete them.

Conclusion

This blog post demonstrated how to use HashiCorp’s Terraform to automate AWS DataSync deployment. We reviewed a scenario for organizing ML datasets with DataSync in preparation for downstream ML tasks such as Exploratory Data Analysis (EDA) and data cleaning, followed by model training or fine-tuning. Although this example focuses on an S3 to S3 configuration, the DataSync Terraform Module can be adapted for more location types.

Using IaC with DataSync allows for an automated and streamlined approach to complex data transfers, minimizing manual intervention and potential misconfigurations. Ultimately organizations can benefit from accelerating data lake development and ML model creation. For more information and to learn more about AWS DataSync and how the preceding datasets are applied for fine-tuning a large language model (LLM) in this post, see the following resources:

Select your cookie preferences

AWS Storage Blog