AWS Storage Blog

Automate online data transfers with AWS DataSync and AWS CloudFormation

Many of our customers are using AWS DataSync to quickly and securely transfer their data between on-premises storage and AWS Storage services, in addition to between AWS Storage services. Customers are using DataSync to migrate their data into AWS, archive cold data to reduce their on-premises storage footprint, replicate data for business continuity, or transfer data for in-cloud processing. DataSync provides built-in capabilities such as scheduling, filtering, and end-to-end data verification that give customers optimal control of their migrations. The service simplifies data transfer workflows by automating and managing many of the tasks required when trying to move large amounts of data using open source or command line tools.

Managing the deployment of IT resources in a predictable manner is often a struggle for customers, particularly in organizations where the number of resources in use continues to grow over time. As customers migrate more of their workloads to AWS, they quickly discover that automating the deployment and management of cloud resources in a consistent, reliable way is key to achieving greater levels of scalability. AWS CloudFormation provides a common language for customers to model and provision AWS and third-party application resources in their cloud environments, providing a single source of truth for their deployments. CloudFormation uses templates, which are YAML or JSON-formatted text files, to describe resources and guide how they should be deployed. The CloudFormation service then automatically deploys the resources specified in template files.

We recently launched CloudFormation support for DataSync, giving customers more ways to automate the deployment of their DataSync resources such as agents, locations, and tasks. In this blog, I provide you with a high-level overview of how DataSync works, and I elaborate on using CloudFormation with DataSync by walking through a few common use cases.

How AWS DataSync works

Use AWS Datasync to quickly and securly transfer data from on-premises to the Cloud and between services in the Cloud

AWS DataSync allows you to copy data between your on-premises storage and AWS Storage services using one or more agents that are deployed in your on-premises VMware, Hyper-V, or KVM environments. DataSync agents can connect to file servers that support the NFS or SMB protocol, in addition to object storage systems that support a set of S3-compatible API operations. DataSync natively integrates with AWS Storage services, and can read from and write to Amazon S3, Amazon EFS, and Amazon FSx for Windows File Server (Amazon FSx). DataSync can write directly to any Amazon S3 storage class, including Amazon S3 Glacier, S3 Glacier Deep Archive, and S3 on Outposts. DataSync can also transfer data to and from AWS Snowcone devices using a built-in DataSync agent.

Using AWS CloudFormation with AWS DataSync

CloudFormation supports the following DataSync resources:

  • Agents
  • Locations (NFS, SMB, object, Amazon S3, Amazon EFS, Amazon FSx)
  • Tasks

In the next few sections, I’m going to show you how to use CloudFormation templates to create DataSync resources for the following use cases:

  • Migrate data from an on-premises NFS server to an EFS file system.
  • Periodically replicate an FSx for Windows File Server file system to a different AWS Region for business continuity purposes.
  • Copy data from an on-premises Windows Server to S3 for in-cloud processing.

However, before we dive into each of these scenarios, let’s first talk about security groups and DataSync locations.

Security Groups and DataSync Locations

AWS DataSync uses Locations to define access to your storage systems. Some locations, such as Amazon FSx and Amazon EFS file systems require one or more security groups to be specified when you create a location resource. These security groups are used by DataSync to access AWS managed file systems.

For example, let’s say that you have an Amazon EFS file system that you want to use DataSync with. The file system has been configured with a mount target that has a security group attached to it named “sgEFS,” which controls network access to the file system, as shown in the EFS console:

In the EFS console, a file system with a mount target that has a security group attached to it

Figure 1 – Network configuration for Amazon EFS file system

You can then create a security group named “sgDataSync” that is used with the DataSync location for Amazon EFS. To enable access to your EFS file system from DataSync, you would add the following Inbound rule on the “sgEFS” security group, as shown in the VPC console below:

Figure 2 - Configure Inbound rules on sgEFS security group

Figure 2 – Configure Inbound rules on sgEFS security group

DataSync communicates with Amazon EFS via TCP port 2049 and all DataSync traffic to EFS will originate from the DataSync service. If you are using an Amazon FSx location you would do something similar, but you would use type SMB and port range 445 instead.

DataSync provides several options for network endpoints when connecting your on-premises or in-cloud agents to the DataSync service. Customers can use VPC endpoints so that data transferred between the agent and the DataSync service doesn’t need to traverse the public internet or need public IP addresses, increasing the security of data as it is copied over the network. If you are using a VPC endpoint, you would add an inbound rule on your DataSync security group for port 443 from the IP address of your agent. Note that this is only required when using DataSync agents with VPC endpoints, not for Public or FIPS endpoints.

Now that you know more about using security groups with Amazon FSx and Amazon EFS locations, let’s go through the three use cases I mentioned earlier.

Migrate data from an on-premises NFS server to an Amazon EFS file system

For this use case, I want to migrate data from my on-premises NFS server to an Amazon EFS file system. This enables me to take advantage of scalable storage without the need to provision and manage capacity. I’ve deployed my DataSync agent on-premises, and I’m going to use my internet connection to transmit the data to AWS, using the Public endpoint for the DataSync service in my Region.

Migrate data from an on-premises NFS server to an Amazon EFS file system

I deployed my agent on-premises in my VM environment, locating the agent as close as possible to my NFS server in order to minimize network latency when communicating with my storage. After deploying my agent, I activate it and pass the activation key to my CloudFormation stack when it is created. Note that a DataSync agent only needs to be deployed once, during initial setup.

In my CloudFormation template, I’m going to create the following resources:

  • A DataSync agent to read data from my on-premises storage and efficiently transfer to AWS.
  • Two DataSync locations – one for NFS, which will be my source location, and one for EFS, which will be my destination location.
  • A security group used by DataSync to access the Amazon EFS file system – this is the common DataSync security group (that is, sgDataSync) mentioned previously.
  • A DataSync task to configure data transfer settings such scheduling, validation, and logging.

My CloudFormation template assumes that the NFS server, the Amazon EFS file system, and a CloudWatch log group for DataSync logs have already been created. These will be parameters to the template.

The following is a snippet from my template, focusing on the Resources section. You can see that the template creates an Agent, two Locations, and one Task, referencing parameters that are provided when the CloudFormation stack is created. To keep things simple, I only specified a subset of the task options that are available.

Resources:

  OnPremAgent:
    Type: AWS::DataSync::Agent
    Properties:
      ActivationKey: !Ref activationKey
      AgentName: 'OnPrem Agent'

  NfsLocation:
    Type: AWS::DataSync::LocationNFS
    Properties:
      OnPremConfig:
        AgentArns:
          - !Ref OnPremAgent
      ServerHostname: !Ref nfsServer
      Subdirectory: !Ref nfsPath

  EfsLocation:
    Type: AWS::DataSync::LocationEFS
    Properties:
      EfsFilesystemArn:
        !Sub arn:${AWS::Partition}:elasticfilesystem:${AWS::Region}:${AWS::AccountId}:file-system/${efsFS}
      Ec2Config:
        SecurityGroupArns:
          - !Sub arn:${AWS::Partition}:ec2:${AWS::Region}:${AWS::AccountId}:security-group/${securityGroupId}
        SubnetArn: !Sub arn:${AWS::Partition}:ec2:${AWS::Region}:${AWS::AccountId}:subnet/${efsSubnetId}
      Subdirectory: !Ref efsPath

  NfsToEfsTask:
    Type: AWS::DataSync::Task
    Properties:
      Name: 'Copy NFS to EFS'
      SourceLocationArn: !Ref NfsLocation
      DestinationLocationArn: !Ref EfsLocation
      Options:
        VerifyMode: 'ONLY_FILES_TRANSFERRED'
        OverwriteMode: 'ALWAYS'
        PreserveDeletedFiles: 'PRESERVE'
        LogLevel: 'BASIC'
      CloudWatchLogGroupArn: !Ref logGroupArn

To try out this scenario in your environment, use the full template and deploy it using CloudFormation.

Periodically replicate an Amazon FSx file system to a different AWS Region for business continuity purposes

For this use case, I want to replicate my primary Amazon FSx file system to another Region so I have a separate, distinct copy that I can move to for business continuity. I transfer data between AWS Storage services, so there is no need to create an agent – I only need to create my locations and my task. Note that for each file system I must provide a security group to enable DataSync access.

Periodically replicate an Amazon FSx file system to a different AWS Region for business continuity purposes

CloudFormation stacks are created separately for each AWS Region. For this use case, I am replicating from one AWS Region to another, so I will need two CloudFormation templates: one for the source Region and one for the destination Region. DataSync uses Amazon CloudWatch to provide monitoring data for tasks. Note that all logs, metrics, and events are generated in the AWS Region where the DataSync task is configured. In this example, because the task will be created in the source Region, that is where the CloudWatch data for my task will be stored.

My template assumes that you have already created two Amazon FSx for Windows File Server file systems (one in each Region) and security groups. The identifiers for each file system and the security group identifiers are provided as parameters to the template.

Note also that user credentials are required for DataSync to connect to an FSx for Windows File Server file system share. For security reasons, it’s recommended not to provide passwords directly as parameters to a CloudFormation template, so I am going to read the user credentials, which include the user name, password, and domain, from AWS Secrets Manager.

The following snippets are from my two templates – one for each Region. The first template is for the destination Region and I use it to create a DataSync location for the destination Amazon FSx file system. I then use the destination location output from the first template in the second template, which is for the source Region and creates the source location in addition to the task.

Resources:

  DestinationFSxWindowsLocation:
    Type: AWS::DataSync::LocationFSxWindows
    Properties:
      Domain: '{{resolve:secretsmanager:DestRegionSecret:SecretString:domain}}'
      FsxFilesystemArn: !Sub arn:${AWS::Partition}:fsx:${AWS::Region}:${AWS::AccountId}:file-system/${fsxwFSid}
      User: '{{resolve:secretsmanager:DestRegionSecret:SecretString:username}}'
      Password: '{{resolve:secretsmanager:DestRegionSecret:SecretString:password}}'
      SecurityGroupArns:
        - !Sub arn:${AWS::Partition}:ec2:${AWS::Region}:${AWS::AccountId}:security-group/${securityGroupId}
      Subdirectory: !Ref fsxwPath
      
Outputs:
  destLocationArn:
    Description: Destination Location ARN
    Value: !Ref DestinationFSxWindowsLocation

After you have deployed the preceding template, you would then use the Destination Location ARN from the output as a parameter for the following template.

Resources:

  SourceFSxWindowsLocation:
    Type: AWS::DataSync::LocationFSxWindows
    Properties:
      Domain: '{{resolve:secretsmanager:SourceRegionSecret:SecretString:domain}}'
      FsxFilesystemArn: !Sub arn:${AWS::Partition}:fsx:${AWS::Region}:${AWS::AccountId}:file-system/${fsxwFSid}
      User: '{{resolve:secretsmanager:SourceRegionSecret:SecretString:username}}'
      Password: '{{resolve:secretsmanager:SourceRegionSecret:SecretString:password}}'
      SecurityGroupArns:
        - !Sub arn:${AWS::Partition}:ec2:${AWS::Region}:${AWS::AccountId}:security-group/${securityGroupId}
      Subdirectory: !Ref fsxwPath

  FSxwToFSxwTask:
    Type: AWS::DataSync::Task
    Properties:
      Name: 'Copy FSxW to FSxW'
      SourceLocationArn: !Ref SourceFSxWindowsLocation
      DestinationLocationArn: !Ref destLocationArn
      Options:
        VerifyMode: 'ONLY_FILES_TRANSFERRED'
        OverwriteMode: 'ALWAYS'
        PreserveDeletedFiles: 'PRESERVE'
        LogLevel: 'BASIC'
      CloudWatchLogGroupArn: !Ref logGroupArn

To test out this use case, you can use the two templates available here: source and destination.

Copy data from an on-premises Windows Server to Amazon S3 for in-cloud processing

For this use case, I want to copy files from my on-premises Windows Server into my Amazon S3 bucket. Doing so enables me to process the data using AWS Cloud services such as Amazon SageMaker, Amazon Athena, and Amazon EMR. I also want to use my AWS Direct Connect link and a VPC endpoint for private connectivity from my on-premises data center into AWS.

Copy data from an on-premises Windows Server to Amazon S3 for in-cloud processing

My template assumes that the VPC endpoint, subnet, security group, S3 bucket (and an IAM role to access the bucket), and a CloudWatch log group have already been created. These will be parameters to the template.

As with the previous use case, user credentials are required for DataSync to connect to the Windows Server via the SMB protocol. I will read the credentials from AWS Secrets Manager.

The following is a snippet from my template, which creates the on-premises agent using the VPC endpoint, a location for the Windows Server, a location for the S3 bucket, and a task to copy the data.

Resources:

   OnPremAgent:
    Type: AWS::DataSync::Agent
    Properties:
      ActivationKey: !Ref activationKey
      AgentName: 'OnPrem Agent'
      SecurityGroupArns:
        - !Sub arn:${AWS::Partition}:ec2:${AWS::Region}:${AWS::AccountId}:security-group/${securityGroupId}
      SubnetArns:
        - !Sub arn:${AWS::Partition}:ec2:${AWS::Region}:${AWS::AccountId}:subnet/${subnetId}
      VpcEndpointId: !Ref vpcEndpointId

  SmbLocation:
    Type: AWS::DataSync::LocationSMB
    Properties:
        AgentArns:
          - !Ref OnPremAgent
        User: '{{resolve:secretsmanager:MySecret:SecretString:username}}'
        Password: '{{resolve:secretsmanager:MySecret:SecretString:password}}'
        Domain: '{{resolve:secretsmanager:MySecret:SecretString:domain}}'
        ServerHostname: !Ref smbServer
        Subdirectory: !Ref smbPath

  S3Location:
    Type: AWS::DataSync::LocationS3
    Properties:
      S3BucketArn: !Sub arn:${AWS::Partition}:s3:::${bucketName}
      S3Config:
        BucketAccessRoleArn: !Ref bucketRoleArn
      S3StorageClass: STANDARD
      Subdirectory: !Ref s3Prefix

  SmbToS3Task:
    Type: AWS::DataSync::Task
    Properties:
      Name: 'Copy SMB to S3'
      SourceLocationArn: !Ref SmbLocation
      DestinationLocationArn: !Ref S3Location
      Options:
        VerifyMode: 'ONLY_FILES_TRANSFERRED'
        OverwriteMode: 'ALWAYS'
        PreserveDeletedFiles: 'PRESERVE'
        LogLevel: 'BASIC'
      CloudWatchLogGroupArn: !Ref logGroupArn

To deploy the resources for this use case, you can access the full template.

Cleanup

DataSync resources do not have an inherent cost. With DataSync, you pay only for the data you transfer. To delete any DataSync resources created using the preceding templates, simply delete the associated CloudFormation stacks.

Other AWS resources mentioned in this blog, such as Amazon EFS file systems, Amazon FSx file systems, or objects stored in S3 buckets, do have an ongoing cost. If you used this blog for testing or learning purposes, and don’t intend to continue using these resources, I recommend you delete them to avoid ongoing costs.

Conclusion and next steps

In this blog, I showed you how to use CloudFormation templates to create DataSync agents, locations, and tasks for various use cases. Using CloudFormation, you can automate the creation and deployment of resources in your AWS Regions, while managing your templates as code. This gives you a single source of truth for your deployments and allows you to automate your best practices, scale your infrastructure worldwide, and integrate with other AWS services. If you would like to read more on the DataSync resources supported by CloudFormation, check out the documentation. You can also find the full CloudFormation templates on GitHub.

To learn more about AWS DataSync and how it can help you with your online data transfer needs, take a look at the following links:

Thanks for reading about using AWS DataSync with AWS CloudFormation. If you have any comments or questions about anything covered in this post, please don’t hesitate to leave a comment in the comments section.

Jeff Bartley

Jeff Bartley

Jeff is a Principal Product Manager on the AWS DataSync team. He enjoys helping customers tackle their biggest data challenges through cloud-scale architectures. A native of Southern California, Jeff loves to get outdoors whenever he can.