Transferring data between AWS accounts using AWS DataSync
In today’s business world, enterprises work together through many different means. One of those ways is the sharing of data. Data can come in many different types, like data streams, structured databases, and basic file data. File data is a common data type within companies, and it can be difficult to transfer file data between two different storage protocols. Moreover, transferring files between companies and changing the protocols at the same time can add to the complexity.
In this blog, we cover using AWS DataSync to copy file data on a daily basis from a Windows Server Message Block (SMB) share running on an Amazon EC2 Windows instance in one account, to an Amazon S3 bucket in a different AWS account and Region, via the internet. Normally, AWS recommends using VPC peering for this use case, but if you cannot peer VPCs together, perhaps due to internal security policies or regulatory compliance, you can still use the internet to securely transfer the data. For more information on using VPC peering, refer to the companion blog on AWS DataSync data transfers using VPC peering.
AWS DataSync components
AWS DataSync is an online data transfer service that simplifies, automates, and accelerates moving data between on-premises storage systems and AWS Storage services, in addition to between AWS Storage services. You can use DataSync to migrate active datasets to AWS, archive data to free up on-premises storage capacity, replicate data to AWS for business continuity, or transfer data to the cloud for analysis and processing. With DataSync, you can remove the manual tasks related to data transfers that can slow down migrations and business continuity projects. DataSync automatically handles the manual tasks, including the data copy, scheduling/monitoring transfers, validating data integrity, and optimizing network utilization. DataSync allows transfers between different sources and destinations on-premises or across different AWS accounts, and you can do these transfers securely over the internet if VPC peering is unavailable.
Before we share the steps to implement this helpful solution, we wanted to take the time to discuss the DataSync components. When copying between on-premises and AWS, DataSync uses an agent to connect to on-premises storage systems. This agent communicates with the DataSync managed service running in AWS. The configuration of the DataSync consists of the source and destination locations (SMB and Amazon S3 in this case), the task that defines how the data copy takes place, and the execution of the task. We outline each of the components in the following subsections. For further details on the components and the process, see the AWS documentation on how AWS DataSync works.
AWS DataSync agent on Amazon EC2
An AWS DataSync agent on Amazon EC2 can transfer data between two locations in AWS, including cross-Region and cross-account transfers, which are the focus of this blog. A role of the DataSync agent is to access your self-managed storage system and manage the data transfer to and from AWS Storage services. Note, however, that you don’t need a DataSync agent when copying data and metadata between AWS Storage services in the cloud). DataSync determines which files are new or changed and only replicates new or changed files between the source and destination locations.
DataSync managed service
The DataSync service component is the AWS managed service for DataSync that orchestrates the data transfer between the agent and the final destination. You consume the service in the Region that you specify from the AWS Management Console.
A DataSync location is an endpoint of a task. Each task has two locations: a source location and a destination location. DataSync supports the following locations:
- Network File System (NFS)
- Server Message Block (SMB)
- Self-managed object storage
- Amazon EFS
- Amazon FSx for Windows File Server
- Amazon S3
An AWS DataSync task includes two locations (source and destination), and defines the configuration of how to transfer the data from one location to the other. Configuration settings can include task scheduling, file controls, and permissions. A task is the complete definition of a data transfer.
A task execution is an individual run of a task, which shows information such as start time, end time, number of transferred files, and status.
In this configuration, we are using SMB as the source location because we are copying files from an EC2 Windows file server instance and the target location is Amazon S3.
Figure 1: DataSync cross account architecture
The preceding architecture diagram shows the AWS DataSync agent running as an EC2 instance that connects to the EC2 Windows file server instance in the same Availability Zone. This setup avoids cross Availability Zone data transfer charges. The DataSync agent EC2 instance and the Windows file server instance are part of a source AWS account that will connect securely to the DataSync public endpoint in the destination Region and the destination AWS account. In the destination account and Region, the DataSync service will manage the connection to the Amazon S3 bucket and perform the transfer.
DataSync instance information
When deploying AWS DataSync on Amazon EC2, the instance size must be at least 2xlarge for your data transfer to take place.
We recommend using one of the following instance types:
- 2xlarge – For tasks to transfer up to 20 million files
- 4xlarge – For tasks to transfer more than 20 million files
Transport Layer Security (TLS) encrypts all the data transferred between the source and destination. In addition, the data is never persisted in AWS DataSync itself. The service supports using default encryption for S3 buckets.
Now let’s discuss the setup and configuration.
Step 1: Create the EC2 DataSync instance
Create an EC2 DataSync agent in the source AWS account and Region. Assign a public IP to the instance. You must launch the DataSync agent in the source account and activate it in destination account. Also, make sure to deploy the DataSync agent in the same Availability Zone as the EC2 Windows file server instance to avoid cross Availability Zone network charges.
Note: If you choose to keep the agent in a public subnet, make sure to lock down the security groups and network ACLs rules. The source AWS account administrator can remove the inbound TCP port 80 after the DataSync agent activation, but must keep outbound TCP 443, TCP/UDP 53, and UDP 123. Please review DataSync network requirements documentation for more details.
Step 2: Create and activate the DataSync agent
Open the DataSync console on the destination account/region. When you create your agent, select the Public service endpoints in <Region> dropdown and type in the public IP address of the DataSync agent that you created in step 1 into the Agent address box. Click the Get key button to activate the DataSync agent.
Note: Make sure to select Amazon EC2 for the hypervisor, the public service endpoint in your desired AWS Region, and that the browser you are using can connect to the public IP of the DataSync agent.
Figure 2: Create the DataSync Agent object in destination account
Step 3: Configure the source SMB location
Configure the source EC2 Windows file server instance as an SMB location. Click the Locations option from the left navigation panel, and then click Create Location. Next, select Server Message Block (SMB) as your Location type. Afterward, select the agent you created in the preceding step, and fill in the SMB Server IP address, Share name, and user credentials with the permissions to access the SMB file shares.
Note: The source AWS account Windows file server administrator must grant the file share domain or workgroup service account with the permissions to access files, folders, and metadata. In addition, the EC2 Windows instance security group must allow inbound TCP/UDP 445 and TCP/UDP 139. In other words, the security group must allow SMB file sharing access with the DataSync EC2 instance private IP address in order for the DataSync instance to access the SMB share and transfer the data.
Figure 3: Create the SMB location in the destination account
Step 4: Configure the destination location
Configure a destination location as Amazon S3. Select Locations from the left navigation menu, then click on Create Location. Choose your target Amazon S3 bucket, S3 storage class, folder, and the IAM role with the permissions to access the Amazon S3 bucket. DataSync can transfer data directly into all S3 storage classes without having to manage zero-day lifecycle policies. For each transfer, you can select the most cost-effective S3 storage class for your needs. DataSync detects existing files or objects in the destination file system or bucket. To prevent accidental modification or loss of data, you can configure DataSync to never overwrite existing data.
Note: If you target Amazon S3, DataSync applies default POSIX metadata to the Amazon S3 object. This includes using the default POSIX user ID and group ID values. Refer to how DataSync handles metadata and special files to learn more. Please also review the Amazon S3 storage class considerations with DataSync documentation.
Figure 4: Create the Amazon S3 location in the destination account
Step 5: Create the replication task
Configure task settings by mapping the existing source SMB location in step 3 and the destination Amazon S3 bucket in step 4. Refer to task settings documentation to learn more about the task settings and options.
Note: If you want to periodically replicate new files, make sure to select your preferred schedule.
Figure 5: Choose the source location for the task
After configuring the source location, do the same for the destination location:
Figure 6: Choose the destination location for the task
Figure 7: Verify the task settings
Review your settings and create your DataSync task.
Step 6: Start the DataSync task
Start your task so DataSync can start transferring the data by clicking Start from the task list, or inside the task overview itself. If you set a schedule during the task setup, then the task will start at the time you specified. You can learn more about task execution and monitoring your DataSync task with Amazon CloudWatch in the linked documentation.
AWS charges the destination account for the use of AWS DataSync, since this is where you use the DataSync endpoint. Refer to the DataSync pricing page for more information.
Compared to the VPC peering method, the source account incurs higher data transfer OUT charges when transferring data using the internet method. If possible, use the VPC method to reduce costs. Refer to the Amazon EC2 pricing page for more information.
Once the data migration is finished, be sure to remove the deployed resources that facilitated the migration. Delete the DataSync agent in the source account to avoid incurring EC2 charges. In addition, delete the DataSync task, location, and agent configurations in the destination account, unless you are going to reuse those items later. AWS does not charge you for having a DataSync configuration.
In this blog, we covered setting up an AWS DataSync task to simplify transferring data between two AWS accounts across the internet when neither account can use VPC peering. VPC peering may be unavailable for several reasons, like internal security policies or regulatory compliance, but you can still transfer data easily across accounts, protocols, and Regions using AWS DataSync over the internet. First, we described the steps on how to set up the DataSync service to use public service endpoints. Then, we discussed setting the task to transfer data from a source SMB server to Amazon S3.
This solution outlined in this post can help with transferring massive amounts of data between accounts, with little effort. It also removes much of the complexities around copying data between protocols and locations. With this simplified process, you can save time transferring data, while also gaining all the capabilities of using a fully managed and user-friendly data transfer system like AWS DataSync. Some of these benefits include being able to automate and monitor transfer tasks, and being able to transfer data to take advantage of different storage options, within minutes.
Thanks for reading this post on using AWS DataSync to transfer your data over the internet when you don’t have access to VPC peering. If you have any comments or questions, please don’t hesitate to leave them in the comments section.