Q: What is AWS DataSync?
A: AWS DataSync is an online data transfer service that simplifies, automates, and accelerates copying large amounts of data between storage systems and AWS storage services such as Amazon S3 and Amazon EFS, over the Internet or AWS Direct Connect.
Q: Why should I use AWS DataSync?
A: AWS DataSync allows you to move, copy, and synchronize large datasets with millions of files, without having to build custom solutions with open source tools, or license and manage expensive commercial network acceleration software. You can use DataSync for one-time migration of active data, periodic distribution for data processing workflows, or ongoing replication for business continuity.
Q: What problem does AWS DataSync solve for me?
A: DataSync reduces the complexity and cost of online data transfer, making it simple to transfer datasets between on-premises storage systems and Amazon S3 or Amazon Elastic File System (EFS). DataSync connects to existing storage systems and data sources with standard storage protocols (NFS or SMB), and uses a purpose-built network protocol and scale-out architecture to accelerate transfer to and from AWS. DataSync automatically scales and handles all of the tasks involved in moving data, monitoring the progress of transfers, encryption and verification of data transfers, and notifying customer of any failures. With DataSync you pay only for the amount of data copied, with no minimum commitments or upfront fees.
Q: Where can I transfer data to and from?
A: DataSync can copy data between NFS servers, SMB file shares, Amazon S3 buckets, and Amazon EFS file systems.
Q: Can I use AWS DataSync to migrate data to AWS?
A: Yes. You can use DataSync to migrate from on-premises data to Amazon S3, Amazon EFS, and Amazon WorkDocs. Read the storage blog, "Migrating storage with AWS DataSync," to learn more about migration best practices and tips.
Q: How do I get started with AWS DataSync?
A: You can transfer data using DataSync with a few clicks in the AWS Management Console or through the AWS Command Line Interface (CLI). To get started, you deploy a DataSync agent, configure the source and destination storage locations, and initiate a data transfer task.
Q: How do I use AWS DataSync?
A: To use DataSync follow these 3 steps:
1. Deploy an agent - Deploy a DataSync agent and associate it to your AWS account via the Management Console or API. The agent will be used to access your NFS server or SMB file share to read data from it or write data to it.
2. Create a data transfer task - Create a task by specifying the location of your data source and destination, and any options you want to use for to configure the transfer, such as copying file metadata.
3. Start the transfer - Start the task and monitor data movement in the console or with Amazon CloudWatch.
Q: How do I deploy an AWS DataSync agent?
A: You deploy a DataSync agent to your VMware ESXi hypervisor or in Amazon EC2. To copy data to or from an on-premises file server, you download the agent virtual machine image (an OVA file) from the AWS Console and deploy to your on-premises VMware ESXi hypervisor. To copy data to or from an in-cloud file server, you create an Amazon EC2 instance from the agent AMI provided in the AWS console. In both cases the agent must be deployed so that it can access your file server using either the NFS or SMB protocol.
Q: What are the resource requirements for the AWS DataSync agent?
A: You can find the minimum required resources to run the agent here.
Q: How do I start an AWS DataSync data transfer task?
A: DataSync copies data when you initiate a task via the AWS Management Console or AWS Command Line Interface (CLI). Each time a task runs, it scans the source for changes, and performs a copy of any differences between the source to destination. You can configure which characteristics of the source are used to determine what changed, define filters to include and exclude specific files or folders, control if files or objects in the destination should be overwritten when changed in the source or deleted when not found in the source.
Q: How does AWS DataSync ensure my data is copied correctly? How does AWS DataSync conduct data verification?
A: As DataSync transfers and stores data, it performs integrity checks to ensure the data written to the destination matches the data read from the source. Additionally, an optional verification check can be performed to ensure the data stored in the destination matches the data stored in the source by calculating and comparing full-file checksums. You can check either the entire dataset or just the files or objects that DataSync transferred.
Q: How can I monitor the status of data being transferred by AWS DataSync?
A: You can use the AWS Management Console or CLI to monitor the status of data being transferred. Using Amazon CloudWatch Metrics, you can see the number of files and amount of data which has been copied. Amazon CloudWatch Logs are available for detailed error information. In addition, CloudWatch Events are triggered as your tasks transition state, enabling automation of dependent workflows. You can find additional information such transfer progress in the AWS Management Console or CLI.
Q: How does AWS DataSync convert files and folders to or from objects in Amazon S3?
A: When files or folders are copied to Amazon S3, there is a one-to-one relationship between a file or folder and an object. File and folder metadata timestamps and POSIX permissions, including user ID, group ID, and permissions, are stored in S3 user metadata. File metadata stored in S3 user metadata is interoperable with File Gateway, providing on-premises file-based access to data stored in Amazon S3 by DataSync.
When DataSync copies from an NFS server, the POSIX permissions from the files and folders on the source are stored in the S3 user metadata. When copying from an SMB file share, default POSIX permissions are stored in S3 user metadata.
When DataSync copies objects that contain this user metadata back to an NFS server, the file metadata is restored. When copying back to an SMB file share, ownership is set based on the user that was configured in DataSync to access that file share, and default permissions are assigned.
Learn more about how DataSync stores files and metadata in our documentation.
Q: Can I copy my data into Amazon S3 Glacier or other S3 storage classes?
A: Yes. When configuring an S3 bucket for use with DataSync you can select the S3 storage class that DataSync uses to store objects. DataSync supports storing data directly into S3 Standard, S3 Intelligent-Tiering, S3 Standard-Infrequent Access (S3 Standard-IA), S3 One Zone-Infrequent Access (S3 One Zone-IA), Amazon S3 Glacier (S3 Glacier), and Amazon S3 Glacier Deep Archive (S3 Glacier Deep Archive). More information on Amazon S3 storage classes can be found in the Amazon Simple Storage Service Developer Guide.
Objects smaller than the minimum charge capacity per object will be stored in S3 Standard. For example, folder objects, which are zero-bytes in size and hold only metadata, will be stored in S3 Standard. Read about considerations when working with Amazon S3 storage classes in our documentation, and for more information on minimum charge capacities see Amazon S3 Pricing.
Q: Which S3 request and storage costs apply when using S3 storage classes with AWS DataSync?
A: Some S3 storage classes have behaviors that can affect your cost, such as data retrieval, minimum storage capacities, and minimum storage durations. DataSync automates management of data to address these factors, and provides settings to minimize data retrieval. For instance, DataSync verifies only files that were transferred, stores small objects in S3 Standard, and has controls for overwriting and deleting objects. Read about considerations when working with Amazon S3 storage classes in our documentation.
Q: Can I copy data out of S3 Glacier and other storage classes?
A: When using S3 as the source location for a DataSync task, the service will use using GetObject to retrieve all objects from the bucket which need to be copied to the destination. Retrieving objects which are archived in the S3 Glacier or S3 Glacier Deep Archive storage class results in an error. Retrieving objects from other storage classes will succeed, but for some storage classes you may be charged a retrieval fee based on the size of the objects. Any errors retrieving archived objects will be logged by DataSync and will result in a failed task completion status. Read about considerations when working with Amazon S3 storage classes in our documentation.
Q: Can I use versioning, lifecycle, cross-region replication, and S3 event notification with AWS DataSync?
A: Yes. Your bucket policies for versioning, lifecycle management, cross-region replication, and S3 event notification apply directly to objects transferred to your bucket through DataSync.
When using versioning, note that changes to object metadata will create a new version of the object.
You can use S3 lifecycle policies to change an object's storage tier or delete old objects or object versions.
Q: What happens if an AWS DataSync task is interrupted?
A: If a task is interrupted, for instance, if the network connection goes down or the DataSync agent is restarted, the next run of the task will transfer missing files, and the data will be complete and consistent at the end of this run. Each time a task is started it performs an incremental copy, transferring only the changes from the source to the destination.
Q: Can I use AWS DataSync with AWS Direct Connect?
A: Yes. You can use DataSync with your Direct Connect link to access public service endpoints or private VPC endpoints. When using VPC endpoints, data transferred between the DataSync agent and AWS services doesn’t traverse the public internet or need public IP addresses, increasing the security of data as it is copied over the network.
Q: Does AWS DataSync support VPC endpoints or AWS PrivateLink?
A: Yes. You can use VPC endpoints to ensure data transferred between your DataSync agent, either deployed on-premises or in-cloud, doesn't traverse the public internet or need public IP addresses. Using VPC endpoints increases the security of your data by keeping network traffic within your Amazon Virtual Private Cloud (Amazon VPC). VPC endpoints for DataSync are powered by AWS PrivateLink, a highly available, scalable technology that enables you to privately connect your VPC to supported AWS services.
Q: How do I configure AWS DataSync to use VPC endpoints?
A: To use VPC endpoints with DataSync, you create an AWS PrivateLink interface VPC endpoint for the DataSync service in your chosen VPC, and then choose this endpoint elastic network interface (ENI) when creating your DataSync agent. Your agent will connect to this ENI to activate, and subsequently all data transferred by the agent will remain within your configured VPC. You can use either the AWS DataSync Console, AWS Command Line Interface (CLI), or AWS SDK, to configure VPC endpoints. To learn more, see Using AWS DataSync in a Virtual Private Cloud.
Q: Does AWS DataSync preserve the source directory stricture when transferring files?
A: Yes. When transferring files, DataSync creates a directory structure on the destination that is similar to the source location's structure.
Q: How fast can AWS DataSync copy my file system to AWS?
A: The rate at which DataSync can copy a given dataset is a function of amount of data, I/O bandwidth achievable from the source and destination storage, network bandwidth available, and network conditions. A single DataSync agent is capable of saturating a 10 Gbps network link.
Q: Can I control the amount of network bandwidth that an AWS DataSync task uses?
A: Yes. You can control the amount of network bandwidth that DataSync will use by configuring the built-in bandwidth throttle. This can help to minimize impact on other users or applications who rely on the same network connection.
Q: Will AWS DataSync affect the performance of my source file system?
A: Depending on the capacity of your on-premises file store, and the quantity and size of files to be transferred, DataSync may affect the response time of other clients when accessing the same source data store, because the agent reads or writes data from that storage system. Configuring a bandwidth limit for a task will reduce this impact by limiting the I/O against your storage system.
Security and compliance
Q: Is my data encrypted while being transferred and stored?
A: Yes. All data transferred between the source and destination is encrypted via Transport Layer Security (TLS, which replaced Secure Sockets Layer, SSL). Data is never persisted in DataSync itself. The service supports using default encryption for S3 buckets and Amazon EFS file system encryption of data at rest.
Q: How does AWS DataSync access my NFS server or SMB file share?
A: DataSync uses an agent that you deploy into your IT environment or into Amazon EC2 to access your files through the NFS or SMB protocol. These agents connect to DataSync service endpoints within AWS, and once activated are securely managed from the AWS Management Console or CLI. When copying data to or from your premises there is no need to setup a VPN/tunnel or allow inbound connections, and the agents can be configured to route through a firewall using standard network ports. You can also deploy DataSync within your Amazon Virtual Private Cloud (Amazon VPC) using VPC endpoints. When using VPC endpoints, data transferred between the DataSync agent and AWS services doesn’t need to traverse the public internet or need public IP addresses.
Q: How do my AWS DataSync agents connect to AWS?
A: Your DataSync agents connect to service endpoints within your chosen AWS Region. When creating an agent, you can choose to have the agent connect to public Internet facing endpoints, Federal Information Processing Standards (FIPS) validated endpoints, or endpoints within one of your VPCs. To learn more, see Choose a Service Endpoint.
Q: How does AWS DataSync access my Amazon S3 bucket?
A: DataSync assumes an IAM role that you provide. The policy you attach to the role determines which actions the role can perform. DataSync can autogenerate this role on your behalf or you can manually configure a role.
Q: How does AWS DataSync access my Amazon EFS file system?
A: DataSync accesses your Amazon EFS file system using the NFS protocol. It does so by mounting your file system from within your VPC from Elastic Network Interfaces (ENIs) managed by the DataSync service. DataSync fully manages the creation, use, and deletion of these ENIs on your behalf.
Q: Which compliance programs does AWS DataSync support?
A: AWS has the longest-running compliance program in the cloud and are committed to helping customers navigate their requirements. DataSync has been assessed to meet global and industry security standards. It complies with PCI DSS, ISO 9001, 27001, 27017, and 27018; SOC 1, 2, and 3; in addition to being HIPAA eligible. That makes it easier for you to verify our security and meet your own obligations. For more information and resources, visit our compliance pages. You can also go to the Services in Scope by Compliance Program page to see a full list of services and certifications.
Q: Is AWS DataSync PCI compliant?
A: Yes. DataSync is PCI-DSS compliant, which means you can use it to transfer payment information. You can download the PCI Compliance Package in AWS Artifact to learn more about how to achieve PCI Compliance on AWS.
Q: Is AWS DataSync HIPAA eligible?
A: Yes. DataSync is HIPAA eligible, which means if you have a HIPAA BAA in place with AWS, you can use DataSync to transfer protected health information (PHI).
Q: How is my DataSync agent patched and updated?
A: Updates to the agent VM, including both the underlying operating system and the DataSync software packages, are managed by the service once the agent is activated. Updates are applied non-disruptively when the agent is idle and not executing a data transfer task.
When to choose AWS DataSync
Q: How is AWS DataSync different from using command line tools such as rsync or S3 sync?
A: Compared to DIY solutions built around command line tools, DataSync provides automated, fully managed data transfers. It uses a purpose-built network protocol and scale-out architecture to transfer data at up to 10 times the speed.
Specifically, DataSync fully automates the data transfer. It comes with built-in retry and network resiliency mechanisms, monitoring via the DataSync API and console, and CloudWatch metrics, events and logs that provide granular visibility into the transfer process. DataSync performs data integrity verification both during the transfer and at the end of the transfer. The service also supports flexible configuration to suit your specific needs, including bandwidth throttling, copying source permissions and metadata etc.
DataSync provides end to end security: all data transferred between the source and destination is encrypted via TLS, and access to your AWS storage is enabled via built-in AWS security mechanisms such as IAM roles.
Q: How do I choose between AWS DataSync and AWS Snowball Edge?
A: AWS Snowball Edge is suitable for customers who don’t need their data in AWS immediately, are bandwidth constrained, or transferring data from remote, disconnected or austere environments. DataSync is ideal for customers who need online migrations for active data sets, timely transfers for continuously generated data, or replication for business continuity.
Q: How do I choose between AWS DataSync and AWS Storage Gateway?
A: If you are looking to transfer data between on-premises and AWS storage such as S3 or EFS, you use DataSync. DataSync is commonly used for storage migration or for timely recurring transfers of data from on-premises devices such as cameras and instruments for processing in AWS. If you are looking for low-latency access from on-premises to data in AWS, you use AWS Storage Gateway. Storage Gateway is commonly used for backup, hybrid workloads, latency-sensitive on-premises applications, content distribution across offices, and for file-based access to objects in S3.
With the combination of DataSync and the File Gateway configuration of Storage Gateway, you can rapidly move your on-premises storage to AWS, while retaining on-premises access for latency-sensitive applications.
Q: How do I choose between AWS DataSync and Amazon S3 Transfer Acceleration?
A: If your applications are already integrated with the Amazon S3 API, and you want higher throughput for transferring large files to S3, you can use S3 Transfer Acceleration. If you want to transfer data from existing storage systems (e.g. Network Attached Storage), or from instruments that can’t be changed (e.g. DNA sequencers, video cameras), or if you want multiple destinations, you use DataSync. DataSync also automates and simplifies the data transfer, by providing additional functionality such as built-in retry and network resiliency mechanisms, data integrity verification and flexible configuration to suit your specific needs, including bandwidth throttling, copying source permissions and metadata etc.
Q: How do I choose between AWS DataSync and AWS Transfer for SFTP?
A: If you currently use SFTP to exchange data with third parties, Transfer for SFTP provides a fully managed SFTP transfer directly into and out of Amazon S3, while reducing your operational burden.
If you want an accelerated and automated data transfer between NFS servers, SMB file shares, Amazon S3, and Amazon EFS, you can use DataSync. DataSync is ideal for customers who need online migrations for active data sets, timely transfers for continuously generated data, or replication for business continuity.
Q: Does AWS DataSync enable me to migrate to WorkDocs?
A: Yes. DataSync is part of the WorkDocs Migration Service. DataSync makes it easier and faster to migrate home directors and department shares to WorkDocs.
AWS DataSync has simple, predictable, usage-based pricing; you pay only for the amount of data that you copy.
Instantly get access to the AWS Free Tier.
Get started building with AWS DataSync in the AWS Console.