AWS Storage Blog

Protect your file and backup archives using AWS DataSync and Amazon S3 Glacier

As the amount of data being generated on-premises continues to grow, so does the demand for more storage to house file and backup archives. If you follow common backup methodologies and have multiple backups in different locations, then you likely have a lot of cold data sitting on-premises in disk storage or in physical tape archives. Keeping track of multiple on-premises copies of data can be a challenge and often leads to significant costs, both in time and money.

AWS Cloud storage provides a compelling alternative to on-premises backup storage or physical tape archives. For example, Amazon S3 Glacier Deep Archive provides 11 9’s (99.999999999%) of durability at a price point of about $1 per TB/month. There’s no storage hardware to manage, no tapes to send offsite, and no sticker shock with hardware refresh cycles. With AWS Cloud storage, you get all the advantages of cloud scalability and durability, while paying only for what you use.

AWS DataSync is an online data transfer service that is designed to help customers get their data to and from AWS quickly, easily, and securely. Using DataSync, you can copy data from your on-premises NFS or SMB shares directly to Amazon Simple Storage Service (Amazon S3), Amazon Elastic File System (Amazon EFS), or Amazon FSx for Windows File Server. DataSync uses a purpose-built, parallel transfer protocol for speeds up to 10x faster than open source tools. DataSync also has built-in verification of data both in flight and at rest, so you can be confident that your data was transferred successfully.

In this post, I discuss how to use DataSync to copy your on-premises archive data to your selected AWS Cloud storage service. I also review how to choose AWS Cloud storage services for storing your data, and why Amazon S3 is the ideal service for protecting your on-premises file and backup archives. Finally, I discuss how to configure DataSync for your data protection workloads while also monitoring your ongoing transfer tasks.

How AWS DataSync works

A DataSync agent is deployed as a virtual machine (VM) in your on-premises VMware environment. You define a task to copy data from your source file system on-premises to the destination storage in AWS. You then execute the task to securely transfer your files. The fully managed AWS DataSync service is optimized for working with AWS storage services and scales to meet the performance needs of your task.

AWS Cloud storage options

One of the first things you want to consider is where to put your archive data in the cloud. AWS DataSync currently supports three AWS Cloud storage services: Amazon S3, Amazon EFS, and Amazon FSx for Windows File Server. Amazon EFS and Amazon FSx for Windows File Server provide scalable file storage for Linux and Windows applications, respectively. They are generally designed for applications that require fast performance with low latency. For more static workloads, such as file and backup archives, Amazon S3 is usually the better choice.

Amazon S3 has a wide variety of storage classes to cover different workloads and use cases. The S3 storage class you choose primarily depends upon two factors: accessibility and cost. If you need immediate access to your data, then you want to use either S3 Standard, S3 Intelligent-Tiering, or S3 Standard-Infrequent Access. If you don’t require regular and immediate access to your data, then S3 Glacier or S3 Glacier Deep Archive may be a good choice. The S3 Glacier storage classes have an overall lower cost than the S3 storage classes that provide immediate access to your data. See the Amazon S3 pricing page for a full breakdown of costs.

S3 Glacier or S3 Glacier Deep Archive are a good choice for file and backup archives. This is because the data in archives must usually be preserved for at least a few months or longer, is not changing, and is not accessed regularly. If this is not the case for your particular workload, then consider using one of the other S3 storage classes.

Choosing between S3 Glacier and S3 Glacier Deep Archive comes down to how quickly you must retrieve your data and how long you want to retain it. With S3 Glacier, you can retrieve your data within a few minutes to a few hours, whereas with S3 Glacier Deep Archive, the minimum retrieval period is 12 hours. You also want to consider how long you must store your data. S3 Glacier has a minimum retention period of 90 days and S3 Glacier Deep Archive has a minimum retention period of 180 days. If you remove your data before the minimum retention period, then you are charged for the time remaining. For your backup and file archive, choose S3 Glacier if you must meet a recovery time objective (RTO) of a few hours, or if you will delete files after 90 days. S3 Glacier Deep Archive is a good choice for archives that must be held for long periods of time, such as for compliance purposes. Check out the S3 Glacier product page for more details on these storage classes.

To fully protect your archive data in Amazon S3, consider enabling object versioning on your bucket. When versioning is enabled, rewriting an object or changing object metadata creates a new version of that object while still keeping the previous version. This can protect you from the consequences of unintended overwrites and deletions. DataSync copies any changes from your on-premises storage. If an event occurs that compromises your on-premises data storage, such as a ransomware event, object versioning enables you to recover your data from previous versions. Note that metadata-only changes on-premises results in a new S3 object version. Additionally, a single DataSync task execution may create more than one version of an S3 object.

Configuring AWS DataSync

Once you’ve decided where to store your data in AWS, you want to configure DataSync according to:

  • Your available network bandwidth
  • The amount of data to be copied
  • The time window in which you must copy your data

AWS DataSync is an online transfer service and can copy data over the internet or within your Amazon VPC using AWS Direct Connect or AWS VPN. Regardless of which method you use, all data is encrypted in flight. When planning your DataSync deployment, you want to know how much bandwidth you have available, as that directly affects the time it takes to transfer your data. For example, if you have 100 Mb/s of available bandwidth and you have 10 TB of data, then it takes about 10 days to transfer. You can set bandwidth limits on DataSync tasks to control how much network bandwidth a task uses.

AWS DataSync uses agents to access your on-premises NFS or SMB shares and then copy your data to AWS. A DataSync agent is deployed on-premises as a virtual machine in your VMware environment. An agent runs a single transfer task at a time. If you have multiple file shares that you want to transfer at the same time, you must deploy multiple agents. Additionally, there are some cases where it is more efficient to transfer a single file share using multiple agents. To read more on how to deploy DataSync agents, check out our documentation.

When protecting your file or backup archives using AWS DataSync, you likely have a one-time full transfer, followed by ongoing incremental transfers to keep your cloud archive up to date. The full transfer makes a complete copy of your source data, while the ongoing incremental transfers copy only the changes. You use the same DataSync task for both full and incremental transfers. If your incremental transfers must occur at a regular interval, say daily or weekly, then you can configure your tasks to run on a schedule. For more information on configuring scheduling in DataSync, check out this blog post.

In addition, there may be some cases where you want more control over which files are being copied to AWS. DataSync gives you the option to specify both exclude and include filters, enabling you to transfer only a subset of your files from your on-premises systems.

Monitoring your tasks

AWS DataSync is fully integrated with Amazon CloudWatch for logging, events, and metrics. Before you start using DataSync, you want to configure DataSync logging to CloudWatch. You can use CloudWatch Events to send notifications when a task completes. You can also monitor the performance of your tasks from the AWS DataSync Management Console, as well as using the CloudWatch console and API.

Conclusion

In this blog post, I’ve discussed how you can use AWS DataSync to copy data to Amazon S3 for low-cost, durable storage of your file and backup archives. I’ve provided some guidance on choosing the AWS storage services that best fit your cloud storage needs and how to configure and use DataSync to copy data to the AWS Cloud. By storing your archive data in AWS, you can free up on-premises storage capacity for other uses or even eliminate some of your secondary backup storage altogether, enabling you to reduce on-premises storage costs and operational overhead.

To learn more and get started with protecting your on-premises data, check out the following links: