Migrating and managing large datasets on Amazon S3 (Part 1)

UPDATE: The second post in this two-part series was published on October 29th, 2020.

This is the first of a two-post series intended for customers migrating and managing large datasets on Amazon S3. This post addresses moving your data to S3 and choosing the optimal storage class for your data. The second follow-on post (coming soon) will address distributing, securing, protecting, and monitoring your data on S3.

I work with a number of public sector customers with large geo-spatial datasets, public record archives, educational materials, and scientific datasets that reside on Amazon S3. While some of the governance and accessibility requirements these customers face may be unique to the public sector, the best practices for managing large datasets on S3 apply to many customers.

This post discusses options and considerations for migrating large datasets to Amazon S3 and what you should be aware of when selecting a storage class for your data. I also cover how you can distribute your archival, scientific, or otherwise large dataset to your internal and external users.

Account structure strategy

Before we dive deep into the technical and operational areas covered in this post, I want to raise one top-level governance topic: account structure for large datasets. It is common for organizations to segment workloads into multiple AWS accounts depending on workload and use case. It is important to think this through for the AWS account holding your bucket with a large dataset. Consider a future possibility that another organization will become the custodian of your data, perhaps another government agency or national archival organization. In that case, creating an account where the bucket holding the dataset is the only critical asset makes transferring account ownership an easier task.

Migrating large datasets to Amazon S3

There is a range of options to choose from in order to move your data into S3. The most appropriate option depends on your specific use case. Factors to consider include:

Volume of data
Available bandwidth
Read throughput from source storage system
Schedule
Cost
Accessibility options to the source data system – available bandwidth, protocols

Network-based transfers

If you have high throughput network connectivity to AWS – via the internet, a government Trusted Internet Connection (TIC), a research network peered with AWS, or your own AWS Direct Connect – you should consider using AWS DataSync or AWS Transfer Family.

AWS DataSync

AWS DataSync is a managed file transfer service that can transfer data from on-premises sources to AWS at speeds of up to 10x faster than open-source tools. Take a look at this blog post on how to schedule a secure data transfer with AWS DataSync.

AWS Transfer Family

Many customers have existing workflows and data pipelines that require file transfer protocol (FTP)-based file transfer. You can send data to S3 via the fully managed AWS Transfer Family. This service presents either an SFTP, FTPS, or FTP front end for S3 buckets. I’d first look to DataSync for a bulk migration use case, but if the source file system does not allow for DataSync, Transfer Family may be an option. There are both hourly and throughput charges for this service so you should compare this with other transfer options.

In cases where DataSync or FTP are not available options, I’ve had customers pull data from HTTP source systems via Lambda functions (invoked from messages via an SQS queue, populated by a file list). You will want to implement a way to appropriately rate-limit this method so as not to overwhelm the source storage system and network capacity. This solution is more advanced and will require some coding.

AWS Command Line Interface

You can also copy or synchronize data from on-premises to AWS manually with the AWS CLI, but doing so requires a fair bit of system tuning. You’ll have to do some testing with the number of Amazon S3 concurrent threads, part size, possibly TCP window size, and likely parallel invocations of the S3 AWS CLI to match the throughput of AWS DataSync. DataSync takes care of these configurations automatically. If you do want to use the AWS CLI, look at the documentation for configurable parameters for S3. Many of the AWS SDKs also use these same parameters. Please note, the AWS CLI includes two namespaces for S3 – s3 and s3api. The s3 namespace includes the common commands for interfacing with S3, while s3api includes a more advanced set of commands. Some of the syntax is different between these two APIs.

Staging storage and upload parts

Amazon S3 multipart upload allows you to upload a single object as a set of parts. After all parts of your object are uploaded, Amazon S3 assembles these parts and creates the object. In general, when your object size reaches 100 MB, you should consider using multipart uploads instead of uploading the object in a single operation. Multipart upload is automatically managed for you when using DataSync.

The AWS CLI uploads objects to the storage class you specify. Amazon S3 Standard is the default storage class. For example, you can copy an object from on-premises to S3 and specify the S3 Glacier Deep Archive storage class. This copies the file directly to S3 Glacier Deep Archive; however there are some staging steps involved in that copy process. The S3 AWS CLI and most S3 compatible tools use S3 multipart upload. This means a large object of 500 MB would be uploaded in 75 parts with the default part size of 8 MB. These parts are stored in S3 staging storage until a complete object is uploaded and then committed to S3 Glacier Deep Archive or another storage class. If you are uploading a dataset and using the storage-class option to direct that data directly to S3 Glacier Deep Archive, you’ll still have some S3 charges for the prorated amount of time those objects are stored in S3 staging storage.

Additionally, if you’ve uploaded a large volume of data, it may be possible that you had some interrupted transfers. Amazon S3 keeps uploaded parts from incomplete uploads until you delete them, or until a S3 Lifecycle policy, which you have configured, deletes them. These will incur charges in your account, so it is a good idea to create a lifecycle policy to automatically remove them after a defined period of time. You can also view and abort multipart uploads with the AWS CLI.

S3 Lifecycle policies

Amazon S3 has multiple storage classes. For the purpose of migrating data, you may first stage data in the S3 Standard storage class. DataSync allows you to specify a storage class directly. You can create an S3 Lifecycle policy to immediately transition your object to your desired storage class once all parts are successfully uploaded. S3 Lifecycle policies support filters and can also be used to purge data after a defined time period.

Network considerations

If you have petabyte scale datasets to move to Amazon S3, you probably also have high-throughput network capability. The S3 endpoint (used by the AWS CLI and other tools that use S3 APIs) and the DataSync endpoint both accept encrypted traffic over HTTPS. This encrypts your data in transit but also subjects the data to overhead from the TCP protocol. Even if you have a 1 Gbps or a 10 Gbps AWS Direct Connect to AWS, latency impacts overall throughput. This is due to the bandwidth-delay product, which means the greater the latency, the less the available throughput regardless of the provisioned size of the connection.

If your applications are already integrated with the Amazon S3 API, and you have high latency to your regional S3 endpoint, you could experience significant throughput increases by using S3 Transfer Acceleration. S3 Transfer Acceleration leverages Amazon CloudFront’s global points of presence to shut down connectivity to S3. This minimizes latency for connections and can accelerate upload times for large datasets. Pricing for S3 Transfer Acceleration may be found in the S3 Transfer Acceleration pricing page.

If you don’t want to use the public Amazon S3 endpoint and want to keep the communications from traversing the public network, you could use DataSync with a private VPC interface endpoint. This endpoint is an IP address within your VPC, connected to your source network via VPN or AWS Direct Connect. You could also consider a web proxy server running on an EC2 instance within a VPC. This proxy instance can access S3 via a S3 gateway endpoint in the VPC. There is no networking charge for traffic passing through a S3 gateway endpoint.

Another critical factor to consider is available network bandwidth. Is that bandwidth (to the internet, or to AWS) shared with other workloads or applications in the organization? A large volume data transfer has the potential to consume all available bandwidth, negatively impacting other users. It can also take some time. There are some simple online bandwidth calculators to help you gauge the amount of time to move a certain volume of data over a defined provisioned network capacity (assuming you are indeed able to saturate that network capacity). For example, 1 TB takes approximately 2.5 hours over a 1 gbps-connection, whereas 1 PB takes over 10 days on a 10-Gbps connection – and that’s if you have favorable network conditions.

Offline migration of data to Amazon S3 – AWS Snow Family

The AWS Snow Family supports offline migration of data at scale to AWS.

AWS Snowball Edge devices are ruggedized compute and storage devices that plug in to your on-premises network. The devices have network options ranging from 10 Gbps to 100 Gbps depending on the Snowball Edge option you select. For customers with requirements for security accreditations including FedRAMP Moderate/High, ITAR, HIPAA, and others, you can confirm Snowball Edge’s compliance status here. You can also review Snowball Edge’s security capabilities in the official documentation.

For smaller data migration requirements, AWS Snowcone supports up to 8 TB of usable capacity per device. AWS Snowcone includes support for online transfers with the DataSync agent pre-installed on the device.

You define your data import or export job in the AWS Snow Family console and request an AWS Snow Family device, which is then shipped to you. This request should originate from the same Region where your Amazon S3 bucket exists. You’ll need to define your destination S3 buckets during the job provisioning process, since jobs can’t be changed once defined. If you need the data in a different destination S3 bucket than the one you defined at provisioning, you can move the data to a new bucket once imported.

If you have large volumes of data to move, and/or relatively limited available bandwidth, you should consider AWS Snowball Edge.

Choosing a storage class

Amazon S3 offers six storage classes that support different data access levels at corresponding rates. It is important to understand your application requirements against the pricing dimensions when selecting a storage class. The S3 storage classes include:

S3 Standard
S3 Intelligent-Tiering
S3 Standard-Infrequent Access (S3 Standard-IA)
S3 One Zone-Infrequent Access (S3 One Zone-IA)
Amazon S3 Glacier
S3 Glacier Deep Archive

Selecting the correct storage class for your large dataset depends on your use case. Will the data in your dataset be accessed frequently? Is your dataset archival data – critical to save in a durable storage system, but accessed infrequently? How infrequently? If archival data, what is your restore time requirement?

If you have large archival datasets, where an up-to 12-hour wait is acceptable, and where an individual object will likely not be accessed more than once every six months, S3 Glacier Deep Archive is an appropriate storage class as it provides the lowest cost cloud storage. S3 Glacier Deep Archive has higher object retrieval costs, so you want to consider this tier for data accessed no more than once or twice per year.

One caveat to be aware of is object size. S3 Glacier Deep Archive has a higher charge for “puts” or writes to the storage system than other storage classes. Billions of small objects written to S3 Glacier Deep Archive costs significantly more than tens of thousands of larger files, even if the total amount stored is the same. If you have billions of smaller files (for example, individual email files for E-discovery, or thumbnail files), you’ll want to archive or batch these small files to a gzip or zip or another archive format before writing them to S3 Glacier Deep Archive. Be aware S3 Glacier Deep Archive has a minimum storage duration of 180 days. Objects deleted early are charged a pro-rated early deletion fee.

Understanding your data access patterns is important in selecting a storage class to optimize costs by matching the storage class with your workload’s data activity. S3 Standard is designed for frequently accessed data and has the lowest object access fee; however, it has the highest monthly cost per GB stored. S3 Standard-Infrequent Access (S3 Standard-IA) costs less per GB stored, and it is designed for data accessed infrequently.

To better understand data access patterns, consider using the S3 Storage Class Analysis. S3 Storage Class Analysis examines storage access patterns to help you decide when to transition the right data to the right storage class. After S3 Storage Class Analysis observes the infrequent access patterns of a filtered set of data over a period of time, you can use the analysis results to help you improve your lifecycle policies.

Sometimes, customers have large datasets that are mixed – a small number of objects are accessed frequently, while a large number of objects are seldom accessed. For example, geo-spatial imagery tiles of populated areas or interesting landmarks may be accessed at high rates, while tiles of the open ocean may never be accessed. For workloads with unknown or varying access patterns, AWS offers the S3 Intelligent-Tiering storage class. With this storage class, S3 looks at object access patterns and automatically moves objects to either frequent or infrequent access tiers based on learned access patterns. S3 Intelligent-Tiering makes storage cost optimization easy for you by automatically moving data to the most cost-effective tier without impacting performance or incurring operational overhead.

Wrap up

In this post, we discussed how customers with large datasets can migrate data to AWS via online and offline methods. We also discussed best practices on how to choose a storage class based on cost and data access patterns.

Storing your datasets on S3 provides the benefit of an infinitely scalable, elastic, durable storage system with global reach. AWS offers multiple migration paths – online with services like DataSync or offline with the Snow Family of services to help you move your data. AWS also offers a range of options to serve your data to your stakeholders. Hopefully you’ve found this discussion of these options useful.

Thanks for reading this blog post! If you have any comments or questions, please don’t hesitate to submit them in the comments section.

All the best in your storage management journey.

AWS Storage Blog