Migrating and managing large datasets on Amazon S3 (Part 2)

This is the second of a two-post series intended for customers migrating and managing large datasets on Amazon Simple Storage Service (Amazon S3). The first post addresses moving your data to S3 and choosing the right storage class.

I work with public sector customers with large datasets – hundreds of TB to dozens of PB. Over the years, I have helped these customers get their data into the AWS Cloud, publish data stored there, and manage that data in the cloud for archival backup. I’ve helped customers with datasets that have immeasurable value – data of earth observations over time taken from space craft, other scientific datasets, and data comprising the collective history of nations. Many of my customers have the critical mission of collecting and preserving this data, while also making it broadly available for educational and R&D purposes.

Recently, I was helping a customer with a case of “missing” data in their S3 buckets. In their case, it turns out they ran a command on a production bucket that was intended to be run on a different development bucket. The logging options discussed in this post make it easier to determine what activity occurred; the permissions and data protection options help limit the blast radius of those sorts of operational issues. Fortunately for my customer, they had a redundant copy.

This blog series is intended to guide customers with similar use cases so you can easily and securely manage large datasets on AWS. This post addresses distributing, securing, protecting, and monitoring your data on Amazon S3. Some of the callouts in this post link to deeper technical resources on the specific topics, take a look at those resources to dive deeper into each topic.

Distributing data from Amazon S3

To access data in Amazon S3, you must connect to the S3 service endpoint and be authorized to access the data. S3 can be accessed from the internet, via a VPC gateway endpoint for S3, or via a public virtual interface with AWS Direct Connect.

Public bucket access

When Amazon S3 resources are created, they are private by default and can only be accessed by the resource owner or account administrator. They can be made publicly accessible by the owner or administrator taking that explicit action. This should only be done for datasets that you intend to make public globally. These users will have unauthenticated access to your publicly shared S3 buckets with the AWS CLI or other tools. They will be able to bulk copy data to their own account for their own purposes.

If you want to offload the access charges and require the requester to pay for object access and bandwidth, you can enable the Requester Pays option for your bucket. Anonymous access is not allowed when a bucket is set to Requester Pays, since the requester must be known in order to be charged. Therefore, recipients must access your bucket using their AWS account.

Web access to Amazon S3

You can also use an Amazon S3 bucket for enabling public web access to its content by serving its static contents. Content is served from S3 via HTTP with a URL containing your bucket’s website endpoint or via a custom URL for your S3 bucket. You can place Amazon CloudFront in front of your S3 bucket and serve web content from it using an SSL certificate. AWS Certificate Manager (ACM) can issue the SSL certificate or you can import it into ACM from your own certificate authority if you require SSL.

Distributing data via CloudFront also enables advanced use cases with AWS Lambda@Edge. A use case for Lambda@Edge for customers with large web-enabled datasets include throttling mechanisms to rate-limit those that are heavy downloaders of data while still maintaining public accessibility for the dataset.

AWS Transfer Family – Simple and seamless file transfer to Amazon S3

You can use AWS Transfer Family to make data available from Amazon S3 via SFTP, FTPS, or FTP by using a custom hostname. You can integrate AWS Transfer Family with third-party directories, including Okta, Active Directory, and LDAP. You can customize paths for each user in real time with logical directories for S3.

AWS Storage Gateway

AWS Storage Gateway is a hybrid cloud storage service that provides on-premises applications access to virtually unlimited cloud storage through file, tape, and volume gateways. File Gateway provides a file interface into Amazon S3. You can store and retrieve objects in Amazon S3 using NFS and SMB.

You can deploy a single read/write File Gateway and/or multiple reader (read only) File Gateways attached to the same S3 bucket for data distribution at multiple locations. We recommend having only a single writer to a bucket when using File Gateway. If your bucket has multiple File Gateways in read-only mode, and data is being written to it from outside of File Gateway, you must refresh the cache to synchronize the File Gateway with the contents of the S3 bucket.

Signed URLs

Another way to distribute data from your S3 bucket is by using presigned URLs. A presigned URL is a URL that contains the path to the object and a pre-signed authorization token. These can be generated with the AWS CLI or AWS SDK. Anybody with access to the signed URL can access the object during the allowed time period. Signed URLs can be valid for a period ranging from a few minutes to 7 days. Presigned URLs are linked to the role that generated the URL, so deleting the role that generated the presigned URL revokes the presigned URL.

Securing your data on Amazon S3

Some datasets have confidentiality requirements. Other datasets are intended to be open and shared broadly or globally. Some datasets are for long-term archive while others support mission-critical operations. Next, we’ll have a brief discussion around data confidentiality, and integrity on Amazon S3.

Confidentiality

S3 provides security capabilities to keep your data secure and accessible. Data is protected in transit to S3. You can enable encryption at the storage layer using your keys or AWS Key Management Service (AWS KMS) keys, and you can securely share your data with only the contacts you identify, with conditions you define.

Data in transit

Data transfers to and from S3, whether via the console UI, the AWS CLI, AWS DataSync, or another tool that uses S3 SDKs, occur via the S3 endpoint for the bucket’s home Region via HTTPS. API calls (list bucket, copy object, put object) occur via HTTPS. Protection of data in transit is inherent in the platform and managed by AWS. If you enable web access to your bucket (for example, to serve static content), you must use Amazon CloudFront for HTTPS access, otherwise your bucket distributes content via HTTP.

Access control

Access control is another component of Amazon S3 security and data confidentiality. Access control is enforced through three separate options – S3 bucket policies, S3 Access Points, and user policies via AWS IAM permissions. These policies control access to the following:

Who – the security principal – user or role – allowed to perform a certain API action
What – the specific API action allowed to be performed
When – you can restrict time ranges for IAM user policies
Where – you can control the location certain API commands can be executed from – objects can only be read from certain locations; delete commands only issued from certain IP addresses; access only permitted via an Amazon Virtual Private Cloud endpoint. You can also control access to Amazon S3 objects so they are only served if a request is routed through CloudFront.
How – specifically, how the user was authenticated to perform that API action – policies can be implemented to require multi-factor authentication to execute a delete action

You can review sample S3 bucket policies here and user policy here.

You can implement Amazon S3’s Block Public Access feature at the account level – this prevents users from enabling public access to S3 buckets. Permissions are required to remove this block. It is recommended to enable this at the account level, unless you have a specific reason for making objects in the bucket public. You can access the bucket from your own AWS account and share it with other specifically listed AWS accounts (enumerated by account ID) without granting public access.

S3 Access Points are another way of controlling access to Amazon S3 objects. Access Points are unique Amazon Resource Names (ARNs) for accessing a single bucket. Each Access Point can have its own unique policy for accessing objects in the bucket – one Access Point could be created for uploading objects, while a second Access Point could have delete permissions, and a third could be used to share only list and read access with other AWS accounts.

You can enable S3 access logging and deliver these log files to a centralized log bucket. These can be useful to track access and determine which of your objects are most often accessed. Access logs are necessary to determine which user has uploaded or deleted an object.

Encryption

You can store your objects in Amazon S3 encrypted or unencrypted. You can encrypt your objects using server-side encryption with Amazon S3-Managed Encryption Keys (SSE-S3). Alternatively, you can encrypt your data using server-side encryption with customer master keys (CMKs) stored in AWS KMS (SSE-KMS). You also have the option to provide and store your own encryption keys, providing them on object upload and download (SSE-C). Encryption takes effect on new objects uploaded to the bucket after encryption has been enabled. Of course, you can also encrypt your data in your client application before sending it to S3.

SSE-S3 uses an AES-256 block cipher to encrypt your objects. On the other hand, SSE-KMS gives you an additional layer of access control on the key material, but there are additional KMS costs on a per-request basis. Amazon S3 also supports SSE-C, but the onus is on the customer to maintain custody of their key or risk data loss.

There is no right or wrong answer for encryption – AWS supports a number of options suited for a variety of different data security requirements.

Integrity

You can check data uploaded to AWS with the AWS CLI or a tool built with one of the AWS SDKs for upload integrity. This happens automatically with the AWS CLI, as explained in the FAQ here. In some cases, customers have already computed their own hash values and want to ensure uploaded objects match those hash values. You can upload your object and specify the SHA256 hash with the upload command. If the hash computed by Amazon S3 does not match, the upload has failed.

If you have a large archival dataset, you may also have requirements for ongoing fixity checking. The AWS Solutions Builder team has developed a serverless fixity checking solution that works with the Amazon S3 storage classes, including Amazon S3 Glacier and S3 Glacier Deep Archive. You can review the solution here and the more detailed deployment guide here. There are also some pricing examples for situations where you must retrieve data from Amazon S3 Glacier for fixity checking.

Data protection with Amazon S3

Amazon S3 is highly durable, offering 99.999999999% (11 9’s) of durability (excluding Reduced Redundancy Storage) across the various storage classes. Multiple copies of your data are stored on multiple storage devices, and in most cases (excluding S3 One Zone-IA), across multiple Availability Zones within an AWS Region. 11 9’s of durability does not protect your data from accidental deletion by an authorized user or process. Nor will single-Region durability meet regulatory requirements that some organizations have requiring multiple copies of data with hundreds or thousands of miles of geographic separation.

Amazon S3 has four features that customers can leverage in their data protection strategy.

Versioning

Versioning should be considered for your data protection strategy. Versioning preserves objects when a new object is written with the same object key name. Versioning also preserves objects if deleted. You do pay for the incurred storage cost for all stored versions of your objects, so this could increase costs. In addition, versioning enables you to roll back to previous versions or restore deleted versions of objects. If an object is deleted from a bucket with versioning not enabled, that object cannot be restored.

Versioning is not enabled on a bucket by default, and once enabled, cannot be removed, though it can be suspended. Consult vendors of third-party enterprise software applications that natively integrate with Amazon S3 before enabling versioning on buckets managed by these applications. If these applications frequently update files with the same file name, you could end up with a large number of unintended versions and increase your storage bill.

You can also enable multi-factor delete for versioned buckets. This requires presentation of a token generated by a multi-factor authentication (MFA) device before allowing deletion of versioned objects.

S3 Object Lock

S3 Object Lock is a mechanism you can place on a bucket that allows write-only access to that bucket. Objects cannot be overwritten nor deleted. This feature is designed for organizations with high certification requirements in highly regulated industries. Objects can be removed at the end of the defined retention period.

Amazon S3 Replication

Some customers with large datasets employ replication strategies for redundant copies of their data. Amazon S3 supports multiple replication options. S3 Replication can be cross-Region or in the same-region, from and to different storage classes, and from and to different AWS accounts.

Organizations with regulations requiring significant geographic separation (hundreds of miles) often need their data in multiple AWS Regions, and they can leverage S3 Cross-Region Replication (S3 CRR) to meet their requirements. S3 CRR is asynchronous replication of data from an S3 bucket in one Region to an S3 bucket in another Region. There are costs for cross-Region data transfer in addition to the S3 storage and access charges. S3 Replication Time Control is available if you have compliance or SLA requirements for your replicated data.

Same-Region Replication (SRR) can replicate a copy of your data to a bucket in the same AWS Region.

One strategy you can consider is storing your “hot” data in one AWS account in one Region, and an archival copy of your data in a separate AWS account. The archival data can also be stored in a less expensive storage class like S3 Glacier Deep Archive (though please recall the cost concern on billions of very small files, as mentioned in the first post of this series). This not only meets regulatory requirements for significant geographic separation, but also presents an additional layer of security in the event of account compromise of accidental administrative deletion.

Monitoring, logging and reporting options for data in Amazon S3

Customers with large datasets on Amazon S3 often pose several questions regarding their data. They want to know what they have stored on S3, how much storage space their data is consuming, how frequently they access specific data, and what administrative actions have occurred on their S3 configuration. This information is necessary when forensically investigating data spillage, unintended access, or accidental deletions. The table below shows some of these monitoring and logging options:

Capability	Functionality
S3 Inventory Reports	Inventory file in CSV format of objects in your bucket. Especially useful if you have millions or billions of objects. Can be scheduled to run periodically.
S3 Access Logging	Not enabled by default. Logs of who accessed what, when. Useful to determine usage or for security and forensic review.
S3 Data Events in AWS CloudTrail	Log of API calls; including puts/gets/deletes. Useful for forensic and debugging purposes.
Amazon CloudWatch Storage Metrics	Daily report of object access and stored objects. Jeff Barr has an extensive blog post on this here.

Summary

In this post, I discussed ways to secure, protect, and monitor your data on Amazon S3. I covered options for distributing data from S3, securing your data on S3, and monitoring and logging access to your data. With these tools at your disposal, you can secure your data in the event of error or other unexpected loss, while making it simple to control permissions and access.

Thanks for reading to the end of this two-part blog series! If you have large datasets, for instance of the earth observation, scientific, or national public archive variety, you have unique challenges. Whether it be the volume of the data, or the time and effort to copy that data to AWS, managing your data requires a methodical approach to evaluating your Amazon S3 configuration options. The entire idea for this series stemmed from my efforts to help a customer recover from an operational issue, and a desire to help other customers be more fully aware of technical options to hopefully avoid similar operational issues – or speed their recovery if they do occur. Hopefully, the insights in this post enable you to fully assess your data storage options on Amazon S3.

If you have any comments or questions about this blog post, please submit them in the comments section.

All the best in your storage management journey.