Data migration and cost saving at scale with Amazon S3 File Gateway

Migrating data to the cloud requires experience with different data types and the ability to preserve source data structure and metadata attributes. Customers often have on-premises file data stored on traditional file servers, retaining original timestamp of data creation for varying reasons including data lifecycle management. Customers find it challenging to identify a path to migrate data to the cloud that supports both preserving source data structure, metadata, and hybrid implementation. This ultimately prevents customers achieving the full breadth of benefits of cloud storage including cost, performance, and scale.

Customers use Amazon S3 File Gateway and Amazon Simple Storage Service (S3) to migrate on-premises data to the cloud and store it, respectively. Once data is in Amazon S3 storage, customers use S3 Lifecycle Policy as a scalable solution to automate object tiering across different Amazon S3 storage classes based on the object ‘Creation Date’, which helps customers optimize storage costs at scale. When a customer uploads objects to Amazon S3 using S3 File Gateway, it updates the file’s ‘Modified Date’ from the source as object ‘Create Date’ in S3, and stores it. With the new ‘Create Date’ in S3, the object age in Amazon S3 resets, which prevents Lifecycle Policy to use original ‘Create Date’ from the source server to tier objects into a lower cost storage class. This presents a challenge for customers that have a lot of older files on-premises and are using S3 File Gateway to migrate and upload to S3 storage.

In this blog, we walk through how AWS services help migrate your data to the cloud while keeping metadata attributes intact, optimizing storage cost, and providing access to data in the cloud from on-premises application using standard SMB (Server Message Block) and NFS (Network File System) file protocols. We will guide you through how to preserve and use original object/file metadata (original date-modified attribute) to automatically place the file into the storage class of choice.

Solution architecture

This architecture illustrates how we migrate and upload files with original metadata to the cloud from an on-premises file system. As depicted, you will copy the source data using Robocopy and then store it into the S3 File Gateway file share. The source file has metadata fields, such as ‘Last Modified Date’, and ‘Create Date’. We call this source metadata info as original stat() time henceforth in the document. Then, the solution automates the data migration using S3 File Gateway, and stores the data to the S3 bucket with a storage class of your choice based on the original stat().

For each object stored in an S3 bucket by ‘Amazon S3 File Gateway’, Amazon S3 maintains a set of user-defined object metadata to preserve the original stat() time of the source files. We use a Lambda function to read the original stat() time from the user-defined metadata fields to instantiate API calls to move objects to the appropriate S3 storage class. In this example, Glacier Instant Retrieval (GIR), and Glacier Deep Archive (GDA) S3 storage classes have been used for demonstration purpose.

This architecture diagram demonstrates two different use cases. The first use case is migrating your on-premises file server data to Amazon S3 storage bucket using Amazon S3 File Gateway. The second use case is that it grants access the objects stored in Amazon S3 through the file interface of Amazon S3 File Gateway from your on-premises location.

Figure 1: Solution overview with Amazon S3 File Gateway, and AWS Lambda

Figure 1: Solution overview with Amazon S3 File Gateway and AWS Lambda

Services involved

The solution is comprised of the following main services. First we will provide context on the main services. Then, we will walk you through how these services create a holistic solution.

Robocopy

Robocopy allows you to copy your on-premises file metadata during the file copy process from a Microsoft Windows File Server. There are a number of methods you can use to preserve file metadata for objects (files) being migrated to Amazon S3. We will show you how to run Robocopy command to demonstrate how to synchronize file timestamp and permission metadata from source to the S3 File Gateway file share.

Amazon S3 File Gateway

S3 File Gateway is an AWS service that enables migrating on-premises file data to Amazon Simple Storage Service (S3) storage in AWS. Customers use Amazon S3 File Gateway to store and retrieve Server Message Block (SMB) and Network File Share (NFS) data files in the form of objects in Amazon S3 storage. Once objects are uploaded to S3 storage, customers can take advantage of the many features provided. Some of these features include encryption, versioning, and lifecycle management.

AWS Lambda

AWS Lambda is a serverless, event-driven compute service that allows you to run code without the need of provisioning or managing servers. In this solution, AWS Lambda helps to automate moving files to the appropriate Amazon S3 storage class by utilizing its deep integration with Amazon S3.

Solution walkthrough

The original data resides in on-premises NAS storage with SMB or NFS protocol access. For the purpose of this post, we will look into the solution workflow. We will walk you through:

Use of Robocopy to copy data and preserve the original
Configure Amazon S3 File Gateway and Amazon S3 storage.
Use AWS Lambda to automate S3 Lifecycle policy.

1. Use of Robocopy to copy data and preserve the original stat()

While copying data using a Robocopy tool to S3 File Gateway, care should be taken to ensure that the appropriate metadata preservation techniques are used. This will ensure the original file timestamp is preserved when data is copied from on-premises NAS systems to S3 File Gateway and eventually to Amazon S3. You must specify if you need to copy file metadata when executing copy. You should use appropriate options within the Robocopy tool to copy metadata, using Amazon S3 File Gateway. Please refer to S3 metadata documentation to learn more.

Here is an example showing how Robocopy runs to preserve source file metadata, which ultimately gets stored as user-metadata in an S3 object using S3 File Gateway.

C:\> robocopy <source> <destination> /TIMFIX /E /SECFIX

Where,

/TIMEFIX = sync timestamp
/E copies subfolders including empty folders
/SECFIX copies the NTFS permissions
/V for verbose logging

Here is an example on how to copy the permissions and ACL information only:

C:\> robocopy <source> <destination> /e /sec /secfix /COPY:ATSOU

Robocopy has a preview or dry run (/L) option you can use to see what updates it would make before performing the operation. It is worth noting that using Robocopy may be slow since it depends on the number of files in each prefix, how fast the client workstation root disk is, and if Amazon S3 versioning is enabled and has deletion markers. Please see Robocopy documentation for details.

2. Configure Amazon S3 File Gateway and Amazon S3 storage

You can install Amazon S3 File Gateway as a virtual machine in your on-premises environment. It supports VMware, Microsoft Hyper-V and Linux KVM, or acts as a dedicated hardware appliance. When Robocopy copy files from source server to the file share on S3 File Gateway, it uploads the files into the S3 storage as data objects in the AWS. The file gateway serves as cache store for the files that users need to access frequently. Files that are not accessed or used infrequently are retrieved from an Amazon S3 bucket through the file gateway.

When files are copied into a file share, data is ingested into Amazon S3 as objects. The original metadata of those files are preserved / stored in the ‘user metadata’ of S3 Objects. While every Amazon S3 object will have a new creation time as per S3, the field of interest for us is the original ‘mtime’ parameter stored in the ‘user metadata’.

After the initial migration of data into S3 storage, you can implement many S3 features to further optimize cost and improve security posture of your data. For a deeper dive into the capabilities of the File Gateway and Amazon S3, please refer to these documentation links.

3. Use AWS Lambda to automate S3 Lifecycle policy

Using Amazon S3 File Gateway, every object that gets written in Amazon S3 with PUT REST Call which triggers the AWS Lambda function. This AWS Lambda function reads the original file timestamps from the ‘user metadata’ of S3 Objects. The function calculates the ‘mtime’ parameter (which is stored in Unix time format) and applies the rule to know if the data qualifies to be tiered down to a different Amazon S3 storage class. If the data qualifies to be moved according to the defined rules, the function applies the action to move the data if appropriate. This is where you can define your thresholds for S3 Lifecycle policy to tier objects into different storage classes.

Please see the sample code below that you can use to build your solution. You can change and modify it as needed to meet your requirements.

import boto3
import json
import math
import numpy as np
bucket='demo-storagegateway'
s3 = boto3.client('s3')

from urllib.parse import unquote_plus

def lambda_handler(event, context):
	sourcekey = event['Records'][0]['s3']['object']['key']
	print ("sourcekey", sourcekey, unquote_plus(sourcekey))

	response = s3.head_object(Bucket=bucket, Key=unquote_plus(sourcekey))

	metadatareturn = response['Metadata']['file-mtime']
	OriginalTime = np.timedelta64(metadatareturn[:-2].ljust(9, '0'),'ns')

	print("metadata", metadatareturnint)
	if (OriginalTime + np.timedelta64(3650, "D")) > np.timedelta64(365, "D"):

		print("made it to file_key_name")
		file_key_name = unquote_plus(sourcekey)
		print(file_key_name)
		print("made it to copy_source_object")
		copy_source_object = {'Bucket': bucket, 'Key': file_key_name}
		print(copy_source_object)
		Testcopy=s3.copy_object(CopySource=copy_source_object, Bucket=bucket, 		Key=file_key_name, StorageClass='GLACIER')
		print("Testcopy", Testcopy)

Figure 2: Sample AWS Lambda function code

Summary

In this blog, we covered how to retain the original timestamp of files during the initial migration phase of moving data from your on-premises files server to Amazon S3 through the S3 File Gateway. You also discussed how to utilize AWS Lambda’s deep integration with Amazon S3 to move your files to lower cost storage based on specific date defined within your code. The benefit of this approach ensures that no data is left behind and you can migrate and preserve all your metadata as well in a cost effective manner. This solution serves not only the data migration path, but also opens the door for your journey to the cloud.

By leveraging AWS Storage Gateway, Amazon S3 and AWS Lambda, you can migrate your data into a durable, scalable, and highly available cloud storage in the cloud. You can take advantage of hybrid cloud storage by accessing your data from on-premises without making any changes in your existing environment. It enables you to implement cloud-native capabilities to simplify data management, optimize the value of your data and realize cost savings on a large scale.

If you have any comments or questions, don’t hesitate to leave them in the comments section.