AWS Storage Blog

Transferring data from Google Cloud Filestore to Amazon EFS using AWS DataSync

Organizations may need to transfer large numbers of files from one cloud provider to another for a variety of reasons like workload migration, disaster recovery, or a requirement to process data in other clouds. Data transfers typically require end-to-end encryption, the ability to detect changes, object validation, network throttling, monitoring, and cost optimization. Building a multicloud data transfer solution can be time consuming, expensive, and difficult to scale. Furthermore, data transfers between cloud providers have limitations such as physically accessing the storage devices at the data center.

For organizations looking to transfer files from Google Filestore, a managed NFS file service in Google Cloud, to Amazon Elastic File System (EFS), a serverless, fully managed elastic file storage service, AWS DataSync can be a great help. AWS DataSync is an online data movement and discovery service that simplifies and accelerates data migration to and from AWS.

In this blog, I will walk through how to transfer files from Google Cloud Filestore to Amazon EFS using AWS DataSync. I will start with an overview of AWS DataSync, highlighting relevant components of the service. Then, I walk through how to configure DataSync so you can transfer data from Google Cloud Filestore to Amazon EFS. The process outlined in this post offers a seamless file migration solution that can get you up and running in AWS in no time, with minimal overhead, advanced security, and granular migration monitoring.

AWS DataSync overview

AWS DataSync is a data movement service that simplifies, automates, and accelerates moving data between on-premises, edge, or other cloud storage and AWS Storage services, as well as between AWS Storage services. AWS DataSync features include data encryption, validation, monitoring, task scheduling, auditing, and scaling.

AWS DataSync provides a custom image you can use to launch a DataSync agent as a Google Cloud Platform (GCP) Compute Engine virtual machine. The agent acts as a client to connect to Google Cloud Filestore and coordinates the data transfer to AWS with the AWS DataSync service.

DataSync components and terminology

DataSync has four components for data movement: task, locations, agent, and task execution. Figure 1 shows the relationship between the components and the attributes that I will use for this tutorial.

[ALT TEXT] DataSync has four components for data movement: task, locations, agent, and task execution.

Figure 1: AWS DataSync four primary components: task, locations, agent, and task execution

  • Agent: A virtual machine (VM) that reads data from, or writes data to, a self-managed location. In this tutorial, the DataSync agent is deployed as a GCP Compute Engine virtual machine instance using the custom image that AWS DataSync provides.
  • Location: The source and destination location for the data transfer. In this tutorial, the source location is Google Cloud Filestore. The destination location is an Amazon EFS File System.
  • Task : A task consists of one source and one destination location with a configuration that defines how data is moved. A task always moves data from the source to the destination. Configuration can include options such as include/exclude patterns, task schedule, bandwidth limits, and more.
  • Task execution: This is an individual run of a task, which includes information such as start time, end time, bytes written, and status.

How to transfer data from Google Cloud Filestore to Amazon EFS with DataSync

In Figure 2, I laid out the basic architecture for DataSync using a single DataSync agent. The source files that will be transferred reside in Google Cloud Filestore. The destination is Amazon EFS.

Architectural diagram illustrates a single AWS DataSync agent’s cross-cloud connectivity between Google Cloud Filestore and Amazon EFS using the AWS DataSync service endpoint.

Figure 2: Single AWS DataSync agent connecting to DataSync public endpoint using TLS 1.3.

In this architecture, the agent is deployed as a virtual machine (VM) in the Google Cloud Compute Engine. In this tutorial, the agent is activated with AWS DataSync using a public endpoint and the data transfer from the agent to the DataSync service is encrypted with TLS 1.3.

Tutorial prerequisites

You should have the following prerequisites:

  • An AWS account
  • Google Cloud Command Line Interface (gcloud CLI)
  • Google Cloud Platform Filestore with source files to transfer

Source overview: Google Cloud Filestore

I have a Google Cloud Filestore instance named demo-customer-filestore. For this example, my Filestore is in the us-central1 Region within a VPC network named gcp-vpc-filestore. The Filestore instance is accessed with a private IP address that is automatically allocated from the VPC network. I have configured a file share named myfileshare. You can create a Filestore instance and add files on the mounted file share to follow along.

Google Cloud Filestore configuration for creating an instance with VPC network and file share

Figure 3: Google Cloud Filestore configuration

Once the filestore is created, take note of the IP address and file share name for your instance as this information will be used to configure a DataSync location.

Google Cloud Filestore instance with file share name and IP address

Figure 4: Google Cloud Filestore instance with file share name and IP address

For the demonstration, I created files to transfer in a sub-directory named business-data contained in the file share. Inside the business-data folder, there are six files. I will transfer the five text files while ignoring the temporary file named log.tmp to demonstrate DataSync’s capability to exclude certain files based on name pattern.

Google Cloud Filestore contains source files to transfer to Amazon EFS.

Figure 5: Google Cloud Filestore contains source files to transfer to Amazon EFS

Walkthrough

Now, I am ready to build the solution to transfer data from Google Cloud Filestore to Amazon EFS using AWS DataSync. The following are the high-level steps:

  1. Deploy a DataSync agent as a VM in Google Cloud Compute Engine.
  2. Create Amazon EFS as the destination.
  3. Create DataSync locations.
  4. Create a DataSync task – Add Locations.
  5. Create a DataSync task – Configure Settings.
  6. Run the DataSync task.
  7. Verify the data transferred.

Step 1: Deploy a DataSync agent as a VM on GCP

The DataSync agent is deployed as a GCP Compute Engine’s virtual machine instance from an image that AWS DataSync provides. The first step is to download the VMware ESXi zip file from the AWS DataSync Management Console then upload the VMware Virtual Machine Disk (.vmdk) file to GCP. Note that uploading this image could take up to 2 hours, so plan ahead. You then use the custom image to launch the DataSync agent as a VM instance. Follow the instructions to create a DataSync agent in GCP outlined in the AWS DataSync documentation. Take note of the public IP address of the DataSync agent once launched. In my case the External IP address is 35.193.197.233.

DataSync agent is deployed as Google Cloud Compute Engine virtual machine with an external IP address.

Figure 6: DataSync agent is deployed as Google Cloud Compute Engine virtual machine with an external IP address

For this tutorial, I launched the DataSync agent within the same VPC as the Google Cloud Filestore. The agent VM is inside a subnet named subnet-filestore-1 which resides in the us-central1 region. The Filestore and DataSync agent are deployed in the same VPC subnet and region.

DataSync agent virtual machine and Google Cloud Filestore are deployed in the same VPC

Figure 7: Google Cloud Virtual Private Cloud for DataSync agent VM and Filestore.

Note that every VPC network in Google Virtual Private Cloud (VPC) has 2 implied firewall rules for IPv4. These implied rules contain the egress protocol/ports that DataSync requires. This means you don’t need to configure additional egress firewall rules unless you have higher priority rules that override the implied rules. For ingress, you only need to add the TCP/80 rule for activation.

For this tutorial, I’m applying the principal of least privilege, so I configured the following rules to only open the ports required for DataSync. At a minimum, DataSync agents require 2 outbound ports to communicate with the DataSync service through a public endpoint and one inbound port for automatic activation from your local AWS Management Console. The ingress activation on TCP/80 is only required for the initial agent activation step and is not required once the activation is complete.

Minimum protocol and ports needed for DataSync agent to communicate with AWS DataSync public endpoint and GCP Filestore.

Figure 8: Google Cloud Virtual Private Cloud protocol for DataSync agent

To implement the minimum protocol/ports for DataSync connectivity, I configured the VPC firewall rules as shown in Figure 9. The block-all-egress rule overrides the implied allow all egress rule. Then, the next higher priority rule datasync-egress enables the required outbound ports for the DataSync agent to communicate with the DataSync service public endpoint. The highest priority egress rule vpc-egress-internal enables the DataSync agent to communicate with the Filestore internal IP address (in my case 10.0.175.210). Finally, the datasync-agent-activation-ingress rule allows activation of the DataSync agent over port 80. Once activated, you can delete or disable this ingress rule.

Minimum firewall rules needed for the DataSync agent to communicate with the AWS DataSync service public endpoint and GCP Filestore.

Figure 9: Google Cloud Virtual Private Cloud firewall rules for DataSync agent.

Once the DataSync agent is running, the next step is to activate the agent.

  1. Navigate to the AWS DataSync console. Make sure your desired Region is selected.
  2. Click the Agents menu option, then click the Create Agent button.
  3. For Endpoint type, Select the Public service endpoint in <Region> the from drop down menu.
  4. Select the Automatically get the activation key from your agent option.
  5. Enter the public IP address of the DataSync agent VM instance you launched in the previous step.
  6. Select Get Key.

Select the Endpoint type and enter the public IP address of the DataSync agent to activate

Figure 10: AWS DataSync agent activation.

Once the activation is successful, you should see an agent status of online. This indicates that the agent is running and able to communicate within the DataSync service via the public endpoint.

Step 2: Create Amazon EFS as the destination

Create a new Amazon EFS file system that will be used as the DataSync task’s destination location. Ensure the file system is created in the same region as the activated agent. Once created, take note of the File system ID.

Amazon EFS file system used as a destination with File system ID

Figure 11: Amazon EFS file system as a destination.

DataSync provisions Elastic Network Interfaces (ENI) in the specified subnet that enables data transfers to and from the Amazon EFS file system through the EFS mount target. In this tutorial, I am using the same security group called EFS-SG for both the DataSync ENIs and EFS mount target. This security group allows ingress traffic from the security group itself for NFS (TCP/2049) traffic and all outbound traffic. These rules allow the DataSync ENIs to communicate directly with the EFS mount target ENI.

VPC security group ingress and egress rules to allow DataSync ENI and Amazon EFS Mount Target to communicate for data transfer.

Figure 12: VPC Security Group for AWS DataSync ENI and Amazon EFS Mount Target.

Step 3: Create DataSync locations

Create a source (Google Cloud Filestore) and destination (Amazon EFS) DataSync locations.

Google Cloud Filestore Location

  1. Open the AWS DataSync console and choose Locations. Then select Create Location.
  2. For Location Type, select Network File System (NFS).
  3. For Agents, select the agent that was activated in the previous step.
  4. For NFS Server, enter the private IP address of the Google Cloud Filestore instance. In my example, this is 10.0.175.210.
  5. For Mount Path, enter the file share name of the Google Cloud Filestore. In my example, this is “/myfileshare”. Be sure to include the leading slash in front of the file share mount path name.
  6. Select Create location.

Configuration of a DataSync source location for Google Cloud Filestore.

Figure 13: DataSync source location for Google Cloud Filestore

Amazon EFS location

  1. Open the AWS DataSync console and choose Locations. Then select Create Location.
  2. For Location Type, select Amazon EFS file system.
  3. For File system, select the Amazon EFS file system created in the previous step.
  4. For Mount Path, enter the subdirectory of the Amazon EFS file system. For my example, I’m targeting the root so I entered “/”. If you are targeting a subdirectory, then enter the path such as “/mysubdirectory”.
  5. For subnet, select a subnet where DataSync will create the network interfaces to communicate with the EFS file system. Note that this subnet does not have to be in same subnet as the EFS mount point. However, the subnet you select here must be within same VPC as the EFS file system and the same availability zone as one of the file system’s mount points.
  6. For Security Group, select the security group that can access the file system mount point. For this tutorial, this is EFS-SG which is same security group where the EFS mount point ENI resides. Hence, my security group has an ingress rule to allow NFS type (TCP/2049) from the security group itself.
  7. Select Create location.

Configuration of a DataSync destination location for Amazon EFS

Figure 14: DataSync destination location for Amazon EFS

Step 4: Create a DataSync task – Add locations

Follow these steps to create an AWS DataSync task selecting the source and destination locations previously created.

  1. Open the AWS DataSync console and choose Tasks. Then select Create task.
  2. For source location options, select Choose an existing location. For Existing locations, select the NFS location created for the Google Cloud Filestore. In my demonstration, this is “nfs://10.0.175.210/myfileshare/”. Then select Next.
  3. For Destination location options, select Choose an existing location. Select the EFS location previously created. This opens the Configure settings page.
  4. Select Next.

Step 5: Create a DataSync task – Configure settings

The next step is to configure the settings for the DataSync task.

  1. For Task Name, enter the name of the task.
  2. For Data transfer configuration, select the Entire source location.
  3. For Transfer mode, keep the defaults.
  4. For Exclude patterns, select Add pattern. For Pattern, enter “*.tmp”. By using this exclude filter, the task will ignore objects that end with a .tmp name. Note that the pattern value is case-sensitive.
  5. For task logging, I want DataSync to write detailed logs to CloudWatch. For Log level, select Log all transferred objects and files. For our demonstration, I have a small number of files. If you have a large volume, consider the costs for CloudWatch and select the logging level that meets your requirements. For more information, see Monitoring your task.
    1. For CloudWatch log group, select Autogenerate. This will create the appropriate policy and CloudWatch log group. Then select Next. Review the task configuration.
  6. Select Create task, and wait for Task status to be Available.

Step 6: Run the DataSync task

The final step is to start the DataSync task to transfer the files. The status of the task will update as the task enters each phase. See AWS DataSync task statuses for possible statuses (phases) and their meaning. You can also monitor DataSync using Amazon CloudWatch that includes metrics such as BytesTransferred, FilesTransferred and others. See Monitoring AWS DataSync with Amazon CloudWatch for more information. Note that transferring data out of GCP in this solution will incur internet egress networking charge. On the AWS side, there is no charge for inbound data transfer.

Step 7: Verify the data transferred

You can now verify the files transferred from Google Cloud Filestore to the destination Amazon EFS file system (see Figure 15). On the source (Google Cloud Filestore), there are 6 files including the log.tmp file. On the destination (Amazon EFS), there are 5 files. Notice that the log.tmp file was not transferred because my task had an exclude filter *.tmp, which excluded this log file from being transferred. You can also see that DataSync transferred the permission, owner and group for each file. Note that DataSync preserves the numeric user ID (UID) and group ID (GID) values. See Managing how AWS DataSync transfers files, objects and metadata for more details.

DataSync transferred files from Google Cloud Filestore to Amazon EFS excluding the log.tmp file.

Figure 15: Files transferred from Google Cloud Filestore to Amazon EFS

You can run the task multiple times. Each time a task runs, AWS DataSync will detect the changes between the source and destination and only transfer the files that are new or modified. This enables you to only transfer incremental data that may be changing on the source location. You can also schedule your DataSync task and set a bandwidth limit to customize DataSync to work within your environment.

Cleaning up

To avoid incurring future charges, delete the resources used in this tutorial.

  1. Delete the Filestore instance and DataSync agent compute engine instance.
  2. Delete DataSync task, locations then the DataSync agent.
  3. Delete the Amazon EFS file system.

Conclusion

In this blog post, I walked through transferring files from Google Cloud Filestore to Amazon EFS using AWS DataSync. I deployed an agent as a GCP virtual machine using an image provided by AWS DataSync. The agent enables data transfer from Google Cloud Filestore to AWS storage services. I demonstrated DataSync features such filtering to exclude certain files and transferring only incremental changes. You can configure additional DataSync features to schedule task, verify data integrity, and set bandwidth limits. After running the task, you can see the task status and verify that the files transferred successfully. By using DataSync, you can benefit by simplifying migration planning, automating data movement, transferring data securely, and reducing operational cost.

Here are additional resources to help you get started with AWS DataSync:

Thank you for reading this post on transferring data from Google Cloud Filestore to Amazon EFS using AWS DataSync. I encourage you to try this solution today. If you have any comments or questions, leave them in the comments section.