AWS Storage Blog
Migrating Google Cloud Storage to Amazon S3 using AWS DataSync
Update (4/17/2024): The solution presented in this post using AWS DataSync for migration to Amazon S3 from Google Cloud Storage works best if you are looking for a secure managed service for your data transfer workflow that provides data validation, integrated auditing and monitoring capabilities, and the ability to transfer changed data. If you are familiar with AWS Glue and want a serverless service for data migration, consider using AWS Glue for your migration to Amazon S3 from Google Cloud Storage. If you have an EMR cluster, are comfortable writing and implementing your data transfer solution, and may desire to use EMR on Amazon EC2 Spot Instances for cost considerations, consider using Amazon EMR for your migration to Amazon S3 from Google Cloud Storage.
Organizations sometimes need to migrate large volumes of object data from one cloud provider to another. Reasons for this type of migration may include data consolidation, workload migration, disaster recovery, or the end of a discount program. Migrations typically require end-to-end encryption, the ability to detect changes, object validation, network throttling, monitoring, and cost optimization. Building such solutions can be time-consuming, expensive, and difficult to scale. Furthermore, migrating between public cloud providers limits certain options such as physically accessing the storage devices at the data center.
In this blog, I walk through migrating object data from Google Cloud Storage to Amazon Simple Storage Service (S3) using AWS DataSync. I start with an overview of AWS DataSync and highlight specific features for this use case. Then, I walk through configuring DataSync to Amazon S3.
AWS DataSync overview
AWS DataSync is a data movement service that simplifies, automates, and accelerates moving data to and from AWS Storage services, between AWS Storage and Google Cloud Storage, and between AWS Storage and Microsoft Azure Files. AWS DataSync features include data encryption, validation, monitoring, task scheduling, auditing, and scaling.
Specifically for Google Cloud Storage, AWS DataSync has built-in integration using the Google Storage XML API. This is a RESTful interface that lets applications manage Google Cloud Storage objects in a programmatic way. DataSync connects to the Google Storage XML API with a Hash-based Message Authentication Code (HMAC) key. The Google Cloud Platform’s (GCP) HMAC key contains access ID and a secret that can be configured with appropriate roles to access Google Cloud Storage bucket objects. With AWS DataSync, you don’t need to program any custom code to migrate data from Google Cloud Storage, and you get the benefit of all of the DataSync features.
AWS DataSync components and terminology
DataSync has four components for data movement: task, locations, agent, and task execution. Figure 1 shows the relationship between the components and the attributes that I’ll use for the tutorial.
Figure 1: AWS DataSync four primary components
- Agent: A virtual machine (VM) that reads data from, or writes data to, a self-managed location. In this tutorial, the DataSync agent is mapped to an Amazon EC2 instance running inside an Amazon VPC.
- Location: The source and destination location for the data migration. In this tutorial, the source location is object storage pointing to the Google Storage XML API (https://storage.googleapis.com), with a specified Google Storage bucket name. The destination location is an Amazon S3 bucket.
- Task: A task consists of one source and one destination location with a configuration that defines how data is moved. A task always moves data from the source to the destination. Configuration can include options such as include/exclude patterns, task schedule, bandwidth limits, and more.
- Task execution: This is an individual run of a task, which includes information such as start time, end time, bytes written, and status.
How to migrate data from Google Cloud Storage to Amazon S3 with AWS DataSync
For this tutorial, I walk through the steps to configure AWS DataSync to migrate objects from Google Cloud Storage to Amazon S3. In Figure 2, I’ve laid out the basic architecture for DataSync using a single DataSync agent. The source objects to migrate reside in the Google Cloud Storage bucket. The destination is the Amazon S3 bucket.
Figure 2: Single AWS DataSync agent architecture
DataSync is made up of several components, starting with a DataSync agent. In this architecture, the agent is deployed as an Amazon Elastic Compute Cloud (Amazon EC2) instance into a subnet within Amazon Virtual Private Cloud (Amazon VPC). The subnet contains a DataSync VPC endpoint that allows the traffic to flow privately from the DataSync agent to Amazon S3. The DataSync agent connects to GCP Cloud Storage via the public internet using an HMAC key.
Tutorial prerequisites
For this tutorial, you should have the following prerequisites:
- An AWS account
- AWS Command Line Interface (CLI)
- Google Cloud Platform Cloud Storage bucket with source objects to transfer
Source overview – Google Cloud Storage bucket
For our tutorial, I have a source Google Cloud Storage bucket named demo-customer-bucket. You can create a storage bucket and upload objects to your bucket to follow along. This bucket has two folders: business-data and system-data. In this use case, I only want to transfer the objects in the business-data folder and ignore the system-data folder. This is to demonstrate DataSync’s capability to include or exclude folders from the source bucket.
Figure 3: Google Cloud Storage bucket as source
Inside the business-data folder, there are six files. My goal is to transfer the five text files while ignoring the temporary file named log.tmp. This is to demonstrate DataSync’s capability to exclude certain objects based on name pattern.
Figure 4: Google Cloud Storage bucket objects as source
Walkthrough
Now I am ready to build the solution to move data from Google Cloud Storage to Amazon S3 using AWS DataSync. The following are the high-level steps:
- Create a Google Cloud Platform HMAC key.
- Create an Amazon S3 bucket as the destination.
- Create an IAM role to access the Amazon S3 bucket.
- Set up a network for the Amazon VPC.
- Deploy an Amazon EC2 DataSync agent.
- Create DataSync locations.
- Google Cloud Storage location
- Amazon S3
- Create and run DataSync task.
- Create a DataSync task – Location
- Create a DataSync task – Configure Settings
- Run the DataSync task
- Verify the data migrated.
Step 1: Create a GCP HMAC key
DataSync agent uses an HMAC credential to authenticate to Google Cloud Platform and manage objects in the Cloud Storage bucket. This requires creating a key for a service account. You can follow the directions at Manage HMAC keys for service accounts. When complete, you should have an access ID and a secret. Keep this information in a secure location.
The service account principal needs sufficient permission for the DataSync agent to connect and migrate the objects. You can assign a predefined role named Storage Object Viewer to the service account principal as a way to grant this permission.
Figure 5: Google Cloud Storage role permission
You can further limit the access of this role by adding a condition to only allow when the resource name starts with projects/_/buckets/demo-customer-bucket where the demo-customer-bucket is the name of the source bucket. Notice that the condition uses a starts with condition. This allows the permission to be granted for the bucket and the objects within the bucket using a single statement.
Common Expression Language (CEL) expression:
resource.name.startsWith(“projects/_/buckets/demo-customer-bucket”)
Figure 6: Google Cloud Storage role permission condition
Step 2: Create an Amazon S3 bucket as the destination
Create a new Amazon S3 bucket that will be used as the destination for DataSync transfer. Once you create the destination bucket, obtain the bucket Amazon Resource Name (ARN) from the bucket’s Properties tab.
Figure 7: Amazon S3 bucket’s Amazon Resource Name
Step 3: Create an IAM role to access the Amazon S3 bucket
AWS DataSync needs to access the Amazon S3 bucket in order to transfer the data to the destination bucket. This requires DataSync to assume an IAM role with appropriate permission and trust relationship. Create a new role and attach a policy that allows DataSync to read and write to your Amazon S3 bucket.
Step 4: Set up a network for the Amazon VPC endpoint
Create the VPC, subnet, route table, and security group based on the network requirements when using VPC endpoints. Then, create a DataSync interface endpoint. With this endpoint, the connection between an agent and the DataSync service doesn’t cross the public internet and doesn’t require public IP addresses.
Step 5: Deploy an Amazon EC2 DataSync agent
The next step is to deploy an agent as an Amazon EC2 instance. The Amazon EC2 instance is launched using the latest DataSync Amazon Machine Image (AMI) into the subnet from the previous step with the security group for agents. Once the Amazon EC2 instance is running, create a DataSync agent component using the VPC endpoint. Finally, activate your agent to associate it with your AWS account.
Step 6: Create DataSync locations
Create source and destination DataSync locations using the following steps.
Google Cloud Storage location
- Open the AWS DataSync console and choose Locations. Then select the Create Location.
- For Location Type, select Object Storage.
- For Agents, select the agent that was activated in the previous step.
- For Server, type “storage.googleapis.com”
- For Bucket name, type the name of the Google Cloud Storage source bucket name. In this tutorial, this is “demo-customer-bucket.” Note that the bucket name is case-sensitive.
- For authentication, enter the Google HMAC key’s access ID and secret that were obtained in Step 1: Create a GCP HMAC Key.
- Select Create location.
Figure 8: DataSync source location for Google Cloud Storage
Amazon S3 location
- Open the AWS DataSync console and choose Locations. Then select the Create Location.
- For Location Type, select Amazon S3.
- For the Amazon S3 bucket, select the destination Amazon S3 bucket from Step 2: Create an Amazon S3 bucket as the destination.
- For IAM role, select the IAM role from Step 3: Create IAM Role to access Amazon S3.
- Select Create location.
Figure 9: DataSync destination location for Amazon S3
Step 7: Create and run an AWS DataSync task
The next step is to create an AWS DataSync task. Follow these steps to create a task starting with configuring the source and destination locations.
Create an AWS DataSync task: Location
- Open the AWS DataSync console and choose Tasks. Then select Create Task.
- For source location options, select Choose an existing location. For Existing locations, select the object-storage://storage.googleapis.com/<bucket name>/ from Step 6: Create DataSync locations. Then select Next.
- For Destination location options, select the Choose an existing location. Select s3://<bucket name>/ from Step 6: Create DataSync locations. Then select Next. This opens the Configure settings page.
- Select Next.
Create an AWS DataSync task: Configure settings
The next step is to configure the settings for the DataSync task.
- For Task Name, enter the name of the task.
- For Data transfer configuration, select the Specific files and folders option in order to specify the subfolder using the include patterns.
- For Transfer mode, select Transfer only data that has changed.
- Uncheck Keep deleted files. Notice that with this option unchecked, if the source object is deleted in the Google Cloud Storage bucket, the task will delete the corresponding object in the Amazon S3 bucket. This is behavior we want in the demo to keep Amazon S3 in sync with the source because the source bucket is assumed to be a live dataset.
- Check Overwrite files. This means that if the task detects that the source and the target object are different, then the task will overwrite the target object in Amazon S3. This is the behavior we want to keep Amazon S3 in sync with the source bucket.
- For Include patterns, enter “/business-data/*”. Using the include patterns, you can filter the objects transferred to the scope of the location path. Note that the pattern value is case-sensitive.
- For Exclude patterns, select Add pattern. For Pattern, enter “*.tmp”. By using this exclude filter, the task will ignore any objects that end with a .tmp name. Note that the pattern value is case-sensitive.
- Expand the Additional settings and uncheck the Copy object tags. This prevents attempting to read tags on GCS which is not supported.
- For task logging, we want DataSync to write the detail logs to CloudWatch Logs. For Log level, select Log all transferred objects and files. For our demonstration, we have a small number of files. If you have a large volume, consider the cost for CloudWatch and select the logging level that meets your requirements. For more information, see Monitoring your task.
- For CloudWatch log group, select Autogenerate. This will create the appropriate policy and CloudWatch log group. Then select Next. Review the task configuration.
- Select Create task, and wait for Task status to be Available.
Figure 10: DataSync task configuration settings
Run the AWS DataSync task
The final step is to start the DataSync task to transfer the files. The status of the task will update as the task passes each phase. See Understanding task execution status for possible statuses (phases) and meaning. You can also monitor DataSync using Amazon CloudWatch.
Step 8: Verify the data migrated
You can now verify the objects migrated from Google Cloud Storage to the Amazon S3 bucket. Navigate to the Amazon S3 target bucket and open the business-data folder. The five files should now be transferred to the Amazon S3 bucket. Notice that the log.tmp file was not transferred because our task had an exclude filter *.tmp, which excluded this log file from transferring.
You can run the task multiple times. Each time a task runs, AWS DataSync will detect the changes between source and destination and only transfer the objects that are new or modified. This allows you to transfer data that may be changing on the source location.
Figure 11: Objects transferred from source to destination
DataSync also transfers the object’s custom metadata. In this tutorial, the file-1.txt file had a custom metadata of Department with value of Math. This information was transferred to the Amazon S3 metadata.
Figure 12: Object metadata transferred from source to destination
Cleaning up
To avoid incurring future charges, delete the resources used in this tutorial.
- Inactivate the GCP HMAC key. Then Delete GCP HMAC key and Google Cloud Storage bucket.
- Delete DataSync task, locations then agent.
- Shut down the EC2 instance.
- Delete VPC endpoint.
- Delete Amazon S3 bucket.
Conclusion
In this blog post, I discussed using AWS DataSync to migrate object data from Google Cloud Storage to Amazon S3, and walked through configuring AWS DataSync to migrate objects from a Google Cloud Storage bucket to an Amazon S3 bucket. I also demonstrated using AWS features to include and exclude source bucket folders and objects.
Using AWS DataSync, you can simplify migrating a large volume of data from Google Cloud Platform to Amazon S3. AWS DataSync has built-in capabilities that allow you to connect to Google Cloud Storage through Google Storage XML API with HMAC key. With the solution covered in this post, you can take advantage of DataSync’s features, including the ability to run the tasks multiple times to capture changes in the source dataset.
Here are additional resources to help you get started with AWS DataSync:
- What’s New with AWS DataSync
- AWS DataSync User Guide
- AWS re:Post
- AWS DataSync Primer – free one-hour, self-paced online course
Thank you for reading this post on migrating Google Cloud Storage to Amazon S3 using AWS DataSync. I encourage you to try this solution today. If you have any comments or questions, please leave them in the comments section.