Copy large datasets from Google Cloud Storage to Amazon S3 using Amazon EMR
Many organizations have data sitting in various data sources in a variety of formats. Even though data is a critical component of decision-making, for many organizations this data is spread across multiple public clouds. Organizations are looking for tools that make it easy and cost-effective to copy large datasets across cloud vendors. With Amazon EMR and the Hadoop file copy tools Apache DistCp and S3DistCp, we can migrate large datasets from Google Cloud Storage (GCS) to Amazon Simple Storage Service (Amazon S3).
Apache DistCp is an open-source tool for Hadoop clusters that you can use to perform data transfers and inter-cluster or intra-cluster file transfers. AWS provides an extension of that tool called S3DistCp, which is optimized to work with Amazon S3. Both these tools use Hadoop MapReduce to parallelize the copy of files and directories in a distributed manner. Data migration between GCS and Amazon S3 is possible by utilizing Hadoop’s native support for S3 object storage and using a Google-provided Hadoop connector for GCS. This post demonstrates how to configure an EMR cluster for DistCp and S3DistCP, goes over the settings and parameters for both tools, performs a copy of a test 9.4 TB dataset, and compares the performance of the copy.
The following are the prerequisites for configuring the EMR cluster:
- Install the AWS Command Line Interface (AWS CLI) on your computer or server. For instructions, see Installing, updating, and uninstalling the AWS CLI.
- Create an Amazon Elastic Compute Cloud (Amazon EC2) key pair for SSH access to your EMR nodes. For instructions, see Create a key pair using Amazon EC2.
- Create an S3 bucket to store the configuration files, bootstrap shell script, and the GCS connector JAR file. Make sure that you create a bucket in the same Region as where you plan to launch your EMR cluster.
- Create a shell script (sh) to copy the GCS connector JAR file and the Google Cloud Platform (GCP) credentials to the EMR cluster’s local storage during the bootstrapping phase. Upload the shell script to your bucket location:
s3://<S3 BUCKET>/copygcsjar.sh. The following is an example shell script:
- Download the GCS connector JAR file for Hadoop 3.x (if using a different version, you need to find the JAR file for your version) to allow reading of files from GCS.
- Upload the file to
- Create GCP credentials for a service account that has access to the source GCS bucket. The credentials should be named json and be in JSON format.
- Upload the key to
s3://<S3 BUCKET>/gcs.json. The following is a sample key:
- Create a JSON file named
gcsconfiguration.jsonto enable the GCS connector in Amazon EMR. Make sure the file is in the same directory as where you plan to run your AWS CLI commands. The following is an example configuration file:
Launch and configure Amazon EMR
For our test dataset, we start with a basic cluster consisting of one primary node and four core nodes for a total of five c5n.xlarge instances. You should iterate on your copy workload by adding more core nodes and check on your copy job timings in order to determine the proper cluster sizing for your dataset.
- We use the AWS CLI to launch and configure our EMR cluster (see the following basic create-cluster command):
- Create a custom bootstrap action to be performed at cluster creation to copy the GCS connector JAR file and GCP credentials to the EMR cluster’s local storage. You can add the following parameter to the create-cluster command to configure your custom bootstrap action:
Refer to Create bootstrap actions to install additional software for more details about this step.
- To override the default configurations for your cluster, you need to supply a configuration object. You can add the following parameter to the create-cluster command to specify the configuration object:
Refer to Configure applications when you create a cluster for more details on how to supply this object when creating the cluster.
Putting it all together, the following code is an example of a command to launch and configure an EMR cluster that can perform migrations from GCS to Amazon S3:
aws emr create-cluster \ --name "My First EMR Cluster" \ --release-label emr-6.3.0 \ --applications Name=Hadoop \ --ec2-attributes KeyName=myEMRKeyPairName \ --instance-type c5n.xlarge \ --instance-count 5 \ --use-default-roles \ --bootstrap-actions Path="s3:///copygcsjar.sh" \ --configurations file://gcsconfiguration.json
Submit S3DistCp or DistCp as a step to an EMR cluster
You can run the S3DistCp or DistCp tool in several ways.
When the cluster is up and running, you can SSH to the primary node and run the command in a terminal window, as mentioned in this post.
You can also start the job as part of the cluster launch. After the job finishes, the cluster can either continue running or be stopped. You can do this by submitting a step directly via the AWS Management Console when creating a cluster. Provide the following details:
- Step type – Custom JAR
- Name –
- JAR location –
- Arguments –
s3-dist-cp --src=gs://<GCS BUCKET>/ --dest=s3://<S3 BUCKET>/
- Action of failure – Continue
We can always submit a new step to the existing cluster. The syntax here is slightly different than in previous examples. We separate arguments by commas. In the case of a complex pattern, we shield the whole step option with single quotation marks:
aws emr add-steps \ --cluster-id j-ABC123456789Z \ --steps 'Name=LoadData,Jar=command-runner.jar,ActionOnFailure=CONTINUE,Type=CUSTOM_JAR,Args=s3-dist-cp,--src=gs://<GCS BUCKET>/, --dest=s3://<S3 BUCKET>/'
DistCp settings and parameters
In this section, we optimize the cluster copy throughput by adjusting the number of maps or reducers and other related settings.
We use the following memory settings:
Both parameters determine the size of the map containers that are used to parallelize the transfer. Setting this value in line with the cluster resources and the number of maps defined is key to ensuring efficient memory usage. You can calculate the number of launched containers by using the following formula:
Dynamic strategy settings
We use the following dynamic strategy settings:
-Ddistcp.dynamic.max.chunks.tolerable=4000 -Ddistcp.dynamic.split.ratio=3 -strategy dynamic
We use the following map setting:
This determines the number of map containers to launch.
List status settings
We use the following list status setting:
This determines the number of threads to perform the file listing of the source GCS bucket.
The following is a sample command when running with 96 core or task nodes in the EMR cluster:
hadoop distcp -Dmapreduce.map.memory.mb=1536 \ -Dyarn.app.mapreduce.am.resource.mb=1536 \ -Ddistcp.dynamic.max.chunks.tolerable=4000 \ -Ddistcp.dynamic.split.ratio=3 \ -strategy dynamic \ -update \ -m 640 \ -numListstatusThreads 15 \ gs://<GCS BUCKET>/ s3://<S3 BUCKET>/
When running large copies from GCS using S3DistCP, make sure you have the parameter fs.gs.status.parallel.enable (also shown earlier in the sample Amazon EMR application configuration object) set in core-site.xml. This helps parallelize getFileStatus and listStatus methods to reduce latency associated with file listing. You can also adjust the number of reducers to maximize your cluster utilization. The following is a sample command when running with 24 core or task nodes in the EMR cluster:
Testing and performance
To test the performance of DistCp with S3DistCp, we used a test dataset of 9.4 TB (157,000 files) stored in a multi-Region GCS bucket. Both the EMR cluster and S3 bucket were located in us-west-2. The number of core nodes that we used in our testing varied from 24–120.
The following are the results of the DistCp test:
- Workload – 9.4 TB and 157,098 files
- Instance types – 1x c5n.4xlarge (primary), c5n.xlarge (core)
The following are the results of the S3DistCp test:
- Workload – 9.4 TB and 157,098 files
- Instance types – 1x c5n.4xlarge (primary), c5n.xlarge (core)
The results show that S3DistCP performed slightly better than DistCP for our test dataset. In terms of node count, we stopped at 120 nodes because we were satisfied with the performance of the copy. Increasing nodes might yield better performance if required for your dataset. You should iterate through your node counts to determine the proper number for your dataset.
Using Spot Instances for task nodes
Amazon EMR supports the capacity-optimized allocation strategy for EC2 Spot Instances for launching Spot Instances from the most available Spot Instance capacity pools by analyzing capacity metrics in real time. You can now specify up to 15 instance types in your EMR task instance fleet configuration. For more information, see Optimizing Amazon EMR for resilience and cost with capacity-optimized Spot Instances.
Make sure to delete the cluster when the copy job is complete unless the copy job was a step at the cluster launch and the cluster was set up to stop automatically after the completion of the copy job.
In this post, we showed how you can copy large datasets from GCS to Amazon S3 using an EMR cluster and two Hadoop file copy tools: DistCp and S3DistCp.
We also compared the performance of DistCp with S3DistCp with a test dataset stored in a multi-Region GCS bucket. As a follow-up to this post, we will run the same test on Graviton instances to compare the performance/cost of the latest x86 based instances vs. Graviton 2 instances.
You should conduct your own tests to evaluate both tools and find the best one for your specific dataset. Try copying a dataset using this solution and let us know your experience by submitting a comment or starting a new thread on one of our forums.
About the Authors
Hammad Ausaf is a Sr Solutions Architect in the M&E space. He is a passionate builder and strives to provide the best solutions to AWS customers.
Andrew Lee is a Solutions Architect on the Snap Account, and is based in Los Angeles, CA.