AWS Big Data Blog

Andrew Lee

Author: Andrew Lee

Andrew Lee is a Senior Prototyping Architect based in Los Angeles, CA. As an active member of the Argo CD SIG scalability, Andrew has an interest in finding the limits of Argo CD through contributions in benchmarking tooling, running large-scale experiments, and sharing the results with the community.

Copy large datasets from Google Cloud Storage to Amazon S3 using Amazon EMR

Data migration between GCS and Amazon S3 is possible by utilizing Hadoop’s native support for S3 object storage and using a Google-provided Hadoop connector for GCS. This post demonstrates how to configure an EMR cluster for DistCp and S3DistCP, goes over the settings and parameters for both tools, performs a copy of a test 9.4 TB dataset, and compares the performance of the copy.