How can I copy large amounts of data from Amazon S3 into HDFS on my Amazon EMR cluster?

4 minute read

I want to copy a large amount of data from Amazon Simple Storage Service (Amazon S3) to my Amazon EMR cluster.

Short description

Use S3DistCp to copy data between Amazon S3 and Amazon EMR clusters. S3DistCp is installed on Amazon EMR clusters by default. To call S3DistCp, add it as a step at launch or after the cluster is running.

Resolution

To add an S3DistCp step to a running cluster using the AWS Command Line Interface (AWS CLI), see Adding S3DistCp as a step in a cluster.

Note: If you receive errors when running AWS CLI commands, make sure that you’re using the most recent version of the AWS CLI.

To add an S3DistCp step using the console, do the following:

1. Open the Amazon EMR console, and then choose Clusters.

2. Choose the Amazon EMR cluster from the list, and then choose Steps.

3. Choose Add step, and then choose the following options:

For Step type, choose Custom JAR.
For Name, enter a name for the S3DistCp step.
For JAR location, enter command-runner.jar. For more information, see Run commands and scripts on an Amazon EMR cluster.
For Arguments, enter options similar to the following: s3-dist-cp --src=s3://s3distcp-source/input-data --dest=hdfs:///output-folder1.
For Action on failure, choose Continue.

4. Choose Add.

5. When the step Status changes to Completed, verify that the files were copied to the cluster:

$ hadoop fs -ls hdfs:///output-folder1/

Note: It's a best practice to aggregate small files into fewer large files using the groupBy option and then compress the large files using the outputCodec option.

Troubleshooting

To troubleshoot problems with S3DistCp, check the step and task logs.

1. Open the Amazon EMR console, and then choose Clusters.

2. Choose the EMR cluster from the list, and then choose Steps.

3. In the Log files column, choose the appropriate step log:

controller: Information about the processing of the step. If your step fails while loading, then you can find the stack trace in this log.
syslog: Logs from non-Amazon software, such as Apache and Hadoop.
stderr: Standard error channel of Hadoop while it processes the step.
stdout: Standard output channel of Hadoop while it processes the step.

If you can't find the root cause of the failure in the step logs, check the S3DistCp task logs:

1. Open the Amazon EMR console, and then choose Clusters.

2. Choose the EMR cluster from the list, and then choose Steps.

3. In the Log files column, choose View jobs.

4. In the Actions column, choose View tasks.

5. If there are failed tasks, choose View attempts to see the task logs.

Common errors

Reducer task fails due to insufficient memory:

If you see an error message similar to the following in the step's stderr log, then the S3DistCp job failed because there wasn't enough memory to process the reducer tasks:

Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Container [pid=19135,containerID=container_1494287247949_0005_01_000003] is running beyond virtual memory limits. Current usage: 569.0 MB of 1.4 GB physical memory used; 3.0 GB of 3.0 GB virtual memory used. Killing container.

To resolve this problem, use one of the following options to increase memory resources for the reducer tasks:

Increase the yarn.nodemanager.vmem-pmem-ratio or mapreduce.reduce.memory.mb parameter in the yarn-site.xml file on the cluster's master node.
Add more Amazon Elastic Compute Cloud (Amazon EC2) instances to your cluster.

Amazon S3 permission error:

If you see an error message similar to the following in the step's stderr log, then the S3DistCp task wasn't able to access Amazon S3 because of a permissions problem:

Exception in thread "main" com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: REQUEST_ID

To resolve this problem, see Permissions errors.

Related information

View log files

Topics

Analytics

Relevant content

AWS EMR (HDFS + Spark) - AWS EMR (Spark)
Accepted Answer
posix
asked 2 years ago
How should i configure my emr cluster to handle large data
Nakshtra
asked 10 days ago
How to copy a large dataset from on-premises Hadoop Cluster to S3?
Accepted Answer
AWS-User-3053529
asked 5 years ago
EMR - UNHEALHTY nodes & HDFS utilization
ahMarrone
asked 2 years ago
How to load large amount of data from S3 onto Sagemaker?
AWS-User-2055875
asked 2 years ago
How can I optimize performance when I upload large amounts of data to Amazon S3?
AWS OFFICIALUpdated a year ago
What's the best way to transfer large amounts of data from one Amazon S3 bucket to another?
AWS OFFICIALUpdated 9 months ago
How can I concatenate Parquet files in Amazon EMR?
AWS OFFICIALUpdated 2 years ago
How can I turn off Safemode for the NameNode service on my Amazon EMR cluster?
AWS OFFICIALUpdated a year ago
EMR Cluster failure with "On the master instance, application provisioning failed"
SUPPORT ENGINEER
Yokesh NK
published 6 days ago