Jigar shows you how to copy large
amounts of data from Amazon S3
into HDFS on an EMR cluster

Jigar_SEA

I want to copy a large amount of data from Amazon Simple Storage Service (Amazon S3) to my Amazon EMR cluster. What is the best way to do that?

Use S3DistCp to copy data between Amazon S3 and Amazon EMR clusters. S3DistCp is installed on Amazon EMR clusters by default. To call S3DistCp, add it as a step in your Amazon EMR cluster at launch or after the cluster is running.

To add an S3DistCp step to a running cluster using the AWS Command Line Interface (AWS CLI), see Adding S3DistCp as a Step in a Cluster. To add an S3DistCp step using the console:

1.    Open the Amazon EMR console and then choose Clusters.

2.    Choose the Amazon EMR cluster from the list and then choose Steps.

3.    Choose Add step, and then choose the following options:
For Step type, choose Custom JAR.
For Name, enter a name for the S3DistCp step.
For JAR location, enter command-runner.jar. For more information, see Command Runner.
For Arguments, enter options similar to the following: s3-dist-cp --src=s3://s3distcp-source/input-data --dest=hdfs://output-folder1.
For Action on failure, choose Continue.

4.    Choose Add.

5.    When the step Status changes to Completed, run a command similar to the following to verify that the files were copied to the cluster:

$ hadoop fs -ls hdfs://output-folder1/

Note: It's a best practice to aggregate small files into fewer large files using the groupBy option and then compress the large files using the outputCodec option.

Troubleshooting

To troubleshoot problems with S3DistCp, check the step and task logs.

Step logs:

1.    Open the Amazon EMR console, and then choose Clusters.

2.    Choose the Amazon EMR cluster from the list, and then choose Steps.

3.    In the Log files column, choose the appropriate step log:

  • controller: Information about the processing of the step. If your step fails while loading, you can find the stack trace in this log.
  • syslog: Describes the execution of Hadoop jobs in the step.
  • stderr: The standard error channel of Hadoop while it processes the step.
  • stdout: The standard output channel of Hadoop while it processes the step.

If you can't find the root cause of the failure in the step logs, check the S3DistCp task logs:

1.    Open the Amazon EMR console, and then choose Clusters.

2.    Choose the Amazon EMR cluster from the list, and then choose Steps.

3.    In the Log files column, choose View jobs.

4.    In the Actions column, choose View tasks.

5.    If there are failed tasks, choose View attempts to see the task logs.

Common errors

Reducer task fails due to insufficient memory:

If you see an error message similar to the following in the step's stderr log, the S3DistCp job failed because there wasn't enough memory to process the reducer tasks:

Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Container [pid=19135,containerID=container_1494287247949_0005_01_000003] is running beyond virtual memory limits. Current usage: 569.0 MB of 1.4 GB physical memory used; 3.0 GB of 3.0 GB virtual memory used. Killing container.

To resolve this problem, use one of the following options to increase memory resources for the reducer tasks:

S3 permission error:

If you see an error message similar to the following in the step's stderr log, the S3DistCp task wasn't able to access S3 because of a permissions problem:

Exception in thread "main" com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: REQUEST_ID

To resolve this problem, see Permissions Errors.


Did this page help you? Yes | No

Back to the AWS Support Knowledge Center

Need help? Visit the AWS Support Center

Published: 2018-09-27