How can I copy large amounts of data from Amazon S3 into HDFS on my Amazon EMR cluster?

4 分的閱讀內容
0

I want to copy a large amount of data from Amazon Simple Storage Service (Amazon S3) to my Amazon EMR cluster.

Short description

Use S3DistCp to copy data between Amazon S3 and Amazon EMR clusters. S3DistCp is installed on Amazon EMR clusters by default. To call S3DistCp, add it as a step at launch or after the cluster is running.

Resolution

To add an S3DistCp step to a running cluster using the AWS Command Line Interface (AWS CLI), see Adding S3DistCp as a step in a cluster.

Note: If you receive errors when running AWS CLI commands, make sure that you’re using the most recent version of the AWS CLI.

To add an S3DistCp step using the console, do the following:

1.    Open the Amazon EMR console, and then choose Clusters.

2.    Choose the Amazon EMR cluster from the list, and then choose Steps.

3.    Choose Add step, and then choose the following options:

For Step type, choose Custom JAR.
For Name, enter a name for the S3DistCp step.
For JAR location, enter command-runner.jar. For more information, see Run commands and scripts on an Amazon EMR cluster.
For Arguments, enter options similar to the following: s3-dist-cp --src=s3://s3distcp-source/input-data --dest=hdfs:///output-folder1.
For Action on failure, choose Continue.

4.    Choose Add.

5.    When the step Status changes to Completed, verify that the files were copied to the cluster:

$ hadoop fs -ls hdfs:///output-folder1/

Note: It's a best practice to aggregate small files into fewer large files using the groupBy option and then compress the large files using the outputCodec option.

Troubleshooting

To troubleshoot problems with S3DistCp, check the step and task logs.

1.    Open the Amazon EMR console, and then choose Clusters.

2.    Choose the EMR cluster from the list, and then choose Steps.

3.    In the Log files column, choose the appropriate step log:

controller: Information about the processing of the step. If your step fails while loading, then you can find the stack trace in this log.
syslog: Logs from non-Amazon software, such as Apache and Hadoop.
stderr: Standard error channel of Hadoop while it processes the step.
stdout: Standard output channel of Hadoop while it processes the step.

If you can't find the root cause of the failure in the step logs, check the S3DistCp task logs:

1.    Open the Amazon EMR console, and then choose Clusters.

2.    Choose the EMR cluster from the list, and then choose Steps.

3.    In the Log files column, choose View jobs.

4.    In the Actions column, choose View tasks.

5.    If there are failed tasks, choose View attempts to see the task logs.

Common errors

Reducer task fails due to insufficient memory:

If you see an error message similar to the following in the step's stderr log, then the S3DistCp job failed because there wasn't enough memory to process the reducer tasks:

Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Container [pid=19135,containerID=container_1494287247949_0005_01_000003] is running beyond virtual memory limits. Current usage: 569.0 MB of 1.4 GB physical memory used; 3.0 GB of 3.0 GB virtual memory used. Killing container.

To resolve this problem, use one of the following options to increase memory resources for the reducer tasks:

Amazon S3 permission error:

If you see an error message similar to the following in the step's stderr log, then the S3DistCp task wasn't able to access Amazon S3 because of a permissions problem:

Exception in thread "main" com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: REQUEST_ID

To resolve this problem, see Permissions errors.


Related information

View log files

AWS 官方
AWS 官方已更新 2 年前