When I try to write data to Apache Hive tables located in an Amazon Simple Storage Service (Amazon S3) bucket using an Amazon EMR cluster, the query fails with one of the following errors. How do I resolve this? 

java.io.IOException: rename for src path ERROR

java.io.FileNotFoundException File s3://yourbucket/.hive-staging_hive_xxx_xxxx does not exist.

When you run INSERT INTO, INSERT OVERWRITE, or other PARTITION commands, Hive creates staging directories in the same S3 bucket as the table. Hive runs RENAME operations to write the staging query data to that S3 bucket.

The RENAME operation includes low-level S3 API calls such as HEAD, GET, and PUT. If Hive makes a HEAD or GET request to find out if the file exists before creating the file, S3 will provide eventual consistency for read-after-write. When this happens, Hive can't rename the temporary scratch directory to the final output directory, and throws an error such as java.io.IOException or java.io.FileNotFoundException. For more information, see Amazon S3 Data Consistency Model.

Note: The following steps apply to Amazon EMR release version 3.2.1 or later. If your cluster uses Amazon EMR version 5.7.0 or earlier, we recommend upgrading to version 5.8 or later. Versions 5.8 and later include Hive 2.3.x. The java.io.IOException and java.io.FileNotFoundException errors can still happen in Hive 2.3.x, but only with S3 tables. These errors don't happen with HDFS tables, because Hive creates the staging directory in a strongly consistent HDFS location, rather than in the same directory as the table that you're querying.

1. Connect to the Master Node Using SSH.

2. Locate the Hive error log in the /mnt/var/log/hive/user/hadoop/hive.log directory or the YARN application container log under your S3 log URI, as shown in the following example. For more information, see View Log Files.

s3://your_log_bucket/elasticmapreduce/j-3ABCDEF2BALUG5/Containers/application_11234567890654_0001/

3. Look for error messages similar to the following. 

2018-05-09T11:53:28,837 ERROR [HiveServer2-Background-Pool: Thread-64([])]: ql.Driver (SessionState.java:printError(1097)) - FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 6, vertexId=vertex_1525862550243_0001_1_03, diagnostics=[Vertex vertex_1525862550243_0001_1_03 [Map 6] killed/failed due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: r initializer failed, vertex=vertex_1525862550243_0001_1_03 [Map 6], java.io.FileNotFoundException: File s3://mybucket/folder/subfolder/subfolder/.hive-staging_hive_2018-04-25_09-36-30_835_6368934499747071892-1 does not exist. at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.listStatus(S3NativeFileSystem.java:972)
            
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to rename output from: s3://mybucket/demo.db/folder/ingestion_date=20180427/.hive-staging_hive_2017-04-27_13-52-51_942_3098569974412217069-5/_task_tmp.-ext-10000/_tmp.000000_2 to: s3://mybucket/demo.db/folder/ingestion_date=20180427/.hive-staging_hive_2017-10-27_13-52-51_942_3098569974412217069-5/_tmp.-ext-10000/000000_2  at org.apache.hadoop.hive.ql.exec.FileSinkOperator$FSPaths.commit(FileSinkOperator.java:247)

4. If either of these errors are in your logs, it means that Hive made a HEAD request during the RENAME operation before the file was created. Enable EMRFS consistent view to resolve these errors. For more information, see Consistent View.

If neither of these errors are in your logs, see How can I use logs to troubleshoot issues with Hive queries in Amazon EMR?  

5. If you still get these errors after enabling consistent view, configure additional settings for consistent view. For example, if you see throttle events on the Amazon DynamoDB table metrics that were created by consistent view, increase the table's read and write capacity units by changing the following parameters in emrfs-site.xml.

fs.s3.consistent.metadata.read.capacity

fs.s3.consistent.metadata.write.capacity

When a request fails because of java.io.FileNotFoundException or java.io.IOException, EMRFS retries the request based on the default values in emrfs-site.xml. EMRFS continues to retry the request until S3 is consistent or until the value defined in fs.s3.consistent.retryCount is reached. If the retry count is reached before the operation succeeds, a ConsistencyException is thrown. To resolve this problem, increase the fs.s3.consistent.retryCount parameter.


Did this page help you? Yes | No

Back to the AWS Support Knowledge Center

Need help? Visit the AWS Support Center

Published: 2018-10-24