I am using Hive queries to export data from Amazon DynamoDB to Amazon S3, and the Hive script has been running for several hours but still is not completed. I tried to use a larger EMR cluster, but Hive still creates only a few map tasks, and most of the EMR nodes are idle. How can I use Hive to utilize more EMR resources?

Usually this happens when the DyanmoDB read throughput encounters a bottleneck. EMR requires that each map task have at minimum 100 read capacity units. Even with large EMR clusters, if the DynamoDB read throughput is configured too low, only a few map tasks will be created, and the Hive query will run for several hours or more.

Check the DynamoDB read throughput and 'dynamodb.throughput.read.percent' configuration in Hive. If you need x number of map tasks, make sure (DyanmoDB read throughput) * (dynamodb.throughput.read.percent)/100 > x. For additional information, see Hive Options in the Amazon EMR Developer Guide.

EMR, Hive, DynamoDB, S3, Simple Storage Service


Did this page help you? Yes | No

Back to the AWS Support Knowledge Center

Need help? Visit the AWS Support Center

Published: 2016-08-26