When I run a Hive query against a DynamoDB table, my query takes a long time to complete
Last updated: 2019-06-19
I'm using Amazon EMR to run Apache Hive queries against an Amazon DynamoDB table. The query has been running for several hours and is still not finished. How can I speed up the query?
This usually happens when you don't have enough read capacity units provisioned for the DynamoDB table. The number of read capacity units that you need depends on how much data is in the table and how quickly you want the query to run. For more information, see Provisioned Read Capacity Units.
To reduce query runtime, add more read capacity units to the source DynamoDB table:
1. Open the DynamoDB console.
2. Choose your table, and then choose the Metrics tab.
3. Find the Throttled read events graph, which corresponds to the ReadThrottleEvents Amazon CloudWatch metric. If there's a spike on the graph, it's probably because you don't have enough read capacity units provisioned for your table.
4. Choose the Capacity tab.
5. Increase the number of Read capacity units and then choose Save. You can use the capacity calculator to estimate monthly charges for the number of read capacity units that you choose.
Note: Depending on how many read capacity units you add, you might need to add more mapper daemons to your Amazon EMR cluster. Each mapper daemon can process 250 read capacity units per second.
6. Start your Hive query.
7. Check the Throttled read events graph again. If there's no spike but the query is still taking a long time to complete, there may be an issue with your Amazon EMR cluster. For more information, see How can I use logs to troubleshoot issues with Hive queries in Amazon EMR?