Magesh walks you through
running concurrent EMR jobs
using AWS Data Pipeline

concurrent-emr-jobs-data-pipeline-magesh

I want to maximize EMR cluster utilization by running concurrent Hadoop jobs using AWS Data Pipeline with a fair or capacity scheduler, instead of using serialized steps. How do I do that?

AWS Data Pipeline supports parallel or concurrent job submission using HadoopActivity. You can choose either a fair scheduler or capacity scheduler to maximize cluster resource utilization; the best choice for you depends on your use case.

Data Pipeline uses the Amazon DynamoDB storage handler, which is a MapReduce application that imports and exports DynamoDB tables. The following example also includes steps to export the specified table to Amazon S3 using HadoopActivity.

Note: The DynamoDB, EMRCluster, and S3 resources for this backup must be in the same AWS region.

You can either use the JSON syntax at HadoopActivity, or use the AWS console:

  1. Sign in to the Data Pipeline console and choose Create Pipeline.
  2. Fill in the fields with the following:
    For Name, add something meaningful to you.
    For Source, choose Build using Architect.
    For Run, choose On pipeline activation.
    For Logging, add an S3 location to copy execution logs to, or choose Disabled.
  3. Choose Edit in Architect.
  4. Choose Add, and select HadoopActivity.
  5. Fill in the Jar URI field with s3://dynamodb-emr-<region>/emr-ddb-storage-handler/2.1.0/emr-ddb-2.1.0.jar, adding the region that your resources are in.
  6. From the Add an optional field menu, choose Argument. Add a new Argument field for each of the four values.
  7. Fill in the newly created Argument field with the following, adding the names and locations of your resources as necessary:
    [org.apache.hadoop.dynamodb.tools.DynamoDbExport,<s3 output directory path>,<table name>, 0.25]
  8. From the Add an optional field menu, choose Runs On.
  9. Open the newly created Runs On drop-down menu, and choose Create new: EMR Cluster.
  10. Repeat steps 4 and 5 for each DynamoDB table you want to back up.
  11. Open the Resource drop-down on the right side of the screen and choose the EmrCluster panel.
  12. From the Add an optional field menu, choose the following options and settings:
    Choose Release Label, and enter emr-4.7.2.
    Choose Master Instance Type, and enter an instance size that meets your needs.
    Choose Core Instance Type, and enter an instance size that meets your needs.
    Choose Hadoop Scheduler Type, and enter PARALLEL_CAPACITY_SCHEDULING or PARALLEL_FAIR_SCHEDULING, depending on whether you want a capacity or fair scheduler.
  13. Choose Activate.

If your EMR cluster has an attached EC2 key pair, you can log in to the master node and run the yarn application –list command to see the number of running applications.


Did this page help you? Yes | No

Back to the AWS Support Knowledge Center

Need help? Visit the AWS Support Center

Published: 2016-10-04