Magesh walks you through
running concurrent EMR jobs
using AWS Data Pipeline

concurrent-emr-jobs-data-pipeline-magesh

How can I export multiple Amazon DynamoDB tables to Amazon Simple Storage Service (Amazon S3) using AWS Data Pipeline? I don't want to create multiple pipelines.

When you use the Export DynamoDB table to S3 template, you must create a separate pipeline for each table that you want to back up. To export multiple DynamoDB tables to Amazon S3 using one data pipeline, use the HadoopActivity object to submit concurrent Amazon EMR jobs. To maximize resource usage on your Amazon EMR cluster, use either the FairScheduler or the CapacityScheduler—whichever works best for your use case.

  1. Sign in to the Data Pipeline console.
  2. Choose Create new pipeline, and then complete the following fields:
    For Name, enter a name.
    For Source, choose Build using Architect.
    For Run, choose on pipeline activation.
    For Logging, choose Enabled or Disabled, depending on your use case.
  3. Choose Edit in Architect.
  4. Choose Add in the upper-left corner, and then choose HadoopActivity.
  5. Open the Activities section and find the HadoopActivity object. It's called something like "DefaultHadoopActivity1."
  6. For Jar URI, enter s3://dynamodb-emr-Region/emr-ddb-storage-handler/2.1.0/emr-ddb-2.1.0.jar. Replace Region with the Region that your DynamoDB table is in, such as us-east-1.
  7. In the Add an optional field list, choose Argument. Repeat this step three times to create a total of four Argument fields.
  8. In the first Argument field, enter org.apache.hadoop.dynamodb.tools.DynamoDbExport.
  9. In the second Argument field, enter an Amazon S3 path. This is where the DynamoDB table will be exported to.
    Note: The DynamoDB tables and S3 bucket must be in the same AWS Region.
  10. In the third Argument field, enter the name of your DynamoDB table (for example, Users).
  11. In the fourth Argument field, enter a value between 0.1 and 1.0. This is the DynamoDB read throughput ratio.
  12. On the Add an optional field drop-down menu, choose Runs On.
  13. On the Runs On drop-down menu, choose Create new: EmrCluster.
  14. Repeat steps 4-12 for each DynamoDB table that you want to export.
  15. Open the Resources section and find the Amazon EMR cluster object. It's called something like "DefaultEmrCluster1."
  16. Add the following fields from the Add an optional field list:
    Choose Release Label, and enter the Amazon EMR release version that you want to use, such as emr-5.20.0. For more information, see About Amazon EMR Releases.
    Choose Master Instance Type, and enter an instance size that fits your use case, such as m5.xlarge.
    Choose Core Instance Type, and enter an instance size that fits your use case.
    Choose Hadoop Scheduler Type, and enter PARALLEL_CAPACITY_SCHEDULING or PARALLEL_FAIR_SCHEDULING, depending on how you want the cluster resources to be distributed. For more information, see EmrCluster.
  17. Repeat step 16 for each Amazon EMR cluster in the Resources section.
  18. Choose Save in the upper-left corner, and then choose Activate.

If your Amazon EMR cluster has an attached Amazon Elastic Compute Cloud (Amazon EC2) key pair, you can connect to the master node using SSH and run yarn application –list to see how many applications are running.


Did this page help you? Yes | No

Back to the AWS Support Knowledge Center

Need help? Visit the AWS Support Center

Published: 2016-10-04

Updated: 2019-02-26