Run Parallel Hadoop Jobs On Your Amazon EMR Cluster Using AWS Data Pipeline

Posted on: Jun 1, 2015

You now have the ability to run Hadoop jobs in parallel on your Amazon Elastic MapReduce (Amazon EMR) clusters from AWS Data Pipeline, enabling you to significantly increase the utilization of your cluster. Using HadoopActivity, you can choose a fair scheduler or capacity scheduler on your Amazon EMR cluster and submit work to the cluster. HadoopActivity allows you to take advantage of scheduler pools on the cluster and assign jobs to specific queues. It provides job level monitoring, direct access to the Hadoop logs and the ability to cancel and re-run a single job. To learn more about using HadoopActivity, please visit our documentation.

Additionally, you can now use Spot instances, for the core Amazon EMR nodes, specify an availability zone and configure custom security groups for your Amazon EMR cluster launched via AWS Data Pipeline. To learn more about the configuration options on the EMRCluster object, please visit our documentation here.