Posted On: Mar 14, 2016

You can now use Apache Sqoop 1.4.6, Apache HCatalog 1.0.0, an upgraded version of Apache Mahout (0.11.1), and upgraded sandbox releases of Presto (0.136) and Apache Zeppelin (0.5.6) on Amazon EMR release 4.4.0. Sqoop allows your Apache Hadoop MapReduce jobs (including Apache Hive and Apache Pig on MapReduce) to interact in parallel with SQL databases through JDBC. Mahout 0.11.1 now supports running your applications using Apache Spark. Zeppelin 0.5.6 includes GitHub integration and import/export support for Zeppelin notebooks. Additionally, Apache Spark is now configured with improved default settings for executors on nodes in your cluster. Dynamic allocation of executors is now enabled by default, and Amazon EMR will configure the memory per executor when creating your cluster based on the Amazon EC2 instance family of your core instance group. You can still override these default settings by using a configuration object or passing additional parameters when submitting your Spark application using spark-submit. Lastly, you can now use Java Development Kit 8 (JDK 8) for your runtime environment (the default for your cluster is JDK 7). However, please note that JDK 8 is not compatible with Hive.

You can create an Amazon EMR cluster with release 4.4.0 by choosing release label “emr-4.4.0” from the AWS Management Console, AWS CLI, or SDK. You can specify Sqoop, HCatalog, Mahout, Presto-Sandbox, and Zeppelin-Sandbox to install these applications on your cluster. You can use JDK 8 by setting JAVA_HOME to JDK 8 for the relevant environmental variables for applications on your cluster. Please visit the Amazon EMR documentation for more information about release 4.4.0, Sqoop 1.4.6, HCatalog 1.0.0, Mahout 0.11.1, Presto 0.136, Zeppelin 0.5.6, the improved default Spark settings, and using JDK 8