AWS Blog

Amazon EMR 4.4.0 – Sqoop, HCatalog, Java 8, and More

by Jeff Barr | on | in Amazon EMR | | Comments

Rob Leidle, Development Manager for Amazon EMR, wrote the guest post below to introduce you to the latest and greatest version!

Jeff;

Today we are announcing Amazon EMR release 4.4.0, which adds support for Apache Sqoop (1.4.6) and Apache HCatalog 1.0.0, an upgraded release of Apache Mahout (0.11.1), and upgraded sandbox releases for Presto (0.136) and Apache Zeppelin (0.5.6). We have also enhanced our default Apache Spark settings and added support for Java 8.

New Applications in Release 4.4.0
Amazon EMR provides an easy way to install and configure distributed big data applications in the Hadoop and Spark ecosystems on managed clusters of Amazon EC2 instances. You can create Amazon EMR clusters from the Amazon EMR Create Cluster Page in the AWS Management Console, AWS Command Line Interface (CLI), or using a SDK with an EMR API. In the latest release, we added support for several new versions of the following applications:

  • Zeppelin 0.5.6 – Zeppelin is an open-source interactive and collaborative notebook for data exploration using Spark. Zeppelin 0.5.6 adds the ability to import or export a notebook, notebook storage in GitHub, auto-save on navigation, and better Pyspark support. View the Zeppelin release notes or learn more about Zeppelin on Amazon EMR.
  • Presto 0.136 – Presto is an open-source, distributed SQL query engine designed for low-latency queries on large datasets in Amazon S3 and HDFS. This is a minor version release, with support for larger arrays, SQL binary literals, the ability to call connector-defined procedures, and improvements to the web interface. View the Presto release notes or learn more about Presto on Amazon EMR.
  • Sqoop 1.4.6 – Sqoop is a tool for transferring bulk data between HDFS, S3 (using EMRFS), and structured datastores such as relational databases.  You can use Sqoop to transfer structured data from RDS and Aurora to EMR for processing, and write out results back to S3, HDFS, or another database. Learn more about Sqoop on Amazon EMR.
  • Mahout 0.11.1 – Mahout is a collection of tools and libraries for building distributed machine learning applications. This release includes support for Spark as well as a new math environment based on Spark named SamsaraLearn more about Mahout on Amazon EMR.
  • HCatalog 1.0.0 – HCatalog is a sub-project within the Apache Hive project. It is a table and storage management layer for Hadoop which utilizes the Hive Metastore. It enables tools to execute SQL on Hadoop through an easy to use REST interface.

Enhancements to the default settings for Spark
We have improved our default configuration for Spark executors from the Apache defaults to better utilize resources on your cluster. Starting with release 4.4.0, EMR has enabled dynamic allocation of executors by default, which lets YARN determine how many executors to utilize when running a Spark application. Additionally, the amount of memory used for each executor is now automatically determined by the instance family used for your cluster’s core instance group.

Enabling dynamic allocation and customizing the executor memory allows Spark to utilize all resources on your cluster, place additional executors on nodes added to your cluster, and better allow for multitenancy for Spark applications. The previous maximizeResourceAllocation parameter is still available. However, this doesn’t use dynamic allocation, and specifies a static number of executors for your Spark application. You can also still override the new defaults by using the configuration API or passing additional parameters when submitting your Spark application using spark-submit. Learn more about Spark configuration on Amazon EMR.

Using Java 8 with your applications on Amazon EMR
By default, applications on your Amazon EMR cluster use the Java Development Kit 7 (JDK 7) for their runtime environment. However, on release 4.4.0, you can use JDK 8 by setting JAVA_HOME to point to JDK 8 for the relevant environment variables using a configuration object (though please note that JDK 8 is not compatible with Apache Hive). Learn more about using Java 8 on Amazon EMR.

Launch an Amazon EMR Cluster with Release 4.4.0 Today
To create an Amazon EMR cluster with 4.4.0, select release 4.4.0 on the Create Cluster page in the AWS Management Console, or use the release label emr-4.4.0 when creating your cluster from the AWS CLI or using a SDK with the EMR API.

— Rob Leidle – Development Manager, Amazon EMR