AWS News Blog

Amazon EMR Update – Apache Spark 1.5.2, Ganglia, Presto, Zeppelin, and Oozie

My colleague Jon Fritz wrote the guest post below to introduce you to the newest version of Amazon EMR.

Jeff;


Today we are announcing Amazon EMR release 4.2.0, which adds support for Apache Spark 1.5.2, Ganglia 3.6 for Apache Hadoop and Spark monitoring, and new sandbox releases for Presto (0.125), Apache Zeppelin (0.5.5), and Apache Oozie (4.2.0).

New Applications in Release 4.2.0
Amazon EMR provides an easy way to install and configure distributed big data applications in the Hadoop and Spark ecosystems on managed clusters of Amazon EC2 instances. You can create Amazon EMR clusters from the Amazon EMR Create Cluster Page in the AWS Management Console, AWS Command Line Interface (CLI), or using a SDK with EMR API. In the latest release, we added support for several new versions of applications:

  • Spark 1.5.2 – Spark 1.5.2 was released on November 9th, and we’re happy to give you access to it within two weeks of general availability. This version is a maintenance release, with improvements to Spark SQL, SparkR, the DataFrame API, and miscellaneous enhancements and bug fixes. Also, Spark documentation now includes information on enabling wire encryption for the block transfer service. For a complete set of changes, view the JIRA. To learn more about Spark on Amazon EMR, click here.
  • Ganglia 3.6 – Ganglia is a scalable, distributed monitoring system which can be installed on your Amazon EMR cluster to display Amazon EC2 instance level metrics which are also aggregated at the cluster level. We also configure Ganglia to ingest and display Hadoop and Spark metrics along with general resource utilization information from instances in your cluster, and metrics are displayed in a variety of time spans. You can view these metrics using the Ganglia web-UI on the master node of your Amazon EMR cluster. To learn more about Ganglia on Amazon EMR, click here.
  • Presto 0.125 – Presto is an open-source, distributed SQL query engine designed for low-latency queries on large datasets in Amazon S3 and the Hadoop Distributed Filesystem (HDFS). Presto 0.125 is a maintenance release, with optimizations to SQL operations, performance enhancements, and general bug fixes. To learn more about Presto on Amazon EMR, click here.
  • Zeppelin 0.5.5 – Zeppelin is an open-source interactive and collaborative notebook for data exploration using Spark. You can use Scala, Python, SQL, or HiveQL to manipulate data and visualize results. Zeppelin 0.5.5 is a maintenance release, and contains miscellaneous improvements and bug fixes. To learn more about Zeppelin on Amazon EMR, click here.
  • Oozie 4.2.0 – Oozie is a workflow designer and scheduler for Hadoop and Spark. This version now includes Spark and HiveServer2 actions, making it easier to incorporate Spark and Hive jobs in Oozie workflows. Also, you can create and manage your Oozie workflows using the Oozie Editor and Dashboard in Hue, an application which offers a web-UI for Hive, Pig, and Oozie. Please note that in Hue 3.7.1, you must still use Shell actions to run Spark jobs. To learn more about Oozie in Amazon EMR, click here.

Launch an Amazon EMR Cluster with Release 4.2.0 Today
To create an Amazon EMR cluster with 4.2.0, select release 4.2.0 on the Create Cluster page in the AWS Management Console, or use the release label emr-4.2.0 when creating your cluster from the AWS CLI or using a SDK with the EMR API.

Jon Fritz, Senior Product Manager