EMR 4.3.0 – New & Updated Applications + Command Line Export
My colleague Jon Fritz wrote the blog post below to introduce you to some new features of Amazon EMR.
Today we are announcing Amazon EMR release 4.3.0, which adds support for Apache Hadoop 2.7.1, Apache Spark 1.6.0, Ganglia 3.7.2, and a new sandbox release for Presto (0.130). We have also enhanced our maximizeResourceAllocation setting for Spark and added an AWS CLI Export feature to generate a create-cluster command from the Cluster Details page in the AWS Management Console.
New Applications in Release 4.3.0
Amazon EMR provides an easy way to install and configure distributed big data applications in the Hadoop and Spark ecosystems on managed clusters of Amazon EC2 instances. You can create Amazon EMR clusters from the Amazon EMR Create Cluster Page in the AWS Management Console, AWS Command Line Interface (CLI), or using a SDK with an EMR API. In the latest release, we added support for several new versions of the following applications:
- Spark 1.6.0 – Spark 1.6.0 was released on January 4th by the Apache Foundation, and we’re excited to include it in Amazon EMR within four weeks of open source GA. This release includes several new features like compile-time type safety using the Dataset API (SPARK-9999), machine learning pipeline persistence using the Spark ML Pipeline API (SPARK-6725), a variety of new machine learning algorithms in Spark ML, and automatic memory management between execution and cache memory in executors (SPARK-10000). View the release notes or learn more about Spark on Amazon EMR.
- Presto 0.130 – Presto is an open-source, distributed SQL query engine designed for low-latency queries on large datasets in Amazon S3 and HDFS. This is a minor version release, with optimizations to SQL operations and support for S3 server-side and client-side encryption in the PrestoS3Filesystem. View the release notes or learn more about Presto on Amazon EMR.
- Hadoop 2.7.1 – This release includes improvements to and bug fixes in YARN, HDFS, and MapReduce. Highlights include enhancements to FileOutputCommitter to increase performance of MapReduce jobs with many output files (MAPREDUCE-4814) and adding support in HDFS for truncate (HDFS-3107) and files with variable-length blocks (HDFS-3689). View the release notes or learn more about Amazon EMR.
- Ganglia 3.7.2 – This release includes new features such as building custom dashboards using Ganglia Views, setting events, and creating new aggregate graphs of metrics. Learn more about Ganglia on Amazon EMR.
Enhancements to the maximizeResourceAllocation Setting for Spark
Currently, Spark on your Amazon EMR cluster uses the Apache defaults for Spark executor settings, which are 2 executors with 1 core and 1GB of RAM each. Amazon EMR provides two easy ways to instruct Spark to utilize more resources across your cluster. First, you can enable dynamic allocation of executors, which allows YARN to programmatically scale the number of executors used by each Spark application, and adjust the number of cores and RAM per executor in your Spark configuration. Second, you can specify maximizeResourceAllocation, which automatically sets the executor size to consume all of the resources YARN allocates on a node and the number of executors to the number of nodes in your cluster (at creation time). These settings create a way for a single Spark application to consume all of the available resources on a cluster. In release 4.3.0, we have enhanced this setting by automatically increasing the Apache defaults for driver program memory based on the number of nodes and node types in your cluster (more information about configuring Spark).
AWS CLI Export in the EMR Console
You can now generate an EMR create-cluster command representative of an existing cluster with a 4.x release using the AWS CLI Export option on the Cluster Details page in the AWS Management Console. This allows you to quickly create a cluster using the Create Cluster experience in the console, and easily generate the AWS CLI script to recreate that cluster from the AWS CLI.
Launch an Amazon EMR Cluster with Release 4.3.0 Today
To create an Amazon EMR cluster with 4.3.0, select release 4.3.0 on the Create Cluster page in the AWS Management Console, or use the release label emr-4.3.0 when creating your cluster from the AWS CLI or using a SDK with the EMR API.
— Jon Fritz, Senior Product Manager, Amazon EMR