Category: Amazon Elastic MapReduce
The Amazon EMR team is cranking out new features at an impressive pace (guess they have lots of worker nodes)! So far this quarter they have added all of these features:
- September – Data Encryption for Apache Spark, Tez, and Hadoop MapReduce.
- September – Open-sourced EMR-DynamoDB Connector for Apache Hive.
- November – Stream Processing at Scale with Apache Flink.
- November – Fine-grained Access Control Using Cluster Tags.
Today we are adding to this list with the addition of automatic scaling for EMR clusters. You can now use scale out and scale in policies to adjust the number of core and task nodes in your clusters in response to changing workloads and to optimize your resource usage:
Scale out Policies add additional capacity and allow you to tackle bigger problems. Applications like Apache Spark and Apache Hive will automatically take advantage of the increased processing power as it comes online.
Scale in Policies remove capacity, either at the end of an instance billing hour or as tasks complete. If a node is removed while it is running a YARN container, YARN will rerun that container on another node (read Configure Cluster Scale-Down Behavior for more info).
Using Auto Scaling
In order to make use of Auto Scaling, an IAM role that give Auto Scaling permission to launch and terminate EC2 instances must be associated with your cluster. If you create a cluster from the EMR Console, it will create the EMR_AutoScaling_DefaultRole for you. You can use it as-is or customize it as needed. If you create a cluster programmatically or via the command-line, you will need to create it yourself. You can also create the default roles from the command line like this:
$ aws emr create-default-roles
From the console, you can edit the Auto Scaling policies by clicking on the Advanced Options when you create your cluster:
Simply click on the pencil icon to begin editing your policy. Here’s my Scale out policy:
Because this policy is driven by YARNMemoryAvailablePercentage, it will be activated under low-memory conditions when I am running a YARN-based framework such as Spark, Tez, or Hadoop MapReduce. I can choose many other metrics as well; here are some of my options:
And here’s my Scale in policy:
I can choose from the same set of metrics, and I can set a Cooldown period for each policy. This value sets the minimum amount of time between scaling activities, and allows the metrics to stabilize as the changes take effect.
Default policies (driven by YARNMemoryAvailablePercentage and ContainerPendingRatio) are also available in the console.
This feature is available now and you start using it today. Simply select emr-5.1.0 from the Release menu to get started!
If you are running MapReduce jobs on premises and storing data in HDFS (the Hadoop Distributed File System), you can now copy that data directly from HDFS to an AWS Snowball without using an intermediary staging file. Because HDFS is often used for Big Data workloads, this can greatly simplify the process of importing large amounts of data to AWS for further processing.
To use this new feature, download and configure the newest version of the Snowball Client on the on-premises host that is running the desired HDFS cluster. Then use commands like this to copy files from HDFS to S3 via Snowball:
$ snowball cp -n hdfs://HOST:PORT/PATH_TO_FILE_ON_HDFS s3://BUCKET-NAME/DESTINATION-PATH
You can use the
-r option to recursively copy an entire folder:
$ snowball cp -n -r hdfs://HOST:PORT/PATH_TO_FOLDER_ON_HDFS s3://BUCKET_NAME/DESTINATION_PATH
To learn more, read Using the HDFS Client.
My colleague Jon Fritz wrote the guest post below to introduce you to the newest version of Amazon EMR.
Today we are announcing Amazon EMR release 4.2.0, which adds support for Apache Spark 1.5.2, Ganglia 3.6 for Apache Hadoop and Spark monitoring, and new sandbox releases for Presto (0.125), Apache Zeppelin (0.5.5), and Apache Oozie (4.2.0).
New Applications in Release 4.2.0
Amazon EMR provides an easy way to install and configure distributed big data applications in the Hadoop and Spark ecosystems on managed clusters of Amazon EC2 instances. You can create Amazon EMR clusters from the Amazon EMR Create Cluster Page in the AWS Management Console, AWS Command Line Interface (CLI), or using a SDK with EMR API. In the latest release, we added support for several new versions of applications:
- Spark 1.5.2 – Spark 1.5.2 was released on November 9th, and we’re happy to give you access to it within two weeks of general availability. This version is a maintenance release, with improvements to Spark SQL, SparkR, the DataFrame API, and miscellaneous enhancements and bug fixes. Also, Spark documentation now includes information on enabling wire encryption for the block transfer service. For a complete set of changes, view the JIRA. To learn more about Spark on Amazon EMR, click here.
- Ganglia 3.6 – Ganglia is a scalable, distributed monitoring system which can be installed on your Amazon EMR cluster to display Amazon EC2 instance level metrics which are also aggregated at the cluster level. We also configure Ganglia to ingest and display Hadoop and Spark metrics along with general resource utilization information from instances in your cluster, and metrics are displayed in a variety of time spans. You can view these metrics using the Ganglia web-UI on the master node of your Amazon EMR cluster. To learn more about Ganglia on Amazon EMR, click here.
- Presto 0.125 – Presto is an open-source, distributed SQL query engine designed for low-latency queries on large datasets in Amazon S3 and the Hadoop Distributed Filesystem (HDFS). Presto 0.125 is a maintenance release, with optimizations to SQL operations, performance enhancements, and general bug fixes. To learn more about Presto on Amazon EMR, click here.
- Zeppelin 0.5.5 – Zeppelin is an open-source interactive and collaborative notebook for data exploration using Spark. You can use Scala, Python, SQL, or HiveQL to manipulate data and visualize results. Zeppelin 0.5.5 is a maintenance release, and contains miscellaneous improvements and bug fixes. To learn more about Zeppelin on Amazon EMR, click here.
- Oozie 4.2.0 – Oozie is a workflow designer and scheduler for Hadoop and Spark. This version now includes Spark and HiveServer2 actions, making it easier to incorporate Spark and Hive jobs in Oozie workflows. Also, you can create and manage your Oozie workflows using the Oozie Editor and Dashboard in Hue, an application which offers a web-UI for Hive, Pig, and Oozie. Please note that in Hue 3.7.1, you must still use Shell actions to run Spark jobs. To learn more about Oozie in Amazon EMR, click here.
Launch an Amazon EMR Cluster with Release 4.2.0 Today
To create an Amazon EMR cluster with 4.2.0, select release 4.2.0 on the Create Cluster page in the AWS Management Console, or use the release label emr-4.2.0 when creating your cluster from the AWS CLI or using a SDK with the EMR API.
— Jon Fritz, Senior Product Manager
Amazon EMR is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to process and analyze vast amounts of data. By using these frameworks and related open-source projects, such as Apache Hive and Apache Pig, you can process data for analytics purposes and business intelligence workloads. First launched in 2009 (Announcing Amazon Elastic MapReduce), we have added comprehensive console support and many, many features since then. Some of the most recent features include:
- Support for S3 encryption (both server-side and client-side).
- Consistent view for the EMR Filesystem (EMRFS).
- Data import, export, and query via the Hive / DynamoDB Connector.
- Enhanced CloudWatch metrics.
Today we are announcing Amazon EMR release 4.0.0, which brings many changes to the platform. This release includes updated versions of Hadoop ecosystem applications and Spark which are available to install on your cluster and improves the application configuration experience. As part of this release we also adjusted some of the ports and paths so as to be in better alignment with several Hadoop and Spark standards and conventions. Unlike the other AWS services which do not emerge in discrete releases and are frequently updated behind the scenes, EMR has versioned releases so that you can write programs and scripts that make use of features that are found only in a particular EMR release or a version of an application found in a particular EMR release.
If you are currently using AMI version 2.x or 3.x, read the EMR Release Guide to learn how to migrate to 4.0.0.
EMR users have access to a number of applications from the Hadoop ecosystem. This version of EMR features the following updates:
- Hadoop 2.6.0 – This version of Hadoop includes a variety of general functionality and usability improvements.
- Hive 1.0 -This version of Hive includes performance enhancements, additional SQL support, and some new security features.
- Pig 0.14 – This version of Pig features a new ORCStorage class, predicate pushdown for better performance, bug fixes, and more.
- Spark 1.4.1 – This release of Spark includes a binding for SparkR and the new Dataframe API, plus many smaller features and bug fixes.
Quick Cluster Creation in Console
You can now create an EMR cluster from the Console using the Quick cluster configuration experience:
Improved Application Configuration Editing
In Amazon EMR AMI versions 2.x and 3.x, bootstrap actions were primarily used to configure applications on your cluster. With Amazon EMR release 4.0.0, we have improved the configuration experience by providing a direct method to edit the default configurations for applications when creating your cluster. We have added the ability to pass a configuration object which contains a list of the configuration files to edit and the settings in those files to be changed. You can create a configuration object and reference it from the CLI, the EMR API, or from the Console. You can store the configuration information locally or in Amazon Simple Storage Service (S3) and supply a reference to it (if you are using the Console, click on Go to advanced options when you create your cluster in order to specify configuration values or to use a configuration file):
To learn more, read about Configuring Applications.
New Packaging System / Standard Ports & Paths
Our release packaging system is now based on Apache Bigtop. This will allow us to add new applications and new applications to EMR even more quickly.
Also, we have moved most ports and paths on EMR release 4.0.0 to open source standards. For more information about these changes read Differences Introduced in 4.x.
Additional EMR Configuration Options for Spark
The EMR team asked me to share a couple of tech tips with you:
Spark on YARN has the ability to dynamically scale the number of executors used for a Spark application. You still need to set the memory (
spark.executor.memory) and cores (
spark.executor.cores) used for an executor in spark-defaults, but YARN will automatically allocate the number of executors to the Spark application as needed. To enable dynamic allocation of executors, set
true in the spark-defaults configuration file. Additionally, the Spark shuffle service is enabled by default in Amazon EMR, so you do not need to enable it yourself.
You can configure your executors to utilize the maximum resources possible on each node in your cluster by setting the
maximizeResourceAllocation option to true when creating your cluster. You can set this by adding this property to the “spark” classification in your configuration object when creating your cluster. This option calculates the maximum compute and memory resources available for an executor on a node in the core node group and sets the corresponding spark-defaults settings with this information. It also sets the number of executors—by setting
spark.executor.instances to the initial core nodes specified when creating your cluster. Note, however, that you cannot use this setting and also enable dynamic allocation of executors.
To learn more about these options, read Configure Spark.
All of the features listed above are available now and you can start using them today
If you are new to large-scale data processing and EMR, take a look at our Getting Started with Amazon EMR page. You’ll find a new tutorial video, along with information about training and professional services, all aimed at getting you up and running quickly and efficiently.
The AWS Key Management Service (KMS) provides you with seamless, centralized control over your encryption keys. As I noted when we launched the service (see my post, New AWS Key Management Service, for more information), this service gives you a new option for data protection and relieves you of many of the more onerous scalability and availability issues that inevitably surface when you implement key management at enterprise scale. KMS uses Hardware Security Modules to protect the security of your keys. It is also integrated with AWS CloudTrail for centralized logging of all key usage.
AWS GovCloud (US), as you probably know, is an AWS region designed to allow U.S. government agencies (federal, state, and local), along with contractors, educational institutions, enterprises, and other U.S. customers to run regulated workloads in the cloud. AWS includes many security features and is also subject to many compliance programs. AWS GovCloud (US) allows customers to run workloads that are subject to U.S. International Traffic in Arms Regulations (ITAR), the Federal Risk and Authorization Management Program (FedRAMPsm), and levels 1-5 of the Department of Defense Cloud Security Model (CSM).
KMS in GovCloud (US)
Today we are making AWS Key Management Service (KMS) available in AWS GovCloud (US). You can use it to encrypt data in your own applications and within the following AWS services, all using keys that are under your control:
- Amazon EBS volumes.
- Amazon S3 objects using Server-Side Encryption (SSE-KMS) or client-side encryption using the encryption client in the AWS SDKs.
- Output from Amazon EMR clusters to S3 using the EMRFS client.
Many AWS customers use Amazon EMR to process huge amounts of data. Built around Hadoop, EMR allows these customers to build highly scalable processing systems that can quickly and efficiently digest raw data and turn it into actionable business intelligence.
EMR File System (EMRFS) enables Amazon EMR clusters to operate directly on data in Amazon Simple Storage Service (S3), making it easy for customers to work with input and output files in S3. Until now, EMRFS supported unencrypted and server-side encrypted objects in S3
Support for Amazon S3 Client-Side Encryption in the EMRFS
Today we’re adding support for client-side encrypted objects in S3, enabling you to use your own keys. The EMRFS S3 client-side encryption uses the same envelope encryption method found in the generic S3 Encryption Client, allowing you to use Amazon EMR to easily process data uploaded to S3 using that client. This feature does not, however, encrypt data stored in HDFS on the local disks of your Amazon EMR cluster or data in transit between your cluster nodes.
The encryption is transparent to the applications running on the EMR cluster.
You can store your keys in the AWS Key Management Service (KMS) or provide custom logic to access keys in on-premises HSMs or other customer key management systems. Amazon EMR can use an Encryption Materials Provider that you supply, so you can store your keys in any location where Amazon EMR can use them.
Enabling Encryption From the Console
You can enable this new feature from the EMR Console like this:
Based on the option that you select, the console will prompt you for additional information. For example, if you choose to use the Key Management Service, you can choose the desired one from the menu (you can also enter the ARN of an AWS KMS key if the key is owned by another AWS account):
Custom Key Management With the EMRFS
You can create a custom Encryption Materials Provider class to provide keys to the EMRFS using user defined logic. The EMRFS will pass information from the S3 object metadata to the provider to inform which key to retrieve for decryption. Your code must contain the information about how to retrieve the keys, and the EMRFS will use the key that the provider presents. When you specify the custom encryption materials provider option, all you need to do is give the Amazon S3 location of your provider, and Amazon EMR will automatically add the provider to the cluster and use it with the EMRFS.
This feature is available now and you can start using it today. You will need to use the latest EMR AMI (version 3.6.0 or later).
For many years, AWS customers have used tags to organize their EC2 resources (instances, images, load balancers, security groups, and so forth), RDS resources (DB instances, option groups, and more), VPC resources (gateways, option sets, network ACLS, subnets, and the like) Route 53 health checks, and S3 buckets. Tags are used to label, collect, and organize resources and become increasingly important as you use AWS in larger and more sophisticated ways. For example, you can tag relevant resources and then take advantage AWS Cost Allocation for Customer Bills.
Today we are making tags even more useful with the introduction of a pair of new features: Resource Groups and a Tag Editor. Resource Groups allow you to easily create, maintain, and view a collection of resources that share common tags. The new Tag Editor allows you to easily manage tags across services and Regions. You can search globally and edit tags in bulk, all with a couple of clicks.
Let’s take a closer look at both of these cool new features! Both of them can be accessed from the new AWS menu:
Until today, when you decided to start making use of tags, you were faced with the task of stepping through your AWS resources on a service-by-service, region-by-region basis and applying tags as needed. The new Tag Editor centralizes and streamlines this process.
Let’s say I want to find and then tag all of my EC2 resources. The first step is to open up the Tag Editor and search for them:
The Tag Editor searches my account for the desired resource types across all of the selected Regions and then displays all of the matches:
I can then select all or some of the resources for editing. When I click on the Edit tags for selected button, I can see and edit existing tags and add new ones. I can also see existing System tags:
I can see which values are in use for a particular tag by simply hovering over the Multiple values indicator:
I can change multiple tags simultaneously (changes take effect when I click on Apply changes):
A Resource Group is a collection of resources that shares one or more tags. It can span Regions and services and can be used to create what is, in effect, a custom console that organizes and consolidates the information you need on a per-project basis.
You can create a new Resource Group with a couple of clicks. I tagged a bunch of my AWS resources with Service and then added the EC2 instances, DB instances, and S3 buckets to a new Resource Group:
My Resource Groups are available from within the AWS menu:
Selecting a group displays information about the resources in the group, including any alarm conditions (as appropriate):
This information can be further expanded:
Each identity within an AWS account can have its own set of Resource Groups. They can be shared between identities by clicking on the Share icon:
Down the Road
We are, as usual, very interested in your feedback on this feature and would love to hear from you! To get in touch, simply open up the Resource Groups Console and click on the Feedback button.
Resource Groups and the Tag Editor are available now and you can start using them today!
Hue is an open source web user interface for Hadoop. Hue allows technical and non-technical users to take advantage of Hive, Pig, and many of the other tools that are part of the Hadoop and EMR ecosystem. You can think of Hue as the primary user interface to Amazon EMR and the AWS Management Console as the primary administrator interface.
I am happy to announce that Hue is now available for Amazon EMR as part of the newest (version 3.3) Elastic MapReduce AMI. You can load your data, run interactive Hive queries, develop and run Pig scripts, work with HDFS, check on the status of your jobs, and more.
We have extended Hue to work with Amazon Simple Storage Service (S3). Hue’s File Browser allows you to browse S3 buckets and you can use the Hive editor to run queries against data stored in S3. You can also define an S3-based table using Hue’s Metastore Manager.
To get started, you simply launch a cluster with the new AMI and log in to Hue (it runs on the cluster’s master node). You can save and share queries with your colleagues, you can visualize query results, and you can view logs in real time (this is very helpful when debugging).
Hue in Action
Here are some screen shots of Hue in action. The main page displays all of my Hue documents (Hive queries and Pig scripts):
I can click on a document to open it up in appropriate query editor:
I can view and edit the query, and then run it on my cluster with a single click of the Execute button. After I do this, I can inspect the logs as the job runs:
After the query runs to completion I can see the results, again with one click:
I can also see the results in graphical (chart) form:
You can launch an EMR cluster (with Hue included) from the AWS Management Console. You can also launch it from the command line like this:
$ aws emr create-cluster --ami-version=3.3.0 \ --applications Name=Hue Name=Hive Name=Pig \ --use-default-roles --ec2-attributes KeyName=myKey \ --instance-groups \ InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge \ InstanceGroupType=CORE,InstanceCount=2,InstanceType=m1.large
Hue for You
Hue is available on version 3.3 and above of the Elastic MapReduce AMI at no extra cost. It runs on the master node of your EMR cluster.
date: 2014-10-15 2:03:16 PM The new Quick Start Reference Deployment Guide for Cloudera Enterprise Data Hub does exactly what the title suggests! The comprehensive (20 page) guide includes the architectural considerations and configuration steps that will help you to launch the new Cloudera Director and an associated Cloudera Enterprise Data Hub (EDH) in a matter of minutes. As the folks at Cloudera said in their blog post, “Cloudera Director delivers an enterprise-class, elastic, self-service experience for Hadoop in cloud environments.”
The reference deployment takes the form of a twelve-node cluster that will cost between $12 and $82 per hour in the US East (Northern Virginia) Region, depending on the instance type that you choose to deploy.
The cluster runs within a Virtual Private Cloud that includes public and private subnets, a NAT instance, security groups, a placement group for low-latency networking within the cluster, and an IAM role. The EDH cluster is fully customizable and includes worker nodes, edge nodes, and management nodes, each running on the EC2 instance type that you designate:
Many AWS developers are using Amazon EMR (a managed Hadoop service) to quickly and cost-effectively build applications that process vast amounts of data. The EMR File System (EMRFS) allows AWS customers to use Amazon Simple Storage Service (S3) as a durable and cost-effective data store that is independent of the memory and compute resources of any particular cluster. It also allows multiple EMR clusters to process the same data set. This file system is accessed via the s3:// scheme.
Because S3 is designed for eventual consistency, if one application creates an S3 object it may take a short time (typically measured in tens or hundreds of milliseconds) before it is visible in a LIST operation. This small window can sometimes lead to inconsistent results when the output files produced by one MapReduce job are used as the input of another job.
Today we are making EMRFS even more powerful with the addition of a consistent view of the files stored in Amazon Simple Storage Service (S3). If you enable this feature, you can be confident that all of your files will be processed as intended when you run a chained series of MapReduce jobs. This is not a replacement file system. Instead, it extends the existing file system with mechanisms that are designed to detect and react to inconsistencies. The detection and recovery process includes a retry mechanism. After it has reached a configurable limit on the number of retries (to allow S3 to return what EMRFS expects in the consistent view), it will either (your choice) raise an exception or log the issue and continue.
The EMRFS consistent view creates and uses metadata in an Amazon DynamoDB table to maintain a consistent view of your S3 objects. This table tracks certain operations but does not hold any of your data. The information in the table is used to confirm that the results returned from an S3 LIST operation are as expected, thereby allowing EMRFS to check list consistency and read-after-write consistency.
Enabling the Consistent View
This feature is not enabled by default. You can, however, enable it when you create a new Elastic MapReduce cluster from the command line, the Elastic MapReduce API, or the Elastic MapReduce Console. Here are the options that are available to you when you use the console:
As you can see, you can also enable S3 server-side encryption for EMRFS.
Here’s how you enable the consistent view from the command line when you create a new EMR cluster:
$ aws emr create-cluster --name TestCluster --ami-version 3.2.1 \ --instance-type m3.xlarge --instance-count 3 \ --emrfs Consistent=True --ec2-attributes KeyName=YOURKEYNAME
In general, once enabled, this feature will enforce consistency with no action on your part. For example, it will create, populate, and update the DynamoDB table as needed. It will not, however, delete the table (it has no way to know when it is safe to do so). You can delete the table through the DynamoDB console or you can add a final cleanup step to the last job on your processing pipeline.
You can also sync a folder to load it into a consistent view. This is useful to add new folders to the view that were not written by EMRFS, or to manually sync a folder being managed by EMRFS. You can log in to the Master node of your cluster and run the
emrfs command like this: table:
$ emrfs sync s3://bucket/folder
There is no charge for this feature, but you will pay an hourly charge for the data stored in the DynamoDB table (the first 100 MB is available to you at no charge at part of the AWS Free Usage tier and for the level of provisioned read and write capacity). By default, the table is provisioned for 500 read capacity units and 100 write capacity units. As I noted earlier, you are responsible for deleting the table when you no longer need it.
This feature is available now and you can start using it today!