Amazon EMR Documentation

Usability

Amazon EMR is designed to simplify building and operating big data environments and applications. Related EMR features include provisioning, managed scaling, and reconfiguring of clusters, and EMR Studio for collaborative development.

Provision clusters quickly

EMR is designed to automate infrastructure provisioning, cluster setup, configuration, and tuning, helping you quickly launch EMR clusters.

Scale resources to meet business needs

EMR Managed Scaling policies are designed to easily scale resources. EMR clusters manage compute resources to meet your usage and performance needs.

EMR Studio is an integrated development environment (IDE) that aims to make it easier for data scientists and data engineers to develop, visualize, and debug data engineering and data science applications written in R, Python, Scala, and PySpark. EMR Studio provides managed Jupyter Notebooks, and tools like Spark UI and YARN Timeline Service to help simplify debugging.

High availability

You can configure high availability for multi-master applications such as YARN, HDFS, Apache Spark, Apache HBase, and Apache Hive. When you enable multi-master support in EMR, EMR is designed to configure these applications for high availability, and in the event of failures, fail-over to a standby master to help minimize disruption to your cluster, and place your master nodes in distinct racks to reduce risk of simultaneous failure. EMR monitors hosts to detect failures, and when issues are detected, EMR provisions new hosts and adds them to the cluster.

EMR Managed Scaling

With EMR Managed Scaling you specify the minimum and maximum compute limits for your clusters and Amazon EMR resizes them to help improve performance and resource utilization. EMR Managed Scaling is designed to continuously sample key metrics associated with the workloads running on clusters.

Reconfigure running clusters

You can modify the configuration of applications running on EMR clusters including Apache Hadoop, Apache Spark, Apache Hive, and Hue without needing to restart the cluster. EMR Application Reconfiguration allows you to modify applications on the fly without needing to shut down or re-create the cluster. Amazon EMR will apply your new configurations and gracefully restart the reconfigured application. Configurations can be applied through the Console, SDK, or CLI.

Elastic

Amazon EMR enables you to provision capacity as you need it, and manually add and remove capacity. With Amazon EMR you can provision instances, scale to match compute requirements, and shut your cluster down when your job is complete. If you need more capacity, you can launch a new cluster and terminate it when you no longer need it. You can also resize your clusters, as needed.

Amazon S3 Integration

The EMR File System (EMRFS) allows EMR clusters to use Amazon S3 as an object store for Hadoop. You can store your data in Amazon S3 and configure multiple Amazon EMR clusters to process the same data set. Each cluster can be optimized for a particular workload. In addition, by storing your input and output data in Amazon S3, you can shut down clusters when they are no longer needed.

EMRFS supports S3 server-side or S3 client-side encryption using AWS Key Management Service (KMS) or customer-managed keys, and offers an optional consistent view which checks for list and read-after-write consistency for objects tracked in its metadata. Also, Amazon EMR clusters can use both EMRFS and HDFS, so you don’t have to choose between on-cluster storage and Amazon S3.

AWS Glue Data Catalog Integration

You can use the AWS Glue Data Catalog as a managed metadata repository to store external table metadata for Apache Spark and Apache Hive. Additionally, it provides schema discovery and schema version history. This allows you to persist metadata for your external tables on Amazon S3 outside of your cluster.

Flexible data stores

With Amazon EMR, you can leverage multiple data stores, including Amazon S3, the Hadoop Distributed File System (HDFS), and Amazon DynamoDB.

Amazon S3

With the EMR File System (EMRFS), Amazon EMR can use Amazon S3 as an object store for Hadoop. Also, EMRFS can enable consistent view to check for list and read-after-write consistency for objects in Amazon S3. EMRFS supports S3 server-side or S3 client-side encryption to process encrypted Amazon S3 objects, and you can use the AWS Key Management Service (KMS) or a custom key vendor.

When you launch your cluster, Amazon EMR streams the data from Amazon S3 to each instance in your cluster and begins processing. One advantage of storing your data in Amazon S3 and processing it with Amazon EMR is you can use multiple clusters to process the same data.

Hadoop Distributed File System (HDFS)

HDFS is the Hadoop file system, which can be used in connection with EMR

Each EC2 instance comes with a fixed amount of storage, referenced as "instance store", attached with the instance. You can also customize the storage on an instance by adding Amazon EBS volumes to an instance. Amazon EMR allows you to add General Purpose (SSD), Provisioned (SSD) and Magnetic volumes types. The EBS volumes added to an EMR cluster do not persist data after the cluster is shutdown. EMR will clean-up the volumes, once you terminate your cluster.

You can also enable encryption for HDFS using an Amazon EMR security configuration, or manually create HDFS encryption zones with the Hadoop Key Management Server. You can use a security configuration option to encrypt EBS root device and storage volumes when you specify AWS KMS as your key provider.

Amazon DynamoDB

Amazon DynamoDB is a managed NoSQL database service. Amazon EMR has direct integration with Amazon DynamoDB so you can process data stored in Amazon DynamoDB and transfer data between Amazon DynamoDB, Amazon S3, and HDFS in Amazon EMR.

Use open source applications

With versioned releases on Amazon EMR, you can select and use open source projects on your EMR cluster, including applications in the Apache Spark and Hadoop ecosystems. Software is installed and configured by Amazon EMR.

Data access control

Amazon EMR application processes use EC2 instance profile when they call other AWS services. For multi-tenant clusters, Amazon EMR offers options to manage user access to Amazon S3 data.

Integration with AWS Lake Formation enables you to define and manage fine-grained authorization policies in AWS Lake Formation to access databases, tables, and columns in AWS Glue Data Catalog. You can enforce the authorization policies on jobs submitted through Amazon EMR Notebooks and Apache Zeppelin for interactive EMR Spark workloads, and send auditing events to AWS CloudTrail. By enabling this integration, you also enable federated Single Sign-On to EMR Notebooks or Apache Zeppelin from enterprise identity systems compatible with Security Assertion Markup Language (SAML) 2.0.

Native integration with Apache Ranger allows you to set up a new or an existing Apache Ranger server to define and manage fine-grained authorization policies for users to access databases, tables, and columns of Amazon S3 data via Hive Metastore. Apache Ranger is an open-source tool to enable, monitor, and manage comprehensive data security across the Hadoop platform.

This native integration allows you to define three types of authorization policies on the Apache Ranger Policy Admin server. You can set table, column, and row level authorization for Hive, table and column level authorization for Spark, and prefix and object level authorization for Amazon S3. Amazon EMR is designed to install and configure the corresponding Apache Ranger plugins on the cluster. These Ranger plugins sync up with the Policy Admin server for authorization polices, enforce data access control, and send auditing events to Amazon CloudWatch Logs.

Additional features

Select the right instance for your cluster

You choose what types of EC2 instances to provision in your cluster (standard, high memory, high CPU, high I/O, etc.) based on your application’s requirements. You have root access to every instance and you can customize your cluster to suit your requirements.

Debug your applications

When you enable debugging on a cluster, Amazon EMR archives the log files to Amazon S3 and then indexes those files. You can then use a graphical interface in the console to browse the logs and view job history in an intuitive way.

Monitor your cluster

You can use Amazon CloudWatch to monitor custom Amazon EMR metrics, such as the average number of running map and reduce tasks. You can also set alarms on these metrics.

Respond to events

You can use Amazon EMR event types in Amazon CloudWatch Events to respond to state changes in your Amazon EMR clusters. Using simple rules that you can set up, match events and route them to Amazon SNS topics, AWS Lambda functions, Amazon SQS queues, and more.

Schedule recurring workflows

You can use AWS Data Pipeline to schedule recurring workflows involving Amazon EMR. AWS Data Pipeline is a web service that helps you reliably process and move data between different AWS compute and storage services as well as on-premise data sources at specified intervals.

Deep learning

Use popular deep learning frameworks like Apache MXNet to define, train, and deploy deep neural networks. You can use these frameworks on Amazon EMR clusters with GPU instances.

Control network access to your cluster

You can launch your cluster in an Amazon Virtual Private Cloud (VPC), a logically isolated section of the AWS cloud. You can configure control over your virtual networking environment, including selection of your own IP address range, creation of subnets, and configuration of route tables and network gateways.

Manage users, permissions and encryption

You can use AWS Identity and Access Management (IAM) tools such as IAM Users and Roles to control access and permissions. Also, you can use Amazon EMR security configurations to set and control various encryption at-rest and in-transit options, including support for Amazon S3 encryption, and Kerberos authentication.

Install additional software

You can use bootstrap actions or a custom Amazon Machine Image (AMI) running Amazon Linux to install additional software on your cluster. Bootstrap actions are scripts that are run on the cluster nodes when Amazon EMR launches the cluster. They run before Hadoop starts and before the node begins processing data. You can also preload and use software on a custom Amazon Linux AMI.

Copy data

You can move data from Amazon S3 to HDFS, from HDFS to Amazon S3, and between Amazon S3 buckets using Amazon EMR’s S3DistCp, an extension of the open source tool Distcp, which uses MapReduce to move large amounts of data.

Custom JAR

Write a Java program, compile against the version of Hadoop you want to use, and upload to Amazon S3. You can then submit Hadoop jobs to the cluster using the Hadoop JobClient interface.

AI Capabilities

Apache Spark upgrade agent

The agent introduces conversational interfaces where engineers are enabled to express upgrade requirements in natural language, while maintaining control over code modifications.

Amazon EMR Studio

EMR Studio is an integrated development environment (IDE) that enables data scientists and data engineers to develop, visualize, and debug data engineering and data science applications written in R, Python, Scala, and PySpark.

EMR Studio provides managed Jupyter Notebooks, and tools like Spark UI and YARN Timeline Service to simplify debugging. EMR Studio uses AWS Single Sign-On and allows you to log in directly with your corporate credentials without logging into the AWS console. Data scientists and analysts can install custom kernels and libraries, collaborate with peers using code repositories such as GitHub and BitBucket, or execute parameterized notebooks as part of scheduled workflows using orchestration services like Apache Airflow or Amazon Managed Workflows for Apache Airflow.

Amazon EMR Notebooks

Amazon EMR Notebooks, a managed environment based on Jupyter and Jupyter-lab notebooks, enables users to interactively analyze and visualize data, collaborate with peers, and build applications using EMR clusters. EMR Notebooks is designed for Apache Spark. It supports Spark Magic kernels, which allows you to remotely run queries and code on your EMR cluster using languages like PySpark, Spark SQL, Spark R, and Scala.

With EMR Notebooks, there is no software or instances to manage. You can either attach the notebook to an existing cluster or provision a new cluster directly from the console. You can attach multiple notebooks to a single cluster, detach notebooks and re-attach them to new clusters.

Serverless storage for EMR Serverless

Amazon EMR Serverless is designed to reduce local storage provisioning for Apache Spark workloads and is designed to prevent job failures from disk capacity constraints. The service handles intermediate data operations and enables elastic scaling. EMR Serverless allows compute resources to scale up and down across job stages without being constrained by locally stored data, handling peak I/O loads while maintaining enterprise-grade security with encryption in transit and at rest.

Amazon EMR on Amazon EKS

Amazon EMR on Amazon EKS enables you to submit Apache Spark jobs on demand on Amazon Elastic Kubernetes Service (EKS) without provisioning clusters. With EMR on EKS, you can consolidate analytical workloads with your other Kubernetes-based applications on the same Amazon EKS cluster to improve resource utilization and simplify infrastructure management.

With Amazon EMR on Amazon EKS, you can share compute and memory resources across all of your applications and use a single set of Kubernetes tools to centrally monitor and manage your infrastructure. You can also use a single EKS cluster to run applications that require different Apache Spark versions and configurations, and take advantage of automated provisioning, scaling, faster runtimes, and development and debugging tools that EMR provides.

Amazon EMR on AWS Outposts

AWS Outposts bring AWS services, infrastructure, and operating models to virtually any data center, co-location space, or on-premises facility. Amazon EMR is available on AWS Outposts, allowing you to set up, deploy, manage, and scale Apache Hadoop, Apache Hive, Apache Spark, and Presto clusters in your on-premises environments, just as you would in the cloud. Amazon EMR provides capacity in Outposts, while automating time-consuming administration tasks including infrastructure provisioning, cluster setup, configuration, or tuning. You can create managed EMR clusters on-premises using the same AWS Management Console, APIs, and CLI for EMR. EMR clusters launched in an Outpost will appear in the AWS console just like any other cluster, but will be running in your Outpost.

Additional Information

For additional information about service controls, security features and functionalities, including, as applicable, information about storing, retrieving, modifying, restricting, and deleting data, please see https://docs.aws.amazon.com/index.html. This additional information does not form part of the Documentation for purposes of the AWS Customer Agreement available at http://aws.amazon.com/agreement, or other agreement between you and AWS governing your use of AWS’s services.