Amazon EMR Documentation

Usability

Amazon EMR is designed to simplify building and operating big data environments and applications. Related EMR features include provisioning, managed scaling, and reconfiguring of clusters, and EMR Studio for collaborative development.

Provision clusters in minutes

You can launch an EMR cluster in minutes. The service is designed to automate infrastructure provisioning, cluster setup, configuration, and tuning. EMR takes care of these tasks allowing you to focus your teams on developing differentiated big data applications.

Scale resources to meet business needs

You can set scale out and scale in using EMR Managed Scaling policies and let your EMR cluster automatically manage the compute resources to meet your usage and performance needs. This improves cluster utilization.

EMR Studio is an integrated development environment (IDE) that makes it easier for data scientists and data engineers to develop, visualize, and debug data engineering and data science applications written in R, Python, Scala, and PySpark. EMR Studio provides managed Jupyter Notebooks, and tools like Spark UI and YARN Timeline Service to simplify debugging.

High availability

You can configure high availability for multi-master applications such as YARN, HDFS, Apache Spark, Apache HBase, and Apache Hive. When you enable multi-master support in EMR, EMR is designed to configure these applications for High Availability, and in the event of failures, automatically fail-over to a standby master so that your cluster is not disrupted, and place your master nodes in distinct racks to reduce risk of simultaneous failure. Hosts are monitored to detect failures, and when issues are detected, new hosts are provisioned and added to the cluster automatically.

EMR Managed Scaling

With EMR Managed Scaling you specify the minimum and maximum compute limits for your clusters and Amazon EMR automatically resizes them for improved performance and resource utilization. EMR Managed Scaling is designed to continuously sample key metrics associated with the workloads running on clusters.

Reconfigure running clusters

You can now modify the configuration of applications running on EMR clusters including Apache Hadoop, Apache Spark, Apache Hive, and Hue without re-starting the cluster. EMR Application Reconfiguration allows you to modify applications on the fly without needing to shut down or re-create the cluster. Amazon EMR will apply your new configurations and gracefully restart the reconfigured application. Configurations can be applied through the Console, SDK, or CLI.

Elastic

Amazon EMR enables you to provision capacity as you need it, and automatically or manually add and remove capacity. This is useful if you have variable or unpredictable processing requirements. For example, if the bulk of your processing occurs at night, you might need 100 instances during the day and 500 instances at night. Alternatively, you might need a significant amount of capacity for a short period of time. With Amazon EMR you can provision instances, automatically scale to match compute requirements, and shut your cluster down when your job is complete.

There are two main options for adding or removing capacity:

  • Deploy multiple clusters: If you need more capacity, you can launch a new cluster and terminate it when you no longer need it. There is no limit to how many clusters you can have. You may want to use multiple clusters if you have multiple users or applications. For example, you can store your input data in Amazon S3 and launch one cluster for each application that needs to process the data. One cluster might be optimized for CPU, a second cluster might be optimized for storage, etc.
  • Resize a running cluster: EMR Managed Scaling is designed to automatically scale or manually resize a running cluster. You may want to scale out a cluster to temporarily add more processing power to the cluster, or scale in your cluster to save on costs when you have idle capacity. For example, some customers add hundreds of instances to their clusters when their batch processing occurs, and remove the extra instances when processing completes. When adding instances to your cluster, EMR can now start utilizing provisioned capacity as soon it becomes available. When scaling in, EMR will proactively choose idle nodes to reduce impact on running jobs.
Amazon EC2 Spot Integration

Amazon EMR enables use of Spot instances so you can save both time and money. Amazon EMR clusters include 'core nodes' that run HDFS and ‘task nodes’ that do not; task nodes are ideal for Spot because if the Spot price increases and you lose those instances you will not lose data stored in HDFS. With the combination of instance fleets, allocation strategies for spot instances, EMR Managed Scaling and more diversification options, you can now optimize EMR for resilience and cost. 

Amazon S3 Integration

The EMR File System (EMRFS) allows EMR clusters to use Amazon S3 as an object store for Hadoop. You can store your data in Amazon S3 and use multiple Amazon EMR clusters to process the same data set. Each cluster can be optimized for a particular workload, which can be more efficient than a single cluster serving multiple workloads with different requirements. For example, you might have one cluster that is optimized for I/O and another that is optimized for CPU, each processing the same data set in Amazon S3. In addition, by storing your input and output data in Amazon S3, you can shut down clusters when they are no longer needed.

EMRFS supports S3 server-side or S3 client-side encryption using AWS Key Management Service (KMS) or customer-managed keys, and offers an optional consistent view which checks for list and read-after-write consistency for objects tracked in its metadata. Also, Amazon EMR clusters can use both EMRFS and HDFS, so you don’t have to choose between on-cluster storage and Amazon S3.

AWS Glue Data Catalog Integration

You can use the AWS Glue Data Catalog as a managed metadata repository to store external table metadata for Apache Spark and Apache Hive. Additionally, it provides automatic schema discovery and schema version history. This allows you to persist metadata for your external tables on Amazon S3 outside of your cluster.

Flexible data stores

With Amazon EMR, you can leverage multiple data stores, including Amazon S3, the Hadoop Distributed File System (HDFS), and Amazon DynamoDB.

Amazon S3

Amazon S3 is a highly durable, scalable, secure, fast, and inexpensive storage service. With the EMR File System (EMRFS), Amazon EMR can efficiently and securely use Amazon S3 as an object store for Hadoop. Amazon EMR has made numerous improvements to Hadoop, allowing you to process large amounts of data stored in Amazon S3. Also, EMRFS can enable consistent view to check for list and read-after-write consistency for objects in Amazon S3. EMRFS supports S3 server-side or S3 client-side encryption to process encrypted Amazon S3 objects, and you can use the AWS Key Management Service (KMS) or a custom key vendor.

When you launch your cluster, Amazon EMR streams the data from Amazon S3 to each instance in your cluster and begins processing. One advantage of storing your data in Amazon S3 and processing it with Amazon EMR is you can use multiple clusters to process the same data. For example, you might have a Hive development cluster that is optimized for memory and a Pig production cluster that is optimized for CPU both using the same input data set.

Hadoop Distributed File System (HDFS)

HDFS is the Hadoop file system. Amazon EMR’s current topology groups its instances into 3 logical instance groups: Master Group, which runs the YARN Resource Manager and the HDFS Name Node Service; Core Group, which runs the HDFS DataNode Daemon and the YARN Node Manager service; and Task Group, which runs the YARN Node Manager service. Amazon EMR installs HDFS on the storage associated with the instances in the Core Group.

Each EC2 instance comes with a fixed amount of storage, referenced as "instance store", attached with the instance. You can also customize the storage on an instance by adding Amazon EBS volumes to an instance. Amazon EMR allows you to add General Purpose (SSD), Provisioned (SSD) and Magnetic volumes types. The EBS volumes added to an EMR cluster do not persist data after the cluster is shutdown. EMR will automatically clean-up the volumes, once you terminate your cluster.

You can also enable complete encryption for HDFS using an Amazon EMR security configuration, or manually create HDFS encryption zones with the Hadoop Key Management Server. You can use a security configuration option to encrypt EBS root device and storage volumes when you specify AWS KMS as your key provider. 

Amazon DynamoDB

Amazon DynamoDB is a managed NoSQL database service. Amazon EMR has direct integration with Amazon DynamoDB so you can process data stored in Amazon DynamoDB and transfer data between Amazon DynamoDB, Amazon S3, and HDFS in Amazon EMR.

Other AWS Data Stores

You can also use Amazon Relational Database Service (a web service designed to set up, operate, and scale a relational database in the cloud), Amazon Glacier (an storage service that provides secure and durable storage for data archiving and backup), and Amazon Redshift (a managed data warehouse service). AWS Data Pipeline is a web service that helps you process and move data between different AWS compute and storage services (including Amazon EMR) as well as on-premises data sources at specified intervals.

Use your favorite open source applications

With versioned releases on Amazon EMR, you can select and use the latest open source projects on your EMR cluster, including applications in the Apache Spark and Hadoop ecosystems. Software is installed and configured by Amazon EMR, so you can spend more time on increasing the value of your data without worrying about infrastructure and administrative tasks.

Big Data Tools

Amazon EMR supports Hadoop tools such as Apache Spark, Apache Hive, Presto, and Apache HBase. Data scientists use EMR to run deep learning and machine learning tools such as TensorFlow, Apache MXNet, and, using bootstrap actions, add use case-specific tools and libraries. Data analysts use EMR Studio, Hue and EMR Notebooks for interactive development, authoring Apache Spark jobs, and submitting SQL queries to Apache Hive and Presto. Data Engineers use EMR for data pipeline development and data processing, and use Apache Hudi to simplify incremental data management and data privacy use cases requiring record-level insert, updates, and delete operations.

Data Processing & Machine Learning

Apache Spark is an engine in the Hadoop ecosystem for processing for large data sets. It uses in-memory, fault-tolerant resilient distributed datasets (RDDs) and directed, acyclic graphs (DAGs) to define data transformations. Spark also includes Spark SQL, Spark Streaming, MLlib, and GraphX..

Apache Flink is a streaming dataflow engine that enables you to run real-time stream processing on high-throughput data sources. It supports event time semantics for out of order events, exactly-once semantics, backpressure control, and APIs optimized for writing both streaming and batch applications. 

TensorFlow is an open source symbolic math library for machine intelligence and deep learning applications. TensorFlow bundles together multiple machine learning and deep learning models and algorithms and can train and run deep neural networks for many different use cases. 

Record-Level Amazon S3 Data Management

Apache Hudi is an open-source data management framework used to simplify incremental data processing and data pipeline development. Apache Hudi enables you to manage data at the record-level in Amazon S3 to simplify Change Data Capture (CDC) and streaming data ingestion, and provides a framework to handle data privacy use cases requiring record level updates and deletes. 

SQL

Apache Hive is an open source data warehouse and analytics package that runs on top of Hadoop. Hive is operated by Hive QL, a SQL-based language which allows users to structure, summarize, and query data. Hive QL goes beyond standard SQL, adding first-class support for map/reduce functions and complex extensible user-defined data types like JSON and Thrift. This capability allows processing of complex and unstructured data sources such as text documents and log files. Hive allows user extensions via user-defined functions written in Java. Amazon EMR has made numerous improvements to Hive, including direct integration with Amazon DynamoDB and Amazon S3. For example, with Amazon EMR you can load table partitions automatically from Amazon S3, you can write data to tables in Amazon S3 without using temporary files, and you can access resources in Amazon S3 such as scripts for custom map/reduce operations and additional libraries. 

Presto is an open-source distributed SQL query engine optimized for low-latency, ad-hoc analysis of data. It supports the ANSI SQL standard, including complex queries, aggregations, joins, and window functions. Presto can process data from multiple data sources including the Hadoop Distributed File System (HDFS) and Amazon S3. 

Apache Phoenix enables low-latency SQL with ACID transaction capabilities over data stored in Apache HBase. You can create secondary indexes for additional performance, and create different views over the same underlying HBase table. 

NoSQL

Apache HBase is an open source, non-relational, distributed database modeled after Google's BigTable. It was developed as part of Apache Software Foundation's Hadoop project and runs on top of Hadoop Distributed File System(HDFS) to provide BigTable-like capabilities for Hadoop. HBase provides you a fault-tolerant, efficient way of storing large quantities of sparse data using column-based compression and storage. In addition, HBase provides fast lookup of data because it caches data in-memory. HBase is optimized for sequential write operations, and for batch inserts, updates, and deletes. HBase works with Hadoop, sharing its file system and serving as a direct input and output to Hadoop jobs. HBase also integrates with Apache Hive, enabling SQL-like queries over HBase tables, joins with Hive-based tables, and support for Java Database Connectivity (JDBC). With EMR, you can use S3 as a data store for HBase, enabling you to reduce operational complexity. If you use HDFS as a data store, you can back up HBase to S3 and you can restore from a previously created backup. 

Interactive Analytics

EMR Studio is an integrated development environment (IDE) that enables data scientists and data engineers to develop, visualize, and debug data engineering and data science applications written in R, Python, Scala, and PySpark. EMR Studio provides managed Jupyter Notebooks, and tools like Spark UI and YARN Timeline Service to simplify debugging.

Hue is an open source user interface for Hadoop that makes it easier to run and develop Hive queries, manage files in HDFS, run and develop Pig scripts, and manage tables. Hue on EMR also integrates with Amazon S3, so you can query directly against S3 and transfer files between HDFS and Amazon S3. 

Jupyter Notebook is an open-source web application that you can use to create and share documents that contain live code, equations, visualizations, and narrative text. JupyterHub allows you to host multiple instances of a single-user Jupyter notebook server. When you create a EMR cluster with JupyterHub, EMR creates a Docker container on the cluster's master node. JupyterHub, all the components required for Jupyter, and Sparkmagic run within the container.

Apache Zeppelin is an open source GUI which creates interactive and collaborative notebooks for data exploration using Spark. You can use Scala, Python, SQL (using Spark SQL), or HiveQL to manipulate data and visualize results. Zeppelin notebooks can be shared among several users, and visualizations can be published to external dashboards. 

Scheduling and workflow

Apache Oozie is a workflow scheduler for Hadoop, where you can create Directed Acyclic Graphs (DAGs) of actions. Also, you can trigger your Hadoop workflows by actions or time. AWS Step Functions allows you to add serverless workflow automation to your applications. The steps of your workflow can run anywhere, including in AWS Lambda functions, on Amazon Elastic Compute Cloud (EC2), or on-premises. 

Other projects and tools

EMR also supports a variety of other popular applications and tools, such as R, Apache Pig (data processing and ETL), Apache Tez (complex DAG execution), Apache MXNet (deep learning), Ganglia (monitoring), Apache Sqoop (relational database connector), HCatalog (table and storage management), and more. The Amazon EMR team maintains an open source repository of bootstrap actions that can be used to install additional software, configure your cluster, or serve as examples for writing your own bootstrap actions.

Data access control

By default, Amazon EMR application processes use EC2 instance profile when they call other AWS services. For multi-tenant clusters, Amazon EMR offers three options to manage user access to Amazon S3 data.

Integration with AWS Lake Formation allows you to define and manage fine-grained authorization policies in AWS Lake Formation to access databases, tables, and columns in AWS Glue Data Catalog. You can enforce the authorization policies on jobs submitted through Amazon EMR Notebooks and Apache Zeppelin for interactive EMR Spark workloads, and send auditing events to AWS CloudTrail. By enabling this integration, you also enable federated Single Sign-On to EMR Notebooks or Apache Zeppelin from enterprise identity systems compatible with Security Assertion Markup Language (SAML) 2.0.

Native integration with Apache Ranger allows you to set up a new or an existing Apache Ranger server to define and manage fine-grained authorization policies for users to access databases, tables, and columns of Amazon S3 data via Hive Metastore. Apache Ranger is an open-source tool to enable, monitor, and manage comprehensive data security across the Hadoop platform.

This native integration allows you to define three types of authorization policies on the Apache Ranger Policy Admin server. You can set table, column, and row level authorization for Hive, table and column level authorization for Spark, and prefix and object level authorization for Amazon S3. Amazon EMR installs and configures the corresponding Apache Ranger plugins on the cluster. These Ranger plugins sync up with the Policy Admin server for authorization polices, enforce data access control, and send auditing events to Amazon CloudWatch Logs.

Additional features

Select the right instance for your cluster

You choose what types of EC2 instances to provision in your cluster (standard, high memory, high CPU, high I/O, etc.) based on your application’s requirements. You have root access to every instance and you can customize your cluster to suit your requirements. 

Debug your applications

When you enable debugging on a cluster, Amazon EMR archives the log files to Amazon S3 and then indexes those files. You can then use a graphical interface in the console to browse the logs and view job history in an intuitive way. 

Monitor your cluster

You can use Amazon CloudWatch to monitor custom Amazon EMR metrics, such as the average number of running map and reduce tasks. You can also set alarms on these metrics. 

Respond to events

You can use Amazon EMR event types in Amazon CloudWatch Events to respond to state changes in your Amazon EMR clusters. Using simple rules that you can set up, match events and route them to Amazon SNS topics, AWS Lambda functions, Amazon SQS queues, and more. 

Schedule recurring workflows

You can use AWS Data Pipeline to schedule recurring workflows involving Amazon EMR. AWS Data Pipeline is a web service that helps you reliably process and move data between different AWS compute and storage services as well as on-premise data sources at specified intervals. 

Deep learning

Use popular deep learning frameworks like Apache MXNet to define, train, and deploy deep neural networks. You can use these frameworks on Amazon EMR clusters with GPU instances. 

Control network access to your cluster

You can launch your cluster in an Amazon Virtual Private Cloud (VPC), a logically isolated section of the AWS cloud. You have control over your virtual networking environment, including selection of your own IP address range, creation of subnets, and configuration of route tables and network gateways. 

Manage users, permissions and encryption

You can use AWS Identity and Access Management (IAM) tools such as IAM Users and Roles to control access and permissions. For example, you could give certain users read but not write access to your clusters. Also, you can use Amazon EMR security configurations to set various encryption at-rest and in-transit options, including support for Amazon S3 encryption, and Kerberos authentication. 

Install additional software

You can use bootstrap actions or a custom Amazon Machine Image (AMI) running Amazon Linux to install additional software on your cluster. Bootstrap actions are scripts that are run on the cluster nodes when Amazon EMR launches the cluster. They run before Hadoop starts and before the node begins processing data. You can also preload and use software on a custom Amazon Linux AMI. 

Copy data

You can move data from Amazon S3 to HDFS, from HDFS to Amazon S3, and between Amazon S3 buckets using Amazon EMR’s S3DistCp, an extension of the open source tool Distcp, which uses MapReduce to move large amounts of data. 

Custom JAR

Write a Java program, compile against the version of Hadoop you want to use, and upload to Amazon S3. You can then submit Hadoop jobs to the cluster using the Hadoop JobClient interface. 

Amazon EMR Studio

EMR Studio is an integrated development environment (IDE) that enables data scientists and data engineers to develop, visualize, and debug data engineering and data science applications written in R, Python, Scala, and PySpark.

EMR Studio provides managed Jupyter Notebooks, and tools like Spark UI and YARN Timeline Service to simplify debugging. EMR Studio uses AWS Single Sign-On and allows you to log in directly with your corporate credentials without logging into the AWS console. Data scientists and analysts can install custom kernels and libraries, collaborate with peers using code repositories such as GitHub and BitBucket, or execute parameterized notebooks as part of scheduled workflows using orchestration services like Apache Airflow or Amazon Managed Workflows for Apache Airflow.

EMR Studio kernels and applications run on EMR clusters, so you get the benefit of distributed data processing using the performance optimized Amazon EMR runtime for Apache Spark. Administrators can set up EMR Studio such that analysts can run their applications on existing EMR clusters or create new clusters using pre-defined AWS Cloud Formation templates for EMR.

Benefits:
Simple to use

EMR Studio is designed to make it simple to interact with applications on an EMR cluster. You can access EMR Studio with your corporate credentials using AWS Single Sign-On, without logging into the AWS console or the cluster. You can interactively explore, process and visualize data using notebooks, build and schedule pipelines, and debug applications without logging into EMR clusters.

Managed Jupyter Notebooks

With EMR Studio, you can start developing analytics and data science applications in R, Python, Scala, and PySpark with managed Jupyter Notebooks. You can attach notebooks to existing EMR clusters or auto-provision clusters using pre-configured templates to run jobs. You can collaborate with others using repositories, and install custom Python libraries or kernels directly from Notebooks.

Easy to build applications

EMR Studio enables you to move from prototyping to production. You can trigger pipelines from code repositories, simply run Notebooks as pipelines using orchestration tools like Apache Airflow or Amazon Managed Workflows for Apache Airflow, or attach notebooks to a bigger cluster using a single click.

Simplified debugging

With EMR Studio, you can debug jobs and access logs without logging into the cluster for both active and terminated clusters. You can use native application interfaces such as Spark UI and YARN timeline service directly from EMR Studio. EMR Studio also allows you to locate the cluster or job to debug by using filters such as cluster state, creation time, and cluster ID.

Use cases:
Build data science and engineering applications

With EMR Studio, you can log in directly to managed notebooks without logging into the AWS console, start notebooks in seconds, get onboarded with sample notebooks, and perform your data exploration. You can collaborate with peers by sharing notebooks via GitHub and other repositories. You can also customize your environment by loading custom kernels and Python libraries from notebooks.

Deploy production pipelines

In EMR Studio, you can use code repository to trigger pipelines. You can also parameterize and chain notebooks to build pipelines. You can integrate notebooks into scheduled workflows using workflow orchestration services such as Apache Airflow or Amazon Managed Workflows for Apache Airflow. EMR Studio also allows you to re-attach notebooks to a bigger cluster to run a job.

Simplify debugging applications

In EMR Studio, you can debug notebook applications from the notebook UI. You can also debug pipelines by first narrowing down clusters using filters like cluster state, and diagnose jobs on both active and terminated clusters with as few clicks as possible to open native debugging UIs like Spark UI, Tez UI, and Yarn Timeline Service.

Amazon EMR Notebooks

Amazon EMR Notebooks, a managed environment based on Jupyter and Jupyter-lab notebooks, enables users to interactively analyze and visualize data, collaborate with peers, and build applications using EMR clusters. EMR Notebooks is designed for Apache Spark. It supports Spark Magic kernels, which allows you to remotely run queries and code on your EMR cluster using languages like PySpark, Spark SQL, Spark R, and Scala.

With EMR Notebooks, there is no software or instances to manage. You can either attach the notebook to an existing cluster or provision a new cluster directly from the console. You can attach multiple notebooks to a single cluster, detach notebooks and re-attach them to new clusters.

EMR Notebooks allows you to:

  1. Monitor and debug Spark jobs directly from your notebook
  2. Install notebook-scoped libraries on a running EMR cluster
  3. Associate Git repositories with your notebook for version control, and simplified code collaboration and reuse
  4. Compare and merge two notebooks using the nbdime utility

Amazon EMR Serverless

Amazon EMR Serverless is a serverless option in Amazon EMR designed to help you run open-source big data analytics frameworks without configuring, managing, and scaling clusters or servers. Select the open-source framework you want to run for your application, such as Apache Spark and Apache Hive, and EMR Serverless can automatically provision and manage the underlying compute and memory resources, including scaling those resources to meet changing data volumes and processing requirements.

Use cases:
Variable workloads

With EMR Serverless, you can automatically scale application resources as workload demands change, without having to preconfigure how much compute power and memory you need.

SLA-sensitive data pipelines

You can pre-initialize application resources in EMR Serverless to help speed up response time for SLA-sensitive data pipelines.

Development and test environments

EMR Serverless can help you quickly spin up a development and test environment that automatically scales with unpredictable usage.

Amazon EMR on Amazon EKS

Amazon EMR on Amazon EKS enables you to submit Apache Spark jobs on demand on Amazon Elastic Kubernetes Service (EKS) without provisioning clusters. With EMR on EKS, you can consolidate analytical workloads with your other Kubernetes-based applications on the same Amazon EKS cluster to improve resource utilization and simplify infrastructure management.

With Amazon EMR on Amazon EKS, you can share compute and memory resources across all of your applications and use a single set of Kubernetes tools to centrally monitor and manage your infrastructure. You can also use a single EKS cluster to run applications that require different Apache Spark versions and configurations, and take advantage of automated provisioning, scaling, faster runtimes, and development and debugging tools that EMR provides.

Benefits:
Simplify management

EMR benefits for Apache Spark on EKS include managed versions of Apache Spark 2.4 and 3.0, automatic provisioning, scaling, performance optimized runtime, and tools like EMR Studio for authoring jobs and an Apache Spark UI for debugging.

Optimize performance

By running analytics applications on EKS, you can reuse existing EC2 instances in your shared Kubernetes cluster and avoid the startup time of creating a new cluster of EC2 instances dedicated for analytics. 

Use cases:
Centralize resource management

With EMR on EKS, you can automate the provisioning, management, and scaling of Apache Spark, and use a single set of tools to centrally manage and monitor your infrastructure.

Co-location of workloads

Run multiple EMR workloads that require different frameworks, versions, and configurations on the same EKS cluster as your other application workloads.

Rapid adoption of new EMR versions

EMR on EKS provides a managed experience for developing, troubleshooting, and optimizing your analytics. You can deploy configurations and start jobs to test new EMR versions on the same EKS cluster without allocating dedicated resources.

Amazon EMR on AWS Outposts

AWS Outposts bring AWS services, infrastructure, and operating models to virtually any data center, co-location space, or on-premises facility. Amazon EMR is available on AWS Outposts, allowing you to set up, deploy, manage, and scale Apache Hadoop, Apache Hive, Apache Spark, and Presto clusters in your on-premises environments, just as you would in the cloud. Amazon EMR provides capacity in Outposts, while automating time-consuming administration tasks including infrastructure provisioning, cluster setup, configuration, or tuning, freeing you to focus on your applications.You can create managed EMR clusters on-premises using the same AWS Management Console, APIs, and CLI for EMR. EMR clusters launched in an Outpost will appear in the AWS console just like any other cluster, but will be running in your Outpost.

Benefits:
Augment on-premises processing capacity

Once your Outpost is set up, you can launch a new EMR cluster on-premises and connect to existing HDFS storage. This allows you to respond when on-premises systems need additional processing capacity. Adding capacity to on-premises Hadoop and Spark clusters helps meet workload demands in periods of high utilization.

Process data that needs to remain on-premises

Apache Hadoop, Apache Hive, Apache Spark, and Presto are commonly used to process, transform, and analyze data that is part of a larger data architecture. For data that needs to remain on-premises for governance, compliance, or other reasons, you can use EMR to deploy and run applications like Apache Hadoop and Apache Spark on-premises, close to your data. This reduces the need to move large amounts of on-premises data to the cloud, reducing the overall time needed to process that data.

Accelerate data and workload migrations

If you’re in the process of migrating data and Apache Hadoop workloads to the cloud and want to start using EMR before your migration is complete, you can use AWS Outposts to launch EMR clusters on-premises that connect to your existing HDFS storage. You can then gradually migrate your data to Amazon S3 as part of an evolution to a cloud architecture.

Additional Information

For additional information about service controls, security features and functionalities, including, as applicable, information about storing, retrieving, modifying, restricting, and deleting data, please see https://docs.aws.amazon.com/index.html. This additional information does not form part of the Documentation for purposes of the AWS Customer Agreement available at http://aws.amazon.com/agreement, or other agreement between you and AWS governing your use of AWS’s services.