Amazon EMR

Easily run and scale Apache Spark, Hive, Presto, and other big data frameworks

Amazon EMR is the industry-leading cloud big data platform for processing vast amounts of data using open source tools such as Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi, and Presto. With EMR you can run Petabyte-scale analysis at less than half of the cost of traditional on-premises solutions and over 3x faster than standard Apache Spark. For short-running jobs, you can spin up and spin down clusters and pay per second for the instances used. For long-running workloads, you can create highly available clusters that automatically scale to meet demand. If you have existing on-premises deployments of open source tools such as Apache Spark and Apache Hive, you can also run EMR clusters on AWS Outposts.

Learn how you can reduce cost and simplify operations by migrating on-premises workloads to EMR

Discover how Apache Hudi simplifies pipelines for change data capture (CDC) and privacy regulations

An introduction to Amazon EMR (3:00)


Easy to use

Analysts, data engineers, and data scientists can use EMR Notebooks, allowing individuals and teams to easily collaborate and interactively explore, process, and visualize data. You can simply specify the the version of EMR applications and type of compute you want to use. EMR takes care of provisioning, configuring, and tuning clusters so that you can focus on running analytics.

Low cost

EMR pricing is simple and predictable: You pay a per-instance rate for every second used, with a one-minute minimum charge. You can launch a 10-node EMR cluster for as little as $0.15 per hour. You can also save 50-80% on the cost of the instances by selecting Amazon EC2 Spot for transient workloads and Reserved Instances for long-running workloads. You can also use Savings Plans.


Unlike the rigid infrastructure of on-premises clusters, EMR decouples compute and storage giving you the ability to scale each independently and take advantage of tiered storage of Amazon S3. With EMR, you can provision one, hundreds, or thousands of compute instances to process data at any scale. The number of instances can be increased or decreased automatically using Auto Scaling (which manages cluster sizes based on utilization) and you only pay for what you use.


Spend less time tuning and monitoring your cluster. EMR is tuned for the cloud and constantly monitors your cluster — retrying failed tasks and automatically replacing poorly performing instances. With multiple master nodes, clusters are highly available and automatically failover in the event of a node failure. EMR provides the latest stable open source software releases, so you don’t have to manage updates and bug fixes, which leads to fewer issues and less effort to maintain the environment.


EMR automatically configures EC2 firewall settings controlling network access to instances and launches clusters in an Amazon Virtual Private Cloud (VPC). Server-side encryption or client-side encryption can be used with the AWS Key Management Service or your own customer-managed keys. EMR makes it easy to enable other encryption options, like in-transit and at-rest encryption, and strong authentication with Kerberos. You can use AWS Lake Formation or Apache Ranger to apply fine-grained data access controls for databases, tables, and columns.


You have complete control over your cluster with root access to every instance. You can launch EMR clusters with custom Amazon Linux AMIs and easily install additional applications with bootstrap actions. EMR enables you to reconfigure applications on running clusters on the fly without the need to relaunch clusters. Additionally, using Hadoop 3.0, you can package library dependencies in Docker containers and submit them with your jobs to simplify environment dependencies.

Use cases

Machine learning

Use EMR's built-in machine learning tools, including Apache Spark MLlib, TensorFlow, and Apache MXNet for scalable machine learning algorithms, and use custom AMIs and bootstrap actions to easily add your preferred libraries and tools to create your own predictive analytics toolset.

Extract transform load (ETL)

EMR can be used to quickly and cost-effectively perform data transformation workloads (ETL) such as sort, aggregate, and join on large datasets.

Learn how Redfin uses transient EMR clusters for ETL »

Clickstream analysis

Analyze clickstream data from Amazon S3 using Apache Spark and Apache Hive to segment users, understand user preferences, and deliver more effective ads.

Real-time streaming

Analyze events from Apache Kafka, Amazon Kinesis, or other streaming data sources in real-time with Apache Spark Streaming and Apache Flink to create long-running, highly available, and fault-tolerant streaming data pipelines on EMR. Persist transformed data sets to S3 or HDFS and insights to Amazon Elasticsearch Service.

Learn how Hearst uses Spark Streaming »

Interactive analytics

EMR Notebooks provide a managed analytic environment based on open-source Jupyter that allows data scientists, analysts, and developers to prepare and visualize data, collaborate with peers, build applications, and perform interactive analyses.


EMR can be used to process vast amounts of genomic data and other large scientific data sets quickly and efficiently. Researchers can access genomic data hosted for free on AWS.

Learn about Apache Spark and Precision Medicine »

Case studies

Analyst research


What's New

  • date

Get started with AWS

Read EMR migration guide
Read the migration guide

Learn how to migrate big data from on-premises to AWS.

Learn more 
Sign up for a free AWS account
Sign up for a free account

Instantly get access to the AWS Free Tier. 

Sign up 
Start building with EMR in the console
Start building in the console

Get started building with Amazon EMR in the AWS Console.

Sign in 

Migrate big data from on-premises to AWS

Resources to help you plan your migration

Learn more about big data and analytics on AWS

Read the AWS Big Data Blog