Amazon EMR

Easily Run and Scale Apache Spark, Hadoop, HBase, Presto, Hive, and other Big Data Frameworks

Amazon EMR is the industry leading cloud-native big data platform for processing vast amounts of data quickly and cost-effectively at scale. Using open source tools such as Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi (Incubating), and Presto, coupled with the dynamic scalability of Amazon EC2 and scalable storage of Amazon S3, EMR gives analytical teams the engines and elasticity to run Petabyte-scale analysis for a fraction of the cost of traditional on-premises clusters. EMR gives teams the flexibility to run use cases on single-purpose, short lived clusters that automatically scale to meet demand, or on long running highly available clusters using the new multi-master deployment mode. If you have existing on-premises deployments of open source tools such as Apache Spark and Apache Hive, you can also run EMR clusters on AWS Outposts, giving you both the ability to scale out on-premises via Outposts or in the cloud.

An introduction to Amazon EMR (3:00)

Benefits

Easy to use

EMR launches clusters in minutes. You don’t need to worry about node provisioning, infrastructure setup, Hadoop configuration, or cluster tuning. EMR takes care of these tasks so you can focus on analysis. Analysts, data engineers, and data scientists can launch a serverless Jupyter notebook in seconds using EMR Notebooks, allowing individuals and teams to collaborate and interactively explore, process and visualize data in an easy to use notebook format.

Low cost

EMR pricing is simple and predictable: You pay a per-instance rate for every second used, with a one-minute minimum charge. You can launch a 10-node EMR cluster with applications such as Apache Spark, and Apache Hive, for as little as $0.15 per hour. Because EMR has native support for Amazon EC2 Spot and Reserved Instances, you can also save 50-80% on the cost of the underlying instances.

Elastic

With EMR, you can provision one, hundreds, or thousands of compute instances to process data at any scale. The number of instances can be increased or decreased manually or automatically using Auto Scaling (which manages cluster sizes based on utilization), and you only pay for what you use. Unlike the rigid infrastructure of on-premise clusters, EMR decouples compute and persistent storage, giving you the ability to scale each independently.

Reliable

Spend less time tuning and monitoring your cluster. EMR is tuned for the cloud, and constantly monitors your cluster — retrying failed tasks and automatically replacing poorly performing instances. EMR provides the latest stable open source software releases, so you don’t have to manage updates and bug fixes, leading to fewer issues and less effort to maintain the environment. With multiple master nodes, clusters are highly available and automatically failover in the event of a node failure.

Secure

EMR automatically configures EC2 firewall settings controlling network access to instances, and launches clusters in an Amazon Virtual Private Cloud (VPC), a logically isolated network you define. For objects stored in S3, server-side encryption or client-side encryption can be used with EMRFS (an object store for Hadoop on S3), using the AWS Key Management Service or your own customer-managed keys. EMR makes it easy to enable other encryption options, like in-transit and at-rest encryption, and strong authentication with Kerberos.

Flexible

You have complete control over your cluster. You have root access to every instance, you can easily install additional applications, and customize every cluster with bootstrap actions. You can also launch EMR clusters with custom Amazon Linux AMIs, and reconfigure running clusters on the fly without the need to re-launch the cluster.

Use cases

Machine learning

Use EMR's built-in machine learning tools, including Apache Spark MLlib, TensorFlow, and Apache MXNet for scalable machine learning algorithms, and use Custom AMI's and Bootstrap Actions to easily add your preferred libraries and tools to create your own predictive analytics toolset.

Learn how Intent Media uses Spark MLib »

Extract transform load (ETL)

EMR can be used to quickly and cost-effectively perform data transformation workloads (ETL) such as - sort, aggregate, and join - on large datasets.

Learn how Redfin uses transient EMR clusters for ETL »

Clickstream analysis

Analyze clickstream data from Amazon S3 using Apache Spark and Apache Hive to segment users, understand user preferences, and deliver more effective ads.

Learn how Razorfish uses EMR for click stream analysis »

Real-time streaming

Analyze events from Apache Kafka, Amazon Kinesis, or other streaming data sources in real-time with Apache Spark Streaming and EMR to create long-running, highly available, and fault-tolerant streaming data pipelines. Persist transformed data sets to Amazon S3 or HDFS, and insights to Amazon Elasticsearch.

Learn how Hearst uses Spark Streaming »

Interactive analytics

EMR Notebooks provide a managed analytic environment based on open-source Jupyter that allows data scientists, analysts, and developers to prepare and visualize data, collaborate with peers, build applications, and perform interactive analysis.

Genomics

EMR can be used to process vast amounts of genomic data and other large scientific data sets quickly and efficiently. Researchers can access genomic data hosted for free on AWS.

Learn about Apache Spark and Precision Medicine »

Case studies

Analyst research

1

Get started with AWS

Step 1 - Sign up for an AWS account

Sign up for an AWS account

Instantly get access to the AWS Free Tier.
icon2

Learn with 10-minute Tutorials

Explore and learn with simple tutorials.
icon3

Start building with AWS

Begin building with step-by-step guides to help you launch your AWS project.

Migrate big data from on-premises to AWS

Read the Amazon EMR Migration Guide Request an onsite Amazon EMR Migration Workshop

Learn more about big data on AWS

Visit the Big Data Blog