Getting started with Amazon EMR

How to use EMR

1

Develop your data processing application

You can use Java, Hive (a SQL-like language), Pig (a data processing language), Cascading, Ruby, Perl, Python, R, PHP, C++, or Node.js. Amazon EMR provides code samples and tutorials to get you up and running quickly.

2

Upload your application and data to Amazon S3

If you have a large amount of data to upload, you may want to consider using AWS Import/Export Snowball, to upload data using physical storage devices; or AWS Direct Connect, to establish a dedicated network connection from your data center to AWS. If you prefer, you can also write your data directly to a running cluster.

3

Configure and launch your cluster

Using the AWS Management Console, the AWS CLI, SDKs, or APIs, specify the number of Amazon EC2 instances to provision in your cluster, the types of instances to use (standard, high memory, high CPU, high I/O, etc.), the applications to install (Apache Spark, Apache Hive, Apache HBase, Presto, etc.), and the location of your application and data. You can use Bootstrap Actions to install additional software or change default settings.

4

Monitor the cluster

You can monitor the cluster’s health and progress using the Management Console, Command Line Interface, SDKs, or APIs. EMR integrates with Amazon CloudWatch for monitoring/alarming and supports popular monitoring tools like Ganglia. You can add/remove capacity to the cluster at any time to handle more or less data. For troubleshooting, you can use the console’s simple debugging GUI.

5

Retrieve the output

Retrieve the output from Amazon S3 or HDFS on the cluster. Visualize the data with tools like Amazon QuickSight, Tableau and MicroStrategy. Amazon EMR will automatically terminate the cluster when processing is complete. Alternatively you can leave the cluster running and give it more work to do.

Are you ready to launch your first cluster?

Click here to launch a cluster using the Amazon EMR Management Console. On the Create Cluster page, go to Advanced cluster configuration, and click on the gray "Configure Sample Application" button at the top right if you want to run a sample application with sample data.

Videos

Stay up to date with AWS webinars

Video

A technical introduction to Amazon EMR (50:44)

Watch the Video

Video

Amazon EMR deep dive & best practices (49:12)

Watch the Video

Tutorials

Learn at your own pace with other tutorials

Spark

Real-time stream processing using Apache Spark streaming and Apache Kafka on AWS

Learn how to set up Apache Kafka on EC2, use Spark Streaming on EMR to process data coming in to Apache Kafka topics, and query streaming data using Spark SQL on EMR.

Read the blog

Spark

Large-scale machine learning with Spark on Amazon EMR

Learn how Intent Media used Spark and Amazon EMR for their modeling workflows.

Read the blog

HBase

Low-latency SQL and secondary indexes with Phoenix and HBase

Learn how to connect to Phoenix using JDBC, create a view over an existing HBase table, and create a secondary index for increased read performance.

Read the blog

HBase

Using HBase with Hive for NoSQL and analytics workloads

Learn how to launch an EMR cluster with HBase and restore a table from a snapshot in Amazon S3.

Read the blog

Presto

Launch an Amazon EMR cluster with Presto and Airpal

Learn how to set up a Presto cluster and use Airpal to process data stored in S3.

Read the blog

Hive

Using HBase with Hive for NoSQL and analytics workloads

Learn how to launch an EMR cluster with HBase and restore a table from a snapshot in Amazon S3.

Read the blog

Hive

Process and analyze big data using Hive on Amazon EMR and MicroStrategy Suite

Learn how to connect to a Hive job flow running on Amazon Elastic MapReduce to create a secure and extensible platform for reporting and analytics.

Read the article

Flink

Build a real-time stream processing pipeline with Apache Flink on AWS

This tutorial outlines a reference architecture for a consistent, scalable, and reliable stream processing pipeline that is based on Apache Flink using Amazon EMR, Amazon Kinesis, and Amazon Elasticsearch Service.

Read the blog

Training and help

Short term engagements

Do you need help building a proof of concept or tuning your EMR applications? AWS has a global support team that specializes in EMR. Please contact us if you are interested in learning more about short term (2-6 week) paid support engagements.

AWS big data training

The Big Data on AWS course is designed to teach you with hands-on experience on how to use Amazon Web Services for big data workloads. AWS will show you how to run Amazon EMR jobs to process data using the broad ecosystem of Hadoop tools like Pig and Hive. Also, AWS will teach you how to create big data environments in the cloud by working with Amazon DynamoDB and Amazon Redshift, understand the benefits of Amazon Kinesis, and leverage best practices to design big data environments for analysis, security, and cost-effectiveness. To learn more about the Big Data course, click here.

Additional training

Scale Unlimited offers customized on-site training for companies that need to quickly learn how to use EMR and other big data technologies. To find out more, click here.

Additional resources

Stay connected with AWS

Next steps

Getting started

Getting started tutorial

Learn more

Resources

Discover more Amazon EMR resources

Visit the resources page

Free Tier

Sign up for a free account

Console

Ready to build?

Get started with Amazon EMR

Getting started with Amazon EMR

How to use EMR

1

Develop your data processing application

2

Upload your application and data to Amazon S3

3

Configure and launch your cluster

4

Monitor the cluster

5

Retrieve the output

Are you ready to launch your first cluster?

Videos

A technical introduction to Amazon EMR (50:44)

Amazon EMR deep dive & best practices (49:12)

Tutorials

Real-time stream processing using Apache Spark streaming and Apache Kafka on AWS

Large-scale machine learning with Spark on Amazon EMR

Low-latency SQL and secondary indexes with Phoenix and HBase

Using HBase with Hive for NoSQL and analytics workloads

Launch an Amazon EMR cluster with Presto and Airpal

Using HBase with Hive for NoSQL and analytics workloads

Process and analyze big data using Hive on Amazon EMR and MicroStrategy Suite

Build a real-time stream processing pipeline with Apache Flink on AWS

Training and help

Short term engagements

AWS big data training

Additional training

Additional resources

Big Data blog

Machine Learning blog

Documentation

FAQs

Articles and tutorials

AWS Cloud Economics Center

AWS Pricing Calculator

AWS Trusted Advisor

AWS Support Plans

Next steps

Getting started tutorial

Discover more Amazon EMR resources

Sign up for a free account

Ready to build?

Ending Support for Internet Explorer