Getting started with Amazon EMR

Amazon EMR

How to use EMR

1. Choose your preferred EMR deployment model

Amazon EMR allows you to process vast amounts of data using open-source tools such as Apache Spark, Hive, Flink, Trino, and more. Simply choose your preferred EMR deployment model:

EMR Serverless: Run applications without managing clusters and automatically scale resources up and down based on your workload
EMR on EC2: For control over cluster configuration, including instance types and custom AMIs.
EMR on EKS: Consolidate analytics with your other Kubernetes-based applications on a shared Amazon EKS cluster.

2. Develop your data processing application

Amazon EMR supports a wide range of frameworks and languages, allowing you to build everything from standard ETL pipelines to large-scale generative AI data preparation.

Languages: Use Python (PySpark) for data science and machine learning, SQL (via Hive or Trino) for analytical queries, or Java and Scala for high-performance Spark applications.

Frameworks: Build and run applications using Apache Spark for large-scale data processing, Apache Flink for real-time streaming, Trino for fast SQL across multiple data sources, and Apache Hudi or Iceberg for managing transactional data lakes.

3. Prepare and ingest data

To begin processing, your data must be accessible to Amazon EMR. While Amazon S3 is the standard storage layer for EMR applications, you have several high-speed methods to move data from your local environment or other AWS services.

Direct Uploads: For immediate processing, upload objects directly to Amazon S3 using the AWS Management Console, CLI, or SDKs.
High-Speed Connectivity: Use AWS Direct Connect to bypass the public internet and establish a private, dedicated network connection from your data center to AWS. This provides consistent bandwidth and reduced latency for large-scale transfers.
Real-Time Streaming: Use Amazon Data Firehose or Amazon Managed Streaming for Apache Kafka (MSK) to feed data directly into your EMR applications as it is generated, enabling near real-time analytics.
Zero-ETL Integrations: Analyze data from Amazon Aurora or Amazon Redshift using Zero-ETL features, which allow EMR to access operational data without the need for manual pipeline construction.
Hybrid Access: If your data resides in a local Hadoop HDFS environment, you can use the S3 Connector to read data directly into EMR or sync specific datasets for cloud-based processing.

4. Launch and monitor

Amazon EMR offers a streamlined deployment experience, whether you are running a one-time job or a continuous production pipeline.

Launch via EMR Studio: Open your EMR Studio notebook and attach it to a Serverless Application or an existing EC2 Cluster. With one click, you can execute your Spark or Hive code in a fully managed environment.
Serverless: If using EMR Serverless, submit your job via the console, CLI, or API. EMR automatically provisions the exact compute and memory needed, scaling up to handle peaks and down to zero when finished.
Launch via SageMaker Unified Studio: Within SageMaker Unified Studio, you can open a serverless notebook and instantly connect it to an EMR Serverless application or an EMR on EC2 cluster.

5. Monitor and optimize execution

EMR provides visibility into your data pipelines with built-in tools that help you identify bottlenecks and optimize costs automatically.

Monitor job progress and cluster health through the EMR Management Console, AWS CLI, or SDKs. EMR provides native integration with Amazon CloudWatch for real-time metrics, logs, and automated alerting.

Access the live and persistent Spark UI or Tez UI directly from the console—debug live jobs in real time and even after a serverless job is finished—to review execution plans and DAGs (Directed Acyclic Graphs).

Are you ready to launch your first cluster?

Click here to launch a cluster using the Amazon EMR Management Console. On the Create Cluster page, go to Advanced cluster configuration, and click on the gray "Configure Sample Application" button at the top right if you want to run a sample application with sample data.

Learn more

Learn at your own pace with other tutorials

Training and help

Do you need help building a proof of concept or tuning your EMR applications? AWS has a global support team that specializes in EMR. Please contact us if you are interested in learning more about short term (2-6 week) paid support engagements.

The Big Data on AWS course is designed to teach you with hands-on experience on how to use Amazon Web Services for big data workloads. AWS will show you how to run Amazon EMR jobs to process data using the broad Hadoop tools like Pig and Hive. Also, AWS will teach you how to create big data environments in the cloud by working with Amazon DynamoDB and Amazon Redshift, understand the benefits of Amazon Kinesis, and leverage best practices to design big data environments for analysis, security, and cost-effectiveness. To learn more about the Big Data course, click here.

Scale Unlimited offers customized on-site training for companies that need to quickly learn how to use EMR and other big data technologies. To find out more, click here.

Additional resources

Stay connected with AWS

Next steps

Getting started

Getting started tutorial

Learn more

Resources

Discover more Amazon EMR resources

Visit the resources page

Free tier

Sign up for a free account

Console

Ready to build?

Get started with Amazon EMR

Getting started with Amazon EMR

How to use EMR

1. Choose your preferred EMR deployment model

2. Develop your data processing application

3. Prepare and ingest data

4. Launch and monitor

5. Monitor and optimize execution

Are you ready to launch your first cluster?

Learn more

Training and help

Additional resources

Big Data blog

Machine Learning blog

Documentation

FAQs

Articles and tutorials

AWS Cloud Economics Center

AWS Pricing Calculator

AWS Trusted Advisor

AWS Support Plans

Next steps

Getting started tutorial

Discover more Amazon EMR resources

Sign up for a free account

Ready to build?

Learn

Resources

Developers

Help

Getting started with Amazon EMR

How to use EMR

1. Choose your preferred EMR deployment model

2. Develop your data processing application

3. Prepare and ingest data

4. Launch and monitor

5. Monitor and optimize execution

Are you ready to launch your first cluster?

Learn more

Training and help

Short term engagements

AWS big data training

Additional training

Additional resources

Big Data blog

Machine Learning blog

Documentation

FAQs

Articles and tutorials

AWS Cloud Economics Center

AWS Pricing Calculator

AWS Trusted Advisor

AWS Support Plans

Next steps

Getting started tutorial

Discover more Amazon EMR resources

Sign up for a free account

Ready to build?

Learn

Resources

Developers

Help