- Analytics›
- Amazon EMR›
- Getting started
Getting started with Amazon EMR
How to use EMR
1. Choose your preferred EMR deployment model
Amazon EMR allows you to process vast amounts of data using open-source tools such as Apache Spark, Hive, Flink, Trino, and more. Simply choose your preferred EMR deployment model:
- EMR Serverless: Run applications without managing clusters and automatically scale resources up and down based on your workload
- EMR on EC2: For control over cluster configuration, including instance types and custom AMIs.
- EMR on EKS: Consolidate analytics with your other Kubernetes-based applications on a shared Amazon EKS cluster.
2. Develop your data processing application
Amazon EMR supports a wide range of frameworks and languages, allowing you to build everything from standard ETL pipelines to large-scale generative AI data preparation.
Languages: Use Python (PySpark) for data science and machine learning, SQL (via Hive or Trino) for analytical queries, or Java and Scala for high-performance Spark applications.
Frameworks: Build and run applications using Apache Spark for large-scale data processing, Apache Flink for real-time streaming, Trino for fast SQL across multiple data sources, and Apache Hudi or Iceberg for managing transactional data lakes.
3. Prepare and ingest data
To begin processing, your data must be accessible to Amazon EMR. While Amazon S3 is the standard storage layer for EMR applications, you have several high-speed methods to move data from your local environment or other AWS services.
- Direct Uploads: For immediate processing, upload objects directly to Amazon S3 using the AWS Management Console, CLI, or SDKs.
- High-Speed Connectivity: Use AWS Direct Connect to bypass the public internet and establish a private, dedicated network connection from your data center to AWS. This provides consistent bandwidth and reduced latency for large-scale transfers.
- Real-Time Streaming: Use Amazon Data Firehose or Amazon Managed Streaming for Apache Kafka (MSK) to feed data directly into your EMR applications as it is generated, enabling near real-time analytics.
- Zero-ETL Integrations: Analyze data from Amazon Aurora or Amazon Redshift using Zero-ETL features, which allow EMR to access operational data without the need for manual pipeline construction.
- Hybrid Access: If your data resides in a local Hadoop HDFS environment, you can use the S3 Connector to read data directly into EMR or sync specific datasets for cloud-based processing.
4. Launch and monitor
Amazon EMR offers a streamlined deployment experience, whether you are running a one-time job or a continuous production pipeline.
- Launch via EMR Studio: Open your EMR Studio notebook and attach it to a Serverless Application or an existing EC2 Cluster. With one click, you can execute your Spark or Hive code in a fully managed environment.
- Serverless: If using EMR Serverless, submit your job via the console, CLI, or API. EMR automatically provisions the exact compute and memory needed, scaling up to handle peaks and down to zero when finished.
- Launch via SageMaker Unified Studio: Within SageMaker Unified Studio, you can open a serverless notebook and instantly connect it to an EMR Serverless application or an EMR on EC2 cluster.
5. Monitor and optimize execution
EMR provides visibility into your data pipelines with built-in tools that help you identify bottlenecks and optimize costs automatically.
Monitor job progress and cluster health through the EMR Management Console, AWS CLI, or SDKs. EMR provides native integration with Amazon CloudWatch for real-time metrics, logs, and automated alerting.
Access the live and persistent Spark UI or Tez UI directly from the console—debug live jobs in real time and even after a serverless job is finished—to review execution plans and DAGs (Directed Acyclic Graphs).
Are you ready to launch your first cluster?
Learn more
Training and help
-
Do you need help building a proof of concept or tuning your EMR applications? AWS has a global support team that specializes in EMR. Please contact us if you are interested in learning more about short term (2-6 week) paid support engagements.
The Big Data on AWS course is designed to teach you with hands-on experience on how to use Amazon Web Services for big data workloads. AWS will show you how to run Amazon EMR jobs to process data using the broad Hadoop tools like Pig and Hive. Also, AWS will teach you how to create big data environments in the cloud by working with Amazon DynamoDB and Amazon Redshift, understand the benefits of Amazon Kinesis, and leverage best practices to design big data environments for analysis, security, and cost-effectiveness. To learn more about the Big Data course, click here.
Scale Unlimited offers customized on-site training for companies that need to quickly learn how to use EMR and other big data technologies. To find out more, click here.