Q: What is Amazon EMR?
Amazon EMR is the industry-leading cloud big data platform for data processing, interactive analysis, and machine learning using open source frameworks such as Apache Spark, Apache Hive, and Presto. With EMR you can run petabyte-scale analysis at less than half of the cost of traditional on-premises solutions and over 1.7x faster than standard Apache Spark.
Q: Why should I use Amazon EMR?
Amazon EMR lets you focus on transforming and analyzing your data without having to worry about managing compute capacity or open-source applications, and saves you money. Using EMR, you can instantly provision as much or as little capacity as you like on Amazon EC2 and set up scaling rules to manage changing compute demand. You can set up CloudWatch alerts to notify you of changes in your infrastructure and take actions immediately. If you use Kubernetes, you can also use EMR to submit your workloads to Amazon EKS clusters.. Whether you use EC2 or EKS, you benefit from EMR’s optimized runtimes which speed your analysis and save both time and money.
Q: How can I deploy and manage Amazon EMR?
You can deploy your workloads to EMR using Amazon EC2, Amazon Elastic Kubernetes Service (EKS), or on-premises AWS Outposts. You can run and manage your workloads withthe EMR Console, API, SDK or CLI and orchestrate them using Amazon Managed Workflows for Apache Airflow (MWAA) or AWS Step Functions. For an interactive experience you can use EMR Studio or SageMaker Studio.
Q: How can I get started with Amazon EMR?
Q: How reliable is Amazon EMR?Please refer to our Service Level Agreement.
Developing & debugging
Q: Where can I find code samples?
Q: How do I develop a data processing application?
You can develop, visualize and debug data science and data engineering applications written in R, Python, Scala, and PySpark in Amazon EMR Studio. You can also develop a data processing job on your desktop, for example, using Eclipse, Spyder, PyCharm, or RStudio, and run it on Amazon EMR. Additionally, you can select JupyterHub or Zeppelin in the software configuration when spinning up a new cluster and develop your application on Amazon EMR using one or more instances.
Q: What is the benefit of using the Command Line Tools or APIs vs. AWS Management Console?
The Command Line Tools or APIs provide the ability to programmatically launch and monitor progress of running clusters, to create additional custom functionality around clusters (such as sequences with multiple processing steps, scheduling, workflow, or monitoring), or to build value-added tools or applications for other Amazon EMR customers. In contrast, the AWS Management Console provides an easy-to-use graphical interface for launching and monitoring your clusters directly from a web browser.
Q: Can I add steps to a cluster that is already running?
Yes. Once the job is running, you can optionally add more steps to it via the AddJobFlowSteps API. The AddJobFlowSteps API will add new steps to the end of the current step sequence. You may want to use this API to implement conditional logic in your cluster or for debugging.
Q: Can I be notified when my cluster is finished?
You can sign up for up Amazon SNS and have the cluster post to your SNS topic when it is finished. You can also view your cluster progress on the AWS Management Console or you can use the Command Line, SDK, or APIs to get a status on the cluster.
Q: Can I terminate my cluster when my steps are finished?
Yes. You can terminate your cluster automatically when all your steps finish by turning the auto-terminate flag on.
Q: What OS versions are supported with Amazon EMR?
Amazon EMR 5.30.0 and later, and the Amazon EMR 6.x series are based on Amazon Linux 2. You can also specify a custom AMI that you create based on the Amazon Linux 2. This allows you to perform sophisticated pre-configuration for virtually any application. For more information, see Using a Custom AMI.
Q: Does Amazon EMR support third-party software packages?
Yes. You can use Bootstrap Actions to install third-party software packages on your cluster. You can also upload statically compiled executables using the Hadoop distributed cache mechanism. EMR 6.x supports Hadoop 3, which allows the YARN NodeManager to launch containers either directly on the EMR cluster host or inside a Docker container. Please see our documentation to learn more.
Q: What tools are available to me for debugging?
There are several tools you can use to gather information about your cluster to help determine what went wrong. If you use Amazon EMR studio, you can launch tools like Spark UI and YARN Timeline Service to simplify debugging. From the Amazon EMR Console, you can get off-cluster access to persistent application user interfaces for Apache Spark, Tez UI and the YARN timeline server, several on-cluster application user interfaces, and a summary view of application history in the Amazon EMR console for all YARN applications. You can also connect to your Master Node Using SSH and view cluster instances via these the web interfaces . For more information, see our documentation.
Q: What is EMR Studio?
EMR Studio is an integrated development environment (IDE) that makes it easy for data scientists and data engineers to develop, visualize, and debug data engineering and data science applications written in R, Python, Scala, and PySpark.
It is a fully managed application with single sign-on, fully managed Jupyter Notebooks, automated infrastructure provisioning, and the ability to debug jobs without logging into the AWS Console or cluster. Data scientists and analysts can install custom kernels and libraries, collaborate with peers using code repositories like GitHub and BitBucket, or execute parameterized notebooks as part of scheduled workflows using orchestration services like Apache Airflow, AWS Step Functions, and Amazon Managed Workflows for Apache Airflow. You can read orchestrating analytics jobs on Amazon EMR notesbooks using Amazon MWAA to learn more. EMR Studio kernels and applications run on EMR clusters, so you get the benefit of distributed data processing using the performance optimized Amazon EMR runtime for Apache Spark. Administrators can setup EMR Studio for analysts to run their applications on existing EMR clusters or create new clusters using pre-defined AWS CloudFormation templates for EMR.
Q: What can I do with EMR Studio?
With EMR Studio, you can log in directly to fully managed Jupyter notebooks using your corporate credentials without logging into the AWS console, start notebooks in seconds, get onboarded with sample notebooks, and perform your data exploration. You can also customize your environment by loading custom kernels and python libraries from notebooks. EMR Studio kernels and applications run on EMR clusters, so you get the benefit of distributed data processing using the performance optimized Amazon EMR runtime for Apache Spark. You can collaborate with peers by sharing notebooks via GitHub and other repositories. You can also run Notebooks directly as continuous integration and deployment pipelines. You can pass different parameter values to a notebook. You can also chain notebooks, and integrate notebooks into scheduled workflows using workflow orchestration services like Apache Airflow. Further, you can debug clusters and jobs using as few clicks as possible with native applications interfaces such as the Spark UI and the YARN Timeline service.
Q: How is EMR Studio different from EMR Notebooks?
There are five main differences.
- There is no need to access AWS Management Console for EMR Studio. EMR Studio is hosted outside of the AWS Management Console. This is useful if you do not provide data scientists or data engineers access to the AWS Management Console.
- You can use enterprise credentials from your identity provider using AWS IAM Identity Center (successor to AWS SSO) to log in to EMR Studio.
- EMR Studio brings you a notebook first experience. EMR Studio kernels and applications run on EMR clusters, so you get the benefit of distributed data processing using the performance optimized Amazon EMR runtime for Apache Spark. Running code on a cluster is as simple as attaching the notebook to an existing cluster or provisioning a new one.
- EMR Studio has a simplified user interface and abstracts hardware configurations. For example, you can setup cluster templates once and use the templates to start new clusters.
- EMR Studio enables a simplified debugging experience so that you can access the native application user interfaces in one place using as few clicks as possible.
Q: How is EMR Studio different from SageMaker Studio?
You can use both EMR Studio and SageMaker Studio with Amazon EMR. EMR Studio provides an integrated development environment (IDE) that makes it easy for you to develop, visualize, and debug data engineering and data science applications written in R, Python, Scala, and PySpark. Amazon SageMaker Studio provides a single, web-based visual interface where you can perform all machine learning development steps. SageMaker Studio gives you complete access, control, and visibility into each step required to build, train, and deploy models. You can quickly upload data, create new notebooks, train and tune models, move back and forth between steps to adjust experiments, compare results, and deploy models to production all in one place, making you much more productive.
Q: How do I get started with EMR Studio?
Your administrator must first set up an EMR Studio. When you receive a unique sign-on URL for your Amazon EMR Studio from your administrator, you can log in to the Studio directly using your corporate credentials.
Q: Do I need to log in to the AWS Management Console to use EMR Studio?
No. After your administrator sets up an EMR Studio and provides the Studio access URL, your team can log in using corporate credentials. There’s no need to log in to the AWS Management Console. In an EMR Studio, your team can perform tasks and access resources conﬁgured by your administrator.
Q: What identity providers are supported for the single sign-on experience in EMR Studio?
AWS IAM Identity Center (successor to AWS SSO) is the single sign-on service provider for EMR Studio. The list of identity providers supported by AWS IAM can be found in our documentation.
Q: What is a Workspace in EMR Studio?
Workspaces help you organize Jupyter Notebooks. All notebooks in a Workspace are saved to the same Amazon S3 location and run on the same cluster. You can also link a code repository like a GitHub repository to all notebooks in a workspace. You can create and conﬁgure a Workspace before attaching it to a cluster, but you should connect to a cluster before running a notebook.
Q: In EMR Studio, can I create a workspace or open a workspace without a cluster?
Yes, you can create or open a workspace without attaching it to a cluster. Only when you need to execute, you should connect them to a cluster. EMR Studio kernels and applications are executed on EMR clusters, so you get the benefit of distributed data processing using the performance optimized Amazon EMR runtime for Apache Spark.
Q: Can I install custom libraries to use in my notebook code?
All Spark queries run on your EMR cluster, so you need to install all runtime libraries that your Spark application uses on the cluster. You can easily install notebook-scoped libraries within a notebook cell. You can also install Jupyter Notebook kernels and Python libraries on a cluster master node either within a notebook cell or while connected using SSH to the master node of the cluster. For more information, see documentation. Additionally, you can use a bootstrap action or a custom AMI to install required libraries when you create a cluster. For more information, see Create Bootstrap Actions to Install Additional Software and Using a Custom AMI in the Amazon EMR Management Guide.
Q: Where are the notebooks saved?
Workspace together with notebook files in the workspace are saved automatically at regular intervals to the ipynb file format in the Amazon S3 location that you specify when you create the workspace. The notebook file has the same name as your notebook in the Amzon EMR Studio.
Q: How do I use version control with my notebook? Can I use repositories like GitHub?
You can associate Git-based repositories with your Amazon EMR Studio notebooks to save your notebooks in a version controlled environment.
Q: In EMR Studio, what compute resources can I run notebooks on?
With EMR Studio, you can run notebook code on Amazon EMR running on Amazon Elastic Compute Cloud (Amazon EC2) or Amazon EMR on Amazon Elastic Kubernetes Service (Amazon EKS). You can attach notebooks to either existing or new clusters. You can create EMR clusters in two ways in EMR Studio: create a cluster using a pre-configured cluster template via AWS Service Catalog, create a cluster by specifying cluster name, number of instances, and instance type.
Q: Can I re-attach a workspace with a different compute resource in EMR Studio?
Yes, you can open your workspace, choose EMR Clusters icon on the left, push Detach button, and then select a cluster from the Select cluster drop down list, and push Attach button.
Q: Where do I find all my workspaces in EMR Studio?
In EMR Studio, you may choose Workspaces tab on the left and view all workspaces created by you and other users in the same AWS account.
Q: What are the IAM policies needed to use EMR Studio?
Each EMR studio needs permissions to interoperate with other AWS services. To give your EMR Studios the necessary permissions, your administrators need to create an EMR Studio service role with the provided policies. They also need to specify a user role for EMR Studio that deﬁnes Studio-level permissions. When they add users and groups from AWS IAM Identity Center (successor to AWS SSO) to EMR Studio, they can assign a session policy to a user or group to apply ﬁne-grained permission controls. Session policies help administrators reﬁne user permissions without the need to create multiple IAM roles. For more information about session policies, see Policies and Permissions in the AWS Identity and Access Management User Guide.
Q: Is there any limitations on EMR clusters I can attach my workspace to in EMR Studio?
Yes. High Availability (Multi-master) clusters, Kerberized clusters, and AWS Lake Formation clusters are currently not supported.
Q: What is the cost of using Amazon EMR Studio?
Amazon EMR Studio is provided at no additional charge to you. Applicable charges for Amazon Simple Storage Service storage and for Amazon EMR clusters apply when you use EMR Studio. For more information about pricing options and details, see Amazon EMR pricing.
Q: What are EMR Notebooks?
We recommend that new customers use Amazon EMR Studio, not EMR Notebooks. EMR Notebooks provide a managed environment, based on Jupyter Notebook, that allows data scientists, analysts, and developers to prepare and visualize data, collaborate with peers, build applications, and perform interactive analysis using EMR clusters. Although we recommend that new customers use EMR Studio, EMR Notebooks is supported for compatibility.
Q: What can I do with EMR Notebooks?
You can use EMR Notebooks to build Apache Spark applications and run interactive queries on your EMR cluster with minimal effort. Multiple users can create serverless notebooks directly from the console, attach them to an existing shared EMR cluster, or provision a cluster directly from the console and immediately start experimenting with Spark. You can detach notebooks and re-attach them to new clusters. Notebooks are auto-saved to S3 buckets, and you can retrieve saved notebooks from the console to resume work. EMR Notebooks are prepackaged with the libraries found in the Anaconda repository, allowing you to import and use these libraries in your notebooks code and use them to manipulate data and visualize results. Further, EMR notebooks have integrated Spark monitoring capabilities that you can use to monitor the progress of your Spark jobs and debug code from within the notebook.
Q: How do I get started with EMR Notebooks?
To get started with EMR Notebooks, open the EMR console and choose Notebooks in the navigation pane. From there, just choose Create Notebook, enter a name for your notebook, choose an EMR cluster or instantly create a new one, provide a service role for the notebook to use, and choose an S3 bucket where you want to save your notebook files and then click on Create Notebook. After the notebook shows a Ready status, choose Open to start the notebook editor.
Q: What EMR release versions are supported with EMR Notebooks?
EMR Notebooks can be attached to EMR clusters running EMR release 5.18.0 or later.
Q: What is the cost of using EMR Notebooks?
EMR notebooks are provided at no additional charge to you. You will be charged as usual for the attached EMR clusters in your account. You and find out more about the pricing for your cluster by visiting https://aws.amazon.com/emr/pricing/
Q: How do I get my data into Amazon S3?
Amazon EMR provides several ways to get data onto a cluster. The most common way is to upload the data to Amazon S3 and use the built-in features of Amazon EMR to load the data onto your cluster. You can use the Distributed Cache feature of Hadoop to transfer files from a distributed file system to the local file system. For more details, see documentation. Alternatively, if you are migrating data from on premises to the cloud, you can use one of the Cloud Data Migration services from AWS.
Q: How do I get logs for terminated clusters?
Hadoop system logs as well as user logs will be placed in the Amazon S3 bucket which you specify when creating a cluster. Persistent application UIs are run off-cluster, Spark History Server, Tez UI and YARN timeline servers logs are available for 30 days after an application terminates.
Q: Do you compress logs?
No. At this time Amazon EMR does not compress logs as it moves them to Amazon S3.
Q: Can I load my data from the internet or somewhere other than Amazon S3?
Yes. You can use AWS Direct Connect to establish a private dedicated network connection to AWS. If you have large amounts of data, you can use AWS Import/Export. For more details refer to our documentation.
Q: Can Amazon EMR estimate how long it will take to process my input data?
No. As each cluster and input data is different, we cannot estimate your job duration.
Q: How much does Amazon EMR cost?
Amazon EMR pricing is simple and predictable: you pay a per-second rate for every second you use, with a one-minute minimum. You can estimate your bill using the AWS Pricing Calculator. Usage for other Amazon Web Services including Amazon EC2 is billed separately from Amazon EMR.
Q: When does billing of my Amazon EMR cluster begin and end?
Amazon EMR billing commences when the cluster is ready to execute steps. Amazon EMR billing ends when you request to shut down the cluster. For more details on when Amazon EC2 begins and ends billing, please refer to the Amazon EC2 Billing FAQ.
Q: Where can I track my Amazon EMR, Amazon EC2 and Amazon S3 usage?
You can track your usage in the Billing & Cost Management Console.
Q: How do you calculate the Normalized Instance Hours displayed on the console ?
On the AWS Management Console, every cluster has a Normalized Instance Hours column that displays the approximate number of compute hours the cluster has used, rounded up to the nearest hour.
Normalized Instance Hours are hours of compute time based on the standard of 1 hour of m1.small usage = 1 hour normalized compute time. You can view our documentation to see a list of different sizes within an instance family, and the corresponding normalization factor per hour.
For example, if you run a 10-node r3.8xlarge cluster for an hour, the total number of Normalized Instance Hours displayed on the console will be 640 (10 (number of nodes) x 64 (normalization factor) x 1 (number of hours that the cluster ran) = 640).
This is an approximate number and should not be used for billing purposes. Please refer to the Billing & Cost Management Console for billable Amazon EMR usage.
Q: Does Amazon EMR support Amazon EC2 On-Demand, Spot, and Reserved Instances?
Yes. Amazon EMR seamlessly supports On-Demand, Spot, and Reserved Instances. Click here to learn more about Amazon EC2 Reserved Instances. Click here to learn more about Amazon EC2 Spot Instances. Click here to learn more about Amazon EC2 Capacity Reservations.
Q: Do your prices include taxes?
Except as otherwise noted, our prices are exclusive of applicable taxes and duties, including VAT and applicable sales tax. For customers with a Japanese billing address, use of AWS services is subject to Japanese Consumption Tax. Learn more.
Security and Data Access Control
Q: How do I prevent other people from viewing my data during cluster execution?
Amazon EMR starts your instances in two Amazon EC2 security groups, one for the master and another for the other cluster nodes. The master security group has a port open for communication with the service. It also has the SSH port open to allow you to SSH into the instances, using the key specified at startup. The other nodes start in a separate security group, which only allows interaction with the master instance. By default both security groups are set up to not allow access from external sources including Amazon EC2 instances belonging to other customers. Since these are security groups within your account, you can reconfigure them using the standard EC2 tools or dashboard. Click here to learn more about EC2 security groups. Additionally, you can configure Amazon EMR block public access in each region that you use to prevent cluster creation if a rule allows public access on any port that you don't add to a list of exceptions.
Q: How secure is my data?
Amazon S3 provides authentication mechanisms to ensure that stored data is secured against unauthorized access. Unless the customer who is uploading the data specifies otherwise, only that customer can access the data. Amazon EMR customers can also choose to send data to Amazon S3 using the HTTPS protocol for secure transmission. In addition, Amazon EMR always uses HTTPS to send data between Amazon S3 and Amazon EC2. For added security, customers may encrypt the input data before they upload it to Amazon S3 (using any common data encryption tool); they then need to add a decryption step to the beginning of their cluster when Amazon EMR fetches the data from Amazon S3.
Q: Can I get a history of all EMR API calls made on my account for security or compliance auditing?
Yes. AWS CloudTrail is a web service that records AWS API calls for your account and delivers log files to you. The AWS API call history produced by CloudTrail enables security analysis, resource change tracking, and compliance auditing. Learn more about CloudTrail at the AWS CloudTrail detail page, and turn it on via CloudTrail's AWS Management Console.
Q: How do I control what EMR users can access in Amazon S3?
By default, Amazon EMR application processes use EC2 instance profiles when they call other AWS services. For multi-tenant clusters, Amazon EMR offers three options to manage user access to Amazon S3 data.
- Integration with AWS Lake Formation allows you to define and manage fine-grained authorization policies in AWS Lake Formation to access databases, tables, and columns in AWS Glue Data Catalog. You can enforce the authorization policies on jobs submitted through Amazon EMR Notebooks and Apache Zeppelin for interactive EMR Spark workloads, and send auditing events to AWS CloudTrail. By enabling this integration, you also enable federated Single Sign-On to EMR Notebooks or Apache Zeppelin from enterprise identity systems compatible with Security Assertion Markup Language (SAML) 2.0.
- Native integration with Apache Ranger allows you to set up a new or an existing Apache Ranger server to define and manage fine-grained authorization policies for users to access databases, tables, and columns of Amazon S3 data via Hive Metastore. Apache Ranger is an open-source tool to enable, monitor, and manage comprehensive data security across the Hadoop platform.
This native integration allows you to define three types of authorization policies on the Apache Ranger Policy Admin server. You can set table, column, and row level authorization for Hive, table and column level authorization for Spark, and prefix and object level authorization for Amazon S3. Amazon EMR automatically installs and configures the corresponding Apache Ranger plugins on the cluster. These Ranger plugins sync up with the Policy Admin server for authorization polices, enforce data access control, and send auditing events to Amazon CloudWatch Logs.
- Amazon EMR User Role Mapper allows you to leverage AWS IAM permissions to manage accesses to AWS resources. You can create mappings between users (or groups) and custom IAM roles. A user or group can only access the data permitted by the custom IAM role. This feature is currently available through AWS Labs.
Regions and Availability Zones
Q: How does Amazon EMR make use of Availability Zones?
Amazon EMR launches all nodes for a given cluster in the same Amazon EC2 Availability Zone. Running a cluster in the same zone improves performance of the jobs flows. By default, Amazon EMR chooses the Availability Zone with the most available resources in which to run your cluster. However, you can specify another Availability Zone if required. You also have the option to optimize your allocation for lowest-priced on demand instances, optimal spot capacity, or use On-Demand Capacity Reservations.
Q: In what Regions is this Amazon EMR available?
For a list of the supported Amazon EMR AWS regions, please visit the AWS Region Table for all AWS global infrastructure.
Q: Is Amazon EMR supported in AWS Local Zones?
EMR supports launching clusters in the Los Angeles AWS Local Zone. You can use EMR in the US West (Oregon) region to launch clusters into subnets associated with the Los Angeles AWS Local Zone.
Q: Which Region should I select to run my clusters?
When creating a cluster, typically you should select the Region where your data is located.
Q: Can I use EU data in a cluster running in the US region and vice versa?
Yes, you can. If you transfer data from one region to the other you will be charged bandwidth charges. For bandwidth pricing information, please visit the pricing section on the EC2 detail page.
Q: What is different about the AWS GovCloud (US) region?
The AWS GovCloud (US) region is designed for US government agencies and customers. It adheres to US ITAR requirements. In GovCloud, EMR does not support spot instances or the enable-debugging feature. The EMR Management Console is not yet available in GovCloud.
Amazon EMR on Amazon EC2
Q: What is an Amazon EMR Cluster?
A cluster is a collection of Amazon Elastic Compute Cloud (Amazon EC2) instances. Each instance in the cluster is called a node and has a role within the cluster, referred to as the node type. Amazon EMR also installs different software components on each node type, giving each node a role in a distributed application like Apache Hadoop. Every cluster has a unique identifier that starts with "j-".
Q: What are node types in a cluster?
An Amazon EMR cluster has three types of nodes:
- Master node: A node that manages the cluster by running software components to coordinate the distribution of data and tasks among other nodes for processing. The master node tracks the status of tasks and monitors the health of the cluster. Every cluster has a master node, and it's possible to create a single-node cluster with only the master node.
- Core node: A node with software components that run tasks and store data in the Hadoop Distributed File System (HDFS) on your cluster. Multi-node clusters have at least one core node.
- Task node: A node with software components that only runs tasks and does not store data in HDFS. Task nodes are optional.
Q: What is a cluster step?
A cluster step is a user-defined unit of processing, mapping roughly to one algorithm that manipulates the data. A step is a Hadoop MapReduce application implemented as a Java jar or a streaming program written in Java, Ruby, Perl, Python, PHP, R, or C++. For example, to count the frequency with which words appear in a document, and output them sorted by the count, the first step would be a MapReduce application which counts the occurrences of each word, and the second step would be a MapReduce application which sorts the output from the first step based on the counts.
Q: What are different cluster states?
STARTING – The cluster starts by configuring EC2 instances.
BOOTSTRAPPING – Bootstrap actions are being executed on the cluster.
RUNNING – A step for the cluster is currently being run.
WAITING – The cluster is currently active, but has no steps to run.
TERMINATING - The cluster is in the process of shutting down.
TERMINATED - The cluster was shut down without error.
TERMINATED_WITH_ERRORS - The cluster was shut down with errors.
Q: What are different step states?
PENDING – The step is waiting to be run.
RUNNING – The step is currently running.
COMPLETED – The step completed successfully.
CANCELLED – The step was cancelled before running because an earlier step failed or cluster was terminated before it could run.
FAILED – The step failed while running.
Launching a cluster
Q: How can I launch a cluster?
You can launch a cluster through the AWS Management Console by filling out a simple cluster request form. In the request form, you specify the name of your cluster, the location in Amazon S3 of your input data, your processing application, your desired data output location, and the number and type of Amazon EC2 instances you’d like to use. Optionally, you can specify a location to store your cluster log files and SSH Key to login to your cluster while it is running. Alternatively, you can launch a cluster using the RunJobFlow API or using the ‘create’ command in the Command Line Tools. For launching a cluster with EMR Studio, refer to the EMR Studio section above..
Q: How can I terminate a cluster?
At any time, you can terminate a cluster via the AWS Management Console by selecting a cluster and clicking the “Terminate” button. Alternatively, you can use the TerminateJobFlows API. If you terminate a running cluster, any results that have not been persisted to Amazon S3 will be lost and all Amazon EC2 instances will be shut down.
Q: Does Amazon EMR support multiple simultaneous cluster?
You can start as many clusters as you like. When you get started, you are limited to 20 instances across all your clusters. If you need more instances, complete the Amazon EC2 instance request form. Once your Amazon EC2 limit is raised, the new limit will be automatically applied to your Amazon EMR clusters.
Managing a cluster
Q: How does Amazon EMR use Amazon EC2 and Amazon S3?
You can upload your input data and a data processing application into Amazon S3. Amazon EMR then launches a number of Amazon EC2 instances that you specified. The service begins the cluster execution while pulling the input data from Amazon S3 using S3 URI scheme into the launched Amazon EC2 instances. Once the cluster is finished, Amazon EMR transfers the output data to Amazon S3, where you can then retrieve it or use as input in another cluster.
Q: How is a computation done in Amazon EMR?
Amazon EMR uses the Hadoop data processing engine to conduct computations implemented in the MapReduce programming model. The customer implements their algorithm in terms of map() and reduce() functions. The service starts a customer-specified number of Amazon EC2 instances, comprised of one master and multiple other nodes. Amazon EMR runs Hadoop software on these instances. The master node divides input data into blocks, and distributes the processing of the blocks to the other nodes. Each node then runs the map function on the data it has been allocated, generating intermediate data. The intermediate data is then sorted and partitioned and sent to processes which apply the reducer function to it locally on the nodes. Finally, the output from the reducer tasks is collected in files. A single “cluster” may involve a sequence of such MapReduce steps.
Q: Which Amazon EC2 instance types does Amazon EMR support?
See the EMR pricing page for details on latest available instance types and pricing per region.
Q: How long will it take to run my cluster?
The time to run your cluster will depend on several factors including the type of your cluster, the amount of input data, and the number and type of Amazon EC2 instances you choose for your cluster.
Q: If the master node in a cluster goes down, can Amazon EMR recover it?
Yes. You can launch an EMR cluster (version 5.23 or later) with three master nodes and support high availability of applications like YARN Resource Manager, HDFS Name Node, Spark, Hive, and Ganglia. Amazon EMR automatically fails over to a standby master node if the primary master node fails or if critical processes, like Resource Manager or Name Node, crash. Since the master node is not a potential single point of failure, you can run your long-lived EMR clusters without interruption. In the event of a failover, Amazon EMR automatically replaces the failed master node with a new master node with the same configuration and boot-strap actions.
Q: If another node goes down in a cluster, can Amazon EMR recover from it?
Yes. Amazon EMR is fault tolerant for node failures and continues job execution if a node goes down. Amazon EMR will also provision a new node when a core node fails. However, Amazon EMR will not replace nodes if all nodes in the cluster are lost.
Q: Can I SSH onto my cluster nodes?
Yes. You can SSH onto your cluster nodes and execute Hadoop commands directly from there. If you need to SSH into a specific node, you have to first SSH to the master node, and then SSH into the desired node.
Q: What is Amazon EMR Bootstrap Actions?
Bootstrap Actions is a feature in Amazon EMR that provides users a way to run custom set-up prior to the execution of their cluster. Bootstrap Actions can be used to install software or configure instances before running your cluster. You can read more about bootstrap actions in EMR's Developer Guide.
Q: How can I use Bootstrap Actions?
You can write a Bootstrap Action script in any language already installed on the cluster instance including Bash, Perl, Python, Ruby, C++, or Java. There are several pre-defined Bootstrap Actions available. Once the script is written, you need to upload it to Amazon S3 and reference its location when you start a cluster. Please refer to the Developer Guide for details on how to use Bootstrap Actions.
Q: How do I configure Hadoop settings for my cluster?
The EMR default Hadoop configuration is appropriate for most workloads. However, based on your cluster’s specific memory and processing requirements, it may be appropriate to tune these settings. For example, if your cluster tasks are memory-intensive, you may choose to use fewer tasks per core and reduce your job tracker heap size. For this situation, a pre-defined Bootstrap Action is available to configure your cluster on startup. See the Configure Memory Intensive Bootstrap Action in the Developer’s Guide for configuration details and usage instructions. An additional predefined bootstrap action is available that allows you to customize your cluster settings to any value of your choice. See the Configure Hadoop Bootstrap Action in the Developer’s Guide for usage instructions.
Q: Can I modify the number of nodes in a running cluster?
Yes. Nodes can be of two types: (1) core nodes, which both host persistent data using Hadoop Distributed File System (HDFS) and run Hadoop tasks and (2) task nodes, which only run Hadoop tasks. While a cluster is running you may increase the number of core nodes and you may either increase or decrease the number of task nodes. This can be done through the API, Java SDK, or though the command line client. Please refer to the Resizing Running clusters section in the Developer’s Guide for details on how to modify the size of your running cluster. You can also use EMR Managed Scaling.
Q: When would I want to use core nodes versus task nodes?
As core nodes host persistent data in HDFS and cannot be removed, core nodes should be reserved for the capacity that is required until your cluster completes. As task nodes can be added or removed and do not contain HDFS, they are ideal for capacity that is only needed on a temporary basis. You can launch task instance fleets on Spot Instances to increase capacity while minimizing costs.
Q: Why would I want to modify the number of nodes in my running cluster?
There are several scenarios where you may want to modify the number of nodes in a running cluster. If your cluster is running slower than expected, or timing requirements change, you can increase the number of core nodes to increase cluster performance. If different phases of your cluster have different capacity needs, you can start with a small number of core nodes and increase or decrease the number of task nodes to meet your cluster’s varying capacity requirements. You can also used EMR Managed Scaling.
Q: Can I automatically modify the number of nodes between cluster steps?
Yes. You may include a predefined step in your workflow that automatically resizes a cluster between steps that are known to have different capacity needs. As all steps are guaranteed to run sequentially, this allows you to set the number of nodes that will execute a given cluster step.
Q: How can I allow other IAM users to access my cluster?
To create a new cluster that is visible to all IAM users within the EMR CLI: Add the --visible-to-all-users flag when you create the cluster. For example: elastic-mapreduce --create --visible-to-all-users. Within the Management Console, simply select “Visible to all IAM Users” on the Advanced Options pane of the Create cluster Wizard.
To make an existing cluster visible to all IAM users you must use the EMR CLI. Use --set-visible-to-all-users and specify the cluster identifier. For example: elastic-mapreduce --set-visible-to-all-users true --jobflow j-xxxxxxx. This can only be done by the creator of the cluster.
To learn more, see the Configuring User Permissions section of the EMR Developer Guide.
Tagging a cluster
Q: What Amazon EMR resources can I tag?
You can add tags to an active Amazon EMR cluster. An Amazon EMR cluster consists of Amazon EC2 instances, and a tag added to an Amazon EMR cluster will be propagated to each active Amazon EC2 instance in that cluster. You cannot add, edit, or remove tags from terminated clusters or terminated Amazon EC2 instances which were part of an active cluster.
Q: Does Amazon EMR tagging support resource-based permissions with IAM Users?
No, Amazon EMR does not support resource-based permissions by tag. However, it is important to note that propagated tags to Amazon EC2 instances behave as normal Amazon EC2 tags. Therefore, an IAM Policy for Amazon EC2 will act on tags propagated from Amazon EMR if they match conditions in that policy.
Q: How many tags can I add to a resource?
You can add up to ten tags on an Amazon EMR cluster.
Q: Do my Amazon EMR tags on a cluster show up on each Amazon EC2 instance in that cluster? If I remove a tag on my Amazon EMR cluster, will that tag automatically be removed from each associated EC2 instance?
Yes, Amazon EMR propagates the tags added to a cluster to that cluster's underlying EC2 instances. If you add a tag to an Amazon EMR cluster, it will also appear on the related Amazon EC2 instances. Likewise, if you remove a tag from an Amazon EMR cluster, it will also be removed from its associated Amazon EC2 instances. However, if you are using IAM policies for Amazon EC2 and plan to use Amazon EMR's tagging functionality, you should make sure that permission to use the Amazon EC2 tagging APIs CreateTags and DeleteTags is granted.
Q: How do I get my tags to show up in my billing statement to segment costs?
Select the tags you would like to use in your AWS billing report here. Then, to see the cost of your combined resources, you can organize your billing information based on resources that have the same tag key values.
Q: How do I tell which Amazon EC2 instances are part of an Amazon EMR cluster?
An Amazon EC2 instance associated with an Amazon EMR cluster will have two system tags:
- Key = instance-group role ; Value = [CORE or TASK];
- Key = job-flow-id ; Value = [JobFlowID]
Q: Can I edit tags directly on the Amazon EC2 instances?
Yes, you can add or remove tags directly on Amazon EC2 instances that are part of an Amazon EMR cluster. However, we do not recommend doing this, because Amazon EMR’s tagging system will not sync the changes you make to an associated Amazon EC2 instance directly. We recommend that tags for Amazon EMR clusters be added and removed from the Amazon EMR console, CLI, or API to ensure that the cluster and its associated Amazon EC2 instances have the correct tags.
Q: What is Amazon EMR Serverless?
Amazon EMR Serverless is a new deployment option in Amazon EMR that allows you to run big data frameworks such as Apache Spark and Apache Hive without configuring, managing, and scaling clusters.
Q: Who can use EMR Serverless?
Data engineers, analysts, and scientists can use EMR Serverless to build applications using open-source frameworks such as Apache Spark and Apache Hive. They can use these frameworks to transform data, run interactive SQL queries, and machine learning workloads.
Q: How do I get started with EMR Serverless?
You can use EMR Studio, AWS CLI, or APIs to submit jobs, track job status, and build your data pipelines to run on EMR Serverless. To get started with EMR Studio, sign into the AWS Management Console, navigate to Amazon EMR under the Analytics category, and select Amazon EMR Serverless. Follow the directions in the AWS Management Console, navigate to Amazon EMR under the Analytics category, and select Amazon EMR Serverless. Follow the directions in the Getting started guide to create your EMR Serverless application and submit jobs. You can refer to the Interacting with your application on the AWS CLI page to launch your applications and submit jobs using CLI. You can also find EMR Serverless examples and sample code in our GitHub repository.
Q: What open-source frameworks does EMR Serverless support?
EMR Serverless currently supports Apache Spark and Apache Hive engines. If you want support for additional frameworks such as Apache Presto or Apache Flink, please send a request to firstname.lastname@example.org.
Q: In what Regions is EMR Serverless available?
At launch, EMR Serverless is available in the US East (N. Virginia), US West (Oregon), Asia Pacific (Tokyo), and Europe (Ireland) Regions.
Q: What is the difference between Amazon EMR Serverless, Amazon EMR on EC2, Amazon EMR on AWS Outposts, and Amazon EMR on EKS?
Amazon EMR provides the option to run applications on EC2 based clusters, EKS clusters, Outposts, or Serverless. EMR on EC2 clusters are suitable for customers who need maximum control and flexibility over running their application. With EMR on EC2 clusters, customers can choose the EC2 instance type to meet application-specific performance needs, customize the Linux AMI, customize the EC2 instance configuration, customize and extend open-source frameworks, and install additional custom software on cluster instances. Amazon EMR on EKS is suitable for customers who want to standardize on EKS to manage clusters across applications or use different versions of an open-source framework on the same cluster. Amazon EMR on AWS Outposts is for customers who want to run EMR closer to their data center, within an Outpost. EMR Serverless is suitable for customers who want to avoid managing and operating clusters and prefer to run applications using open-source frameworks.
Q: What are some of the feature differences between EMR Serverless and Amazon EMR on EC2?
Amazon EMR on EKS
Resilience to Availability Zone failures
Automatically scale resources up and down as needed
Encryption for data at rest
Support for Apache Hudi and Apache Iceberg
Integration with Apache Ranger for table- and column-level permission control
Customize operating system images
Customize open-source framework installed
Customize and load additional libraries and dependencies
Run workloads from SageMaker Studio as part of machine learning (ML) workflow
Connect to self-hosted Jupyter Notebooks
Build and orchestrate pipelines using Apache Airflow and Amazon Managed Workflows for Apache Airflow (MWAA)
Build and orchestrate pipelines using AWS Step Functions
Q: What EMR releases are supported in EMR Serverless?
EMR Serverless supports EMR release labels 6.6 and above. With EMR Serverless, you get the same performance-optimized EMR runtime available in other EMR deployment options, which is 100% API-compatible with standard open-source frameworks.
Applications, workers, and jobs
Q: What is an application and how can I create it?
With Amazon EMR Serverless, you can create one or more EMR Serverless applications that use open-source analytics frameworks. To create an application, you must specify the following attributes: 1) the Amazon EMR release version for the open-source framework version you want to use and 2) the specific analytics engines that you want your application to use, such as Apache Spark 3.1 or Apache Hive 3.0. After you create an application, you can start running your data processing jobs or interactive requests to your application.
Q: What is a worker?
An EMR Serverless application internally uses workers to execute your workloads. When a job is submitted, EMR Serverless computes the resources needed for the job and schedules workers. EMR Serverless breaks down your workloads into tasks, provisions and sets up workers with the open-source framework, and decommissions them when the job completes. EMR Serverless automatically scales workers up or down depending on the workload and parallelism required at every stage of the job, thereby removing the need for you to estimate the number of workers required to run your workloads. The default size of these workers is based on your application type and Amazon EMR release version. You can override these sizes when scheduling a job run.
Q: Can I specify the minimum and maximum number of workers that my jobs can use?
With EMR Serverless, you can specify the minimum and maximum number of concurrent workers and the vCPU and memory configuration for workers. You can also set the maximum capacity limits on the application’s resources to control costs.
Q: When should I create multiple applications?
Consider creating multiple applications when doing any of the following:
- Using different open-source frameworks
- Using different versions of open-source frameworks for different use cases
- Performing A/B testing when upgrading from one version to another
- Maintaining separate logical environments for test and production scenarios
- Providing separate logical environments for different teams with independent cost controls and usage tracking
- Separating different lines-of-business applications
Q: Can I change default properties of an EMR Serverless application after it is created?
Yes, you can modify application properties such as initial capacity, maximum capacity limits, and network configuration using EMR Studio or the update-application API/CLI call.
Q: When should I create an application with a pre-initialized pool of workers?
An EMR Serverless application without pre-initialized workers takes up to 120 seconds to determine the required resources and provision them. EMR Serverless provides an optional feature that keeps workers initialized and ready to respond in seconds, effectively creating an on-call pool of workers for an application. This feature is called pre-initialized capacity and can be configured for each application by setting the initial-capacity parameter of an application.
Pre-initialized capacity allows jobs to start immediately, making it ideal for implementing time-sensitive jobs. You can specify the number of workers that you want to pre-initialize when you start an EMR Serverless application. Subsequently, when users submit jobs, the pre-initialized workers can be used to immediately start the jobs. If the job requires more workers than what you have chosen to pre-initialize, EMR Serverless automatically adds more workers (up to the maximum concurrent limit that you specify). After the job finishes, EMR Serverless automatically reverts back to maintaining the pre-initialized workers that you specified. Workers automatically shut down if they have been idle for 15 minutes. You can change the default idle timeout for your application using the updateApplication API or EMR Studio.
Q: How do I submit and manage jobs on EMR Serverless?
You can submit and manage EMR Serverless jobs using EMR Studio, SDK/CLI, or our Apache Airflow connectors.
Q: How can I include dependencies with jobs that I want to run on EMR Serverless?
For PySpark, you can package your Python dependencies using virtualenv and pass the archive file using the --archives option, which enables your workers to use the dependencies during the job run. For Scala or Java, you can package your dependencies as jars, upload them to Amazon S3, and pass them using the --jars or --packages options with your EMR Serverless job run.
Q: Do EMR Serverless Spark and Hive applications support user-defined functions (UDFs)?
EMR Serverless supports Java-based UDFs. You can package them as jars, upload them to S3, and use them in your Spark or HiveQL scripts.
Q: What worker configurations does EMR Serverless support?
Refer to the Supported Worker Configuration for details.
Q: Can I cancel an EMR Serverless job in case it is running longer than expected?
Yes, you can cancel a running EMR Serverless job from EMR Studio or by calling the cancelJobRun API/CLI.
Q: Can I add extra storage to the workers?
EMR Serverless comes with 20 GB of ephemeral storage on each worker. If you need more storage, you can customize this during job submission from 20 GB up to 200 GB.
Q: Where can I find code samples?
You can find EMR Serverless code samples in our GitHub repository.
Monitoring and debugging
Q: How do I monitor Amazon EMR Serverless applications and job runs?
Amazon EMR Serverless application- and job-level metrics are published every 30 seconds to Amazon CloudWatch.
Q: How do I launch Spark UI and Tez UI with EMR Serverless?
From EMR Studio, you can select a running or completed EMR Serverless job and then click on the Spark UI or Tez UI button to launch them.
Security and data control
Q: Can I access resources in my Amazon Virtual Private Cloud (VPC)?
Yes, you can configure Amazon EMR Serverless applications to access resources in your own VPC. See the Configuring VPC access section in the documentation to learn more.
Q: What kind of isolation can I get with an EMR Serverless application?
Each EMR Serverless application is isolated from other applications and runs on a secure Amazon VPC.
Q: How does Amazon EMR Serverless help save costs on big data deployments?
There are three ways in which Amazon EMR Serverless can help you save costs. First, there is no operational overhead of managing, securing, and scaling clusters. Second, EMR Serverless automatically scales workers up at each stage of processing your job and scales them down when they’re not required. You’re charged for aggregate vCPU, memory, and storage resources used from the time a worker starts running until it stops, rounded up to the nearest second with a 1-minute minimum. For example, your job may require 10 workers for the first 10 minutes of processing the job and 50 workers for the next 5 minutes. With fine-grained automatic scaling, you incur costs for only 10 workers for 10 minutes and 50 workers for 5 minutes. As a result, you don’t have to pay for underutilized resources. Third, EMR Serverless includes the Amazon EMR performance-optimized runtime for Apache Spark and Apache Hive, and Presto. The Amazon EMR runtime is API-compatible and over twice as fast as standard open-source analytics engines, so your jobs run faster and incur fewer compute costs.
Q: Is the EMR Serverless cost comparable to Amazon EMR on EC2 Spot Instances?
It depends on your current EMR on EC2 cluster utilization. If you are running EMR clusters using EC2 On-Demand Instances, EMR Serverless will offer a lower total cost of ownership (TCO) if your current cluster utilization is less than 70%. If you are using the EC2 Savings Plans, EMR Serverless will offer a lower TCO if your current cluster utilization is less than 50%. And if you use EC2 Spot Instances, Amazon EMR on EC2 and Amazon EMR on EKS will continue to be more cost effective.
Q: Are pre-initialized workers charged even after jobs have run to completion?
Yes, if you do not stop workers after a job is complete, you will incur charges on pre-initialized workers.
Q. Who should I reach out to with questions, comments, and feature requests?
Please send us an email at email@example.com with your inquiries and valuable feedback on EMR Serverless.
Amazon EMR on Amazon EKS
Q: What is Amazon EMR on Amazon EKS?
Amazon EMR on Amazon EKS is a deployment model of Amazon EMR that enables customers to easily and cost-effectively process vast amounts of data. It utilizes hosted analytics frameworks running on the flexible Amazon EKS managed service in containers, with the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2), AWS Fargate, and Amazon Simple Storage Service (Amazon S3).
Q: Why should I use Amazon EMR on Amazon EKS?
Amazon EMR on Amazon EKS decouples the analytics job from the services and infrastructure that are processing the job by using a container-based approach. You can focus more on developing your application and less on operating the infrastructure as EMR on EKS dynamically configures the infrastructure based on the compute, memory, and application dependencies of the job. Infrastructure teams can centrally manage a common compute platform to consolidate EMR workloads with other container-based applications. Multiple teams, organizations, or business units can simultaneously and independently run their analytics processes on the shared infrastructure while maintaining isolation enabled by Amazon EKS and AWS Identity and Access Management (IAM).
Q: What are the benefits for users already running Apache Spark on Amazon EKS?
If you already run Apache Spark on Amazon EKS, you can get all of the benefits of Amazon EMR like automatic provisioning and scaling and the ability to use the latest fully managed versions of open source big data analytics frameworks. You get an optimized EMR runtime for Apache Spark with 3X faster performance than open source Apache Spark on EKS, a serverless data science experience with EMR Studio and Apache Spark UI, fine grained data access control, and support for data encryption.
Q: How does this feature relate to and work with other AWS services?
Amazon EKS provides customers with a managed experience for running Kubernetes on AWS, enabling you to add compute capacity using EKS Managed Node Groups or using AWS Fargate. Running EMR jobs on EKS can access their data on Amazon S3 while monitoring and logging can be integrated with Amazon CloudWatch. AWS Identity and Access Management (IAM) enables role based access control for both jobs and to dependent AWS services.
Q: How does Amazon EMR on Amazon EKS work?
Register your EKS cluster with Amazon EMR. Then, submit your Spark jobs to EMR using the CLI, SDK or EMR Studio. EMR requests the Kubernetes scheduler on EKS to schedule Pods. For each job that you run, EMR on EKS creates a container. The container contains Amazon Linux 2 base image with security updates, plus Apache Spark and associated dependencies to run Spark, plus your application-specific dependencies. Each Job runs in a pod. The Pod downloads this container and starts to execute it. If the container’s image has been previously deployed to the node, then a cached image is used and the download is bypassed. Sidecar containers, such as log or metric forwarders, can be deployed to the pod. The Pod terminates after the job terminates. After the job terminates, you can still debug it using Spark UI.
Q: What AWS compute services can I use with Amazon EMR on EKS?
You can use Amazon EMR for EKS with both Amazon Elastic Compute Cloud (EC2) instances to support broader customization options, or the serverless AWS Fargate service to process your analytics without having to provision or manage EC2 instances. Application availability can automatically improve by spreading your analytics jobs across multiple AWS Availability Zones (AZs).
Q: How do I get started with EMR on EKS?
To get started, register your Amazon EKS cluster with Amazon EMR. After registration, reference this registration in your job definition (that includes application dependencies and framework parameters) by submitting your workloads to EMR for execution. With EMR on EKS, you can use different open source big data analytics frameworks, versions, and configurations for analytics applications running on the same EKS cluster. For more information, refer to our documentation.
Q: Can I use the same EMR release for EMR clusters and applications running on EKS?
Yes, you can use the same EMR release for applications that run on EMR clusters and applications that run on EKS.
Q: How do I troubleshoot analytics applications?
You can use the Amazon EMR Spark UI to diagnose and troubleshoot Spark applications. For all analytics applications, EMR provides access to application details, associated logs, and metrics for up to 30 days after they have completed. Jobs can be individually configured to send logs to an Amazon S3 location or Amazon CloudWatch.
Q: Can I see EMR applications in EKS?
Yes, EMR applications show up in the EKS console as Kubernetes jobs and deployments.
Q: Can I isolate multiple jobs or applications from each other on the same EKS cluster?
Yes, Kubernetes natively provides job isolation. Additionally, each job can be configured to run with its own execution-role to limit which AWS resources the job can access.
Q: How does EMR on EKS help reduce costs?
EMR on EKS reduces cost by removing the need to run dedicated clusters. You can use a common, shared EKS cluster to run analytics applications that require different versions of open source big data analytics frameworks. You can also use the same EKS cluster to run your other containerized non-EMR applications.
Q: How do you charge for EMR on EKS?
Amazon EMR on EKS pricing is calculated based on the vCPU and memory resources requested for the pod(s) that are running your job at per minute granularity. For pricing information, visit the Amazon EMR pricing page.
Q: What are some of the differences between EMR on EKS and EMR on EC2?
EMR on EKS
EMR on EC2
Latest supported version of EMR
Multi-AZ Support for Jobs
Multi-Tenant with non-big data workloads
EMR version scope
AWS Lake Formation
Apache Ranger Integration
Custom AMI / Images
Integration with Sagemaker & Zeppelin
Y with Livy
Integration with EMR Studio
Orchestration with Apache Airflow
Orchestration with AWS Stepfunctions
Q: What are Pod Templates?
EMR on EKS enables you to use Kubernetes Pod Templates to customize where and how your job runs in the Kubernetes cluster. Kubernetes Pod Templates provide a reusable design pattern or boilerplate for declaratively expressing how a Kubernetes pod should be deployed to your EKS cluster.
Q: Why should I use Pod Templates with my EMR on EKS job?
Pod Templates can provide more control over how your jobs are scheduled in Kubernetes. For example, you can reduce cost by running Spark driver tasks on Amazon EC2 Spot instances or only allowing jobs requiring SSDs to run on SSD enabled instances. Pod Templates with EMR on EKS to enables fine-grained control of how resources are allocated and running custom containers alongside your job. Therefore, resulting in reduced cost and increased performance of your jobs.
Q: What is a Pod?
Pods are one or more containers, with shared network and storage resources, that run on a Kubernetes worker node. EMR on EKS uses pods to run your job by scheduling Spark driver and executor tasks as individual pods.
Q: What are some use-cases for Pod Templates?
You can optimize both performance and cost by using Pod Templates. For example, you can save cost by defining jobs to run on EC2 Spot instances or increase performance by scheduling them on GPU or SSD-backed EC2 instances. Customers often need fine-grained workload control in order to support multiple teams or organizations on EKS, and Pod Templates simplify running jobs on team designated node groups. In addition, you can deploy sidecar containers to run initialization code for your job or run common monitoring tools like Fluentd for log forwarding.
Q: Can I specify a different Pod Template for my Spark drivers and Spark executors?
You can, but is not required, to provide individual templates for drivers and executors. For example, you can configure nodeSelectors and tolerations to designate Spark drivers to run only on AWS EC2 On-Demand instances and Spark executors to run only on AWS Fargate instances. In your job submission, configure the spark properties spark.kubernetes.driver.podTemplateFile and spark.kubernetes.executor.podTemplateFile to reference the template’s S3 location.
Q: What template values can I specify?
You can specify both Pod Level Fields (including Volumes, Pod Affinity, Init Containers, Node Selector) and Spark Main Container level fields (including EnvFrom, Working Directory, Lifecycle, Container Volume Mounts). The full list of allowed values is provided in our documentation.
Q: Where can I find more information about Pod Templates?
Amazon EKS already supports Pod Templates and for more information about Amazon EMR on EKS’ support for Pod Templates, refer to our documentation and the Apache Spark Pod Template documentation.
Q: Why should I use Custom Images with EMR on EKS?
Without Custom Images, managing application dependencies with EMR on EKS required you to reference them at runtime from an external storage service such as Amazon S3. Now, with custom image support, you can create a self-contained docker image with the application and its’ dependent libraries. You no longer need to maintain, update or version externally stored libraries and your big data applications can be developed using the same DevOps processes that your other containerized applications are using. Just point at your image and run it.
Q: What is a custom image?
A Custom Image is an EMR on EKS provided docker image (“base image”) that contains the EMR runtime and connectors to other AWS services that you modify to include application dependencies or additional packages that your application requires. The new image can be stored in either Amazon Elastic Container Registry (ECR) or your own Docker container registry.
Q: What are some use-cases for Custom Images?
Customers can create a base image, add their corporate standard libraries, and then store it in Amazon Elastic Container Registry (Amazon ECR). Other customers can customize the image to include their application specific dependencies. The resulting immutable image can be vulnerability scanned, deployed to test and production environments. Examples of dependencies you can add include Java SDK, Python, or R libraries, you can add them to the image directly, just as with other containerized applications.
Q: What is included in the base image?
Q: When should I specify a different Custom Image for my Spark drivers and Spark executors?You can specify separate images for your Spark drivers and executors when you want include different dependencies or libraries. Removing libraries that are not required in both images can result in a smaller image size and thus reduce job start time. You can specify a single image for both drivers and executors (spark.kubernetes.container.image) or specify a different image for drivers (spark.kubernetes.driver.container.image) and executors (spark.kubernetes.executor.container.image).
Q: Where can I find more information about Custom Images?For more information about Amazon EMR on EKS’ support for Custom Images, refer to our documentation or the Apache Spark Documentation.
Q: Is there an additional charge for Custom Images?There is no charge for using the Custom Images feature.
Amazon EMR on AWS Outposts
Q: What is AWS Outposts?
AWS Outposts brings native AWS services, infrastructure, and operating models to virtually any data center, co-location space, or on-premises facility. Using EMR on Outposts, you can deploy, manage, and scale EMR clusters on-premises, just as you would in the cloud.
Q: When should I use EMR on Outposts?
If you have existing on-premises Apache Hadoop deployments and are struggling to meet capacity demands during peak utilization, you can use EMR on Outposts to augment your processing capacity without having to move data to the cloud. EMR on Outposts enables you to launch a new EMR cluster on-premises in minutes and connect to existing datasets in on-premises HDFS storage to meet this demand and maintain SLAs.
If you need to process data that needs to remain on-premises for governance, compliance, or other reasons, you can use EMR on Outposts to deploy and run applications like Apache Hadoop and Apache Spark on-premises, close to your data. This reduces the need to move large amounts of on-premises data to the cloud, reducing the overall time needed to process that data.
If you’re in the process of migrating data and Apache Hadoop workloads to the cloud and want to start using EMR before your migration is complete, you can use AWS Outposts to launch EMR clusters that connect to your existing on-premises HDFS storage. You can then gradually migrate your data to Amazon S3 as part of an evolution to a cloud architecture.
Q: What EMR versions are supported with EMR on Outposts?
The minimum supported Amazon EMR release is 5.28.0.
Q: What EMR applications are available when using Outposts?
All applications in EMR release 5.28.0 and above are supported. See our release notes for a full list of EMR applications.
Q: What EMR features are not supported with EMR on Outposts?
- EC2 Spot instances are not available in AWS Outposts. When creating a cluster, you must choose EC2 On-Demand instances.
- A subset of EC2 instance types are available in AWS Outposts. For a list of supported instance types with EMR and Outposts, please see our documentation.
- When adding Amazon EBS volumes to instances, only the General Purpose SSD (GP2) storage type is supported in AWS Outposts.
Q: Can I use EMR clusters in an Outpost to read data from my existing on-premises Apache Hadoop clusters?
Workloads running on EMR in an Outpost can read and write data in existing HDFS storage, allowing you to easily integrate with existing on-premises Apache Hadoop deployments. This gives you the ability to augment your data processing needs using EMR without the need to migrate data.
Q: Can I choose where to store my data?
When an EMR cluster is launched in an Outpost, all of the compute and data storage resources are deployed in your Outpost. Data written locally to the EMR cluster is stored on local EBS volumes in your Outpost. Tools such as Apache Hive, Apache Spark, Presto, and other EMR applications can each be configured to write data locally in an Outpost, to external file system such as an existing HDFS installation, or to Amazon S3. Using EMR on Outposts, you have full control over storing your data in Amazon S3 or locally in your Outpost.
Q: Do any EMR features need uploaded data to S3?
When launching an EMR cluster in an Outpost, you have the option to enable logging. When logging is enabled, cluster logs will be uploaded to the S3 bucket that you specify. These logs are used to simplify debugging clusters after they have been terminated. When disabled, no logs will be uploaded to S3.
Q: What happens if my Outpost is out of capacity?
When launching a cluster in an Outpost, EMR will attempt to launch the number and type of EC2 On-Demand instances you’ve requested. If there is no capacity available on the Outpost, EMR will receive an insufficient capacity notice. EMR will retry for a period of time, and if no capacity becomes available, the cluster will fail to launch. The same process applies when resizing a cluster. If there is insufficient capacity on the Outpost for the requested instance types, EMR will be unable to scale up the cluster. You can easily set up Amazon CloudWatch alerts to monitor your capacity utilization on Outposts and receive alerts when instance capacity is lower than a desired threshold.
Q: What happens if network connectivity is interrupted between my Outpost and AWS?
If network connectivity between your Outpost and its AWS Region is lost, your clusters in Outposts will continue to run, but there will be actions you will be unable to take until connectivity is restored. Until connectivity is restored, you cannot create new clusters or take new actions on existing clusters. In case of instance failures, the instance will not be automatically replaced. Also, actions such as adding steps to a running cluster, checking step execution status, and sending CloudWatch metrics and events will be delayed until connectivity is restored.
We recommend that you provide reliable and highly available network connectivity between your Outpost and the AWS Region. If network connectivity between your Outpost and its AWS Region is lost for more than a few hours, clusters that have terminate protection enabled will continue to run, and clusters that have terminate protection disabled may be terminated. If network connectivity will be impacted due to routine maintenance, we recommend proactively enabling terminate protection.
Using EBS volumes
Q: What can I do now that I could not do before?
Most EC2 instances have fixed storage capacity attached to an instance, known as an "instance store". You can now add EBS volumes to the instances in your Amazon EMR cluster, allowing you to customize the storage on an instance. The feature also allows you to run Amazon EMR clusters on EBS-Only instance families such as the M4 and C4.
Q: What are the benefits of adding EBS volumes to an instance running on Amazon EMR?
You will benefit by adding EBS volumes to an instance in the following scenarios:
- Your processing requirements are such that you need a large amount of HDFS (or local) storage that what is available today on an instance. With support for EBS volumes, you will be able to customize the storage capacity on an instance relative to the compute capacity that the instance provides. Optimizing the storage on an instance will allow you to save costs.
- You are running on an older generation instance family (such as the M1 and M2 family) and want to move to latest generation instance family but are constrained by the storage available per node on the next generation instance types. Now you can use any of the new generation instance type and add EBS volumes to optimize the storage. Internal benchmarks indicate that you can save cost and improve performance by moving from an older generation instance family (M1 or M2) to a new generation one (M4, C4 & R3). The Amazon EMR team recommends that you run your application to arrive at the right conclusion.
- You want to use or migrate to the next-generation EBS-Only M4 and C4 family.
Q: Can I persist my data on an EBS volume after a cluster is terminated?
Currently, Amazon EMR will delete volumes once the cluster is terminated. If you want to persist data outside the lifecycle of a cluster, consider using Amazon S3 as your data store.
Q: What kind of EBS volumes can I attach to an instance?
Amazon EMR allows you to use different EBS Volume Types: General Purpose SSD (GP2), Magnetic and Provisioned IOPS (SSD).
Q: What happens to the EBS volumes once I terminate my cluster?
Amazon EMR will delete the volumes once the EMR cluster is terminated.
Q: Can I use an EBS with instances that already have an instance store?
Yes, you can add EBS volumes to instances that have an instance store.
Q: Can I attach an EBS volume to a running cluster?
No, currently you can only add EBS volumes when launching a cluster.
Q: Can I snapshot volumes from a cluster?
The EBS API allows you to Snapshot a cluster. However, Amazon EMR currently does not allow you to restore from a snapshot.
Q: Can I use encrypted EBS volumes?
You can encrypt EBS root device and storage volumes using AWS KMS as your key provider. For more information, see Local Disk Encryption.
Q: What happens when I remove an attached volume from a running cluster?
Removing an attached volume from a running cluster will be treated as a node failure. Amazon EMR will replace the node and the EBS volume with each of the same.
Q: What is Apache Spark?
Apache SparkTM is an open-source, distributed processing system used for big data workloads. It utilizes in-memory caching, and optimized query execution for fast analytic queries against data of any size. Amazon EMR is the best place to deploy Apache Spark in the cloud, because it combines the integration and testing rigor of commercial Spark distributions with the scale, simplicity, and cost effectiveness of the cloud. It allows you to launch Spark clusters in minutes without needing to do node provisioning, cluster setup, Spark configuration, or cluster tuning. EMR features Amazon EMR runtime for Apache Spark, a performance-optimized runtime environment for Apache Spark that is active by default on Amazon EMR clusters. Amazon EMR runtime for Apache Spark can be over 3x faster than clusters without the EMR runtime, and has 100% API compatibility with standard Apache Spark. Learn more about Spark and Spark on Amazon EMR.
Q: What is Presto?
Presto is an open source, distributed SQL query engine, designed from the ground up for fast analytic queries against data of any size. With Amazon EMR, you can launch Presto clusters in minutes without needing to do node provisioning, cluster setup, Presto configuration, or cluster tuning. EMR enables you to provision one, hundreds, or thousands of compute instances in minutes. Presto has two community projects –PrestoDB and PrestoSQL. Amazon EMR supports both projects. Learn more about Presto and Presto on Amazon EMR.
Q: What is Apache Hive?
Hive is an open source data warehouse and analytics package that runs on top of Hadoop. Hive is operated by a SQL-based language called Hive QL that allows users to structure, summarize, and query data sources stored in Amazon S3. Hive QL goes beyond standard SQL, adding first-class support for map/reduce functions and complex extensible user-defined data types like Json and Thrift. This capability allows processing of complex and even unstructured data sources such as text documents and log files. Hive allows user extensions via user-defined functions written in Java and deployed via storage in Amazon S3. You can learn more about Apache Hive here.
Q: What can I do with Hive running on Amazon EMR?
Using Hive with Amazon EMR, you can implement sophisticated data-processing applications with a familiar SQL-like language and easy to use tools available with Amazon EMR. With Amazon EMR, you can turn your Hive applications into a reliable data warehouse to execute tasks such as data analytics, monitoring, and business intelligence tasks.
Q: How is Hive different than traditional RDBMS systems?
Traditional RDBMS systems provide transaction semantics and ACID properties. They also allow tables to be indexed and cached so that small amounts of data can be retrieved very quickly. They provide for fast update of small amounts of data and for enforcement of referential integrity constraints. Typically they run on a single large machine and do not provide support for executing map and reduce functions on the table, nor do they typically support acting over complex user defined data types.
In contrast, Hive executes SQL-like queries using MapReduce. Consequently, it is optimized for doing full table scans while running on a cluster of machines and is therefore able to process very large amounts of data. Hive provides partitioned tables, which allow it to scan a partition of a table rather than the whole table if that is appropriate for the query it is executing.
Traditional RDMS systems are best for when transactional semantics and referential integrity are required and frequent small updates are performed. Hive is best for offline reporting, transformation, and analysis of large data sets; for example, performing click stream analysis of a large website or collection of websites.
One of the common practices is to export data from RDBMS systems into Amazon S3 where offline analysis can be performed using Amazon EMR clusters running Hive.
Q: How can I get started with Hive running on Amazon EMR?
The best place to start is to review our written documentation located here.
Q: Are there new features in Hive specific to Amazon EMR?
Yes. Refer to our documentation for further details:
- You can launch an EMR cluster with multiple master nodes to support high availability for Apache Hive. Amazon EMR automatically fails over to a standby master node if the primary master node fails or if critical processes, like Resource Manager or Name Node, crash. This means that you can run Apache Hive on EMR clusters without interruption.
- Amazon EMR allows you to define EMR Managed Scaling for Apache Hive clusters to help you optimize your resource usage. With EMR Managed Scaling, you can automatically resize your cluster for best performance at the lowest possible cost. With EMR Managed Scaling you specify the minimum and maximum compute limits for your clusters and Amazon EMR automatically resizes them for best performance and resource utilization. EMR Managed Scaling continuously samples key metrics associated with the workloads running on clusters.
- Amazon EMR 6.0.0 adds support for Hive LLAP, providing an average performance speedup of 2x over EMR 5.29. You can learn more here.
- You can now use S3 Select with Hive on Amazon EMR to improve performance. S3 Select allows applications to retrieve only a subset of data from an object, which reduces the amount of data transferred between Amazon EMR and Amazon S3.
- Amazon EMR also enables fast performance on complex Apache Hive queries. EMR uses Apache Tez by default, which is significantly faster than Apache MapReduce. Apache MapReduce uses multiple phases, so a complex Apache Hive query would get broken down into four or five jobs. Apache Tez is designed for more complex queries, so that same job on Apache Tez would run in one job, making it significantly faster than Apache MapReduce.
- With Amazon EMR, you have the option to leave the metastore as local or externalize it. EMR provides integration with the AWS Glue Data Catalog, Amazon Aurora, Amazon RDS and AWS Lake Formation. Amazon EMR can pull information directly from Glue or Lake Formation to populate the metastore.
- You can load table partitions automatically from Amazon S3. Previously, to import a partitioned table you needed a separate alter table statement for each individual partition in the table. Amazon EMR a now includes a new statement type for the Hive language: “alter table recover partitions.” This statement allows you to easily import tables concurrently into many clusters without having to maintain a shared meta-data store. Use this functionality to read from tables into which external processes are depositing data, for example log files.
- You can write data directly to Amazon S3. When writing data to tables in Amazon S3, the version of Hive installed in Amazon EMR writes directly to Amazon S3 without the use of temporary files. This produces a significant performance improvement but it means that HDFS and S3 from a Hive perspective behave differently. You cannot read and write within the same statement to the same table if that table is located in Amazon S3. If you want to update a table located in S3, then create a temporary table in the cluster’s local HDFS filesystem, write the results to that table, and then copy them to Amazon S3.
- You can access resources located in Amazon S3. The version of Hive installed in Amazon EMR allows you to reference resources such as scripts for custom map and reduce operations or additional libraries located in Amazon S3 directly from within your Hive script (e.g., add jar s3://elasticmapreduce/samples/hive-ads/libs/jsonserde.jar).
Q: What types of Hive clusters are supported?
There are two types of clusters supported with Hive: interactive and batch. In an interactive mode a customer can start a cluster and run Hive scripts interactively directly on the master node. Typically, this mode is used to do ad hoc data analyses and for application development. In batch mode, the Hive script is stored in Amazon S3 and is referenced at the start of the cluster. Typically, batch mode is used for repeatable runs such as report generation.
Q: How can I launch a Hive cluster?
Both batch and interactive clusters can be started from AWS Management Console, EMR command line client, or APIs. Please refer to the Hive section in the Release Guide for more details on launching a Hive cluster.
Q: When should I use Hive vs. PIG?
Hive and PIG both provide high level data-processing languages with support for complex data types for operating on large datasets. The Hive language is a variant of SQL and so is more accessible to people already familiar with SQL and relational databases. Hive has support for partitioned tables which allow Amazon EMR clusters to pull down only the table partition relevant to the query being executed rather than doing a full table scan. Both PIG and Hive have query plan optimization. PIG is able to optimize across an entire scripts while Hive queries are optimized at the statement level.
Ultimately the choice of whether to use Hive or PIG will depend on the exact requirements of the application domain and the preferences of the implementers and those writing queries.
Q: What version of Hive does Amazon EMR support?
For latest version of Hive on Amazon EMR, please refer to documentation.
Q: Can I write to a table from two clusters concurrently
No. Hive does not support concurrently writing to tables. You should avoid concurrently writing to the same table or reading from a table while you are writing to it. Hive has non-deterministic behavior when reading and writing at the same time or writing and writing at the same time.
Q: Can I share data between clusters?
Yes. You can read data in Amazon S3 within a Hive script by having ‘create external table’ statements at the top of your script. You need a create table statement for each external resource that you access.
Q: Should I run one large cluster, and share it amongst many users or many smaller clusters?
Amazon EMR provides a unique capability for you to use both methods. On the one hand one large cluster may be more efficient for processing regular batch workloads. On the other hand, if you require ad-hoc querying or workloads that vary with time, you may choose to create several separate cluster tuned to the specific task sharing data sources stored in Amazon S3. You can use EMR Managed Scaling to optimize resource usage.
Q: Can I access a script or jar resource which is on my local file system?
No. You must upload the script or jar to Amazon S3 or to the cluster’s master node before it can be referenced. For uploading to Amazon S3 you can use tools including s3cmd, jets3t or S3Organizer.
Q: Can I run a persistent cluster executing multiple Hive queries?
Yes. You run a cluster in a manual termination mode so it will not terminate between Hive steps. To reduce the risk of data loss we recommend periodically persisting all of your important data in Amazon S3. It is good practice to regularly transfer your work to a new cluster to test your process for recovering from master node failure.
Q: Can multiple users execute Hive steps on the same source data?
Yes. Hive scripts executed by multiple users on separate clusters may contain create external table statements to concurrently import source data residing in Amazon S3.
Q: Can multiple users run queries on the same cluster?
Yes. In the batch mode, steps are serialized. Multiple users can add Hive steps to the same cluster; however, the steps will be executed serially. In interactive mode, several users can be logged on to the same cluster and execute Hive statements concurrently.
Q: Can data be shared between multiple AWS users?
Yes. Data can be shared using standard Amazon S3 sharing mechanism described here.
Q: Does Hive support access from JDBC?
Yes. Hive provides JDBC drive, which can be used to programmatically execute Hive statements. To start a JDBC service in your cluster you need to pass an optional parameter in the Amazon EMR command line client. You also need to establish an SSH tunnel because the security group does not permit external connections.
Q: What is your procedure for updating packages on EMR AMIs?
On first boot, the Amazon Linux AMIs for EMR connect to the Amazon Linux AMI yum repositories to install security updates. When you use a custom AMI, you can disable this feature, but we don’t recommend this for security reasons.
Q: Can I update my own packages on EMR clusters?
Yes. You can use Bootstrap Actions to install updates to packages on your clusters.
Q: Can I process DynamoDB data using Hive?
Yes. Simply define an external Hive table based on your DynamoDB table. You can then use Hive to analyze the data stored in DynamoDB and either load the results back into DynamoDB or archive them in Amazon S3. For more information please visit our Developer Guide.
Q: What is Apache Hudi?
Apache Hudi is an open-source data management framework used to simplify incremental data processing and data pipeline development. Apache Hudi enables you to manage data at the record-level in Amazon S3 to simplify Change Data Capture (CDC) and streaming data ingestion, and provides a framework to handle data privacy use cases requiring record level updates and deletes. Data sets managed by Apache Hudi are stored in S3 using open storage formats, and integrations with Presto, Apache Hive, Apache Spark, and AWS Glue Data Catalog give you near real-time access to updated data using familiar tools.
Q: When should I use Apache Hudi?
Apache Hudi helps you with uses cases requiring record-level data management on S3. There are five common use-cases that benefit from these abilities:
- Complying with data privacy laws that require organizations to remove user data, or update user preferences when users choose to change their preferences as to how their data can be used. Apache Hudi gives you the ability to perform record-level insert, update, and delete operations on your data stored in S3, using open source data formats such as Apache Parquet, and Apache Avro.
- Consuming real time data streams and applying change data capture logs from enterprise systems. Many organizations require Enterprise Data Warehouses (EDW) and Operational Data Stores (ODS) data to be available in Amazon S3 so it’s accessible to SQL engines like Apache Hive and Presto for data processing and analytics. Apache Hudi simplifies applying change logs, and gives users near real-time access to data.
- Reinstating late arriving, or incorrect data. Late arriving, or incorrect data requires the data to be restated, and existing data sets updated to incorporate new, or updated records. Apache Hudi allows you to “upsert” records into an existing data set, relying on the framework to insert or update records based on their presence in the data set.
- Tracking change to data sets and providing the ability to rollback changes. With Apache Hudi, each change to a data set is tracked as a commit, and can be easily rolled back, allowing you to find specific changes to a data set and “undo” them.
- Simplifying file management on S3. To make sure data files are efficiently sized, customers have to build custom solutions that monitor and re-write many small files into fewer large files. With Apache Hudi, data files on S3 are managed, and users can simply configure an optimal file size to store their data and Hudi will merge files to create efficiently sized files.
- Writing deltas to a target Hudi dataset. Hudi datasets can be pulled incrementally, which means you can get ALL and ONLY the updated & new rows since a specified instant time.
Q: How do I create an Apache Hudi data set?
Apache Hudi data sets are created using Apache Spark. Creating a data set is as simple as writing an Apache Spark DataFrame. The metadata for Apache Hudi data sets can optionally be stored in the AWS Glue Data Catalog or the Hive metastore to simplify data discovery and for integrating with Apache Hive and Presto.
Q: How does Apache Hudi manage data sets?
When creating a data set with Apache Hudi, you can choose what type of data access pattern the data set should be optimized for. For read-heavy use cases, you can choose the “Copy on Write” data management strategy to optimize for frequent reads of the data set. This strategy organizes data using columnar storage formats, and merges existing data and new updates when the updates are written. For write-heavy workloads, Apache Hudi uses the “Merge on Read” data management strategy which organizes data using a combination of columnar and row storage formats, where updates are appended to a file in row based storage format, while the merge is performed at read time to provide the updated results.
Q: How do I write to an Apache Hudi data set?
Changes to Apache Hudi data sets are made using Apache Spark. With Apache Spark, Apache Hudi data sets are operated on using the Spark DataSource API, enabling you to read and write data. DataFrame containing newly added data or updates to existing data can be written using the same DataSource API". You can also use the Hudi DeltaStreamer utility.
Q: How do I read from an Apache Hudi data set?
You can read data using either Apache Spark, Apache Hive, Presto, Amazon Redshift Spectrum or Amazon Athena. When you create a data set, you have the option to publish the metadata of that data set in either the AWS Glue Data Catalog, or the Hive metastore. If you choose to publish the metadata in a metastore, your data set will look just like an ordinary table, and you can query that table using Apache Hive and Presto.
Q: What considerations or limitations should I be aware of when using Apache Hudi?
For a full list of consideration and limitations when using Apache Hudi on Amazon EMR, please refer to our Amazon EMR documentation.
Q: How does my existing data work with Apache Hudi?
If you have existing data that you want to now manage with Apache Hudi, you can easily convert your Apache Parquet data to Apache Hudi data sets using an import tool provided with Apache Hudi on Amazon EMR, or you can use Hudi DeltaStreamer utility, or Apache Spark to rewrite your existing data as an Apache Hudi data set.
Q: What is Impala?
Impala is an open source tool in the Hadoop ecosystem for interactive, ad hoc querying using SQL syntax. Instead of using MapReduce, it leverages a massively parallel processing (MPP) engine similar to that found in traditional relational database management systems (RDBMS). With this architecture, you can query your data in HDFS or HBase tables very quickly, and leverage Hadoop’s ability to process diverse data types and provide schema at runtime. This lends Impala to interactive, low-latency analytics. In addition, Impala uses the Hive metastore to hold information about the input data, including the partition names and data types. Also, Impala on Amazon EMR requires AMIs running Hadoop 2.x or greater. Click here to learn more about Impala.
Q: What can I do with Impala running on Amazon EMR?
Similar to using Hive with Amazon EMR, leveraging Impala with Amazon EMR can implement sophisticated data-processing applications with SQL syntax. However, Impala is built to perform faster in certain use cases (see below). With Amazon EMR, you can use Impala as a reliable data warehouse to execute tasks such as data analytics, monitoring, and business intelligence. Here are three use cases:
- Use Impala instead of Hive on long-running clusters to perform ad hoc queries. Impala reduces interactive queries to seconds, making it an excellent tool for fast investigation. You could run Impala on the same cluster as your batch MapReduce workflows, use Impala on a long-running analytics cluster with Hive and Pig, or create a cluster specifically tuned for Impala queries.
- Use Impala instead of Hive for batch ETL jobs on transient Amazon EMR clusters. Impala is faster than Hive for many queries, which provides better performance for these workloads. Like Hive, Impala uses SQL, so queries can easily be modified from Hive to Impala.
- Use Impala in conjunction with a third party business intelligence tool. Connect a client ODBC or JDBC driver with your cluster to use Impala as an engine for powerful visualization tools and dashboards.
Both batch and interactive Impala clusters can be created in Amazon EMR. For instance, you can have a long-running Amazon EMR cluster running Impala for ad hoc, interactive querying or use transient Impala clusters for quick ETL workflows.
Q: How is Impala different than traditional RDBMSs?
Traditional relational database systems provide transaction semantics and database atomicity, consistency, isolation, and durability (ACID) properties. They also allow tables to be indexed and cached so that small amounts of data can be retrieved very quickly, provide for fast updates of small amounts of data, and for enforcement of referential integrity constraints. Typically, they run on a single large machine and do not provide support for acting over complex user defined data types. Impala uses a similar distributed query system to that found in RDBMSs, but queries data stored in HDFS and uses the Hive metastore to hold information about the input data. As with Hive, the schema for a query is provided at runtime, allowing for easier schema changes. Also, Impala can query a variety of complex data types and execute user defined functions. However, because Impala processes data in-memory, it is important to understand the hardware limitations of your cluster and optimize your queries for the best performance.
Q: How is Impala different than Hive?
Impala executes SQL queries using a massively parallel processing (MPP) engine, while Hive executes SQL queries using MapReduce. Impala avoids Hive’s overhead from creating MapReduce jobs, giving it faster query times than Hive. However, Impala uses significant memory resources and the cluster’s available memory places a constraint on how much memory any query can consume. Hive is not limited in the same way, and can successfully process larger data sets with the same hardware. Generally, you should use Impala for fast, interactive queries, while Hive is better for ETL workloads on large datasets. Impala is built for speed and is great for ad hoc investigation, but requires a significant amount of memory to execute expensive queries or process very large datasets. Because of these limitations, Hive is recommended for workloads where speed is not as crucial as completion. Click here to view some performance benchmarks between Impala and Hive.
Q: Can I use Hadoop 1?
No, Impala requires Hadoop 2, and will not run on a cluster with an AMI running Hadoop 1.x.
Q: What instance types should I use for my Impala cluster?
For the best experience with Impala, we recommend using memory-optimized instances for your cluster. However, we have shown that there are performance gains over Hive when using standard instance types as well. We suggest reading our Performance Testing and Query Optimization section in the Amazon EMR Developer’s Guide to better estimate the memory resources your cluster will need with regards to your dataset and query types. The compression type, partitions, and the actual query (number of joins, result size, etc.) all play a role in the memory required. You can use the EXPLAIN statement to estimate the memory and other resources needed for an Impala query.
Q: What happens if I run out of memory on a query?
If you run out of memory, queries fail and the Impala daemon installed on the affected node shuts down. Amazon EMR then restarts the daemon on that node so that Impala will be ready to run another query. Your data in HDFS on the node remains available, because only the daemon running on the node shuts down, rather than the entire node itself. For ad hoc analysis with Impala, the query time can often be measured in seconds; therefore, if a query fails, you can discover the problem quickly and be able to submit a new query in quick succession.
Q: Does Impala support user defined functions?
Yes, Impala supports user defined functions (UDFs). You can write Impala specific UDFs in Java or C++. Also, you can modify UDFs or user-defined aggregate functions created for Hive for use with Impala. For information about Hive UDFs, click here.
Q: Where is the data stored for Impala to query?
Impala queries data in HDFS or in HBase tables.
Q: Can I run Impala and MapReduce at the same time on a cluster?
Yes, you can set up a multitenant cluster with Impala and MapReduce. However, you should be sure to allot resources (memory, disk, and CPU) to each application using YARN on Hadoop 2.x. The resources allocated should be dependent on the needs for the jobs you plan to run on each application.
Q: Does Impala support ODBC and JDBC drivers?
While you can use ODBC drivers, Impala is also a great engine for third-party tools connected through JDBC. You can download and install the Impala client JDBC driver from http://elasticmapreduce.s3.amazonaws.com/libs/impala/1.2.1/impala-jdbc-1.2.1.zip. From the client computer where you have your business intelligence tool installed, connect the JDBC driver to the master node of an Impala cluster using SSH or a VPN on port 21050. For more information, see Open an SSH Tunnel to the Master Node.
Q: What is Apache Pig?
Pig is an open source analytics package that runs on top of Hadoop. Pig is operated by a SQL-like language called Pig Latin, which allows users to structure, summarize, and query data sources stored in Amazon S3. As well as SQL-like operations, Pig Latin also adds first-class support for map/reduce functions and complex extensible user defined data types. This capability allows processing of complex and even unstructured data sources such as text documents and log files. Pig allows user extensions via user-defined functions written in Java and deployed via storage in Amazon S3.
Q: What can I do with Pig running on Amazon EMR?
Using Pig with Amazon EMR, you can implement sophisticated data-processing applications with a familiar SQL-like language and easy to use tools available with Amazon EMR. With Amazon EMR, you can turn your Pig applications into a reliable data warehouse to execute tasks such as data analytics, monitoring, and business intelligence tasks.
Q: How can I get started with Pig running on Amazon EMR?
The best place to start is to review our written documentation located here.
Q: Are there new features in Pig specific to Amazon EMR?
Yes. There are three new features which make Pig even more powerful when used with Amazon EMR, including:
a/ Accessing multiple filesystems. By default a Pig job can only access one remote file system, be it an HDFS store or S3 bucket, for input, output and temporary data. EMR has extended Pig so that any job can access as many file systems as it wishes. An advantage of this is that temporary intra-job data is always stored on the local HDFS, leading to improved performance.
b/ Loading resources from S3. EMR has extended Pig so that custom JARs and scripts can come from the S3 file system, for example “REGISTER s3:///my-bucket/piggybank.jar”
c/ Additional Piggybank function for String and DateTime processing.
Q: What types of Pig clusters are supported?
There are two types of clusters supported with Pig: interactive and batch. In an interactive mode a customer can start a cluster and run Pig scripts interactively directly on the master node. Typically, this mode is used to do ad hoc data analyses and for application development. In batch mode, the Pig script is stored in Amazon S3 and is referenced at the start of the cluster. Typically, batch mode is used for repeatable runs such as report generation.
Q: How can I launch a Pig cluster?
Both batch and interactive clusters can be started from AWS Management Console, EMR command line client, or APIs.
Q: What version of Pig does Amazon EMR support?
Amazon EMR supports multiple versions of Pig.
Q: Can I write to a S3 bucket from two clusters concurrently
Yes, you are able to write to the same bucket from two concurrent clusters.
Q: Can I share input data in S3 between clusters?
Yes, you are able to read the same data in S3 from two concurrent clusters.
Q: Can data be shared between multiple AWS users?
Yes. Data can be shared using standard Amazon S3 sharing mechanism described here http://docs.amazonwebservices.com/AmazonS3/latest/index.html?S3_ACLs.html
Q: Should I run one large cluster, and share it amongst many users or many smaller clusters?
Amazon EMR provides a unique capability for you to use both methods. On the one hand one large cluster may be more efficient for processing regular batch workloads. On the other hand, if you require ad-hoc querying or workloads that vary with time, you may choose to create several separate cluster tuned to the specific task sharing data sources stored in Amazon S3.
Q: Can I access a script or jar resource which is on my local file system?
No. You must upload the script or jar to Amazon S3 or to the cluster’s master node before it can be referenced. For uploading to Amazon S3 you can use tools including s3cmd, jets3t or S3Organizer.
Q: Can I run a persistent cluster executing multiple Pig queries?
Yes. You run a cluster in a manual termination mode so it will not terminate between Pig steps. To reduce the risk of data loss we recommend periodically persisting all important data in Amazon S3. It is good practice to regularly transfer your work to a new cluster to test you process for recovering from master node failure.
Q: Does Pig support access from JDBC?
No. Pig does not support access through JDBC.
Q: What is Apache HBase?
HBase is an open source, non-relational, distributed database modeled after Google's BigTable. It was developed as part of Apache Software Foundation's Hadoop project and runs on top of Hadoop Distributed File System(HDFS) to provide BigTable-like capabilities for Hadoop. HBase provides you a fault-tolerant, efficient way of storing large quantities of sparse data using column-based compression and storage. In addition, HBase provides fast lookup of data because data is stored in-memory instead of on disk. HBase is optimized for sequential write operations, and it is highly efficient for batch inserts, updates, and deletes. HBase works seamlessly with Hadoop, sharing its file system and serving as a direct input and output to Hadoop jobs. HBase also integrates with Apache Hive, enabling SQL-like queries over HBase tables, joins with Hive-based tables, and support for Java Database Connectivity (JDBC). You can learn more about Apache HBase here.
Q: Are there new features in HBase specific to Amazon EMR?
With Amazon EMR, you can you can use HBase on Amazon S3 to store a cluster's HBase root directory and metadata directly to Amazon S3 and create read replicas and snapshots. Please see our documentation to learn more.
Q: Which versions of HBase are supported on Amazon EMR?
You can look at the latest HBase versions supported on Amazon EMR here.
Q: What does EMR Connector to Kinesis enable?
The connector enables EMR to directly read and query data from Kinesis streams. You can now perform batch processing of Kinesis streams using existing Hadoop ecosystem tools such as Hive, Pig, MapReduce, Hadoop Streaming, and Cascading.
Q: What does the EMR connector to Kinesis enable that I couldn’t have done before?
Reading and processing data from a Kinesis stream would require you to write, deploy and maintain independent stream processing applications. These take time and effort. However, with this connector, you can start reading and analyzing a Kinesis stream by writing a simple Hive or Pig script. This means you can analyze Kinesis streams using SQL! Of course, other Hadoop ecosystem tools could be used as well. You don’t need to developed or maintain a new set of processing applications.
Q: Who will find this functionality useful?
The following types of users will find this integration useful:
- Hadoop users who are interested in utilizing the extensive set of Hadoop ecosystem tools to analyze Kinesis streams.
- Kinesis users who are looking for an easy way to get up and running with stream processing and ETL of Kinesis data.
- Business analysts and IT professionals who would like to perform ad-hoc analysis of data in Kinesis streams using familiar tools like SQL (via Hive) or scripting languages like Pig.
Q: What are some use cases for this integration?
The following are representative use cases are enabled by this integration:
- Streaming Log Analysis: You can analyze streaming web logs to generate a list of top 10 error type every few minutes by region, browser, and access domains.
- Complex Data Processing Workflows: You can join Kinesis stream with data stored in S3, Dynamo DB tables, and HDFS. You can write queries that join clickstream data from Kinesis with advertising campaign information stored in a DynamoDB table to identify the most effective categories of ads that are displayed on particular websites.
- Ad-hoc Queries: You can periodically load data from Kinesis into HDFS and make it available as a local Impala table for fast, interactive, analytic queries.
Q: What EMR AMI version do I need to be able to use the connector?
You need to use EMR’s AMI version 3.0.4 and later.
Q: Is this connector a stand-alone tool?
No, it is a built in component of the Amazon distribution of Hadoop and is present on EMR AMI versions 3.0.4 and later. Customer simply needs to spin up a cluster with AMI version 3.0.4 or later to start using this feature.
Q: What data format is required to allow EMR to read from a Kinesis stream?
The EMR Kinesis integration is not data format-specific. You can read data in any format. Individual Kinesis records are presented to Hadoop as standard records that can be read using any Hadoop MapReduce framework. Individual frameworks like Hive, Pig and Cascading have built in components that help with serialization and deserialization, making it easy for developers to query data from many formats without having to implement custom code. For example, in Hive users can read data from JSON files, XML files and SEQ files by specifying the appropriate Hive SerDe when they define a table. Pig has a similar component called Loadfunc/Evalfunc and Cascading has a similar component called a Tap. Hadoop users can leverage the extensive ecosystem of Hadoop adapters without having to write format-specific code. You can also implement custom deserialization formats to read domain specific data in any of these tools.
Q: How do I analyze a Kinesis stream using Hive in EMR?
Create a table that references a Kinesis stream. You can then analyze the table like any other table in Hive. Please see our tutorials for page more details.
Q: Using Hive, how do I create queries that combine Kinesis stream data with other data source?
First create a table that references a Kinesis stream. Once a Hive table has been created, you can join it with tables mapping to other data sources such as Amazon S3, Amazon Dynamo DB, and HDFS. This effectively results in joining data from Kinesis stream to other data sources.
Q: Is this integration only available for Hive?
No, you can use Hive, Pig, MapReduce, Hadoop Streaming, and Cascading.
Q: How do I setup scheduled jobs to run on a Kinesis stream?
The EMR Kinesis input connector provides features that help you configure and manage scheduled periodic jobs in traditional scheduling engines such as Cron. For example, you can develop a Hive script that runs every N minutes. In the configuration parameters for a job, you can specify a Logical Name for the job. The Logical Name is a label that will inform the EMR Kinesis input connector that individual instances of the job are members of the same periodic schedule. The Logical Name allows the process to take advantage of iterations, which are explained next.
Since MapReduce is a batch processing framework, to analyze a Kinesis stream using EMR, the continuous stream is divided in to batches. Each batch is called an Iteration. Each Iteration is assigned a number, starting with 0. Each Iteration’s boundaries are defined by a start sequence number and end sequence number. Iterations are then processed sequentially by EMR.
In the event of an attempt’s failure, the EMR Kinesis input connector will re-try the iteration within the Logical Name from the known start sequence number of the iteration. This functionality ensures that successive attempts on the same iteration will have precisely the same input records from the Kinesis stream as the previous attempts. This guarantees idempotent (consistent) processing of a Kinesis stream.
You can specify Logical Names and Iterations as runtime parameters in your respective Hadoop tools. For example, in the tutorial section “Running queries with checkpoints”, the code sample shows a scheduled Hive query that designates a Logical Name for the query and increments the iteration with each successive run of the job.
Additionally, a sample cron scheduling script is provided in the tutorials.
Q: Where is the metadata for Logical Names and Iterations stored?
The metadata that allows the EMR Kinesis input connector to work in scheduled periodic workflows is stored in Amazon DynamoDB. You must provision an Amazon Dynamo DB table and specify it as an input parameter to the Hadoop Job. It is important that you configure appropriate IOPS for the table to enable this integration. Please refer to the getting started tutorial for more information on setting up your Amazon Dynamo DB table.
Q: What happens when an iteration processing fails?
Iterations identifiers are user-provided values that map to specific boundary (start and end sequence numbers) in a Kinesis stream. Data corresponding to these boundaries is loaded in the Map phase of the MapReduce job. This phase is managed by the framework and will be automatically re-run (three times by default) in case of job failure. If all the retries fail, you would still have options to retry the processing starting from last successful data boundary or past data boundaries. This behavior is controlled by providing kinesis.checkpoint.iteration.no parameter during processing. Please refer to the getting started tutorial for more information on how this value is configured for different tools in the Hadoop ecosystem.
Q: Can I run multiple queries on the same iteration?
Yes, you can specify a previously run iteration by setting the kinesis.checkpoint.iteration.no parameter in successive processing. The implementation ensures that successive runs on the same iteration will have precisely the same input records from the Kinesis stream as the previous runs.
Q: What happens if records in an Iteration expire from the Kinesis stream?
In the event that the beginning sequence number and/or end sequence number of an iteration belong to records that have expired from the Kinesis steam, the Hadoop job will fail. You would need to use a different Logical Name to process data from the beginning of the Kinesis stream.
Q: Can I push data from EMR into Kinesis stream?
No. The EMR Kinesis connector currently does not support writing data back into a Kinesis stream.
Q: Does the EMR Hadoop input connector for Kinesis enable continuous stream processing?
The Hadoop MapReduce framework is a batch processing system. As such, it does not support continuous queries. However there is an emerging set of Hadoop ecosystem frameworks like Twitter Storm and Spark Streaming that enable to developers build applications for continuous stream processing. A Storm connector for Kinesis is available at on GitHub here and you can find a tutorial explaining how to setup Spark Streaming on EMR and run continuous queries here.
Additionally, developers can utilize the Kinesis client library to develop real-time stream processing applications. You can find more information on developing custom Kinesis applications in the Kinesis documentation here.
Q: Can I specify access credential to read a Kinesis stream that is managed in another AWS account?
Yes. You can read streams from another AWS account by specifying the appropriate access credentials of the account that owns the Kinesis stream. By default, the Kinesis connector utilizes the user-supplied access credentials that are specified when the cluster is created. You can override these credentials to access streams from other AWS Accounts by setting the kinesis.accessKey and kinesis.secretKey parameters. The following examples show how to set the kinesis.accessKey and kinesis.secretKey parameters in Hive and Pig.
Code sample for Hive:
Code sample for Pig:
raw_logs = LOAD 'AccessLogStream' USING com.amazon.emr.kinesis.pig.Kin
) AS (line:chararray);
Q: Can I run multiple parallel queries on a single Kinesis Stream? Is there a performance impact?
Yes, a customer can run multiple parallel queries on the same stream by using separate logical names for each query. However, reading from a shard within a Kinesis stream is subjected to a rate limit of 2MB/sec. Thus, if there are N parallel queries running on the same stream, each one would get roughly (2/N) MB/sec egress rate per shard on the stream. This may slow down the processing and in some cases fail the queries as well.
Q: Can I join and analyze multiple Kinesis streams in EMR?
Yes, for example in Hive, you can create two tables mapping to two different Kinesis streams and create joins between the tables.
Q: Does the EMR Kinesis connector handle Kinesis scaling events, such as merge and split events?
Yes. The implementation handles split and merge events. The Kinesis connector ties individual Kinesis shards (the logical unit of scale within a Kinesis stream) to Hadoop MapReduce map tasks. Each unique shard that exists within a stream in the logical period of an Iteration will result in exactly one map task. In the event of a shard split or merge event, Kinesis will provision new unique shard Ids. As a result, the MapReduce framework will provision more map tasks to read from Kinesis. All of this is transparent to the user.
Q: What happens if there are periods of “silence” in my stream?
The implementation allows you to configure a parameter called kinesis.nodata.timeout. For example, consider a scenario where kinesis.nodata.timeout is set to 2 minutes and you want to run a Hive query every 10 minutes. Additionally, consider some data has been written to the stream since the last iteration (10 minutes ago). However, currently no new records are arriving, i.e. there is a silence in the stream. In this case, when the current iteration of the query launches, the Kinesis connector would find that no new records are arriving. The connector will keep polling the stream for 2 minutes and if no records arrive for that interval then it will stop and process only those records that were already read in the current batch of stream. However, if new records start arriving before kinesis.nodata.timeout interval is up, then the connector will wait for an additional interval corresponding to a parameter called kinesis.iteration.timeout. Please look at the tutorials to see how to define these parameters.
Q: How do I debug a query that continues to fail in each iteration?
In the event of a processing failure, you can utilize the same tools they currently do when debugging Hadoop Jobs. Including the Amazon EMR web console, which helps identify and access error logs. More details on debugging an EMR job can be found here.
Q: What happens if I specify a DynamoDB table that I don’t have access to?
The job would fail and the exception would show up in error logs for the job.
Q: What happens if job doesn’t fail but checkpointing to DynamoDB fails?
The job would fail and the exception would show up in error logs for the job.
Q: How do I maximize the read throughput from Kinesis stream to EMR?
Throughput from Kinesis stream increases with instance size used and record size in the Kinesis stream. We recommend that you use m1.xlarge and above for both master and core nodes for this feature.
Service Level Agreement
Q: What is Amazon EMR Service Level Agreement?
Please refer to our Service Level Agreement.
Q: What does your Amazon EMR Service Level Agreement provide?
Q: What happens if you don’t meet your Service Commitment?In the event any of the Amazon EMR Services do not meet the Service Commitment, you will be eligible to receive a Service Credit.
Q: How do I know if I qualify for a Service Credit? How do I claim one?To receive Service Credits, you will need to submit a claim by opening a case in the AWS Support Center. To understand the eligibility and claim format, please see https://aws.amazon.com/emr/sla/