General Questions
-
Q: What is Amazon Elastic MapReduce?
-
Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).
-
Q: What can I do with Amazon Elastic MapReduce?
-
Using Amazon Elastic MapReduce, you can instantly provision as much or as little capacity as you like to perform data-intensive tasks for applications such as web indexing, data mining, log file analysis, machine learning, financial analysis, scientific simulation, and bioinformatics research. Amazon Elastic MapReduce lets you focus on crunching or analyzing your data without having to worry about time-consuming set-up, management or tuning of Hadoop clusters or the compute capacity upon which they sit.
Amazon Elastic MapReduce is ideal for problems that necessitate the fast and efficient processing of large amounts of data. The web service interfaces allow you to build processing workflows, and programmatically monitor progress of running job flows. In addition, you can use the simple web interface of the AWS Management Console to launch your job flows and monitor processing-intensive computation on clusters of Amazon EC2 instances.
-
Q: Who can use Amazon Elastic MapReduce?
-
Anyone who requires simple access to powerful data analysis can use Amazon Elastic MapReduce. Customers don’t need any software development experience to experiment with several sample applications available in the Developer Guide and in our Resource Center.
-
Q: What can I do with Amazon Elastic MapReduce that I could not do before?
-
Amazon Elastic MapReduce significantly reduces the complexity of the time-consuming set-up, management. and tuning of Hadoop clusters or the compute capacity upon which they sit. You can instantly spin up large Hadoop job flows which will start processing within minutes, not hours or days. When your job flow completes, unless you specify otherwise, the service automatically tears down your instances.
Using this service you can quickly perform data-intensive tasks for applications such as web indexing, data mining, log file analysis, machine learning, financial analysis, scientific simulation, and bioinformatics research.
As a software developer, you can also develop and run your own more sophisticated applications, allowing you to add functionality such as scheduling, workflows, monitoring, or other features.
-
Q: What is the data processing engine behind Amazon Elastic MapReduce?
-
Amazon Elastic MapReduce uses Apache Hadoop as its distributed data processing engine. Hadoop is an open source, Java software framework that supports data-intensive distributed applications running on large clusters of commodity hardware. Hadoop implements a programming model named “MapReduce,” where the data is divided into many small fragments of work, each of which may be executed on any node in the cluster. This framework has been widely used by developers, enterprises and startups and has proven to be a reliable software platform for processing up to petabytes of data on clusters of thousands of commodity machines.
-
Q: What is an Amazon Elastic MapReduce job flow?
-
A job flow is a collection of processing steps that Amazon Elastic MapReduce runs on a specified dataset using a set of Amazon EC2 instances. A job flow consists of one or more steps, each of which must complete in sequence successfully, for the job flow to finish.
-
Q: What is a job flow step?
-
A job flow step is a user-defined unit of processing, mapping roughly to one algorithm that manipulates the data. A step is a Hadoop MapReduce application implemented as a Java jar or a streaming program written in Java, Ruby, Perl, Python, PHP, R, or C++. For example, to count the frequency with which words appear in a document, and output them sorted by the count, the first step would be a MapReduce application which counts the occurrences of each word, and the second step would be a MapReduce application which sorts the output from the first step based on the counts.
-
Q: What are different job flow states?
-
STARTING – The job flow provisions, starts, and configures EC2 instances.
RUNNING – A step for the job flow is currently being run.
WAITING – The job flow is currently active, but has no steps to run.
SHUTTING_DOWN- The job flow is in the process of shutting down.
COMPLETED – The job flow shut down after all steps completed successfully.
FAILED – The job flow shut down after a step failed or due to an internal error.
TERMINATED – The job terminated on request of the user.
-
Q: What are different step states?
-
PENDING – The step is waiting to be run.
RUNNING – The step is currently running.
COMPLETED – The step completed successfully.
CANCELLED – The step was cancelled before running – because an earlier step failed or job flow was terminated before it could run.
FAILED – The step failed while running.
Starting a Job Flow
-
Q: How can I access Amazon Elastic MapReduce?
-
You can access Amazon Elastic MapReduce by using the AWS Management Console, Command Line Tools, or the API calls defined by the service.
-
Q: How can I launch a job flow?
-
You can launch a job flow through the AWS Management Console by filling out a simple job flow request form. In the request form, you specify the name of your job flow, the location in Amazon S3 of your input data, your processing application, your desired data output location, and the number and type of Amazon EC2 instances you’d like to use. Optionally, you can specify a location to store your job flow log files and SSH Key to login to your job flow while it is running. Alternatively, you can launch a job flow using the RunJobFlow API or using the ‘create’ command in the Command Line Tools.
-
Q: How can I get started with Amazon Elastic MapReduce?
-
To sign up for Amazon Elastic MapReduce, click the “Sign up for This Web Service” button on the Amazon Elastic MapReduce detail page http://aws.amazon.com/elasticmapreduce. You must be signed up for Amazon EC2 and Amazon S3 to access Amazon Elastic MapReduce; if you are not already signed up for these services, you will be prompted to do so during the Amazon Elastic MapReduce sign-up process. After signing up, please refer to the Amazon Elastic MapReduce documentation, which includes our Getting Started Guide – the best place to get going with the service.
-
Q: How can I terminate a job flow?
-
At any time, you can terminate a job flow via the AWS Management Console by selecting a job flow and clicking the “Terminate” button. Alternatively, you can use the TerminateJobFlows API. If you terminate a running job flow, any results that have not been persisted to Amazon S3 will be lost and all Amazon EC2 instances will be shut down.
-
Q: Does Amazon Elastic MapReduce support multiple simultaneous job flows?
-
Yes. At any time, you can begin a new job flow, even if you’re already running one or more job flows.
-
Q: How many job flows can I run simultaneously?
-
You can start as many job flows as you like. You are limited to 20 instances across all your job flows. If you need more instances, complete the Amazon EC2 instance request form and your use case and instance increase will be considered. If your Amazon EC2 limit has been already raised, the new limit will be applied to your Amazon Elastic MapReduce job flows.
Developing
-
Q: Are there any job flow examples?
-
The Amazon Elastic MapReduce Developer’s Guide provides two examples of job flows. There are also several examples in our Resource Center.
-
Q: How do I develop a data processing application?
-
You can develop a data processing job on your desktop, for example, using Eclipse or NetBeans plug-ins such as IBM MapReduce Tools for Eclipse (http://www.alphaworks.ibm.com/tech/mapreducetools) or Karmasphere Studio (http://www.karmasphere.com). These tools make it easy to develop and debug MapReduce jobs and test them locally on your machine. Additionally, you can develop your job flow directly on Amazon Elastic MapReduce using one or more instances.
-
Q: What is the benefit of using the Command Line Tools or APIs vs. AWS Management Console?
-
The Command Line Tools or APIs provide the ability to programmatically launch and monitor progress of running job flows, to create additional custom functionality around job flows (such as sequences with multiple processing steps, scheduling, workflow, or monitoring), or to build value-added tools or applications for other Amazon Elastic MapReduce customers. In contrast, the AWS Management Console provides an easy-to-use graphical interface for launching and monitoring your job flows directly from a web browser.
-
Q: Can I add steps to a job flow that is already running?
-
Yes. Once the job is running, you can optionally add more steps to it via the AddJobFlowSteps API. The AddJobFlowSteps API will add new steps to the end of the current step sequence. You may want to use this API to implement conditional logic in your job flow or for debugging.
-
Q: Can I run a persistent job flow?
-
Yes. Amazon Elastic MapReduce job flows that are started with the –alive flag will continue until explicitly terminated. This allows customers to add steps to a job flow on demand. You may want to use this to debug your job flow logic without having to repeatedly wait for job flow startup. You may also use a persistent job flow to run a long-running data warehouse cluster. This can be combined with data warehouse and analytics packages that runs on top of Hadoop such as Hive and Pig.
-
Q: Can I be notified when my job flow is finished?
-
Not at this time. You can view your job flow progress on the AWS Management Console or you can call the DescribeJobFlows API to get a status on the job flow.
-
Q: What programming languages does Amazon Elastic MapReduce support?
-
You can use Java to implement Hadoop custom jars. Alternatively, you may use other languages including Perl, Python, Ruby, C++, PHP, and R via Hadoop Streaming. Please refer to the Developer’s Guide for instructions on using Hadoop Streaming.
-
Q: What OS versions are supported with Amazon Elastic MapReduce?
-
At this time Amazon Elastic MapReduce supports Debian/Squeeze in 32 and 64 bit modes.
-
Q: Can I view the Hadoop UI while my job flow is running?
-
Yes. Please refer to the Hadoop UI section in the Developer’s Guide for instructions on how to access the Hadoop UI.
-
Q: Does Amazon Elastic MapReduce support third-party software packages?
-
Yes. The recommended way to install third-party software packages on your cluster is to use Bootstrap Actions. Alternatively you can package any third party libraries directly into your Mapper or Reducer executable. You can also upload statically compiled executables using the Hadoop distributed cache mechanism.
-
Q: Which Hadoop versions does Amazon Elastic MapReduce support?
-
Amazon Elastic MapReduce supports Hadoop 0.20.205 and Hadoop 1.0.3 with custom patches.
-
Q: Can I use a data processing engine other than Hadoop?
-
At this time, Amazon Elastic MapReduce supports Hadoop 0.20.205 and Hadoop 1.0.3. We are always listening to our customers and will work to provide additional capabilities over time as our customers ask for them.
-
Q: Does Amazon contribute Hadoop improvements to the open source community?
-
Yes. Amazon Elastic MapReduce is active with the open source community and contributes many fixes back to the Hadoop source.
-
Q: Does Amazon Elastic MapReduce update the version of Hadoop it supports?
-
Amazon Elastic MapReduce periodically updates its supported version of Hadoop based on the Hadoop releases by the community. Amazon Elastic MapReduce may choose to skip some Hadoop releases.
-
Q: How quickly does Amazon Elastic MapReduce retire support for old Hadoop versions?
-
Amazon Elastic MapReduce service retires support for old Hadoop versions several months after deprecation. However, Amazon Elastic MapReduce APIs are backward compatible, so if you build tools on top of these APIs, they will work even when Amazon Elastic MapReduce updates the Hadoop version it’s using.
Debugging
-
Q: How can I debug my job flow?
-
You first select the job flow you want to debug, then click on the “Debug” button to access the debug a job flow window in the AWS Management Console. This will enable you to track progress and identify issues in steps, jobs, tasks, or task attempts of your job flows. Alternatively you can SSH directly into the Amazon Elastic Compute Cloud (Amazon EC2) instances that are running your job flow and use your favorite command-line debugger to troubleshoot the job flow.
-
Q: What is the job flow debug window?
-
The job flow debug window is a part of the AWS Management Console where you can track progress and identify issues in steps, jobs, tasks, or task attempts of your job flows. To access the job flow debug window, first select the job flow you want to debug and then click on the “Debug” button.
-
Q: How can I enable debugging of my job flow?
-
To enable debugging you need to set “Enable Debugging” flag when you create a job flow in the AWS Management Console. Alternatively, you can pass the --enable-debugging and --log-uri flags in the Command Line Client when creating a job flow.
-
Q: Where can I find instructions on how to use the debug a job flow window?
-
Please reference the AWS Management Console section of the Developer’s Guide for instructions on how to access and use the debug a job flow window.
-
Q: What types of job flows can I debug with the debug a job flow window?
-
You can debug all types of job flows currently supported by Amazon Elastic MapReduce including custom jar, streaming, Hive, and Pig.
-
Q: Why do I have to sign-up for Amazon SimpleDB to use job flow debugging?
-
Amazon Elastic MapReduce stores state information about Hadoop jobs, tasks and task attempts under your account in Amazon SimpleDB. You can subscribe to Amazon SimpleDB here.
-
Q: Can I use the job flow debugging feature without Amazon SimpleDB subscription?
-
You will be able to browse job flow steps and step logs but will not be able to browse Hadoop jobs, tasks, or task attempts if you are not subscribed to Amazon SimpeDB.
-
Q: Can I delete historical job flow data from Amazon SimpleDB?
-
Yes. You can delete Amazon SimpleDB domains that Amazon Elastic MapReduce created on your behalf. Please reference the Amazon SimpleDB documentation for instructions.
Managing Data
-
Q: How do I get my data into Amazon S3?
-
You can use Amazon S3 APIs to upload data to Amazon S3. Alternatively, you can use many open source or commercial clients to easily upload data to Amazon S3.
-
Q: How do I get logs for completed job flows?
-
Hadoop system logs as well as user logs will be placed in the Amazon S3 bucket which you specify when creating a job flow.
-
Q: Do you compress logs?
-
No. At this time Amazon Elastic MapReduce does not compress logs as it moves them to Amazon S3.
-
Q: Can I load my data from the internet or somewhere other than Amazon S3?
-
Yes. Your Hadoop application can load the data from anywhere on the internet or from other AWS services. Note that if you load data from the internet, EC2 bandwidth charges will apply. Amazon Elastic MapReduce also provides Hive-based access to data in DynamoDB.
Billing
-
Q: Can Amazon Elastic MapReduce estimate how long it will take to process my input data?
-
No. As each job flow and input data is different, we cannot estimate your job duration.
-
Q: How much does Amazon Elastic MapReduce cost?
-
Amazon Elastic MapReduce is available in the US, EU, and APAC Regions. As with the rest of AWS, you pay only for what you use. There is no minimum fee and there are no up-front commitments or long-term contracts. Amazon Elastic MapReduce pricing is in addition to normal Amazon EC2 and Amazon S3 pricing.
For Amazon Elastic MapReduce pricing information, please visit the pricing section on the Amazon Elastic MapReduce detail page.
Amazon EC2, Amazon S3 and Amazon SimpleDB charges are billed separately. Pricing for Amazon Elastic MapReduce is per instance-hour consumed for each instance type, from the time job flow began processing until it is terminated. Each partial instance-hour consumed will be billed as a full hour. For additional details on Amazon EC2 Instance Types, Amazon EC2 Spot Pricing, Amazon EC2 Reserved Instances Pricing, Amazon S3 Pricing, or Amazon SimpleDB Pricing, follow the links below:
Amazon EC2 Instance Types
Amazon EC2 Reserved Instances Pricing
Amazon EC2 Spot Instances Pricing
Amazon S3 Pricing
Amazon SimpleDB Pricing
-
Q: When does billing of my Amazon Elastic MapReduce job flow begin and end?
-
Billing commences when Amazon Elastic MapReduce starts running your job flow. You are only charged for the resources actually consumed. For example, let’s say you launched 100 Amazon EC2 Standard Small instances for an Amazon Elastic MapReduce job flow, where the Amazon Elastic MapReduce cost is an incremental $0.015 per hour. The Amazon EC2 instances will begin booting immediately, but they won’t necessarily all start at the same moment. Amazon Elastic MapReduce will track when each instance starts and will check it into the cluster so that it can accept processing tasks.
In the first 10 minutes after your launch request, Amazon Elastic MapReduce either starts your job flow (if all of your instances are available) or checks in as many instances as possible. Once the 10 minute mark has passed, Amazon Elastic MapReduce will start processing (and charging for) your job flow as soon as 90% of your requested instances are available. As the remaining 10% of your requested instances check in, Amazon Elastic MapReduce starts charging for those instances as well.
So, in the above example, if all 100 of your requested instances are available 10 minutes after you kick off a launch request, you’ll be charged $1.50 per hour (100 * $0.015) for as long as the job flow takes to complete. If only 90 of your requested instances were available at the 10 minute mark, you’d be charged $1.35 per hour (90 * $0.015) for as long as this was the number of instances running your job flow. When the remaining 10 instances checked in, you’d be charged $1.50 per hour (100 * $0.015) for as long as the balance of the job flow takes to complete.
Each job flow will run until one of the following occurs: you terminate the job flow with the TerminateJobFlows API call (or an equivalent tool), the job flow shuts itself down, or the job flow is terminated due to software or hardware failure. Partial instance hours consumed are billed as full hours.
-
Q: Where can I track my Amazon Elastic MapReduce, Amazon EC2 and Amazon S3 usage?
-
You can track your usage on your AWS Account Activity page.
-
Q: Can I track how many instance hours each job flow consumed?
-
Yes. On the AWS Management Console, every job flow has a Normalized Instance Hours column that displays the approximate number of compute hours the job flow took so far. Normalized Instance Hours are hours of compute time based on the standard of 1 hour of m1.small = 1 hour normalized compute time:
- 1 hour of m1.small = 1 hour normalized compute time
- 1 hour of m1.medium = 2 hours normalized compute time
- 1 hour of m1.large = 4 hours normalized compute time
- 1 hour of m1.xlarge = 8 hours normalized compute time
- 1 hour of c1.medium = 2 hours normalized compute time
- 1 hour of c1.xlarge = 9 hours normalized compute time
- 1 hour of m2.xlarge = 7 hours normalized compute time
- 1 hour of m2.2xlarge = 14 hours normalized compute time
- 1 hour of m2.4xlarge = 27 hours normalized compute time
- 1 hour of cc1.4xlarge = 21 hours normalized compute time
- 1 hour of cg1.4xlarge = 34 hours normalized compute time
- 1 hour of cc2.8xlarge = 39 hours normalized compute time
- 1 hour of hi1.4xlarge = 48 hours normalized compute time
- 1 hour of hs1.8xlarge = 71 hours normalized compute time
This is an approximate number and should not be used for billing purposes. Please refer to AWS Account Activity page for billable Amazon Elastic MapReduce usage.
-
Q: Does Amazon Elastic MapReduce support Amazon EC2 On-Demand, Spot, and Reserved Instances?
-
Yes. Amazon Elastic MapReduce seamlessly supports On-Demand, Spot, and Reserved Instances. Click here to learn more about Amazon EC2 Reserved Instances. Click here to learn more about Amazon EC2 Spot Instances.
Security
-
Q: How do I prevent other people from viewing my data during job flow execution?
-
Amazon Elastic MapReduce starts your instances in two Amazon EC2 security groups, one for the master and another for the slaves. The master security group has a port open for communication with the service. It also has the SSH port open to allow you to SSH into the instances, using the key specified at startup. The slaves start in a separate security group, which only allows interaction with the master instance. By default both security groups are set up to not allow access from external sources including Amazon EC2 instances belonging to other customers. Since these are security groups within your account, you can reconfigure them using the standard EC2 tools or dashboard. Click here to learn more about EC2 security groups.
-
Q: How secure is my data?
-
Amazon S3 provides authentication mechanisms to ensure that stored data is secured against unauthorized access. Unless the customer who is uploading the data specifies otherwise, only that customer can access the data. Amazon Elastic MapReduce customers can also choose to send data to Amazon S3 using the HTTPS protocol for secure transmission. In addition, Amazon Elastic MapReduce always uses HTTPS to send data between Amazon S3 and Amazon EC2. For added security, customers may encrypt the input data before they upload it to Amazon S3 (using any common data compression tool); they then need to add a decryption step to the beginning of their job flow when Amazon Elastic MapReduce fetches the data from Amazon S3.
Regions & Availability Zones
-
Q: How does Amazon Elastic MapReduce make use of Availability Zones?
-
Amazon Elastic MapReduce launches all nodes for a given cluster in the same Amazon EC2 Availability Zone. Running a cluster in the same zone improves performance of the jobs flows because it provides a higher data access rate. By default, Amazon Elastic MapReduce chooses the Availability Zone with the most available resources in which to run your job flow. However, you can specify another Availability Zone if required.
-
Q: In what Regions is this service available?
-
Amazon Elastic MapReduce is available in the US East (Northern Virginia), US West (Oregon), US West (Northern California), EU (Ireland), Asia Pacific (Singapore), Asia Pacific (Tokyo), Asia Pacific (Sydney), South America (Sao Paulo), and AWS GovCloud (US) Regions.
-
Q: Which Region should I select to run my job flows?
-
When creating a job flow, typically you should select the Region where your data is located.
-
Q: Can I use EU data in a job flow running in the US region and vice versa?
-
Yes you can. If you transfer data from one region to the other you will be charged bandwidth charges. For bandwidth pricing information, please visit the pricing section on the EC2 detail page.
-
Q: What is different about the AWS GovCloud (US) region?
-
The AWS GovCloud (US) region is designed for US government agencies and customers. It adheres to US ITAR requirements. In GovCloud, EMR does not support spot instances or the enable-debugging feature. The EMR Management Console is not yet available in GovCloud.
Managing your Cluster
-
Q: How does Amazon Elastic MapReduce use Amazon EC2 and Amazon S3?
-
Customers upload their input data and a data processing application into Amazon S3. Amazon Elastic MapReduce then launches a number of Amazon EC2 instances as specified by the customer. The service begins the job flow execution while pulling the input data from Amazon S3 using S3N protocol into the launched Amazon EC2 instances. Once the job flow is finished, Amazon Elastic MapReduce transfers the output data to Amazon S3, where customers can then retrieve it or use as input in another job flow.
-
Q: How is a computation done in Amazon Elastic MapReduce?
-
Amazon Elastic MapReduce uses the Hadoop data processing engine to conduct computations implemented in the MapReduce programming model. The customer implements their algorithm in terms of map() and reduce() functions. The service starts a customer-specified number of Amazon EC2 instances, comprised of one master and multiple slaves. Amazon Elastic MapReduce runs Hadoop software on these instances. The master node divides input data into blocks, and distributes the processing of the blocks to the slave node. Each slave node then runs the map function on the data it has been allocated, generating intermediate data. The intermediate data is then sorted and partitioned and sent to processes which apply the reducer function to it. These processes also run on the slave nodes. Finally, the output from the reducer tasks is collected in files. A single “job flow” may involve a sequence of such MapReduce steps.
-
Q: How reliable is Amazon Elastic MapReduce?
-
Amazon Elastic MapReduce manages an Amazon EC2 cluster of compute instances using Amazon’s highly available, proven network infrastructure and datacenters. Amazon Elastic MapReduce uses industry proven, fault-tolerant Hadoop software as its data processing engine. Hadoop splits the data into multiple subsets and assigns each subset to more than one Amazon EC2 instance. So, if an Amazon EC2 instance fails to process one subset of data, the results of another Amazon EC2 instance can be used.
-
Q: How quickly will my job flow be up and running and processing my input data?
-
Amazon Elastic MapReduce starts resource provisioning of Amazon EC2 On-Demand instances almost immediately. If the instances are not available, Amazon Elastic MapReduce will keep trying to provision the resources for your job flow until they are provisioned or you cancel your request. The instance provisioning is done on a best-efforts basis and depends on the number of instances requested, time when the job flow is created, and total number of requests in the system. After resources have been provisioned, it typically takes fewer than 15 minutes to start processing.
In order to guarantee capacity for your job flows at the time you need it, you may pay a one-time fee for Amazon EC2 Reserved Instances to reserve instance capacity in the cloud at a discounted hourly rate. Like On-Demand Instances, customers pay usage charges only for the time when their instances are running. In this way, Reserved Instances enable businesses with known instance requirements to maintain the elasticity and flexibility of On-Demand Instances, while also reducing their predictable usage costs even further.
-
Q: Which Amazon EC2 instance types does Amazon Elastic MapReduce support?
-
Amazon Elastic MapReduce supports 12 EC2 instance types including Standard, High CPU, High Memory, Cluster Compute, and High Storage. Standard Instances have memory to CPU ratios suitable for most general-purpose applications. High CPU instances have proportionally more CPU resources than memory (RAM) and are well suited for compute-intensive applications. High Memory instances offer large memory sizes for high throughput applications. Cluster Compute instances have proportionally high CPU with increased network performance and are well suited for High Performance Compute (HPC) applications and other demanding network-bound applications. High Storage instances offer 48 TB of storage across 24 disks and are ideal for applications that require sequential access to very large data sets such as data warehousing and log processing. See the EMR pricing page for details on available instance types and pricing per region.
-
Q: How do I select the right Amazon EC2 instance type?
-
When choosing instance types, you should consider the characteristics of your application with regards to resource utilization and select the optimal instance family. One of the advantages of Amazon Elastic MapReduce with Amazon EC2 is that you pay only for what you use, which makes it convenient and inexpensive to test the performance of your job flows on different instance types and quantity. One effective way to determine the most appropriate instance type is to launch several small clusters and benchmark your job flows.
-
Q: How do I select the right number of instances for my job flow?
-
The number of instances to use in your job flow is application-dependent and should be based on both the amount of resources required to store and process your data and the acceptable amount of time for your job to complete.
As a general guideline, we recommend that you limit 60% of your disk space to storing the data you will be processing, leaving the rest for intermediate output. Hence, given 3x replication on HDFS, if you were looking to process 5 TB on m1.xlarge instances, which have 1,690 GB of disk space, we recommend your cluster contains at least (5 TB * 3) / (1,690 GB * .6) = 15 m1.xlarge core nodes. You may want to increase this number if your job generates a high amount of intermediate data or has significant I/O requirements. You may also want to include additional task nodes to improve processing performance. See Amazon EC2 Instance Types for details on local instance storage for each instance type configuration.
-
Q: How long will it take to run my job flow?
-
The time to run your job flow will depend on several factors including the type of your job flow, the amount of input data, and the number and type of Amazon EC2 instances you choose for your job flow.
-
Q: If the master node in a job flow goes down, can Amazon Elastic MapReduce recover it?
-
No. If the master node goes down, your job flow will be terminated and you’ll have to rerun your job. Amazon Elastic MapReduce currently does not support automatic failover of the master nodes or master node state recovery. In case of master node failure, the AWS Management console displays “The master node was terminated” message which is an indicator for you to start a new job flow. Customers can instrument check pointing in their job flows to save intermediate data (data created in the middle of a job flow that has not yet been reduced) on Amazon S3. This will allow resuming the job flow from the last check point in case of failure.
-
Q: If a slave node goes down in a job flow, can Amazon Elastic MapReduce recover from it?
-
Yes. Amazon Elastic MapReduce is fault tolerant for slave failures and continues job execution if a slave node goes down. In the current version, Amazon Elastic MapReduce does not automatically provision another node to take over failed slaves.
-
Q: Can I SSH onto my cluster nodes?
-
Yes. You can SSH onto your cluster nodes and execute Hadoop commands directly from there. If you need to SSH into a slave node, you have to first SSH to the master node, and then SSH into the slave node.
-
Q: Can I use Microsoft Windows instances with Amazon Elastic MapReduce?
-
At this time, Amazon Elastic MapReduce supports Debian/Lenny in 32 and 64 bit modes. We are always listening to customer feedback and will add more capabilities over time to help our customers solve their data crunching business problems.
-
Q: What is Amazon Elastic MapReduce Bootstrap Actions?
-
Bootstrap Actions is a feature in Amazon Elastic MapReduce that provides users a way to run custom set-up prior to the execution of their job flow. Bootstrap Actions can be used to install software or configure instances before running your job flow.
-
Q: How can I use Bootstrap Actions?
-
You can write a Bootstrap Action script in any language already installed on the job flow instance including Bash, Perl, Python, Ruby, C++, or Java. There are several pre-defined Bootstrap Actions available. Once the script is written, you need to upload it to Amazon S3 and reference its location when you start a job flow. Please refer to the “Developer’s Guide”: http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/ for details on how to use Bootstrap Actions.
-
Q: How do I configure Hadoop settings for my job flow?
-
The Elastic MapReduce default Hadoop configuration is appropriate for most workloads. However, based on your job flow’s specific memory and processing requirements, it may be appropriate to tune these settings. For example, if your job flow tasks are memory-intensive, you may choose to use fewer tasks per core and reduce your job tracker heap size. For this situation, a pre-defined Bootstrap Action is available to configure your job flow on startup. See the Configure Memory Intensive Bootstrap Action in the Developer’s Guide for configuration details and usage instructions. An additional predefined bootstrap action is available that allows you to customize your cluster settings to any value of your choice. See the Configure Hadoop Bootstrap Action in the Developer’s Guide for usage instructions.
-
Q: Can I modify the number of slave nodes in a running job flow?
-
Yes. Slave nodes can be of two types: (1) core nodes, which both host persistent data using Hadoop Distributed File System (HDFS) and run Hadoop tasks and (2) task nodes, which only run Hadoop tasks. While a job flow is running you may increase the number of core nodes and you may either increase or decrease the number of task nodes. This can be done through the API, Java SDK, or though the command line client. Please refer to the Resizing Running Job Flows section in the Developer’s Guide for details on how to modify the size of your running job flow.
-
Q: When would I want to use core nodes versus task nodes?
-
As core nodes host persistent data in HDFS and cannot be removed, core nodes should be reserved for the capacity that is required until your job flow completes. As task nodes can be added or removed and do not contain HDFS, they are ideal for capacity that is only needed on a temporary basis.
-
Q: Why would I want to modify the number of slave nodes in my running job flow?
-
There are several scenarios where you may want to modify the number of slave nodes in a running job flow. If your job flow is running slower than expected, or timing requirements change, you can increase the number of core nodes to increase job flow performance. If different phases of your job flow have different capacity needs, you can start with a small number of core nodes and increase or decrease the number of task nodes to meet your job flow’s varying capacity requirements.
-
Q: Can I automatically modify the number of slave nodes between job flow steps?
-
Yes. You may include a predefined step in your workflow that automatically resizes a job flow between steps that are known to have different capacity needs. As all steps are guaranteed to run sequentially, this allows you to set the number of slave nodes that will execute a given job flow step.
-
Q: How can I allow other IAM users to access my job flow?
-
To create a new job flow that is visible to all IAM users within the EMR CLI: Add the --visible-to-all-users flag when you create the job flow. For example: elastic-mapreduce --create --visible-to-all-users. Within the Management Console, simply select “Visible to all IAM Users” on the Advanced Options pane of the Create Job Flow Wizard.
To make an existing job flow visible to all IAM users you must use the EMR CLI. Use --set-visible-to-all-users and specify the job flow identifier. For example: elastic-mapreduce --set-visible-to-all-users true --jobflow j-xxxxxxx. This can only be done by the creator of the job flow.
To learn more, see the Configuring User Permissions section of the EMR Developer Guide.
Using Hive
-
Q: What is Apache Hive?
-
Hive is an open source datawarehouse and analytics package that runs on top of Hadoop. Hive is operated by a SQL-based language called Hive QL that allows users to structure, summarize, and query data sources stored in Amazon S3. Hive QL goes beyond standard SQL, adding first-class support for map/reduce functions and complex extensible user-defined data types like Json and Thrift. This capability allows processing of complex and even unstructured data sources such as text documents and log files. Hive allows user extensions via user-defined functions written in Java and deployed via storage in Amazon S3.
-
Q: What can I do with Hive running on Amazon Elastic MapReduce?
-
Using Hive with Amazon Elastic MapReduce, you can implement sophisticated data-processing applications with a familiar SQL-like language and easy to use tools available with Amazon Elastic MapReduce. With Amazon Elastic MapReduce, you can turn your Hive applications into a reliable data warehouse to execute tasks such as data analytics, monitoring, and business intelligence tasks.
-
Q: How is Hive different than traditional RDBMS systems?
-
Traditional RDBMS systems provide transaction semantics and ACID properties. They also allow tables to be indexed and cached so that small amounts of data can be retrieved very quickly. They provide for fast update of small amounts of data and for enforcement of referential integrity constraints. Typically they run on a single large machine and do not provide support for executing map and reduce functions on the table, nor do they typically support acting over complex user defined data types.
In contrast, Hive executes SQL-like queries using MapReduce. Consequently, it is optimized for doing full table scans while running on a cluster of machines and is therefore able to process very large amounts of data. Hive provides partitioned tables, which allow it to scan a partition of a table rather than the whole table if that is appropriate for the query it is executing.
Traditional RDMS systems are best for when transactional semantics and referential integrity are required and frequent small updates are performed. Hive is best for offline reporting, transformation, and analysis of large data sets; for example, performing click stream analysis of a large website or collection of websites.
One of the common practices is to export data from RDBMS systems into Amazon S3 where offline analysis can be performed using Amazon Elastic MapReduce job flows running Hive.
-
Q: How can I get started with Hive running on Amazon Elastic MapReduce
-
The best place to start is to review our written or video tutorial located here http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2862
-
Q: Are there new features in Hive specific to Amazon Elastic MapReduce?
-
Yes. There are four new features which make Hive even more powerful when used with Amazon Elastic MapReduce, including:
a/ The ability to load table partitions automatically from Amazon S3. Previously, to import a partitioned table you needed a separate alter table statement for each individual partition in the table. Amazon Elastic MapReduce a now includes a new statement type for the Hive language: “alter table recover partitions.” This statement allows you to easily import tables concurrently into many job flows without having to maintain a shared meta-data store. Use this functionality to read from tables into which external processes are depositing data, for example log files.
b/ The ability to specify an off-instance metadata store. By default, the metadata store where Hive stores its schema information is located on the master node and ceases to exist when the job flow terminates. This feature allows you to override the location of the metadata store to use, for example a MySQL instance that you already have running in EC2.
c/ Writing data directly to Amazon S3. When writing data to tables in Amazon S3, the version of Hive installed in Amazon Elastic MapReduce writes directly to Amazon S3 without the use of temporary files. This produces a significant performance improvement but it means that HDFS and S3 from a Hive perspective behave differently. You cannot read and write within the same statement to the same table if that table is located in Amazon S3. If you want to update a table located in S3, then create a temporary table in the job flow’s local HDFS
filesystem, write the results to that table, and then copy them to Amazon S3.
d/ Accessing resources located in Amazon S3. The version of Hive installed in Amazon Elastic MapReduce allows you to reference resources such as scripts for custom map and reduce operations or additional libraries located in Amazon S3 directly from within your Hive script (e.g., add jar s3://elasticmapreduce/samples/hive-ads/libs/jsonserde.jar).
-
Q: What types of Hive job flows are supported?
-
There are two types of job flows supported with Hive: interactive and batch. In an interactive mode a customer can start a job flow and run Hive scripts interactively directly on the master node. Typically, this mode is used to do ad hoc data analyses and for application development. In batch mode, the Hive script is stored in Amazon S3 and is referenced at the start of the job flow. Typically, batch mode is used for repeatable runs such as report generation.
-
Q: How can I launch a Hive job flow?
-
Both batch and interactive job flows can be started from AWS Management Console, Elastic MapReduce command line client, or APIs. Please refer to the Using Hive section in the Developer’s Guide for more details on launching a Hive job flow.
-
Q: When should I use Hive vs. PIG?
-
Hive and PIG both provide high level data-processing languages with support for complex data types for operating on large datasets. The Hive language is a variant of SQL and so is more accessible to people already familiar with SQL and relational databases. Hive has support for partitioned tables which allow Amazon Elastic MapReduce job flows to pull down only the table partition relevant to the query being executed rather than doing a full table scan. Both PIG and Hive have query plan optimization. PIG is able to optimize across an entire scripts while Hive queries are optimized at the statement level.
Ultimately the choice of whether to use Hive or PIG will depend on the exact requirements of the application domain and the preferences of the implementers and those writing queries.
-
Q: What version of Hive does Amazon Elastic MapReduce support?
-
Amazon Elastic MapReduce supports Hive version 0.7.1 and 0.8.1.
-
Q: Can I write to a table from two job flows concurrently
-
No. Hive does not support concurrently writing to tables. You should avoid concurrently writing to the same table or reading from a table while you are writing to it. Hive has non-deterministic behavior when reading and writing at the same time or writing and writing at the same time.
-
Q: Can I share data between job flows?
-
Yes. You can read data in Amazon S3 within a Hive script by having ‘create external table’ statements at the top of your script. You need a create table statement for each external resource that you access.
-
Q: Should I run one large job flow, and share it amongst many users or many smaller job flows?
-
Amazon Elastic MapReduce provides a unique capability for you to use both methods. On the one hand one large job flow may be more efficient for processing regular batch workloads. On the other hand, if you require ad-hoc querying or workloads that vary with time, you may choose to create several separate job flow tuned to the specific task sharing data sources stored in Amazon S3.
-
Q: Can I access a script or jar resource which is on my local file system?
-
No. You must upload the script or jar to Amazon S3 or to the job flow’s master node before it can be referenced. For uploading to Amazon S3 you can use tools including s3cmd, jets3t or S3Organizer.
-
Q: Can I run a persistent job flow executing multiple Hive queries?
-
Yes. You run a job flow in a manual termination mode so it will not terminate between Hive steps. To reduce the risk of data loss we recommend periodically persisting all of your important data in Amazon S3. It is good practice to regularly transfer your work to a new job flow to test you process for recovering from master node failure.
-
Q: Can multiple users execute Hive steps on the same source data?
-
Yes. Hive scripts executed by multiple users on separate job flows may contain create external table statements to concurrently import source data residing in Amazon S3.
-
Q: Can multiple users run queries on the same job flow?
-
Yes. In the batch mode, steps are serialized. Multiple users can add Hive steps to the same job flow, however, the steps will be executed serially. In interactive mode, several users can be logged on to the same job flow and execute Hive statements concurrently.
-
Q: Can data be shared between multiple AWS users?
-
Yes. Data can be shared using standard Amazon S3 sharing mechanism described here http://docs.amazonwebservices.com/AmazonS3/latest/index.html?S3_ACLs.html
-
Q: Does Hive support access from JDBC?
-
Yes. Hive provides JDBC drive, which can be used to programmatically execute Hive statements. To start a JDBC service in your job flow you need to pass an optional parameter in the Amazon Elastic MapReduce command line client. You also need to establish an SSH tunnel because the security group does not permit external connections.
-
Q: What is your procedure for updating packages on Elastic MapReduce AMIs?
-
We run a select set of packages from Debian/stable including security patches. We will upgrade a package whenever it gets upgraded in Debian/stable. The “r-recommended” package on our image is up to date with Debian/stable (http://packages.debian.org/search?keywords=r-recommended).
-
Q: Can I update my own packages on Elastic MapReduce job flows?
-
Yes. You can use Bootstrap Actions to install updates to packages on your clusters.
-
Q: Can I process DynamoDB data using Hive?
-
Yes. Simply define an external Hive table based on your DynamoDB table. You can then use Hive to analyze the data stored in DynamoDB and either load the results back into DynamoDB or archive them in Amazon S3. For more information please visit our Developer Guide.
Using Pig
-
Q: What is Apache Pig?
-
Pig is an open source analytics package that runs on top of Hadoop. Pig is operated by a SQL-like language called Pig Latin, which allows users to structure, summarize, and query data sources stored in Amazon S3. As well as SQL-like operations, Pig Latin also adds first-class support for map/reduce functions and complex extensible user defined data types. This capability allows processing of complex and even unstructured data sources such as text documents and log files. Pig allows user extensions via user-defined functions written in Java and deployed via storage in Amazon S3.
-
Q: What can I do with Pig running on Amazon Elastic MapReduce?
-
Using Pig with Amazon Elastic MapReduce, you can implement sophisticated data-processing applications with a familiar SQL-like language and easy to use tools available with Amazon Elastic MapReduce. With Amazon Elastic MapReduce, you can turn your Pig applications into a reliable data warehouse to execute tasks such as data analytics, monitoring, and business intelligence tasks.
-
Q: How can I get started with Pig running on Amazon Elastic MapReduce?
-
The best place to start is to review our written or video tutorial located here http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2735&categoryID=269
-
Q: Are there new features in Pig specific to Amazon Elastic MapReduce?
-
Yes. There are three new features which make Pig even more powerful when used with Amazon Elastic MapReduce, including:
a/ Accessing multiple filesystems. By default a Pig job can only access one remote file system, be it an HDFS store or S3 bucket, for input, output and temporary data. Elastic MapReduce has extended Pig so that any job can access as many file systems as it wishes. An advantage of this is that temporary intra-job data is always stored on the local HDFS, leading to improved perfomance.
b/ Loading resources from S3. Elastic MapReduce has extended Pig so that custom JARs and scripts can come from the S3 file system, for example “REGISTER s3:///my-bucket/piggybank.jar”
c/ Additional Piggybank function for String and DateTime processing. These are documented here http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2730.
-
Q: What types of Pig job flows are supported?
-
There are two types of job flows supported with Pig: interactive and batch. In an interactive mode a customer can start a job flow and run Pig scripts interactively directly on the master node. Typically, this mode is used to do ad hoc data analyses and for application development. In batch mode, the Pig script is stored in Amazon S3 and is referenced at the start of the job flow. Typically, batch mode is used for repeatable runs such as report generation.
-
Q: How can I launch a Pig job flow?
-
Both batch and interactive job flows can be started from AWS Management Console, Elastic MapReduce command line client, or APIs.
-
Q: What version of Pig does Amazon Elastic MapReduce support?
-
Amazon Elastic MapReduce supports Pig version 0.6 and Pig version 0.9.1.
-
Q: Can I write to a S3 bucket from two job flows concurrently
-
Yes, you are able to write to the same bucket from two concurrent job flows.
-
Q: Can I share input data in S3 between job flows?
-
Yes, you are able to read the same data in S3 from two concurrent job flows.
-
Q: Can data be shared between multiple AWS users?
-
Yes. Data can be shared using standard Amazon S3 sharing mechanism described here http://docs.amazonwebservices.com/AmazonS3/latest/index.html?S3_ACLs.html
-
Q: Should I run one large job flow, and share it amongst many users or many smaller job flows?
-
Amazon Elastic MapReduce provides a unique capability for you to use both methods. On the one hand one large job flow may be more efficient for processing regular batch workloads. On the other hand, if you require ad-hoc querying or workloads that vary with time, you may choose to create several separate job flow tuned to the specific task sharing data sources stored in Amazon S3.
-
Q: Can I access a script or jar resource which is on my local file system?
-
No. You must upload the script or jar to Amazon S3 or to the job flow’s master node before it can be referenced. For uploading to Amazon S3 you can use tools including s3cmd, jets3t or S3Organizer.
-
Q: Can I run a persistent job flow executing multiple Pig queries?
-
Yes. You run a job flow in a manual termination mode so it will not terminate between Pig steps. To reduce the risk of data loss we recommend periodically persisting all important data in Amazon S3. It is good practice to regularly transfer your work to a new job flow to test you process for recovering from master node failure.
-
Q: Does Pig support access from JDBC?
-
No. Pig does not support access through JDBC.