Getting started with Amazon EMR
How to use EMR
Develop your data processing application
You can use Java, Hive (a SQL-like language), Pig (a data processing language), Cascading, Ruby, Perl, Python, R, PHP, C++, or Node.js. Amazon EMR provides code samples and tutorials to get you up and running quickly.
Upload your application and data to Amazon S3
If you have a large amount of data to upload, you may want to consider using AWS Import/Export Snowball, to upload data using physical storage devices; or AWS Direct Connect, to establish a dedicated network connection from your data center to AWS. If you prefer, you can also write your data directly to a running cluster.
Configure and launch your cluster
Using the AWS Management Console, the AWS CLI, SDKs, or APIs, specify the number of Amazon EC2 instances to provision in your cluster, the types of instances to use (standard, high memory, high CPU, high I/O, etc.), the applications to install (Apache Spark, Apache Hive, Apache HBase, Presto, etc.), and the location of your application and data. You can use Bootstrap Actions to install additional software or change default settings.
Monitor the cluster
You can monitor the cluster’s health and progress using the Management Console, Command Line Interface, SDKs, or APIs. EMR integrates with Amazon CloudWatch for monitoring/alarming and supports popular monitoring tools like Ganglia. You can add/remove capacity to the cluster at any time to handle more or less data. For troubleshooting, you can use the console’s simple debugging GUI.
Retrieve the output
Retrieve the output from Amazon S3 or HDFS on the cluster. Visualize the data with tools like Amazon QuickSight, Tableau and MicroStrategy. Amazon EMR will automatically terminate the cluster when processing is complete. Alternatively you can leave the cluster running and give it more work to do.
Are you ready to launch your first cluster?
Click here to launch a cluster using the Amazon EMR Management Console. On the Create Cluster page, go to Advanced cluster configuration, and click on the gray "Configure Sample Application" button at the top right if you want to run a sample application with sample data.