Video: A Technical Introduction to Amazon EMR (AWS re:Invent, October 2015, Total: 50 minutes)

Amazon EMR provides a managed framework which makes it easy, cost effective, and secure to run data processing frameworks such as Apache Hadoop, Apache Spark, and Presto on AWS. In this presentation, you learn the key design principles behind running these frameworks on the cloud and the feature set that Amazon EMR offers. We discuss the benefits of decoupling compute and storage, and strategies to take advantage of the scale and the parallelism that the cloud offers, while lowering costs. Additionally, you hear from AOL’s Senior Software Engineer on how they used these strategies to migrate their Hadoop workloads to the AWS cloud and lessons learned along the way.

 

Video: Amazon EMR, Deep Dive and Best Practices (AWS re:Invent, October 2015, Total: 49 minutes)

In this presentation, we introduce you to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of both long and short-lived clusters, and other Amazon EMR architectural best practices. We talk about how to scale your cluster up or down dynamically and introduce you to ways you can fine-tune your cluster. We will also share best practices to keep your Amazon EMR cluster cost-efficient. Finally, we dive into some of our recent launches to keep you current on our latest features.

  1. Develop your data processing application. You can use Java, Hive (a SQL-like language), Pig (a data processing language), Cascading, Ruby, Perl, Python, R, PHP, C++, or Node.js. Amazon EMR provides code samples and tutorials to get you up and running quickly.
  2. Upload your application and data to Amazon S3. If you have a large amount of data to upload, you may want to consider using AWS Import/Export Snowball, to upload data using physical storage devices; or AWS Direct Connect, to establish a dedicated network connection from your data center to AWS. If you prefer, you can also write your data directly to a running cluster.
  3. Configure and launch your cluster. Using the AWS Management Console, the AWS CLI, SDKs, or APIs, specify the number of Amazon EC2 instances to provision in your cluster, the types of instances to use (standard, high memory, high CPU, high I/O, etc.), the applications to install (Hive, Pig, HBase, etc.), and the location of your application and data. You can use Bootstrap Actions to install additional software or change default settings.
  4. Monitor the cluster (Optional). You can monitor the cluster’s health and progress using the Management Console, Command Line Interface, SDKs, or APIs. EMR integrates with Amazon CloudWatch for monitoring/alarming and supports popular monitoring tools like Ganglia. You can add/remove capacity to the cluster at any time to handle more or less data. For troubleshooting, you can use the console’s simple debugging GUI.
  5. Retrieve the output. Retrieve the output from Amazon S3 or HDFS on the cluster. Visualize the data with tools like Tableau and MicroStrategy. Amazon EMR will automatically terminate the cluster when processing is complete. Alternatively you can leave the cluster running and give it more work to do.

Are you ready to launch your first cluster?

Click here to launch a cluster using the Amazon EMR Management Console. On the Create Cluster page, go to Advanced cluster configuration, and click on the gray "Configure Sample Application" button at the top right if you want to run a sample application with sample data.

For a step-by-step written tutorial, click here. This tutorial walks you through creating a cluster that counts the frequency of words in a text file.

Get Started with Amazon EMR

Create a Free Account

Need Help? Ask Us!

Do you need help building a proof of concept or tuning your EMR applications? AWS has a global support team that specializes in EMR.  Please contact us if you are interested in learning more about short term (2-6 week) paid support engagements.

The Big Data on AWS course is designed to teach you with hands-on experience on how to use Amazon Web Services for big data workloads. AWS will show you how to run Amazon EMR jobs to process data using the broad ecosystem of Hadoop tools like Pig and Hive. Also, AWS will teach you how to create big data environments in the cloud by working with Amazon DynamoDB and Amazon Redshift, understand the benefits of Amazon Kinesis, and leverage best practices to design big data environments for analysis, security, and cost-effectiveness.  To learn more about the Big Data course, click here.

If you are planning to process more than 1 TB per day you may be eligible for EMR Bootcamp, an onsite proof-of-concept and knowledge transfer workshop with an AWS Solutions Architect who specializes in EMR.  To learn more, click here or contact us.

Scale Unlimited offers customized on-site training for companies that need to quickly learn how to use EMR and other big data technologies.  To find out more, click here.