AWS Big Data Blog

Installing Apache Spark on an Amazon EMR Cluster

Jonathan Fritz is a Senior Product Manager for Amazon EMR

———————–

Please note – Amazon EMR now officially supports Spark. For more information about Spark on EMR, visit the Spark on Amazon EMR page or read Intent Media’s guest post on the AWS Big Data Blog about Spark on EMR.

——–—————

Over the last five years, Amazon EMR has evolved into a container for running many distributed computing frameworks beyond just Hadoop MapReduce. Customers can choose to run a variety of engines such as HBase, Impala, Spark, or Presto in their EMR cluster and leverage Amazon EMR’s features like fast performance of Amazon S3, connectivity with other AWS services, and ease of use (cluster creation and management).

We’re particularly excited about Apache Spark, an engine in the Apache Hadoop ecosystem for fast and efficient processing of large datasets. By using in-memory, fault-tolerant resilient distributed datasets (RDDs) and directed, acyclic graphs (DAGs) to define data transformations, Spark has shown significant performance increases for certain workloads when compared to Hadoop MapReduce.

EMR is no stranger to Spark. In fact, customers have been running Spark on EMR managed Hadoop clusters for years. To provide our customers with easy access to Spark on their EMR cluster, we wrote a bootstrap action accompanied by an article back in February 2013 on how to use Spark and Shark.

Much has changed in the Spark ecosystem since then: Spark graduated to 1.x thereby guaranteeing stability of its core API for all 1.x releases, Shark has been deprecated in favor of Spark SQL, and Spark can be run on top of YARN (the resource manager for Hadoop 2). In light of these changes, we have revised the bootstrap action to install Spark 1.x on our Hadoop 2.x AMIs and run it on top of YARN. The bootstrap action also installs and configures Spark SQL, Spark Streaming, MLlib, and GraphX.

The S3 location for the Spark installation bootstrap action is:

s3://support.elasticmapreduce/spark/install-spark

You can also find more information about the bootstrap script to install Spark on our EMR Labs GitHub page.

With this bootstrap action, Spark can be easily installed on EMR clusters from the Console or AWS CLI (shown here – but replace “MyKeyPair” with the name of your keypair to SSH into your cluster):

aws emr create-cluster --name SparkCluster --ami-version 3.2.1 --instance-type m3.xlarge --instance-count 3 --ec2-attributes KeyName=MyKeyPair --applications Name=Hive --use-default-roles --bootstrap-actions Path=s3://support.elasticmapreduce/spark/install-spark

Note: If you have not created default IAM Roles for EMR, you can do so using the EMR create-default-roles command. On the AWS CLI version 1.7.17 or later, this command adds values to the AWS CLI config file that specify the default IAM roles (service role and instance profile)  for use in the create-cluster command. If you specify these values in the AWS CLI config file, you don’t need to include the –use-defualt-roles shortcut in your create-cluster command as shown in the example above.

Currently, the bootstrap action will install:

  • Spark 0.8.1 on Hadoop 1.0.3 (AMI 2.x)
  • Spark 1.0.0 on Hadoop 2.2.0 (AMI 3.0.x)
  • Spark 1.1.0 on Hadoop 2.4.0 (AMI 3.1.x and 3.2.x)

We have also updated the original Spark on Amazon EMR article to refer to the new bootstrap action and use new syntax in the Spark and Spark SQL examples. This is a great way to start exploring Spark and Spark SQL on EMR.

If you have any questions or suggestions, please leave a comment below.

—————

Related:

Using IPython Notebook to Analyze Data with EMR

Getting Started with Elasticsearch and Kibana on EMR

Strategies for Reducing your EMR Costs

—————————————————————-

Love to work on open source? Check out EMR’s careers page.

—————————————————————-