Posted On: Jun 16, 2015
Apache Spark is now supported on Amazon EMR. Similar to Apache Hadoop, Apache Spark is an open-source, distributed processing system commonly used for big data workloads. Spark utilizes in-memory caching and optimized execution for fast performance, and it supports batch processing, streaming, machine learning, graph databases, and ad hoc queries. With support for Scala, Python, Java, and SQL (using the Spark SQL module), Amazon EMR makes it easy to develop Spark applications in many popular languages. Also, Spark includes several libraries to help build applications for machine learning (MLlib), stream processing (Spark Streaming), and graph processing (GraphX). You can install Spark alongside the other Hadoop applications available in Amazon EMR and leverage the EMR File System (EMRFS) to directly access data in Amazon S3.
You can create an Amazon EMR cluster with Apache Spark from the AWS Management Console, AWS CLI, or SDK by choosing AMI 3.8.0 and adding Spark as an application. Amazon EMR currently supports Spark version 1.3.1 and utilizes Hadoop YARN as the cluster manager. To submit applications to Spark on your Amazon EMR cluster, you can add Spark steps with the Step API or interact directly with the Spark API on your cluster’s master node. To learn more, visit the Apache Spark on Amazon EMR page. For instructions on how to launch an Amazon EMR cluster with Spark, click here.