Apache Spark on Amazon EMR

Why Apache Spark on EMR?

Amazon EMR is the best place to run Apache Spark. You can quickly and easily create managed Spark clusters from the AWS Management Console, AWS CLI, or the Amazon EMR API. Additionally, you can leverage additional Amazon EMR features, including fast Amazon S3 connectivity using the Amazon EMR File System (EMRFS), integration with the Amazon EC2 Spot market and the AWS Glue Data Catalog, and EMR Managed Scaling to add or remove instances from your cluster. AWS Lake Formation brings fine-grained access control, while integration with AWS Step Functions helps with orchestrating your data pipelines. EMR Studio (preview) is an integrated development environment (IDE) that makes it easy for data scientists and data engineers to develop, visualize, and debug data engineering and data science applications written in R, Python, Scala, and PySpark. EMR Studio provides fully managed Jupyter Notebooks, and tools like Spark UI and YARN Timeline Service to simplify debugging. EMR Notebooks make it easy for you to experiment and build applications with Spark. If you prefer, you can use Apache Zeppelin to create interactive and collaborative notebooks for data exploration using Spark.

Learn more about Apache Spark here

Features and benefits

EMR features Amazon EMR runtime for Apache Spark, a performance-optimized runtime environment for Apache Spark that is active by default on Amazon EMR clusters. Amazon EMR runtime for Apache Spark can be over 3x faster than clusters without the EMR runtime, and has 100% API compatibility with standard Apache Spark. This improved performance means your workloads run faster and saves you compute costs, without making any changes to your applications.

By using a directed acyclic graph (DAG) execution engine, Spark can create efficient query plans for data transformations. Spark also stores input, output, and intermediate data in-memory as resilient dataframes, which allows for fast processing without I/O cost, boosting performance of iterative or interactive workloads.

Apache Spark natively supports Java, Scala, SQL, and Python, which gives you a variety of languages for building your applications. Also, you can submit SQL or HiveQL queries using the Spark SQL module. In addition to running applications, you can use the Spark API interactively with Python or Scala directly in the Spark shell or via EMR Studio, or Jupyter notebooks on your cluster. Support for Apache Hadoop 3.0 in EMR 6.0 brings Docker container support to simplify managing dependencies. You can also leverage cluster-independent EMR Notebooks (based on Jupyter) or use Zeppelin to create interactive and collaborative notebooks for data exploration and visualization. You can tune and debug your workloads in the EMR console which has an off-cluster, persistent Spark History Server.

Apache Spark includes several libraries to help build applications for machine learning (MLlib), stream processing (Spark Streaming), and graph processing (GraphX). These libraries are tightly integrated in the Spark ecosystem, and they can be leveraged out of the box to address a variety of use cases. Additionally, you can use deep learning frameworks like Apache MXNet with your Spark applications. Integration with AWS Step Functions enables you to add serverless workflow automation and orchestration to your applications.

Submit Apache Spark jobs with the EMR Step API, use Spark with EMRFS to directly access data in S3, save costs using EC2 Spot capacity, use EMR Managed Scaling to dynamically add and remove capacity, and launch long-running or transient clusters to match your workload. You can also easily configure Spark encryption and authentication with Kerberos using an EMR security configuration. Additionally, you can use the AWS Glue Data Catalog to store Spark SQL table metadata or use Amazon SageMaker with your Spark machine learning pipelines. EMR installs and manages Spark on Hadoop YARN, and you can also add other big data applications on your cluster. EMR with Apache Hudi lets you more efficiently manage change data capture (CDC) and helps with privacy regulations like GDPR and CCPA by simplifying record deletion. Click here for more details about EMR features.

Use cases

Consume and process real-time data from Amazon KinesisApache Kafka, or other data streams with Spark Streaming on EMR. Perform streaming analytics in a fault-tolerant way and write results to S3 or on-cluster HDFS.

Apache Spark on EMR includes MLlib for a variety of scalable machine learning algorithms, or you can use your own libraries. By storing datasets in-memory during a job, Spark has great performance for iterative queries common in machine learning workloads. You can enhance Amazon SageMaker capabilities by connecting the notebook instance to an Apache Spark cluster running on Amazon EMR, with Amazon SageMaker Spark for easily training models and hosting models.

Use Spark SQL for low-latency, interactive queries with SQL or HiveQL. Spark on EMR can leverage EMRFS, so you can have ad hoc access to your datasets in S3. Also, you can utilize EMR Studio, EMR Notebooks, Zeppelin notebooks, or BI tools via ODBC and JDBC connections.

Customer success

  • Yelp

    Yelp’s advertising targeting team makes prediction models to determine the likelihood of a user interacting with an advertisement. By using Apache Spark on Amazon EMR to process large amounts of data to train machine learning models, Yelp increased revenue and advertising click-through rate.

  • The Washington Post

    The Washington Post uses Apache Spark on Amazon EMR to build models powering its website’s recommendation engine to boost reader engagement and satisfaction. They leverage Amazon EMR's performant connectivity with Amazon S3 to update models in near real-time.

  • Krux

    As part of its Data Management Platform for customer insights, Krux runs many machine learning and general processing workloads using Apache Spark. Krux utilizes ephemeral Amazon EMR clusters with Amazon EC2 Spot Capacity to save costs and uses Amazon S3 with EMRFS as a data layer for Apache Spark.

    Read more »
  • GumGum

    GumGum, an in-image and in-screen advertising platform, uses Spark on Amazon EMR for inventory forecasting, processing of clickstream logs, and ad hoc analysis of unstructured data in Amazon S3. Spark’s performance enhancements saved GumGum time and money for these workflows.

    Read more »
  • Hearst Corporation

    Hearst Corporation, a large diversified media and information company, has customers viewing content on over 200 web properties. Using Apache Spark Streaming on Amazon EMR, Hearst’s editorial staff can keep a real-time pulse on which articles are performing well and which themes are trending.

    Read more »
  • CrowdStrike

    CrowdStrike provides endpoint protection to stop breaches. They use Amazon EMR with Spark to process hundreds of terabytes of event data and roll it up into higher-level behavioral descriptions on the hosts. From that data, CrowdStrike can pull event data together and identify the presence of malicious activity.

    Read more »