Announcing EMR Runtime for Apache Spark

Posted on: Nov 18, 2019

We are happy to announce the Amazon EMR runtime for Apache Spark – A performance optimized runtime environment for Apache Spark, available and turned on by default on Amazon EMR clusters. EMR runtime for Spark is up to 32x faster with 100% API compatibility with open source Spark. The runtime is on by default starting in EMR release 5.28. 

To measure the impact of these improvements, we used TPC-DS benchmark queries with 3-TB scale running on a 6-node c4.8xlarge EMR cluster with data in Amazon S3. We measured performance improvements as the geometric mean of improvement in total query execution time, and total query execution time across all queries. We observed 2.4x improvement in geometric mean, and 3.2x improvement in total query run time between EMR 5.16 and EMR 5.28. For more details on the performance improvements and the impact to short and long running queries, see our AWS Big Data Blog post: Amazon EMR introduces EMR runtime for Apache Spark.