Achieve 3x better Spark performance with EMR 5.25.0

Posted on: Aug 1, 2019

You can now use Spark 2.4.3, Presto 0.220, Apache Hive 2.3.5, and Apache Tez 0.9.2 on Amazon EMR release 5.25.0.

This release also includes two new performance optimizations that improve Spark performance up to 3x* over EMR 5.24: Bloom Filter Join, and Optimized Join Reorder.

  • Bloom Filter Join filters table joins dynamically to include only relevant rows. This reduces the amount of data processed by Spark improving query runtime performance.
  • Optimized Join Reorder dynamically reorders joins to execute smaller joins with filters first, reducing the processing required for larger subsequent joins.

Please refer to our EMR Spark Performance documentation and EMR 5.25.0 release notes for details on enabling these optimizations. 

Additionally, we have updated the default Spark configuration for memory optimized R4 instances to achieve better CPU and memory utilization. This update improves Spark runtime performance by 1.5x*.

Amazon EMR release 5.25.0 is now available in all supported regions for Amazon EMR

You can stay up to date on EMR releases by subscribing to the feed for EMR release notes. Use the icon at the top of the EMR Release Guide to link the feed URL directly to your favorite feed reader. 

*Based on 3TB TPC-DS benchmark comparing EMR 5.24.0 with EMR 5.25.0.