Posted On: Aug 27, 2019
With EMR release 5.26.0, Spark users benefit from all the new Spark performance optimizations introduced in EMR release 5.24.0 and 5.25.0 without the need to make any configuration or code changes. The following optimizations are enabled by default in the 5.26.0 release:
- Dynamic partition pruning - Allows the Spark engine to infer relevant partitions at runtime, saving time and compute resources both by reading less data from storage and by reducing the number of records that need to be processed.
- DISTINCT before INTERSECT - Eliminates duplicate values in each input collection prior to computing the intersection, which improves performance by reducing the amount of data shuffled between hosts.
- Flattening scalar subqueries - Helps in situations where multiple different conditions need to be applied to rows from a specific table, preventing the table from being read multiple times for each condition.
- Optimized join reorder - Reorders joins to execute smaller joins with filters first, reducing the processing required for larger subsequent joins.
- Bloom filter join - Filters table joins dynamically to include only relevant rows, reducing the amount of data processed by Spark and improving query runtime performance.
Please visit Optimizing Spark Performance documentation and the EMR 5.26.0 release notes for details on these optimizations.
Also included in EMR 5.26.0, is a Beta integration with AWS Lake Formation and new versions of Apache HBase 1.4.10, and Apache Phoenix 4.14.2. Please see Integrating Amazon EMR with AWS Lake Formation (Beta) for more details on the integration.
Amazon EMR release 5.26.0 is now available in all supported regions for Amazon EMR.
The integration between AWS Lake Formation and Amazon EMR is in Beta, and is available in the US East (N. Virginia), and US West (Oregon) regions.
You can stay up to date on EMR releases by subscribing to the feed for EMR release notes. Use the icon at the top of the EMR Release Guide to link the feed URL directly to your favorite feed reader.