Posted On: Jul 11, 2022
The Amazon EMR runtime for Apache Spark is a performance optimized runtime environment for Apache Spark, available and turned on by default on Amazon EMR clusters 5.28 onward. Amazon EMR runtime for Spark is up to 3x faster with 100% API compatibility with open source Spark.
Today, we are excited to announce Result Fragment Caching, a new feature available in Amazon EMR runtime for Apache Spark to help speed up queries that target a static subset of data. In our internal tests derived from the TPC-H benchmark, when using Result Fragment Caching, queries that employ rolling and incremental window functions consistently delivered query performance speedups of up to 15.6x.
Customers often have data in Amazon S3 that does not change over time. However, repeated queries over such datasets need to re-compute results in each run. For e.g. consider eCommerce orders data which contains information about orders that are delivered or returned within 30 days. Suppose a user wants to compare the daily trend of orders for a certain product for the entire year. All orders data is mostly unchanged, except the last 30 days of data, but a query aggregating order returns for every day of last 1 year would need to recompute the results for each day of the year every time the query runs. Result Fragment Caching addresses this by automatically caching results fragments i.e. all parts of the results from subtrees of Spark query plans and then reusing the results on subsequent query executions. Result Fragment Caching allows maximum reuse of the cached data as the fragments i.e. parts of the result can be reused in subsequent queries as opposed to Result Set caching where the query has to exactly match to return the results from cache. Results fragments are stored in a customer-designated S3 bucket and can be re-used across multiple clusters. This can speedup queries especially when the results are significantly smaller than the input data set. With Result Fragment Caching, repeated queries aggregating orders in the last 1 year would only need to compute the last 30 days of changed data while for all previous days of the year, aggregated results could be simply served from cache significantly improving query performance. The speedup could be even more pronounced for queries on data that is mostly immutable and is only inserted for the current day e.g. for IOT events, clickstream data etc. In the rare cases where data is modified, Result Fragment Caching automatically detects and invalidates the cached data ensuring data accuracy.
Result Fragment Caching is available in all regions where the Amazon EMR runtime for Apache Spark is available.
To get started with Result Fragment Caching, see our documentation here.