Run Trino queries 2.7 times faster with Amazon EMR 6.15.0

Trino is an open source distributed SQL query engine designed for interactive analytic workloads. On AWS, you can run Trino on Amazon EMR, where you have the flexibility to run your preferred version of open source Trino on Amazon Elastic Compute Cloud (Amazon EC2) instances that you manage, or on Amazon Athena for a serverless experience. When you use Trino on Amazon EMR or Athena, you get the latest open source community innovations along with proprietary, AWS developed optimizations.

Starting from Amazon EMR 6.8.0 and Athena engine version 2, AWS has been developing query plan and engine behavior optimizations that improve query performance on Trino. In this post, we compare Amazon EMR 6.15.0 with open source Trino 426 and show that TPC-DS queries ran up to 2.7 times faster on Amazon EMR 6.15.0 Trino 426 compared to open source Trino 426. Later, we explain a few of the AWS-developed performance optimizations that contribute to these results.

Benchmark setup

In our testing, we used the 3 TB dataset stored in Amazon S3 in compressed Parquet format and metadata for databases and tables is stored in the AWS Glue Data Catalog. This benchmark uses unmodified TPC-DS data schema and table relationships. Fact tables are partitioned on the date column and contained 200-2100 partitions. Table and column statistics were not present for any of the tables. We used TPC-DS queries from the open source Trino Github repository without modification. Benchmark queries were run sequentially on two different Amazon EMR 6.15.0 clusters: one with Amazon EMR Trino 426 and the other with open source Trino 426. Both clusters used 1 r5.4xlarge coordinator and 20 r5.4xlarge worker instances.

Results observed

Our benchmarks show consistently better performance with Trino on Amazon EMR 6.15.0 compared to open source Trino. The total query runtime of Trino on Amazon EMR was 2.7 times faster compared to open source. The following graph shows performance improvements measured by the total query runtime (in seconds) for the benchmark queries.

Many of the TPC-DS queries demonstrated performance gains over five times faster compared to open source Trino. Some queries showed even greater performance, like query 72 which improved by 160 times. The following graph shows the top 10 TPC-DS queries with the largest improvement in runtime. For succinct representation and to avoid skewness of performance improvements in the graph, we’ve excluded q72.

Performance enhancements

Now that we understand the performance gains with Trino on Amazon EMR, let’s delve deeper into some of the key innovations developed by AWS engineering that contribute to these improvements.

Choosing a better join order and join type is critical to better query performance because it can affect how much data is read from a particular table, how much data is transferred to the intermediate stages through the network, and how much memory is needed to build up a hash table to facilitate a join. Join order and join algorithm decisions are typically a function performed by cost-based optimizers, which uses statistics to improve query plans by deciding how tables and subqueries are joined.

However, table statistics are often not available, out of date, or too expensive to collect on large tables. When statistics aren’t available, Amazon EMR and Athena use S3 file metadata to optimize query plans. S3 file metadata is used to infer small subqueries and tables in the query while determining the join order or join type. For example, consider the following query:

SELECT ss_promo_sk FROM store_sales ss, store_returns sr, call_center cc WHERE 
ss.ss_cdemo_sk = sr.sr_cdemo_sk AND ss.ss_customer_sk = cc.cc_call_center_sk 
AND cc_sq_ft > 0

The syntactical join order is store_sales joins store_returns joins call_center. With the Amazon EMR join type and order selection optimization rules, optimal join order is determined even if these tables don’t have statistics. For the preceding query if call_center is considered a small table after estimating the approximate size through S3 file metadata, EMR’s join optimization rules will join store_sales with call_center first and convert the join to a broadcast join, speeding-up the query and reducing memory consumption. Join reordering minimizes the intermediate result size, which helps to further reduce the overall query runtime.

With Amazon EMR 6.10.0 and later, S3 file metadata-based join optimizations are turned on by default. If you are using Amazon EMR 6.8.0 or 6.9.0, you can turn on these optimizations by setting the session properties from Trino clients or adding the following properties to the trino-config classification when creating your cluster. Refer to Configure applications for details on how to override the default configurations for an application.

Configuration for Join type selection:

session property: rule_based_join_type_selection=true
config property: rule-based-join-type-selection=true

Configuration for Join reorder:

session property: rule_based_join_reorder=true
config property: rule-based-join-reorder=true

Conclusion

With Amazon EMR 6.8.0 and later, you can run queries on Trino significantly faster than open source Trino. As shown in this blog post, our TPC-DS benchmark showed a 2.7 times improvement in total query runtime with Trino on Amazon EMR 6.15.0. The optimizations discussed in this post, and many others, are also available when running Trino queries on Athena where similar performance improvements are observed. To learn more, refer to the Run queries 3x faster with up to 70% cost savings on the latest Amazon Athena engine.

In our mission to innovate on behalf of customers, Amazon EMR and Athena frequently release performance and reliability enhancements on their latest versions. Check the Amazon EMR and Amazon Athena release pages to learn about new features and enhancements.

About the Authors

Bhargavi Sagi is a Software Development Engineer on Amazon Athena. She joined AWS in 2020 and has been working on different areas of Amazon EMR and Athena engine V3, including engine upgrade, engine reliability, and engine performance.

Sushil Kumar Shivashankar is the Engineering Manager for EMR Trino and Athena Query Engine team. He has been focusing in the big data analytics space since 2014.