Run queries up to 9x faster using Trino with Amazon S3 Select on Amazon EMR
Customers building data lakes continue to innovate in the ways that they store and access their data. For these customers, performance is critical, particularly when they are accessing large amounts of data. For example, data scientists, data analysts, and data engineers running queries from open source frameworks like Trino want to accelerate access to their data, which lets them spend more time analysing their data and generating insights.
Amazon S3 Select is an Amazon Simple Storage Service (Amazon S3) feature that makes it easy to retrieve specific data from an object using simple SQL expressions without having to retrieve the entire object. Trino is an open source SQL query engine that can be used to run interactive analytics on data stored in Amazon S3. By using Trino with S3 Select, you retrieve only a subset of data from an object, reducing the amount of data returned from Amazon S3 and accelerating query performance.
On November 21, 2022, AWS announced its upstream contributions to open source Trino, which improves query performance when accessing CSV and JSON data formats. Now, you can use the S3 Select Pushdown performance enhancements to reduce the amount of data that must be transferred and processed by Trino. S3 Select Pushdown enables Trino to “push down” the computational work of projection operations (for example, selecting a subset of columns) and predicate operations (for example, applying filtering in a WHERE clause ) to Amazon S3. This allows queries to retrieve only the required data from Amazon S3.
In this post, we discuss the performance benchmarks on Trino release 397 with S3 Select using TPC-DS-like benchmark queries at 3 TB scale. We show that queries run up to 9x faster as a result of pushing down the computational load of scanning and filtering data to Amazon S3 when compared to using Trino without S3 Select.
How to configure the Trino Hive connector
You enable S3 Select Pushdown using the
s3_select_pushdown_enabled Hive session property, or using the
hive.s3select-pushdown.enabled configuration property. The session property overrides the config property, which lets you enable or disable S3 Select Pushdown on a per-query basis.
Results observed using TPC-DS-like benchmarks
To evaluate the performance improvements on Trino with S3 Select, we ran all 99 TPC-DS-like benchmark queries at 3 TB scale on a 33-node r5.16xlarge EMR v 6.8.0 cluster patched with Trino release 397 and all of the data stored on Amazon S3.
We ran all queries successfully multiple times with and without S3 Select. To ensure consistent results, we monitored the standard deviations for each query run. The average deviation in query runtimes across all 99 queries was 0.70 seconds with S3 Select and 0.99 seconds without S3 Select.
First, we’ll compare the total aggregate runtime and the total aggregate amount of data processed for all 99 queries in the TPC-DS-like 3 TB CSV (uncompressed) dataset. The first graph shows the total aggregate runtime in seconds with and without S3 Select:
The next graph shows the total aggregate data (in terabytes) processed with and without S3 Select:
In our tests, we found that S3 Select sped up all 99 queries. The maximum query acceleration with S3 Select was 9.2x, the minimum query acceleration with S3 Select was 1.1x, and the average query acceleration was 2.5x. The following graph shows the query speedup for each of the 99 queries:
In our tests, we found that S3 Select reduced the amount of bytes processed by Trino for all 99 queries. For example, we saw a reduction of 17 TB (99%) of processed data with S3 Select on Query 9. During testing, the average reduction in the amount of processed data per query with S3 Select was 2 TB, and the total reduction in processed data across all 99 queries was 200 TB (21x better) with S3 Select.
The following graph shows the reduction of data processed for each of the 99 queries with S3 Select:
The performance results provided in this post required no tuning to Amazon EMR or Trino, and all of the results are from default configurations. The following is the default Trino configuration used with our EMR cluster.
/etc/trino/conf/config.properties: coordinator=true node-scheduler.include-coordinator=false http-server.threads.max=500 discovery-server.enabled=true sink.max-buffer-size=1GB query.max-memory=6606029MB query.max-memory-per-node=214061170033B query.max-history=40 query.min-expire-age=30m query.client.timeout=30m query.stage-count-warning-threshold=100 query.max-stage-count=150 http-server.http.port=8889 http-server.log.path=/var/log/trino/http-request.log http-server.log.max-size=67108864B http-server.log.max-history=5 log.max-size=268435456B log.max-history=5 jmx.rmiregistry.port=9080 jmx.rmiserver.port=9081
Next, let’s look at the enhancements that we made to Trino that contributed to these results.
Performance enhancements to the Trino Hive connector
Our contributions to Trino improve how Trino sends requests to Amazon S3 by enhancing its use of S3 Select. There are two contributing factors that accelerate the query runtime when S3 Select is used. First, S3 Select reduces the number of bytes transferred between Amazon S3 and Trino by pushing down the filtering to Amazon S3. With S3 Select, Trino retrieves a pre-filtered subset of Amazon S3 data because filtering and projection is performed by S3 Select. Second, using S3 Select to push down the computation work of filtering to Amazon S3 increases Trino’s ability to parallelise projection and predicate operations.
In this post, we presented our results from running our TPC-DS-like 3TB scale benchmark. With the S3 Select Pushdown performance optimizations available in Trino release 397 and later, you can run queries faster than before by using Trino with S3 Select to “pushdown” the computational work of projection and predicate operations to Amazon S3. Our benchmark testing demonstrated up to a 9x performance improvement in query runtime (2.5x on average), and a 21x overall reduction in the total data processed.
If you have a data lake built on Amazon S3 and use Trino today, then you can use S3 Select’s filtering capability to quickly and easily run interactive ad-hoc queries. For more information, see the Trino release notes to learn about the enhancements to the Hive connector.
By default, S3 Select Pushdown is disabled in the Trino Hive connector. Its performance depends on the amount of data filtered by the query. For example, by filtering a large number of rows, you will achieve better performance. We recommend that you benchmark your workloads with and without S3 Select to see if using it may be suitable for your specific use case. For more information on S3 Select pricing, please visit the Amazon S3 pricing page.