Posted On: Nov 21, 2022

Today, Amazon EMR has announced support for long running fault-tolerant SQL queries on Trino engine (Project Tardigrade) with checkpointing in Amazon S3 or HDFS for fault-tolerance. Project Tardigrade aims to improve the user experience of long running, resource intensive queries on Trino, when used for ETL style workloads. Project Tardigrade uses Amazon S3 for checkpointing buffered intermediate data. With Amazon EMR 6.9 release, we are also adding checkpointing on HDFS for performance sensitive and long running SQL workloads.

Long running ETL workloads can be challenging to run reliably and cost effectively on Trino. This is because restarting failed queries from scratch would waste cluster resources and lack of iterative query capability could cost more on large clusters. Project Tardigrade introduced a new fault-tolerant execution mechanism that enables Trino clusters to mitigate query failures by retrying them using the intermediate exchange data that is collected on S3. Amazon EMR team extended this capability to check point in HDFS to further improve the performance for these Trino queries. With support for fault-tolerant long running queries, Amazon EMR users can now run ETL workflows reliably while also benefiting from performance and cost-saving because of iterative task runs. You can enable fault-tolerance on Amazon EMR Trino clusters using Trino configuration classification on the Amazon EMR console, CLI or using the API.

You can use this capability in all regions where Amazon EMR Trino is available. To learn more about this feature, please refer to our documentation.