Introducing optimized Spark 3.1 runtime for data integration with AWS Glue 3.0

Posted on: Aug 19, 2021

Today, we’re pleased to announce AWS Glue version 3.0, a new version of AWS Glue Spark for your batch and streaming jobs that accelerates your data integration workloads in AWS. AWS Glue 3.0 introduces a performance-optimized Spark runtime that includes optimizations from AWS Glue and Amazon EMR, and is based on open-source Apache Spark 3.1.1. The AWS Glue 3.0 runtime optimizes both read and write access to Amazon Simple Storage Service (Amazon S3), using faster vectorized readers and Amazon S3 optimized output committers. It also optimizes access to the AWS Glue Data Catalog with the use of partition predicates. For highly partitioned datasets, Glue 3.0 improves the execution speed by filtering out unnecessary partitions using partition indexes. AWS Glue 3.0 runtime is also fully integrated with AWS Lake Formation, so you can secure your data access in different granularities like database-, table-, column-, row-, and cell-level access control using resource names and AWS Lake Formation tag based access control. With AWS Glue 3.0, we also bring in new capabilities to improve user experience for monitoring, debugging, and tuning Spark applications. Spark 3.1.1 enables an improved Spark UI experience that includes new Spark executor memory metrics and Spark Structured Streaming metrics that are useful for AWS Glue streaming jobs. Similar to AWS Glue 2.0, AWS Glue 3.0 reduces startup latency and improve the overall job completion times.

AWS Glue 3.0 is available in every AWS Region where AWS Glue is available. To learn more about this feature, visit the blog and the AWS Glue User Guide.

Introducing optimized Spark 3.1 runtime for data integration with AWS Glue 3.0

Learn

Resources

Developers

Help