AWS Glue
Discover, prepare, and integrate all your data at any scale
Why AWS Glue?
Preparing your data to obtain quality results is the first step in an analytics or AI project. AWS Glue is a serverless service that makes data integration simpler, faster, and cheaper. You can discover and connect to more than 100 diverse data sources, manage your data in a centralized data catalog, and visually create, run, and monitor data pipelines to load data into your data lakes, data warehouses, and lakehouses. With built-in generative AI capabilities, you can modernize Apache Spark jobs and develop faster with intelligent assistance for ETL authoring and Spark troubleshooting.
Integrate your data with AWS Glue in the next generation of Amazon SageMaker
With AWS Glue in the next generation of Amazon SageMaker, you can manage and build your workloads in one place with cost-effective, serverless, and scalable data integration.

Benefits
Use Cases
Simplify ETL pipeline management
Interactively explore, experiment on, and process data
Discover data efficiently
Support various processing frameworks and workloads
What's New
AWS Glue Data Catalog usage metrics now available with Amazon CloudWatch
AWS Glue Data Catalog now offers usage metrics for APIs in Amazon CloudWatch, enabling you to monitor, troubleshoot, and optimize your API usage with greater visibility. The insights from these API usage metrics will help you better understand your lakehouse runtime API usage in production environments.
Customers seek better observability of their API usage to identify bottlenecks, detect anomalies, and understand usage patterns in their lakehouse architecture. With Data Catalog Usage Metrics in CloudWatch, you can track critical API usage performance indicators per minute, including reads, updates, and deletions of lakehouse resources such as catalogs, tables, partitions, connections, and statistics. You can set up CloudWatch alarms to receive notifications when metrics exceed specified thresholds, allowing proactive management of your lakehouse.
You can get started by navigating to Metrics in the CloudWatch console and filter usage by AWS Glue resource. You can then graph the metrics and configure alarms that alert you when usage approaches specified thresholds.
This feature is available in all AWS Regions where Data Catalog is available. To get started, read the launch blog and read Data Catalog documentation.
AWS Glue enables enhanced Apache Spark capabilities for AWS Lake Formation tables with full table access
AWS Glue now supports read and write operations from AWS Glue 5.0 Apache Spark jobs on AWS Lake Formation registered tables when the job role has full table access. This capability enables Data Manipulation Language (DML) operations including CREATE, ALTER, DELETE, UPDATE, and MERGE INTO statements on Apache Hive and Iceberg tables from within the same Apache Spark application.
While Lake Formation's fine-grained access control (FGAC) offers granular security controls at row, column, and cell levels, many ETL workloads simply need full table access. This new feature enables AWS Glue 5.0 Spark jobs to directly read and write data when full table access is granted, removing limitations that previously restricted certain Extract, Transform, and Load (ETL) operations. You can now leverage advanced Spark capabilities including Resilient Distributed Datasets (RDDs), custom libraries, and User Defined Functions (UDFs) with Lake Formation tables. Additionally, data teams can run complex, interactive Spark applications through SageMaker Unified Studio in compatibility mode while maintaining Lake Formation's table-level security boundaries.
This feature is available in all AWS Regions where AWS Glue and AWS Lake Formation are supported. To learn more, visit the AWS Glue product page and documentation.
Amazon S3 now supports sort and z-order compaction for Apache Iceberg tables
Amazon S3 now supports sort and z-order compaction for Apache Iceberg tables, available both in Amazon S3 Tables and general purpose S3 buckets using AWS Glue Data Catalog optimization. Sort compaction in Iceberg tables minimizes the number of data files scanned by query engines, leading to improved query performance and reduced costs. Z-order compaction provides additional performance benefits through efficient file pruning when querying across multiple columns simultaneously.
S3 Tables provide a fully managed experience where hierarchical sorting is automatically applied on columns during compaction when a sort order is defined in table metadata. When multiple query predicates need to be prioritized equally, you can enable z-order compaction through the S3 Tables maintenance API. If you are using Iceberg tables in general purpose S3 buckets, optimization can be enabled in the AWS Glue Data Catalog console, where you can specify your preferred compaction method.
These additional compaction capabilities are available in all AWS Regions where S3 Tables or optimization with the AWS Glue Data Catalog are available. To learn more, read the AWS News Blog, and visit the S3 Tables maintenance documentation and AWS Glue Data Catalog optimization documentation.
AWS Glue now available in Asia Pacific (Taipei) Region
AWS Glue, a serverless data integration service, is now available in the Asia Pacific (Taipei) region, enabling customers to build and run their ETL workloads closer to their data sources in these regions.
AWS Glue is a serverless data integration service that makes it simple to discover, prepare, and combine data for analytics, machine learning, and application development. AWS Glue provides both visual and code-based interfaces to make data integration simpler, so you can analyze your data and put it to use in minutes.
To learn more, visit the AWS Glue product page and our documentation. For AWS Glue region availability, please see the AWS Region table.
Amazon EMR enables enhanced Apache Spark capabilities for Lake Formation tables with full table access
Amazon EMR now supports read and write operations from Apache Spark jobs on AWS Lake Formation registered tables when the job role has full table access. This capability enables Data Manipulation Language (DML) operations including CREATE, ALTER, DELETE, UPDATE, and MERGE INTO statements on Apache Hive and Iceberg tables from within the same Apache Spark application.
While Lake Formation's fine-grained access control (FGAC) offers granular security controls at row, column, and cell levels, many ETL workloads simply need full table access. This new feature enables Apache Spark to directly read and write data when full table access is granted, removing FGAC limitations that previously restricted certain ETL operations. You can now leverage advanced Spark capabilities including RDDs, custom libraries, UDFs, and custom images (AMIs for EMR on EC2, custom images for EMR-Serverless) with Lake Formation tables. Additionally, data teams can run complex, interactive Spark applications through SageMaker Unified Studio in compatibility mode while maintaining Lake Formation's table-level security boundaries.
This feature is available in all AWS Regions where Amazon EMR and AWS Lake Formation are supported.
To learn more about this feature, visit the Lake Formation unfiltered access section in EMR Serverless documentation.
AWS Glue Studio now supports additional file types and single file output
Today, AWS Glue Studio announces support for additional compressed file types, Excel files (as source), and XML and Tableau's Hyper files (as target). We are also introducing the option to select the number of output files for an S3 target. These enhancements will allow you to use visual ETL jobs for additional data processing workflows not supported today, for example loading data from an Excel file into a single XML file output.
The new experience will now enable you to have one single file as the output of your Glue job, or to specify a custom number for the output files. Further, Glue now supports Excel files via S3 file source nodes, and XML or Tableau Hyper files for S3 file target nodes. New compression types that will be available to use are: LZ4 , SNAPPY, DEFLATE, LZO, BROTLI, ZSTD and ZLIB.
These new features are now available in all AWS commercial Regions and AWS GovCloud (US) Regions where AWS Glue is available. Access the AWS Regional Services List for the most up-to-date availability information.
To learn more, visit the AWS Glue documentation.
Amazon SageMaker scheduling experience for Visual ETL and Query editors
Amazon SageMaker now offers a unified scheduling experience for visual ETL flows and queries. The next generation of Amazon SageMaker is the center for all your data, analytics, and AI, and includes SageMaker Unified Studio, a single data and AI development environment. Visual ETL in Amazon SageMaker provides a drag-and-drop interface for building ETL flows and authoring flows with Amazon Q. The query editor tool provides a place to write and run queries, view results, and share your work with your team. This new scheduling experience simplifies the scheduling process for Visual ETL and Query editor users.
With unified scheduling you can now schedule your workloads with Amazon EventBridge Scheduler from the same visual interface you use to author your query or visual ETL flow. Previously, you needed to create a code-based workflow in order to run a single flow or query on schedule. You can also view, modify or pause/resume these schedules and monitor the runs they invoked.
This new feature is now available in all AWS regions where Amazon SageMaker is available. Access the supported region list for the most up-to-date availability information.
To learn more, visit our Amazon SageMaker Unified Studio documentation, blog post and Amazon EventBridge Scheduler pricing page.
Amazon Redshift adds history mode support to 8 third-party SaaS applications
Amazon Redshift now supports history mode for zero-ETL integrations with eight third-party applications including Salesforce, ServiceNow, and SAP. This addition complements existing history mode support for Amazon Aurora PostgreSQL-compatible and MySQL-compatible, DynamoDB, and RDS for MySQL databases. The expansion enables you to track historical data changes without Extract, Transform, and Load (ETL) processes, simplifying data management across AWS and third-party applications.
History Mode for zero-ETL integrations with third-party applications lets customers easily run advanced analytics on historical data from their applications, build comprehensive lookback reports, and perform trend analysis and data auditing across multiple zero-ETL data sources. This feature preserves the complete history of data changes without maintaining duplicate copies across various external data sources, allowing organizations to meet data retention requirements while significantly reducing storage needs and operational costs. Available for both existing and new integrations, history mode offers enhanced flexibility by allowing selective enabling of historical tracking for specific tables within third-party application integrations, giving businesses precise control over their data analysis and storage strategies.
To learn more about history mode for zero-ETL integrations in Amazon Redshift and how it can benefit your data analytics workflows, visit the history mode documentation. To learn more about the supported third-party applications, visit the AWS Glue documentation. To get started with zero-ETL integrations, visit the getting started guides for Amazon Redshift.
Get started with Glue
Did you find what you were looking for today?
Let us know so we can improve the quality of the content on our pages