Posted On: Jun 6, 2023

AWS announces general availability of AWS Glue Data Quality, a capability that automatically measures and monitors data lake and data pipeline quality. AWS Glue is a serverless, scalable data integration and ETL (extract, transform, and load) service that makes it easier to discover, prepare, move, and integrate data from multiple sources.

AWS Glue Data Quality helps reduce the need for manual data quality work by automatically analyzing your data to gather data statistics. It uses open-source Deequ to evaluate rules and measure and monitor the data quality of petabyte-scale data lakes. It then recommends data quality rules to get started. You can update recommended rules or add new rules. If data quality deteriorates, you can configure actions to alert users and drill down into the issue’s root cause. Data quality rules and actions can also be configured on AWS Glue data pipelines, helping prevent “bad” data from entering data lakes and data warehouses.

With general availability, we have launched new features to identify specific records that failed data quality checks and added new rules that validate data consistency across different datasets. You can now validate the data quality of Amazon Redshift, Apache Iceberg, Apache HUDI, and Delta Lake datasets that are cataloged in the AWS Glue Data Catalog. AWS Glue Data Quality results are now published to Amazon EventBridge, simplifying how users are alerted and integrating data quality results with other applications. These features help you perform robust data quality checks across various datasets and identify issues for correction

AWS Glue Data Quality is generally available in all AWS Regions where AWS Glue is available.

To learn more, visit AWS Glue Data Quality.