Posted On: Nov 30, 2022

AWS Glue announces the preview of AWS Glue Data Quality, a new capability that automatically measures and monitors data lake and data pipeline quality. AWS Glue is a serverless, scalable data integration service that makes it more efficient to discover, prepare, move, and integrate data from multiple sources. Managing data quality is manual and time-consuming. You must set up data quality rules and validate your data against these rules on a recurring basis, also writing code to set up alerts when quality deteriorates. Analysts must manually analyze data, write rules, and then write code to implement these rules. 

AWS Glue Data Quality automatically analyzes your data to gather data statistics. It then recommends data quality rules to get started. You can update recommended rules or add new rules using provided data quality rules. If data quality deteriorates, you can then configure actions to alert users. Data quality rules and actions can also be configured on AWS Glue extract, transform, and load (ETL) jobs on data pipelines. These guidelines can prevent “bad” data from entering data lakes and data warehouses. AWS Glue is serverless, so there is no infrastructure to manage, and AWS Glue Data Quality uses open-source Deequ to evaluate rules. AWS uses Deequ to measure and monitor data quality of petabyte-scale data lakes.  

AWS Glue Data Quality is available in preview in the following AWS Regions: US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Tokyo), and Europe (Ireland).

To learn more, review the AWS Glue Data Quality documentation for data quality on data at rest, for data quality in data pipelines.