AWS Glue Data Quality

Deliver high-quality data across your data lakes and pipelines

Data lakes may become data swamps without proper oversight. Setting up data quality checks is time-consuming, tedious and error prone. You must manually create data quality rules and write code to monitor data pipelines, and alert data consumers when data quality deteriorates. AWS Glue Data Quality reduces these manual quality efforts from days to hours. It automatically computes statistics, recommends quality rules, monitors, and alerts you when it detects issues. For hidden and hard-to-find issues, Glue Data Quality uses ML algorithms. The combined power of rule-based and ML approach, along with the serverless, scalable and open solution, enables you to deliver high quality data to make confident business decisions. 

AWS Glue Data Quality overview (1:27)

Features of AWS Glue Data Quality

AWS Glue is serverless, so you can scale without having to manage infrastructure. It scales for any data size, and it features pay-as-you-go billing to increase agility and improve costs. AWS Glue Data Quality uses Deequ, an open-source framework built by Amazon used to manage petabyte-scale datasets. Because it’s built using open source, AWS Glue Data Quality provides flexibility and portability without lock-in.
AWS Glue Data Quality automatically computes statistics for your datasets. It uses these statistics to recommend a set of quality rules that checks for freshness, accuracy, integrity and even hard-to-find issues. You can adjust recommended rules, discard rules, or add new rules as needed. If it detects quality issues, AWS Glue Data Quality also alerts you so that you can act on them.
AWS Glue Data Quality is intelligent. It learns patterns on data statistics gathered over time using ML algorithms. It detects anomalies, unusual data patterns and alerts users. It also auto-creates rules to monitor these specific patterns so that you can progressively build data quality rules.
Your data rests in different repositories, and it moves from one repository to another. Monitoring data quality both once it lands and while it is in transit is important. AWS Glue Data Quality rules can be applied to data at rest in your datasets and data lakes and to entire data pipelines where data is in motion. You can apply rules across multiple datasets. For data pipelines built on AWS Glue Studio, you can apply a transform to evaluate the quality for the entire pipeline at a fraction of the cost as data is already in memory. You can also define rules to stop the pipeline if quality deteriorates, preventing bad data from landing in your data lakes.
Use over 25 out-of-the box AWS Glue Data Quality rules to validate your data and identify specific data that causes issues. Implement data quality checks that compare different data sets in disparate data sources in minutes with out-of-the-box rules. Using Glue ETL you can easily remediate these issues and ingest high quality data into your data repositories.