AWS Glue Data Quality (Preview)

Deliver high-quality data across your data lakes and pipelines

Hundreds of thousands of customers build data lakes, which can become data swamps without data quality. Setting up data quality is a time-consuming, tedious process. You must manually analyze and create data quality rules and write code to alert when quality deteriorates. AWS Glue Data Quality reduces these manual quality efforts from days to hours. AWS Glue Data Quality automatically computes statistics, recommends quality rules, monitors, and alerts you when it detects that quality has deteriorated. Therefore, identifying missing, stale, or bad data before it impacts your business becomes a streamlined process.

Introducing AWS Glue Data Quality (0:29)

Key features

Automatic rule recommendations customized to your data

Getting started with data quality can be difficult because you must manually analyze data to create quality rules. AWS Glue Data Quality automatically computes statistics for your datasets. It uses these statistics to recommend a set of quality rules that checks for freshness, accuracy, and integrity. You can adjust recommended rules, discard rules, or add new rules as needed. If it detects quality issues, AWS Glue Data Quality also alerts you so that you can act.

Achieve data quality at rest and in pipelines

Your data rests in different repositories, and it moves from one repository to another. Monitoring data quality both once it lands and while it is in transit is important. AWS Glue Data Quality rules can be applied to data at rest in your datasets and data lakes and to entire data pipelines where data is in motion. For data pipelines built on AWS Glue Studio, you can apply a transform to evaluate the quality for the entire pipeline. You can also define rules to stop the pipeline if quality deteriorates, preventing bad data from landing in your data lakes.

Serverless, cost-effective, petabyte-scale data quality without lock-in

AWS Glue is serverless, so you can scale without having to manage infrastructure. It scales for any data size, and it features pay-as-you-go billing to increase agility and improve costs. AWS Glue Data Quality uses Deequ, an open-source framework built by Amazon used to manage petabyte-scale datasets. Because it’s built using open source, AWS Glue Data Quality provides flexibility and portability without lock-in.