AWS Big Data Blog
AWS Glue Data Quality is Generally Available
We are excited to announce the General Availability of AWS Glue Data Quality.
Our journey started by working backward from our customers who create, manage, and operate data lakes and data warehouses for analytics and machine learning. To make confident business decisions, the underlying data needs to be accurate and recent. Otherwise, data consumers lose trust in the data and make suboptimal or incorrect decisions. For example, medical researchers found that across 79,000 emergency department encounters of pediatric patients in a hospital, incorrect or missing patient weight measurements led to medication dosing errors in 34% of cases. A data quality check to identify missing patient weight measurements or a check to ensure patients’ weights are trending within certain thresholds would have alerted respective teams to identify these discrepancies.
For our customers, setting up these data quality checks is manual, time consuming, and error prone. It takes days for data engineers to identify and implement data quality rules. They have to gather detailed data statistics, such as minimums, maximums, averages, and correlations. They have to then review the data statistics to identify data quality rules, and write code to implement these checks in their data pipelines. Data engineers must then write code to monitor data pipelines, visualize quality scores, and alert them when anomalies occur. They have to repeat these processes across thousands of datasets and the hundreds of data pipelines populating them. Some customers adopt commercial data quality solutions; however, these solutions require time-consuming infrastructure management and are expensive. Our customers needed a simple, cost-effective, and automatic way to manage data quality.
In this post, we discuss the capabilities and features of AWS Glue Data Quality.
Capabilities of AWS Glue Data Quality
AWS Glue Data Quality accelerates your data quality journey with the following key capabilities:
- Serverless – AWS Glue Data Quality is a feature of AWS Glue, which eliminates the need for infrastructure management, patching, and maintenance.
- Reduced manual efforts with recommending data quality rules and out-of-the-box rules – AWS Glue Data Quality computes data statistics such as minimums, maximums, histograms, and correlations for datasets. It then uses these statistics to automatically recommend data quality rules that check for data freshness, accuracy, and integrity. This reduces manual data analysis and rule identification efforts from days to hours. You can then augment recommendations with out-of-the-box data quality rules. The following table lists the rules that are supported by AWS Glue Data Quality as of writing. For an up-to-date list, refer to Data Quality Definition Language (DQDL).
Rule Type | Description |
AggregateMatch |
Checks if two datasets match by comparing summary metrics like total sales amount. Useful for customers to compare if all data is ingested from source systems. |
ColumnCorrelation |
Checks how well two columns are corelated. |
ColumnCount |
Checks if any columns are dropped. |
ColumnDataType |
Checks if a column is compliant with a data type. |
ColumnExists |
Checks if columns exist in a dataset. This allows customers building self-service data platforms to ensure certain columns are made available. |
ColumnLength |
Checks if length of data is consistent. |
ColumnNamesMatchPattern |
Checks if column names match defined patterns. Useful for governance teams to enforce column name consistency. |
ColumnValues |
Checks if data is consistent per defined values. This rule supports regular expressions. |
Completeness |
Checks for any blank or NULLs in data. |
CustomSql |
Customers can implement almost any type of data quality checks in SQL. |
DataFreshness |
Checks if data is fresh. |
DatasetMatch |
Compares two datasets and identifies if they are in sync. |
DistinctValuesCount |
Checks for duplicate values. |
Entropy |
Checks for entropy of the data. |
IsComplete |
Checks if 100% of the data is complete. |
IsPrimaryKey |
Checks if a column is a primary key (not NULL and unique). |
IsUnique |
Checks if 100% of the data is unique. |
Mean |
Checks if the mean matches the set threshold. |
ReferentialIntegrity |
Checks if two datasets have referential integrity. |
RowCount |
Checks if record counts match a threshold. |
RowCountMatch |
Checks if record counts between two datasets match. |
StandardDeviation |
Checks if standard deviation matches the threshold. |
SchemaMatch |
Checks if schema between two datasets match. |
Sum |
Checks if sum matches a set threshold. |
Uniqueness |
Checks if uniqueness of dataset matches a threshold. |
UniqueValueRatio |
Checks if the unique value ration matches a threshold. |
- Embedded in customer workflow – AWS Glue Data Quality has to blend into customer workflows for it to be useful. Disjointed experiences create friction in getting started. You can access AWS Glue Data Quality from the AWS Glue Data Catalog, allowing data stewards to set up rules while they are using the Data Catalog. You can also access AWS Glue Data Quality from Glue Studio (AWS Glue’s visual authoring tool), Glue Studio notebooks (a notebook-based interface for coders to create data integration pipelines), and interactive sessions, an API where data engineers can submit jobs from their choice of code editor.
- Pay-as-you-go and cost-effective – AWS Glue Data Quality is charged based on the compute used. This simple pricing model doesn’t lock you into annual licenses. AWS Glue ETL-based data quality checks can use Flex execution, which is 34% cheaper for non-SLA sensitive data quality checks. Additionally, AWS Glue Data Quality rules on data pipelines can help you save costs because you don’t have to waste compute resources on bad quality data when detected early. Also, when data quality checks are configured as part of data pipelines, you only incur an incremental cost because the data is already read and mostly in memory.
- Built on open-source – AWS Glue Data Quality is built on open-source DeeQu, a library that is used internally by Amazon to manage the quality of data lakes over 60 PB. DeeQu is optimized to run data quality rules in minimal passes that makes it efficient. The rules that are authored in AWS Glue Data Quality can be run in any environment that can run DeeQu, allowing you to stay in an open-source solution.
- Simplified rule authoring language – As part of AWS Glue Data Quality, we announced Data Quality Definition Language (DQDL). DQDL attempts to standardize data quality rules so that you can use the same data quality rules across different databases and engines. DQDL is simple to author and read, and brings the goodness of code that developers like, such as version control and deployment. To demonstrate the simplicity of this language, the following example shows three rules that check if record counts are greater than 10, and ensures that
VendorID
doesn’t have any empty values andVendorID
has a certain range of values:
General Availability features
AWS Glue Data Quality has several key enhancements from the preview version:
- Error record identification – You need to know which records failed data quality checks. We have launched this capability in AWS Glue ETL, where the data quality transform now enriches the input dataset with new columns that identify which records failed data quality checks. This can help you quarantine bad data so that only good records flow into your data repositories.
- New rule types that validate data across multiple datasets – With new rule types like
ReferentialIntegrity
,DatasetMatches
,RowCountMatches
, andAggregateMatches
, you can compare two datasets to ensure that data integrity is maintained. TheSchemaMatch
rule type ensures that the dataset accurately matches a set schema, preventing downstream errors that may be caused by schema changes. - Amazon EventBridge integration – Integration with Amazon EventBridge enables you to simplify how you set up alerts when quality rules fail. A one-time setup is now sufficient to alert data consumers about data quality failures.
- AWS CloudFormation support – With support for AWS CloudFormation, AWS Glue Data Quality now enables you to easily deploy data quality rules in many environments
- Join support in CustomSQL rule type – You can now join datasets in CustomSQL rule types to write complex business rules.
- New data source support – You can check data quality on open transactional formats such as Apache HUDI, Apache Iceberg, and Delta Lake. Additionally, you can set up data quality rules on Amazon Redshift and Amazon Relational Database Service (Amazon RDS) data sources cataloged in the AWS Glue Data Catalog.
Summary
AWS Data Quality is now Generally Available. To help you get started, we have created a five-part blog series:
- Part 1: Getting started with AWS Glue Data Quality from the AWS Glue Data Catalog
- Part 2: Getting started with AWS Glue Data Quality for ETL Pipelines
- Part 3: Set up data quality rules across multiple datasets using AWS Glue Data Quality
- Part 4: Set up alerts and orchestrate data quality rules with AWS Glue Data Quality
- Part 5: Visualize data quality score and metrics generated by AWS Glue Data Quality
Get started today with AWS Glue Data Quality and tell us what you think.
About the authors
Shiv Narayanan is a Technical Product Manager for AWS Glue’s data management capabilities like data quality, sensitive data detection and streaming capabilities. Shiv has over 20 years of data management experience in consulting, business development and product management.
Tome Tanasovski is a Technical Manager at AWS, for a team that manages capabilities into Amazon’s big data platforms via AWS Glue. Prior to working at AWS, Tome was an executive for a market-leading global financial services firm in New York City where he helped run the Firm’s Artificial Intelligence & Machine Learning Center of Excellence. Prior to this role he spent nine years in the Firm focusing on automation, cloud, and distributed computing. Tome has a quarter-of-a-century worth of experience in technology in the Tri-state area across a wide variety of industries including big tech, finance, insurance, and media.
Brian Ross is a Senior Software Development Manager at AWS. He has spent 24 years building software at scale and currently focuses on serverless data integration with AWS Glue. In his spare time, he studies ancient texts, cooks modern dishes and tries to get his kids to do both.
Alona Nadler is AWS Glue Head of Product and is responsible for AWS Glue Service. She has a long history of working in the enterprise software and data services spaces. When not working, Alona enjoys traveling and playing tennis.