Posted On: Jul 23, 2021

AWS Glue DataBrew now allows customers to specify which data quality statistics to auto-generate for datasets when running a profile job. This allows users to customize data profile statistics such as determining duplicate values, correlations, and outliers based on the nature and size of their datasets, and create a custom data profile overview with only the statistics that meet their needs.

DataBrew surfaces all statistics from a profile job on a visual profile dashboard and stores the raw data as a JSON object in an Amazon S3 bucket. Customers can control what statistics to show, monitor the quality of incoming data over time, and discover changes to data within minutes, all without writing any code. Customers can also set up automated data quality alerts using DataBrew and AWS Lambda, as outlined in this blog post.

To get started, visit the AWS Management Console or install the DataBrew plugin in your Notebook environment and refer to the DataBrew documentation.