Posted On: Feb 11, 2021
When running profile jobs in AWS Glue DataBrew to auto-generate 40+ data quality statistics like column-level cardinality, numerical correlations, unique values, standard deviation, and other statistics, you can now configure the size of the dataset you want analyzed. This allows you to customize your profile to run on x% of the dataset for really large datasets or focus on a sub-sample of the dataset for faster results.
Once the profile job completes running the analysis, DataBrew surfaces all statistics on a visual profile dashboard on the console and stores the raw statistics as a JSON object in your Amazon S3 bucket. With this, you can monitor the quality of incoming data over time, detect unanticipated or undesirable changes in data, and set up automated data quality alerts within minutes instead of hours, days, and weeks, without writing any code.
To get started, visit the AWS Management Console or install the DataBrew plugin in your Notebook environment and refer to the DataBrew documentation.