Posted On: Mar 30, 2021

With AWS Glue DataBrew, you can now visually detect outliers in data from your data lake, data warehouses, and other JDBC-accessible data sources. You can further handle outliers by replacing, removing, rescaling, or flagging them using mathematical and algorithmic methods such as z-score (to find the difference from mean value and divide it by the standard deviation), modified z-score (to calculate the difference from median absolute deviation), interquartile ranges (to calculate values between the first quartile and the third quartile) and one or more transformations such as creating a flag column, applying window functions, or choose from over 250+ other transformations.  

For analytics and machine learning use cases, datasets often contain outliers with either valuable information or meaningless aberrations caused by measurement and recording errors. Including or excluding outliers in the datasets can directly impact the result of the analysis or machine learning models and the decisions made based on this data. When working with small samples of the data from your data lake and data warehouses, you have to slice and dice the data multiple times in code to detect and handle all outliers in the data because there is no visual way of looking at them. With DataBrew, now you can not only visually preview outliers in your dataset profiles but also handle them appropriately without writing any code.

AWS Glue DataBrew is a visual data preparation tool that makes it easy to clean and normalize data using 250+ pre-built transformations for data preparation, without the need to write any code.  

To learn more, view this getting started video or use a sample dataset to explore DataBrew. To get started, visit the AWS Management Console or install the DataBrew plugin in your Notebook environment and refer to the DataBrew documentation.