Detect multicollinearity and easily export results in a few clicks with Amazon SageMaker Data Wrangler

Posted On: Aug 16, 2021

Amazon SageMaker Data Wrangler reduces the time it takes to aggregate and prepare data for machine learning (ML) from weeks to minutes. With SageMaker Data Wrangler, you can simplify the process of data preparation and feature engineering, and complete each step of the data preparation workflow, including data selection, cleansing, exploration, and visualization from a single visual interface. Starting today, you can use new capabilities of Amazon SageMaker Data Wrangler that make it easier and faster to prepare data for ML including: multicollinearity detection, easy export of results to Amazon S3, support for column delimiters, and the ability to reuse the same SageMaker Data Wrangler flow on different datasets of your choice.

Multicollinearity occurs when two or more features in a dataset are highly correlated with one another. Detecting the presence of multicollinearity in a dataset is important because multicollinearity can hinder the performance of a ML model. Starting today, you can use three new diagnostic visualizations within Amazon SageMaker Data Wrangler to help detect multicollinearity in a dataset. The first visualization allows you to plot variance inflation factors (VIFs) in your dataset. High VIFs in your data may indicate the presence of multicollinearity. The second visualization uses Principal Components Analysis (PCA) and the Singular Value Decomposition (SVD) to calculate singular values. A highly non-uniform distribution of singular values in your dataset may also indicate multicollinearity. Finally, a third visualization uses LASSO (Least absolute shrinkage and selection operator), which plots coefficient values from a LASSO model trained on your data. Variables with coefficient values that are close to zero may be redundant and may not contribute significantly to the performance of a ML model.

Starting today, you can also easily export your prepared data with a few clicks. Amazon SageMaker Data Wrangler’s new export functionality provides a push-button export experience to export your data. You can simply click on “Export Data” from the prepare tab and specify the Amazon S3 location of where you would like results to be stored. Your results will then be exported directly to S3 for you to use in other ML applications. Additionally, you can now import data in a variety of delimited formats including, comma-separated, tab-separated, pipe-separated, semicolon-separated, and colon-separated data formats. Finally, you can now change datasets used in your SageMaker Data Wrangler data flows. You can simply click on a source node in the data view and select “Edit dataset” to modify the source data used in a SageMaker Data Wrangler flow file.

To get started with new capabilities of Amazon SageMaker Data Wrangler, you can open Amazon SageMaker Studio and click File > New > Flow from the menu or “new data flow” from the SageMaker Studio launcher. To learn more, visit the feature page or view documentation. You can also learn how to upgrade to the latest release here.

Detect multicollinearity and easily export results in a few clicks with Amazon SageMaker Data Wrangler

Ending Support for Internet Explorer