AWS Machine Learning Blog

Run AutoML experiments with large parquet datasets using Amazon SageMaker Autopilot

Starting today, you can use Amazon SageMaker Autopilot to tackle regression and classification tasks on large datasets up to 100 GB. Additionally, you can now provide your datasets in either CSV or Apache Parquet content types.

Businesses are generating more data than ever. A corresponding demand is growing for generating insights from these large datasets to shape business decisions. However, successfully training state-of-the-art machine learning (ML) algorithms on these large datasets can be challenging. Autopilot automates this process and provides a seamless experience for running automated machine learning (AutoML) on large datasets up to 100 GB.

Autopilot subsamples your large datasets automatically to fit the maximum supported limit while preserving the rare class in case of class imbalance. Class imbalance is an important problem to be aware of in ML, especially when dealing with large datasets. Consider a fraud detection dataset where only a small fraction of transactions is expected to be fraudulent. In this case, Autopilot subsamples only the majority class, non-fraudulent transactions, while preserving the rare class, fraudulent transactions.

When you run an AutoML job using Autopilot, all relevant information for subsampling is stored in Amazon CloudWatch. Navigate to the log group for /aws/sagemaker/ProcessingJobs, search for the name of your AutoML job, and choose the CloudWatch log stream that includes -db- in its name.

Many of our customers prefer the Parquet content type to store their large datasets. This is generally due to its compressed nature, support for advanced data structures, efficiency, and low-cost operations. This data can often reach up to tens or even hundreds of GBs. Now, you can directly bring these Parquet datasets to Autopilot. You can either use our API or navigate to Amazon SageMaker Studio to create an Autopilot job with a few clicks. You can specify the input location of your Parquet dataset as a single file or multiple files specified as a manifest file. Autopilot automatically detects the content type of your dataset, parses it, extracts meaningful features, and trains multiple ML algorithms.

You can get started using our sample notebook for running AutoML using Autopilot on Parquet datasets.

About the Authors

H. Furkan Bozkurt, Machine Learning Engineer, Amazon SageMaker Autopilot.

Valerio Perrone, Applied Science Manager, Amazon SageMaker Autopilot.