Posted On: Jan 28, 2022
Amazon SageMaker Autopilot automatically builds, trains, and tunes the best machine learning models based on your data, while allowing you to maintain full control and visibility. Starting today, SageMaker Autopilot provides support for the Apache Parquet file format. Apache Parquet is a free and open-source column-oriented data storage format for the Apache Hadoop ecosystem. It provides efficient data compression and encoding schemes with enhanced performance. This new feature allows the creation of SageMaker Autopilot experiments with files stored in the Apache Parquet file format.
When creating an Autopilot experiment with this release, you can specify Amazon S3 location for input parquet data that points to either a single parquet file or a manifest file that contains metadata and references multiple parquet files. Autopilot can accept up to 2 GB of compressed data for each Parquet file in the input location or manifest. You can increase the 2 GB service limit default for a compressed parquet-formatted file by filing a Service limit increase request in the AWS Support Center console. When you specify an Amazon S3 folder or manifest file with multiple parquet files as input, the default 2 GB limit is enforced for each parquet file separately. This release includes support for processing large parquet datasets as well. SageMaker Autopilot will automatically subsample the uncompressed data stored in the parquet file(s) to fit the maximum supported limit, while accounting for class imbalance and preserving rare class labels.
Parquet file format is supported in all AWS regions where SageMaker Autopilot is available. For more information, see the Data and Problem types and Quotas topics in the Amazon SageMaker Autopilot Developer Guide, and ContentType in the AutoMLChannel API Reference. For a deep dive, check out our blog post and sample notebook previewing this feature launch. To get started with SageMaker Autopilot, see the Getting Started or access Autopilot within SageMaker Studio.