Posted On: Jan 24, 2019

Amazon SageMaker Batch Transform now supports TFRecord format as a supported SplitType, enabling datasets to be split by TFRecord boundaries. This adds to the list of supported formats including RecordIO, CSV, and Text.

Amazon SageMaker is a fully-managed service that enables every developer and data scientist to build, train, and deploy machine learning models quickly and easily. A major feature within SageMaker is Batch Transform that enables you to run predictions on batch data.

TFRecord is a standard TensorFlow data format. It is a record-oriented binary file format, enabling efficient storage and processing of large datasets. With this enhancement, it is now simple to store a sequence of binary records and is ideal when working with large datasets using SageMaker Batch Transform. To use TFRecord when running Batch Transform jobs, you can just choose TFRecord as the SplitType and your dataset will be split by TFRecord boundaries. Additionally, you can specify a BatchStrategy of MultiRecord to batch multiple records from a single request.

TFRecord support is now available in all AWS regions where Amazon SageMaker is available today. To learn more, visit the documentation and the sample example.