Amazon SageMaker Processing now supports built-in Spark containers for big data processing

Posted on: Sep 30, 2020

We’re excited to announce Amazon SageMaker now supports Apache Spark as a pre-built big data processing container. You can now use this container with Amazon SageMaker Processing and take advantage of a fully managed Spark environment for data processing or feature engineering workloads.

Apache Spark is a unified analytics engine for large-scale data processing. Amazon SageMaker now provides pre-built Docker images that include Apache Spark and other dependencies needed to run distributed data processing jobs. Managing and scaling infrastructure for running Spark jobs requires considerable heavy lifting. Developers and data scientists spend significant time to manage infrastructure for shared usage and to tune the infrastructure for performance, scale and cost. Maintaining a persistent Spark infrastructure that is utilized only for the duration of data processing jobs is expensive as costs are incurred even when jobs are not running.

With Amazon SageMaker Processing and the built-in Spark container, you can run Spark processing jobs for data preparation easily and at scale. Customers enjoy the benefits of a fully managed Spark environment and on-demand, scalable infrastructure with all the security and compliance capabilities of Amazon SageMaker. You can easily manage Spark configuration and submit custom jobs for distributed processing. When submitting jobs, Amazon SageMaker will manage provisioning infrastructure, bootstrapping the Spark cluster, running your application, and releasing resources upon completion.

Amazon SageMaker Processing is now generally available in all AWS regions in the Americas and Europe, and some regions in Asia Pacific with additional regions coming soon. You can find the details of the specific regions here. Read the documentation for more information and for sample notebooks. To learn how to use the feature visit the blog post.