Amazon SageMaker Data Processing FAQs
General
What is Amazon SageMaker Data Processing?
SageMaker Data Processing analyzes, prepares, integrates and orchestrates your data with processing capabilities from Amazon Athena, Amazon EMR, AWS Glue, and Amazon Managed Workflows for Apache Airflow (Amazon MWAA). You can use open source data-processing frameworks such as Apache Spark, analyze data at scale with Trino, and seamlessly build real-time analytics with Apache Flink and Apache Spark.
What services are included in SageMaker Data Processing?
SageMaker Data Processing brings together Amazon EMR, Athena, AWS Glue, and Amazon MWAA.
Why should I use SageMaker Data Processing?
SageMaker Data Processing helps you explore data, build data-transformation jobs, orchestrate, and deploy data pipelines at scale. It improves performance, driving faster insights than traditional open source systems with cost-effective and open source API-compatible versions of Apache Spark, Apache Airflow, Apache Flink, Trino, and more. SageMaker Data Processing provides access to your data sources in Amazon SageMaker Lakehouse through zero-ETL integrations, federated querying capabilities, and connectors.
Migration and access
Do I need to migrate to SageMaker to use existing services like Amazon EMR, Athena, or AWS Glue?
No, you do not need to migrate to SageMaker. You can continue to use Amazon EMR, Athena, AWS Glue, and Amazon MWAA as you do today. However, we recommend that you get started with SageMaker to use unified tooling, built-in data governance, and simplified SageMaker Lakehouse architectures.
What happens to the jobs, queries, code, and resources that I've already created or plan to create in Amazon EMR, Athena, or AWS Glue?
There is no impact to current code, queries, jobs, and other resources that you’ve created and used with Amazon EMR, Athena, or AWS Glue. You can continue to use these services for new workloads, if you prefer. Resources created in these services, such as Amazon EMR on Amazon Elastic Compute Cloud (Amazon EC2) clusters, are visible in SageMaker to simplify the development of analytics and AI applications. Existing development experiences built into Amazon EMR, AWS Glue, and Athena will continue to exist in addition to a new development experience within SageMaker.
What version of AWS Glue is available in SageMaker?
The latest version of AWS Glue, AWS Glue 5.0, is available in SageMaker. AWS Glue 5.0 accelerates data-processing workloads and delivers the latest performance-optimized Apache Spark 3.5.2 runtime so you can develop, run, and scale for faster insights. To learn more, visit AWS Glue.
Pricing
What is the pricing model for SageMaker Data Processing?
Each AWS service that you use through SageMaker is subject to its own individual pricing. For more details, please consult the AWS pricing page for Athena, Amazon EMR, AWS Glue, and Amazon MWAA.