AWS Big Data Blog
Apache Spark 4.0.1 preview now available on Amazon EMR Serverless
Amazon EMR Serverless now supports Apache Spark 4.0.1 in preview, making analytics accessible to more users, simplifying data engineering workflows, and strengthening governance capabilities. The release introduces ANSI SQL compliance, VARIANT data types support for JSON handling, Apache Iceberg v3 table format support, and enhanced streaming capabilities. This preview is available in all regions where EMR Serverless is available.
In this post, we explore key benefits, technical capabilities, and considerations for getting started with Spark 4.0.1 on Amazon EMR Serverless—a serverless deployment option that simplifies running open-source big data frameworks, without requiring managing clusters. With the emr-spark-8.0-preview release label, you can evaluate new SQL capabilities, Python API improvements, and streaming enhancements in your existing EMR Serverless environment.
Benefits
Spark 4.0.1 helps you solve data engineering problems with specific improvements. This section shows how new capabilities help with real-world scenarios.
Make analytics accessible to more users
Simplify Extract Transform Load (ETL) development with SQL scripting. Data engineers often switch between SQL and Python to build complex ETL logic with control flow. SQL scripting in Spark 4.0.1 enables loops, conditionals, and session variables directly in SQL, reducing context-switching and simplifying pipeline development. Use pipe syntax (|>) to chain operations for more readable, maintainable queries.
Improve data quality with ANSI SQL mode. Silent type conversion failures can introduce data quality issues. ANSI SQL mode (now default) enforces standard SQL behavior, raising errors for invalid operations instead of producing unexpected results. Important: ANSI SQL mode is now enabled by default. Test your queries thoroughly during this preview evaluation.
Simplify data engineering workflows
Process JSON data efficiently with VARIANT. Teams working with semi-structured data often see slow performance from repeated JSON parsing. The VARIANT data type stores JSON in an optimized binary format, eliminating parsing overhead. You can efficiently store and query JSON data in data lakes without schema rigidity.
Build Python data sources without Scala. Integrating custom data sources previously required Scala expertise. The Python data Source API lets you build connectors entirely in Python, using existing Python skills and libraries without learning a new language.
Debug streaming applications with queryable state. Troubleshooting stateful streaming applications has historically required indirect methods. The new state data source reader shows streaming state as queryable DataFrames. You can inspect state during debugging, test state values in unit tests, and diagnose production incidents.
Strengthen governance capabilities
Establish comprehensive audit trails with Apache Iceberg v3. The Apache Iceberg v3 table format provides transaction guarantees and tracks data changes over time, giving you the audit trails needed for regulatory compliance. When combined with VARIANT data type support, you can maintain governance controls while handling semi-structured data efficiently in data lakes.
Key capabilities
Spark 4.0.1 Preview on EMR Serverless introduces four major capability areas:
- SQL enhancements – ANSI mode, pipe syntax, VARIANT type, SQL scripting, user-defined functions (UDFs)
- Python API advances – custom data sources, UDF profiling
- Streaming improvements – stateful processing API v2, queryable state
- Table format support – Amazon S3 Tables, AWS Lake Formation integration
The following sections provide technical details and code examples for each capability.
SQL enhancements
Spark 4.0.1 introduces new SQL capabilities including ANSI mode compliance, SQL UDFs, pipe syntax for readable queries, VARIANT type for JSON handling, and SQL scripting with control flow.
ANSI SQL mode by default
ANSI SQL mode is now enabled by default, enforcing standard SQL behavior for data integrity. Silent casting of out-of-range values now raises errors rather than producing unexpected results. Existing queries may behave differently, particularly around null handling, string casting, and timestamp operations. Use spark.sql.ansi.enabled=false if you need legacy behavior during migration.
SQL pipe syntax
You can now chain SQL operations using the |> operator for improved readability. The following example shows how you can replace nested subqueries with a more maintainable pipeline:
This replaces nested subqueries, making complex transformations easier to understand and maintain.
VARIANT data type
The VARIANT type handles semi-structured JSON/XML data efficiently without repeated parsing. It uses an optimized binary representation internally while maintaining schema-less flexibility. Previously, JSON expressions required repeated parsing, degrading performance. VARIANT eliminates this overhead. The following snippet shows how to parse JSON into the VARIANT type:
Spark 4.0.1 on EMR Serverless supports Apache Iceberg v3, enabling the VARIANT data type with Iceberg tables. This combination provides efficient storage and querying of semi-structured JSON data in your data lake. Store VARIANT columns in Iceberg tables and use Iceberg’s schema evolution and time travel capabilities alongside Spark’s optimized JSON processing. The following example shows how to create an Iceberg table with a VARIANT column:
SQL scripting with session variables
Manage state and control flow directly in SQL using session variables, and IF/WHILE/FOR statements. The following example demonstrates a loop that populates a results table:
This enables complex ETL logic entirely in SQL without switching to Python.
SQL user-defined functions
Define custom functions directly in SQL. Functions can be temporary (session-scoped) or permanent (catalog-stored). The following example shows how to register and use a simple UDF:
Python API advances
This section covers new Python capabilities including custom data sources and UDF profiling tools.
Python data source API
You can now build custom data sources in Python without Scala knowledge. The following example shows how to create a simple data source that returns sample data:
Unified UDF profiling
Profile Python and Pandas UDFs for performance and memory insights. The following code enables performance profiling:spark.conf.set("spark.sql.pyspark.udf.profiler", "perf") # or "memory"
Structured streaming enhancements
This section covers improvements to stateful stream processing, including queryable state and enhanced state management APIs.
Arbitrary stateful processing API v2
The transformWithState operator provides robust state management with timer and TTL support for automatic cleanup, schema evolution capabilities, and initial state support for pre-populating state from batch DataFrames.
State data source reader
Query streaming state as a DataFrame for debugging and monitoring. Previously, state data was internal to streaming queries. Now you can verify state values in unit tests, diagnose production incidents, detect state corruption, and optimize performance. Note: This feature is experimental. Source options and behavior may change in future releases.
State store improvements
Upgraded changelog checkpointing for RocksDB removes performance bottlenecks. Enhanced checkpoint coordination and improved sorted string table (SST) file reuse management optimizes streaming operations.
Table format support
This section covers support for AWS S3 Tables and full table access (FTA) with AWS Lake Formation.
AWS S3 Tables
Use Spark 4.0.1 with AWS S3 Tables, a storage solution that provides managed Apache Iceberg tables with automatic optimization and maintenance. S3 Tables simplify data lake operations by handling compaction, snapshot management, and metadata cleanup automatically.
Full table access with Lake Formation
FTA is supported for Apache Iceberg, Delta Lake, and Apache Hive tables when using AWS Lake Formation, a managed service that simplifies data access control. FTA provides coarse-grained access control at the table level. Note that fine-grained access control (FGAC) with column-level or row-level permissions is not available in this preview.
Getting started
Follow these steps to create an EMR Serverless application, run sample code to test new features, and provide feedback on the preview.
Prerequisites
Before you begin, confirm you have the following:
- AWS Command Line Interface (CLI): Install and configure the AWS CLI with credentials that have permissions to create and manage EMR Serverless applications
- EMR Serverless access: Your Identity and Access Management (IAM) role must have permissions for EMR Serverless operations (
emr-serverless:CreateApplication,emr-serverless:StartJobRun). See the EMR Serverless IAM policy examples for minimum required permissions - S3 bucket: An S3 bucket to store your job scripts and data
- Runtime role: An IAM execution role for EMR Serverless with access to your S3 resources
Note: EMR Studio Notebooks and SageMaker Unified Studio are not supported during this preview. Use the AWS CLI or AWS SDK to submit jobs.
Step 1: Create your EMR Serverless application
Create or update your application with the emr-spark-8.0-preview release label. The following command creates a new application:
Step 2: Test sample code
Run this PySpark job to verify setup and test Spark 4.0.1 features:
Submit the job with the following command:
Step 3: Test your workloads
Review the Spark SQL Migration Guide and PySpark Migration Guide, then test production workloads in non-production environments. Focus on queries affected by ANSI SQL mode and benchmark performance.
Step 4: Clean up resources
After testing, delete all resources created during this evaluation to avoid ongoing charges:
Migration considerations
Before evaluating Spark 4.0.1, review the updated runtime requirements and behavioral changes that may affect your existing code.
Runtime requirements
- Scala: Version 2.13.16 required (2.12 support dropped)
- Java: JDK 17 or higher required (JDK 8 and 11 support removed)
- Python: Version 3.9+ required, continued support for 3.11 and newly added 3.12 (3.8 support removed)
- Pandas: Minimum version 2.0.0 (previously 1.0.5)
- SparkR: Deprecated; migrate to PySpark
Behavioral changes
With ANSI SQL mode enforcement, you may see different behavior in:
- Null handling: Stricter null propagation in expressions
- String casting: Invalid casts now raise errors instead of returning null
- Map key operations: Duplicate keys now raise errors
- Timestamp conversions: Overflow returns null instead of wrapped values
- CREATE TABLE statements: Now respect the
spark.sql.sources.defaultconfiguration instead of defaulting to Hive format when USING or STORED AS clauses are omitted
You can control many of these behaviors via legacy configuration flags. Consult the official migration guides for details refer: Spark SQL Migration Guide: 3.5 to 4.0 and PySpark Migration Guide: 3.5 to 4.0.
Preview limitations
The following capabilities are not available in this preview:
- Fine-grained access control: Fine-grained access control (FGAC) with row-level or column-level filtering is not supported in this preview. Jobs with
spark.emr-serverless.lakeformation.enabled=truewill fail. - Spark Connect: Not supported in this preview. Use standard Spark job submission with the StartJobRun API.
- Open Table Format limitations: Hudi is not supported in this preview. Delta 4.0.0 does not support Flink connectors (deprecated in Delta 4.0.0). Delta Universal Format is not supported in this preview.
- Connectors: spark-sql-kinesis, emr-dynamodb, and spark-redshift are unavailable.
- Interactive applications: Livy and JupyterEnterpriseGateway are not included. Also, SageMaker Unified Studio and EMR Studio are not supported.
- EMR features: Serverless Storage and Materialized Views are not supported.
This preview lets you evaluate Spark 4.0.1’s core capabilities on EMR Serverless, including SQL enhancements, Python API improvements, and streaming state management. Test your migration path, assess performance improvements, and provide feedback to shape the general availability release.
Conclusion
This post showed you how to get started with the Apache Spark 4.0.1 preview release on Amazon EMR Serverless. You explored how the VARIANT data type works with Iceberg v3 to process JSON data efficiently, how SQL scripting and pipe syntax eliminate context-switching for ETL development, and how queryable streaming state simplifies debugging stateful applications. You also learned about the preview limitations, runtime requirements, and behavioral changes to consider during evaluation.
Test the Spark 4.0.1 preview on EMR Serverless and provide feedback through AWS Support to help shape the general availability release.
To learn more about Apache Spark 4.0.1 features, see the Spark 4.0.1 Release Notes. For EMR Serverless documentation, see the EMR Release Guide.
Resources
Apache Spark Documentation
Amazon EMR Resources