AWS Big Data Blog

Apache Spark 4.0.1 preview now available on Amazon EMR Serverless

Amazon EMR Serverless now supports Apache Spark 4.0.1 in preview, making analytics accessible to more users, simplifying data engineering workflows, and strengthening governance capabilities. The release introduces ANSI SQL compliance, VARIANT data types support for JSON handling, Apache Iceberg v3 table format support, and enhanced streaming capabilities. This preview is available in all regions where EMR Serverless is available.

In this post, we explore key benefits, technical capabilities, and considerations for getting started with Spark 4.0.1 on Amazon EMR Serverless—a serverless deployment option that simplifies running open-source big data frameworks, without requiring managing clusters. With the emr-spark-8.0-preview release label, you can evaluate new SQL capabilities, Python API improvements, and streaming enhancements in your existing EMR Serverless environment.

Benefits

Spark 4.0.1 helps you solve data engineering problems with specific improvements. This section shows how new capabilities help with real-world scenarios.

Make analytics accessible to more users

Simplify Extract Transform Load (ETL) development with SQL scripting. Data engineers often switch between SQL and Python to build complex ETL logic with control flow. SQL scripting in Spark 4.0.1 enables loops, conditionals, and session variables directly in SQL, reducing context-switching and simplifying pipeline development. Use pipe syntax (|>) to chain operations for more readable, maintainable queries.

Improve data quality with ANSI SQL mode. Silent type conversion failures can introduce data quality issues. ANSI SQL mode (now default) enforces standard SQL behavior, raising errors for invalid operations instead of producing unexpected results. Important: ANSI SQL mode is now enabled by default. Test your queries thoroughly during this preview evaluation.

Simplify data engineering workflows

Process JSON data efficiently with VARIANT. Teams working with semi-structured data often see slow performance from repeated JSON parsing. The VARIANT data type stores JSON in an optimized binary format, eliminating parsing overhead. You can efficiently store and query JSON data in data lakes without schema rigidity.

Build Python data sources without Scala. Integrating custom data sources previously required Scala expertise. The Python data Source API lets you build connectors entirely in Python, using existing Python skills and libraries without learning a new language.

Debug streaming applications with queryable state. Troubleshooting stateful streaming applications has historically required indirect methods. The new state data source reader shows streaming state as queryable DataFrames. You can inspect state during debugging, test state values in unit tests, and diagnose production incidents.

Strengthen governance capabilities

Establish comprehensive audit trails with Apache Iceberg v3. The Apache Iceberg v3 table format provides transaction guarantees and tracks data changes over time, giving you the audit trails needed for regulatory compliance. When combined with VARIANT data type support, you can maintain governance controls while handling semi-structured data efficiently in data lakes.

Key capabilities

Spark 4.0.1 Preview on EMR Serverless introduces four major capability areas:

  1. SQL enhancements – ANSI mode, pipe syntax, VARIANT type, SQL scripting, user-defined functions (UDFs)
  2. Python API advances – custom data sources, UDF profiling
  3. Streaming improvements – stateful processing API v2, queryable state
  4. Table format support – Amazon S3 Tables, AWS Lake Formation integration

The following sections provide technical details and code examples for each capability.

SQL enhancements

Spark 4.0.1 introduces new SQL capabilities including ANSI mode compliance, SQL UDFs, pipe syntax for readable queries, VARIANT type for JSON handling, and SQL scripting with control flow.

ANSI SQL mode by default

ANSI SQL mode is now enabled by default, enforcing standard SQL behavior for data integrity. Silent casting of out-of-range values now raises errors rather than producing unexpected results. Existing queries may behave differently, particularly around null handling, string casting, and timestamp operations. Use spark.sql.ansi.enabled=false if you need legacy behavior during migration.

SQL pipe syntax

You can now chain SQL operations using the |> operator for improved readability. The following example shows how you can replace nested subqueries with a more maintainable pipeline:

FROM customer
|> LEFT OUTER JOIN orders ON c_custkey = o_custkey
|> AGGREGATE COUNT(o_orderkey) c_count GROUP BY c_custkey
|> AGGREGATE COUNT(*) AS custdist GROUP BY c_count
|> ORDER BY custdist DESC

This replaces nested subqueries, making complex transformations easier to understand and maintain.

VARIANT data type

The VARIANT type handles semi-structured JSON/XML data efficiently without repeated parsing. It uses an optimized binary representation internally while maintaining schema-less flexibility. Previously, JSON expressions required repeated parsing, degrading performance. VARIANT eliminates this overhead. The following snippet shows how to parse JSON into the VARIANT type:

df = spark.sql("SELECT parse_json('{\"name\":\"Alice\",\"age\":30}') as data")

Spark 4.0.1 on EMR Serverless supports Apache Iceberg v3, enabling the VARIANT data type with Iceberg tables. This combination provides efficient storage and querying of semi-structured JSON data in your data lake. Store VARIANT columns in Iceberg tables and use Iceberg’s schema evolution and time travel capabilities alongside Spark’s optimized JSON processing. The following example shows how to create an Iceberg table with a VARIANT column:

CREATE TABLE catalog.db.events (
  event_id BIGINT,
  event_data VARIANT,
  timestamp TIMESTAMP
) USING iceberg;

INSERT INTO catalog.db.events SELECT 1, parse_json('{"user":"alice","action":"login"}'), current_timestamp();

SQL scripting with session variables

Manage state and control flow directly in SQL using session variables, and IF/WHILE/FOR statements. The following example demonstrates a loop that populates a results table:

BEGIN
  DECLARE counter INT = 10;
  WHILE counter > 0 DO
    INSERT INTO results VALUES (counter);
    SET counter = counter - 1;
  END WHILE;
END

This enables complex ETL logic entirely in SQL without switching to Python.

SQL user-defined functions

Define custom functions directly in SQL. Functions can be temporary (session-scoped) or permanent (catalog-stored). The following example shows how to register and use a simple UDF:

CREATE FUNCTION plusOne(x INT) RETURNS INT RETURN x + 1;
SELECT plusOne(5);

Python API advances

This section covers new Python capabilities including custom data sources and UDF profiling tools.

Python data source API

You can now build custom data sources in Python without Scala knowledge. The following example shows how to create a simple data source that returns sample data:

from pyspark.sql.datasource import DataSource, DataSourceReader
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

class SampleDataSource(DataSource):
    def schema(self):
        return StructType([
            StructField("name", StringType()),
            StructField("age", IntegerType())
        ])
    
    def reader(self, schema):
        return SampleReader()

class SampleReader(DataSourceReader):
    def read(self, partition):
        yield ("Alice", 30)
        yield ("Bob", 25)

# Register and use
spark.dataSource.register(SampleDataSource)
spark.read.format("SampleDataSource").load().show()

Unified UDF profiling

Profile Python and Pandas UDFs for performance and memory insights. The following code enables performance profiling:spark.conf.set("spark.sql.pyspark.udf.profiler", "perf") # or "memory"

Structured streaming enhancements

This section covers improvements to stateful stream processing, including queryable state and enhanced state management APIs.

Arbitrary stateful processing API v2

The transformWithState operator provides robust state management with timer and TTL support for automatic cleanup, schema evolution capabilities, and initial state support for pre-populating state from batch DataFrames.

State data source reader

Query streaming state as a DataFrame for debugging and monitoring. Previously, state data was internal to streaming queries. Now you can verify state values in unit tests, diagnose production incidents, detect state corruption, and optimize performance. Note: This feature is experimental. Source options and behavior may change in future releases.

State store improvements

Upgraded changelog checkpointing for RocksDB removes performance bottlenecks. Enhanced checkpoint coordination and improved sorted string table (SST) file reuse management optimizes streaming operations.

Table format support

This section covers support for AWS S3 Tables and full table access (FTA) with AWS Lake Formation.

AWS S3 Tables

Use Spark 4.0.1 with AWS S3 Tables, a storage solution that provides managed Apache Iceberg tables with automatic optimization and maintenance. S3 Tables simplify data lake operations by handling compaction, snapshot management, and metadata cleanup automatically.

Full table access with Lake Formation

FTA is supported for Apache Iceberg, Delta Lake, and Apache Hive tables when using AWS Lake Formation, a managed service that simplifies data access control. FTA provides coarse-grained access control at the table level. Note that fine-grained access control (FGAC) with column-level or row-level permissions is not available in this preview.

Getting started

Follow these steps to create an EMR Serverless application, run sample code to test new features, and provide feedback on the preview.

Prerequisites

Before you begin, confirm you have the following:

Note: EMR Studio Notebooks and SageMaker Unified Studio are not supported during this preview. Use the AWS CLI or AWS SDK to submit jobs.

Step 1: Create your EMR Serverless application

Create or update your application with the emr-spark-8.0-preview release label. The following command creates a new application:

aws emr-serverless create-application --type spark \
  --release-label emr-spark-8.0-preview \
  --region us-east-1 --name spark4-test

Step 2: Test sample code

Run this PySpark job to verify setup and test Spark 4.0.1 features:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Spark 4.0.1 Test").getOrCreate()
print(f"Spark Version: {spark.version}")

# Create sample data
data = [("Alice", 34, "Engineering"), ("Bob", 45, "Sales"),
        ("Charlie", 28, "Engineering"), ("Diana", 52, "Marketing")]
df = spark.createDataFrame(data, ["name", "age", "department"])
df.createOrReplaceTempView("employees")

# Test SQL PIPE syntax
try:
    result = spark.sql("""
        FROM employees
        |> WHERE age > 30
        |> SELECT name, age, department
        |> ORDER BY age DESC
    """)
    result.show()
    print("✓ SQL pipe syntax test passed")
except Exception as e:
    print(f"✗ SQL pipe syntax test failed: {e}")

# Test VARIANT data type
try:
    json_data = spark.sql("""
        SELECT parse_json('{"name":"Alice","skills":["Python","Spark","SQL"]}') as data
    """)
    json_data.show(truncate=False)
    print("✓ VARIANT data type test passed")
except Exception as e:
    print(f"✗ VARIANT data type test failed: {e}")

Submit the job with the following command:

aws emr-serverless start-job-run \
    --application-id <your-application-id> \
    --execution-role-arn <your-execution-role-arn> \
    --job-driver '{
        "sparkSubmit": {
            "entryPoint": "s3://<your-bucket>/spark_4_test.py",
            "sparkSubmitParameters": "--conf spark.executor.cores=4 --conf spark.executor.memory=16g"
        }
    }'

Step 3: Test your workloads

Review the Spark SQL Migration Guide and PySpark Migration Guide, then test production workloads in non-production environments. Focus on queries affected by ANSI SQL mode and benchmark performance.

Step 4: Clean up resources

After testing, delete all resources created during this evaluation to avoid ongoing charges:

# Delete the EMR Serverless application
aws emr-serverless delete-application \
    --application-id spark4-test \
    --region us-east-1
# Remove the test script from S3
aws s3 rm s3://<your-bucket>/spark_4_test.py

Migration considerations

Before evaluating Spark 4.0.1, review the updated runtime requirements and behavioral changes that may affect your existing code.

Runtime requirements

  • Scala: Version 2.13.16 required (2.12 support dropped)
  • Java: JDK 17 or higher required (JDK 8 and 11 support removed)
  • Python: Version 3.9+ required, continued support for 3.11 and newly added 3.12 (3.8 support removed)
  • Pandas: Minimum version 2.0.0 (previously 1.0.5)
  • SparkR: Deprecated; migrate to PySpark

Behavioral changes

With ANSI SQL mode enforcement, you may see different behavior in:

  • Null handling: Stricter null propagation in expressions
  • String casting: Invalid casts now raise errors instead of returning null
  • Map key operations: Duplicate keys now raise errors
  • Timestamp conversions: Overflow returns null instead of wrapped values
  • CREATE TABLE statements: Now respect the spark.sql.sources.defaultconfiguration instead of defaulting to Hive format when USING or STORED AS clauses are omitted

You can control many of these behaviors via legacy configuration flags. Consult the official migration guides for details refer: Spark SQL Migration Guide: 3.5 to 4.0 and PySpark Migration Guide: 3.5 to 4.0.

Preview limitations

The following capabilities are not available in this preview:

  • Fine-grained access control: Fine-grained access control (FGAC) with row-level or column-level filtering is not supported in this preview. Jobs with spark.emr-serverless.lakeformation.enabled=true will fail.
  • Spark Connect: Not supported in this preview. Use standard Spark job submission with the StartJobRun API.
  • Open Table Format limitations: Hudi is not supported in this preview. Delta 4.0.0 does not support Flink connectors (deprecated in Delta 4.0.0). Delta Universal Format is not supported in this preview.
  • Connectors: spark-sql-kinesis, emr-dynamodb, and spark-redshift are unavailable.
  • Interactive applications: Livy and JupyterEnterpriseGateway are not included. Also, SageMaker Unified Studio and EMR Studio are not supported.
  • EMR features: Serverless Storage and Materialized Views are not supported.

This preview lets you evaluate Spark 4.0.1’s core capabilities on EMR Serverless, including SQL enhancements, Python API improvements, and streaming state management. Test your migration path, assess performance improvements, and provide feedback to shape the general availability release.

Conclusion

This post showed you how to get started with the Apache Spark 4.0.1 preview release on Amazon EMR Serverless. You explored how the VARIANT data type works with Iceberg v3 to process JSON data efficiently, how SQL scripting and pipe syntax eliminate context-switching for ETL development, and how queryable streaming state simplifies debugging stateful applications. You also learned about the preview limitations, runtime requirements, and behavioral changes to consider during evaluation.

Test the Spark 4.0.1 preview on EMR Serverless and provide feedback through AWS Support to help shape the general availability release.

To learn more about Apache Spark 4.0.1 features, see the Spark 4.0.1 Release Notes. For EMR Serverless documentation, see the EMR Release Guide.

Resources

Apache Spark Documentation

Amazon EMR Resources


About the authors

Al MS

Al MS

Al is a product manager for Amazon EMR at Amazon Web Services.

Emilie Faracci

Emilie Faracci

Emilie is a Software Development Engineer Amazon Web Services, working on Amazon EMR. She focuses on Spark development and has contributed to open-source Apache Spark v4.0.1.

Karthik Prabhakar

Karthik Prabhakar

Karthik is a Data Processing Engines Architect for Amazon EMR at AWS. He specializes in distributed systems architecture and query optimization, working with customers to solve complex performance challenges in large-scale data processing workloads. His focus spans engine internals, cost optimization strategies, and architectural patterns that enable customers to run petabyte-scale analytics efficiently.