AWS Big Data Blog

Upgrade PySpark from Spark 3.5 to Spark 4.0 with AWS Spark Upgrade Agent

Upgrading Apache Spark applications across major versions means tracking down breaking changes, manually debugging failures from log files, and running repeated test cycles. This process can stretch across weeks for complex code bases.

In this post, we walk through a hands-on PySpark migration from Spark 3.5 to Spark 4.0 on Amazon EMR Serverless, using the AWS Spark Upgrade Agent. You’ll see how the agent iteratively validates your application on a live Amazon EMR Serverless application, automatically diagnosing and resolving failures from Amazon CloudWatch logs until the job succeeds. By the end, you have a multi-pipeline PySpark application running on Spark 4.0 with four distinct breaking changes resolved. The fixes include configuration key removals, codec renames, and stricter charset validation, all driven through natural language interaction in the Integrated Development Environment (IDE).

This is part 2 of a three-part series on how the AWS Spark Upgrade Agent can automate and simplify Spark upgrades.

In Part 1, we introduced the agent’s architecture and capabilities. This post walks through a complete PySpark migration from Spark 3.5 to Spark 4.0 on Amazon EMR Serverless.

In the sections that follow, you will set up the prerequisites and infrastructure, explore the sample application, run the iterative validation workflow on EMR Serverless, review data quality results, and generate a comprehensive upgrade summary.

Note: Because this upgrade is performed using the AWS Spark Upgrade Agent Model Context Protocol (MCP) server, an agentic artificial intelligence (AI) system, the agent might take different paths to reach the same successful outcome. The workflow demonstrated here represents one successful upgrade path. The key takeaway is the end-to-end workflow: generating an upgrade plan, iteratively validating on Amazon EMR Serverless, and producing a comprehensive upgrade summary.

1. Prerequisites and setup

This section covers the tools, infrastructure, and IDE configuration you need before starting the upgrade. To follow along, you need an AWS account with an AWS Identity and Access Management (AWS IAM) user or role that has permissions to deploy AWS CloudFormation stacks, create AWS IAM roles and policies, and create Amazon EMR Serverless applications. Intermediate knowledge of AWS Command Line Interface (AWS CLI), AWS CloudFormation, and Python is helpful.

1.1 Install Kiro CLI and local tools

In this post, we use Kiro CLI to demonstrate the upgrade workflow. You can use an MCP-compatible IDE or framework. Examples include VS Code with Cline, Cursor, Windsurf, and Claude Desktop, among others. To follow along with Kiro CLI, install it on your workstation. For more details on the installation and setup, refer to Setup for Upgrade Agent:

curl -fsSL https://cli.kiro.dev/install | bash

Run the following command and use your builder ID to log in:

kiro-cli login --use-device-flow

With the Kiro CLI installed and logged in, rather than installing the remaining tools manually, use Kiro CLI to set up and verify your prerequisites with the following prompt:

kiro-cli chat
> Install AWS CLI, Python 3.10, and uv on my system if they are not already installed

Kiro CLI output showing successful installation of AWS CLI, Python, and uv

Output of AWS CLI and local tools install step.

These tools are needed for the upgrade workflow:

1.2 Infrastructure setup (AWS CloudFormation)

Two AWS CloudFormation stacks create the required resources: an AWS IAM role, an Amazon Simple Storage Service (Amazon S3) staging bucket, an Amazon EMR Serverless application (Spark 4.0.1), and its execution role.

Stack 1 – AWS IAM role and Amazon S3 staging bucket:

The spark-upgrade-mcp-setup template creates the AWS IAM role and Amazon S3 staging bucket required by the upgrade agent. Choose the Launch Stack button for your Region. For additional Regions, see the full region list.

# Region Launch
1 US East (N. Virginia) Launch Stack
2 US East (Ohio) Launch Stack
3 US West (Oregon) Launch Stack
4 Europe (Ireland) Launch Stack

After deployment, open the AWS CloudFormation Outputs tab, copy the ExportCommand value, and run it in your terminal. This sets SMUS_MCP_REGION, IAM_ROLE, and STAGING_BUCKET_PATH automatically.

CloudFormation Outputs tab showing ExportCommand with SMUS_MCP_REGION, IAM_ROLE, and STAGING_BUCKET_PATH values

Outputs tab of the CloudFormation stack.

# Sets SMUS_MCP_REGION, IAM_ROLE, and STAGING_BUCKET_PATH
export SMUS_MCP_REGION=<YOUR-REGION> && export IAM_ROLE=arn:aws:iam::<YOUR-ACCOUNT-ID>:role/spark-upgrade-role-* && export STAGING_BUCKET_PATH=<YOUR-BUCKET>

Then configure the AWS CLI profile:

aws configure set profile.spark-upgrade-profile.role_arn ${IAM_ROLE}
aws configure set profile.spark-upgrade-profile.source_profile default
aws configure set profile.spark-upgrade-profile.region ${SMUS_MCP_REGION}

Stack 2 – Amazon EMR Serverless target application and execution role:

git clone https://github.com/aws-samples/sample-amazon-emr-spark4-examples
cd sample-amazon-emr-spark4-examples/pyspark/AWSSpark4AutoUpgradeDemo

The PySpark sample lives at resources/global_logistics_platform/. The AWS CloudFormation template lives at resources/cloudformation/.

Deploy the AWS CloudFormation template to create the source and target Amazon EMR Serverless applications and a shared execution role:

aws cloudformation deploy \
  --template-file resources/cloudformation/emr-serverless-target-setup.yaml \
  --stack-name spark-emr-serverless-upgrade \
  --region ${SMUS_MCP_REGION} \
  --capabilities CAPABILITY_NAMED_IAM \
  --parameter-overrides \
    StagingBucketName=${STAGING_BUCKET_PATH} \
    SourceReleaseLabel=emr-7.0.0 \
    TargetReleaseLabel=emr-spark-8.0-preview \
    SourceApplicationName=spark-upgrade-source \
    TargetApplicationName=spark-upgrade-target

This creates two Amazon EMR Serverless applications: a source (Spark 3.5.0) for data quality baseline and a target (Spark 4.0.1) for upgrade validation, with a shared execution role. Both applications auto-stop after 15 minutes of idle time, so there is no cost when not in use. To upgrade between different Spark versions, override SourceReleaseLabel and TargetReleaseLabel with your target Amazon EMR release labels.

After the stack completes deployment, note the outputs:

aws cloudformation describe-stacks \
  --stack-name spark-emr-serverless-upgrade \
  --region ${SMUS_MCP_REGION} \
  --query "Stacks[0].Outputs" --output table

This gives you the SourceApplicationId, TargetApplicationId, and ExecutionRoleArn needed for the upgrade prompt. Make a note of them.

1.3 IDE and MCP server configuration

Configure the spark-upgrade MCP server. For Kiro CLI:

kiro-cli-chat mcp add \
    --name "spark-upgrade" \
    --command "uvx" \
    --args '[
      "mcp-proxy-for-aws@latest",
      "https://sagemaker-unified-studio-mcp.'${SMUS_MCP_REGION}'.api.aws/spark-upgrade/mcp",
      "--service", "sagemaker-unified-studio-mcp",
      "--profile", "spark-upgrade-profile",
      "--region", "'${SMUS_MCP_REGION}'",
      "--read-timeout", "180"
    ]' \
    --timeout 180000 \
    --scope global

For other MCP clients, refer to your IDE’s MCP configuration documentation and use the same server parameters shown previously.

Verify the connection: Start Kiro CLI and confirm the spark-upgrade tools are loaded:

$ kiro-cli chat
...
spark-upgrade (MCP):
- generate_spark_upgrade_plan          * not trusted
- update_build_configuration           * not trusted
- fix_upgrade_failure                  * not trusted
- run_validation_job                   * not trusted
- check_job_status                     * not trusted
...

Tip: After Kiro CLI and the MCP server are configured, you can ask the agent to verify your setup. For example: “Check if I have AWS CLI, Python 3.10+, and uv installed, and confirm the spark-upgrade MCP server is connected.”

Kiro CLI output confirming spark-upgrade MCP server connection and tool availability

Output showing the status of each tool, AWS CLI, and MCP server.

Tip: Trust mode vs. confirm mode: When running the upgrade agent in Kiro CLI, you have two options:

Trust mode: Type t when prompted to approve a tool. The agent auto-approves subsequent uses of that tool without asking for confirmation. You can also use /tools trust-all to trust every tool at once for a fully autonomous experience.

Confirm mode: Type y for each individual tool invocation. This lets you review, verify, and approve every action before the agent runs it. If this is your first time using the agent, use confirm mode for full visibility.

2. Hands-on PySpark upgrade from Spark 3.5 to Spark 4.0

This section walks through the complete migration of a representative PySpark application from Amazon EMR Serverless 7.0.0 (Spark 3.5.0) to EMR Serverless with the emr-spark-8.0-preview release label (Spark 4.0.1), using the global_logistics_platform sample.

2.1 Sample project: global logistics platform

The sample application is a multi-domain PySpark data processing application with three pipelines:

  • Fleet management: Processes vehicle telemetry data (GPS tracking, fuel consumption, driver behavior scoring) using window functions, lag/lead operations, and statistical aggregations. Writes Parquet with lz4raw compression.
  • International shipping: Handles cross-border shipment documents with multi-language address standardization using character encoding functions (encode/decode with charsets like Shift_JIS, GB2312, EUC-KR), and processes carrier manifests with ISO-8859-1 encoding.
  • Historical compliance: Processes regulatory audit records spanning centuries (including pre-1582 Julian calendar dates), requiring legacy datetime rebasing for Parquet writes.

Project structure:

global_logistics_platform/
├── main.py                          # Orchestrator - runs all 3 pipelines
├── src/
│   ├── utils/
│   │   └── spark_config.py          # Spark session config & logging
│   └── domain/                      # Application code that needs migration
│       ├── fleet_management/
│       │   └── telemetry_processor.py
│       ├── international_shipping/
│       │   └── shipment_processor.py
│       └── historical_compliance/
│           └── compliance_processor.py
└── data/                             # Sample dataset for the workflow
    └── sample/
        ├── fleet_telemetry.csv
        ├── international_shipments.csv
        └── compliance_records.csv

2.2 The four Spark 4.0 incompatibilities

Before diving into the upgrade, here are the four specific breaking changes present in this code base that the agent discovers and resolves entirely through runtime validation:

# Incompatibility File(s)
1 Legacy Parquet configuration key removed: spark.sql.legacy.parquet.datetimeRebaseModeInWrite removed in Spark 4.0. Must use spark.sql.parquet.datetimeRebaseModeInWrite. spark_config.py
2 Parquet compression codec rename: lz4raw codec renamed to lz4_raw in Spark 4.0. telemetry_processor.py
3 Stricter charset encoding validation: Spark 4.0 tightened encode() behavior. Encoding CJK (Chinese, Japanese, Korean) characters to ISO-8859-1 now throws MALFORMED_CHARACTER_CODING. In Spark 3.x this silently replaced unmappable chars with ?. Restored via spark.sql.legacy.codingErrorAction. spark_config.py
4 Character encoding restrictions: encode()/decode() in Spark 4.0 supports US-ASCII, ISO-8859-1, UTF-8, UTF-16BE, UTF-16LE, UTF-16, and UTF-32. Code uses Shift_JIS, GB2312, EUC-KR. shipment_processor.py

The agent resolves each of these through iterative runtime validation on EMR Serverless: submitting the job, diagnosing failures from Amazon CloudWatch logs, applying fixes, and resubmitting until the job succeeds.

Architecture diagram showing the iterative validation workflow between the IDE, MCP server, and Amazon EMR Serverless

2.3 Step 1: Invoke the upgrade agent

Open the project in Kiro CLI and enter the following prompt:

Upgrade my Spark application in the current directory from EMR serverless version 7.0.0 to EMR serverless version 8.0.0.
Use Amazon EMR Serverless target app-id <YOUR-TARGET-APP-ID> and execution role
<YOUR-EXECUTION-ROLE-ARN> for validation.
Use source Amazon EMR Serverless app-id <YOUR-SOURCE-APP-ID> for data quality baseline.
Store artifacts at s3://${STAGING_BUCKET_PATH}/spark4-upgrade/python/
Enable data quality validation

Tip: The SourceApplicationId, TargetApplicationId, and ExecutionRoleArn are in the Outputs of the spark-emr-serverless-upgrade AWS CloudFormation stack you deployed in Section 1.2.

The agent invokes generate_spark_upgrade_plan, scans the project structure, identifies the Spark version mapping (EMR 7.0.0 → Spark 3.5.0, EMR 8.0.0 → Spark 4.0.1), and produces a structured upgrade plan with an Analysis ID for traceability.

The agent presents the plan and asks for confirmation. Type y to approve the tool invocation, or t to trust that tool for the rest of the session.

You have an option to save the plan as a local JSON file for future reference or to resume the upgrade at a later point, so go ahead and ask Kiro to save it locally. Provide the AWS CLI profile that you have configured on your system. Use the following prompt to provide these inputs:

Yes I would like to save the plan to a local file and use spark-upgrade-profile

2.4 Step 2: Build and package

The agent validates the Python project compiles successfully, then packages it for Amazon EMR Serverless deployment:

  • Runs py_compile on each .py file to verify syntax.
  • Creates src.zip containing the src/ directory (preserving the import structure used by from src.utils import ...).
  • Uploads src.zip, main.py, and sample input data to the Amazon S3 staging path.
# What the agent does behind the scenes:
zip -r src.zip src/
aws s3 cp main.py s3://<YOUR-BUCKET>/spark4-upgrade/python/<ANALYSIS-ID>/source/main.py
aws s3 cp src.zip s3://<YOUR-BUCKET>/spark4-upgrade/python/<ANALYSIS-ID>/source/src.zip
aws s3 cp data/sample/ s3://<YOUR-BUCKET>/spark4-upgrade/python/<ANALYSIS-ID>/input/ --recursive

No external dependencies (no requirements.txt), so no virtual environment is needed. If your project has external dependencies in a requirements.txt, the agent will package them into a virtual environment archive and include it in the EMR Serverless submission parameters.

2.5 Step 3: Data quality baseline on source application

Before migrating the code, the agent establishes a data quality baseline by running the original (pre-upgrade) code on the source Amazon EMR Serverless application (Spark 3.5.0 / EMR 7.0.0). This captures the expected output that the upgraded application must match.

The agent submits the job to the source application with data quality check enabled:

{
  "executionRoleArn": "arn:aws:iam::<YOUR-ACCOUNT-ID>:role/<YOUR-EXECUTION-ROLE>",
  "jobDriver": {
    "sparkSubmit": {
      "entryPoint": "s3://<YOUR-BUCKET>/spark4-upgrade/python/<ANALYSIS-ID>/source/main.py",
      "entryPointArguments": [
        "s3://<YOUR-BUCKET>/spark4-upgrade/python/<ANALYSIS-ID>/input/",
        "s3://<YOUR-BUCKET>/spark4-upgrade/python/<ANALYSIS-ID>/output/source/"
      ],
      "sparkSubmitParameters": "--py-files s3://<YOUR-BUCKET>/spark4-upgrade/python/<ANALYSIS-ID>/source/src.zip"
    }
  },
  "configurationOverrides": {
    "monitoringConfiguration": {
      "cloudWatchLoggingConfiguration": {
        "enabled": true,
        "logGroupName": "/aws/emr-serverless"
      }
    }
  }
}

The agent monitors the source run via check_job_status until it completes successfully. This baseline output is stored for comparison after the target validation succeeds.

2.6 Step 4: Iterative runtime validation on target application

This is the core of the upgrade. The agent submits the unmodified application to the target Amazon EMR Serverless application (Spark 4.0.1), and every incompatibility is discovered, diagnosed, and fixed through runtime failures. The agent drives the entire fix cycle by submitting to EMR, reading errors from Amazon CloudWatch logs, applying fixes, rebuilding, and resubmitting.

The agent presents the proposed Amazon EMR Serverless job configuration for your review before each submission. Type y to approve.

2.6.1 Fix 1: Legacy Parquet configuration key removed (iteration 1)

The first submission fails immediately at SparkSession initialization:

org.apache.spark.sql.AnalysisException:
The SQL config 'spark.sql.legacy.parquet.datetimeRebaseModeInWrite' was removed
in the version 4.0.0. Use 'spark.sql.parquet.datetimeRebaseModeInWrite' instead.

The Historical Compliance pipeline configures spark.sql.legacy.parquet.datetimeRebaseModeInWrite for handling pre-1582 Julian calendar dates. Spark 4.0 removed the legacy. prefix from this configuration key.

The agent calls fix_upgrade_failure, which identifies the migration rule and recommends the fix:

File: src/utils/spark_config.py

# Before
.config("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "LEGACY")

# After
.config("spark.sql.parquet.datetimeRebaseModeInWrite", "LEGACY")

After applying the fix, the agent rebuilds src.zip, re-uploads to Amazon S3, and resubmits the job.

2.6.2 Fix 2: Parquet compression codec rename (iteration 2)

The resubmitted job fails with a new error, which confirms progress:

pyspark.errors.exceptions.captured.IllegalArgumentException:
[CODEC_NOT_AVAILABLE.WITH_AVAILABLE_CODECS_SUGGESTION]
The codec lz4raw is not available.
Available codecs are brotli, uncompressed, lzo, snappy, lz4_raw, none, zstd, lz4, gzip.
SQLSTATE: 56038

The Fleet Management pipeline’s telemetry_processor.py uses lz4raw as the Parquet compression codec. Spark 4.0 renamed this to lz4_raw (with an underscore).

The recommended fix:

File: src/domain/fleet_management/telemetry_processor.py

# Before
.option("compression", "lz4raw")

# After
.option("compression", "lz4_raw")

The agent applies the change, rebuilds, and resubmits.

2.6.3 Fix 3: Stricter charset encoding validation (iteration 3)

The next submission surfaces a different failure:

org.apache.spark.SparkRuntimeException:
[MALFORMED_CHARACTER_CODING]
Invalid value found when performing `encode` with ISO-8859-1
SQLSTATE: 22000

The International Shipping pipeline’s process_carrier_manifests() method uses encode(..., 'ISO-8859-1') on data containing CJK (Chinese, Japanese, Korean) characters. Although ISO-8859-1 is in Spark 4.0’s supported charset list, it is a single-byte encoding that cannot represent CJK characters. In Spark 3.x, the Java charset encoder silently replaced unmappable characters with ?. Spark 4.0 tightened this behavior to throw MALFORMED_CHARACTER_CODING for unmappable characters.

The agent identifies the migration rule and adds a legacy compatibility configuration:

File: src/utils/spark_config.py

# Added to SparkSession builder
.config("spark.sql.legacy.codingErrorAction", "true")

This restores the Spark 3.x behavior where unmappable characters are silently replaced instead of throwing errors.

With the configuration added, the agent rebuilds and resubmits.

2.6.4 Fix 4: Character encoding restrictions (iteration 4)

The fourth submission fails with yet another encoding error:

org.apache.spark.SparkIllegalArgumentException:
[INVALID_PARAMETER_VALUE.CHARSET]
The value of parameter(s) `charset` in `encode` is invalid:
expects one of the iso-8859-1, us-ascii, utf-16, utf-16be, utf-16le, utf-32, utf-8,
but got Shift_JIS. SQLSTATE: 22023

The International Shipping pipeline’s standardize_addresses_with_charset() method uses Shift_JIS, GB2312, and EUC-KR charsets in encode()/decode() calls. Spark 4.0 restricts these functions to seven standard charsets. These regional charsets are not in the supported list.

The agent replaces the unsupported charsets with UTF-8:

File: src/domain/international_shipping/shipment_processor.py

Before (Spark 3.5.0):

df = df.withColumn(
    "shipper_address_normalized",
    when(col("origin_country") == "JP",
         expr("decode(encode(shipper_address, 'Shift_JIS'), 'UTF-8')"))
    .when(col("origin_country") == "CN",
         expr("decode(encode(shipper_address, 'GB2312'), 'UTF-8')"))
    .when(col("origin_country") == "KR",
         expr("decode(encode(shipper_address, 'EUC-KR'), 'UTF-8')"))
    .otherwise(col("shipper_address"))
)

After (Spark 4.0.1):

df = df.withColumn(
    "shipper_address_normalized",
    when(col("origin_country") == "JP",
         expr("decode(encode(shipper_address, 'UTF-8'), 'UTF-8')"))
    .when(col("origin_country") == "CN",
         expr("decode(encode(shipper_address, 'UTF-8'), 'UTF-8')"))
    .when(col("origin_country") == "KR",
         expr("decode(encode(shipper_address, 'UTF-8'), 'UTF-8')"))
    .otherwise(col("shipper_address"))
)

The same transformation is applied to consignee_address_normalized.

The agent rebuilds and resubmits one final time.

2.6.5 Final submission: success

The fifth submission completes successfully:

{"success": true, "message": "EMR SERVERLESS job completed successfully",
"compute_run_id": "<JOB-RUN-ID>", "status": "SUCCESS",
"application_type": "EMR-Serverless"}

The three pipelines (Fleet Management, International Shipping, and Historical Compliance) complete on EMR Serverless with the emr-spark-8.0-preview release label (Spark 4.0.1).

2.7 Summary of the iterative runtime validation

The runtime validation loop is the core value of the upgrade agent. Here’s the complete iteration history:

Table showing the four validation iterations with error types and fixes applied

Each iteration follows the same cycle:

Diagram showing the submit, diagnose, fix, rebuild, and resubmit cycle

Failures that would normally require manual log analysis, root cause investigation, and code patching are resolved automatically by the agent in this workflow.

3. Data quality validation

With both the source baseline (Section 2.5) and the upgraded target run (Section 2.6) completed successfully, the agent performs data quality validation to verify the migration hasn’t changed your application’s output. This is the key advantage of including the source application in your upgrade prompt: the agent can compare outputs from both Spark versions side by side.

3.1 Data quality comparison

The agent invokes get_data_quality_summary to compare the outputs across four dimensions:

  • Schema validation: Confirms column names, data types, and column ordering match between source and target outputs.
  • Row count validation: Verifies no data loss or duplication during migration.
  • Nullability validation: Detects changes in null handling.
  • Statistical summary validation: Compares numeric and string column distributions (min, max, mean, count, distinct values).

The agent presents the comparison results:

Data quality summary showing schema, row count, and nullability checks passing with a statistical mismatch in shipper_address

The preceding image shows the data quality summary.

Three of four checks pass cleanly. The statistical summary validation detects a mismatch in the shipper_address column of the customs_declarations output: the max and min summary values differ between source and target.

3.2 Understanding and resolving the mismatch

This mismatch is a direct consequence of Fix 4 (Section 2.6.4). The original code ran addresses through a Shift_JIS/GB2312/EUC-KRUTF-8 roundtrip that produced garbled text, because the intermediate regional charset corrupted multi-byte UTF-8 characters. The upgraded code uses UTF-8UTF-8, preserving addresses faithfully. The mismatch reflects improved data quality, not a regression.

Schema, row counts, and nullability matched exactly: the difference is limited to string values that were previously garbled. No further action is needed. The upgraded application is production-ready.

Expected behavior: Character encoding migrations might change string values, although they preserve semantic meaning. When data quality validation reports mismatches, trace each one back to a specific code change. If the mismatch is explained by a required migration fix (as here), verify the new behavior is correct and document it. If a mismatch cannot be explained, investigate before promoting to production.

4. Upgrade summary

After the agent completes the entire upgrade workflow, it produces a comprehensive upgrade summary following a structured template. This summary lets you review the job configuration updates, code modifications with diffs and file references, relevant migration rules applied, and data quality validation status.

Here is the summary the agent produced for this upgrade:

Upgrade plan

  • Compile and build project with current Spark 3.5.0: validated that Python files compile successfully.
  • Run baseline validation on source EMR Serverless (00g4vhvt1lhtrs09) with Spark 3.5.0: established data quality baseline.
  • Run target validation on target EMR Serverless (00g4vhvt3np1bj09) with Spark 4.0.1: fixed 4 issues iteratively across 4 validation attempts.
  • Compare data quality between source and target runs: detected expected mismatch in shipper_address.
  • Generate and persist upgrade summary.

Upgrade result

Upgrade completed with data validation enabled. Data validation detected an expected mismatch in the shipper_address column because of the charset encoding migration from unsupported charsets (Shift_JIS, GB2312, EUC-KR) to UTF-8.

Dependency changes

No external dependencies were changed in this project (no requirements.txt).

Job configuration changes

  • Parquet datetime rebase configuration key renamed.
    • Change: spark.sql.legacy.parquet.datetimeRebaseModeInWritespark.sql.parquet.datetimeRebaseModeInWrite.
    • Migration rule: In Spark 4.0, the legacy datetime rebasing SQL configurations with the prefix spark.sql.legacy are removed. The SQL configuration spark.sql.legacy.parquet.datetimeRebaseModeInWrite was removed in the version 4.0.0. Use spark.sql.parquet.datetimeRebaseModeInWrite instead.
  • Legacy coding error action enabled.
    • Change: Added spark.sql.legacy.codingErrorAction set to true.
    • Migration rule: In Spark 4.0, the encode() and decode() functions raise MALFORMED_CHARACTER_CODING error when handling unmappable characters. In Spark 3.5 and earlier versions, these characters are replaced with garbled text. To restore the previous behavior, set spark.sql.legacy.codingErrorAction to true.

Code changes

  • Validation attempt 1: Legacy Parquet configuration key.
    • Validation run: EMR-Serverless job_run_id 00g4vm14v118vg0b.
    • Error: The SQL config 'spark.sql.legacy.parquet.datetimeRebaseModeInWrite' was removed in the version 4.0.0.
    • Applied changes: src/utils/spark_config.py: Changed .config("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "LEGACY") to .config("spark.sql.parquet.datetimeRebaseModeInWrite", "LEGACY").
  • Validation attempt 2: Parquet compression codec.
    • Validation run: EMR-Serverless job_run_id 00g4vm5pm1hig00b.
    • Error: [CODEC_NOT_AVAILABLE.WITH_AVAILABLE_CODECS_SUGGESTION] The codec lz4raw is not available.
    • Applied changes: src/domain/fleet_management/telemetry_processor.py: Changed .option("compression", "lz4raw") to .option("compression", "lz4_raw").
  • Validation attempt 3: Stricter charset encoding.
    • Validation run: EMR-Serverless job_run_id 00g4vm8sh4sp0g0b.
    • Error: [MALFORMED_CHARACTER_CODING] Invalid value found when performing encode with ISO-8859-1.
    • Applied changes: src/utils/spark_config.py: Added .config("spark.sql.legacy.codingErrorAction", "true") to the SparkSession builder.
  • Validation attempt 4: Unsupported charsets.
    • Validation run: EMR-Serverless job_run_id 00g4vmc668ng6o0b.
    • Error: [INVALID_PARAMETER_VALUE.CHARSET] charset in encode is invalid: expects one of iso-8859-1, us-ascii, utf-16, utf-16be, utf-16le, utf-32, utf-8, but got Shift_JIS.
    • Applied changes: src/domain/international_shipping/shipment_processor.py: Replaced Shift_JIS, GB2312, EUC-KR with UTF-8 for shipper and consignee address encoding.

Data validation result

# Validation Status
1 Schema validation (column names, types, ordering) Passed (no difference)
2 Row count validation (no data loss) Passed (no difference)
3 Nullability validation (null handling changes) Passed (no difference)
4 Statistical summary validation (numeric/string distributions) Failed (with difference)

Data mismatch: 1. The shipper_address column max summary value changed in customs_declarations output. This is expected because of the charset encoding migration from Shift_JIS/GB2312/EUC-KR to UTF-8. 2. The shipper_address column min summary value changed in customs_declarations output for the same expected cause.

5. Conclusion

The AWS Spark Upgrade Agent turns a traditionally time-consuming PySpark migration into an automated, iterative workflow. For the Global Logistics Platform sample, the agent identified and resolved four distinct Spark 4.0 breaking changes: legacy Parquet configuration key removal, compression codec renames, stricter charset encoding validation, and character encoding restrictions. Each fix was applied across three domain processors, through natural language interaction in the IDE.

Every incompatibility was discovered through runtime validation on Amazon EMR Serverless. The agent submitted the unmodified application to the target application, and each failure revealed the next breaking change:

  • The spark.sql.legacy.parquet.datetimeRebaseModeInWrite configuration removal, which crashes SparkSession initialization.
  • The lz4rawlz4_raw codec rename, which fails when Parquet writes run.
  • ISO-8859-1 encoding of CJK characters: ISO-8859-1 is a valid Spark 4.0 charset, so the failure surfaces only when the code runs against real multi-language data, because Spark 4.0 tightened charset encoding validation to reject unmappable characters.
  • Shift_JIS/GB2312/EUC-KR charsets removed from Spark 4.0’s supported charset list entirely.

The agent diagnosed each error from Amazon CloudWatch logs, applied the fix, rebuilt, and resubmitted without manual intervention beyond approving each step. The data quality validation then confirmed that the upgraded application produces equivalent output on Spark 4.0.1: schema, row counts, and nullability matched exactly. The one difference, in the shipper_address column, resulted from the charset migration from regional encodings to UTF-8, which actually improved data quality by eliminating garbled text from incorrect encoding roundtrips. With each mismatch traced back to a specific, understood code change, the upgraded application is production-ready.

# Category Spark 3.x behavior Spark 4.0 change Agent fix
1 Parquet datetime configuration spark.sql.legacy.parquet.datetimeRebaseModeInWrite legacy. prefix removed from key name Update configuration key
2 Parquet compression lz4raw codec name Renamed to lz4_raw (with underscore) Update codec name
3 Charset + CJK data ISO-8859-1 silently replaced unmappable CJK chars with ? Stricter charset validation throws MALFORMED_CHARACTER_CODING for unmappable characters Add spark.sql.legacy.codingErrorAction=true
4 Character encoding encode()/decode() supported Java charsets Restricted to 7 standard charsets Replace unsupported charsets with UTF-8

Next steps after your first upgrade:

  1. Apply the agent to your production PySpark code base.
  2. Integrate the upgrade workflow into your CI/CD pipeline.
  3. Explore Scala application upgrades (see Part 3 of this series).

To get started with your own PySpark migration:

  • Deploy the AWS CloudFormation templates from Section 1.2 for one-time AWS IAM, Amazon S3, and Amazon EMR Serverless setup.
  • Configure the spark-upgrade MCP server in your MCP-compatible IDE.
  • Point the agent at your PySpark project and let it handle the rest.

For more information, see the Amazon EMR Serverless documentation, the Apache Spark 4.0 migration guide, and the AWS Spark Upgrade Agent setup guide.

6. Clean up resources

To avoid ongoing costs, delete the resources you created:

  1. Delete the Amazon EMR Serverless stack:
    aws cloudformation delete-stack --stack-name spark-emr-serverless-upgrade --region ${SMUS_MCP_REGION}
  2. Delete the AWS IAM and Amazon S3 staging stack:
    aws cloudformation delete-stack --stack-name spark-upgrade-mcp-setup --region ${SMUS_MCP_REGION}
  3. If the Amazon S3 staging bucket contains objects, empty it before deleting the stack:
    aws s3 rm s3://${STAGING_BUCKET_PATH} --recursive

About the authors

Prasad Nadig

Prasad Nadig

Prasad Nadig is a Senior Analytics Specialist Solutions Architect at AWS, specializing in data and AI, including data lakes, data warehousing, and analytics services such as Amazon Redshift, Amazon EMR, and AWS Glue. He helps customers architect, migrate, and modernize their data and analytics workloads to achieve scalable, performant, and cost-effective solutions on AWS.

Karthik Prabhakar

Karthik Prabhakar

Karthik is a Data Processing Engines Architect for Amazon EMR at Amazon Web Services (AWS). He specializes in distributed systems architecture and query optimization, working with customers to solve complex performance challenges in large-scale data processing workloads. His focus spans engine internals, cost-optimization strategies, and architectural patterns that enable customers to run petabyte-scale analytics efficiently.

Bezuayehu Wate

Bezuayehu Wate

Bezuayehu is a Specialist Solutions Architect at AWS, specializing in big data analytics and AI solutions. She works closely with customers to modernize analytics platforms using AWS data and AI services. With a passion for emerging technologies and customer success, she thrives on designing innovative cloud solutions that deliver measurable business impact and drive organizational transformation.

Chuhan Liu

Chuhan Liu

Chuhan is a Software Development Engineer at AWS.

Keerthi Chadalavada

Keerthi Chadalavada

Keerthi is a Senior Software Development Engineer in the AWS analytics organization. She focuses on combining generative AI and data integration technologies to design and build comprehensive solutions for customer data and analytics needs.

Pradeep Patel

Pradeep Patel

Pradeep is a Sr. Software Engineer at AWS Glue. He is passionate about helping customers solve their problems by using the power of the AWS Cloud to deliver highly scalable and robust solutions. In his spare time, he loves to hike and play with web applications.