AWS Big Data Blog

Announcing Spark Connect on Amazon EMR Serverless: Interactive PySpark development, anywhere

Today, AWS is announcing support for Spark Connect on Amazon EMR Serverless with EMR release 7.13 (Apache Spark 3.5.6) and later versions. You can now build and debug Spark applications from your preferred local environment while running full-scale Spark operations on EMR Serverless.

Previously, code that worked on a local machine might break in production because of environment mismatches, dependency conflicts, or unexpected data patterns. The only way to catch it was a deploy-and-check cycle. With the Spark Connect feature, you can develop Spark code from a supported local environment, such as an IDE (for example, VS Code or PyCharm), Jupyter notebooks, Amazon SageMaker Unified Studio (SMUS) Data Notebooks, Amazon Q Developer, or Kiro. There are no clusters to provision, no code to repackage, and no deploy-and-check loop. Your local Python session can stay local as usual while Spark operations are automatically routed to a remote Spark server for execution.

Each Spark Connect session has its own AWS resource with a unique ARN, enabling per‑session AWS Identity and Access Management (AWS IAM) permissions, tag‑based cost allocation, audit through AWS CloudTrail, and session-specific configuration overrides. This gives teams finer control over who runs what, at what cost. You also get real-time visibility through the Spark UI, persistent session history, and a dedicated interface to monitor and manage active and completed sessions.

For more details, visit the EMR Serverless release notes or the EMR Serverless Developer Guide. For a quick look at the experience, here’s a demonstration of using Spark Connect in Amazon SageMaker Unified Studio Data Notebooks:

For a runnable end-to-end example, try the EMR Serverless Spark Connect sample notebook from your local IDE. See the following demonstration:

How Spark Connect works

Spark Connect uses a client-server architecture that separates application code from the Spark engine. The client, a lightweight PySpark library running on a local environment, sends Spark operations over a secure gRPC/TLS connection to a Spark Connect server running on EMR Serverless. Then the server runs that Spark code on EMR Serverless as compute. Finally, it returns results to your local session.

Spark Connect client-server architecture showing a local IDE connecting to a Spark Connect server on EMR Serverless

Your local machine doesn’t need Spark installed, doesn’t need direct access to the data, and doesn’t need to be sized for the workload. Because the client is a compact library, you can embed Spark operations in your Python applications that support PySpark. This includes web services, dashboards, and automation scripts. For example, a development team can add Spark-powered analytics directly into a FastAPI backend or a Streamlit dashboard, treating Spark like a database driver rather than a separate batch system. These capabilities extend Spark Connect use cases beyond traditional notebook and IDE development, since the compute-intensive processing happens on the server – EMR Serverless side. This allows you to use pandas, matplotlib, and your team’s internal Python libraries on your laptop or in your embedded clients, without installing those libraries on EMR Serverless.

With Spark Connect server sessions running on EMR Serverless, you pay for compute only while your session is active. When inactive, you’re not paying. EMR Serverless automatically scales compute up and down based on workload demands through dynamic resource allocation (DRA), eliminating the need to predict capacity ahead of time. For teams that run Spark Connect sessions regularly, you can configure pre-initialized capacity on your EMR Serverless application for faster session startup times. Additionally, your Spark Connect sessions have access to the full suite of EMR Serverless features, including AWS Graviton processors for cost optimization and secure VPC connectivity to your data sources. You also get access to custom images with flexibility and integrated observability through Amazon CloudWatch and the Spark UI.

Getting started

Getting started with Spark Connect on EMR Serverless takes three steps: create an application, start a session, and connect from your IDE.

Note: The resources created in this quick start incur charges while active. Make sure to follow the cleanup steps at the end of this tutorial to avoid ongoing charges.

Prerequisites

  • In addition to the required job runtime IAM role, these additional permissions are needed: emr-serverless:StartSession, GetSession, GetSessionEndpoint, TerminateSession, GetResourceDashboard, and iam:PassRole on the runtime role.
  • An existing EMR Serverless application running emr-7.13.0 or later, with interactiveConfiguration.sessionEnabled = true.
  • boto3 version 1.43.0 or later to access the latest EMR Serverless session APIs.

Step 1: Create an EMR Serverless application with Spark Connect enabled

Amazon EMR Serverless application creation page in the EMR console

  • Open the Amazon EMR console and navigate to EMR Serverless.
  • Choose Get started. A pop-up appears. Choose Create and launch EMR Studio.
  • This takes you to the Create application page.
  • Enter a Name for your application (for example, spark-connect-app).
  • For Type, select Spark.
  • For Release version, select emr-7.13.0 or later.
  • For Architecture, choose x86_64 (default). This is compatible with most third-party tools and libraries.
  • Under Application setup options, select Use default settings for interactive workloads. This automatically sets interactiveConfiguration.sessionEnabled = true.
  • Choose Create and start application.

Alternatively, using the CLI command:

# Create an application with Spark Connect enabled
APP_ID=$(aws emr-serverless create-application \
  --type "SPARK" \
  --name "spark-connect-app" \
  --release-label emr-7.13.0 \
  --interactive-configuration '{"sessionEnabled": true}' \
  --query 'applicationId' \
  --output text)
echo "Created application: $APP_ID"
# Start the application
aws emr-serverless start-application --application-id "$APP_ID"

Step 2: Start a session

Next, start a session and obtain the Spark Connect endpoint.

Provide an IAM execution role that grants the session access to your data, such as reading data from an Amazon S3 bucket or querying the AWS Glue Data Catalog. This is the same type of role used for EMR Serverless batch jobs.

# Start a session with your execution role
$ROLE_ARN="YOUR_ROLE" # example: arn:aws:iam::123456789012:role/EMRServerlessSessionRole
SESSION_ID=$(aws emr-serverless start-session \
  --application-id $APP_ID \
  --execution-role-arn $ROLE_ARN \
  --query sessionId \
  --output text)

# Get the session endpoint
aws emr-serverless get-session-endpoint \
  --application-id $APP_ID \
  --session-id $SESSION_ID

The get-session-endpoint response includes a secure endpoint URL and an authentication token. All communication between your local environment and EMR Serverless is encrypted using TLS. Treat the token as a sensitive credential. Consider using AWS Secrets Manager to store and retrieve tokens programmatically. The authentication token is time-limited to 1 hour, so for long-running sessions we recommend that you refresh it periodically.

Step 3: Connect from your local IDE

Use the returned endpoint URL and authentication token to connect to the Spark Connect server.

The connection URL uses the sc:// protocol, which is the Spark Connect standard. The use_ssl=true parameter supports encrypted communications over TLS, so your data and credentials are protected in transit.

from pyspark.sql import SparkSession

# Use the endpoint and auth token from get-session-endpoint
session_endpoint="<endpoint-from-get-session-endpoint>"
auth_token="<authToken-from-get-session-endpoint>"

spark_connect_url = (
    f"sc://{session_endpoint}:443/;use_ssl=true;x-aws-proxy-auth={auth_token}"
)

spark = SparkSession.builder \
    .remote(spark_connect_url) \
    .getOrCreate()

# Query data in your S3 data lake
df = spark.sql("SELECT * FROM my_catalog.my_database.my_table")
df.show()

# Run transformations at scale
df.groupBy("category").count().orderBy("count", ascending=False).show()
spark.stop()

Once connected, Spark operations you write in your IDE can be run on EMR Serverless. For debugging, you can pause the execution at breakpoints, inspect variables, and step through your transformations locally while EMR Serverless processes your data on remote, scalable infrastructure.

Sessions remain active for a configurable idle timeout (1 hour by default). If your connection drops, the session continues running, allowing you to reconnect without losing your work. You can also access the live Spark UI through the GetResourceDashboard API to monitor queries, stages, and executors in real time. After the session ends, the Spark History Server remains available for post-run analysis.

Clean up resources

If the 1-hour session idle timeout does not meet your needs, you can manually remove sessions to avoid ongoing costs. Note that terminating an active session will immediately stop you running Spark operations. Before doing that, verify all your critical data processing is completed, and results are saved.

# 1. Stop the active session
aws emr-serverless terminate-session \
  --application-id $APP_ID \
  --session-id $SESSION_ID

# 2. Stop the application
aws emr-serverless stop-application --application-id $APP_ID

Use cases

Spark Connect on EMR Serverless supports a wide range of development workflows. The following are some of the most popular use cases, including but not limited to:

  • Interactive ETL development — Build and test data pipelines interactively, validating transformations against full-scale datasets before promoting them to production as batch jobs.
  • SageMaker Unified Studio (SMUS) Data Notebooks — Run interactive PySpark sessions directly from SMUS Data Notebooks connected to EMR Serverless through Spark Connect.
  • Direct S3 and JDBC access without a catalog — Connect directly to S3 files and JDBC data sources without needing a metastore or catalog configuration.
  • Apache Iceberg Data Lakehouse analytics — Query and manage Iceberg tables through the AWS Glue Data Catalog, with full support for time travel, schema evolution, and partition management.
  • Amazon S3 Tables with federated catalog — Access S3 Tables as a federated Glue Data Catalog source, combining Iceberg features with serverless Spark execution.
  • dbt-spark — Run dbt-spark adapter against EMR Serverless via Spark Connect, allowing analytics engineers to develop and test transformations locally with dbt framework while using EMR Serverless as the remote Spark engine.
  • Exploratory data analysis and feature engineering — Analyze production-scale data from your preferred notebook environment instead of using sampled subsets, helping you catch data quality issues earlier.
  • Compute standardization — Standardize EMR Serverless as the Spark backend while giving you the flexibility to use preferred local tools, version control, and CI/CD workflows.

These use cases work across multiple client surfaces: IDEs, Jupyter notebooks, dbt-spark, and AI coding agents. Because Spark Connect is an open Apache Spark standard, the same PySpark code typically works across different Spark backends by changing the connection endpoint.

Availability and pricing

Spark Connect on EMR Serverless is now available with Apache Spark 3.5.6 on Amazon EMR release 7.13 and higher in all AWS Regions where EMR Serverless is available. There is no additional charge for using Spark Connect. You pay for the EMR Serverless compute resources (vCPU, memory, and storage) consumed during your session, the same pricing model as EMR Serverless batch jobs.

Conclusion

Spark Connect on EMR Serverless bridges the gap between local development and production-scale execution. Build and debug PySpark applications from your preferred environment (IDE, notebook, dbt, or AI coding agent) while EMR Serverless handles automatic scaling, per-session cost visibility, and infrastructure management behind the scenes. With ARN-addressable sessions, fine-grained IAM permissions, tag-based cost allocation, and per-session configuration overrides, your team gets the controls they need without sacrificing flexibility.

Get started today with EMR release 7.13.0 (Spark 3.5.6). Follow the step-by-step tutorial in the EMR Serverless Developer Guide to create your first Spark Connect session and experience interactive, serverless PySpark development firsthand.


About the authors

Al MS

Al MS

Al is a product manager for Amazon EMR at AWS.

Melody Yang

Melody Yang

Melody Yang is a Principal Analytics Specialist Solution Architect at AWS with expertise in Big Data technologies. She is an experienced analytics leader working with AWS customers to provide best practice guidance and technical advice in order to assist their success in data transformation. Her areas of interests are open-source frameworks and automation, data engineering and DataOps.

KiKi Nwangwu

KiKi Nwangwu

Kiki is an Analytics and GenAI Specialist Solutions Architect at AWS. She specializes in helping customers architect, build, and modernize scalable data analytics and generative AI solutions. She enjoys traveling and exploring new cultures.