Modernize Apache Spark workflows using Spark Connect on Amazon EMR on Amazon EC2

Apache Spark Connect, introduced in Spark 3.4, enhances the Spark ecosystem by offering a client-server architecture that separates the Spark runtime from the client application. Spark Connect enables more flexible and efficient interactions with Spark clusters, particularly in scenarios where direct access to cluster resources is limited or impractical.

A key use case for Spark Connect on Amazon EMR is to be able to connect directly from your local development environments to Amazon EMR clusters. By using this decoupled approach, you can write and test Spark code on your laptop while using Amazon EMR clusters for execution. This capability reduces development time and simplifies data processing with Spark on Amazon EMR.

In this post, we demonstrate how to implement Apache Spark Connect on Amazon EMR on Amazon Elastic Compute Cloud (Amazon EC2) to build decoupled data processing applications. We show how to set up and configure Spark Connect securely, so you can develop and test Spark applications locally while executing them on remote Amazon EMR clusters.

Solution architecture

The architecture centers on an Amazon EMR cluster with two node types. The primary node hosts both the Spark Connect API endpoint and Spark Core components, serving as the gateway for client connections. The core node provides additional compute capacity for distributed processing. Although this solution demonstrates the architecture with two nodes for simplicity, it scales to support multiple core and task nodes based on workload requirements.

In Apache Spark Connect version 4.x, TLS/SSL network encryption is not inherently supported. We show you how to implement secure communications by deploying an Amazon EMR cluster with Spark Connect on Amazon EC2 using an Application Load Balancer (ALB) with TLS termination as the secure interface. This approach enables encrypted data transmission between Spark Connect clients and Amazon Virtual Private Cloud (Amazon VPC) resources.

The operational flow is as follows:

Bootstrap script – During Amazon EMR initialization, the primary node fetches and executes the start-spark-connect.sh file from Amazon Simple Storage Service (Amazon S3). This script starts the Spark Connect server.
Server availability – When the bootstrap process is complete, the Spark Server enters a waiting state, ready to accept incoming connections. The Spark Connect API endpoint becomes available on the configured port (typically 15002), listening for gRPC connection from remote clients.
Client interaction – Spark Connect clients can establish secure connections to an Application Load Balancer. These clients translate DataFrame operations into unresolved logical query plans, encode these plans using protocol buffers, and send them to the Spark Connect API using gRPC.
Encryption in transit – The Application Load Balancer receives incoming gRPC or HTTPS traffic, performs TLS termination (decrypting the traffic), and forwards the requests to the primary node. The certificate is stored in AWS Certificate Manager (ACM).
Request processing – The Spark Connect API receives the unresolved logical plans, translates them into Spark’s built-in logical plan operators, passes them to Spark Core for optimization and execution, and streams results back to the client as Apache Arrow-encoded row batches.
(Optional) Operational access – Administrators can securely connect to both primary and core nodes through Session Manager, a capability of AWS Systems Manager, enabling troubleshooting and maintenance without exposing SSH ports or managing key pairs.

The following diagram depicts the architecture of this post’s demonstration for submitting Spark unresolved logical plans to EMR clusters using Spark Connect.

Apache Spark Connect on Amazon EMR solution architecture diagram

Prerequisites

To proceed with this post, ensure you have the following:

An AWS account that provides access to AWS services.
AWS Command Line Interface (AWS CLI) installed with the targeted Region set. Refer to Getting started with the AWS CLI for more information.
Permissions to create AWS Identity and Access Management (IAM) roles, policies, security groups, and Amazon EMR Cluster.
An Amazon VPC connection set up with its subnets and route tables.

Implementation steps

In this recipe, through AWS CLI commands, you will:

Prepare the bootstrap script, a bash script starting Spark Connect on Amazon EMR.
Set up the permissions for Amazon EMR to provision resources and perform service-level actions with other AWS services.
Create the Amazon EMR cluster with these associated roles and permissions and eventually attach the prepared script as a bootstrap action.
Deploy the Application Load Balancer and certificate with ACM secure data in transit over the internet.
Modify the primary node’s security group to allow Spark Connect clients to connect.
Connect with a test application connecting the client to Spark Connect server.

Prepare the bootstrap script

To prepare the bootstrap script, follow these steps:

Create an Amazon S3 bucket to host the bootstrap bash script:

REGION=
BUCKET_NAME=
aws s3api create-bucket \
   --bucket $BUCKET_NAME \ 
   --region $REGION \
   --create-bucket-configuration LocationConstraint=$REGION

Open your preferred text editor, add the following commands in a new file with a name such start-spark-connect.sh. If the script runs on the primary node, it starts Spark Connect server. If it runs on a task or core node, it does nothing:

#!/bin/bash
if grep isMaster /mnt/var/lib/info/instance.json | grep false;
then
    echo "This is not master node, do nothing."
    exit 0
fi
echo "This is master, continuing to execute script"
SPARK_HOME=/usr/lib/spark
SPARK_VERSION=$(spark-submit --version 2>&1 | grep "version" | head -1 | awk '{print $NF}' | grep -oE '[0-9]+\.[0-9]+\.[0-9]+')
SCALA_VERSION=$(spark-submit --version 2>&1 | grep -o "Scala version [0-9.]*" | awk '{print $3}' | grep -oE '[0-9]+\.[0-9]+')
echo "Spark version ${SPARK_VERSION} is installed under ${SPARK_HOME} running with scala version ${SCALA_VERSION}"
sudo "${SPARK_HOME}"/sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_"${SCALA_VERSION}:${SPARK_VERSION}"

Upload the script into the bucket created in step 1:

aws s3 cp start-spark-connect.sh s3://$BUCKET_NAME

Set up the permissions

Before creating the cluster, you must create the service role, and instance profile. A service role is an IAM role that Amazon EMR assumes to provision resources and perform service-level actions with other AWS services. An EC2 instance profile for Amazon EMR assigns a role to every EC2 instance in a cluster. The instance profile must specify a role that can access the resources for your bootstrap action.

Create the IAM role:

aws iam create-role \
--role-name AmazonEMR-ServiceRole-SparkConnectDemo \
--assume-role-policy-document '{
	"Version": "2012-10-17",
	"Statement": [{
		"Effect": "Allow",
		"Principal": {"Service": "elasticmapreduce.amazonaws.com"},
		"Action": "sts:AssumeRole"
		}]
}'

Attach the necessary managed policies to the service role to allow Amazon EMR to manage the underlying services Amazon EC2 and Amazon S3 on your behalf and optionally grant an instance to interact with Systems Manager:

aws iam attach-role-policy \
--role-name AmazonEMR-ServiceRole-SparkConnectDemo \
--policy-arn arn:aws:iam::aws:policy/service-role/AmazonEMRServicePolicy_v2

aws iam attach-role-policy \
--role-name AmazonEMR-ServiceRole-SparkConnectDemo \
--policy-arn arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore

aws iam attach-role-policy \
--role-name AmazonEMR-ServiceRole-SparkConnectDemo \
--policy-arn arn:aws:iam::aws:policy/service-role/AmazonElasticMapReduceRole

Create an Amazon EMR instance role to grant permissions to EC2 instances to interact with Amazon S3 or other AWS services:

aws iam create-role \
--role-name EMR_EC2_SparkClusterNodesRole \
--assume-role-policy-document '{
"Version": "2012-10-17",
"Statement": [{
   "Effect": "Allow",
   "Principal": {"Service": "ec2.amazonaws.com"},
   "Action": "sts:AssumeRole"
   }]
}'

To allow the primary instance to read from Amazon S3, attach the AmazonS3ReadOnlyAccess policy to the Amazon EMR instance role. For production environments, this access policy should be reviewed and replaced with a custom policy following the principle of least privilege, granting only the specific permissions needed for your use case:
```
aws iam attach-role-policy \
--role-name EMR_EC2_SparkClusterNodesRole \
--policy-arn arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess
```
Attaching AmazonSSMManagedInstanceCore policy enables the instances to use core Systems Manager features, such as Session Manager, and Amazon CloudWatch:
```
aws iam attach-role-policy \
--role-name EMR_EC2_SparkClusterNodesRole \
--policy-arn arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
```
To pass the EMR_EC2_SparkClusterInstanceProfile IAM role information to the EC2 instances when they start, create the Amazon EMR EC2 instance profile:
```
aws iam create-instance-profile \
--instance-profile-name EMR_EC2_SparkClusterInstanceProfile
```

Attach the role EMR_EC2_SparkClusterNodesRole created in step 3 to the newly instance profile:

aws iam add-role-to-instance-profile \
--instance-profile-name EMR_EC2_SparkClusterInstanceProfile \
--role-name EMR_EC2_SparkClusterNodesRole

Create the Amazon EMR cluster

To create the Amazon EMR cluster, follow these steps:

Set the environment variables, where your EMR cluster and load-balancer must be deployed:

VPC_ID=<vpc-emr-and-alb>
EMR_PRI_SB_ID_1=<emr-private-subnet-id-az1>
ALB_PUB_SB_ID_1=<alb-public-subnet-id-az1>
ALB_PUB_SB_ID_2=<alb-public-subnet-id-az2>

Create the EMR cluster with the latest Amazon EMR release. Replace the placeholder value with your actual S3 bucket name where the bootstrap action script is stored:

CLUSTER_ID=$(aws emr create-cluster \
--name "Spark Connect cluster demo" \
--applications Name=Spark \
--release-label emr-7.9.0 \
--service-role AmazonEMR-ServiceRole-SparkConnectDemo \
--ec2-attributes InstanceProfile=EMR_EC2_SparkClusterInstanceProfile,SubnetId=$EMR_PRI_SB_ID_1 \
--instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m5.xlarge InstanceGroupType=CORE,InstanceCount=1,InstanceType=m5.xlarge \
--bootstrap-actions Path="s3://$BUCKET_NAME/start-spark-connect.sh" \
--query 'ClusterId' --output text)
echo CLUSTER_ID="$CLUSTER_ID"

To modify primary node’s security group to allow Systems Manager to start a session.

Get the primary node’s security group identifier. Record the identifier because you’ll need it for subsequent configuration steps in which primary-node-security-group-id is mentioned:

PRIMARY_NODE_SG=$(aws emr describe-cluster \
--cluster-id $CLUSTER_ID \
--query 'Cluster.Ec2InstanceAttributes.EmrManagedMasterSecurityGroup' \
--output text)
echo PRIMARY_NODE_SG=$PRIMARY_NODE_SG

Find the EC2 instance connect prefix list ID for your Region. You can use the EC2_INSTANCE_CONNECT filter with the describe-managed-prefix-lists command. Using a managed prefix list provides a dynamic security configuration to authorize Systems Manager EC2 instances to connect the primary and core nodes by SSH:
```
IC_PREFIX_LIST=$(aws ec2 describe-managed-prefix-lists \
--filters Name=prefix-list-name,Values=com.amazonaws.$REGION.ec2-instance-connect \
--query 'PrefixLists[0].PrefixListId' \
--output text)
echo IC_PREFIX_LIST=$IC_PREFIX_LIST
```

Modify the primary node security group inbound rules to allow SSH access (port 22) to the EMR cluster’s primary node from resources that are part of the specified Instance Connect service contained in the prefix list:

aws ec2 authorize-security-group-ingress \
--region $REGION \
--group-id $PRIMARY_NODE_SG \
--ip-permissions "[{\"IpProtocol\":\"tcp\",\"FromPort\":22,\"ToPort\":22,\"PrefixListIds\":[{\"PrefixListId\":\"$IC_PREFIX_LIST\"}]}]"

Optionally, you can repeat the preceding steps 1–3 for the core (and tasks) cluster’s nodes to allow Amazon EC2 Instance Connect to access the EC2 instance through SSH.

Deploy the Application Load Balancer and certificate

To deploy the Application Load Balancer and certificate, follow these steps:

Create a load balancer’s security group:

ALB_SG_ID=$(aws ec2 create-security-group \
--group-name spark-connect-alb-sg \
--description "Security group for Spark Connect ALB" \
--region $REGION \
--vpc-id $VPC_ID \
--query 'GroupId' \
--output text)

Add rule to accept TCP traffic from a trusted IP on port 443. We recommend that you use the local development machine’s IP address. You can check your current public IP address here: https://checkip.amazonaws.com:
```
aws ec2 authorize-security-group-ingress \
--group-id $ALB_SG_ID \
--protocol tcp \
--port 443 \
--cidr <replace-with-trusted-IP>/32
```

Create a new target group with gRPC protocol, which targets the Spark Connect server instance and the port the server is listening to:

ALB_TG_ARN=$(aws elbv2 create-target-group \
--name spark-connect-tg \
--protocol HTTP \
--protocol-version GRPC \
--port 15002 \
--target-type instance \
--health-check-enabled \
--health-check-protocol HTTP \
--health-check-path / \
--vpc-id $VPC_ID \
--query 'TargetGroups[0].TargetGroupArn' \
--output text)
echo "ALB TG created (ARN)=$ALB_TG_ARN"

Create the Application Load Balancer:

ALB_ARN=$(aws elbv2 create-load-balancer \
--name spark-connect-alb \
--type application \
--scheme internet-facing \
--subnets $ALB_PUB_SB_ID_1 $ALB_PUB_SB_ID_2 \
--security-groups $ALB_SG_ID \
--query 'LoadBalancers[0].LoadBalancerArn' \
--output text)
echo "ALB created (ARN)=$ALB_ARN"

Get the load balancer DNS name:

ALB_DNS=$(aws elbv2 describe-load-balancers \
--load-balancer-arns $ALB_ARN \
--query 'LoadBalancers[0].DNSName' \
--output text)
echo "ALB DNS=$ALB_DNS"

Retrieve the Amazon EMR primary node ID:

PRIMARY_NODE_ID=$(aws emr list-instances --cluster-id $CLUSTER_ID --instance-group-types MASTER --query 'Instances[0].Ec2InstanceId' --output text)
echo PRIMARY_NODE_ID=$PRIMARY_NODE_ID

(Optional) To encrypt and decrypt the traffic, the load balancer needs a certificate. You can skip this step if you already have a trusted certificate in ACM. Otherwise, create a self-signed certificate:

PRIVATE_KEY_PATH=./sc-private-key.key
CERTIFICATE_PATH=./sc-certificate.cert
sudo openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout $PRIVATE_KEY_PATH -out $CERTIFICATE_PATH -subj "/CN=$ALB_DNS"

Upload to ACM:

ACM_CERT_ARN=$(aws acm import-certificate \
--certificate fileb://$CERTIFICATE_PATH \
--private-key fileb://$PRIVATE_KEY_PATH \
--region $REGION \
--query CertificateArn \
--output text)
echo "Certificate created (ARN)=$ACM_CERT_ARN"

Create the load balancer listener:

ALB_LISTENER_ARN=$(aws elbv2 create-listener \
--load-balancer-arn $ALB_ARN \
--protocol HTTPS \
--port 443 \
--certificates CertificateArn=$ACM_CERT_ARN \
--ssl-policy ELBSecurityPolicy-TLS13-1-2-2021-06 \
--default-actions Type=forward,TargetGroupArn=$ALB_TG_ARN \
--region $REGION \
--query 'Listeners[0].ListenerArn' \
--output text)
echo "ALB listener created (ARN)=$ALB_LISTENER_ARN"

After the listener has been provisioned, register the primary node to the target group:

aws elbv2 register-targets \
--target-group-arn $ALB_TG_ARN \
--targets Id=$PRIMARY_NODE_ID

Modify the primary node’s security group to allow Spark Connect clients to connect

To connect to Spark Connect, amend only the primary security group. Add an inbound rule to the primary’s node security group to accept Spark Connect TCP connection on port 15002 from your chosen trusted IP address:

aws ec2 authorize-security-group-ingress \
--group-id $PRIMARY_NODE_SG \
--protocol tcp \
--port 15002 \
--source-group $ALB_SG_ID

Connect with a test application

This example demonstrates that a client running a newer Spark version (4.0.1) can successfully connect to an older Spark version on the Amazon EMR cluster (3.5.5), showcasing Spark Connect’s version compatibility feature. This version combination is for demonstration only. Running older versions might pose security risks in production environments.

To test the client-to-server connection, we provide the following test Python application. We recommend that you create and activate a Python virtual environment (venv) before installing the packages. This helps isolate the dependencies for this specific project and prevents conflicts with other Python projects. To install packages, run the following command:

pip install pyspark-client==4.0.1

In your integrated development environment (IDE), copy and paste the following code, replace the placeholder, and invoke it. The code creates a Spark DataFrame containing two rows and it shows its data:

from pyspark.sql import SparkSession
import os
os.environ['GRPC_DEFAULT_SSL_ROOTS_FILE_PATH'] = os.path.expanduser('sc-certificate.cert')
spark = SparkSession.builder \
    .remote("sc://:443/;use_ssl=true") \
    .config('spark.sql.execution.pandas.inferPandasDictAsMap', True) \
    .config('spark.sql.pyspark.legacy.inferMapTypeFromFirstPair.enabled', True) \
    .getOrCreate()
spark.createDataFrame([("sue", 32),("li", 3)],["first_name", "age"]).show()

The following shows the application output:

+----------+---+
|first_name|age|
+----------+---+
|       sue| 32|
|        li|  3|
+----------+---+

Clean up

When you no longer need the cluster, release the following resources to stop incurring charges:

Delete the Application Load Balancer listener, target group, and the load balancer.
Delete the ACM certificate.
Delete the load balancer and Amazon EMR node security groups.
Terminate the EMR cluster.
Empty the Amazon S3 bucket and delete it.
Remove AmazonEMR-ServiceRole-SparkConnectDemo and EMR_EC2_SparkClusterNodesRole roles and EMR_EC2_SparkClusterInstanceProfile instance profile.

Considerations

Security considerations with Spark Connect:

Private subnet deployment – Keep EMR clusters in private subnets with no direct internet access, using NAT gateways for outbound connectivity only.
Access logging and monitoring – Enable VPC Flow Logs, AWS CloudTrail, and bastion host access logs for audit trails and security monitoring.
Security group restrictions – Configure security groups to allow Spark Connect port (15002) access only from bastion host or specific IP ranges.

Conclusion

In this post, we showed how you can adopt modern development workflows and debug Spark applications from local IDEs or notebooks, so you can step through code execution. With Spark Connect’s client-server architecture, the Spark cluster can run on a different version than the client applications, so operations teams can perform infrastructure upgrades and patches independently.

As the cluster operators gain experience, they can customize the bootstrap actions and add steps to process data. Consider exploring Amazon Managed Workflows for Apache Airflow (MWAA) for orchestrating your data pipeline.

AWS Big Data Blog