Implementing Kerberos authentication for Apache Spark jobs on Amazon EMR on EKS to access a Kerberos-enabled Hive Metastore

Many organizations run their Apache Spark analytics platforms on Amazon EMR on Amazon Elastic Compute Cloud (Amazon EC2), using Kerberos authentication to secure connectivity between Spark jobs and a centralized shared Apache Hive Metastore (HMS). With Amazon EMR on Amazon EKS, they gained a new option for running Spark jobs with the benefits of Kubernetes-based container orchestration, improved resource utilization, and faster job startup times. However, an HMS deployment supports only one authentication mechanism at a time. This means that they must configure Kerberos authentication for their Spark jobs on Amazon EMR on EKS to connect to the existing Kerberos-enabled HMS.

In this post, we show how to configure Kerberos authentication for Spark jobs on Amazon EMR on EKS, authenticating against a Kerberos-enabled HMS so you can run both Amazon EMR on EC2 and Amazon EMR on EKS workloads against a single, secure HMS deployment.

Overview of solution

Consider an enterprise data platform team that’s been running Spark jobs on Amazon EMR on EC2 for several years. Their architecture includes a Kerberos-enabled standalone HMS that serves as the centralized data catalog, with Microsoft Active Directory functioning as the Key Distribution Center (KDC). As the team evaluates Amazon EMR on EKS for new workloads, their existing HMS must continue serving Amazon EMR on EC2, with both authenticating through the same Kerberos infrastructure. To address this, the platform team must configure their Spark jobs running on Amazon EMR on EKS to authenticate with the same KDC. This is so they can obtain valid Kerberos tickets and establish authenticated connections to the HMS while maintaining a unified security posture across their data platform.

Scope of Kerberos in this solution

Kerberos authentication in this solution secures the connection between Spark jobs and the HMS. Other components in the architecture use AWS and Kubernetes security mechanisms instead.

Solution architecture

Our solution implements Kerberos authentication to secure the connection between Spark jobs and the HMS. The architecture spans two Amazon Virtual Private Clouds (Amazon VPCs) connected using VPC peering, with distinct components handling identity management, compute, and metadata services.

Identity and Authentication layer

A self-managed Microsoft Active Directory Domain Controller is deployed in a dedicated VPC and serves as the KDC for Kerberos authentication. The Active Directory server hosts service principals for both the HMS service and Spark jobs. This separate VPC deployment mirrors real-world enterprise architectures where Active Directory is typically managed by identity teams in their own network boundary, whether on-premises or in AWS.

Data Platform layer

The data platform components reside in a separate VPC and includes an EKS cluster that hosts both the HMS service and Amazon EMR on EKS based Spark jobs persisting data in an Amazon Simple Storage Service (Amazon S3) bucket.

Hive Metastore service

The HMS is deployed in the EKS hive-metastore namespace and simulates a pre-existing, standalone Kerberos-enabled HMS, a common enterprise pattern where HMS is managed independently of any data processing platform. You can learn more about other enterprise design patterns in the post Design patterns for implementing Hive Metastore for Amazon EMR or EKS. The HMS service authenticates with the KDC using its service principal and keytab mounted from a Kubernetes secret.

Apache Spark Execution layer

Apache Spark jobs are deployed using the Spark Operator on EKS. The Spark driver and executor pods are configured with Kerberos credentials through mounted ConfigMaps containing krb5.conf and jaas.conf, along with keytab files from Kubernetes secrets. When a Spark job must access Hive tables, the driver authenticates with the KDC and establishes a secure Simple Authentication and Security Layer (SASL) connection to the HMS.

Authentication flow

The HMS runs as a long-running Kubernetes service that must be deployed and authenticated before Spark jobs can connect.

During HMS deployment:

HMS pod validates its Kerberos configuration. krb5.conf and jaas.conf are mounted from ConfigMaps
Service authenticates with KDC using its principal hive/hive-metastore-svc.hive-metastore.svc.cluster.local@CORP.KERBEROS
keytab is mounted from Kubernetes secret for credential access
Secure Thrift endpoint is established on port 9083 with SASL authentication enabled

When a Spark job must interact with the HMS:

Spark job submission:
1. User submits Spark job through Spark Operator
2. Driver and executor pods are created with Kerberos configuration mounted as volumes
3. krb5.conf ConfigMap provides KDC connection details including realm and server addresses
4. jaas.conf ConfigMap specifies a login module configuration with keytab path and principal
5. Keytab secret contains encrypted credentials for Spark service principal spark/analytics-team@CORP.KERBEROS
Authentication and connection:
1. Spark driver authenticates with KDC using its principal and keytab to obtain a Ticket Granting Ticket (TGT)
2. When connecting to HMS, Spark requests a service ticket from the KDC for the HMS principal hive/hive-metastore-svc.hive-metastore.svc.cluster.local@CORP.KERBEROS
3. KDC issues a service ticket encrypted with HMS’s secret key
4. Spark presents this service TGT to HMS over the Thrift connection on port 9083
5. HMS decrypts the ticket using its keytab, verifies Spark’s identity, and establishes the authenticated SASL session
6. Executor pods use the same configuration for authenticated operations
Data access:
1. Authenticated Spark job queries HMS for table metadata
2. HMS validates Kerberos tickets before serving metadata requests
3. Spark accesses underlying data in Amazon S3 using IRSA

Implementation workflow

The implementation involves three key stakeholders working together to establish the Kerberos-enabled communication:

Microsoft Active Directory Administrator

The Active Directory Administrator creates service accounts that are used for HMS and Spark jobs. This involves setting up the service principal names using the setspn utility and generating keytab files using ktpass for secure credential storage. The administrator configures the appropriate Active Directory permissions and Kerberos AES256 encryption type. Finally, the keytab files are uploaded to AWS Secrets Manager for secure distribution to Kubernetes workloads.

Data Platform Team

The platform team handles the Amazon EMR on EKS and Kubernetes configurations. They retrieve keytabs from Secrets Manager and create Kubernetes secrets for the workloads. They configure Helm charts for HMS deployment with Kerberos settings and set up ConfigMaps for krb5.conf, jaas.conf, and core-site.xml.

Data Engineering Operations

Data engineers submit jobs using the configured service account with Kerberos authentication. They monitor job execution and verify authenticated access to HMS.

Deploy the solution

In the remainder of this post, you will explore the implementation details for this solution. You can find the sample code in the AWS Samples GitHub repository. For additional details, including verification steps for each deployment stage, refer to the README in the repository.

Prerequisites

Before you deploy this solution, make sure that the following prerequisites are in place:

Access to a valid AWS account and permission to create AWS resources.
The AWS Command Line Interface (AWS CLI) is installed on your local machine.
Git, Docker, eksctl, kubectl, Helm, envsubst, jq, and yq utilities are installed on your local machine.
Familiarity with Kerberos, Apache Hive Metastore (HMS), Apache Spark, Kubernetes, Amazon EKS, and Amazon EMR on Amazon EKS.

Clone the repository and set up environment variables

Clone the repository to your local machine and set the two environment variables. Replace <AWS_REGION> with the AWS Region where you want to deploy these resources.

# Clone the Git repository
git clone https://github.com/aws-samples/sample-emr-eks-spark-kerberos-hms.git
cd sample-emr-eks-spark-kerberos-hms

# Set environment variables
export REPO_DIR=$(pwd)
export AWS_REGION=<AWS_REGION>

Setup Microsoft Active Directory infrastructure

In this section, we deploy a self-managed Microsoft Active Directory with KDC on a Windows Server EC2 instance into a dedicated VPC. This is an intentionally minimal implementation highlighting only the key components required for this blog post.

cd ${REPO_DIR}/microsoft-ad
./setup.sh

Setup EKS infrastructure

This section provisions the Amazon EMR on EKS infrastructure stack, including VPC, EKS cluster, Amazon Aurora PostgreSQL database, Amazon Elastic Container Registry (Amazon ECR), Amazon S3, Amazon EMR on EKS virtual clusters and the Spark Operator. Run the following script.

cd ${REPO_DIR}/data-infra
./setup.sh

Set up VPC peering

This section establishes network connectivity between the Active Directory VPC and EKS VPC for Kerberos authentication. Run the following script:

cd ${REPO_DIR}/vpc-peering
./setup.sh

Deploy Hive Metastore with Kerberos authentication

This section deploys a Kerberos-enabled HMS service on the EKS cluster. Complete the following steps:

Create Kerberos Service Principal for HMS service

cd ${REPO_DIR}/microsoft-ad/
# Create HMS service principal
./manage-ad-service-principals.sh create hive "hive/hive-metastore-svc.hive-metastore.svc.cluster.local"
# Verify the service principal was created
./manage-ad-service-principals.sh list

Deploy HMS service with Kerberos authentication

cd ${REPO_DIR}/hive-metastore
./deploy.sh

Set up Amazon EMR on Amazon EKS with Kerberos authentication

This section configures Spark jobs to authenticate with Kerberos-enabled HMS. This involves creating service principles for Spark jobs and generating the necessary configuration files. Complete the following steps:

Create Service Principal for Spark jobs

cd ${REPO_DIR}/microsoft-ad/
# Create Spark service principal
./manage-ad-service-principals.sh create spark "spark/analytics-team"
# Verify the service principal was created
./manage-ad-service-principals.sh list

Generate Kerberos configurations for Spark jobs

cd ${REPO_DIR}/spark-jobs/
./generate-spark-configs.sh --principal "spark/analytics-team@CORP.KERBEROS" --namespace emr

Submit Spark jobs

This section verifies Kerberos authentication by running a Spark job that connects to the Kerberized HMS. Complete the following steps:

Submit the test Spark job

cd ${REPO_DIR}/spark-jobs
kubectl apply -f spark-job.yaml

Monitor job execution

# Watch the SparkApplication status
kubectl get sparkapplications -n emr -w
# Check pod status
kubectl get pods -n emr | grep "spark-kerberos"

Verify Kerberos authentication and HMS connection

# Check Spark driver logs for successful authentication
kubectl logs spark-kerberos-job-driver -n emr

The logs should confirm successful authentication, along with a listing of sample databases and tables.

Understanding Kerberos configuration

The HMS requires specific configuration parameters to enable Kerberos authentication, applied through the previously mentioned steps. The key configurations are outlined in the following section.

HMS configuration (metastore-site.xml)

The following configurations are added to metastore-site.xml file.

Setting	Value	Purpose
`hive.metastore.sasl.enabled`	true	Enable SASL authentication
`hive.metastore.kerberos.principal`	`hive/hive-metastore-svc.hive-metastore.svc.cluster.local@CORP.KERBEROS`	HMS service principal
`hive.metastore.kerberos.keytab.file`	`/etc/security/keytab/hive.keytab`	Keytab path

Hadoop security (core-site.xml)

The following configurations are added to the core-site.xml file.

Setting	Value
`hadoop.security.authentication`	kerberos
`hadoop.security.authorization`	true

Spark configuration

Setting	Value	Purpose
`spark.security.credentials.kerberos.enabled`	true	Enable Kerberos for Spark
`spark.hadoop.hive.metastore.sasl.enabled`	true	SASL for HMS connection
`spark.kerberos.principal`	`spark/analytics-team@CORP.KERBEROS`	Spark service principal
`spark.kerberos.keytab`	`local:///etc/security/keytab/analytics-team.keytab`	Keytab path

Shared Kerberos files

Both HMS and Spark pods mount two common Kerberos configuration files: krb5.conf and jaas.conf, using ConfigMaps and Kubernetes secrets. The krb5.conf file is identical across both services and defines how each component connects to the KDC. The jaas.conf file follows the same structure but differs in the principal and keytab path for each service.

krb5 Configuration

[libdefaults]
	default_realm = CORP.KERBEROS
	dns_lookup_realm = false
	dns_lookup_kdc = false
	ticket_lifetime = 24h
	forwardable = true
	udp_preference_limit = 1
	default_tkt_enctypes = aes256-cts-hmac-sha1-96 aes128-cts-hmac-sha1-96
	default_tgs_enctypes = aes256-cts-hmac-sha1-96 aes128-cts-hmac-sha1-96
	permitted_enctypes = aes256-cts-hmac-sha1-96 aes128-cts-hmac-sha1-96

[realms]
	CORP.KERBEROS = {
		kdc = <ad-server-ip>
		admin_server = <ad-server-ip>
	}

[domain_realm]
	.corp.kerberos = CORP.KERBEROS
	corp.kerberos = CORP.KERBEROS

For more information, see the online documentation for krb5.conf.

JAAS configuration

Client {
 com.sun.security.auth.module.Krb5LoginModule required
 useKeyTab=true
 keyTab="/etc/security/keytab/hive.keytab"
 principal="hive/hive-metastore-svc.hive-metastore.svc.cluster.local@CORP.KERBEROS"
 useTicketCache=false
 storeKey=true
 debug=false;
};

Additional security considerations

This post focuses on core Kerberos authentication mechanics between Spark and HMS. We recommend two additional security hardening steps based on your organization’s security posture and compliance requirements.

Protecting Keytabs at Rest with AWS KMS Envelope Encryption

Keytabs stored as Kubernetes Secrets are only base64-encoded by default, not encrypted at rest. We recommend enabling EKS envelope encryption using an AWS Key Management Service (AWS KMS) customer managed key. With envelope encryption, secret data is encrypted with a Data Encryption Key (DEK), which is encrypted by your customer managed key. This protects keytab content even if the etcd datastore is compromised. To enable this on an existing EKS cluster:

aws eks associate-encryption-config \
  --cluster-name <your-cluster> \
  --encryption-config '[{"resources":["secrets"],"provider":{"keyArn":"arn:aws:kms:<region>:<account-id>:key/<key-id>"}}]'

Refer to the Amazon EKS documentation on envelope encryption for full setup guidance.

Encrypting the Thrift Data Channel with TLS

SASL with Kerberos provides mutual authentication but doesn’t automatically encrypt data over the Thrift connection. Many deployments default to auth QoP, leaving the data channel unencrypted. We recommend either:

Set SASL QoP to auth-conf — enables SASL-layer encryption using Kerberos session keys
Layer TLS over Thrift (preferred) — enables transport-level encryption using modern cipher suites

Enabling TLS on HiveServer2 / Hive Metastore Thrift:

<property>
  <name>hive.server2.use.SSL</name>
  <value>true</value>
</property>
<property>
  <name>hive.server2.keystore.path</name>
  <value>/etc/tls/keystore.jks</value>
</property>

Refer to the Hive SSL/TLS configuration documentation for full details.

Cleaning up

To avoid incurring future charges, clean up all provisioned resources during this setup by executing the following cleanup script.

cd ${REPO_DIR}/
./cleanup.sh

Conclusion

In this post, we demonstrated how to implement Kerberos authentication for Amazon EMR on EKS to securely connect to a Kerberos-enabled HMS. This solution addresses a common challenge faced by organizations with existing Kerberos-enabled HMS deployments who want to adopt Amazon EMR on EKS while maintaining their Kerberos-enabled security posture.

This pattern applies whether you’re migrating from on-premises Hadoop, running hybrid Amazon EMR on EC2 or Amazon EMR on EKS environments, or building a new cloud-native platform. Any scenario where Spark jobs on Kerberos must authenticate with a shared, Kerberos-enabled HMS.

You can use this post as a starting point to implement this pattern and extend it further to suit your organization’s data platform needs.

AWS Big Data Blog