AWS Big Data Blog

Implementing Kerberos authentication for Apache Spark jobs on Amazon EMR on EKS to access a Kerberos-enabled Hive Metastore

Many organizations run their Apache Spark analytics platforms on Amazon EMR on Amazon Elastic Compute Cloud (Amazon EC2), using Kerberos authentication to secure connectivity between Spark jobs and a centralized shared Apache Hive Metastore (HMS). With Amazon EMR on Amazon EKS, they gained a new option for running Spark jobs with the benefits of Kubernetes-based container orchestration, improved resource utilization, and faster job startup times. However, an HMS deployment supports only one authentication mechanism at a time. This means that they must configure Kerberos authentication for their Spark jobs on Amazon EMR on EKS to connect to the existing Kerberos-enabled HMS.

In this post, we show how to configure Kerberos authentication for Spark jobs on Amazon EMR on EKS, authenticating against a Kerberos-enabled HMS so you can run both Amazon EMR on EC2 and Amazon EMR on EKS workloads against a single, secure HMS deployment.

Overview of solution

Consider an enterprise data platform team that’s been running Spark jobs on Amazon EMR on EC2 for several years. Their architecture includes a Kerberos-enabled standalone HMS that serves as the centralized data catalog, with Microsoft Active Directory functioning as the Key Distribution Center (KDC). As the team evaluates Amazon EMR on EKS for new workloads, their existing HMS must continue serving Amazon EMR on EC2, with both authenticating through the same Kerberos infrastructure. To address this, the platform team must configure their Spark jobs running on Amazon EMR on EKS to authenticate with the same KDC. This is so they can obtain valid Kerberos tickets and establish authenticated connections to the HMS while maintaining a unified security posture across their data platform.

Architecture diagram showing two VPCs connected via VPC peering: the Active Directory VPC contains Microsoft Active Directory serving as the Kerberos Key Distribution Center (KDC) with ports 88 (Kerberos) and 749 (Admin). The Amazon EKS VPC contains two namespaces — the emr namespace runs Apache Spark jobs (each with a driver pod and executor pods) configured with krb5.conf, jaas.conf, and keytab files using a spark/analytics-team@CORP.KERBEROS principal; the hive-metastore namespace runs Hive Metastore pods (with deployment, replica set, and HPA) configured with Kerberos artifacts and the hive/hive-metastore@CORP.KERBEROS principal. Spark driver pods connect to the Hive Metastore service, which is backed by Amazon Aurora PostgreSQL for metadata storage and Amazon S3 for data storage. AWS Secrets Manager stores Kerberos keytabs and database credentials retrieved during deployment. Users submit Spark jobs via AWS Systems Manager Session Manager.

Scope of Kerberos in this solution

Kerberos authentication in this solution secures the connection between Spark jobs and the HMS. Other components in the architecture use AWS and Kubernetes security mechanisms instead.

Solution architecture

Our solution implements Kerberos authentication to secure the connection between Spark jobs and the HMS. The architecture spans two Amazon Virtual Private Clouds (Amazon VPCs) connected using VPC peering, with distinct components handling identity management, compute, and metadata services.

Identity and Authentication layer

A self-managed Microsoft Active Directory Domain Controller is deployed in a dedicated VPC and serves as the KDC for Kerberos authentication. The Active Directory server hosts service principals for both the HMS service and Spark jobs. This separate VPC deployment mirrors real-world enterprise architectures where Active Directory is typically managed by identity teams in their own network boundary, whether on-premises or in AWS.

Data Platform layer

The data platform components reside in a separate VPC and includes an EKS cluster that hosts both the HMS service and Amazon EMR on EKS based Spark jobs persisting data in an Amazon Simple Storage Service (Amazon S3) bucket.

Hive Metastore service

The HMS is deployed in the EKS hive-metastore namespace and simulates a pre-existing, standalone Kerberos-enabled HMS, a common enterprise pattern where HMS is managed independently of any data processing platform. You can learn more about other enterprise design patterns in the post Design patterns for implementing Hive Metastore for Amazon EMR or EKS. The HMS service authenticates with the KDC using its service principal and keytab mounted from a Kubernetes secret.

Apache Spark Execution layer

Apache Spark jobs are deployed using the Spark Operator on EKS. The Spark driver and executor pods are configured with Kerberos credentials through mounted ConfigMaps containing krb5.conf and jaas.conf, along with keytab files from Kubernetes secrets. When a Spark job must access Hive tables, the driver authenticates with the KDC and establishes a secure Simple Authentication and Security Layer (SASL) connection to the HMS.

Authentication flow

The HMS runs as a long-running Kubernetes service that must be deployed and authenticated before Spark jobs can connect.

During HMS deployment:

  1. HMS pod validates its Kerberos configuration. krb5.conf and jaas.conf are mounted from ConfigMaps
  2. Service authenticates with KDC using its principal hive/hive-metastore-svc.hive-metastore.svc.cluster.local@CORP.KERBEROS
  3. keytab is mounted from Kubernetes secret for credential access
  4. Secure Thrift endpoint is established on port 9083 with SASL authentication enabled

When a Spark job must interact with the HMS:

  1. Spark job submission:
    1. User submits Spark job through Spark Operator
    2. Driver and executor pods are created with Kerberos configuration mounted as volumes
    3. krb5.conf ConfigMap provides KDC connection details including realm and server addresses
    4. jaas.conf ConfigMap specifies a login module configuration with keytab path and principal
    5. Keytab secret contains encrypted credentials for Spark service principal spark/analytics-team@CORP.KERBEROS
  2. Authentication and connection:
    1. Spark driver authenticates with KDC using its principal and keytab to obtain a Ticket Granting Ticket (TGT)
    2. When connecting to HMS, Spark requests a service ticket from the KDC for the HMS principal hive/hive-metastore-svc.hive-metastore.svc.cluster.local@CORP.KERBEROS
    3. KDC issues a service ticket encrypted with HMS’s secret key
    4. Spark presents this service TGT to HMS over the Thrift connection on port 9083
    5. HMS decrypts the ticket using its keytab, verifies Spark’s identity, and establishes the authenticated SASL session
    6. Executor pods use the same configuration for authenticated operations
  3. Data access:
    1. Authenticated Spark job queries HMS for table metadata
    2. HMS validates Kerberos tickets before serving metadata requests
    3. Spark accesses underlying data in Amazon S3 using IRSA

Sequence diagram illustrating the Kerberos authentication flow between a Spark job and the Hive Metastore. The flow proceeds in five phases: (1) Job Submission — a Data Engineer submits a SparkApplication via kubectl, and the Spark Operator creates a driver pod with krb5.conf, jaas.conf, and keytab mounted. (2) Kerberos Authentication — the Spark driver loads its keytab for the spark/analytics-team@CORP.KERBEROS principal and sends an AS-REQ to the Active Directory KDC, which validates the credentials and returns a TGT (Ticket Granting Ticket). (3) Service Ticket Request — the Spark driver sends a TGS-REQ to the KDC requesting a service ticket for the Hive Metastore principal, and the KDC returns a service ticket encrypted with the HMS key. (4) Authenticated Connection — the Spark driver connects to the Hive Metastore over Thrift (port 9083) using SASL with the service ticket; HMS decrypts the ticket using its own keytab, verifies the Spark identity, and establishes an authenticated session. (5) Data Operations — the Spark driver queries table metadata from HMS (backed by Aurora PostgreSQL) and reads/writes table data directly from Amazon S3 using IRSA credentials.

Implementation workflow

The implementation involves three key stakeholders working together to establish the Kerberos-enabled communication:

Microsoft Active Directory Administrator

The Active Directory Administrator creates service accounts that are used for HMS and Spark jobs. This involves setting up the service principal names using the setspn utility and generating keytab files using ktpass for secure credential storage. The administrator configures the appropriate Active Directory permissions and Kerberos AES256 encryption type. Finally, the keytab files are uploaded to AWS Secrets Manager for secure distribution to Kubernetes workloads.

Data Platform Team

The platform team handles the Amazon EMR on EKS and Kubernetes configurations. They retrieve keytabs from Secrets Manager and create Kubernetes secrets for the workloads. They configure Helm charts for HMS deployment with Kerberos settings and set up ConfigMaps for krb5.conf, jaas.conf, and core-site.xml.

Data Engineering Operations

Data engineers submit jobs using the configured service account with Kerberos authentication. They monitor job execution and verify authenticated access to HMS.

Deploy the solution

In the remainder of this post, you will explore the implementation details for this solution. You can find the sample code in the AWS Samples GitHub repository. For additional details, including verification steps for each deployment stage, refer to the README in the repository.

Prerequisites

Before you deploy this solution, make sure that the following prerequisites are in place:

Clone the repository and set up environment variables

Clone the repository to your local machine and set the two environment variables. Replace <AWS_REGION> with the AWS Region where you want to deploy these resources.

# Clone the Git repository
git clone https://github.com/aws-samples/sample-emr-eks-spark-kerberos-hms.git
cd sample-emr-eks-spark-kerberos-hms

# Set environment variables
export REPO_DIR=$(pwd)
export AWS_REGION=<AWS_REGION>

Setup Microsoft Active Directory infrastructure

In this section, we deploy a self-managed Microsoft Active Directory with KDC on a Windows Server EC2 instance into a dedicated VPC. This is an intentionally minimal implementation highlighting only the key components required for this blog post.

cd ${REPO_DIR}/microsoft-ad
./setup.sh

Setup EKS infrastructure

This section provisions the Amazon EMR on EKS infrastructure stack, including VPC, EKS cluster, Amazon Aurora PostgreSQL database, Amazon Elastic Container Registry (Amazon ECR), Amazon S3, Amazon EMR on EKS virtual clusters and the Spark Operator. Run the following script.

cd ${REPO_DIR}/data-infra
./setup.sh

Set up VPC peering

This section establishes network connectivity between the Active Directory VPC and EKS VPC for Kerberos authentication. Run the following script:

cd ${REPO_DIR}/vpc-peering
./setup.sh

Deploy Hive Metastore with Kerberos authentication

This section deploys a Kerberos-enabled HMS service on the EKS cluster. Complete the following steps:

  1. Create Kerberos Service Principal for HMS service
cd ${REPO_DIR}/microsoft-ad/
# Create HMS service principal
./manage-ad-service-principals.sh create hive "hive/hive-metastore-svc.hive-metastore.svc.cluster.local"
# Verify the service principal was created
./manage-ad-service-principals.sh list
  1. Deploy HMS service with Kerberos authentication
cd ${REPO_DIR}/hive-metastore
./deploy.sh

Set up Amazon EMR on Amazon EKS with Kerberos authentication

This section configures Spark jobs to authenticate with Kerberos-enabled HMS. This involves creating service principles for Spark jobs and generating the necessary configuration files. Complete the following steps:

  1. Create Service Principal for Spark jobs
cd ${REPO_DIR}/microsoft-ad/
# Create Spark service principal
./manage-ad-service-principals.sh create spark "spark/analytics-team"
# Verify the service principal was created
./manage-ad-service-principals.sh list
  1. Generate Kerberos configurations for Spark jobs
cd ${REPO_DIR}/spark-jobs/
./generate-spark-configs.sh --principal "spark/analytics-team@CORP.KERBEROS" --namespace emr

Submit Spark jobs

This section verifies Kerberos authentication by running a Spark job that connects to the Kerberized HMS. Complete the following steps:

  1. Submit the test Spark job
cd ${REPO_DIR}/spark-jobs
kubectl apply -f spark-job.yaml
  1. Monitor job execution
# Watch the SparkApplication status
kubectl get sparkapplications -n emr -w
# Check pod status
kubectl get pods -n emr | grep "spark-kerberos"
  1. Verify Kerberos authentication and HMS connection
# Check Spark driver logs for successful authentication
kubectl logs spark-kerberos-job-driver -n emr

The logs should confirm successful authentication, along with a listing of sample databases and tables.

Understanding Kerberos configuration

The HMS requires specific configuration parameters to enable Kerberos authentication, applied through the previously mentioned steps. The key configurations are outlined in the following section.

HMS configuration (metastore-site.xml)

The following configurations are added to metastore-site.xml file.

Setting Value Purpose
hive.metastore.sasl.enabled true Enable SASL authentication
hive.metastore.kerberos.principal hive/hive-metastore-svc.hive-metastore.svc.cluster.local@CORP.KERBEROS HMS service principal
hive.metastore.kerberos.keytab.file /etc/security/keytab/hive.keytab Keytab path

Hadoop security (core-site.xml)

The following configurations are added to the core-site.xml file.

Setting Value
hadoop.security.authentication kerberos
hadoop.security.authorization true

Spark configuration

Setting Value Purpose
spark.security.credentials.kerberos.enabled true Enable Kerberos for Spark
spark.hadoop.hive.metastore.sasl.enabled true SASL for HMS connection
spark.kerberos.principal spark/analytics-team@CORP.KERBEROS Spark service principal
spark.kerberos.keytab local:///etc/security/keytab/analytics-team.keytab Keytab path

Shared Kerberos files

Both HMS and Spark pods mount two common Kerberos configuration files: krb5.conf and jaas.conf, using ConfigMaps and Kubernetes secrets. The krb5.conf file is identical across both services and defines how each component connects to the KDC. The jaas.conf file follows the same structure but differs in the principal and keytab path for each service.

  1. krb5 Configuration
[libdefaults]
	default_realm = CORP.KERBEROS
	dns_lookup_realm = false
	dns_lookup_kdc = false
	ticket_lifetime = 24h
	forwardable = true
	udp_preference_limit = 1
	default_tkt_enctypes = aes256-cts-hmac-sha1-96 aes128-cts-hmac-sha1-96
	default_tgs_enctypes = aes256-cts-hmac-sha1-96 aes128-cts-hmac-sha1-96
	permitted_enctypes = aes256-cts-hmac-sha1-96 aes128-cts-hmac-sha1-96

[realms]
	CORP.KERBEROS = {
		kdc = <ad-server-ip>
		admin_server = <ad-server-ip>
	}

[domain_realm]
	.corp.kerberos = CORP.KERBEROS
	corp.kerberos = CORP.KERBEROS

For more information, see the online documentation for krb5.conf.

  1. JAAS configuration
Client {
 com.sun.security.auth.module.Krb5LoginModule required
 useKeyTab=true
 keyTab="/etc/security/keytab/hive.keytab"
 principal="hive/hive-metastore-svc.hive-metastore.svc.cluster.local@CORP.KERBEROS"
 useTicketCache=false
 storeKey=true
 debug=false;
};

Additional security considerations

This post focuses on core Kerberos authentication mechanics between Spark and HMS. We recommend two additional security hardening steps based on your organization’s security posture and compliance requirements.

Protecting Keytabs at Rest with AWS KMS Envelope Encryption

Keytabs stored as Kubernetes Secrets are only base64-encoded by default, not encrypted at rest. We recommend enabling EKS envelope encryption using an AWS Key Management Service (AWS KMS) customer managed key. With envelope encryption, secret data is encrypted with a Data Encryption Key (DEK), which is encrypted by your customer managed key. This protects keytab content even if the etcd datastore is compromised. To enable this on an existing EKS cluster:

aws eks associate-encryption-config \
  --cluster-name <your-cluster> \
  --encryption-config '[{"resources":["secrets"],"provider":{"keyArn":"arn:aws:kms:<region>:<account-id>:key/<key-id>"}}]'

Refer to the Amazon EKS documentation on envelope encryption for full setup guidance.

Encrypting the Thrift Data Channel with TLS

SASL with Kerberos provides mutual authentication but doesn’t automatically encrypt data over the Thrift connection. Many deployments default to auth QoP, leaving the data channel unencrypted. We recommend either:

  • Set SASL QoP to auth-conf — enables SASL-layer encryption using Kerberos session keys
  • Layer TLS over Thrift (preferred) — enables transport-level encryption using modern cipher suites

Enabling TLS on HiveServer2 / Hive Metastore Thrift:

<property>
  <name>hive.server2.use.SSL</name>
  <value>true</value>
</property>
<property>
  <name>hive.server2.keystore.path</name>
  <value>/etc/tls/keystore.jks</value>
</property>

Refer to the Hive SSL/TLS configuration documentation for full details.

Cleaning up

To avoid incurring future charges, clean up all provisioned resources during this setup by executing the following cleanup script.

cd ${REPO_DIR}/
./cleanup.sh

Conclusion

In this post, we demonstrated how to implement Kerberos authentication for Amazon EMR on EKS to securely connect to a Kerberos-enabled HMS. This solution addresses a common challenge faced by organizations with existing Kerberos-enabled HMS deployments who want to adopt Amazon EMR on EKS while maintaining their Kerberos-enabled security posture.

This pattern applies whether you’re migrating from on-premises Hadoop, running hybrid Amazon EMR on EC2 or Amazon EMR on EKS environments, or building a new cloud-native platform. Any scenario where Spark jobs on Kerberos must authenticate with a shared, Kerberos-enabled HMS.

You can use this post as a starting point to implement this pattern and extend it further to suit your organization’s data platform needs.


About the authors

Headshot of Krishna Kumar Venkateswaran

Krishna Kumar Venkateswaran is a Cloud Infrastructure Architect at Amazon Web Services (AWS), passionate about building secure applications and data platforms. He has extensive experience in Kubernetes, DevOps, and enterprise architecture, helping customers containerize applications, streamline deployments, and optimize cloud-native environments.

Headshot of Sunil Chakrapani Sundararaman

Sunil Chakrapani Sundararaman is a DevOps Architect at Amazon Web Services (AWS), where he helps enterprise customers architect and implement Data and Machine Learning platforms in the AWS Cloud. He brings extensive experience in Data Platform engineering, MLOps, DevOps, and Kubernetes implementations. Sunil specializes in guiding organizations through their cloud transformation journey, focusing on building scalable and efficient solutions that drive business value.

Headshot of Avinash Desireddy

Avinash Desireddy is a Specialist Solutions Architect (Containers) at Amazon Web Services (AWS), passionate about building secure applications and data platforms. He has extensive experience in Kubernetes, DevOps, and enterprise architecture, helping customers and partners containerize applications, streamline deployments, and optimize cloud-native environments.

Headshot of Suvojit Dasgupta

Suvojit Dasgupta is an Engineering Leader at Amazon Web Services (AWS). He leads engineering teams, guiding them in designing and implementing scalable, high-performance data platforms for AWS customers. With expertise spanning distributed systems, real-time and batch data architectures, and cloud-native infrastructure, he drives technical strategy and engineering excellence across teams. He is passionate about raising the bar on engineering practices, and solving large-scale problems at the intersection of data and business impact.