AWS Big Data Blog

Secure Apache Spark writes to Amazon S3 on Amazon EMR with dynamic AWS KMS encryption

When processing data at scale, many organizations use Apache Spark on Amazon EMR to run shared clusters that handle workloads across tenants, business units, or classification levels. In such multi-tenant environments, different datasets often require distinct AWS Key Management Service (AWS KMS) keys to enforce strict access controls and meet compliance requirements. At the same time, operational efficiency might drive these organizations to consolidate their data pipelines. Instead of running separate Spark jobs for each dataset, it could be more efficient to run a single job on Amazon EMR that processes inputs once and writes multiple outputs to Amazon Simple Storage Service (Amazon S3), each encrypted with its own KMS key.

Although consolidating multiple datasets in one Spark job reduces orchestration overhead and simplifies code maintenance, you might encounter challenges with encryption configurations. By default, the EMRFS and S3A file system clients cache their settings, which can cause encryption keys to persist incorrectly across writes. This means if you change the encryption key between writes to Amazon S3 in an Apache Hadoop environment, some output files can end up encrypted with unintended keys, leading to possible security and compliance concerns.

In this post, we show how to securely write data to Amazon S3 from Spark jobs running on Amazon EMR, while dynamically managing different KMS keys for encryption. We discuss three approaches to solve this challenge and how to choose the right solution for your use case.

Amazon S3 server-side encryption options with Amazon EMR

When writing data to Amazon S3 from Amazon EMR, you can choose from multiple server-side encryption options. The two most commonly used options are:

  • Server-side encryption with Amazon S3 managed keys (SSE-S3) – Amazon S3 manages the encryption keys for you
  • Server-side encryption with KMS keys (SSE-KMS) – AWS KMS manages the keys, and you can use custom KMS keys with fine-grained access control

When running Spark jobs on Amazon EMR, data writes to Amazon S3 occur through one of the following file system implementations:

  • EMRFS – The default implementation for Amazon EMR versions below 7.10.0
  • S3A – The default implementation starting from Amazon EMR 7.10.0

Both implementations provide configuration properties to control server-side encryption. The following tables show how to specify the KMS key for SSE-KMS encryption.

For EMRFS (default in Amazon EMR versions below 7.10.0), refer to the following table.

Property Description
fs.s3.enableServerSideEncryption
Enables server-side encryption. Defaults to SSE-S3 if no KMS key is provided.
fs.s3.serverSideEncryption.kms.keyId
Specifies the KMS key ID or ARN for SSE-KMS encryption.

For S3A (default in Amazon EMR starting from 7.10.0), refer to the following table.

Property Description
fs.s3a.encryption.algorithm
Specifies the encryption algorithm.
fs.s3a.encryption.key
Specifies the KMS key ID or ARN for SSE-KMS encryption.

Starting from the Amazon EMR 7.10.0 release, the S3A file system has replaced EMRFS as the default EMR S3 connector. For more information, refer to Migration Guide: EMRFS to S3A Filesystem.

Challenges to prevent encryption key reuse due to file system caching

In practice, a unified Spark job might write outputs for multiple tenants or classifications in a single run. In this situation, applying the correct encryption key for each output is critical to maintaining compliance and enforcing isolation in multi-tenant S3 buckets without the complexity of managing separate Spark jobs for each dataset.

When Spark executors write to Amazon S3, they use a file system (EMRFS or S3A) client that is cached and reused for performance optimization. The problem is that each file system instance keeps the encryption settings it was first created with. Each executor’s Java Virtual Machine (JVM) creates and caches a file system (and its underlying S3 client) for a given S3 bucket. This cached instance, along with its encryption configuration, persists throughout the executor’s lifecycle. If you change the encryption key in Spark after some data has been written, the existing cached client can’t pick up the new key.

For example, the following PySpark code first creates a Spark session with S3 server-side encryption enabled. It then writes a DataFrame to two different folders within the same S3 bucket amzn-s3-demo-bucket1, but with different KMS keys. The first write operation writes the data to folder1 using kmsKey1 for encryption, and the second write operation writes to folder2 using kmsKey2.

# Pseudo-code: setting different keys for successive writes to different folders within the same S3 bucket

spark = SparkSession.builder \
.appName("Write data to S3 with KMS") \
.config("spark.hadoop.fs.s3.enableServerSideEncryption", "true") \
.getOrCreate()

df.write.option('fs.s3.serverSideEncryption.kms.keyId', kmsKey1).save("s3://amzn-s3-demo-bucket1/folder1/")
df.write.option('fs.s3.serverSideEncryption.kms.keyId', kmsKey2).save("s3://amzn-s3-demo-bucket1/folder2/")

You might expect files in folder1/ to use kmsKey1 and files in folder2/ to use kmsKey2. But due to caching, the second write can still use the client configured with kmsKey1. This leads to mixed or incorrect encryption key usage across outputs.

Solution overview

Our objective is to achieve correct encryption of each output S3 object with its intended KMS key, even when a single Spark job writes multiple outputs. To implement this, you can use one of the following approaches:

  • Disable file system cache – Turn off S3 client caching so a new client is created for each write, picking up the current key
  • Separate Spark applications or sessions – Run a separate Spark application (or session) for each distinct encryption key, so each client is initialized fresh
  • Use S3 bucket default encryption – Configure bucket-level SSE-KMS with the desired key so Amazon S3 automatically applies the correct encryption key

Each method offers a different balance of implementation complexity, performance, and flexibility. The following sections provide detailed implementation steps and considerations for each approach.

Method 1: Disable file system cache

Disabling the file system cache forces Spark to create a new S3 client for each write, which applies the updated encryption settings. This can be done using a Spark configuration or EMR cluster settings.

The property name for disabling the cache depends on your URI scheme (s3:// or s3a://), not on your choice of file system (EMRFS or S3A). The following table summarizes which configuration property name you should use to disable the cache.

Properties for s3:// URI scheme Properties for s3a:// URI scheme
fs.s3.impl.disable.cache
fs.s3a.impl.disable.cache
spark.hadoop.fs.s3.impl.disable.cache
spark.hadoop.fs.s3a.impl.disable.cache

To use this method, complete the following steps:

  1. Disable the file system cache in Spark configuration.
    For s3:// scheme, you can manually set spark.hadoop.fs.s3.impl.disable.cache=true, for example (PySpark):

    # PySpark example for "s3://"
    # Create Spark session
    spark = SparkSession.builder \
    .appName("Write data to S3 with KMS") \
    .config("spark.hadoop.fs.s3.impl.disable.cache", "true") \
    .getOrCreate()

    Alternatively, you can use the following spark-defaults configuration classification:

    [
      {
        "Classification": "spark-defaults",
        "Properties": {
          "spark.hadoop.fs.s3.impl.disable.cache": "true"
        }
      }
    ]

    For s3a:// scheme, you can manually set spark.hadoop.fs.s3a.impl.disable.cache=true, for example (PySpark):

    # PySpark example for "s3a://"
    # Create Spark session
    spark = SparkSession.builder \
    .appName("Write data to S3 with KMS") \ 
    .config("spark.hadoop.fs.s3a.impl.disable.cache", "true") \
    .getOrCreate()

    Alternatively, you can use the following spark-defaults configuration classification:

    [
      {
        "Classification": "spark-defaults",
        "Properties": {
          "spark.hadoop.fs.s3a.impl.disable.cache": "true"
        }
      }
    ]
  2. Instead of disabling the cache specifically for Spark applications, you can optionally configure the EMR cluster’s core-site.xml to disable the file system cache globally at cluster level. You must configure the /etc/hadoop/conf/core-site.xml file on the primary nodes of your EMR cluster. For example, when creating or modifying the cluster, use the following configuration.
    For s3:// scheme:

    [
      {
        "Classification": "core-site",
        "Properties": {
          "fs.s3.impl.disable.cache": "true"
        }
      }
    ]

    For s3a:// scheme:

    [
      {
        "Classification": "core-site",
        "Properties": { 
          "fs.s3a.impl.disable.cache": "true"
        }
      }
    ]
  3. Enable SSE-KMS encryption.
    For EMRFS, set spark.hadoop.fs.s3.enableServerSideEncryption=true for Spark applications only or use the following configuration to enable encryption at cluster level:

    [
      {
        "Classification": "emrfs-site",
        "Properties": {
          "fs.s3.enableServerSideEncryption": "true"
        }
      }
    ]

    For S3A, set spark.hadoop.fs.s3a.encryption.algorithm=SSE-KMS for Spark applications only or use the following configuration to enable encryption at cluster level:

    [
      {
        "Classification": "core-site",
        "Properties": {
          "fs.s3a.encryption.algorithm": "SSE-KMS"
        }
      }
    ]
  4. When using EMRFS with fs.s3.impl.disable.cache=true, you must also disable the EMRFS S3-optimized committer to avoid errors. You can do this by either manually setting spark.sql.parquet.fs.optimized.committer.optimization-enabled=false or using the following spark-defaults configuration classification:
    [
      {
        "Classification": "spark-defaults",
        "Properties": {
          "spark.sql.parquet.fs.optimized.committer.optimization-enabled": "false"
        }
      }
    ]

For more information about configuring applications on EMR clusters, refer to Configure applications when you create a cluster and Reconfigure an instance group in a running cluster.

Considerations

Use this method when you need to write data to an S3 bucket using multiple KMS keys within a single Spark application. This is a quick, straightforward implementation that works well for the following use cases:

  • Testing environments and debugging sessions
  • Proof-of-concept demonstrations
  • Low-volume or one-time jobs where write performance is not critical
  • Workloads that frequently switch encryption keys to write data to different S3 prefixes within the same bucket

Before implementation, consider the following performance impacts:

  • Increased latency for each write
  • Additional S3 API operations
  • Extra connection overhead

Although this method provides a pragmatic solution when splitting work into separate Spark applications isn’t feasible, we don’t recommend it for high-throughput or latency-sensitive production workloads. The increased API traffic can lead to higher costs and potential throttling. For production implementations, consider Method 2 and Method 3.

Method 2: Use separate Spark applications or sessions

When writing data with multiple encryption keys, use a separate Spark application (or Spark session) for each distinct key. The file system needs to be initialized with the correct encryption key in a fresh JVM context when writing data with different keys. You can achieve this by either submitting a separate Spark application or starting a new Spark session. This enables the S3 client to be created with the intended encryption key.

Complete the following steps:

  1. Divide the write tasks by KMS key. For example, prepare separate DataFrames or filter logic for each key.
  2. Submit separate jobs. Choose either of the following options:
    1. Use spark-submit commands. For example:
      # For EMRFS
      spark-submit –conf spark.hadoop.fs.s3.serverSideEncryption.kms.keyId=kmsKey1 job1.py
      spark-submit --conf spark.hadoop.fs.s3.serverSideEncryption.kms.keyId=kmsKey2 job2.py
      
      # For S3A
      spark-submit –conf spark.hadoop.fs.s3a.server-side-encryption.key=kmsKey1 job1.py
      spark-submit --conf spark.hadoop.fs.s3a.server-side-encryption.key=kmsKey2 job2.py
    2. Use Spark sessions in code (PySpark example):
      # For EMRFS
      for kmsKey in [kmsKey1, kmsKey2]:
          spark = SparkSession.builder \
              .appName("Write data to S3 with KMS") \
              .config("spark.hadoop.fs.s3.enableServerSideEncryption", "true") \
              .config("spark.hadoop.fs.s3.serverSideEncryption.kms.keyId", kmsKey) \
              .getOrCreate()
          write_df_for_key(kmsKey)  # Pseudocode for writing data for this key
          spark.stop()
      
      # For S3A
      for kmsKey in [kmsKey1, kmsKey2]:
          spark = SparkSession.builder \
              .appName("Write data to S3 with KMS") \
              .config("spark.hadoop.fs.s3a.encryption.algorithm", "SSE-KMS") \
              .config("spark.hadoop.fs.s3a.encryption.key", kmsKey) \
              .getOrCreate()
          write_df_for_key(kmsKey)  # Pseudocode for writing data for this key
          spark.stop()
  3. Use your preferred workflow (such as AWS Step Functions, Apache Airflow, or a wrapper script) to launch jobs in sequence or parallel.

Considerations

Use this method when you need a production-grade solution for applying different KMS keys at scale. This approach maintains file system caching benefits and works well for the following use cases:

  • High-throughput or latency-sensitive workloads with frequent write operations
  • Scenarios requiring strong isolation between different KMS keys
  • Multi-tenant environments with separate compliance boundaries

This method creates fresh S3 clients with each new Spark application or session. Compared to Method 1, it offers several advantages:

  • Avoids per-write connection and API overhead
  • Maintains full compatibility with the EMRFS S3-optimized committer
  • Enforces credential boundaries between workloads and improves operational and compliance isolation by assigning each Spark application or session a dedicated key

Before implementation, consider the following trade-offs:

  • Requires orchestration of multiple Spark applications and sessions or clusters
  • Involves higher resource overhead
  • Increases operational complexity

Choose this method when performance, cost predictability, and security isolation are more important than single-process simplicity.

Method 3: Use S3 bucket default encryption

Where possible, configure S3 bucket-level default encryption (SSE-KMS) with the desired KMS key to automatically encrypt objects written to that bucket.

Complete the following steps:

  1. On the Amazon S3 console or using the AWS Command Line Interface (AWS CLI), enable default SSE-KMS for the bucket with the desired key. For instructions on enabling SSE-KMS for S3 buckets, refer to Configuring default encryption.
  2. With default encryption enabled for the S3 bucket, you can write without specifying a KMS key in Spark. Amazon S3 can encrypt each object with the bucket’s key. Your Spark code only needs standard write operations.

Considerations

Use this method when your workloads can use a single KMS key per bucket. This approach works well for the following use cases:

  • Production environments prioritizing operational simplicity
  • Workloads where all data in a bucket shares the same security requirements
  • Scenarios where encryption configuration should be managed at the bucket level
  • Use cases that map naturally to per-bucket separation

This method provides several advantages:

  • Alleviates the need to configure encryption in Spark applications
  • Automatically applies the default KMS key for all writes
  • Simplifies encryption management

Before implementation, consider the following limitations:

  • You must have all data in one bucket to use the same encryption key
  • You can’t apply different keys to different prefixes within the same bucket

Choose this method when you need a simple, reliable approach that provides strong security while simplifying operational management.

Choosing the right approach

Choose the method based on your workload’s security requirements, performance needs, and operational constraints:

  • Method 1 – Use when you need to apply multiple KMS keys within a single Spark job and can accept some performance impact
  • Method 2 – Use for production workloads that require different encryption keys within the same bucket and need optimal performance
  • Method 3 – Use when a single KMS key per bucket meets your encryption requirements and you want simplified operations

Conclusion

In this post, we demonstrated how to handle multiple KMS keys when writing to Amazon S3 from Spark jobs on Amazon EMR. When encrypting multiple outputs with different encryption keys in a single Spark application, it’s important to consider the file system caching behavior. We presented several practical solutions with their respective trade-offs. You can start implementing these solutions in your environment by first testing the file system cache-disable method, which provides a straightforward approach to handling multiple encryption keys. As your workload grows, consider evolving to separate Spark sessions or S3 bucket default encryption based on your specific requirements. After implementing a solution, verify that each S3 object’s SSE-KMS key is the intended one (for example, by checking S3 object metadata). We also recommend measuring job performance and S3 API usage, especially for the cache-disable approach.


About the authors

Pinxi Tai

Pinxi Tai

Pinxi is a Hadoop Systems Engineer at AWS, specializing in big data technologies and Amazon EMR. He focuses on helping customers solve complex distributed computing challenges and is passionate about designing well-structured solutions for large-scale data processing. Outside of work, Pinxi enjoys swimming and football.