AWS for Industries

FSI Services Spotlight: Featuring Amazon EMR (Amazon EMR)

Welcome back to the Financial Services Industry (FSI) Service Spotlight monthly blog series. Each month we look at five key considerations that FSI customers should focus on to help streamline cloud service approval for one particular service. Each of the five key considerations includes specific guidance, suggested reference architectures, and technical code that can be used to streamline service approval for the featured service. This guidance should be adapted to suit your own specific use case and environment.

This month we are covering Amazon EMR. Amazon EMR (previously called Amazon Elastic MapReduce) is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to process and analyze vast amounts of data. Using these frameworks and related open-source projects, you can process data for analytics purposes and business intelligence workloads. Amazon EMR also lets you transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB.

Amazon EMR Use Cases in the Financial Services Industry

Bridgewater Associates is the world’s largest hedge fund. Their core mission is to understand how the world works by analyzing the drivers of markets and turning that understanding into high-quality portfolios and investment advice for their clients. Bridgewater uses Amazon EMR in three separate use cases. EMR on Amazon Elastic Kubernetes Service (Amazon EKS) for Spark is used for running distributed large scale financial simulations on top of Peta-scale S3. EMR Trino is used as part of a modernization effort to augment their batch workflows with more real-time analysis of data. Bridgewater has also adopted EMR HBase as part of data intake from external sources which has led to improved durability as HBase on Amazon S3 is used to store the cluster’s HBase root directory and metadata.

Achieving Compliance with Amazon EMR

Amazon EMR is an AWS managed service and third-party auditors regularly assess the security and compliance of it as part of multiple AWS compliance programs. As part of the AWS shared responsibility model, Amazon EMR is in the scope of the following compliance programs.

  • SOC 1,2,3
  • PCI
  • CSA STAR CCM v3.0.1
  • ISO/IEC 27001:2013, 27017:2015, 27018:2019, 27701:2019, 22301:2019, 9001:2015, and CSA STAR CCM v4.0
  • ISMAP
  • FedRAMP (Moderate and High)
  • DoD CC SRG (IL2-IL6)
  • HIPAA
  • IRAP
  • MTCS (Regions: US-East, US-West, Singapore, Seoul)
  • C5
  • K-ISMS
  • ENS High
  • HITRUST CSF
  • FINMA
  • GSMA (Regions: US-East (Ohio) and Europe (Paris))
  • PiTuKri
  • CCCS Medium
  • IAR
  • DESC CSP

You can obtain corresponding compliance reports under an AWS non-disclosure agreement (NDA) through AWS Artifact. It is important to understand that Amazon EMR compliance status does not automatically apply to applications that you run in the AWS Cloud. You need to ensure that your use of AWS services complies with the standards.

Your scope of the shared responsibility model when using Amazon EMR is determined by the sensitivity of your data, your organization’s compliance objectives, and applicable laws and regulations. If your use of Amazon EMR is subject to compliance with standards like HIPAA, PCI, or FedRAMP, AWS provides resources to help.

Data Protection with Amazon EMR

At AWS, we recommend that encryption is applied to complement other access controls that are already in place. Data protection refers to protecting data while in-transit and at rest. You can protect data in transit using Secure Socket Layer/Transport Layer Security (SSL/TLS) or client-side encryption.

With Amazon EMR versions 4.8.0 and later, you can use a security configuration to specify settings for encrypting data at rest, data in transit, or both. When you enable at-rest data encryption, you can choose to encrypt EMRFS data in Amazon S3, data in local disks, or both. Each security configuration that you create is stored in Amazon EMR rather than in the cluster configuration, so you can easily reuse a configuration to specify data encryption settings whenever you create a cluster. For more information, see Create a security configuration.

The following diagram shows the different data encryption options available with security configurations.

The following encryption options are also available and are not configured using a security configuration:

Beginning with Amazon EMR version 5.24.0, you can use a security configuration option to encrypt EBS root device and storage volumes when you specify AWS Key Management Service (AWS KMS) as your key provider. For more information, see Local disk encryption.

Data encryption requires keys and certificates. A security configuration gives you the flexibility to choose from several options, including keys managed by AWS Key Management Service, keys managed by Amazon S3, and keys and certificates from custom providers that you supply. When using AWS KMS as your key provider, charges apply for the storage and use of encryption keys. For more information, see AWS KMS pricing.

Before you specify encryption options, decide on the key and certificate management systems you want to use, so you can first create the keys and certificates or the custom providers that you specify as part of encryption settings.

Encryption at rest for EMRFS data in Amazon S3

Amazon S3 encryption works with EMR File System (EMRFS) objects read from and written to Amazon S3. You specify Amazon S3 server-side encryption (SSE) or client-side encryption (CSE) as the Default encryption mode when you enable encryption at rest. Optionally, you can specify different encryption methods for individual buckets using Per bucket encryption overrides. Regardless of whether Amazon S3 encryption is enabled, Transport Layer Security (TLS) encrypts the EMRFS objects in transit between EMR cluster nodes and Amazon S3. For in-depth information about Amazon S3 encryption, see Protecting data using encryption in the Amazon Simple Storage Service User Guide.

Amazon S3 server-side encryption

When you set up Amazon S3 server-side encryption, Amazon S3 encrypts data at the object level as it writes the data to disk and decrypts the data when it is accessed. For more information about SSE, see Protecting data using server-side encryption in the Amazon Simple Storage Service User Guide.

You can choose between two different key management systems when you specify SSE in Amazon EMR:

  • SSE-S3 – Amazon S3 manages keys for you.
  • SSE-KMS – You use an AWS KMS key to set up with policies suitable for Amazon EMR. For more information about key requirements for Amazon EMR, see Using AWS KMS keys for encryption.
  • SSE with customer-provided keys (SSE-C) is not available for use with Amazon EMR.

Amazon S3 client-side encryption

With Amazon S3 client-side encryption, the Amazon S3 encryption and decryption takes place in the EMRFS client on your cluster. Objects are encrypted before being uploaded to Amazon S3 and decrypted after they are downloaded. The provider you specify supplies the encryption key that the client uses. The client can use keys provided by AWS KMS (CSE-KMS) or a custom Java class that provides the client-side root key (CSE-C). The encryption specifics are slightly different between CSE-KMS and CSE-C, depending on the specified provider and the metadata of the object being decrypted or encrypted. For more information about these differences, see Protecting data using client-side encryption in the Amazon Simple Storage Service User Guide.

Amazon S3 CSE only ensures that EMRFS data exchanged with Amazon S3 is encrypted; not all data on cluster instance volumes is encrypted. Furthermore, because Hue does not use EMRFS, objects that the Hue S3 File Browser writes to Amazon S3 are not encrypted.

Local disk encryption

The following mechanisms work together to encrypt local disks when you enable local disk encryption using an Amazon EMR security configuration.

Open-source HDFS encryption

HDFS exchanges data between cluster instances during distributed processing. It also reads from and writes data to instance store volumes and the EBS volumes attached to instances. The following open-source Hadoop encryption options are activated when you enable local disk encryption:

  • Secure Hadoop RPC is set to Privacy, which uses Simple Authentication Security Layer (SASL).
  • Data encryption on HDFS block data transfer is set to true and is configured to use AES 256 encryption.

You can activate additional Apache Hadoop encryption by enabling in-transit encryption. For more information, see Encryption in transit. These encryption settings do not activate HDFS transparent encryption, which you can configure manually. For more information, see Transparent encryption in HDFS on Amazon EMR.

Instance store encryption

For EC2 instance types that use NVMe-based SSDs as the instance store volume, NVMe encryption is used regardless of Amazon EMR encryption settings. For more information, see NVMe SSD volumes in the Amazon EC2 User Guide for Linux Instances. For other instance store volumes, Amazon EMR uses LUKS to encrypt the instance store volume when local disk encryption is enabled regardless of whether EBS volumes are encrypted using EBS encryption or LUKS.

EBS volume encryption

If you create a cluster in an AWS Region where Amazon EC2 encryption of EBS volumes is enabled by default for your account, EBS volumes are encrypted even if local disk encryption is not enabled. For more information, see Encryption by default in the Amazon EC2 User Guide for Linux Instances. With local disk encryption enabled in a security configuration, the Amazon EMR settings take precedence over the Amazon EC2 encryption-by-default settings for cluster EC2 instances.

The following options are available to encrypt EBS volumes using a security configuration:

  • EBS encryption – Beginning with Amazon EMR version 5.24.0, you can choose to enable EBS encryption. The EBS encryption option encrypts the EBS root device volume and attached storage volumes. The EBS encryption option is available only when you specify AWS Key Management Service as your key provider. We recommend using EBS encryption.
  • LUKS encryption – If you choose to use LUKS encryption for Amazon EBS volumes, the LUKS encryption applies only to attached storage volumes, not to the root device volume. For more information about LUKS encryption, see the LUKS on-disk specification.

For your key provider, you can set up an AWS KMS key with policies suitable for Amazon EMR, or a custom Java class that provides the encryption artifacts. When you use AWS KMS, charges apply for the storage and use of encryption keys.

To check if EBS encryption is enabled on your cluster, it is recommended that you use DescribeVolumes API call. Running lsblk on the cluster only checks the status of LUKS encryption, not of EBS encryption.

Isolation of environments with Amazon EMR

As a managed service, Amazon EMR is protected by the AWS global network security procedures that are described in the Amazon Web Services: Overview of security processes whitepaper.

You use the AWS API calls to access Amazon EMR through the network. Clients must support Transport Layer Security (TLS) 1.2. Clients must also support cipher suites with perfect forward secrecy (PFS) such as Ephemeral Diffie-Hellman (DHE) or Elliptic Curve Ephemeral Diffie-Hellman (ECDHE). Most modern systems such as Java 7 and later support these modes.

Additionally, requests must be signed by using an access key ID and a secret access key that is associated with an IAM principal. Preferably, you can use the AWS Security Token Service (AWS STS) to generate temporary security credentials to sign requests.

Connect to Amazon EMR using an interface VPC endpoint

You can connect directly to Amazon EMR using an interface VPC endpoint (AWS PrivateLink) in your Virtual Private Cloud (VPC) instead of connecting over the internet. When you use an interface VPC endpoint, communication between your VPC and Amazon EMR is conducted entirely within the AWS network. Each VPC endpoint is represented by one or more Elastic network interfaces (ENIs) with private IP addresses in your VPC subnets.

The interface VPC endpoint connects your VPC directly to Amazon EMR without an internet gateway, NAT device, VPN connection, or AWS Direct Connect connection. The instances in your VPC don’t need public IP addresses to communicate with the Amazon EMR API.

To use Amazon EMR through your VPC, you must connect from an instance that is inside the VPC or connect your private network to your VPC by using an Amazon Virtual Private Network (VPN) or AWS Direct Connect. For information see VPN connections and Creating a connection in the AWS.

You can create an interface VPC endpoint to connect to Amazon EMR using the AWS console or AWS Command Line Interface (AWS CLI) commands. After you create an interface VPC endpoint, if you enable private DNS hostnames for the endpoint, the default Amazon EMR endpoint resolves to your VPC endpoint. The default service name endpoint for Amazon EMR is in the following format: elasticmapreduce.Region.amazonaws.com. If you do not enable private DNS hostnames, Amazon VPC provides a DNS endpoint name that you can use in the following format: VPC_Endpoint_ID.elasticmapreduce.Region.vpce.amazonaws.com.

Amazon EMR supports making calls to all of its API actions inside your VPC. You can attach VPC endpoint policies to a VPC endpoint to control access for IAM principals. You can also associate security groups with a VPC endpoint to control inbound and outbound access based on the origin and destination of network traffic, such as a range of IP addresses. For more information, see Controlling access to services with VPC endpoints.

Automating audits with APIs with Amazon EMR

There are different ways to gain insights into Amazon EMR activity in your AWS accounts.

Amazon EMR is integrated with AWS CloudTrail, a service that records user activity and API calls on resources within your AWS accounts. CloudTrail captures a subset of API calls for Amazon EMR as events, including calls from the Amazon EMR console and code calls to the Amazon EMR APIs. If you create a trail, you can enable continuous delivery of CloudTrail events to an Amazon S3 bucket, including events for Amazon EMR. If you don’t configure a trail, you can still view the most recent events in the CloudTrail console in Event history. CloudTrail logging provides identity information for events like:

  • Whether the request was made with root or IAM user credentials
  • Whether the request was made with temporary security credentials for a role or federated user
  • Whether the request was made by another AWS service

Operational access and security with Amazon EMR

Operational access to Amazon EMR Clusters can be controlled either by EC2 key pairs for SSH credentials or using Kerberos authentication.

Amazon EMR cluster nodes run on Amazon EC2 instances. You can connect to cluster nodes in the same way that you can connect to Amazon EC2 instances. You can use Amazon EC2 to create a key pair, or you can import a key pair. When you create a cluster, you can specify the Amazon EC2 key pair that will be used for SSH connections to all cluster instances. You can also create a cluster without a key pair. This is usually done with transient clusters that start, run steps, and then terminate automatically.

The SSH client that you use to connect to the cluster needs to use the private key file associated with this key pair. Suppose this is a .pem file for SSH clients using Linux, Unix and macOS. You must set permissions so that only the key owner has permission to access the file. On Windows, you might need to Convert your private key using PuTTYgen.

Amazon EMR release version 5.10.0 and later supports Kerberos, which is a network authentication protocol created by the Massachusetts Institute of Technology (MIT). Kerberos uses secret-key cryptography to provide strong authentication so that passwords or other credentials aren’t sent over the network in an unencrypted format.

In Kerberos, services and users that need to authenticate are known as principals. Principals exist within a Kerberos realm. Within the realm, a Kerberos server known as the key distribution center (KDC) provides the means for principals to authenticate. The KDC does this by issuing tickets for authentication. The KDC maintains a database of the principals within its realm, their passwords, and other administrative information about each principal. A KDC can also accept authentication credentials from principals in other realms, which is known as a cross-realm trust. In addition, an EMR cluster can use an external KDC to authenticate principals.

A common scenario for establishing a cross-realm trust or using an external KDC is to authenticate users from an LDAP domain (such as Active Directory). This allows users to access an EMR cluster with their domain account when they use SSH to connect to a cluster or work with big data applications. When you use Kerberos authentication, Amazon EMR configures Kerberos for the applications, components, and subsystems that it installs on the cluster so that they are authenticated with each other. More details can be found in the EMR Documentation.

Conclusion

In this post, we reviewed Amazon EMR and highlighted key information that can help FSI customers accelerate the approval of the service within these five categories: achieving compliance, data protection, isolation of compute environments, automating audits with APIs, and operational access and security. While not a one-size-fits-all approach, the guidance can be adapted to meet your organization’s security and compliance requirements and provide a consolidated list of key areas for Amazon EMR.

In the meantime, be sure to visit our AWS Financial Services Industry channel and stay tuned for more financial services news and best practices!

Peter Sideris

Peter Sideris

Peter Sideris is a Sr. Technical Account Manager at AWS. He works with some of our largest and most complex customers to ensure their success in the AWS Cloud. Peter enjoys his family, marine reef keeping, and volunteers his time to the Boy Scouts of America in several capacities.