AWS Big Data Blog

Dive deep into security management: The Data on EKS Platform

The construction of big data applications based on open source software has become increasingly uncomplicated since the advent of projects like Data on EKS, an open source project from AWS to provide blueprints for building data and machine learning (ML) applications on Amazon Elastic Kubernetes Service (Amazon EKS). In the realm of big data, securing data on cloud applications is crucial. This post explores the deployment of Apache Ranger for permission management within the Hadoop ecosystem on Amazon EKS. We show how Ranger integrates with Hadoop components like Apache Hive, Spark, Trino, Yarn, and HDFS, providing secure and efficient data management in a cloud environment. Join us as we navigate these advanced security strategies in the context of Kubernetes and cloud computing.

Overview of solution

The Amber Group’s Data on EKS Platform (DEP) is a Kubernetes-based, cloud-centered big data platform that revolutionizes the way we handle data in EKS environments. Developed by Amber Group’s Data Team, DEP integrates with familiar components like Apache Hive, Spark, Flink, Trino, HDFS, and more, making it a versatile and comprehensive solution for data management and BI platforms.

The following diagram illustrates the solution architecture.

Effective permission management is crucial for several key reasons:

  • Enhanced security – With proper permission management, sensitive data is only accessible to authorized individuals, thereby safeguarding against unauthorized access and potential security breaches. This is especially important in industries handling large volumes of sensitive or personal data.
  • Operational efficiency – By defining clear user roles and permissions, organizations can streamline workflows and reduce administrative overhead. This system simplifies managing user access, saves time for data security administrators, and minimizes the risk of configuration errors.
  • Scalability and compliance – As businesses grow and evolve, a scalable permission management system helps with smoothly adjusting user roles and access rights. This adaptability is essential for maintaining compliance with various data privacy regulations like GDPR and HIPAA, making sure that the organization’s data practices are legally sound and up to date.
  • Addressing big data challenges – Big data comes with unique challenges, like managing large volumes of rapidly evolving data across multiple platforms. Effective permission management helps tackle these challenges by controlling how data is accessed and used, providing data integrity and minimizing the risk of data breaches.

Apache Ranger is a comprehensive framework designed for data governance and security in Hadoop ecosystems. It provides a centralized framework to define, administer, and manage security policies consistently across various Hadoop components. Ranger specializes in fine-grained access control, offering detailed management of user permissions and auditing capabilities.

Ranger’s architecture is designed to integrate smoothly with various big data tools such as Hadoop, Hive, HBase, and Spark. The key components of Ranger include:

  • Ranger Admin – This is the central component where all security policies are created and managed. It provides a web-based user interface for policy management and an API for programmatic configuration.
  • Ranger UserSync – This service is responsible for syncing user and group information from a directory service like LDAP or AD into Ranger.
  • Ranger plugins – These are installed on each component of the Hadoop ecosystem (like Hive and HBase). Plugins pull policies from the Ranger Admin service and enforce them locally.
  • Ranger Auditing – Ranger captures access audit logs and stores them for compliance and monitoring purposes. It can integrate with external tools for advanced analytics on these audit logs.
  • Ranger Key Management Store (KMS) – Ranger KMS provides encryption and key management, extending Hadoop’s HDFS Transparent Data Encryption (TDE).

The following flowchart illustrates the priority levels for matching policies.

chartflow

The priority levels are as follows:

  • Deny list takes precedence over allow list
  • Deny list exclude has a higher priority than deny list
  • Allow list exclude has a higher priority than allow list

Our Amazon EKS-based deployment includes the following components:

  • S3 buckets – We use Amazon Simple Storage Service (Amazon S3) for scalable and durable Hive data storage
  • MySQL database – The database stores Hive metadata, facilitating efficient metadata retrieval and management
  • EKS cluster – The cluster is comprised of three distinct node groups: platform, Hadoop, and Trino, each tailored for specific operational needs
  • Hadoop cluster applications – These applications include HDFS for distributed storage and YARN for managing cluster resources
  • Trino cluster application – This application enables us to run distributed SQL queries for analytics
  • Apache Ranger – Ranger serves as the central security management tool for access policy across the big data components
  • OpenLDAP – This is integrated as the LDAP service to provide a centralized user information repository, essential for user authentication and authorization
  • Other cloud services resources – Other resources include a dedicated VPC for network security and isolation

By the end of this deployment process, we will have realized the following benefits:

  • A high-performing, scalable big data platform that can handle complex data workflows with ease
  • Enhanced security through centralized management of authentication and authorization, provided by the integration of OpenLDAP and Apache Ranger
  • Cost-effective infrastructure management and operation, thanks to the containerized nature of services on Amazon EKS
  • Compliance with stringent data security and privacy regulations, due to Apache Ranger’s policy enforcement capabilities

Deploy a big data cluster on Amazon EKS and configure Ranger for access control

In this section, we outline the process of deploying a big data cluster on AWS EKS and configuring Ranger for access control. We use AWS CloudFormation templates for quick deployment of a big data environment on Amazon EKS with Apache Ranger.

Complete the following steps:

  1. Upload the provided template to AWS CloudFormation, configure the stack options, and launch the stack to automate the deployment of the entire infrastructure, including the EKS cluster and Apache Ranger integration.

    cloudformation

    After a few minutes, you’ll have a fully functional big data environment with robust security management ready for your analytical workloads, as shown in the following screenshot.

  2. On the AWS web console, find the name of your EKS cluster. In this case, it’s dep-demo-eks-cluster-ap-northeast-1. For example:
    aws eks update-kubeconfig --name dep-eks-cluster-ap-northeast-1 --region ap-northeast-1
    
    ## Check pod status.
    
    kubectl get pods --namespace hadoop
    
    kubectl get pods --namespace platform
    
    kubectl get pods --namespace trino

  3. After Ranger Admin is successfully forwarded to port 6080 of localhost, go to localhost:6080 in your browser.
  4. Log in with user name admin and the password you entered earlier.

By default, you have already created two policies: Hive and Trino, and granted all access to the LDAP user you created (depadmin in this case).

Also, the LDAP user sync service is set up and will automatically sync all users from the LDAP service created in this template.

Example permission configuration

In a practical application within a company, permissions for tables and fields in the data warehouse are divided based on business departments, isolating sensitive data for different business units. This provides data security and orderly conduct of daily business operations. The following screenshots show an example business configuration.

The following is an example of an Apache Ranger permission configuration.

The following screenshots show users associated with roles.

When performing data queries, using Hive and Spark as examples, we can demonstrate the comparison before and after permission configuration.

The following screenshot shows an example of Hive SQL (running on superset) with privileges denied.

The following screenshot shows an example of Spark SQL (running on IDE) with privileges denied.

The following screenshot shows an example of Spark SQL (running on IDE) with permissions permitting.

Based on this example and considering your enterprise requirements, it becomes feasible and flexible to manage permissions in the data warehouse effectively.

Conclusion

This post provided a comprehensive guide on permission management in big data, particularly within the Amazon EKS platform using Apache Ranger, that equips you with the essential knowledge and tools for robust data security and management. By implementing the strategies and understanding the components detailed in this post, you can effectively manage permissions, implementing data security and compliance in your big data environments.


About the Authors

Yuzhu Xiao is a Senior Data Development Engineer at Amber Group with extensive experience in cloud data platform architecture. He has many years of experience in AWS Cloud platform data architecture and development, primarily focusing on efficiency optimization and cost control of enterprise cloud architectures.

Xin Zhang is an AWS Solutions Architect, responsible for solution consulting and design based on the AWS Cloud platform. He has a rich experience in R&D and architecture practice in the fields of system architecture, data warehousing, and real-time computing.