Using BlueTalon with Amazon EMR

This is a guest post by Pratik Verma, Founder and Chief Product Officer at BlueTalon. Leonid Fedotov, Senior Solution Architect at BlueTalon, also contributed to this post.

Amazon Elastic MapReduce (Amazon EMR) makes it easy to quickly and cost-effectively process vast amounts of data in the cloud. EMR gets used for log, financial, fraud, and bioinformatics analysis, as well as many other big data use cases. Often, the data used in these analyses, such as customer information, transaction history, and other proprietary data, is sensitive from a business perspective and may even be subject to regulatory compliance.

BlueTalon is a leading provider of data-centric security solutions for Hadoop, SQL and Big Data environments on-premises and in the cloud. BlueTalon keeps enterprises in control of their data by allowing them to give users access to the data they need, not a byte more. BlueTalon solution works across AWS data services like EMR, Redshift and RDS.

In this blog post, we show how organizations can use BlueTalon to mitigate the risks associated with their use of sensitive data while taking full advantage of EMR.

BlueTalon provides capabilities for data-centric security:

Audits of user activity using a context-rich trail of queries users run that hit sensitive fields.
Precise control over data that is specific for each user identity or business role and specific for the data resource at the file, folder, table, column, row, cell, or partial-cell level.
Secure use of business data in policy decisions for real-world requirements, while maintaining complex access scenarios and relationship between users and data.

Using BlueTalon to enforce data security

BlueTalon’s data-centric security solution has three main components: a UI to create rules and visualize real-time audit, a Policy Engine to make fast run-time authorization decisions, and a collection of Enforcement Points that transparently enforce the decisions made by the Policy Engine.

In a typical Hadoop cluster, users specify computations using SQL queries in Hive, scripts in Pig, or MapReduce programs. For applications accessing data via Hive, the BlueTalon Hive enforcement point transparently proxies HiveServer2 at the network level and provides policy-protected data. The BlueTalon Policy Engine makes sophisticated, fine-grained policy decisions based on user and content criteria in-memory at run-time by re-engineering SQL requests for Hive. With the query modification technique, BlueTalon is able to ensure that end users get the same data, whether raw data is coming from local HDFS or Amazon S3, and that only policy-compliant data is pulled from storage by Hive.

For direct HDFS access, end users connect to and receive policy-protected data via the BlueTalon HDFS enforcement point that transparently proxies HDFS NameNode at network level and the BlueTalon Policy Engine makes policy decisions based on user and content criteria in-memory at run-time to provide folder and file level control on HDFS. With the enforcement point for HDFS, BlueTalon ensures that end-users can’t get around its security by going to HDFS to obtain data not accessible via Hive.

Using enforcement points, BlueTalon provides the following access controls for your data:

Field protection: Fields can be denied without breaking the application. As an example, a blank value compatible with the id field is returned instead of revealing the id values as they are stored on disk.

Record protection: The result set can be filtered to return a subset of the data, even when the field used in the filter criteria is not in the result set. In this example, the user is able to see only the 2 records with the East Coast zip codes, compared to 10 records on disk.

Cell protection: A specific field value for a specific record can be protected. In this example, the user is able to see the birthdate value for ‘Joyce McDonald’ but not ‘Kelly Adams’. Here as well, the date field is compatible with the format expected by the application.

Partial cell protection: Even portions of a cell may be protected. In this example, the user sees the last four digits of a Social Security number, rather than the number being hidden entirely.

The BlueTalon Policy Engine integrates with Active Directory for authenticating end-user credentials and mapping identities to business roles. It enforces authorization so that Hive provides only policy-compliant data to end users.

Deploying BlueTalon with Amazon EMR

In the following sections, you’ll learn how to deploy BlueTalon with EMR and configure the policies. A typical deployment looks like the following:

Prerequisites

You need to contact sales@bluetalon.com to obtain an evaluation copy, an Amazon EC2 Linux instance for installing BlueTalon, and an Amazon EMR cluster in the same VPC. BlueTalon recommends using an m3.large instance with CentOS.

To integrate BlueTalon with a directory, you can use a pre-existing directory in your VPC or launch a new Simple AD using AWS Directory Service. For more information, see Tutorial: Creating a Simple AD Directory.

Install the packages

On the EC2 instance, install the BlueTalon Policy Engine and Audit packages, available as rpm packages, using the yum commands:

> yum search bluetalon

bluetalon-audit.x86_64 : BlueTalon data security for Hadoop.
bluetalon-enforcementpoint.x86_64 : BlueTalon data security for Hadoop.
bluetalon-policy.x86_64 : BlueTalon data security for Hadoop.

> yum install bluetalon-audit –y 

> yum install bluetalon-policy –y

Run the setup script

After the BlueTalon packages are installed, run the setup script to configure and turn on the run-time services and UI associated with the two packages.

> bluetalon-audit-setup

Starting bt-audit-server service:                          [  OK  ]
Starting bt-audit-zookeeper service:                       [  OK  ]
Starting bt-audit-kafka service:                           [  OK  ]
Starting bt-audit-activity-monitor service:                [  OK  ]

  BlueTalon Audit Product is installed....
  URL to access BlueTalon Audit UI
  ec2-0-0-0-0.us-west-2.compute.amazonaws.com:8112/BlueTalonAudit

  Default Username : btadminuser
  Default Password : P@ssw0rd


> bluetalon-policy-setup

Starting bt-postgresql service:                            [  OK  ]
Starting bt-policy-engine service:                         [  OK  ]
Starting bt-sql-hooks-vds service:                         [  OK  ]
Starting bt-webserver service:                             [  OK  ]
Starting bt-HeartBeatService service:                      [  OK  ]

  BlueTalon Data Security Product for Hadoop is installed....
  You can create rules using the BlueTalon Policy UI
  URL to access BlueTalon Policy UI
  ec2-0-0-0-0.us-west-2.compute.amazonaws.com:8111/BlueTalonConfig

  Default Username : btadminuser
  Default Password : P@ssw0rd

Connecting to the BlueTalon UI

After the run time services associated with the BlueTalon packages have started, you should be able to connect to the BlueTalon Policy Management and User Audit interfaces as displayed below.

Installing enforcement points

Install and configure the BlueTalon enforcement point packages for Hive and HDFS NameNode on the master node of the EMR cluster using the following commands:

> yum install bluetalon-enforcementpoint –y
> bluetalon-enforcementpoint-setup Hive 10011 HiveDS

Starting bt-enforcement-point-demods service:              [  OK  ]

The arguments to this command include:

Hive: The type of enforcement point to configure. Options include Hive, HDFS, and PostgreSQL.

10011: The port on which the enforcement point listens.

HiveDS: The name of the data domain in the BlueTalon UI to associate with this enforcement point.

This command configures a Hive enforcement point for the local HiveServer2 and creates an iptables entry to re-route HiveServer2 traffic to the BlueTalon enforcement point first.

The following command restarts NameNode with the embedded BlueTalon enforcement point process:

> bluetalon-enforcementpoint-setup HDFS
Stopping NameNode process: [ OK ]
Starting NameNode process: [ OK ]

Adding data domains

Open the BlueTalon Policy Management UI using a browser and add Hive and HDFS as data domains so that BlueTalon can look up the data resources (databases, tables, columns, folders, files, etc.) to create data access rules. This requires connectivity information for HiveServer2 and NameNode.

For HiveServer2:

default: Database name associated with Hive warehouse. Typically, ‘default’.

10.0.0.1: Hostname of the machine where HiveServer2 is running. Typically, the DNS of the master node in Amazon EMR.

10000: Port that HiveServer2 is listening on. Typically, ‘10000’.

10011: Port on which the enforcement point listens. Typically, ‘10011’.

HiveDS: The name of the data domain in the BlueTalon UI to associate with this enforcement point.

No Login: Credentials for connecting to HiveServer2 if required.

For HDFS:

10.0.0.1: Hostname of the machine where NameNode is running.

8020: Port on which NameNode is listening. Typically, ‘8020’.

HDFSDS: The name of the data domain in the BlueTalon UI to associate with this enforcement point.

Adding user domains

Using the BlueTalon Policy Management UI, add the directory as a user domain so that BlueTalon can authenticate user credentials and look up the business roles to which a user belongs. For more information about obtaining connectivity information, see Viewing Directory Information.

10.0.0.1: Hostname of the machine where Active Directory is running.

389: Port of the machine where Active Directory is running. Typically, ‘389’.

10011: Port that the Enforcement Point listens on. Typically, ‘10011’.

CN=hadoopadmin: Credentials for bind and query to Active Directory.

Creating rules for specifying data access

Using the BlueTalon UI, you can create rules specifying which users can access what data. This can be done using the Add Rule button on the Policy tab to open a tray. Two examples are shown below.

On the left is an example of a row-level rule that restricts access for user ‘admin1’ to records in the ‘people’ table for locations in West Coast zip codes only. On the right is an example of a masking rule on a sensitive field, ‘accounts.ssn’, which masks it completely.

Deploying policies

After the rules are created, deploy the policy to the BlueTalon Policy Engine using the Deploy button from the Deploy tab. After it’s deployed, the policy and rules become effective on the Policy Engine.

The screenshots below show the data protection with BlueTalon by making queries through the ‘beeline’ client.

With BlueTalon

beeline> !connect jdbc:hive2://<hostname of masternode>:10011/default

Without BlueTalon

beeline> !connect jdbc:hive2://<hostname of masternode>:10000/default

With BlueTalon, the row level protection count of records is 249.

Without BlueTalon, the row level protection count of records is 2499.

With BlueTalon protection, field ssn is masked with ‘XXXX’.

Auditing access

All access through the BlueTalon enforcement points is authorized against the BlueTalon Policy Engine and audited. The audit can be visualized in the BlueTalon User Audit UI.

Try BlueTalon with data available from AWS

Generate sample data with table ‘books_1’ using instructions from http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/query-impala-generate-data.html

Create a policy for a user ‘alice’ that allows a read on table ‘books_1’ for books with ‘price’ less than $30.00, masks the field ‘publisher’ and denies the ‘id’ of book completely.

Run the query directly and through BlueTalon to see the effect of the policy rules created in BlueTalon.

Data as stored in Hive:

Result with BlueTalon protection:

Conclusion

BlueTalon enables organizations to protect access to data efficiently in HDFS or Amazon S3, allow users to get needed data, and leverage the full potential of Hadoop in a secure manner.

If you have questions or suggestions, please leave a comment below.

——————————-

Getting Started with Elasticsearch and Kibana on EMR

Strategies for Reducing your EMR Costs

—————————————————————-

Love to work on open source? Check out EMR’s careers page.