How do I set up an SSL connection between Hive on Amazon EMR and a metastore on Amazon RDS MySQL?

Last updated: 2020-09-04

I want to set up an SSL connection between Apache Hive and a metastore that's on an Amazon Relational Database Service (Amazon RDS) MySQL DB instance. How can I do that?

Short description

Set up an encrypted connection between Hive and an external metastore, using an SSL certificate. You can set up this connection when you launch a new Amazon EMR cluster or after the cluster is running.

Resolution

Note: The following steps were tested with Amazon EMR release version 5.18.0 and Amazon RDS MySQL version 5.7.

Set up the SSL connection on a new Amazon EMR cluster

1.     Run a command similar to the following to create an Amazon RDS MySQL DB instance. Replace $RDS_MASTER_USER_NAME and $RDS_MASTER_PASSWORD with your user name and password. For more information, see create-db-instance.

Note: If you receive errors when running AWS Command Line Interface (AWS CLI) commands, be sure that you’re using the most recent version of the CLI.

aws rds create-db-instance --db-name hive --db-instance-identifier mysql-hive-meta --db-instance-class db.t2.micro\ 
    --engine mysql --engine-version 5.7.19 --master-username $RDS_MASTER_USER_NAME --master-user-password $RDS_MASTER_PASSWORD\
    --allocated-storage 20 --storage-type gp2 --vpc-security-group-ids $RDS_VPC_SG --publicly-accessible

2.     Connect to the Amazon RDS MySQL DB instance as the master user. Then, create a user for the Hive metastore, as shown in the following example.

Important: Be sure that you restrict access for this user to the DB instance that you created in step 1.

mysql -h mysql-hive-meta.########.us-east-1.rds.amazonaws.com -P 3306 -u $RDS_MASTER_USER_NAME -p
Enter password: $RDS_MASTER_PASSWORD

CREATE USER 'hive_user'@'%' IDENTIFIED BY 'hive_user_password';
REVOKE ALL PRIVILEGES, GRANT OPTION FROM 'hive_user'@'%';
GRANT ALL PRIVILEGES ON hive.* TO 'hive_user'@'%' REQUIRE SSL;
FLUSH PRIVILEGES;

3.     Create a JSON configuration file similar to the following. Replace hive_user and hive_user_password with the values that you used in the JSON script in step 2. Replace the endpoint in the JDBC URL with the endpoint for your RDS DB instance.

You will use this file to launch the Amazon EMR cluster in the next step. The file enables an SSL connection to the RDS DB instance. For more information, see Using SSL with a MySQL DB instance.

[
    {
        "Classification": "hive-site",
        "Properties": {
            "javax.jdo.option.ConnectionURL": "jdbc:mysql:\/\/mysql-hive-meta.########.us-east-1.rds.amazonaws.com:3306\/hive?createDatabaseIfNotExist=true&useSSL=true&serverSslCert=\/home\/hadoop\/rds-combined-ca-bundle.pem",
            "javax.jdo.option.ConnectionDriverName": "org.mariadb.jdbc.Driver",
            "javax.jdo.option.ConnectionUserName": "hive_user",
            "javax.jdo.option.ConnectionPassword": "hive_user_password"
        }
    }
]

4.     In the security group that's associated with the Amazon RDS MySQL instance, create an inbound rule with the following parameters.
For Type, choose MYSQL/Aurora (3306).
For Protocol, TCP (6) is selected by default.
For Port Range, 3306 is selected by default.
For Source, enter the Group ID of the Amazon EMR-managed security group that's associated with the master node.

This rule allows the Amazon EMR cluster's master node to access the Amazon RDS instance. For more information, see VPC security groups.

5.     Run the create-cluster command to launch an Amazon EMR cluster using the JSON file from step 3, along with a bootstrap action that downloads the SSL certificate to /home/hadoop/ on the master node. Example:

aws emr create-cluster --applications Name=Hadoop Name=Hive --tags Name="EMR Hive Metastore"\
    --ec2-attributes KeyName=$EC2_KEY_PAIR,InstanceProfile=EMR_EC2_DefaultRole,SubnetId=$EMR_SUBNET,EmrManagedSlaveSecurityGroup=$EMR_CORE_AND_TASK_VPC_SG,EmrManagedMasterSecurityGroup=$EMR_MASTER_VPC_SG\
   
    --service-role EMR_DefaultRole --release-label emr-5.18.0 --log-uri s3://awsdoc-example-bucket/emr-logs-path/ --name "Hive External Metastore RDS MySQL w/ SSL"\
    --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge,Name="Master
    - 1"\ 
    --configurations file://hive-ext-meta-mysql-ssl.json\
    --bootstrap-actions Path=s3://elasticmapreduce/bootstrap-actions/run-if,Args=["instance.isMaster=true","cd /home/hadoop && wget -S -T
    10 -t 5 https://s3.amazonaws.com/rds-downloads/rds-combined-ca-bundle.pem"]

6.     Connect to the master node using SSH.

7.     Open a hive session on the master node. Then, create a table. It can be any table—this is for testing purposes only. Example:

hive> create table tb_test (col1 STRING, col2 BIGINT);
OK
Time taken: 2.371 seconds

hive> describe tb_test;
OK
col1                    string
col2                    bigint
Time taken: 0.254 seconds, Fetched: 2 row(s)

8.     Connect to the RDS MySQL metastore using the mysql client on the master node. Then, verify the table metadata in the metastore. If the metadata corresponds to the table that you created in the previous step, the SSL connection is working. Example:

mysql -h mysql-hive-meta.########.us-east-1.rds.amazonaws.com -P 3306 -u $RDS_MASTER_USER_NAME -p
Enter password: $RDS_MASTER_PASSWORD

mysql> use hive;
Database changed

mysql> select t1.OWNER, t1.TBL_NAME, t1.TBL_TYPE, s1.INPUT_FORMAT, s1.OUTPUT_FORMAT, s1.LOCATION from TBLS t1 inner join SDS s1 on s1.SD_ID = t1.SD_ID where t1.TBL_NAME = 'tb_test'\G
*************************** 1. row ***************************
        OWNER: hadoop
     TBL_NAME: tb_test
     TBL_TYPE: MANAGED_TABLE
 INPUT_FORMAT: org.apache.hadoop.mapred.TextInputFormat
OUTPUT_FORMAT: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
     LOCATION: hdfs://ip-xxx-xx-xx-xxx.ec2.internal:8020/user/hive/warehouse/tb_test
1 row in set (0.23 sec)

mysql> select t1.OWNER, t1.TBL_NAME, c1.COLUMN_NAME, c1.TYPE_NAME from TBLS t1 inner join SDS s1 on s1.SD_ID = t1.SD_ID inner join COLUMNS_V2 c1 on c1.CD_ID = s1.CD_ID where t1.TBL_NAME = 'tb_test';
+--------+----------+-------------+-----------+
| OWNER  | TBL_NAME | COLUMN_NAME | TYPE_NAME |
+--------+----------+-------------+-----------+
| hadoop | tb_test  | col1        | string    |
| hadoop | tb_test  | col2        | bigint    |
+--------+----------+-------------+-----------+
2 rows in set (0.22 sec)

Set up the SSL connection on a running Amazon EMR cluster

Note: The following steps assume that you have an Amazon RDS MySQL DB instance.

1.     Connect to the master node using SSH.

2.     Run the following command to download the SSL certificate to /home/hadoop/ on the master node:

cd /home/hadoop && wget -S -T 10 -t 5 https://s3.amazonaws.com/rds-downloads/rds-combined-ca-bundle.pem

3.     In the /etc/hive/conf.dist directory, add or edit the following lines in the hive-site.xml file. This enables an SSL connection to the RDS DB instance. In the following example, replace the endpoint in the JDBC URL with the endpoint for your RDS DB instance.

<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:mysql://mysql-hive-meta.########.us-east-1.rds.amazonaws.com:3306/hive?createDatabaseIfNotExist=true&amp;useSSL=true&amp;serverSslCert=/home/hadoop/rds-combined-ca-bundle.pem</value>
  <description>JDBC URL for the metastore database</description>
</property>

<property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>hive_user</value>
    <description>User name for the metastore database</description>
</property>

<property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>HIVE_USER_PASSWORD</value>
    <description>Password for metastore database</description>
</property>

Important: The ampersand (&) is a special character in XML. To use an ampersand in hive-site.xml, such as in the JDBC string, you must use "&amp;" instead of "&." Otherwise, you get an error when you restart hive-hcatalog-server.

4.     Run a command similar to the following to test the SSL connection:

mysql -h mysql-hive-meta.########.us-east-1.rds.amazonaws.com -P 3306 -u hive_user -p --ssl-ca ./rds-combined-ca-bundle.pem

5.     Restart hive-server2 and hive-hcatalog-server on the master node. For more information, see Restarting processes.

6.     In the /var/log/ directory, verify that the services restarted successfully. The log should have entries like this:

sudo stop hive-server2
sudo stop hive-hcatalog-server
sudo start hive-hcatalog-server
sudo start hive-server2

7.     Open a hive session on the master node. Then, create a table. It can be any table—this is for testing purposes only. Example:

hive> create table tb_test (col1 STRING, col2 BIGINT);
OK
Time taken: 2.371 seconds

hive> describe tb_test;
OK
col1                    string
col2                    bigint
Time taken: 0.254 seconds, Fetched: 2 row(s)

8.     Connect to the RDS MySQL metastore using the mysql client on the master node. Then, verify the table metadata in the metastore. If the metadata corresponds to the table that you created in the previous step, the SSL connection is working. Example:

$ mysql -h mysql-hive-meta.########.us-east-1.rds.amazonaws.com -P 3306 -u $RDS_MASTER_USER_NAME -p
Enter password: $RDS_MASTER_PASSWORD

mysql> use hive;
Database changed

mysql> select t1.OWNER, t1.TBL_NAME, t1.TBL_TYPE, s1.INPUT_FORMAT, s1.OUTPUT_FORMAT, s1.LOCATION from TBLS t1 inner join SDS s1 on s1.SD_ID = t1.SD_ID where t1.TBL_NAME = 'tb_test'\G
*************************** 1. row ***************************
        OWNER: hadoop
     TBL_NAME: tb_test
     TBL_TYPE: MANAGED_TABLE
 INPUT_FORMAT: org.apache.hadoop.mapred.TextInputFormat
OUTPUT_FORMAT: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
     LOCATION: hdfs://ip-xxx-xx-xx-xxx.ec2.internal:8020/user/hive/warehouse/tb_test
1 row in set (0.23 sec)

mysql> select t1.OWNER, t1.TBL_NAME, c1.COLUMN_NAME, c1.TYPE_NAME from TBLS t1 inner join SDS s1 on s1.SD_ID = t1.SD_ID inner join COLUMNS_V2 c1 on c1.CD_ID = s1.CD_ID where t1.TBL_NAME = 'tb_test';
+--------+----------+-------------+-----------+
| OWNER  | TBL_NAME | COLUMN_NAME | TYPE_NAME |
+--------+----------+-------------+-----------+
| hadoop | tb_test  | col1        | string    |
| hadoop | tb_test  | col2        | bigint    |
+--------+----------+-------------+-----------+
2 rows in set (0.22 sec)

Troubleshoot hive-hcatalog-server restart errors

You might get an error message similar to the following when you try to restart hive-hcatalog-server. This usually happens when the RDS MySQL DB instance blocks the EMR cluster's master node as a security precaution.

2020-08-20T14:18:50,750 WARN [main] org.apache.hadoop.hive.metastore.HiveMetaStore - Retrying creating default database after error: Unable to open a test connection to the given database. JDBC url = jdbc:mysql://mariadb-hive-meta.ch1nkjp4jjbq.eu-west-1.rds.amazonaws.com:3306/hive?createDatabaseIfNotExist=true&useSSL=true&serverSSlCert=/home/hadoop/rds-combined-ca-bundle.pem, username = masteruser. Terminating connection pool (set lazyInit to true if you expect to start your database after your app). Original Exception: ------
java.sql.SQLException: Host '172.31.41.187' is blocked because of many connection errors; unblock with 'mysqladmin flush-hosts'

To resolve this error, connect to a different local machine or Amazon Elastic Compute Cloud (Amazon EC2) instance that has the mysqladmin tool installed. Run the following command to flush the master node from the DB instance.

mysqladmin -h mysql-hive-meta.########.us-east-1.rds.amazonaws.com -P 3306 -u $RDS_MASTER_USER_NAME -p flush-hosts
Enter password: $RDS_MASTER_PASSWORD

You should now be able to successfully restart hive-hcatalog-server.


Did this article help?


Do you need billing or technical support?