How do I use a PostgreSQL database as the external metastore for Hive on Amazon EMR?

Last updated: 2022-01-31

I want to use an Amazon Relational Database Service (Amazon RDS) for PostgreSQL DB instance as my external metastore for Apache Hive on Amazon EMR.

Resolution

Before you begin, note the following:

  • This solution assumes that you already have an active PostgreSQL database.
  • If you're using Amazon EMR release version 5.7 or earlier, download the PostgreSQL JDBC driver. Then, add the driver to the Hive library path ( /usr/lib/hive/lib). Amazon EMR release versions 5.8.0 and later come with the PostgreSQL JDBC driver in the Hive library path.

To configure a PostgreSQL DB instance as the external metastore for Hive, do the following:

1.    Create an Amazon RDS for PostgreSQL DB instance and create the database. Note that you can do this while creating the DB instance from Amazon RDS in the AWS Console. You can specify the database name in the Initial database name field under Additional configuration. Or, you can connect the PostgreSQL database instance and then create the database.

2.    Modify the DB instance security group to allow connections on port 5432 between your database and the ElasticMapReduce-master security group. For more information, see VPC security groups.

3.    Launch an Amazon EMR cluster without an external metastore. Amazon EMR uses the default MySQL database in this case.

4.    Connect to the master node using SSH.

5.    Replace the Hive configuration with the following properties.
Replace the following values in the example:
mypostgresql.testabcd1111.us-west-2.rds.amazonaws.com with the endpoint of your DB instance
mypgdb with the name of your PostgreSQL database
database_username with the DB instance username
database_password with the DB instance password

[hadoop@ip-X-X-X-X ~]$ sudo vi /etc/hive/conf/hive-site.xml

<property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:postgresql://mypostgresql.testabcd1111.us-west-2.rds.amazonaws.com:5432/mypgdb</value>
    <description>PostgreSQL JDBC driver connection URL</description>
  </property>

  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>org.postgresql.Driver</value>
    <description>PostgreSQL metastore driver class name</description>
  </property>

  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>database_username</value>
    <description>the username for the DB instance</description>
  </property>

  <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>database_password</value>
    <description>the password for the DB instance</description>
  </property>

6.    Run the following commands to create the PostgreSQL schema:

[hadoop@ip-X-X-X-X ~]$ cd /usr/lib/hive/bin/
[hadoop@ip-X-X-X-X bin]$ ./schematool -dbType postgres -initSchema  
Metastore connection URL:     jdbc:postgresql://mypostgresql.testabcd1111.us-west-2.rds.amazonaws.com:5432/mypgdb
Metastore Connection Driver :     org.postgresql.Driver
Metastore connection User:     test
Starting metastore schema initialization to 2.3.0
Initialization script hive-schema-2.3.0.postgres.sql
Initialization script completed
schemaTool completed

7.    Stop and start Hive services so that the updated settings take effect:

[hadoop@ip-X-X-X-X bin]$ sudo initctl list |grep -i hive
hive-server2 start/running, process 11818
hive-hcatalog-server start/running, process 12708
[hadoop@ip-X-X-X-X9 bin]$ sudo stop hive-server2
hive-server2 stop/waiting
[hadoop@ip-X-X-X-X bin]$ sudo stop hive-hcatalog-server
hive-hcatalog-server stop/waiting
[hadoop@ip-X-X-X-X bin]$ sudo start hive-server2
hive-server2 start/running, process 18798
[hadoop@ip-X-X-X-X bin]$ sudo start hive-hcatalog-server
hive-hcatalog-server start/running, process 19614

You can choose to automate the steps 5 through 7 in the preceding process by running the following bash script (hive_postgres_emr_step.sh) as a step job in the EMR cluster.

## Automated Bash script to update the hive-site.xml and restart Hive

## Parameters
rds_db_instance_endpoint='<rds_db_instance_endpoint>'
rds_db_instance_port='<rds_db_instance_port>'
rds_db_name='<rds_db_name>'
rds_db_instance_username='<rds_db_instance_username>'
rds_db_instance_password='<rds_db_instance_username>'

############################# Copying the original hive-site.xml
sudo cp /etc/hive/conf/hive-site.xml /tmp/hive-site.xml

############################# Changing the JDBC URL
old_jdbc=`grep "javax.jdo.option.ConnectionURL" -A +3 -B 1 /tmp/hive-site.xml | grep "<value>" | xargs`
sudo sed -i "s|$old_jdbc|<value>jdbc:postgresql://$rds_db_instance_endpoint:$rds_db_instance_port/$rds_db_name</value>|g" /tmp/hive-site.xml

############################# Changing the Driver name
old_driver_name=`grep "javax.jdo.option.ConnectionDriverName" -A +3 -B 1 /tmp/hive-site.xml | grep "<value>" | xargs`
sudo sed -i "s|$old_driver_name|<value>org.postgresql.Driver</value>|g" /tmp/hive-site.xml

############################# Changing the database user
old_db_username=`grep "javax.jdo.option.ConnectionUserName"  -A +3 -B 1 /tmp/hive-site.xml | grep "<value>" | xargs`
sudo sed -i "s|$old_db_username|<value>$rds_db_instance_username</value>|g" /tmp/hive-site.xml

############################# Changing the database password and description
connection_password=`grep "javax.jdo.option.ConnectionPassword" -A +3 -B 1 /tmp/hive-site.xml | grep "<value>" | xargs`
sudo sed -i "s|$connection_password|<value>$rds_db_instance_password</value>|g" /tmp/hive-site.xml
old_password_description=`grep "javax.jdo.option.ConnectionPassword" -A +3 -B 1 /tmp/hive-site.xml | grep "<description>" | xargs`
new_password_description='<description>the password for the DB instance</description>'
sudo sed -i "s|$password_description|$new_password_description|g" /tmp/hive-site.xml

############################# Moving hive-site to backup
sudo mv /etc/hive/conf/hive-site.xml /etc/hive/conf/hive-site.xml_bkup
sudo mv /tmp/hive-site.xml /etc/hive/conf/hive-site.xml

############################# Init Schema for Postgres
/usr/lib/hive/bin/schematool -dbType postgres -initSchema

############################# Restart Hive
## Check Amazon Linux version and restart Hive
OS_version=`uname -r`
if [[ "$OS_version" == *"amzn2"* ]]; then
    echo "Amazon Linux 2 instance, restarting Hive..."
    sudo systemctl stop hive-server2
    sudo systemctl stop hive-hcatalog-server
    sudo systemctl start hive-server2
    sudo systemctl start hive-hcatalog-server
elif [[ "$OS_version" == *"amzn1"* ]]; then
    echo "Amazon Linux 1 instance, restarting Hive"
    sudo stop hive-server2
    sudo stop hive-hcatalog-server
    sudo start hive-server2
    sudo start hive-hcatalog-server
else
    echo "ERROR: OS version different from AL1 or AL2."
fi
echo "--------------------COMPLETED--------------------"

Be sure to replace the following values in the script:

  • rds_db_instance_endpoint with the endpoint of your DB instance
  • rds_db_instance_port with the port of your DB instance
  • rds_db_name with the name of your PostgreSQL database
  • rds_db_instance_username with the DB instance user name
  • rds_db_instance_password with the DB instance password

Upload the script to Amazon S3. You can run the script as a step job using the Amazon EMR Console, AWS Command Line Interface (AWS CLI), or the API. To use the Amazon EMR console to run the script, do the following:

1.    Open the Amazon EMR console.

2.    On the Cluster List page, select the link for your cluster.

3.    On the Cluster Details page, choose the Steps tab.

4.    On the Steps tab, choose Add step.

5.    In the Add step dialog box, retain the default values for Step type and Name.

6.    For JAR location, enter the following:

command-runner.jar

7.    For Arguments, enter the following:

bash -c "aws s3 cp s3://example_bucket/script/hive_postgres_emr_step.sh .; chmod +x hive_postgres_emr_step.sh; ./hive_postgres_emr_step.sh"

Replace the S3 location in the command with the location where you stored the script.

8.    Choose Add run the step job.

After the step job is completed, do the following to verify the Hive configuration updates:

1.    Log in to the Hive shell and create a Hive table.

Note: Be sure to replace test_postgres in the example with the name of your Hive table.

[hadoop@ip-X-X-X-X bin]$ hive

Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j2.properties Async: true
hive> show databases;
OK
default
Time taken: 0.569 seconds, Fetched: 1 row(s)
hive> create table test_postgres(a int,b int);
OK
Time taken: 0.708 seconds

2.    Install PostgreSQL:

[hadoop@ip-X-X-X-X bin]$ sudo yum install postgresql

3.    Connect to the PostgreSQL DB instance using the command line.

Replace the following values in the command:
mypostgresql.testabcd1111.us-west-2.rds.amazonaws.com with the endpoint of your DB instance
mypgdb with the name of your PostegreSQL database
database_username with the DB instance user name

[hadoop@ip-X-X-X-X bin]$ psql --host=mypostgresql.testabcd1111.us-west-2.rds.amazonaws.com --port=5432 --username=database_username --password --dbname=mypgdb

4.    When prompted, enter the password for the DB instance.

5.    Run the following command to confirm that you can access the Hive table that you created earlier:

mypgdb=>  select * from "TBLS";

 TBL_ID | CREATE_TIME | DB_ID | LAST_ACCESS_TIME | OWNER  | RETENTION | SD_ID |   TBL_NAME    |   TBL_TYPE    | VIEW_EXPANDED_TEXT | VIEW_ORIGINAL_TEXT | IS_REWRITE_ENABLED 
--------+-------------+-------+------------------+--------+-----------+-------+---------------+---------------+--------------------+--------------------+--------------------
      1 |  1555014961 |     1 |                0 | hadoop |         0 |     1 | test_postgres | MANAGED_TABLE |                    |                    | f
(1 row)

Your Amazon EMR cluster is now using the PostgreSQL database as the external metastore for Hive.