I created an Amazon SageMaker notebook instance. Then I connected it to an Apache Spark Amazon EMR cluster using Apache Livy. I can see databases and tables in the AWS Glue Data Catalog. However, when I try to query the data, I get an error message like this:

An error occurred while calling o647.showString. : java.lang.AssertionError: assertion failed: No plan for HiveTableRelation 'mydatabase'.'cloudfront_logs', org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe

In Amazon EMR release versions 5.10.0 to 5.17.0, HiveContext is not enabled by default when you use Livy. Enable HiveContext in Livy to resolve the error.

You can enable HiveContext in Livy on a long-running cluster or when launching a new cluster. For a long-running cluster:

1.    Connect to the Master Node Using SSH. For example:

ssh -i ~/mykeypair.pem hadoop@master-public-DNS

2.    Navigate to the directory where the livy.conf file is located:

cd /etc/livy/conf/

3.    Back up the livy.conf file:

sudo cp livy.conf livy.bk

4.    Open the livy.conf file in your preferred text editor. For example:

sudo vim livy.conf

5.    Add the following line to livy.conf, and then save the file:

livy.repl.enable-hive-context true

6.    Run the following commands to restart livy-server:

sudo stop livy-server
sudo start livy-server

For a new cluster:

Use the following configuration object to enable HiveContext for livy-conf when launching a new cluster:

[
  {
    "Classification": "livy-conf",
    "Properties": {
      "livy.repl.enable-hive-context": "true"
    }
  }
]

Test PySpark query functionality in your Amazon SageMaker notebook

1.    Create a Spark context in your notebook:

sc

2.    Import HiveContext, and then preview the table. For example:

from pyspark.sql import HiveContext
hive_context = HiveContext(sc)
cl_logs = hive_context.table("mydatabase.cloudfront_logs")
cl_logs.show()

3.    Run a simple query to confirm that the java.lang.AssertionError error is resolved. For example:

result = hive_context.sql('SELECT * FROM mydatabase.cloudfront_logs WHERE method="GET" AND status=200 LIMIT 5')
result.show()

Did this page help you? Yes | No

Back to the AWS Support Knowledge Center

Need help? Visit the AWS Support Center

Published: 2019-02-12