Use Apache Spark and Hive on Amazon EMR with the AWS Glue Data Catalog

Posted on: Aug 14, 2017

You can now use the AWS Glue Data Catalog with Apache Spark and Apache Hive on Amazon EMR. The AWS Glue Data Catalog is a managed metadata repository that is integrated with Amazon EMR, Amazon Athena, Amazon Redshift Spectrum, and AWS Glue ETL jobs. Additionally, it provides automatic schema discovery and schema version history. You can choose to use the AWS Glue Data Catalog to store external table metadata for Hive and Spark instead of utilizing an on-cluster or self-managed Hive Metastore. This allows you to more easily store metadata for your external tables on Amazon S3 outside of your cluster.

You can configure your Amazon EMR clusters to use the AWS Glue Data Catalog from the Amazon EMR console, AWS Command Line Interface (CLI), or the AWS SDK with the Amazon EMR API. Amazon EMR release 5.8.0 and later can utilize the AWS Glue Data Catalog for Apache Spark and Apache Hive. Please visit the Amazon EMR documentation for more information about using the AWS Glue Data Catalog with Apache Spark and Apache Hive. For more information about AWS Glue, click here.

AWS Glue Data Catalog charges are billed separately and pricing information is available here. Amazon EMR supports the AWS Glue Data Catalog the US East (N. Virginia) region.