Posted On: Nov 22, 2022

Amazon EMR customers can now use AWS Glue Data Catalog from their streaming and batch SQL workflows on Flink. The AWS Glue Data Catalog is an Apache Hive metastore-compatible catalog. You can configure your Flink jobs on Amazon EMR to use the Data Catalog as an external Apache Hive metastore. With this release, You can then directly run Flink SQL queries against the tables stored in the Data Catalog.

 Flink supports on-cluster Hive metastore as the out-of-box persistent catalog. This means that metadata had to be recreated when clusters were shutdown and it was hard for multiple clusters to share the same metadata information. Starting with Amazon EMR 6.9, your Flink jobs on Amazon EMR can manage Flink’s metadata in AWS Glue Data Catalog. You can use a persistent and fully managed Glue Data Catalog as a centralized repository. Each Data Catalog is a highly scalable collection of tables organized into databases. 

The AWS Glue Data Catalog provides a uniform repository where disparate systems can store and find metadata to keep track of data in data silos. You can then query the metadata and transform that data in a consistent manner across a wide variety of applications. With support for AWS Glue Data Catalog, you can use Apache Flink on Amazon EMR for unified BATCH and STREAM processing of Apache Hive Tables or metadata of any Flink tablesource such as Iceberg, Kinesis or Kafka. You can specify the AWS Glue Data Catalog as the metastore for Flink using the AWS Management Console, AWS CLI, or Amazon EMR API.

You can use this capability in all regions where Amazon EMR is available. To learn more about this feature, please refer to our documentation.