Posted On: Oct 27, 2022
Hive users run Metastore check command with the repair table option (MSCK REPAIR table) to update the partition metadata in the Hive metastore for partitions that were directly added to or removed from the file system (S3 or HDFS). When run, MSCK repair command must make a file system call to check if the partition exists for each partition. This step could take a long time if the table has thousands of partitions. In EMR 6.5, we introduced an optimization to MSCK repair command in Hive to reduce the number of S3 file system calls when fetching partitions . This feature improves performance of MSCK command (~15-20x on 10k+ partitions) due to reduced number of file system calls especially when working on tables with large number of partitions. Previously, you had to enable this feature by explicitly setting a flag. Starting with Amazon EMR 6.8, we further reduced the number of S3 filesystem calls to make MSCK repair run faster and enabled this feature by default.
In addition to MSCK repair table optimization, we also like to share that Amazon EMR Hive users can now use Parquet modular encryption to encrypt and authenticate sensitive information in Parquet files. It is a challenging task to protect the privacy and integrity of sensitive data at scale while keeping the Parquet functionality intact. Data protection solutions such as encrypting files or storage layer are currently used to encrypt Parquet files, however, they could lead to performance degradation. With Parquet modular encryption, you can not only enable granular access control but also preserve the Parquet optimizations such as columnar projection, predicate pushdown, encoding and compression. Using Parquet modular encryption, Amazon EMR Hive users can protect both Parquet data and metadata, use different encryption keys for different columns, and perform partial encryption of only sensitive columns. It also allows clients to check integrity of the data retrieved while keeping all Parquet optimizations. This feature is available from Amazon EMR 6.6 release and above.
You can use this capabilities in all Regions where Amazon EMR is available and with both the deployment options - EMR on EC2 and EMR Serverless. To learn more on these features, please refer our documentation.