Apache Hive on Amazon EMR

Apache Hive is an open-source, distributed, fault-tolerant system that provides data warehouse-like query capabilities. It enables users to read, write, and manage petabytes of data using a SQL-like interface. Learn more about Apache Hive here.

Apache Hive is natively supported in Amazon EMR, and you can quickly and easily create managed Apache Hive clusters from the AWS Management Console, AWS CLI, or the Amazon EMR API. Additionally, you can leverage additional Amazon EMR features, including direct connectivity to Amazon DynamoDB or Amazon S3 for storage, integration with the AWS Glue Data Catalog, AWS Lake Formation, Amazon RDS, or Amazon Aurora to configure an external metastore, and EMR Managed Scaling to add or remove instances from your cluster.

Features and benefits

High availability

You can launch an EMR cluster with multiple master nodes to support high availability for Apache Hive. Amazon EMR automatically fails over to a standby master node if the primary master node fails or if critical processes, like Resource Manager or Name Node, crash. This means that you can run Apache Hive on EMR clusters without interruption.

Managed scaling

Amazon EMR allows you to define EMR Managed Scaling for Apache Hive clusters to help you optimize your resource usage. With EMR Managed Scaling, you can automatically resize your cluster for best performance at the lowest possible cost. With EMR Managed Scaling you specify the minimum and maximum compute limits for your clusters and Amazon EMR automatically resizes them for best performance and resource utilization. EMR Managed Scaling continuously samples key metrics associated with the workloads running on clusters.

Fast performance

Amazon EMR 6.0.0 adds support for Hive LLAP, providing an average performance speedup of 2x over EMR 5.29. You can learn more here. You can now use S3 Select with Hive on Amazon EMR to improve performance. S3 Select allows applications to retrieve only a subset of data from an object, which reduces the amount of data transferred between Amazon EMR and Amazon S3. Amazon EMR also enables fast performance on complex Apache Hive queries. EMR uses Apache Tez by default, which is significantly faster than Apache MapReduce. Apache MapReduce uses multiple phases, so a complex Apache Hive query would get broken down into four or five jobs. Apache Tez is designed for more complex queries, so that same job on Apache Tez would run in one job, making it significantly faster than Apache MapReduce.

Flexible metastore options

With Amazon EMR, you have the option to leave the metastore as local or externalize it. EMR provides integration with the AWS Glue Data Catalog and AWS Lake Formation, so that EMR can pull information directly from Glue or Lake Formation to populate the metastore.

Customer success

Airbnb connects people with places to stay and things to do around the world with 2.9 million hosts listed, supporting 800k nightly stays. Airbnb uses Amazon EMR to run Apache Hive on a S3 data lake. Running Hive on the EMR clusters enables Airbnb analysts to perform ad hoc SQL queries on data stored in the S3 data lake. By migrating to a S3 data lake, Airbnb reduced expenses, can now do cost attribution, and increased the speed of Apache Spark jobs by three times their original speed.

Guardian gives 27 million members the security they deserve through insurance and wealth management products and services. Guardian uses Amazon EMR to run Apache Hive on a S3 data lake. Apache Hive is used for batch processing to enable fast queries on large datasets. The S3 data lake fuels Guardian Direct, a digital platform that allows consumers to research and purchase both Guardian products and third party products in the insurance sector.

FINRA – the Financial Industry Regulatory Authority – is the largest independent securities regulator in the United States, and monitors and regulates financial trading practices. FINRA uses Amazon EMR to run Apache Hive on a S3 data lake. Running Hive on the EMR clusters enables FINRA to process and analyze trade data of up to 90 billion events using SQL. The cloud data lake resulted in cost savings of up to $20 million compared to FINRA’s on-premises solution, and drastically reduced the time needed for recovery and upgrades.

Vanguard, an American registered investment advisor, is the largest provider of mutual funds and the second largest provider of exchange traded funds. Vanguard uses Amazon EMR to run Apache Hive on a S3 data lake. Data is stored in S3 and EMR builds a Hive metastore on top of that data. The Hive metastore contains all the metadata about the data and tables in the EMR cluster, which allows for easy data analysis. Hive also enables analysts to perform ad hoc SQL queries on data stored in the S3 data lake. Migrating to a S3 data lake with Amazon EMR has enabled 150+ data analysts to realize operational efficiency and has reduced EC2 and EMR costs by $600k.