Posted On: Dec 12, 2013
We are pleased to announce support for Impala with Amazon EMR. Impala is an open source tool for real-time, ad hoc querying using a familiar SQL-like language. By using Impala on Amazon EMR, you can perform fast interactive analytics on unstructured data. For many types of queries, it's much faster than Hive. Impala's performance makes it a great engine for iterative queries and many popular BI tools. With Amazon EMR, you can use Impala as a reliable data warehouse to execute tasks such as data analytics, monitoring, and business intelligence. Here are three use cases:
- Use Impala instead of Hive on long-running clusters to perform ad hoc queries. Impala reduces interactive queries to seconds, making it an excellent tool for fast investigation. You could run Impala on the same cluster as your batch MapReduce workflows, use Impala on a long-running analytics cluster with Hive and Pig, or create a cluster specifically tuned for Impala queries.
- Use Impala instead of Hive for batch ETL jobs on transient Amazon EMR clusters. Impala is faster than Hive for many queries, which provides better performance for these workloads. Like Hive, Impala uses SQL, so queries can easily be modified from Hive to Impala.
- Use Impala in conjunction with a third party business intelligence tool. Connect a client ODBC or JDBC driver with your cluster to use Impala as an engine for powerful visualization tools and dashboards.
To get started, you can launch an Amazon EMR cluster with Impala using the EMR Command Line Interface (CLI), Management Console, APIs, and SDKs. You also must be using an AMI and Amazon distribution of Hadoop 2.x (AMI 3.0.2 or later). To learn more about using Impala with Amazon EMR, please visit the Developer's Guide or FAQ.