Apache HBase on Amazon EMR

Amazon EMR natively supports Apache HBase to give you realtime access to tables that can scale to billions of rows and millions of columns. Amazon EMR combines the benefits of open source Apache HBase - column oriented data store on distributed systems – with the durability, performance, integration and tooling capabilities of Amazon EMR. You can get strongly consistent writes and reads, and you can query results on petabytes of data within milliseconds to power mission critical workloads in financial services, ad tech, web analytics and applications using time-series data. Your existing Apache HBase applications will work on Amazon EMR without any code changes. Learn more about Apache HBase on Amazon EMR.

Features and benefits

Durability

Amazon EMR enables you to use Amazon S3 as a data store for Apache HBase using the EMR File System. Using Amazon S3 as a data store decouples your compute from storage and provides several advantages over on-cluster Hadoop Distributed File System (HDFS) from Apache Hadoop. You can save cost by sizing your cluster for your compute requirements instead of HDFS data storage requirements, while getting the availability and durability of Amazon S3 for your data storage. You can scale compute nodes without impacting your underlying storage, terminate your cluster when your job finishes to save costs, and quickly restore your cluster when you need it. You can also create and configure a read-replica cluster in an Amazon EC2 Availability Zone which the primary cluster resides, to get read-only access to the same data and ensuring uninterrupted access to your data even if the primary cluster becomes unavailable. Amazon EMR also persists Apache HBase data files (HFiles) to Amazon S3.

Performance

Apache HBase is designed to maintain performance while scaling out to hundreds of nodes, supporting random access billions of rows and millions of columns. It utilizes Amazon S3 (with EMRFS) or the Hadoop Distributed Filesystem (HDFS) as a fault-tolerant datastore. Amazon EMR supports a wide variety of instance types and Amazon EBS volumes, so you can customize the hardware of your cluster to optimize for cost and performance.

Integration

You can easily launch a fully-configured Amazon EMR cluster running Apache HBase and other Apache Hadoop and Apache Spark ecosystem applications in minutes. Amazon EMR automatically replaces poorly performing nodes, and you can easily resize your cluster to meet your requirements. You can manage tables and browse data in Apache HBase using the Hue UI, and easily backup and restore tables to Amazon S3 using EMRFS and Hadoop MapReduce. Additionally, Apache HBase on Amazon EMR can utilize Amazon EMR’s authorization, Kerberos authentication, and encryption feature sets. Click here for more details about Amazon EMR features.

Tooling

Amazon EMR enables you to use Amazon S3 as a data store for Apache HBase using the EMR File System. Separating your cluster’s storage and compute nodes by using Amazon S3 as a data store, provides several advantages over on-cluster HDFS. You can save costs by sizing your cluster for your compute requirements instead of HDFS data storage, get the availability and durability of S3 storage, scale compute nodes without impacting your underlying storage, and terminate your cluster to save costs and quickly restore it. You can also create and configure a read-replica cluster in another Amazon EC2 Availability Zone that provides read-only access to the same data as the primary cluster, ensuring uninterrupted access to your data even if the primary cluster becomes unavailable.