Posted On: Nov 21, 2016

You can now use Amazon S3 as a data store for Apache HBase on Amazon EMR using the EMR File System. Apache HBase is a distributed, non-relational database built for random, strictly consistent realtime access for tables with billions of rows and millions of columns. By using Amazon S3 as a data store for Apache HBase, you can separate your cluster’s storage and compute nodes. This enables you to save costs by sizing your cluster for your compute requirements instead of paying to store your entire dataset with 3x replication in the on-cluster Hadoop Distributed File System (HDFS).

Amazon EMR configures Apache HBase on Amazon S3 to cache data in-memory and on-disk in your cluster, delivering faster performance from active compute nodes. You can quickly and easily scale out or scale in compute nodes without impacting your underlying storage, or terminate your cluster to save costs and quickly restore it in another Availability Zone.

Apache HBase with support for Amazon S3 is available on Amazon EMR release 5.2.0, and it can be launched using release label “emr-5.2.0” from the AWS Management Console, AWS CLI, or SDK. To use Amazon S3 as a data store, configure the storage mode and specify a root directory in your Apache HBase configuration. Also, it’s recommended to enable EMRFS consistent view. Please visit the Amazon EMR documentation for more information about Apache HBase on Amazon S3.