Apache HBase is a massively scalable, distributed big data store in the Apache Hadoop ecosystem. It is an open-source, non-relational, versioned database which runs on top of Amazon S3 (using EMRFS) or the Hadoop Distributed Filesystem (HDFS), and it is built for random, strictly consistent realtime access for tables with billions of rows and millions of columns. Apache Phoenix integrates with Apache HBase for low-latency SQL access over Apache HBase tables and secondary indexing for increased performance. Additionally, Apache HBase has tight integration with Apache Hadoop, Apache Hive, and Apache Pig, so you can easily combine massively parallel analytics with fast data access. Apache HBase's data model, throughput, and fault tolerance are a good match for workloads in ad tech, web analytics, financial services, applications using time-series data, and many more.
Apache HBase is natively supported in Amazon EMR, so you can quickly and easily create managed Apache HBase clusters from the AWS Management Console, AWS CLI, or the Amazon EMR API. You can leverage additional Amazon EMR features, including using Amazon S3 as a data store to reduce costs, creating read-replica clusters for increased availability, leveraging your choice of a wide variety Amazon EC2 instances and Amazon EBS volumes for your cluster's hardware, backup-and-restore to Amazon S3 using the Amazon EMR File System (EMRFS), automatic node replacement, and easy resize commands to add or remove instances from your cluster. Also, you can use Hue to visualize your HBase tables and explore your data. Learn more about Apache HBase on Amazon EMR.
Features and benefits
Performance at scale
Apache HBase is designed to maintain performance while scaling out to hundreds of nodes, supporting billions of rows and millions of columns. It utilizes Amazon S3 (with EMRFS) or the Hadoop Distributed Filesystem (HDFS) as a fault-tolerant datastore. Amazon EMR supports a wide variety of instance types and Amazon EBS volumes, so you can customize the hardware of your cluster to optimize for cost and performance. Additionally, you can use Apache Phoenix for low-latency SQL over massive HBase tables or creating secondary indexes for increased performance.
Through tight integration with projects in the Apache Hadoop ecosystem, you can easily run massively parallel analytics workloads on data stored in HBase tables. You can easily install Apache Phoenix, Apache Hadoop, Apache Hive, Apache Pig, and other open-source big data applications on your Amazon EMR cluster alongside Apache HBase, and utilize these tools to run reporting, SQL queries, or other analytics workloads on your data in Apache HBase. Also, you can use these tools to bulk import/export data into Apache HBase tables, or use Apache Hive to join data from Apache HBase with external tables on Amazon S3.
Integration with Amazon EMR
You can easily launch a fully-configured Amazon EMR cluster running Apache HBase and other Apache Hadoop and Apache Spark ecosystem applications in minutes. Amazon EMR automatically replaces poorly performing nodes, and you can easily resize your cluster to meet your requirements. You can manage tables and browse data in Apache HBase using the Hue UI, and easily backup and restore tables to Amazon S3 using EMRFS and Hadoop MapReduce. Additionally, Apache HBase on Amazon EMR can utilize Amazon EMR’s authorization, Kerberos authentication, and encryption feature sets. Click here for more details about Amazon EMR features.
Amazon S3 storage for HBase
Amazon EMR enables you to use Amazon S3 as a data store for Apache HBase using the EMR File System. Separating your cluster’s storage and compute nodes by using Amazon S3 as a data store, provides several advantages over on-cluster HDFS. You can save costs by sizing your cluster for your compute requirements instead of HDFS data storage, get the availability and durability of S3 storage, scale compute nodes without impacting your underlying storage, and terminate your cluster to save costs and quickly restore it. You can also create and configure a read-replica cluster in another Amazon EC2 Availability Zone that provides read-only access to the same data as the primary cluster, ensuring uninterrupted access to your data even if the primary cluster becomes unavailable.
Customer success with HBase and Amazon EMR
FINRA – the Financial Industry Regulatory Authority – is the largest independent securities regulator in the United States, and monitors and regulates financial trading practices. FINRA uses Amazon EMR to run Apache HBase on Amazon S3 for random access on 3 trillion records (growing by billions per day) for an interactive application to search and display related market events. By decoupling their storage and compute, FINRA can store a single copy of their data in Amazon S3 and size their cluster for the compute capacity needed, rather than size their cluster for storing data in HDFS with 3x replication. This amounts to cost savings of over 60% per year, easy scalability of compute, and reducing the restoration time of a cluster in a new EC2 availability zone from days to less than 30 minutes.
Monster, a global leader in connecting people and jobs, utilizes Apache HBase on Amazon EMR to store clickstream and advertising campaign data for downstream analytics. This enables them to monitor how different customer segments are performing in a given campaign at the granularity of a single impression. Monster’s analytics team can easily scan through rows to aggregate the number of views and clicks per user to identify campaign activity. Additionally, they utilize Apache HBase’s tight integration with the Apache Hadoop ecosystem. Monster runs Apache Hive on a separate Amazon EMR cluster to query their HBase table with SQL, which is useful for additional analytics and exporting data from Apache HBase to Amazon Redshift.