What is Apache HBase?
What is HBase?
Apache HBase is an open-source, NoSQL, distributed big data store. It enables random, strictly consistent, real-time access to petabytes of data. HBase is very effective for handling large, sparse datasets.
HBase integrates seamlessly with Apache Hadoop and the Hadoop ecosystem and runs on top of the Hadoop Distributed File System (HDFS) or Amazon S3 using Amazon Elastic MapReduce (EMR) file system, or EMRFS. HBase serves as a direct input and output to the Apache MapReduce framework for Hadoop, and works with Apache Phoenix to enable SQL-like queries over HBase tables.
How does HBase work?
HBase is a column-oriented, non-relational database. This means that data is stored in individual columns, and indexed by a unique row key. This architecture allows for rapid retrieval of individual rows and columns and efficient scans over individual columns within a table. Both data and requests are distributed across all servers in an HBase cluster, allowing you to query results on petabytes of data within milliseconds. HBase is most effectively used to store non-relational data, accessed via the HBase API. Apache Phoenix is commonly used as a SQL layer on top of HBase allowing you to use familiar SQL syntax to insert, delete, and query data stored in HBase.
Benefits of HBase
HBase is designed to handle scaling across thousands of servers and managing access to petabytes of data. With the elasticity of Amazon EC2, and the scalability of Amazon S3, HBase is able to handle online access to massive data sets.
HBase provides low latency random read and write access to petabytes of data by distributing requests from applications across a cluster of hosts. Each host has access to data in HDFS and S3, and serves read and write requests in milliseconds.
HBase splits data stored in tables across multiple hosts in the cluster and is built to withstand individual host failures. Because data is stored on HDFS or S3, healthy hosts will automatically be chosen to host the data once served by the failed host, and data is brought online automatically.
HBase Use Cases
FINRA – the Financial Industry Regulatory Authority – is the largest independent securities regulator in the United States, and monitors and regulates financial trading practices. FINRA uses Amazon EMR to run Apache HBase on Amazon S3 for random access on 3 trillion records (growing by billions per day) for an interactive application to search and display related market events. By decoupling their storage and compute, FINRA can store a single copy of their data in Amazon S3 and size their cluster for the compute capacity needed, rather than size their cluster for storing data in HDFS with 3x replication. This amounts to cost savings of over 60% per year, easy scalability of compute, and reducing the restoration time of a cluster in a new EC2 availability zone from days to less than 30 minutes.
Monster, a global leader in connecting people and jobs, utilizes Apache HBase on Amazon EMR to store clickstream and advertising campaign data for downstream analytics. This enables them to monitor how different customer segments are performing in a given campaign at the granularity of a single impression. Monster’s analytics team can easily scan through rows to aggregate the number of views and clicks per user to identify campaign activity. Additionally, they utilize Apache HBase’s tight integration with the Apache Hadoop ecosystem. Monster runs Apache Hive on a separate Amazon EMR cluster to query their HBase table with SQL, which is useful for additional analytics and exporting data from Apache HBase to Amazon Redshift.
HBase and Hadoop on AWS
Amazon EMR provides the easiest, fastest, and most cost-effective managed Hadoop framework, enabling customers to process vast amounts of data across dynamically scalable EC2 instances. Customers can also run other popular distributed frameworks such as Apache HBase, Hive, Spark, Presto, and Flink in EMR. Learn more about Amazon EMR.