Low-Latency Access on Trillions of Records: FINRA’s Architecture Using Apache HBase on Amazon EMR with Amazon S3

John Hitchingham is Director of Performance Engineering at FINRA

The Financial Industry Regulatory Authority (FINRA) is a private sector regulator responsible for analyzing 99% of the equities and 65% of the option activity in the US. In order to look for fraud, market manipulation, insider trading, and abuse, FINRA’s technology group has developed a robust set of big data tools in the AWS Cloud to support these activities.

One particular application, which requires low-latency retrieval of items from a data set that contains trillions of records, enables FINRA analysts to investigate particular sets of related trade activity. FINRA’s new architecture for this application, Apache HBase on Amazon EMR using Amazon S3 for data storage, has resulted in cost savings of over 60%, drastically reduced the time for recovery or upgrades, and alleviated resource contention.

Original application architecture

Early in the 2 ½ year migration of FINRA’s Market Regulation Portfolio to the AWS Cloud, FINRA developed a system on AWS to replace an on-premises solution that allowed analysts to query this trade activity. This solution provided fast random access across trillions of trade records, which would quickly grow to over 700 TB of data.

FINRA selected Apache HBase, which is optimized for random access over large data sets, to store and serve this data. Our initial Apache HBase cluster used a commercial Hadoop distribution running on Amazon EC2 with data stored in on-cluster HDFS. To hold 700 TB of compressed data with 3x replication for HDFS, we required over 2 PB of storage on the cluster with 60 hs1.8xlarge instances. We updated our HBase table after the market close each day using the HBase bulk load API operation, which leverages Apache Hadoop MapReduce. This provided a simple, performant way to load billions of records each night.

FINRA’s analysts were thrilled with the performance–queries that took minutes and hours to run on the old on-premises system now returned in sub-seconds to minutes with Apache HBase. However, there were several operational challenges with our new system:

Disaster recovery: Because of the data size (700+ TB), it would take us days to move this data and restore our cluster in the event of a failure. This would also apply if we needed to restore our cluster in another Availability Zone, in the event of problems in a single zone.
Resource contention: Sometimes the batch load processing started late due to upstream data availability, which runs up against the window where users are executing their queries. This impacted query performance, and balancing these workflows on the same cluster proved challenging.
Cluster maintenance: Upgrading Apache HBase and other components on the cluster was difficult. Creating new, parallel clusters sized for the data volume was cost prohibitive and doing rolling updates were operationally risky and time-consuming from rebalancing over 2 PB of HDFS blocks.
Cost: Because we had to combine storage and compute by using on-cluster HDFS for our storage, we were paying for compute capacity that was nearly idle most of the time. It was just being used to store our data.

Decoupling storage and compute for HBase

Elsewhere in FINRA’s portfolio of analytic applications in AWS, we increasingly used an architecture that separated storage from compute. FINRA stores data on Amazon S3 for low cost, durable, scalable storage and uses Amazon EMR for scalable compute workloads using Hive, Presto, and Spark. With EMR and EMRFS, these engines can directly query data in S3 as if it were stored in HDFS. This has several advantages, including eliminating the need to load data into on-cluster HDFS, checkpointing processing to keep job state during possible Amazon EC2 Spot Instance loss, and having our data durable and available across zones with no extra management.

We wondered if it would be possible to leverage EMR to run HBase with storage on S3 instead of in HDFS. With the new support for using S3 as a storage layer with HBase on EMR, we were excited to work with AWS on evaluating this new architecture for our cluster. With this new configuration, HBase uses S3 to store table data and metadata, and still uses a small footprint in HDFS to store the HBase write-ahead log. In addition to the HBase Region Server in-memory cache, EMR configures the HBase bucket cache to cache data on the local disks of each node, giving faster read performance than directly accessing S3 for each request.

FINRA_HBase_S3

It was simple to migrate from our old cluster to HBase on S3. We exported an HBase snapshot from our old cluster and stored it on S3, which included both the table metadata and actual data. Next, we quickly started an EMR cluster with HBase, and configured our snapshot directory on S3 as the HBase root directory. Because we were moving from an older version of HBase, we also upgraded the file format and performed a major compaction of the data after the HBase on S3 cluster was brought up for the first time. These latter steps were unique to our upgrade situation, and would not need to be performed if migrating from a newer version of HBase on an already-compacted data set.

Advantages to running HBase on S3

Without the need to transfer our dataset to HDFS, we were able to launch a new HBase cluster on EMR and be ready to accept queries in less than 30 minutes because the data (HFiles) stays on S3. This would have taken almost two days on our old cluster, due to the transfer time of moving 700 TB of data from S3 to HDFS, before we could even run a query.

Also, with the storage offloaded to S3, we can pick the EC2 instance types that are right for our compute requirements instead of being constrained by instance types that have sufficient disk space for HDFS. Instead of the storage-optimized 60 hs1.8xlarge nodes (or the newer d2.8xlarge nodes), we were able to save significant costs by switching to 100 m3.2xlarge nodes. Also, we only need to store and pay for 1x of the data in S3. When we ran our application’s query benchmark against the cluster, we saw no degradation in concurrency. Query response time for our workload was slightly slower but still acceptable, with most queries returning in less than 3 seconds.

In addition, HBase on S3 gives us increased resiliency. With the old configuration, we were at risk of having to execute a multi-day restore if we lost 3 nodes in the cluster with possible HDFS data loss. With the new solution, EMR automatically rebalances the HBase Region Servers to other nodes (and replaces the lost node), and our data stored in S3 isn’t impacted. We can now lose more than 3 nodes and continue processing without interruption. This enables us to run our development environments like DEV and TEST on Spot Instances for savings – and turn these environments off when we don’t need them. Also, because data in S3 is available across all zones in an AWS region, we can create and restore our cluster in another zone in less than 30 minutes. This enables us to meet our cross-zone disaster recovery objective of restoration time in minutes instead of days.

Another operational benefit is that the batch load processing is now isolated from the interactive query processing. In our new architecture, this process is run on a separate EMR cluster so there is no impact on query traffic if the bulk load continues into the daytime hours.

Lastly, we are able to scale the cluster down over the weekend when usage is very light.

Additional considerations

After migrating to our new HBase on EMR cluster, we had to make some changes to our operations as a result of using EMRFS/S3 instead of HDFS.

There are certain HBase operations that assume filesystem-like performance from hierarchical directory listings and ‘move’ as an index-update operation, and these operations are significantly slower on S3. For example, the HFile cleaner, which by default runs every 5 minutes, relies on these operations. But using EMRFS and data on S3, it can take more than 5 minutes to traverse all HFiles. We turn this process off, and run it manually during a maintenance window. We also had to increase a few of the timeout settings for management operations related to dropping a table. However, these changes are minor given the benefits to our new architecture.

Conclusion

By migrating to HBase on EMR using S3 for storage, we have been able to lower our costs by 60%. Additionally, we have decreased operational complexity, increased durability and availability, and have created a more scalable architecture. Overall, HBase on S3 has been a great success for us.

If you have questions or suggestions, please comment below.