Best Practices from Rackspace for Modernizing a Legacy HBase/Solr Architecture Using AWS Services

By Daniel Quach, Professional Services Data Practice Manager – Rackspace
By Anil Kumar, Professional Services Big Data Engineer – Rackspace
By Angel Conde Manjon, Sr. Partner Solutions Architect – AWS

Rackspace

As technology advances and business requirements change, organizations may find themselves needing to migrate away from legacy data processing systems like HBase, Solr, and HBase Indexer.

While these systems may have served their purpose in the past, they may no longer be sufficient for the needs of modern data platforms because of the challenges of scalability and maintenance.

To help modernize legacy systems, organizations can turn to newer technologies like Amazon Kinesis, AWS Lambda, and Amazon OpenSearch Service. These tools offer a range of benefits from improved security, scalability, and a path of regular maintenance updates without being worried about how to do them.

In this post, we’ll explore the advantages of migrating from HBase, Solr, and HBase indexer to a modern data ecosystem based on Amazon Web Services (AWS). We’ll also discuss architecture, design, and a pathway for implementation.

HBase, once a foundational component of early data infrastructures, now grapples with issues such as challenging operational maintenance, budgetary constraints, and a decreasing number of professionals well-versed in its specifics. The imperative of business continuity, combined with the advantages of cloud-based infrastructure, underscores the need for organizations to transition to more agile platforms like Amazon Web Services (AWS).

This post offers insights and guidance for those looking to embark on this intricate migration journey. Rackspace is an AWS Premier Tier Services Partner and Managed Service Provider (MSP) that combines specialized expertise with advanced operational tooling to help businesses realize the power of the AWS Cloud.

Legacy State

Legacy data platforms often have a workflow such as this:

Figure 1 – Architecture workflow of Hbase and Solr.

Extract, transform, load (ETL) processes are defined in the Hadoop platform.
ETL processes perform either a create, update, or delete on HBase.
HBase Indexer captures all of the changes.
Changes are sent to a Solr index using HBase Indexer.

Since HBase is a NoSQL store, data is replicated to a Solr for easier querying capabilities.

Pathways Forward

If you want to move off your legacy data from your on-premises system, there are a couple options you can follow:

#1: Migration to Amazon OpenSearch Service (Recommended)

In this approach, major legacy components are modernized:

Upgrade Apache Hbase to version 2.x on top of Amazon EMR 6.x.
Hbase Indexer moves to:
- Streaming HBase endpoint on Amazon EMR.
- AWS Lambda and Amazon Kinesis push changes to Amazon OpenSearch Service.
Solr 5.5 is migrated to Amazon OpenSearch Service.
Any applications connecting to Solr are refactored to be compatible with OpenSearch.

#2: Migration to Elasticsearch (Not Recommended)

This approach is similar to Option 1, but the challenge on AWS is that Elasticsearch is frozen on version 6.8/7.10, so eventually you must to migrate to Amazon OpenSearch Service.

#3: Lift and Shift Solr 5.x (Not Recommend)

In this approach, Solr is lifted and shifted. However, there are several issues with this approach:

Lift and shifting a legacy platform (like Solr 5.5) exposes you to security vulnerabilities not yet patched.
Legacy Hbase Indexers support only up to Hbase version 1.1.12, and moving to Amazon EMR 6.x uses Hbase 2.x.

Given these three options, we are going to focus on Option 1 in this post. It gives you the most modern data stack that will be supported with future maintenance updates and dropping any technical debt.

High-Level Steps

When pursuing a migration like this, great care should be taken to address business continuity and identifying all dependent interfaces. The project plan includes:

Provision new architecture in an AWS account.
Handle replication workflow previous handled by Hbase Indexer.
Perform a full load migration Solr to OpenSearch.
Refactor any old applications connecting to legacy Solr.

Replication High-Level Design

Before reading the rest of the post, please reference the AWS blog post about streaming Apache HBase edits for real-time analytics.

In this architecture, we are building a system to ensure changes performed against Hbase (Step 3 in Figure 2 below) are replicated to an Amazon OpenSearch domain (Step 6). To define a bit of nomenclature, Slowly Changing Dimensions (SCD Type 1) refers to current state data. In regards to a data integrity standpoint, whatever queried in Hbase should always exist in OpenSearch.

Figure 2 – SCD Type 1 replication workflow.

Amazon EMR 6.x contains HBase 2.x provisioned.
HBase changes are committed to the data store.
A special replication endpoint deployed on EMR listens to inserts, updates, and deletes. This is also known as change data capture (CDC).
These changes are streamed to an Amazon Kinesis endpoint.
An AWS Lambda function runs to apply any applicable transformations.
Changes are applied to OpenSearch.

Full Load Design

As part of the deployment, a one-time full snapshot deployment will need to be run.

Figure 3 – Full load design.

Depending on your source system, scripts will have to be created to perform a full load. Following are references:

Solr > Amazon Kinesis Data Firehose > OpenSearch
- Github migration example for ElasticSearch 5.6 to Opensearch 1.2
Solr > OpenSearch
- Use the Elasticsearch and requests Python library.
- Create an instance of the Elasticsearch object pointing at the OpenSearch cluster.
- Create a custom class to iterate through the Solr collection via the https interface.
- Iterate through one batch at a time and then write it to OpenSearch.
- Compare record counts for data quality checking.

Replication Detailed Design

The following workflow describes what it takes to replace the HBase indexer.

Figure 4 – Full workflow from upstream application to OpenSearch.

Upstream applications performs either an Insert, Update, or Delete on Hbase inside Amazon EMR
EMR is bootstrapped with a special HBase streaming endpoint JAR which listens to the changes and pushes it to Kinesis. See GitHub for deployment details.
- The class StreamingReplicationEndpoint will listen to Hbase WALEdits (Write Ahead Log Edit changes)
- WALEdit (CDC) changes are replicated
Changes are pushed to a Kinesis Data Sink Class implementation which pushes the new changes
A Kinesis stream is provisioned with the applicable number of shards:
- Firehose is connected to the analytics stream for backup
- A raw landing zone records all the changes in the need any data needs to be reprocessed
Lambda is attached to the Kinesis stream:
- The payload in Lambda will have WALEdit information, with the payloads base64 encoded

{"key":{"writeTime":1682464672527,"sequenceId":32151290,"tablename":"SALES_SLIP_HISTORY","nonce":0,"nonceGroup":0,"origLogSeqNum":0,"encodedRegionName":"ODZkMGU0NTY3ZWVjMzRjYWUyMGYzOGYxOThlMmViNDM=","writeEntry":null},"edit":{"cells":[{"qualifier":"VEVTVDE=","value":"WFl6MQ==","type":"Put","family":"SQ==","timeStamp":1682464672527,"row":"MDAwMDAwMDkzNTRCNTdDMjgwODkzMTNFNDY2QUVEM0N8Mjg1MzV8MTgzNXwzNTI5NTQ0MjlfMDAwMDAwMDAwS2F5Xzc1"}],"families":["SQ=="],"replay":false,"metafamily":"TUVUQUZBTUlMWQ=="}}

- - Keys of the WALEdit will have qualifier, value, type, family, timestamp, row
  - The only keys we need to examine are:
    - Type, which contains the values of put, delete, deleteAll, deleteColumn, deleteFamily

“type":"Put"

- - - Row, which is the reference key to HBase Base 64 encoded

"row":"MDAwMDAwMDkzNTRCNTdDMjgwODkzMTNFNDY2QUVEM0N8Mjg1MzV8MTgzNXwzNTI5NTQ0MjlfMDAwMDAwMDAwS2F5Xzc1"

- If Type = “Put” (it means it’s an insert or update)
- Look up the Hbase focument by row key
  - See GitHub for example reference of retrieving a record
- With the OpenSearch API, update the entire record (or only the selected fields depending on your business case)
- If Type = “Delete”
- Look up the HBase document by row key
- Delete document via OpenSearch API

Set Up Amazon EMR

When Amazon EMR is provisioned, these are steps that need to be taken:

Update the HBASE_CLASSPATH to reference this JAR:

"HBASE_CLASSPATH":${HBASE_CLASSPATH}${HBASE_CLASSPATH:+:}$(ls -1 /usr/lib/phoenix/phoenix-server*.jar):/usr/lib/hbase-extra/kinesis-sink-alpha-0.1.jar"

Set up the HBase configuration files:
- Set-up hbase-site.xml

"hbase.replication.kinesis.aggregation-enabled": "false",
"hbase.replication.bulkload.enabled": "false",
"hbase.replication.cluster.id": "hbase1",
"hbase.replication.sink-factory-class": "com.amazonaws.hbase.datasink.KinesisDataSinkImpl",
"hbase.replication.kinesis.stream-table-map": "tablename1:kinesis-stream1”
"hbase.replication.compression-enabled": "false"

Run the below command at HBase Shell prompt to add replication peer into HBase:
- add_peer 'Kinesis_endpoint', ENDPOINT_CLASSNAME > ‘com.amazonaws.hbase.StreamingReplicationEndpoint‘
Run the below command at HBase Shell prompt to enable table replication:
- enable_table_replication "YOUR_BASE_TABLE”

During final deployment, these steps can be bootstrapped in Amazon EMR for further automation.

Refactoring Legacy Code

Any legacy code connecting to Solr can be modified using this repository for a reference

Summary

The migration from legacy systems such as HBase, Solr, and HBase Indexer to a modern data platform offers myriad advantages, primarily in security, scalability, and ongoing maintenance. Making this transition is vital for organizational growth and business continuity.

Yet, migrating and modernizing your data doesn’t have to be a solo journey. As a pioneer in the realm of data solutions, Rackspace Technology is poised to be your trusted partner through this transition. Its broad portfolio of data solutions is designed to facilitate your transition and ensure you’re equipped to handling modern data architectures.

Whether you’re looking to migrate and modernize your extract, transform, load (ETL), adopt AI for smarter decision making, or wish to modernize your databases, Rackspace’s diverse suite of offerings has you covered. Discover more by exploring Rackspace data solutions which can help guide your transformation journey.

.

.

Rackspace – AWS Partner Spotlight

Rackspace is an AWS Premier Tier Services Partner and MSP that combines specialized expertise with advanced operational tooling to help businesses realize the power of the AWS Cloud.

Contact Rackspace | Partner Overview | AWS Marketplace | Case Studies

AWS Partner Network (APN) Blog