Tips for Migrating to Apache HBase on Amazon S3 from HDFS

Starting with Amazon EMR 5.2.0, you have the option to run Apache HBase on Amazon S3. Running HBase on S3 gives you several added benefits, including lower costs, data durability, and easier scalability.

HBase provides several options that you can use to migrate and back up HBase tables. The steps to migrate to HBase on S3 are similar to the steps for HBase on the Apache Hadoop Distributed File System (HDFS). However, the migration can be easier if you are aware of some minor differences and a few “gotchas.”

In this post, I describe how to use some of the common HBase migration options to get started with HBase on S3.

HBase migration options

Selecting the right migration method and tools is an important step in ensuring a successful HBase table migration. However, choosing the right ones is not always an easy task.

The following HBase helps you migrate to HBase on S3:

Snapshots
Export and Import
CopyTable

The following diagram summarizes the steps for each option.

Various factors determine the HBase migration method that you use. For example, EMR offers HBase version 1.2.3 as the earliest version that you can run on S3. Therefore, the HBase version that you’re migrating from can be an important factor in helping you decide. For more information about HBase versions and compatibility, see the HBase version number and compatibility documentation in the Apache HBase Reference Guide.

If you’re migrating from an older version of HBase (for example, HBase 0.94), you should test your application to make sure it’s compatible with newer HBase API versions. You don’t want to spend several hours migrating a large table only to find out that your application and API have issues with a different HBase version.

The good news is that HBase provides utilities that you can use to migrate only part of a table. This lets you test your existing HBase applications without having to fully migrate entire HBase tables. For example, you can use the Export, Import, or CopyTable utilities to migrate a small part of your table to HBase on S3. After you confirm that your application works with newer HBase versions, you can proceed with migrating the entire table using HBase snapshots.

Option 1: Migrate to HBase on S3 using snapshots

You can create table backups easily by using HBase snapshots. HBase also provides the ExportSnapshot utility, which lets you export snapshots to a different location, like S3. In this section, I discuss how you can combine snapshots with ExportSnapshot to migrate tables to HBase on S3.

For details about how you can use HBase snapshots to perform table backups, see Using HBase Snapshots in the Amazon EMR Release Guide and HBase Snapshots in the Apache HBase Reference Guide. These resources provide additional settings and configurations that you can use with snapshots and ExportSnapshot.

The following example shows how to use snapshots to migrate HBase tables to HBase on S3.

Note: Earlier HBase versions, like HBase 0.94, have a different snapshot structure than HBase 1.x, which is what you’re migrating to. If you’re migrating from HBase 0.94 using snapshots, you get a TableInfoMissingException error when you try to restore the table. For details about migrating from HBase 0.94 using snapshots, see the Migrating from HBase 0.94 section.

From the source HBase cluster, create a snapshot of your table:

$ echo "snapshot '<table_name>', '<snapshot_name>'" | hbase shell

Export the snapshot to an S3 bucket:
```
$ hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot <snapshot_name> -copy-to s3://<HBase_on_S3_root_dir>/
```
For the -copy-to parameter in the ExportSnapshot utility, specify the S3 location that you are using for the HBase root directory of your EMR cluster. If your cluster is already up and running, you can find its S3 hbase.rootdir value by viewing the cluster’s Configurations in the EMR console, or by using the AWS CLI. Here’s the command to find that value:
```
$ aws emr describe-cluster --cluster-id <cluster_id> | grep hbase.rootdir
```
Launch an EMR cluster that uses the S3 storage option with HBase (skip this step if you already have one up and running). For detailed steps, see Creating a Cluster with HBase Using the Console in the Amazon EMR Release Guide. When launching the cluster, ensure that the HBase root directory is set to the same S3 location as your exported snapshots (that is, the location used in the -copy-to parameter in the previous step).
Restore or clone the HBase table from that snapshot.
- To restore the table and keep the same table name as the source table, use restore_snapshot:
```
$ echo "restore_snapshot '<SNAPSHOT_NAME>'"| hbase shell
```
- To restore the table into a different table name, use clone_snapshot:
```
$ echo "clone_snapshot '<snapshot_name>', '<table_name>'" | hbase shell
```

Migrating from HBase 0.94 using snapshots

If you’re migrating from HBase version 0.94 using the snapshot method, you get an error if you try to restore from the snapshot. This is because the structure of a snapshot in HBase 0.94 is different from the snapshot structure in HBase 1.x.

The following steps show how to fix an HBase 0.94 snapshot so that it can be restored to an HBase on S3 table.

Complete steps 1—3 in the previous example to create and export a snapshot.

From your destination cluster, follow these steps to repair the snapshot:

Use s3-dist-cp to copy the snapshot data (archive) directory into a new directory. The archive directory contains your snapshot data. Depending on your table size, it might be large. Use s3-dist-cp to make this step faster:
```
$ s3-dist-cp --src s3://<HBase_on_S3_root_dir>/.archive/<table_name> --dest s3://<HBase_on_S3_root_dir>/archive/data/default/<table_name>
```

Create and fix the snapshot descriptor file:

$ hdfs dfs -mkdir s3://<HBase_on_S3_root_dir>/.hbase-snapshot/<snapshot_name>/.tabledesc

$ hdfs dfs -mv s3://<HBase_on_S3_root_dir>/.hbase-snapshot/<snapshot_name>/.tableinfo.<*> s3://<HBase_on_S3_root_dir>/.hbase-snapshot/<snapshot_name>/.tabledesc

Restore the snapshot:

$ echo "restore_snapshot '<snapshot_name>'" | hbase shell

Option 2: Migrate to HBase on S3 using Export and Import

As I discussed in the earlier sections, HBase snapshots and ExportSnapshot are great options for migrating tables. But sometimes you want to migrate only part of a table, so you need a different tool. In this section, I describe how to use the HBase Export and Import utilities.

The steps to migrate a table to HBase on S3 using Export and Import is not much different from the steps provided in the HBase documentation. In those docs, you can also find detailed information, including how you can use them to migrate part of a table.

The following steps show how you can use Export and Import to migrate a table to HBase on S3.

From your source cluster, export the HBase table:

$ hbase org.apache.hadoop.hbase.mapreduce.Export <table_name> s3://<table_s3_backup>/<location>/

In the destination cluster, create the target table into which to import data. Ensure that the column families in the target table are identical to the exported/source table’s column families.

From the destination cluster, import the table using the Import utility:

$ hbase org.apache.hadoop.hbase.mapreduce.Import '<table_name>' s3://<table_s3_backup>/<location>/

HBase snapshots are usually the recommended method to migrate HBase tables. However, the Export and Import utilities can be useful for test use cases in which you migrate only a small part of your table and test your application. It’s also handy if you’re migrating from an HBase cluster that does not have the HBase snapshots feature.

Option 3: Migrate to HBase on S3 using CopyTable

Similar to the Export and Import utilities, CopyTable is an HBase utility that you can use to copy part of HBase tables. However, keep in mind that CopyTable doesn’t work if you’re copying or migrating tables between HBase versions that are not wire compatible (for example, copying from HBase 0.94 to HBase 1.x).

For more information and examples, see CopyTable in the HBase documentation.

Conclusion

In this post, I demonstrated how you can use common HBase backup utilities to migrate your tables easily to HBase on S3. By using HBase snapshots, you can migrate entire tables to HBase on S3. To test HBase on S3 by migrating or copying only part of your tables, you can use the HBase Export, Import, or CopyTable utilities.

If you have questions or suggestions, please comment below.

About the Author

Bruno Faria is an EMR Solution Architect with AWS. He works with our customers to provide them architectural guidance for running complex applications on Amazon EMR. In his spare time, he enjoys spending time with his family and learning about new big data solutions.

Low-Latency Access on Trillions of Records: FINRA’s Architecture Using Apache HBase on Amazon EMR with Amazon S3

AWS Big Data Blog