AWS Big Data Blog

Running Apache Accumulo on Amazon EMR

Manjeet Chayel is a Solutions Architect with Amazon Web Services

This post was co-authored by Matt Yanchyshyn, a Principal Solutions Architect with Amazon Web Services

Apache Accumulo is a sorted, distributed key-value store that is built on top of Apache Hadoop, Zookeeper, and Thrift. Accumulo was originally modeled after Google’s BigTable and can scale to trillions of records and hundreds of petabytes.  Some features that make it stand out include:

  • User permissions and cell-based access control
  • Server-side iterators for additional data management capabilities
  • An extensible balancing algorithm
  • High-performance ingest
  • Data management features including table merge, efficient data deletion, table renaming, table cloning and fast, efficient table splitting

Like Amazon DynamoDB, Accumulo has granular access control that lets you restrict access to individual cells (key/value pairs) using visibility labels and table access control lists (ACLs). Each cell has an ACL of what the user may see, while users who do not have access to those cells do not even know they exist.

For more information about Apache Accumulo and how to configure it, see http://accumulo.apache.org/. For more information about Accumulo features, see https://accumulo.apache.org/docs/2.x/getting-started/features.

Accumulo is designed to run on top of the Hadoop architecture, which means you can distribute operations across many computers in a cluster to efficiently parse vast amounts of data. In this article, we walk through how to provision an Accumulo cluster instance using Amazon EMR.

To install Accumulo on Amazon EMR you can use Amazon EMR bootstrap actions. Bootstrap action scripts are stored on Amazon Simple Storage Service (Amazon S3) and allow you to install custom applications or libraries on Amazon EMR nodes. They can contain configuration settings and arguments related to Hadoop or Amazon EMR. Bootstrap actions are run before Hadoop starts and before the node begins processing data. The bootstrap action we have provided for this post installs both Apache Accumulo and Zookeeper on your Amazon EMR cluster.   Accumulo uses ZooKeeper to keep your cluster running, coordinate settings between processes, and help finalize TabletServer failure.

Installing Accumulo

We will use the AWS Command Line Interface (CLI) to launch a small Amazon EMR cluster consisting of three m3.xlarge instances.  The CLI command references a bootstrap action script in a shared Amazon S3 bucket.  Before running the following command, replace <YOURKEY> with the name of your AWS key pair, not including the “.pem” suffix.

aws emr create-cluster --name Accumulo --no-auto-terminate --bootstrap-actions Path=s3://elasticmapreduce.bootstrapactions/accumulo/1.6.1/install-accumulo_mj,Name=Install_Accumulo --ami-version 3.3.1 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge --ec2-attributes KeyName=<YOURKEY>

The above command should have an output similar to the following:

{
    "ClusterId": "j-12345abcde"
}

Running a Basic Sample

  1. Wait for your Amazon EMR cluster to deploy and run the bootstrap scripts. It will transition to the “Waiting” state, meaning it’s ready to accept connections and jobs.
  1. Connect to the master node of your Amazon EMR cluster, replacing <YOURCLUSTER> in the command below with the ClusterId returned when you created the cluster and <YOURKEY> with the AWS key pair that you used to launch the cluster, this time including the “.pem” suffix:
    aws emr ssh --cluster-id <YOURCLUSTER> --key-pair-file <YOURKEY>
  1. Log into the Accumulo shell:
    ~/accumulo/bin/accumulo shell -u root -p secret
  1. Create a table called ‘hellotable’:
    root@instance> createtable hellotable
    root@instance> quit
  1. Launch a Java program that inserts data with a BatchWriter:
    ~/accumulo/bin/accumulo
    org.apache.accumulo.examples.simple.helloworld.InsertWithBatchWriter
    -i instance -z 127.0.0.1 -u root -p secret -t hellotable
  1. To view the entries, log into the Accumulo shell again and scan the table:
    ~/accumulo/bin/accumulo shell -u root -p secret
    root@instance> table hellotable
    root@instance> scan
    

Cell-Level Security Example

Data in Accumulo is represented as key-value pairs, but also includes additional elements such as visibility and timestamp.  In the example above, you may have noticed empty brackets [ ] in each row of data returned. For example:

row_1154 colfam:colqual_3 []    value_1154_3

The value in brackets [ ] would be the visibility labels. Since none were used, this is empty for this row.  If you had run the scan command with with the -st you would have also seen a timestamp for each item.

Let’s experiment with using Accumulo Security Labels to restrict data visibility:

  1. Log back into the Accumulo shell (see above).
  1. Create a new table called ‘hellosecurity’ and insert some dummy data:
    root@instance> createtable hellosecurity
    root@instance> insert row1 cola colb value1
    root@instance> scan
    

You should see the following output of the scan:

  1. Now insert a new cell, this time with the label restricted and run a scan:
    root@instance> insert row2 cola colb value2 -l restricted
    root@instance> scan
    
  2. Note that you still see only the first row returned.  This is because the user (root in this case) has not been explicitly authorized to view cells with the restricted label.  Let’s make that change and try another scan:
    root@instance> setauths -s restricted
    root@instance> scan

    You should see the following output of the scan:

  1. Let’s create a third row with two labels and grant permission to the root user to see both.  The & character in the insert requires that the user be authorized to see both labels for the row to be visible.
    root@instance> insert row3 cola colb value3 -l 
    restricted&forrealthistime
    root@instance> setauths -s restricted,forrealthistime
    root@instance> scan

    Three rows should be returned.

  1. Next, we’ll create a new user with permission to only read items from our new table with only one of the two labels, change to that user and run a table scan:
    root@instance> createuser test
    root@instance> setauths -s restricted -u test
    root@instance> grant Table.READ -t hellosecurity -u test
    test@instance> user test
    test@instance> scan

    Only two rows will be returned.

Conclusion

The examples above demonstrate only a fraction of what Apache Accumulo can do, but hopefully it’s enough to show you how easy it is to launch an Apache Accumulo cluster running on Amazon EMR and also give you a taste of its advanced security controls.

Running Apache Accumulo on Amazon EMR requires a well-defined backup strategy. To avoid data loss, it is highly recommended that you regularly back up your data to Amazon S3 using using Amazon EMR’s s3distcp or Hadooop’s distcp tool. In addition, if there is ever an Amazon EMR node failure there may be a delay before the data is fully recovered by other healthy nodes, so it is important to architect for database recovery delays.

Please let us know in the comments if you have any questions or if you’re using Amazon EMR today to run Apache Accumulo.

—————————————————————

Love to work on open source? Check out EMR’s careers page.

—————————————————————-