AWS Big Data Blog

Building and Deploying Custom Applications with Apache Bigtop and Amazon EMR

Hernan Vivani is an Hadoop Systems Engineer for Amazon Web Services

When you launch a cluster, Amazon EMR lets you choose applications that will run on your cluster. But what if you want to deploy your own custom application? This post shows you how to build a custom application for EMR for Apache Bigtop-based releases 4.x and greater. EMR nodes are based on the Amazon Linux AMI, so I will deploy on RPM packages and use Elasticsearch as the example application.

What is Apache Bigtop?

Apache Bigtop is a community maintained repository that supports a wide range of components and projects, including, but not limited, to Hadoop, HBase, and Spark. Bigtop supports various Linux packaging systems, such as RPM or Deb, to package applications and application deployment and configuration on clusters using Puppet.

Walkthrough

The following diagram represents the Bigtop package creation process.

To create a Bigtop package for EMR, follow these steps:

  1. Launch a development EMR cluster.
  2. Clone the Bigtop public repository.
  3. Add the application definition to bigtop.bom.
  4. Create directories and configuration files for the application.
    • Create an RPM package.
    • Create a Yum repository.
  5. Move the output repository to S3 to make it available for any new cluster where you want to install the new application.
  6. Test the application.
  7. Create a bootstrap script.
  8. Launch an EMR cluster with the bootstrap script.

You will create an EMR cluster for development purposes. This provides you with the tools needed to create and test the Bigtop application including Maven and Gradle, among other tools.

 

Launch a development EMR cluster

Using CLI tools, run the following command to get the development cluster up and running:

aws emr create-cluster --name "EMR_Bigtop_Dev" --release-label emr-4.7.2 --instance-type=m3.xlarge --instance-count 1 --ec2-attributes KeyName=<YOUR-KEY-PAIR> --log-uri s3://<YOUR-BUCKET>/ --no-auto-terminate --use-default-roles --bootstrap-action Name="Install EMR DEV Tools",Path=s3://us-west-2.awssupportdatasvcs.com/bootstrap-actions/EMR_Dev/setup_EMR_Dev.sh

 

Clone the Bigtop public repository

After the cluster is running, SSH to the EMR Bigtop dev master node and clone the Bigtop public repository:

git clone https://github.com/apache/bigtop.git

 

Add the application definition to bigtop.bom

In the directory created by the clone command in the previous section (/home/hadoop/bigtop/), you will find a file called bigtop.bom. This file saves all the definitions for applications available in the current version of Bigtop.

In the components section, add an ‘elasticsearch’ section as follows:

   'elasticsearch' {
      name    = 'elasticsearch'
      relNotes = 'Search and Analytics engine'
      version { base = '1.6.0'; pkg = base; release = 1 }
      tarball { destination = "$name-${version.base}.tar.gz"
                source      = "v${version.base}.zip" }
      url     { site = "https://github.com/elastic/elasticsearch/archive"
                archive = site }
    }

This should look like the following screenshot.

The application you are defining here describes the following:

  • Application name
  • Application version
  • Tarball:
    • destination: The tarball name to build with the downloaded source code.
    • source: The source code file name. In this case, you are downloading the source code from GitHub and choosing a specific release, Tag v1.6.0.
  • url: The URL you are downloading the code from.

 

Test the repository

To test if Gradle and all the needed tools for building a BigTop application are installed, run the following command:

gradle tasks | grep elasticsearch

The first run of this command can take a little time. You should get a final output like the following:

 

Create directories and configuration files for the application

Deploying an application for Bigtop involves two major tasks: creating RPM packages for the application and creating the Puppet script.

  • Creating RPM packages for the application

For Elasticsearch, the example application, you use a customized version for the SPEC RPM definition. If the application that you want to include on Bigtop provides an RPM, then you can customize it for Bigtop. Otherwise, you need to create a SPEC RPM definition file from scratch. The default directory location for these files is:

bigtop-packages/src/rpm/<application-name>/SPECS

Common scripts are executed by the package building process to create the final RPM. When you use Bigtop in a Red Hat-based distribution, you use RPM. When you use a Debian-based distribution, Deb is the package management system. The default directory location for these files is:

bigtop-packages/src/common/<application-name>/

Some common scripts are:

  • do-component-build: This file contains the environment configuration and build commands to use when creating a package. As an example: mvn clean install -DskipTests -Dhadoop.version=$HADOOP_VERSION “$@”
  • install-<application-name>.sh: This script defines the package directory structure and how the files are distributed on that structure.

For general guidance, see How to create an RPM package in the Fedora documentation. If you are just getting starting, see How to create a GNU Hello RPM package in the Fedora documentation.

  • Creating the Puppet scripts

Puppet is responsible for the installation and configuration process of the application. Each application defines a main ‘init.pp’ script where you declare how to install the application, how the configuration files are populated, and how the service is handled, among other tasks. The default directory location for the init.pp script is:

bigtop-deploy/puppet/modules/<application-name>/manifests/

Another important directory in Puppet structure is ‘templates’. You usually use templates in Bigtop to deploy configuration files combining code and data. The default directory location for templates is:

bigtop-deploy/puppet/modules/<application-name>/templates/

For more information about Puppet templates, see Language: Using templates in the Puppet documentation. If you are just starting with Puppet, see Puppet Hello World.

 

Create the file and directory structure

For this example, create the required file and directory structure with the following commands:

cd ~
git clone https://github.com/awslabs/aws-big-data-blog.git

After you clone the needed structure for the application, use the following commands to copy it to the local Bigtop repository you created in Clone the Bigtop repository so you can build the application from there:

cd aws-big-data-blog/aws-blog-bigtop-application-emr/
cp -r bigtop-packages/* ~/bigtop/bigtop-packages/
cp -r bigtop-deploy/* ~/bigtop/bigtop-deploy/

 

Create an RPM package for the new application

Now that you have all the configuration files in place, run the command to build the new application. This command downloads the source code (as defined in bigtop.bom), compiles the source code, and builds a new RPM as per the specification in the SPEC file.

cd /home/hadoop/bigtop
gradle realclean elasticsearch-rpm --stacktrace

The final output should look something like the following:

You should be able to create the package by just executing gradle elasticsearch-rpm, but I am adding a couple of extra parameters:

  • –stacktrace provides the complete stack trace in case of an exception during the build process.
  • “realclean” cleans previous build outputs in case you have to build more than one time.

 

Create the repository for the new application

After the RPM packages are created, create the Yum repository to host the new RPMs; this way, Puppet calls yum to install new applications when needed. The RPM repository is created at /home/hadoop/bigtop/output/after you run the following command:

	gradle yum

 

Move the output repository to S3

Now that you have the Yum repository created, move it to S3 so it can be used for any EMR cluster at launch time. This repository is referenced later when you create the bootstrap script to launch EMR clusters with Elasticsearch.

	aws s3 sync /home/hadoop/bigtop/output s3://<your-bucket>/bigtop/output --acl public-read

Now, create a Yum repository definition file pointing to the newly-created repository. On /etc/yum.repos.d/, create a file called bigtop_custom.repo with the following content:

[bigtop_custom]
name=bigtop_custom_repo
baseurl=https://<your-s3-endpoint>.amazonaws.com/<your-bucket>/bigtop/output
enabled=1
gpgcheck=0

Remember to replace “baseurl” with the S3 location where you synced the repo that you created.

As example, if your bucket is in eu-west-1, then the base URL address line looks like:

baseurl=https://s3-eu-west-1.amazonaws.com/<your-bucket>/bigtop/output

For more information, see Working with Amazon S3 Buckets.

Upload this repository definition file to S3 to also have it available when you reference the package on any new cluster.

aws s3 cp /etc/yum.repos.d/bigtop_custom.repo s3://<your-bucket>/bigtop/repos/

 

Test the repository

Run the following command to test the repository:

	sudo yum list elasticsearch

You should see output like the following:

 

Test the application

Now that you have everything in place, run puppet apply to test the application on the master node:

	sudo puppet apply --verbose -d --modulepath=/home/hadoop/bigtop/bigtop-deploy/puppet/modules:/etc/puppet/modules  -e 'include elasticsearch::client' 2>&1 | tee ~/puppet_apply.log

You will see a lot of text output on the console while Puppet applies all the configuration scripts.   The text file is saved on your home directory with the name “puppet_apply.log”. If the puppet apply command was successful, your cluster should have an instance of Elasticsearch running.

You can run a health check by using the REST API; run the following command:

	curl localhost:9200/_cluster/health?pretty=true

Elasticsearch should return a result that looks like the following:

 

Create a bootstrap script to deploy the new application in a new cluster

Now that you have the application created, create the bootstrap script needed to launch any EMR cluster with the application. You can find this script in one of the previously cloned directories:

vi ~/aws-big-data-blog/aws-blog-bigtop-application-emr/puppet_install_elasticsearch.sh

Edit the script and ensure that you point to your own bucket on line 5:

#!/bin/bash
set -x

echo "creating custom Bigtop repository"
sudo aws s3 cp s3://<your-bucket>/bigtop/repos/bigtop_custom.repo  /etc/yum.repos.d/bigtop_custom.repo

Upload the script to S3 so it can be invoked when you launch a new EMR cluster:

	aws s3 cp ~/aws-big-data-blog/aws-blog-bigtop-application-emr/puppet_install_elasticsearch.sh s3://<your-bucket>/bigtop/scripts/

 

Launch a new cluster with the new Bigtop application

As per the roles you are using, you will not be able to launch an EMR cluster from the master node of the development cluster. Using another EC2 instance or your own machine with AWS command line tools installed, you can launch a cluster with a command like the following:

	aws emr create-cluster --name "EMR_Bigtop_Application_Test" --release-label emr-4.7.2 --instance-type=m3.xlarge --instance-count 3 --ec2-attributes KeyName=<your-key> --log-uri s3://<your-bucket>/logs/ --no-auto-terminate --use-default-roles --bootstrap-action Name="Install Elasticsearch",Path=s3://<your-bucket>/bigtop/scripts/puppet_install_elasticsearch.sh

You can also use the EMR console to launch the cluster.

After the cluster is up and running, you should be able to SSH into the master node and check that the application is running properly.

Conclusion

In this post, I demonstrated how to create an Apache Bigtop application, and install and run it on an EMR cluster.

If you have any questions or suggestions, please leave a comment below.


 

Related

Indexing Common Crawl Metadata on Amazon EMR Using Cascading and Elasticsearch