Accelerate Amazon EMR Spark, Presto, and Hive with the Alluxio AMI

Data analytics workloads are increasingly being migrated to the cloud. Amazon EMR is a cloud-native big data platform that makes it easy to process vast amounts of data quickly and cost effectively at scale.

Amazon EMR, along with Amazon Simple Storage Service (Amazon S3) provides a flexible storage platform. With the click of a few buttons or the run of a single command, you can create a 5-node or 500-node cluster. However, for certain workloads, data engineers and data platform teams may want an added performance boost for Apache Spark and Presto jobs. In particular, they may want to reuse the same data over and over again.

In this blog post, I show how to accelerate Amazon EMR workloads. To do this, I show how to add a caching layer on each Amazon EMR node using the Alluxio AMI, available in AWS Marketplace. A caching layer prioritizes the most relevant data and deprioritizes older data in order to increase workload performance.

Using Alluxio for data orchestration on Amazon EMR

Alluxio is a data orchestration layer. Data orchestration technology brings your data closer to compute across clusters, regions, clouds, and countries. With a data orchestration platform, you get data locality for memory-speed access for your big data and AI/ML workloads, data accessibility regardless of where data resides, and data on-demand so you can abstract and independently scale compute and storage.

It enables users to increase the performance and flexibility of analytic workloads running on Amazon EMR while using S3 for storage.

Alluxio helps accelerate workloads that read and write data running on Amazon S3. It does this by providing a tiered caching layer using memory, Solid State Disks (SSDs), and disks. Alluxio provides Hadoop Distributed File System (HDFS) and S3 API compatibility for compute frameworks like Apache Spark, Presto and Hive that execute on top of Alluxio. It is also compatible with storage systems integrated below Alluxio.

Data locality with intelligent tiering

Alluxio has a cache that includes multi-tiered storage. With tiered storage, data is stored in different tiers based on performance, availability, and recovery requirements. This gives users the ability to lower costs based on performance. With Alluxio, the fastest input/output tier is typically on the top. An eviction process removes old or “cold” data out to make room for newer, “hot” data. Users can choose an eviction policy like least recently used (LRU). It also intelligently accounts for the ordering of storage tiers. For example, users often specify the following storage tiers:

MEM (Memory)
SSD (Solid State Drives)
HDD (Hard Disk Drives)

The following diagram shows how Alluxio provides tiered caching for data within an Amazon EMR instance. The tiers include disk, SSD, and RAM and integrate with Apache Spark, Presto, and Hive. Other compute frameworks can be automatically integrated to use Alluxio using a bootstrap action mechanism that Amazon EMR provides.

Alluxio data orchestration Amazon EMR

Prerequisites

An AWS account
AWS Identity and Access Management (AWS IAM)
Account with the default Amazon EMR roles
Key pair for Amazon Elastic Compute Cloud (Amazon EC2)
An S3 bucket
AWS Command Line Interface (AWS CLI): Make sure that the AWS CLI is also set up and ready with the required AWS Access/Secret key

STEP 1: Subscribe to the Alluxio AMI in AWS Marketplace

Subscribe to the Alluxio AMI in AWS Marketplace by navigating to the product detail page and selecting Continue to Subscribe.
Review the pricing and terms. Select Accept Terms.

The Alluxio AMI is now associated with your account. The subscription includes a seven-day free trial. After the trial, you will automatically be billed at an hourly rate based on your instance type.

STEP 2: Set up the IAM roles required for Amazon EMR

Open AWS command line interface (CLI) in a terminal on your laptop. Run the following command to set up your default Amazon EMR roles.

$ aws emr create-default-roles

This command will not return anything if the role already exists. It will return information about the roles created if they don’t exist as shown in the AWS CLI documentation.

STEP 3: Create the Amazon EMR cluster with this script

Amazon EMR enables you to bring up a cluster with Alluxio pre-installed on the cluster using the bootstrap action mechanism.

On the AWS CLI, I entered the following Amazon EMR command. This creates a cluster using the Alluxio bootstrap script, the path of which is included in the following command (s3://alluxio-public/emr/2.0.1/alluxio-emr.sh). It uses the Alluxio AMI from AWS Marketplace.

aws emr create-cluster \

--release-label emr-5.23.0 \

--custom-ami-id ami-0a53794238d399ab6 \

--instance-count 3 \

--instance-type r4.2xlarge \

--applications Name=Spark Name=Presto Name=Hive \

--name try-Alluxio \

--bootstrap-actions \

Path=s3://alluxio-public/emr/2.0.1/alluxio-emr.sh,\ Args=[s3://my-test-bucket/mount/,\ -p,"alluxio.user.block.size.bytes.default=122M|alluxio.user.file.writetype.default=ASYNC_THROUGH",\

-s,"|"] \

--configurations https://alluxio-public.s3.amazonaws.com/emr/2.0.1/alluxio-emr.json \

--ec2-attributes KeyName=admin-key

This command will return the cluster ID if successful.

{

"ClusterId": "j-BZOH1F6EYEGE"

}

You can customize the preceding command by changing the following elements according to your needs.

release-label – the version of EMR that you want installed.
custom-ami-id – the ID of the AMI you want to use with pre-installed software. In this example, the AMI ID points to the Alluxio Marketplace Enterprise Edition AMI ID ami-0a53794238d399ab6, which you’ll see after you subscribe. Then you can select it for the Region where you want to bring up your cluster.
instance-count – the number of nodes you want in the EMR cluster.
instance-type – the type of EC2 instance you want to use. Make sure you choose an instance that the Alluxio AMI supports. Also verify your instance limits on your AWS Account for the instance type you want to use. You can check your EC2 instance limits here.
applications – the services or frameworks you want to bring up. These may include Spark, Hive, or Presto.
name – the name you want to give to this cluster to identify it.
Bootstrap-actions – the script that pre-installs software and configures the cluster. The script takes the following arguments:
- The root under file system path This should be an s3:// URI designating the root mount of the Alluxio file system. This is a mandatory property. It would be in the form of s3://<bucket-name-for-ufs>/<mount-point>/. The mount point should be a folder. To create a folder in your AWS account, follow these instructions.
- user.block.size.bytes.default=122M – the size of the cache you want to allocate to Alluxio on a per-node basis.
- user.file.writetype.default=ASYNC_THROUGH tells Alluxio to write files asynchronously to the storage system underneath. Learn more about the Alluxio write type option.

STEP 4: Log in to the new EMR cluster

Log into the Amazon EMR Console.
Once the cluster is in the Waiting stage as indicated in the upper center, select the Hardware tab to see the master and worker details.
In the Hardware tab, select the master instance and note the Public DNS name. In the following screenshot, the Public DNS name is ec2-54-205-158-201.compute-1.amazonaws.com.
Go back to the Terminal application and exit the AWS CLI. Next, on the command line in Terminal, connect to the AWS EMR master instance. Use the Secure Shell (SSH) command and the key pair in number 4 of the Prerequisites. Note that if a security group isn’t specified via CLI, the default EMR security group will not allow inbound SSH. You can learn how to create a security group to allow SSH here. Here’s an example command:

$ ssh -i ~/admin-key.pem hadoop@ec2-54-205-158-201.compute-1.amazonaws.com

When you SSH into the master node, you get access to the entire EMR cluster and Alluxio to run a variety of commands. You will be logging in as the Hadoop user, which gives you access to administration of the cluster.

STEP 5: Verify Alluxio is installed and working

Now that you have an SSH shell running and are connected to the AWS EMR master, enter the following command on the SSH shell to verify that Alluxio is running as expected.

$ sudo runuser -l alluxio -c "/opt/alluxio/bin/alluxio runTests"

When running this command, Alluxio runs a set of scripts to test various different configurations and indicate that the test passed or failed. The following output shows that the tests passed, and Alluxio is running as expected.

runTest BASIC CACHE_PROMOTE MUST_CACHE

2019-08-30 21:27:23,798 INFO BasicOperations - writeFile to file /default_tests_files/BASIC_CACHE_PROMOTE_MUST_CACHE took 427 ms.

2019-08-30 21:27:23,895 INFO BasicOperations - readFile file /default_tests_files/BASIC_CACHE_PROMOTE_MUST_CACHE took 97 ms.

Passed the test!

Congratulations! Your Spark, Hive, and Presto workloads are now using Alluxio to access and store data. You’ll experience faster performance for your queries and jobs. Data is also asynchronously written back to S3. This means that the application does not need to wait for data to be written to S3 to continue, which leads to faster workloads.

STEP 6: Create a table in Alluxio

Now that your cluster is up and running, try using the cache within the AWS EMR instance for a faster workload. To do this, create a table using Presto stored in Alluxio and backed by S3.

In your EMR master node that you SSHed into, create a directory named user in Alluxio by entering the following commands:

/opt/alluxio/bin/alluxio fs mkdir /user

/opt/alluxio/bin/alluxio fs chown hadoop:hadoop /user

Running these commands will create a new user directory and allow the Hadoop user to read and write to this directory.

Open the Presto Command Line Interface (CLI) and enter the following default schema.

presto-cli --catalog hive

Use default;

This command will bring up the Presto CLI and allow you to type in Structured Query Language (SQL) queries to read data from Alluxio.

On the Presto CLI, enter the following commands to create a table with location Alluxio.

CREATE TABLE user (userid INT, age INT, gender CHAR(1), occupation CHAR(16), zipcode CHAR(5)) WITH (external_location = 'alluxio:///user' ) ;

This creates a new table with name user and schema mentioned in the statement, with the location as Alluxio. This means that any data added to this table will be automatically cached in Alluxio.

On the Presto CLI, enter the following commands to insert data into the table created in the previous step.

Insert into user VALUES (12345, 21, 'M', 'STUDENT', '94403’);

Insert into user VALUES (23456, 21, 'F', 'STUDENT', '94403’);

This command will insert two rows of records into the table and store this data in the Alluxio cache.

On the Presto CLI, enter the following commands to query data from the user table:

select * from user;

This command returns the two records you just inserted into Alluxio. The larger the dataset you have cached, the better the performance of your queries and workload.

Conclusion

In this tutorial, I showed how you can bootstrap an Amazon EMR Cluster with Alluxio. Alluxio caches metadata and data for your jobs to accelerate them. By using this cache, Presto, Spark, and Hive queries that run in Amazon EMR can run up to five times faster and run more concurrent queries. As a result, workloads run more efficiently, and users can run more complex workloads with the same number of resources.

About the author

Dipti Borkar is Vice President of Products at Alluxio. Dipti’s interests like in technology areas like databases, cloud computing, and advanced analytics.

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.

AWS Marketplace