Custom Log Presto Query Events on Amazon EMR for Auditing and Performance Insights

We find that AWS customers often require that every query submitted to Presto running on Amazon EMR is logged. They want to track what query was submitted, when it was submitted and who submitted it.

In this blog post, we will demonstrate how to implement and install a Presto event listener for purposes of custom logging, debugging and performance analysis for queries executed on an EMR cluster. An event listener is a plugin to Presto that is invoked when an event such as query creation, query completion, or split completion occurs.

Presto also provides a system connector to access metrics and details about a running cluster. In particular, the system connector gets information about currently running and recently run queries by using the system.runtime.queries table. From the Presto command line interface (CLI), you get this data with the entries Select * from system.runtime.queries; and Select * from system.runtime.tasks;. This connector is available out of the box and exposes information and metrics through SQL.

In addition to providing custom logging and debugging, the Presto event listener allows you to enhance the information provided by the system connector by providing a mechanism to capture detailed query context, query statistics and failure information during query creation or completion, or a split completion.

We will begin by providing a detailed walkthrough of the implementation of the Presto event listener in Java followed by its deployment on the EMR cluster.

Implementation

We use the Eclipse IDE to create a Maven Project, as shown below:

Once you have created the Maven Project, modify the pom.xml file to add the dependency for Presto, as shown following:

After you add the Presto dependency to our pom.xml file, create a Java package under the src/main/java folder. In our project, we have named the package com.amazonaws.QueryEventListener. You can choose the naming convention that best fits your organization. Within this package, create three Java files for the EventListener, the EventListenerFactory, and the EventListenerPlugin.

As the Presto website says: “EventListenerFactory is responsible for creating an EventListener instance. It also defines an EventListener name, which is used by the administrator in a Presto configuration. Implementations of EventListener implement methods for the event types they are interested in handling. The implementation of EventListener and EventListenerFactory must be wrapped as a plugin and installed on the Presto cluster.”

In our project, we have named these Java files QueryEventListener, QueryEventListenerFactory, and QueryEventListenerPlugin:

Now we write our code for the Java files.

QueryEventListener – QueryEventListener implements the Presto EventListener interface. It has a constructor that creates five rotating log files of 524 MB each. After creating QueryEventListener, we implement the query creation, query completion, and split completion methods and log the events relevant to us. You can choose to include more events based on your needs.

You can find the code for the QueryEventListener class in this Amazon repository.

QueryEventListenerFactory – The QueryEventListenerFactory class implements the Presto EventListenerFactory interface. We implement the method getName, which provides a registered EventListenerFactory name to Presto. We also implement the create method, which creates an EventListener instance.

You can find the code for the QueryEventListenerFactory class in this Amazon repository.

QueryEventListenerPlugin – The QueryEventListenerPlugin class implements the Presto EventListenerPlugin interface. This class has a getEventListenerFactories method that returns an immutable list containing the EventListenerFactory. Basically, in this class we are wrapping QueryEventListener and QueryEventListenerFactory.

You can find the code for the QueryEventListenerPlugin class in this Amazon repository.

Finally, in our project we add the META-INF folder and a services subfolder within the META-INF folder. In the services subfolder, you create a file called com.facebook.presto.spi.Plugin:

As the Presto documentation describes: “Each plugin identifies an entry point: an implementation of the plugin interface. This class name is provided to Presto via the standard Java ServiceLoader interface: the classpath contains a resource file named com.facebook.presto.spi.Plugin in the META-INF/services directory”.

We add the name of our plugin class com.amazonaws.QueryEventListener.QueryEventListenerPlugin to the com.facebook.presto.spi.Plugin file, as shown below:

Finally, we create a jar file to deploy to the EMR cluster by following the instructions provided on the Eclipse IDE website

We saved our jar file in our project directory as depicted in the screenshot below.

While creating the jar file, we save the MANIFEST.MF file in the services directory.

The entire Java project is available for your reference in this Amazon Repository.

Next we will show you how to deploy the Presto plugin we created to Amazon EMR for custom logging.

Presto logging overview on Amazon EMR

Presto by default will produce three log files that capture the configurations properties and the overall operational events of the components that make up Presto, plus log end user access to the Presto UI.

On Amazon EMR, these log files are written into /var/log/presto. The log files in this directory are pushed into Amazon S3. This S3 location is the location of the new log file.

Steps to deploy Presto on Amazon EMR with custom logging

To deploy Presto on EMR with custom logging a bootstrap action will be used. The bootstrap is available in this Amazon Repository.

#!/bin/bash
IS_MASTER=true
if [ -f /mnt/var/lib/info/instance.json ]
then
        if grep isMaster /mnt/var/lib/info/instance.json | grep true;
        then
        IS_MASTER=true
        else
        IS_MASTER=false
        fi
fi
sudo mkdir -p /usr/lib/presto/plugin/queryeventlistener
sudo /usr/bin/aws s3 cp s3://replace-with-your-bucket/QueryEventListener.jar /tmp
sudo cp /tmp/QueryEventListener.jar /usr/lib/presto/plugin/queryeventlistener/
if [ "$IS_MASTER" = true ]; then
sudo mkdir -p /usr/lib/presto/etc
sudo bash -c 'cat <<EOT >> /usr/lib/presto/etc/event-listener.properties
event-listener.name=event-listener
EOT'
fi

First, upload the JAR file created on the last section and update the s3 location in the bootstrap, s3://replace-with-your-bucket/QueryEventListener.jar, with the bucket name where the jar was placed.

You can always use the Jar we have generated in this Amazon Repository.

After updating the bootstrap with the S3 location for your JAR, upload that bootstrap to your own bucket.

The bootstrap action will copy the jar file with the custom EventListener implementation into all machines of the cluster. Moreover, the bootstrap action will create a file named event-listener.properties on the Amazon EMR Master node. This file will configure the coordinator to enable the custom logging plugin via property event-listener.name. The event-listener.name property is set to event-listener in the event-listener.properties file. As per Presto documentation, this property is used by Presto to find a registered EventListenerFactory based on the name returned by EventListenerFactory.getName().

Now that the bootstrap is ready, the following AWS CLI command can be used to create a new EMR cluster with the bootstrap:

aws emr create-cluster --name "ClusterWithPrestoLogging" --release-label emr-5.10.0 --applications Name=Hive Name=Presto --use-default-roles --instance-count 2 --instance-type m3.xlarge --ec2-attributes KeyName=replace-with-your-key  --log-uri s3://replace-with-your-bucket/ --bootstrap-actions Path=s3://replace-with-your-bucket-name/eventlistenerbootstrap.sh,Name=BootstrapActionPrestoLogging,Args=[]  --region us-east-1

A quick example with results

After we have implemented the Presto custom logging plugin and deployed it, we can run a test and see the output of the log files.

We first create a Hive table with the DDL below from the Hive Command Line (simply type Hive on the SSH terminal to the EMR Master to access the Hive Command Line):

CREATE EXTERNAL TABLE wikistats (
language STRING,
page_title STRING,
hits BIGINT,
retrived_size BIGINT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ' '
LINES TERMINATED BY '\n'
LOCATION 's3://support.elasticmapreduce/training/datasets/wikistats/';

Then, we access the presto command line by typing the following command on the terminal:

presto-cli --server localhost:8889 --catalog hive --schema default

Finally, we run a simple query:

Select * from wikistats limit 10;

We then go to the /var/log/presto directory and look at the contents of the log file queries-YYYY-MM-DDTHH\:MM\:SS.0.log. As depicted in the screenshot below, our QueryEventListener plugin captures the fields shown for the Query Created and Query Completed events. Moreover, if there are splits, the plugin will also capture split events.

Note: If you want to include the query text executed by the user for auditing and debugging purposes, add the field appropriately in the QueryEventListener class methods, as shown below:

For the queryCreatedEvent method, add it like so:

msg.append("Query Text:  ");		   
msg.append(queryCreatedEvent.getMetadata().getQuery().toString());

For the queryCompletedEvent method, add it like so:

msg.append("Query Text: ");
msg.append(queryCompletedEvent.getMetadata().getQuery().toString());

Because this is custom logging, you can capture as many fields as are available for the particular events. To find out the fields available for each of the events, see the Java Classes provided by Presto at this GitHub location.

Summary

In this post, you learned how to add custom logging to Presto on EMR to enhance your organization’s auditing capabilities and provide insights into performance.

If you have questions or suggestions, please leave a comment below.

Additional Reading

If you found this post useful, be sure to check out Visualize AWS Cloudtrail Logs using AWS Glue and Amazon Quicksight.

About the Authors

Zafar Kapadia is a Cloud Application Architect for AWS. He works on Application Development and Optimization projects. He is also an avid cricketer and plays in various local leagues.

Francisco Oliveira is a Big Data Engineer with AWS Professional Services. He focuses on building big data solutions with open source technology and AWS. In his free time, he likes to try new sports, travel and explore national parks.