Visualize Big Data with Amazon QuickSight, Presto, and Apache Spark on Amazon EMR
Last December, we introduced the Amazon Athena connector in Amazon QuickSight, in the Derive Insights from IoT in Minutes using AWS IoT, Amazon Kinesis Firehose, Amazon Athena, and Amazon QuickSight post.
The connector allows you to visualize your big data easily in Amazon S3 using Athena’s interactive query engine in a serverless fashion. This turned out to be a very popular combination, as customers benefit from the speed, agility, and cost benefit that serverless business intelligence (BI) and analytics architecture brings.
Today, we’re excited to announce two new native connectors in QuickSight for big data analytics: Presto and Spark. With the Presto and SparkSQL connector in QuickSight, you can easily create interactive visualizations over large datasets using Amazon EMR.
EMR provides a simple and cost effective way to run highly distributed processing frameworks such as Presto and Spark when compared to on-premises deployments. EMR provides you with the flexibility to define specific compute, memory, storage, and application parameters and optimize your analytic requirements.
In this post, I walk you through connecting QuickSight to an EMR cluster running Presto. If you’d like a walkthrough with Spark, let us know in the comments section!
Presto is an open source, distributed SQL query engine for running interactive analytic queries against data sources ranging from gigabytes to petabytes. It supports the ANSI SQL standard, including complex queries, aggregations, joins, and window functions. Presto can run on multiple data sources, including Amazon S3.
Presto’s execution framework is fundamentally different from that of Hive/MapReduce. Presto has a custom query and execution engine where the stages of execution are pipelined, similar to a directed acyclic graph (DAG), and all processing occurs in memory to reduce disk I/O. This pipelined execution model can run multiple stages in parallel and streams data from one stage to another as the data becomes available. This reduces end-to-end latency and makes Presto a great tool for ad hoc data exploration over large data sets.
Use the following steps to connect QuickSight to an EMR cluster running Presto:
- Create an EMR cluster with the latest 5.5.0 release.
- Configure LDAP for user authentication in QuickSight.
- Configure SSL using a QuickSight supported certificate authority (CA).
- Create tables for Presto in the Hive metastore.
- Whitelist the QuickSight IP address range in your EMR master security group rules.
- Connect QuickSight to Presto and create some visualizations.
You need run Presto version 0.167, at a minimum, which is the first release that supports LDAP authentication. LDAP authentication is a requirement for the Presto and Spark connectors and QuickSight refuses to connect if LDAP is not configured on your cluster.
Create an EMR cluster with release version 5.5.0
In the EMR console, use the Quick Create option to create a cluster. For this post, use most of the default settings with a few exceptions. To install both Presto and Spark on your cluster (and customize other settings), create your cluster from the Advanced Options wizard instead.
Make sure that EMR release 5.5.0 is selected and under Applications, choose Presto. If you have an EC2 key pair, you can use it. Otherwise, create a key pair (.PEM file) and then return to this page to create the cluster.
Make sure that you configure your cluster’s security group inbound rules to allow SSH from your machine’s IP address range.
Configure LDAP for user authentication in QuickSight
After your cluster is in a running state, connect using SSH to your cluster to configure LDAP authentication.
To SSH into your EMR cluster, use the following commands in the terminal:
After you log in, install OpenLDAP, configure it, and create users in the directory. For more about configuring LDAP, see Editing /etc/openldap/slapd.conf in the OpenLDAP documentation.
After LDAP is installed and restarted, you issue a couple of commands to change the LDAP password. First, generate a hash for the LDAP root password and save the output hash that looks like this:
Issue the following command and set a root password for LDAP when prompted:
Now, prepare the commands to set the password for the LDAP root. Make sure to replace the hash below with the one that you generated in the previous step:
Run the following command to execute the above commands against LDAP:
Next, create a user account with password in the LDAP directory with the following commands. When prompted for a password, use the LDAP root password that you created in the previous step.
You now have OpenLDAP configured on your EMR cluster running Presto and a user that you later use to authenticate against when connecting to Presto.
Configure SSL using a QuickSight supported certificate authority
To ensure that any communication between QuickSight and Presto is secured, QuickSight requires that the connection to be established with SSL enabled. You need to obtain a certificate from a certificate authority (CA) that QuickSight trusts. You can find the full list of public CAs accepted by QuickSight in the Network and Database Configuration Requirements topic.
To set up SSL on LDAP and Presto, obtain the following three SSL certificate files from your CA and store them in the /home/hadoop/ directory.
- Certificate key file
- Certificate file
- CA certificate
Configure the keys in LDAP with the following commands:
Now, enable SSL in LDAP by editing the /etc/sysconfi/ldap file and set SLAPD_LDAPS=yes:
Use the following commands to generate keystore. You will be prompted to provide a password for the keystore.
Edit the configuration files for Presto in EMR.
Create tables for Presto in the Hive metastore
Now that you have a running EMR cluster with Presto and LDAP set up, you can load some sample data into the cluster for analysis. Use the same CloudFront log sample data set that is available for Athena. The following SQL query creates a table in EMR and loads the sample data set into it:
Try to query the data using the Presto CLI with the following commands:
You should see an output from Presto like the following:
Whitelist the QuickSight IP address range in your EMR master security group rules
Now you’re ready to connect QuickSight to Presto. For QuickSight to connect to Presto, you need to make sure that Presto is reachable by QuickSight’s public endpoints by adding QuickSight’s IP address ranges to your EMR master node security group.
Connect QuickSight to Presto and create some visualizations
If you have not already signed up for QuickSight, you can do so at https://quicksight.aws. QuickSight offers a 1 user and 1 GB perpetual free tier.
After you’re signed up for QuickSight, navigate to the New Analysis page and the New Data Set page. You see the new Presto and Spark connector as in the following screenshot.
Open the Presto connector, provide the connection details in the modal window, and choose Create data source.
Select the default schema and choose the cloudfront_logs table that you just created.
In QuickSight, you can choose between importing the data in SPICE for analysis or directly querying your data in Presto. SPICE is an in-memory optimized columnar engine in QuickSight that enable fast, interactive visualization as you explore your data. For this post, choose to import the data into SPICE and choose Visualize.
In the analysis view, you can see the notification that shows import is complete with 4996 rows imported. On the left, you see the list of fields available in the data set and below, the various types of visualizations from which you can choose.
QuickSight makes it easy for you to create visualizations and analyze data with AutoGraph, a feature that automatically selects the best visualization for you based on selected fields.
To create a visualization, select the fields on the left panel. In this case, look at the number of connections to CloudFront ordered by the various OS types, by selecting the OS field. Additionally, you can select the bytes fields to look at total bytes transferred by OS instead of count.
You just finished creating an EMR cluster, setting up Presto and LDAP with SSL, and using QuickSight to visualize your data. I hope this post was helpful. Feel free to reach out if you have any questions or suggestions.
To learn more about these capabilities and start using them in your dashboards, check out the QuickSight User Guide.
If you have questions and suggestions, you can post them on the QuickSight forum.
Not a QuickSight user
Go to the QuickSight website to get started for FREE.