AWS Marketplace

Accelerate self-service analytics integrating on-premises and third-party data with AWS Data Exchange and Dremio

Businesses face growing volumes and types of customer data. But privacy rules often limit data collection, storage, and analysis. Using AWS Data Exchange and Dremio, customers can now join and query data across sources. This enriches business data with relevant third-party information. It helps organizations move from descriptive to diagnostic analytics and it enables a 360-degree customer view.

AWS Data Exchange lets Amazon Web Services (AWS) customers securely exchange file-based datasets, access Amazon Redshift datashares, and call data APIs in the AWS Cloud. Subscribers can find and access thousands of products from qualified data providers. Data managed via Amazon Simple Storage Service (Amazon S3) is readily available across various AWS analytics and machine learning (ML) services. Anyone with an AWS account can be an AWS Data Exchange subscriber.

Dremio is a member of the AWS Partner Network (APN) and is available in AWS Marketplace. Dremio Software is an open data lakehouse platform that can be deployed on AWS and provides self-service access by connecting a broad range of data consumers. This includes business analysts to data scientists, who can access data on Amazon S3, databases, data warehouses, and data lakes.

In this blog post, we explore how businesses can use AWS Data Exchange with their on-premises Hive-compliant data source using Dremio to integrate third-party and on-premises data without moving or copying data. We also demonstrate how customers can use the consolidated data for business intelligence (BI) and exploratory analytics.

Prerequisites

  1. An AWS account with appropriate IAM roles and permissions.
  2. Using your AWS account, in AWS Marketplace, subscribe to Dremio (Dremio AWS – Community Edition).
    1. To configure the software, choose Continue to Subscribe, and then Continue to Configuration.
    2. Select the default options for FullFillment Option, Software Version, and Region.
    3. Choose the Action as Launch Configuration to launch the AWS CloudFormation stack in your AWS account.
  3. An on-premises data warehouse with data cataloged using Apache Hive metastore.
  4. Subscribe Datamam Eventbrite Events Dataset (Sample Data) on AWS Data Exchange.
  5. Deploy Tableau Desktop by downloading a free trial of Tableau desktop. In this example, we have used Tableau Desktop version 2020.4+.

Solution overview

The following architecture diagram shows how Dremio integrates with AWS services such as AWS Data Exchange, Amazon S3, and Apache Hive to visualize the results with a BI application such as Tableau.

Figure 1 : Explains how Dremio integrates with AWS Data Exchange, Amazon S3, on-premises data warehouse, and BI applications Figure 1 : Explains how Dremio integrates with AWS Data Exchange, Amazon S3, on-premises data warehouse, and BI applications

First, subscribe to the Eventbrite Events Dataset product from Datamam in your AWS account using AWS Data Exchange. Export the data into your Amazon S3 bucket. After deploying Dremio Community edition as outlined in the pre-requisites, connect Dremio to Amazon S3 and on-premises Hive metastore and create a view joining both datasets. Finally, create a BI dashboard with Tableau to derive insights from the view stored in Dremio.

Solution walkthrough

Step 1: Subscribe to a third-party data product via AWS Data Exchange on your AWS Account.

In your AWS account, associate the AWSDataExchangeSubscriberFullAccess IAM policy with your IAM user/role to perform the following steps. Note that we will be using eu-central-1 Region for this section.

  1. Open the AWS Data Exchange console. From the left navigation pane, choose Browse catalog. From the search bar, enter Eventbrite Events Dataset (Sample Data) and press Enter. Choose the product to view its detail page.
  2. On the product detail page, choose Continue to Subscribe on the top right corner.
  3. Choose whether to activate auto-renewal for the subscription. Review the offer details, including the data subscription agreement (DSA). Note that the dataset is available in eu-central-1 (Frankfurt) Region. Review the pricing information, choose the pricing offer and, if you and your organization agree to the DSA, pricing, and support information, choose Subscribe.
  4. Once the subscription has successfully completed, you will be able to see the product in the Subscriptions section.
  5. Create an Amazon S3 bucket (For example: dataexchange-dremio-poc) hosted in the same AWS Region (eu-central-1), where the subscribed data will be exported.
  6. Open the Entitled data section of the AWS Data Exchange console within the Region that hosts the dataset. Choose Eventbrite Events Dataset (Sample Data) and then choose the dataset Eventbrite Events Dataset (Sample Data).
  7. Scroll down, select the latest revision, and then choose Export to Amazon S3. You will see that the Simple option is pre-selected. This option has a default path of {Revision.CreatedAt}/{Asset.Name}. For this post, let’s choose the Advanced option.
    1. Under Select Amazon S3, choose the Amazon S3 bucket where you want your data to be exported.
    2. Select a Dynamic reference and choose Append to create your key naming pattern. You can find documentation on key naming patterns using dynamic references here. Note that you must include either Asset.Name or Asset.Id as part of your key naming pattern.
    3. Set the encryption options that you want on the data.
    4. Review the Amazon S3 pricing section to determine if cross-Region data transfer charges will be applied to your export. Choose Export to complete your setup.
    5. Navigate to the S3 Bucket and verify if the Eventbrite Events Dataset has been exported to your Amazon S3 bucket

Step 2: Add sample data to your on-premises Hive metastore

Run the following script to create sample data on your on-premises Hive metastore.

CREATE TABLE customers (name VARCHAR, zip INT);
INSERT INTO customers VALUES (‘Matt Park’, 32805);

Step 3: Sign in to Dremio and connect to Amazon S3 and Hive

Using your Dremio Community Edition deployment, connect Dremio to your Amazon S3 data lake storage that has the Eventbrite Events Dataset (Sample Data) and to Hive, which has customer information. On this data, you can query against the physical dataset and create views by joining the data sources as required.

In your AWS account, create an IAM Role with a name of your choosing (for example “dremio-adx-hive-user”) and attach these IAM policies. Note the ARN of the IAM role.

Navigate to the bottom left corner of the Dremio home page and select the + icon next to Object Storage for Amazon S3. To configure access to Amazon S3, select the Amazon S3 option and enter the following information in General:

  • Name: s3-dataexchange
  • Authentication method: Select Data Source Credentials
  • Choose an IAM Role: Enter Role ARN in the field, you noted previously.

In the bottom left corner under Object Storage, select the data connection to verify that Dremio is connected to Amazon S3. You will see a list of Amazon S3 buckets that you have access to. You should see the Amazon S3 bucket that has the AWS Data Exchange data (for example, eventbrite_events_2023-08-02.json or the latest file you see there).

Once you select the dataset, a query window will open to let you run a SQL query on the data in the Amazon S3 bucket.

From the query execution window, you can run a SELECT * query. Refer to the following example. (The path and file name may be different depending on your Amazon S3 configuration)

SELECT * FROM s3-dataexchange."66d13e330a60e2c0cad79f3a932fae51".ff257e98e489d6b7d3e3649f15433978."eventbrite_events_2023-08-02.json";

Figure 2: Dremio query interface depicting the query and resultsFigure 2: Dremio query interface depicting the query and results

Similarly, you can configure your Hive connection by adding a Hive source and entering the required information. For more information, refer to connecting a Hive data source to Dremio.

Once successfully configured, you can see the customer table inside your Hive source.

When you select the customers table, it launches a query window, where you can run this query to verify the data:

SELECT * FROM hive.customers

Step 4: Create a view by joining data in Amazon S3 and Hive

Let’s write a query to join an on-premises customer table from Hive and the third-party data in Amazon S3. This query will indicate which events occurred in the towns customers live in. Adjust the query to reflect the path and file names of your data exchange data in your Amazon S3 bucket.

SELECT events.event_category, events.datetime_local, events.title, events.postal_code
FROM "s3-dataexchange"."66d13e330a60e2c0cad79f3a932fae51".ff257e98e489d6b7d3e3649f15433978."eventbrite_events_2023-08-02.json" AS events
JOIN hive.customers AS customers ON events.postal_code = TO_CHAR(customers.zip, '#')

Choose the dropdown on the top right of the query window and select “Save View as_” and give the view an appropriate name and save it (for example, “orlando_event”).

Step 5: Creating a BI dashboard with Tableau

For this section, download a free trial of Tableau desktop. In this example, we have used Tableau Desktop version 2020.4+.

For connecting Tableau with Dremio, refer to Connecting Client applications to Dremio.

Now that you have created and saved the view in Dremio, you can open it in Tableau to create a BI dashboard.

  1. Edit the view created by navigating to the pen icon that is accessible to the right of the table listing on the Dremio dashboard.
  2. Choose the Tableau logo in the upper right corner.
  3. Open the .tds file that was downloaded to open Tableau.
  4. Enter your Dremio credentials to establish a connection directly to Dremio.

Figure 3: Visualizing results in a Tableau dashboard

You can now run your BI dashboard on your data directly from your data lake using Dremio. For better performance, enable aggregate reflections on your view and experience near real-time live queries when working with your BI dashboards.

Connecting Tableau using a personal access token (For Non-OAuth)

  1. Choose Account settings, which is accessible in the bottom left corner of the Dremio UI.
  2. Create a PAT (personal access token).
  3. Open Tableau. Under the connect menu on the left, choose “To a server”, under the options, select “…more” and on the following list you will see Dremio.
  4. Select the version of Dremio that you are using and select PAT/Token authentication. Enter your credentials using the PAT that you generated earlier as your password and establish a connection by choosing the Submit button.
  5. Select your data source table and work with Tableau normally from there with a live connection.

Cleanup

As part of this tutorial, Amazon EC2 instances were deployed along with Amazon S3 buckets in your AWS account. Remove these resources to prevent any further charges.

You may delete the stack on the AWS CloudFormation console or via the AWS Command Line Interface (AWS CLI).

Conclusion

AWS Data Exchange and Dremio’s open data lakehouse platform provide businesses with a toolkit for accessing, integrating, and analyzing data. The ability to leverage third-party data through AWS Data Exchange, combined with on-premises and cloud-based sources, offers significant opportunities for businesses to enhance their customer data. Whether you are a business analyst or a data scientist, AWS and Dremio can transform your data strategy.

For more information on how to get started, refer to the Dremio documentation. You can also learn more about Dremio in AWS Marketplace.

About Authors

Sunil Ramachandra

Sunil is a Senior Solutions Architect enabling hyper-growth SaaS ISVs to innovate and accelerate on AWS. He partners with customers to design highly scalable and resilient cloud architectures. When not collaborating with customers, Sunil enjoys spending time with family, running, meditating, and watching Prime Video.

Sathisan Vannadil

Sathisan Vannadil is a Senior Partner Solutions Architect at Amazon Web Services (AWS). His primary focus is on helping independent software vendor (ISV) partners design and build solutions at scale on AWS. Prior to AWS, Sathisan held diverse technical positions and has over 20 years of experience in the field of data and analytics.

Chidambaram Muthappan

Chidambaram Muthappan is a Specialist Solution Architect at AWS based out of Seattle, Washington. He works with enterprise customers on designing resilient end-to-end data architectures, launching third-party data and machine learning products in AWS Marketplace.

Alex Merced

Alex Merced is a developer advocate at Dremio. In his role at Dremio, Alex hosts the weekly podcast “Gnarly Data Waves,” creates technical contents and speaks at events on Data Lakehouse and related technologies. Alex is a co-author of O’Reilly’s “Apache Iceberg: The Definitive Guide.”