AWS Big Data Blog
Ingest Excel data automatically into Amazon QuickSight
Amazon QuickSight is a fast, cloud-powered, business intelligence (BI) service that makes it easy to deliver insights to everyone in your organization. This post demonstrates how to build a serverless data ingestion pipeline to automatically import frequently changed data into a SPICE (Super-fast, Parallel, In-memory Calculation Engine) dataset of Amazon QuickSight dashboards.
It is sometimes quite difficult to be agile in BI development. For example, end-users that perform self-service analytics may want to add their additional ad hoc data into an existing dataset and have a view of the corresponding updated dashboards and reports in a timely fashion. However, dashboards and reports are usually built on top of a single online analytic processing (OLAP) data warehouse with a rigid schema. Therefore, an end-user (who doesn’t have permission to update the dataset directly) has to go through a complicated and time-consuming procedure to have their data updated in the warehouse. Alternatively, they could open a ticket for you to edit the dataset manually, but it is still a very inconvenient solution that involves a significant amount of repetitive manual effort, especially if they frequently need to update the data.
Therefore, an automated data processing tool that can perform real-time data ingestion is very useful. This post discusses a tool that, when an end-user uploads Excel files into Amazon S3 or any other data file sharing location, performs the following end-to-end process:
- Cleans the raw data from the Excel files, which might contain a lot of formatting and redundant information.
- Ingests the cleaned data.
- Performs a status check to monitor the data cleaning and ingestion process.
- Sends a notification of the results to the end-user and BI development team.
With the recently launched feature cross data source joins, you can join across all data sources that Amazon QuickSight supports, including file-to-file, file-to-database, and database-to-database joins. For more information, see Joining across data sources on Amazon QuickSight.
In addition to cross data source joins, Amazon QuickSight has also launched new APIs for SPICE ingestion. For more information, see Importing Data into SPICE and Evolve your analytics with Amazon QuickSight’s new APIs and theming capabilities.
This post shows how you can combine these features to build an agile solution that cleans and ingests an Excel file into a SPICE dataset of Amazon QuickSight automatically. In SPICE, the real-time data from Excel joins with the Amazon Redshift OLAP data warehouse, and end-users receive Amazon SNS messages about its status throughout the process.
Solution overview
The following diagram illustrates the workflow of the solution.
The workflow includes the following steps:
- An end-user uploads an Excel file into an on-premises shared folder.
- The Excel files upload to the Amazon S3 bucket
excel-raw-data
.Alternatively, the end-user can skip this step and upload the Excel file into this Amazon S3 bucket directly. - This upload event triggers the SNS message
Excel-Raw-Data-Uploaded
. - Both the end-user and the BI team receive a message about the new upload event.
- The upload event also triggers the AWS Lambda function
DataClean
to process the Excel data. - The Lambda function removes the formatting and redundant information of the Excel file, saves the cleaned data as a CSV file into the S3 bucket
autoingestionqs
, and publishes an SNS message to notify end-users about the data cleansing status. - This cleaned CSV file is mapped as an Amazon Athena table.
- In the Amazon QuickSight SPICE dataset, this table joins with the Amazon Redshift table through the cross data source join functionality.
- The CSV file creation event in the S3 bucket
autoingestionqs
triggers the Lambda functionqsAutoIngestion
. - This function calls the data ingestion API of Amazon QuickSight and checks the data ingestion status.
- When the data ingestion is complete, end-users receive the
Ingestion-Finished
SNS message.
Prerequisites
For this walkthrough, you should have the following prerequisites:
- An AWS account
- The following AWS services: Amazon QuickSight, Amazon Athena, AWS Lambda, Amazon SNS, and Amazon S3
- Basic knowledge of Python
Creating resources
Create your resources by launching the following AWS CloudFormation stack:
During the stack creation process, you have to provide a valid email address as the endpoint of Amazon SNS services. After the stack creation is successful, you have three SNS topics, two S3 buckets, and the corresponding IAM policies.
Walkthrough
To implement this solution, complete the following steps:
- Enable SNS notification of new object creation event in S3 bucket
excel-raw-data
. For more information, see How Do I Enable and Configure Event Notifications for an S3 Bucket? When an end-user uploads an Excel file into the excel-raw-data S3 bucket, the event triggers an Amazon SNS message.The following screenshot shows the example Excel file that this post uses.The following screenshot shows the SNS messageExcel-Raw-Data-Upload
, which includes details of the upload event. - Download the sample code
DataClean.py
in Python 3.7 from the GitHub repo. - Create a Lambda function named
DataClean
. - Configure the function to be a subscriber of the SNS topic
Excel-Raw-Data-Uploaded
. - Edit the SNS topic
Cleaning-is-Done
, and add the following code to the access policy:The policy allows the Lambda function
DataClean
to trigger the SNS messageCleaning-is-Done
.The function
DataClean
saves the CSV file of the cleaned data into the S3 bucketautoingestionqs
. You should see the new CSV file in this bucket. See the following screenshot.When the Lambda function ends, it triggers the SNS message
Cleaning-is-Done
. The following screenshot shows the text of the notification message. - Add an event notification into the S3 bucket
autoingestionqs
to trigger a Lambda function namedqsAutoIngestion
.This function calls the Amazon QuickSight data API to ingest data into the SPICE dataset.The cleaned CSV file in the S3 bucketautoingestionqs
is mapped as an Athena table. The following screenshot shows the sample data in the CSV file.In the Amazon QuickSight SPICE dataset, the Athena table joins with the Amazon Redshift table through the cross data source join functionality.
- Create the SPICE dataset. For more information, see Joining across data sources on Amazon QuickSight.The following screenshot shows the Data page in Amazon QuickSight where your data set details appear. The Athena table joins a Redshift table.The new object creation event in the Amazon S3 bucket
autoingestionqs
triggers another Lambda function namedqsAutoIngestion
. This function calls the data ingestion API of Amazon QuickSight and checks the data ingestion status. If the data ingestion is completed successfully, end-users receive the SNS message Ingestion-Finished. You can download the sample code ofqsAutoIngestion
from the GitHub repo.
Cleaning up
To avoid incurring future charges, delete the resources you created: the two Lambda functions, three SNS topics, two S3 buckets, and corresponding IAM policies.
Conclusion
This post discussed how BI developers and architects can use data API, Lambda functions, and other AWS services to complete an end-to-end automation process. End-users can have their real-time data ingested and joined with OLAP data warehouse tables and visualize their data in a timely fashion without the need to wait for nightly or hourly ETL or the need to understand the complex technical development steps. You should now be fully equipped to construct a solution in a development environment and demo it to non-technical business end-users.
About the Author
Ying Wang is a Data Visualization Engineer with the Data & Analytics Global Specialty Practice in AWS Professional Services.