AWS Business Intelligence Blog
Enhance data governance through column-level lineage in Amazon QuickSight
Amazon QuickSight is cloud-powered, serverless, and embeddable business intelligence (BI) service that makes it easy to deliver insights to everyone in your organization. As a fully managed service, QuickSight lets you easily create and publish interactive dashboards that can then be accessed from any device and embedded into your applications, portals, and websites.
A QuickSight dataset can be reused across many analyses and dashboards. One advantage of creating centralized, reusable datasets is that changes to the data within a dataset will be applied to multiple dashboards. This reduces the maintenance efforts compared to managing individual datasets for each dashboard.
Nevertheless, it is important to consider the potential effects on connected analyses and dashboards when modifications are made to the dataset. Therefore, it’s crucial to understand how datasets and data fields are used across various analyses and dashboards when considering the impact of making schema changes. Tracking column-level lineage provides a clear view of each column’s path through different parts of QuickSight.
This insight helps optimize data processing, improve query performance, ensure accuracy, and meet regulatory requirements. A good data lineage solution shows exactly how data moves and is used, helping organizations make better decisions. It also helps identify and remove unnecessary data fields from QuickSight SPICE, speeding up data ingestion, saving space, and simplifying data pipelines. Better visibility into data lineage leads to improved data governance and regulatory compliance.
In this post, we explore how to create a simple serverless architecture using AWS Lambda, Amazon Athena, and QuickSight to establish column level lineage.
Solution Overview
This solution uses QuickSight along with following AWS services to prepare data for analysis.
- AWS Lambda – A serverless, event-driven compute service that lets you run code for virtually any type of application or backend service without provisioning or managing servers. You can trigger Lambda from over 200 AWS services and software as a service (SaaS) applications, and only pay for what you use.
- Amazon Simple Storage Service (Amazon S3) – An object storage service that offers industry-leading scalability, data availability, security, and performance.
- AWS Glue Data Catalog –A centralized repository that stores metadata about your organization’s datasets. It acts as an index to the location, schema, and runtime metrics of your data sources. The metadata is stored in metadata tables, where each table represents a single data store.
- Amazon Athena – A serverless, interactive analytics service built on open source frameworks that supports open-table and file formats. Athena provides a simplified, flexible way to analyze petabytes of data where it lives.
The following architecture diagram has four main steps, but when we examine it, these steps can be related to a standard extract, transform, load (ETL) process. In your use case, if you want to replace any of these services with similar functionality, that is absolutely possible.
- Invoke QuickSight APIs – The Lambda function invokes API calls to request information for all assets in the QuickSight account without filtering. APIs invoked are
describe-analysis-definition
,describe-data-set
, anddescribe-dashboard-definition
to extract details of analyses, datasets, and dashboards in JSON format. - Write response as .csv – After extracting QuickSight analysis, dataset, and dashboard details in JSON format, Lambda functions extract required fields, such as
ColumnIdentifier
andDatasetIdentifier
, from these JSON objects and parse them into .csv format. These .csv files are then stored in an S3 bucket in separate folders namedanalyses_details
,datasets_details
, anddashboards_details
. - Create tables from .csv – Once the .csv files are stored in the S3 bucket, the next step involves creating AWS Glue Data Catalog tables that point to the .csv files. These tables are created through data definition language (DDL) statements using Athena.
- Create a consolidated view – Finally, a consolidated Athena view is created. This view joins all three tables together and includes logic to define column usage in QuickSight. It has multiple common table expressions (CTEs) to create a column-level lineage view for QuickSight.
Prerequisites
Before proceeding with this walkthrough, make sure you have the following:
- AWS Account – An active AWS account is required. If you don’t have one yet, you can sign up on the AWS console.
- QuickSight subscription – Obtain a subscription to QuickSight
- AWS CLI Installation – Install the AWS Command Line Interface (AWS CLI) by following these instructions
- Terraform installation – Make sure Terraform is installed on your local machine. You can install by following these instructions
- AWS Identity and Access Management (IAM) permissions – Make sure you have the necessary IAM permissions required to create AWS resources using Terraform
Optionally, you may also want to consider:
- Makefile – Having the make command available is not mandatory, but it is recommended. The target commands within the provided Makefile can be executed independently
- Checkov – Install Checkov to analyze your Terraform configurations for security and compliance issues. Use the following command to install Checkov using pip:
$ pip install checkov
Deployment
To simplify setting up the prerequisites, we have enabled deployment using Terraform. The resources being deployed are:
- AWS Lambda functions:
- AWS Lambda layer for
utils
- AWS Glue database
- AWS Glue catalog tables:
- Dashboard details
- Analysis details
- Dataset details
- S3 bucket for storing Lambda results
- Necessary Lambda execution IAM roles
Clone the GitHub repository
Clone the GitHub repository to your local machine. This repository contains all the code needed for the walkthrough. Use the following command to clone the repository.
$ git clone git@github.com:aws-samples/column-level-lineage-for-amazon-quicksight.git
Terraform deployment
Navigate to the infrastructure directory to initialize and deploy the AWS resources using the following Terraform initialization:
$ cd infrastructure
To use the makefile provided in the repository:
- Terraform Initialize
$ make init
- Terraform Plan
$ make plan
- Terraform Apply
$ make apply
To initialize and deploy without using the provided makefile:
- Terraform Initialize
$ terraform init
- Terraform Plan
$ terraform plan
- Terraform Apply
$ terraform apply
Create query in Athena
- In the Athena console, choose Query editor.
- Under Saved queries find the query named
qs_blog-column-level-lineage-view
. Choose the query, as shown in the following screenshot.
- Choose Run in the Query editor to create the view.
- When the query has been completed, it will create a view in Athena named
qs_column_level_lineage_view
. This view comprises the QuickSight dataset, column name, column type, and the related analysis and dashboard, as shown in the following screenshot.
Potential use cases
The following use cases are some examples of where this solution can be used for process, storage, and cost optimization.
Identify used columns in your QuickSight datasets
When ingesting data into SPICE, a best practice is to limit datasets to only include columns used in your dashboards. This practice optimizes storage costs and improves SPICE ingestion speeds since fewer columns need to be ingested and stored on each refresh. Additionally, selecting fewer columns in the Athena query reduces data scanning costs and lowers storage costs for Athena query results
Create an impact analysis for a specific column
Using this view helps build an impact analysis of the columns in the dataset. Because dataset schemas often change, it’s crucial to know which columns are used in which datasets in order to understand the impact analysis of a schema change.
Typically, it would require enormous manual effort to identify usage of each column from a dataset in QuickSight dashboards or analyses. The column-level lineage view makes it straightforward to find out the usage pattern of the columns and document impact analysis. This decreases maintenance efforts and avoids potential manual errors, resulting in decreasing unexpected downtime in the dashboards.
The following image is an example of a dashboard that can be built using Athena view qs_column_level_lineage_view
as the data source. In the summary sheet, it shows the number of analysis and dashboards powered by each dataset. Users can get a view of how many and which dashboards will be impacted if a specific dataset is updated.
The following image of the dashboard shows the Details sheet, where you can observe the lineage between analysis and column or dashboard and column.
And you can drill down from the dashboard to see the datasets and columns using in which analysis and dashboard.
Data integrity and data quality
In a complex enterprise setup where many business users can build dashboards, it can be difficult to maintain data integrity because users may misinterpret fields while building dashboards. With data lineage, administrators and data curators can have better visibility to identify the data fields and their usage. Data governance teams can validate and ensure that appropriate data fields are used to maintain data integrity across the BI landscape.
Cleanup
It’s important to clean up the resources to avoid incurring unnecessary costs. Follow these steps to clean up:
- Manually delete the data stored in the S3 bucket.
- Delete other AWS resources using Terraform:
To use the makefile provided in the repository:$ make destroy
To clean up without using the provided makefile:$ cd infrastructure
$ terraform destroy
These commands will delete all AWS resources created during the deployment, including Lambda functions, AWS Glue databases, catalog tables, and IAM roles.
By following these steps, you can verify that all resources are properly cleaned up and unnecessary costs are avoided.
Conclusion
The solution provided in this post helps organizations improve operational efficiency, storage and cost management, and data governance in QuickSight. With this solution, you can quickly see how data is connected and used across various projects, making it easier to fix issues and improve data processes. This means better, richer, and accurate data insights with less effort, so you can make informed decisions. To better manage QuickSight dashboards and data refreshes, you can refer to Monitor your Amazon QuickSight deployments using the new Amazon CloudWatch integration.
If you have any questions or feedback, please leave a comment. For additional discussions and help getting answers to your questions, check out the QuickSight Community.
About the Author
Selman Ay is a Data Architect in the AWS Professional Services team. He has worked with customers from various industries such as ecommerce, pharma, automotive, and finance to build scalable data architectures and generate insights from the data. Outside of work, he enjoys playing tennis and engaging in outdoor activities.
Cizer Pereira is a Senior DevOps Architect at AWS Professional Services. He works closely with AWS customers to accelerate their journey to the cloud. He has a deep passion for building in the cloud and DevOps. In his free time, he enjoys contributing to open source projects.
Roy Yung is a Specialist Solutions Architect for Amazon QuickSight. Roy has over 10 years of experience implementing enterprise business intelligence (BI) solutions. Prior to AWS, Roy delivered BI and data platform solutions in the insurance, banking, aviation, and retail industries.
Srikanth Baheti is a Specialized World Wide Principal Solution Architect for Amazon QuickSight. He started his career as a consultant and worked for multiple private and government organizations. Later he worked for PerkinElmer Health and Sciences & eResearch Technology Inc, where he was responsible for designing and developing high traffic web applications, highly scalable and maintainable data pipelines for reporting platforms using AWS services and Serverless computing.