AWS Big Data Blog
Automate dataset monitoring in Amazon QuickSight
Amazon QuickSight is an analytics service that you can use to create datasets, perform one-time analyses, and build visualizations and dashboards. In an enterprise deployment of QuickSight, you can have multiple dashboards, and each dashboard can have multiple visualizations based on multiple datasets. This can quickly become a management overhead to view all the datasets’ status with their latest refresh timestamp.
This post demonstrates how to visualize datasets associated with all the dashboards in your account, with their latest refresh status and refresh time.
Solution overview
The following screenshot illustrates the architecture of the solution.
The architecture includes the following steps:
- You create the datasets and tag them via an AWS Lambda
- A second function gets the refresh status from the tagged datasets.
- The function stores the refresh status in Amazon Simple Storage Service (Amazon S3).
- You query the refresh status in Amazon Athena.
- You visualize the refresh status in QuickSight.
A QuickSight deployment can have multiple dashboards and each dashboard can have multiple datasets associated with it. You can end up having hundreds of datasets. It’s difficult to know if all the underlying datasets are refreshing as required unless you check them manually. However, QuickSight sends email notifications to the dataset owner on its dataset refresh failure. This solution provides a holistic view of all datasets’ refreshes.
The aim is to create a dashboard that monitors the refresh of the existing datasets and provides refresh status for the datasets.
To implement the solution, you must create the following:
- A Lambda execution role for QuickSight.
- A scheduled Lambda function to tag the datasets.
- A scheduled Lambda function to get the last refresh status of the datasets and store it in Amazon S3.
- An external table in Athena on top of the S3 bucket.
- A QuickSight dashboard using Athena as the data source, which provides the datasets’ last refresh status.
This post assumes that you have existing analyses and dashboards with numerous datasets.
Creating a Lambda execution role for QuickSight
Your first step is to create a Lambda execution role that allows you to perform tagging and create QuickSight analysis, datasets, and data sources. The role should be able to describe and update them. The following code is an example role policy (replace the bucket name with the bucket for storing the QuickSight ingestion results):
Creating a scheduled Lambda function to tag the datasets
The next step is to identify all the datasets required for your dashboard and tag them. It’s easier to do this right after you create the dataset. Complete the following steps:
- On the QuickSight console, choose Manage data.
- Choose your dataset and choose Edit dataset.
- Record the dataset ID from the URL (data-sets/<dataset ID>/prepare).
Alternatively, you can use a Lambda function to find the dataset name and ID. See the following code (replace the AwsAccountID with your ID):
The function provides all the datasets in your account. Make sure to record the dataset IDs specific to your dashboard.
- Create your Lambda function.
- Tag the datasets per your individual dashboards. See the following code (use the target dashboard name and ID to create the tagging key, and replace the dataset_ids and account number with your own):
You can do this for all your dashboards. The only limitation is you can only tag one dataset to one dashboard name key pair.
If you tag the datasets with a wrong key, you can remove them using an untag call and replace the ResourceArn with the specific dataset ARN. See the following code:
Creating a Lambda function to get the last refresh status
The next step is to configure a Lambda function that gets the last refresh status of the tagged datasets and loads it into Amazon S3. You use resourcegroupstaggingapi to get back all the resources with a particular key. For this post, the key is the DashboardName. From the response of the ResourceTagMappingList, you filter out the dataset ID and dataset ARN. You also get the data source ARN and name for each dataset associated with the particular key value. Finally, you list the ingestions for all the datasets and classify them as one of the following:
- Failed – The last refresh failed.
- Did not run within last 24 hours – No ingestion ID in the last 24 hours (the time is configurable). You explicitly use this status even if the previous run before the last 24 hours succeeded or failed. This makes sure the datasets adhere to a certain refresh schedule. For this post, you want the datasets to refresh one time a day.
- Error – No ingestion ID for more than 90 days.
See the following code (replace the placeholder text with your specific values):
The last status run is now stored in a .csv file in the specific bucket mentioned in the Lambda function (see the following screenshot).
You can also schedule your function to run at a certain frequency, depending on when you want to check the status.
Creating an external table in Athena on top of Amazon S3
Now you can create an external Athena table on top of the .csv file you stored in Amazon S3 and query it. Use the following table definition for reference (replace the location with the location of your S3 bucket):
You can get the latest status of the dataset refreshes by querying the table with SQL in Athena.
Creating a QuickSight dashboard using Athena as the data source
To visualize this data and share it with others, build a dashboard on top of the data in QuickSight. The following screenshot shows the listed dashboards.
You first create a dataset for the Athena table.
- On the QuickSight console, choose Manage data.
- Choose Create dataset.
You use Athena as the source for your dataset. If you don’t have an existing Athena data source, you can create a new one. For instructions, see Creating a Data Source.
- Choose the table you just created.
- Select Import to SPICE for quicker analysis.
Depending on the size of your dataset and expected latency, you can choose Directly query your data instead. If you use SPICE, remember to add a refresh schedule for the dataset.
- Create an analysis from the dataset.
For this post, choose a table visual type and drag all the columns to the Value field well.
You can create the visualization as in the following screenshot, with conditional formatting to highlight failed and successful loads.
- To publish the dashboard, choose Share on the application bar of the analysis.
- Choose Publish dashboard.
- For Publish new dashboard as, enter a name for your dashboard.
You can now share the dashboard with end-users.
Conclusion
In this post, we described how to create a QuickSight dashboard that can track the last refresh status of all the datasets in your account. The dashboard provides a single pane view of the status of all the datasets and avoids the manual effort of opening and checking each individual dataset.
About the authors
Ginni Malik is an Associate Cloud Developer with AWS.
Rohan Jamadagni is a Solutions Architect with AWS.