AWS Partner Network (APN) Blog

How Pariveda Enables Operational Data Observability Across Your AWS Data Lake at Scale

By Suchi Patel, Sr. Associate – Pariveda
By Gene Masin, Sr. Partner Solutions Architect, HCLS – AWS
By Kalyan Kumar Neelampudi, Partner Solutions Architect, Analytics – AWS

Pariveda-AWS-Partners-2024
Pariveda
Pariveda-APN-Blog-CTA-2024

The variety and volume of data are ever-growing, with some organizations reporting their data volume increases by 63% each month. Moreover, this data is complex and originates from many disjointed sources. Processing it requires spinning up numerous services to scale to meet an enterprise’s needs.

Resources for ingesting, processing, and storing data drive platform costs. This creates an opportunity to invest in cost-effective capabilities that enable platform engineers, operators, and administrators to observe what’s happening within the platform to quickly detect, diagnose, and resolve issues as they arise in a given pipeline.

Pariveda’s data observability solution ensures data platform teams optimize the pipeline’s performance efficiency to remove bottlenecks, reduce costs, and reinforce trust in the system.

Pariveda Solutions is an AWS Premier Tier Services Partner and AWS Marketplace Seller with several AWS Competencies, including the Data and Analytics Consulting Competency. Pariveda is dedicated to solving complex business problems by aligning people-development focus with the mission of the clients.

In this post, we demonstrate a solution that builds operational dashboards of AWS Glue job metadata using Amazon QuickSight. This solution uses an Amazon CloudWatch metrics stream to deliver data to an Amazon Simple Storage Service (Amazon S3) bucket.

AWS Lake Formation and the AWS Glue Data Catalog are used to build a table on top of the metadata store for querying via Amazon Athena. Finally, QuickSight is connected to an Athena data source to create a dashboard that displays job details, such as runtimes, status, and computational load.

Customer Requirements

A healthcare client engaged Pariveda to design and build the data platform for a greenfield analytics-as-a-service offering. The solution leverages a data mesh pattern via AWS Lake Formation to enable a hub-and-spoke model between the platform, producing the data and future customers who will be consuming it.

Over the past year, Pariveda has developed hundreds of extract, transform, load (ETL) jobs using AWS Glue to ingest and standardize data for the platform. As the number of jobs and job runs increased, the volume of metrics data also grew rapidly. Based on client-specific requirements to increase visibility into jobs performance, Pariveda had to create a solution to collect, parse, and visualize this metric data within a QuickSight dashboard.

The dashboard had to provide at-a-glance monitoring of key AWS Glue job metrics like job run status, duration, and number of processed records. Operators needed to be able to quickly identify trends, outliers, and anomalies to optimize job performance. The solution also needed automatically incorporate new Glue jobs without additional configuration, providing immediate visibility into new jobs.

The customer requested the data to be provided in a near-real-time, and the access to the system had to follow least privilege requirements to comply with the customer’s data governance guidelines.

While addressing these requirements, Pariveda had to consider a few challenges. First, QuickSight needed access to the AWS Glue metrics data stored in Amazon CloudWatch. Second, metric records are pushed from Glue every 30 seconds. This rate can lead to a massive amount of data as the number and duration of ETL jobs increase.

The following section will explain how by leveraging AWS Glue’s detailed metrics and building a centralized dashboard, Pariveda delivered a solution that enabled end-to-end observability of ETL workloads.

Operators can now gain actionable insights to proactively manage Glue jobs, rather than reactively troubleshooting failures. The automated onboarding also streamlines monitoring as the Glue job catalog evolves. Overall, this solution unlocks additional value of Glue job metrics for optimized workload monitoring.

Solution Overview

To address the customer’s requirements, Pariveda built a business intelligence (BI) dashboard to provide a single-pane-of-glass for viewing the operational status of resources running in the data platform.

Pariveda’s data platform architecture, built on AWS-native technologies, leverages AWS Glue to create, run, and monitor ETL pipelines with real-time analytics. CloudWatch metrics, both standard and custom, are sent in near real-time to a data lake, a key part of the platform, for subsequent analysis with QuickSight.

To maintain the security principle of least-privilege, it’s best practice to limit which principals are granted write access. The solution integrates seamlessly with AWS Lake Formation so all permissions to AWS services are managed from a centralized location. This includes all access to underlying metadata and the ability to write to or read from tables.

Figure 1 – Solution architecture.

This solution can be applied to any service where CloudWatch is being used as a metrics store. To adapt the solution to a different use-case, a new Metrics stream for the desired namespace and schema definition based on the expected structure of the incoming data, will need to be created.

Core Solution

To break each section down, Pariveda’s solution leverages various AWS services to execute the core task of bringing metrics data from an AWS Glue jobs to be analyzed in QuickSight via a CloudWatch metrics stream.

The AWS Glue Job Profiler collects metadata from Glue jobs into near real-time metrics, and these arrive in CloudWatch every 30 seconds. As ETL pipelines scale, so will the volume of metadata that needs to be analyzed and processed. CloudWatch metrics streams, backed by Amazon Data Firehose, enable the delivery of these metrics to the data store of your choosing.

The CloudWatch metrics streams are highly configurable, with various options to choose the output format, namespaces, and desired metrics. Such filters are very important as cost-management measures when configuring the CloudWatch metrics streams.

Pariveda implemented namespace (for example, “Glue”) and metrics filters to limit the volume of data that travels through the stream; common filters include specific namespaces or metrics. The system can also automatically scale by intaking metrics from new Glue jobs when they are created without manual intervention.

Amazon S3 maintains consistency with the rest of the customer’s data lake. The data is streamed to S3 in time-based partitions and stored as JSON files with GZIP compression, which enabled more efficient querying in Amazon Athena, an interactive query service that makes it easy to analyze data directly in S3 using standard SQL.

Lifecycle policies ensure data is moved dynamically across storage tiers, enabling cost savings. To learn more, see this AWS blog post about optimizing storage costs with new S3 lifecycle filters and actions. A table is created on top of the S3 data within the AWS Glue Data Catalog using the known schema of the incoming metrics data; Amazon Athena is then be used to query this table.

Figure 2 – Metrics data.

The raw metrics data output by CloudWatch is nested and the dimensions should be flattened to improve performance and scalability. This can be accomplished by creating a view in Athena, resulting in more efficient querying and removing the need for QuickSight to unnest tables to display visualizations.

Pariveda-Data-Lake-Observability-3

Figure 3 – Creating Amazon Athena view.

Once the data is transformed into usable state, Athena is configured as a QuickSight data source for the creation of the operational dashboard.

The enterprise software environment also needs automation to scale to ensure the dashboard is always displaying the latest status of the processing jobs. While the CloudWatch metric stream handles the near real-time transfer of data from CloudWatch to S3, the AWS Glue Data Catalog does not immediately recognize when new data is added into the metric bucket.

To account for this, an event-driven architecture is used to trigger a Lambda function, which parses S3 event notifications to create tables and partitions in the data catalog to prepare the data for querying in QuickSight.

This process ensures any new data set by the CloudWatch metric stream is automatically recognized by Athena as a new partition. The flattened view, created on top of the glue_metrics table, is always up-to-date based on the latest data available in the table. By connecting the QuickSight dashboard to a view with automation, the visualizations will always use the latest metrics available.

The solution is flexible enough to be integrated into various orchestrations tools, including AWS Step Functions or Amazon Managed Workflows for Apache Airflow (MWAA). The implemented solution uses event-triggers for simple Lambda workflows and MWAA for the more complex chaining of various processing pipelines.

Figure 4 – Integration with Amazon MWAA.

The steps outlined below detail the solution workflow:

  1. AWS Glue Job Profiler sends near real-time metric data to CloudWatch.
  2. CloudWatch leverages an Amazon CloudWatch metric stream (backed by Amazon Kinesis) to send the metric data to an S3 bucket.
  3. Amazon S3 sends ObjectCreated event notifications to an Amazon Simple Queue Service (SQS) queue.
  4. SQS queue triggers a Lambda function to perform the following actions in the data catalog:
    • Check if an AWS Glue Data Catalog table exists in the catalog already.
    • If the table does not exist, the function will:
      • Retrieve the expected schema from an S3 bucket to create the table. You can also leverage an AWS Glue Crawler to infer the schema. For this use case, we decided to stay consistent with the existing architecture and leveraged the former pattern to reduce the number of changes needed in our AWS CloudFormation templates
      • Create a view to flatten the nested metric data for downstream consumption by QuickSight.
    • Add a partition to the table based on the S3 path provided in the event trigger.
  5. The Glue table and view from Step 4 can now be queried via Amazon Athena.
  6. Athena table and view are configured as a data source in QuickSight, to then be used for the custom operational dashboard.

Access Management

The permissions for QuickSight to access the underlying S3 data are controlled and tracked using AWS Lake Formation’s native capabilities.

AWS Lake Formation enables all permissions management to be federated and centralized through the service. This includes all access to underlying metadata and the ability to write to or read from tables.

As a result, any principal that interacts with the data catalog must be granted permission through Lake Formation. Principals can be AWS Identity and Access Management (IAM) roles or QuickSight entities. To maintain the principle of least-privilege, it’s best practice to limit which principals are granted write access. For example, in most cases a QuickSight user should not have permissions to write to an underlying data store.

AWS Lake Formation access can also be provisioned through CloudFormation templates to maintain consistency across customers and environments.

Figure 5 – Access management architecture.

Supplying Custom Metrics

There are some metrics, such as JobRunId status, which are not available natively through the AWS Glue metrics that CloudWatch collects. To alleviate this challenge, a solution was created to automatically collect and publish custom metrics to CloudWatch.

Figure 6 – Capturing AWS Glue JobRunId metrics.

The steps below outline the solution workflow:

  1. Once an AWS Glue job finishes or fails, MWAA is configured to invoke AWS Lambda.
  2. The Lambda function will poll the Glue service for that JobRunId and get the latest run metadata.
  3. These attributes get published as custom metrics to the Glue namespace in CloudWatch.
  4. These are collected by the newly-created CloudWatch metric stream and sent to the metric S3 bucket for downstream analysis.

Examples of a QuickSight Dashboard

The dashboard examples below show the displayed job details, including runtimes, status, and computational load. The dashboard is connected to the Amazon Athena view of the flattened data that was created in the previous section, ensuring the data in the dashboard is always up to date.

This specific example shows an “at-a-glance view” of a single dataset as it’s processed from the raw layer through the curated layer. In the example, a filter is put in place to search for all of the jobs that correspond to a specific dataset—namely, the raw-to-staged job, staged-to-curated job, and data quality row-by-row logging job. This filter could be removed to show the aggregated metrics for all data processing jobs within the data platform.

Figure 7 – QuickSight dashboard examples.

Conclusion

AWS Glue provides powerful features to build scalable and secure data platforms. As these platforms grow, having visibility into key job metrics becomes critical for optimizing reliability, performance, and cost.

The AWS Glue console dashboard surfaces helpful insights into job run times, data processing unit (DPU) usage, and data processed. As data volumes increase, digging deeper into metrics at a job level is essential. Pariveda’s solution enhances AWS Glue’s built-in monitoring by providing granular observability tailored to customer-specific needs. It also integrates a single pane BI dashboard for visibility into Glue job runs.

This additional telemetry enables users to pinpoint optimization opportunities, troubleshoot issues faster, and ultimately drive greater efficiency across Glue data pipelines.

By leveraging both AWS Glue’s native monitoring and Pariveda’s enhanced observability, data engineers can maintain high standards for pipeline uptime, performance, and cost as workloads scale. The combination of these tools provides a powerful way to unlock greater value from AWS Glue deployments.

The Pariveda solution provides an optimized method to automate the process of bringing vast volumes of metrics data in near real-time from Amazon CloudWatch for self-service operational observability into QuickSight. By leveraging federated access models and automated orchestration, the solution is able to scale with the vast amount of data and access permissions.

Alerting capabilities could be employed at any point of the data processing pipeline to inform operations teams of issues that require additional mitigation. Additionally, QuickSight has embedded artificial intelligence (AI) and machine learning (ML) functionality that would further enhance observability.

.
Pariveda-APN-Blog-Connect-2024
.


Pariveda Solutions – AWS Partner Spotlight

Pariveda is an AWS Premier Tier Services Partner that’s dedicated to solving complex business problems by aligning people-development focus with the mission of the clients.

Contact Pariveda | Partner Overview | AWS Marketplace | Case Studies