AWS Partner Network (APN) Blog

Achieve Complete Data Observability on AWS with TCS Approach to Observability Challenges

By Ramesh Srinivasan, Head, Data and Analytics AI.Cloud – TCS
By Gopakumar S, Solution Architect, AWS Data Integration and Analytics – TCS
By Nirmal Singh Tomar, Sr. Solutions Architect – AWS

TCS-AWS-Partners-2023
TCS
Connect with TCS-1

Data observability refers to the ability to monitor, understand, and troubleshoot the flow of data across your infrastructure, ensuring that data pipelines are running smoothly, accurately, and efficiently. It’s a critical aspect of data engineering projects for any workload.

By maintaining strong data observability, data engineers can quickly identify and address issues, optimize performance, ensure data quality, and eventually make informed decisions based on the data being processed.

Modern data engineering platforms are very diverse in terms of their complexity. Data flows into the data platforms in large volumes, but the rise in the number of data sources can make the observability data difficult to handle and understand. Adding more layers and technology components to the data platform further adds to the complexity.

There’s a need for an observability solution that’s commercially viable and a solution that goes well with your technology stack and doesn’t require additional skills or training to achieve your data observability goals on Amazon Web Services (AWS).

In this post, you will learn about various data observability challenges and the Tata Consulting Services (TCS) approach to addressing them with AWS-native services, starting with event ingestion, aggregation to create actionable insights, and visualization with dashboards.

An IT services, consulting, and business solutions organization, TCS is an AWS Premier Tier Services Partner and Managed Service Provider (MSP) that has been partnering with many of the world’s largest businesses in their transformation journeys for the last 50 years.

Data Observability Dimensions

Unlike observability data emitted from other systems, data platforms have attributes that will help you determine the health of your data. These attributes form the core of data observability, which includes lineage, freshness, volume, distribution, and schema.

  • Lineage: Examines which teams have been responsible for generating and accessing the data, as well as any impact on upstream sources or downstream dashboards.
  • Freshness: Measure of how recently data tables were updated.
  • Volume: Relates to the completeness of the data tables.
  • Distribution: Measure to check if the data is within an anticipated range.
  • Schema: Considers changes in underlying table structures and how the data is organized.

The TCS data observability approach discussed in this post takes these attributes into account and derives inferences from the collected events.

What Makes Data Observability Challenging?

The fundamental requirement of any observability solution is the generation of observability data. The major challenge is the rate at which new workload deployments are growing, which can result in complexity caused by unmanageable volumes and rate at which data is released in a dynamic environment. It can be difficult for IT teams to consistently grasp the full context of every situation.

Another challenge is to ensure systems are emitting enough events. The following are factors affecting the flow of events:

  • All components in the data platform may not be instrumented.
  • Observability data can be large in volume, and the aggregation layer may not have enough storage.
  • Logs may not contain enough information about the incident.
  • Selection of the right set of tools and services for your observability stack.
  • Selection of the right metrics and key performance indicators (KPIs).

TCS Approach to Observability Challenges

Let’s discuss a common pattern in which the observability data can be handled. There can be a balanced combination of white and black box observability.

Black box observability is focused on collecting and analyzing standard metrics that are emitted from the systems, such as CPU and memory. White box observability focuses on collecting some internals of the system, like queries or response codes. Black box observability relies on metrics emitted out-of-the-box, whereas white box observability relies heavily on instrumentation to collect metrics.

Both patterns are significant here. TCS instruments all of the components that are running data workloads so it can collect data and job-related metrics. TCS also collects out-of-the-box metrics and logs regarding the utilization of infrastructure. The optimal blend of these two approaches can significantly enhance the effectiveness of your data observability solution.

Architecture for Data Observability

Figure 1 – Architecture for data observability.

Collection, aggregation, and visualization are the building blocks for observability. Each plays a specific role in system design:

  • Collection: Events are collected from different data workloads as a first step.
  • Aggregation: Collected events are stored in an intermediate layer, where it will be aggregated and analyzed. Logs should be optimized for running queries as required.
  • Visualization: Provides insights to the operation teams through different dashboards on the state of your data in the platform.

Let’s see some of the data components that TCS is observing in today’s data landscape, and how it can help enterprises to build observability solution.

Reference Architecture Using AWS Services

This solution is suitable for organizations that want to leverage AWS-native services:

Observability Solution using AWS native services

Figure 2 – Observability solution using AWS-native services.

Highlights of the architecture are:

  1. Workload account is an individual AWS account where resources like Apache NiFi, Amazon EMR, and Apache Airflow run and share observability data and resources with observability accounts.
  2. Observability account is a central AWS account that runs the observability components on top of data shared by other workload accounts.
  3. Every component should be instrumented with unified Amazon CloudWatch agent. Alternatively, AWS Distro for Open Telemetry can be used for this instrumentation.
  4. Prometheus agent is used to collect job-related metrics from the Apache NiFi instance, and then source them to an Amazon Simple Storage Service (Amazon S3) encrypted bucket with the help of AWS Lambda using cross-account access in AWS Identity and Access Management (IAM).
  5. Once the metrics and logs are collected in Amazon CloudWatch, it’s sourced to an S3 bucket with the help of a Lambda function, which is scheduled to scrape metrics and logs from CloudWatch at fixed intervals.
  6. Amazon S3 is added as a data source in Amazon QuickSight, and dashboards are refreshed at a fixed interval per the requirement.
  7. IAM cross-account roles are used to access workload account data into observability accounts.
  8. All datastore used in the solution are encrypted at rest using AWS Key Management Service (AWS KMS).

Below are the dashboards for different metrics generated as a part of this solution:

Various Dashboards Generated by Solution 1

Figure 3 – Sample dashboards using Amazon QuickSight created by the solution.

Reference Architecture Using AWS Managed Open-Source Services

This solution would be suitable for organizations that wants to leverage a combination of AWS-native services including managed open-source services like Amazon Managed Service for Prometheus and Amazon Managed Grafana.

Observability Solution using AWS native services and AWS managed open-source services

Figure 4 – Observability solution using AWS-native and AWS managed open-source services.

Highlights of the architecture are:

  1. Workload account is an individual AWS account where resources like Apache NiFi, Amazon EMR, and Apache Airflow run and share observability data and resources with observability accounts.
  2. Observability account is a central AWS account that runs the observability components on top of data shared by other workload accounts.
  3. Metrics are exported to Amazon Managed Service for Prometheus for aggregation.
  4. Log collector collects the logs and sources it to Amazon OpenSearch Service.
  5. Single unified dashboard are provided by Grafana, which shares the observability dashboards to the operations team.
  6. Amazon Simple Notification Service (SNS) notifications can be triggered from Amazon Managed Grafana to alert the operations team.
  7. Generating alerts from Grafana to the ITSM tools to automate the event response and remediation.
  8. IAM cross-account roles are used to access workload account data in observability accounts.
  9. All datastore used in the solution are encrypted at rest using AWS KMS.

Below are the dashboards for different metrics generated as a part of this solution:

Various Dashboards Generated by Solution2

Figure 5 – Sample dashboards using Amazon Managed Grafana created by the solution.

Business Benefits

By implementing the above architectures and leveraging native AWS services for data observability, organizations can ensure their data engineering workloads operate efficiently, maintain data quality, and provide valuable insights to their stakeholders. Here are a few additional benefits:

  • Centralized dashboard for easy monitoring: Create a centralized dashboard to displays key metrics from various AWS services.
  • Automated alerts: Alarms can automatically trigger notifications when predefined thresholds are breached.
  • Regular monitoring: Regular monitoring to identify trends, anomalies, and performance issues.
  • Continuous improvement: Use the insights gained from observability to continuously optimize your data pipelines, enhancing performance and data quality.

Conclusion

The TCS approach for observability can be suitable for various use cases, regardless of the industry. Reference architectures discussed in this post are built on top of AWS-native services and provides a single pane of glass view into systems running enterprise applications.

This is flexible enough to incorporate further enhancements, and organizations can adopt the TCS approach as it is or with slight customization according to their requirements. Contact TCS to learn more about its experience consulting with various data analytics and observability solutions.

.
TCS-APN-Blog-Connect-2022
.


TCS – AWS Partner Spotlight

TCS is an AWS Premier Tier Services Partner and MSP that has been partnering with many of the world’s largest businesses in their transformation journeys for the last 50 years.

Contact TCS | Partner Overview | Case Studies