AWS Partner Network (APN) Blog
Unveiling the Power of Full-Stack Observability
By Austin Sequeira, Head of Cloud Innovation – LTIMindtree
By Francois van Rensburg, Principal Partner Solution Architect – AWS
LTIMindtree |
Enterprises operating in advanced, heterogeneous, and distributed cloud environments face growing challenges in monitoring their complex infrastructures. Maintaining visibility into their ever-growing and dynamic environment manually can be demanding and time consuming. The complexity is further compounded as enterprises operate within multiple cloud accounts making it challenging to effectively monitor and manage operational data. Additionally, gaining visibility into the overall business health and assessing the resulting business impact amid the ongoing issues heightens the challenge.
As a result, the business, operations, and delivery teams are increasingly turning to full-stack observability solutions that provide unified insights for driving effective cloud operations at scale.
Observability enables enterprises to gain actionable insights into the behaviour, performance, and interactions of their systems and infrastructure by correlating metrics, events, logs, and traces. The more observable a system is, the better equipped the enterprise is to understand application interdependencies and proactively identify and resolve issues.
The infusion of artificial intelligence (AI) into observability solutions is an upcoming trend that holds the promise of reshaping how enterprises gain insights and manage their systems. AI brings in a new dimension to observability by enhancing capabilities in the following areas:
- Anomaly detection: Machine learning algorithms help in auto detection of anomalies to identify deviations from normal operations as well as uncover potential issues beforehand.
- Predictive analysis and forecasting: AI-based observability solutions leverage historical data to understand patterns and trends, predict potential future issues, and enable proactive remediation.
- Faster root-cause analysis (RCA): AI can analyse complex and interconnected data and streamline the RCA process by correlating data from multiple sources to identify the origin of issues more accurately and swiftly.
- Generative AI (GenAI) driven recommendations: Leverage GenAI to reduce time to resolution, utilizing system specific documentation and historical data to generate remediation steps for Site Reliability Engineering (SRE) team.
LTIMindtree provides comprehensive business observability solutions, including KPI monitoring, to help customers gain a holistic understanding of their operations. As an AWS Premier Tier Services partner, LTIMindtree leverages its domain expertise and accelerators to deliver strategic business value to enterprises. The company’s success in developing impactful business KPIs and process mapping across industries demonstrates the effectiveness of its observability solutions.
In this post, we will look at how Infinity Watch, LTIMindtree’s observability solution, capacitates business-driven full-stack visibility across the cloud environments. We will also understand how the platform offers actionable recommendations for issue resolution while providing a single-pane view across business, applications, infrastructure, and network.
Solution Overview
LTIMindtree’s Infinity Watch is a full-stack observability solution built on AWS, providing cognitive insights on business impact, resiliency, and health. It uses telemetry data to facilitate end-to-end visibility across the cloud lifecycle. The platform integrates with a suite of monitoring tools (like New Relic, AppDynamics, Dynatrace, Prometheus, and Amazon CloudWatch), offering enterprises a comprehensive understanding of the entire business ecosystem.
Additionally, Infinity Watch facilitates metrics correlation, detects anomalies, and harnesses the power of augmented AI to deliver insightful recommendations.
Infinity Watch is a highly extensible and tool-agnostic solution built on a plug-play model to integrate with customer preferred and existing tools and services.
The Infinity Watch solution includes three major building blocks:
Discovery
This module connects with monitoring services like Amazon CloudWatch and AWS Resilience Hub to gather telemetry data from the application landscape. The discovered data is aligned to the standardized and structured open telemetry formats. The ingested data from metrics, events, logs, and traces, is housed in the central knowledge base for further correlation.
The platform also integrates with customer environment for AWS Identity and Access Management (IAM) access. This access is encrypted and secured through the secret management architecture within the solution.
Insights
The Insights module provides a single-pane view of both business and technology observability across the entire stack. It works on the principle of pattern matching for anomaly detection. The correlated data from the Discovery module is analysed by a cognitive engine to determine anti-patterns and then reported through real time alerts and notifications.
To accelerate diagnosis, the end users are equipped with a correlated view of metrics, events, logs, and traces. Most importantly, the Insights module provides a mechanism to associate alerts with affected business KPIs, thus offering complete transparency across the resiliency metrics and Service Level Objective (SLO) insights. This enables business stakeholders to make informed decisions.
Actions
The Actions module provides recommendations and automated workflows across the stack. The recommendations are aligned to the patterns configured for anomaly detection. Remediations can be implemented as defined by your standard operating procedure (SOP) and includes actions such as:
- Fixing the issue automatically
- Create an incident in ServiceNow
- Generate a resolution runbook with the help of GenAI
These actions help quickly remediate issues and optimize business ecosystems to prevent SLO breaches.
Key Features
Business Impact, Business Resiliency, and Application Health
The platform’s correlation engine provides comprehensive health insights across heterogeneous systems encompassing applications, infrastructure and business processes. This allows for early identification and troubleshooting through cognitive engine that works on pattern matching for anomaly detection and faster root cause analysis.
GenAI-led Run Books for Automated Remediation
Step-by-step remediation guidance provided by the GenAI-augmented runbooks helps engineers reduce overall time to remediation. It also helps enforce standardization of best practices and guardrails within the business operations.
SLO Management to Measure Service Health Compliance
Infinity Watch helps enterprises to significantly reduce the risk of potential breaches, non-compliance, and SLA or SLO violations by up to 25%. This is achieved by measuring IT performance across the Golden Signals.
How Infinity Watch works
Infinity Watch is a cloud-based platform running within a private subnet in a customers Amazon Virtual Private Cloud (VPC).
Figure 1. Infinity Watch Architecture Diagram
AWS Architecture Components
Infinity Watch is deployed in the customer’s environment to comply with customer region, security, and high availability requirements.
Security and compliance is achieved through fine-grained access control using AWS IAM with least privilege policy as per security best practices. The platform is integrated with Active Directory for user authentication and uses the principles of role-based access control (RBAC) to ensure controlled access across the platform.
All user requests are routed through an Application Load Balancer, which handles user traffic to the frontend service as it dynamically scales to accommodate demand.
The presentation and service layers of the application are deployed as containers on Amazon Elastic Kubernetes Services. This allows for simplified management, routing (using AWS Load Balancer Controller), and scalability.
Metadata and transaction data is stored in Amazon Relational Database Service (Amazon RDS) for better scalability, availability, and less operational overhead. Data privacy is ensured through encryption, both for data at rest as well as data in transit.
Infinity Watch uses HashiCorp Vault, which is deployed within the customer environment, to manage and retrieve all sensitive data such as database credentials, API keys, and secrets needed within the application.
Key Takeaways
LTIMindtree’s Infinity Watch platform has helped customers drive end-to-end observability with real-time insights to understand the impact on business outcomes and enhance operational efficiency. The key takeaways include:
- Enhanced business resiliency
- 70% reduction in false alerts
- 25% operational cost savings
- 60% faster remediation with AI-driven Root Cause Analysis (RCA)
Summary
In this post, we have touched upon how enterprises face intricate challenges in monitoring, managing and ensuring seamless performance of modern applications in dynamic and elastic cloud environments. The evolving nature of these applications has led to the growing demand for adoption of full-stack observability solutions to effectively navigate and derive insights on the ever-changing business operations.
Platforms like Infinity Watch focus on addressing the need for consolidated visibility into application efficiency, performance, security, and business KPIs on the cloud. The goal is to empower stakeholders with necessary insights to make informed decisions, fostering an optimized enterprise ecosystem that aligns with evolving business needs.
To learn more about Infinity Watch, reach out to LTIMindtree at infinity.cloud@ltimindtree.com.
LTIMindtree – AWS Partner Spotlight
LTIMindtree is an AWS Advanced Technology Partner and AWS Competency Partner that provides an advanced monitoring solution for cloud apps and modern infrastructure that aggregates metrics across distributed services to alert you on service-wide issues and trends in real-time.