How CloudWatch cross-account observability helps JPMorgan Chase improve Federated Data Lake Monitoring
AWS best practices guide customers to deploy their applications across multiple AWS accounts to establish security and billing boundary between teams and to reduce the impact of operational events. As enterprises grow and scale with tons of resources, customers often need a unified observability experience to help them search, visualize, and analyze their cross-account telemetry data, including metrics, logs, and traces across multiple AWS accounts. For JPMorgan Chase, the biggest global investment bank and financial services company in US, they wanted to improve monitoring skills on their federated data lake that is lying across thousands of accounts with Amazon CloudWatch. In this post, we will show how to set up cross-account observability in Amazon CloudWatch to view telemetry data across connected AWS accounts and how JPMorgan Chase leverages this cross-account observability in their centralized monitoring.
A data mesh lake is a large-scale, interconnected data pool. It is a move from a monolithic data lake to a loosely coupled architecture for data. Data is organized into products that can be onboarded, consumed, and managed independently of each other.
Using Data Mesh at JPMorgan Chase
JPMorgan Chase has leveraged AWS Lake Formation to support its multiple lines of business in order to maximize data reuse and to ensure data governance. JPMorgan Chase created a Data Mesh internally called a Federated Data Lake where multiple teams and applications use different AWS accounts (Producer) to onboard data, which is registered to AWS Glue Catalog and access entitled AWS Lake Formation in a central account (Governor). Each line of business can create as many data producer and consumer accounts as they desire all linked together by the central account. Teams managing data products can share the data products with fine-grained entitlements to consumer accounts using an automated, self-served pipeline via the controlled central account.
Amazon CloudWatch Cross-Account Observability
Amazon Web Services recently launched cross-account observability across Amazon CloudWatch to help customers monitor and troubleshoot applications that span multiple AWS accounts within an AWS Region. Using cross-account observability in CloudWatch, customers can seamlessly search, visualize, and analyze their logs, metrics and traces without any account boundaries. Customers can start with an aggregated cross-account view of their application to visually identify the resources exhibiting errors and dive deep into correlated traces, metrics, and logs to root cause the issue. The seamless cross-account data access and navigation enabled by cross-account observability helps customers reduce the manual effort required to troubleshoot issues and save valuable time in resolution. Cross-account observability is an addition to CloudWatch’s unified observability capability.
What We Had Before
Prior to CloudWatch Cross-Account Observability, each account owner between application teams had to monitor their own areas individually. While Cross-Account Cross-Region Dashboards made it possible to share dashboards between teams for metrics, we needed a way trace requests across the collaborating infrastructure or dig into logs across accounts. This led to a balkanization of responsibility and visibility across a product that was at its heart a collaborative cross business task. Issue resolution would require arms length coordination between teams in order to answer basic questions that were easily answerable given the data if there was a way to stitch that data together across the impacted accounts.
A change within the central account can affect the ability of a producer account to add data or a consumer account to read data. However, it is hard to know that a change has had a negative effect on these accounts without verifying it in each of those accounts. We can set up some monitoring or health checks in the producer and consumer accounts but that data is not viewable by the central team via the central account.
Without reporting by the producer account team or consumer account team, the central account team is not aware of a negative effect. This can increase the Root Cause Analysis (RCA) time to identify and fix problems. An improved state is one where the central team can immediately get feedback from all connected accounts on whether the platform is functional and data is still accessible. To achieve this, the central account would need to have a way to collect, correlate, aggregate, and analyze these telemetry data from participants’ accounts to reduce the Mean Time to Resolution (MTTR) when there is a problem.
How Cross-Account Observability helps improve JPMorgan Chase Federated Data Lake Monitoring
The new CloudWatch cross-account observability feature is a unified observability experience across Amazon CloudWatch that provides you the ability to monitor and troubleshoot applications that spans multiple AWS accounts. You can seamlessly search, visualize and analyze metrics, logs and traces with a birds-eye view, as if you were operating in a single account without account boundaries. Using these capabilities, our central team and business line teams are able to collaborate together more effectively, the central team is able to observe impact due to their changes in real time without assistance, and our MTTR is faster and RCA’s are simpler.
The goal is to create a system where changes to the central account setup can be immediately verified as non-detrimental to all producer and consumer accounts. This means the central team should be able to view telemetry data for all connected accounts. However, there could be many accounts so it is not ideal for the central team to log in to all accounts to see telemetry data to verify that all functions are currently working.
We automated health checks in the producer and consumer accounts while sharing telemetry data in real-time with the central account using cross-account observability to visualize and alert on faults in the platform combining the power of metrics, logs and traces. This process is broken down into a few steps.
- Create a Lambda function that checks the functionality of the data lake. For example, in the consumer account, a given IAM role should be able to query data from a shared Glue Data Catalog table using Athena. The Lambda function will send logs to CloudWatch and enable tracing to send segment data to X-Ray.
- Create a schedule using EventBridge to run the Lambda function periodically.
- Setup the central account as a Sink. (Cross-Account Setup)
- Setup the consumer account as a source account to Link with the Sink. (Cross-Account Setup)
The resulting architecture will look like this.
At the completion of the above steps, AWS X-Ray in the central account shows a successful end-to-end health check for a consumer account. Note in this trace, the AWS:Lambda::Function elements have an Acct# underneath indicating that this data is coming from a source account.
When it is broken, the trace will look like this.
We can set up alerts in the monitoring account to notify us when this happens.
Cross-account observability in CloudWatch delivers a holistic operational view in just a few steps without requiring additional data pipelines—saving customers time, effort, and cost in managing infrastructure and applications. In this blog post, we showed you how to automate health checks in producer and consumer accounts while sharing telemetry data in real-time with the central monitoring accounts. Amazon CloudWatch cross-account observability is generally available in all commercial AWS Regions now. To learn more about cross-account observability, please refer to Amazon CloudWatch documentation.
About the authors: