AWS Cloud Operations & Migrations Blog

How CloudWatch cross-account observability helps JPMorgan Chase improve Federated Data Lake Monitoring

AWS best practices guide customers to deploy their applications across multiple AWS accounts to establish security and billing boundary between teams and to reduce the impact of operational events. As enterprises grow and scale with tons of resources, customers often need a unified observability experience to help them search, visualize, and analyze their cross-account telemetry data, including metrics, logs, and traces across multiple AWS accounts. For JPMorgan Chase, the biggest global investment bank and financial services company in US, they wanted to improve monitoring skills on their federated data lake that is lying across thousands of accounts with Amazon CloudWatch. In this post, we will show how to set up cross-account observability in Amazon CloudWatch to view telemetry data across connected AWS accounts and how JPMorgan Chase leverages this cross-account observability in their centralized monitoring.

Data Mesh

A data mesh lake is a large-scale, interconnected data pool. It is a move from a monolithic data lake to a loosely coupled architecture for data. Data is organized into products that can be onboarded, consumed, and managed independently of each other.

Using Data Mesh at JPMorgan Chase 

JPMorgan Chase has leveraged AWS Lake Formation to support its multiple lines of business in order to maximize data reuse and to ensure data governance. JPMorgan Chase created a Data Mesh internally called a Federated Data Lake where multiple teams and applications use different AWS accounts (Producer) to onboard data, which is registered to AWS Glue Catalog and access entitled AWS Lake Formation in a central account (Governor). Each line of business can create as many data producer and consumer accounts as they desire all linked together by the central account. Teams managing data products can share the data products with fine-grained entitlements to consumer accounts using an automated, self-served pipeline via the controlled central account.

Figure 1. Data Mesh architecture in JPMorgan Chase also called as Federated Data Lake

Figure 1. Data Mesh architecture in JPMorgan Chase also called as Federated Data Lake

Amazon CloudWatch Cross-Account Observability

Amazon Web Services recently launched cross-account observability across Amazon CloudWatch to help customers monitor and troubleshoot applications that span multiple AWS accounts within an AWS Region. Using cross-account observability in CloudWatch, customers can seamlessly search, visualize, and analyze their logs, metrics and traces without any account boundaries. Customers can start with an aggregated cross-account view of their application to visually identify the resources exhibiting errors and dive deep into correlated traces, metrics, and logs to root cause the issue. The seamless cross-account data access and navigation enabled by cross-account observability helps customers reduce the manual effort required to troubleshoot issues and save valuable time in resolution. Cross-account observability is an addition to CloudWatch’s unified observability capability.

What We Had Before

Prior to CloudWatch Cross-Account Observability, each account owner between application teams had to monitor their own areas individually. While Cross-Account Cross-Region Dashboards made it possible to share dashboards between teams for metrics, we needed a way trace requests across the collaborating infrastructure or dig into logs across accounts. This led to a balkanization of responsibility and visibility across a product that was at its heart a collaborative cross business task. Issue resolution would require arms length coordination between teams in order to answer basic questions that were easily answerable given the data if there was a way to stitch that data together across the impacted accounts.

A change within the central account can affect the ability of a producer account to add data or a consumer account to read data. However, it is hard to know that a change has had a negative effect on these accounts without verifying it in each of those accounts. We can set up some monitoring or health checks in the producer and consumer accounts but that data is not viewable by the central team via the central account.

Without reporting by the producer account team or consumer account team, the central account team is not aware of a negative effect. This can increase the Root Cause Analysis (RCA) time to identify and fix problems. An improved state is one where the central team can immediately get feedback from all connected accounts on whether the platform is functional and data is still accessible. To achieve this, the central account would need to have a way to collect, correlate, aggregate, and analyze these telemetry data from participants’ accounts to reduce the Mean Time to Resolution (MTTR) when there is a problem.

How Cross-Account Observability helps improve JPMorgan Chase Federated Data Lake Monitoring

The new CloudWatch cross-account observability feature is a unified observability experience across Amazon CloudWatch that provides you the ability to monitor and troubleshoot applications that spans multiple AWS accounts. You can seamlessly search, visualize and analyze metrics, logs and traces with a birds-eye view, as if you were operating in a single account without account boundaries. Using these capabilities, our central team and business line teams are able to collaborate together more effectively, the central team is able to observe impact due to their changes in real time without assistance, and our MTTR is faster and RCA’s are simpler.

Task

The goal is to create a system where changes to the central account setup can be immediately verified as non-detrimental to all producer and consumer accounts. This means the central team should be able to view telemetry data for all connected accounts. However, there could be many accounts so it is not ideal for the central team to log in to all accounts to see telemetry data to verify that all functions are currently working.

Action

We automated health checks in the producer and consumer accounts while sharing telemetry data in real-time with the central account using cross-account observability to visualize and alert on faults in the platform combining the power of metrics, logs and traces. This process is broken down into a few steps.

  1. Create a Lambda function that checks the functionality of the data lake. For example, in the consumer account, a given IAM role should be able to query data from a shared Glue Data Catalog table using Athena. The Lambda function will send logs to CloudWatch and enable tracing to send segment data to X-Ray.
  2. Create a schedule using EventBridge to run the Lambda function periodically.
  3. Setup the central account as a Sink. (Cross-Account Setup)
  4. Setup the consumer account as a source account to Link with the Sink. (Cross-Account Setup)

The resulting architecture will look like this.

Figure 2. JPMorgan Chase Federated Lake monitoring after CloudWatch Cross-Account Observability

Figure 2. JPMorgan Chase Federated Lake monitoring after CloudWatch Cross-Account Observability

Result

At the completion of the above steps, AWS X-Ray in the central account shows a successful end-to-end health check for a consumer account. Note in this trace, the AWS:Lambda::Function elements have an Acct# underneath indicating that this data is coming from a source account.

Figure 3. End-to-end trace view for cross-account Lambda functions

Figure 3. End-to-end trace view for cross-account Lambda functions

When it is broken, the trace will look like this.

Figure 4. How the anomalies will look like in cross-account tracing which can be managed through alerts

Figure 4. How the anomalies will look like in cross-account tracing which can be managed through alerts

We can set up alerts in the monitoring account to notify us when this happens.

Figure 5. A cross-account alert that is set-up in the monitoring account

Figure 5. A cross-account alert that is set up in the monitoring account

Conclusion

Cross-account observability in CloudWatch delivers a holistic operational view in just a few steps without requiring additional data pipelines—saving customers time, effort, and cost in managing infrastructure and applications. In this blog post, we showed you how to automate health checks in producer and consumer accounts while sharing telemetry data in real-time with the central monitoring accounts. Amazon CloudWatch cross-account observability is generally available in all commercial AWS Regions now. To learn more about cross-account observability, please refer to Amazon CloudWatch documentation.

About the authors:

Marc Rosenthal

Marc Rosenthal is an architect in the Cloud Engineering Center of Excellence for Corporate Technology at JPMorgan Chase. Marc works closely with teams to modernize their applications and migrate them to AWS. He is an engineer on the Federated Data Lake and contributes to multiple infrastructure blueprints. Outside of work, he enjoys live music, sports, and spending time with his family.

Ayobami Taiwo

Ayobami Taiwo is a senior software engineer in the Corporate Technology SRE Center of Excellence at JPMorgan Chase. Based in Glasgow, Scotland, Ayobami is passionate about researching cloud topics which inspires her continuous creation of innovative solutions to accelerate SRE Teams and drive strategic decisions at JPMorgan Chase. In the last decade, Ayobami has worked as a presales engineer, product engineer, software programmer, web developer, cloud infrastructure and DevOps engineer. Outside of work, she spends time travelling with family, exploring nature with her beautiful daughter, fashion styling and reading a new book.

Anthony Giles

Anthony Giles is Head of Site Reliability Engineering Center of Excellence for Employee Experience Corporate Technology at JPMorgan Chase. Anthony represents or leads multiple firmwide groups across all lines of business with a passion for SRE, AWS, and Observability. Anthony holds a Bachelor of Science in Computer Science with more than 20 years’ experience as an IT leader in the Financial and Healthcare industries in various development, database, architecture, engineering and SRE roles. Prior to joining JPMorgan Chase, Anthony developed the cloud-native strategy for monitoring the core banking system at a prior financial services firm. Aside from spending time with his family in Delaware, he coaches track to Delaware Youth where his passion lies as former college track coach for 13 years.

Robert Waugh

Robert Waugh is a Managing Director at JP Morgan Chase focused on firmwide observability strategy. Robert leads their client engineering functions, as well as oversees their APM, Synthetics, and Visualization services across all business lines. Before he joined JPMorgan Chase, Robert developed the cloud-native strategy for his previous employer and developed the CloudWatch Logs and Logs Insights service at AWS. He also has experience developing quantitative trading and finance platforms, and started his career developing the browser and server features of the Internet as we know it today at Netscape communications where he made browser and server features in use today. While at home in Seattle, he modifies his custom 3D printer, and plays Minecraft with his daughter who aspires to resurrect extinct dog breeds.

Krishna Kannan

Krishna is the product owner of Federated Data Lake, a governed data mesh implementation using AWS Lake Formation. He leads the team responsible for migrating Data Lake, Machine Learning and Analytics workload in his LOB to AWS Cloud. His responsibilities include cloud and data strategy, solutions architecture, governance and engineering support. Prior to joining JPMC, he has consulted for large financial institutions, retail and healthcare companies. He has a bachelor’s degree in electronics and communication engineering and is currently pursuing his Masters in Analytics at Georgia Institute of Technology. Outside of work, he is interested in travelling, watching movies and spending time with his family.

Omur Kirikci

Omur Kirikci is a Senior Product Manager for Amazon CloudWatch based in Dublin, Ireland. He is passionate about creating new products and looks for ideas from everywhere in order to deliver solutions with the right quality in a timely fashion. Before he joined AWS, Omur spent more than 15 years in product management, program management, go-to-market strategy, and product development. Outside of work, he enjoys being outdoors and hiking, spending time with his family, tasting different cuisines, and watching soccer with friends.