AWS Big Data Blog
How JPMorgan Chase built a data mesh architecture to drive significant value to enhance their enterprise data platform
April 2024: This post was reviewed for accuracy.
This is a joint blog post co-authored with Anu Jain, Graham Person, and Paul Conroy from JP Morgan Chase.
Most modern organizations recognize that their data benefits their entire enterprise. Data has value to the individual business process that produces it, but data’s additional potential can be realized when it’s shared and combined with other data assets.
Unlike most resources, data doesn’t diminish in value as it’s used. You can use the same data in many places, and the more combinations of data an organization creates—such as between reference data and data from business processes across the enterprise—the more value you can extract via enterprise-wide visibility, real-time analytics, and more accurate AI and machine learning (ML) predictions. Organizations that are good at sharing data internally as legally permissible can realize more value from their data resources than organizations that aren’t.
But like any resource, data risks must be managed, particularly in regulated industries. Controls help to mitigate such risks, so organizations that have strong controls around their data are exposed to less risk than those that don’t.
This presents a paradox: data that is permitted to be freely shareable across the enterprise has the potential to add tremendous value for stakeholders, but the more freely shareable the data is, the greater the possible risk to the organization. To unlock the value of our data, we must solve this paradox. We must make data easy to share across the organization, while maintaining appropriate control over it.
JPMorgan Chase Bank, N.A. (JPMC) is taking a two-pronged approach to addressing this paradox. Firstly, by defining data products, which are curated by people who understand the data and its management requirements, permissible uses, and limitations. And secondly, by implementing a data mesh architecture, which allows us to align our data technology to those data products.
This combined approach achieves the following:
- Empowers data product owners to make management and use decisions for their data
- Enforces those decisions by sharing data, rather than copying it
- Provides clear visibility of where data is being shared across the enterprise
Let’s first look at what the data mesh is, and then at how the data mesh architecture supports our data product strategy, and how both enable our businesses.
Aligning our data architecture to our data product strategy
JPMC is comprised of multiple lines of business (LOBs) and corporate functions (CFs) that span the organization. To enable data consumers across JPMC’s LOBs and CFs to more easily find and obtain the data they need, while providing the necessary control around the use of that data, we’re adopting a data product strategy.
Data products are groups of related data from the systems that support our business operations. They are broad but cohesive collections of related data. We store the data for each data product in its own product-specific data lake, and provide physical separation between each data product lake. Each lake has its own cloud-based storage layer, and we catalog and schematize the data in each lake using cloud services. One can use cloud-based storage and data integration services such as Amazon Simple Storage Service (Amazon S3) and data integration services like AWS Glue to provide these capabilities.
The services that consume data are hosted in consumer application domains. These consumer applications are physically separated from each other and from the data lakes. When a data consumer needs data from one or more of the data lakes, we use cloud services to make the lake data visible to the data consumers, and provide other cloud services to query the data directly from the lakes. One could use services such as the AWS Glue Data Catalog to make data visible, AWS Lake Formation to securely share data, and Amazon Athena to interactively query the data.
The data product-specific lakes that hold data, and the application domains that consume lake data, are interconnected to form the data mesh. A data mesh is a network of distributed data nodes linked together to ensure that data is secure, highly available, and easily discoverable. The following diagram illustrates this architecture.
Empower the right people to make control decisions
Our data mesh architecture allows each data product lake to be managed by a team of data product owners who understand the data in their domain, and who can make risk-based decisions regarding the management of their data.
When a consumer application needs data from a product lake, the team that owns the consumer application locates the data they need in our enterprise-wide data catalog (see the following diagram). The entries in the catalog are maintained by the processes that move data to the lakes, so the catalog always reflects what data is currently in the lakes.
The catalog allows the consumption team to discover and request the data. Because each lake is curated by a team that understands the data in their domain and can help facilitate rapid, authoritative decisions by the right decision-makers, the consumption team’s wait time is minimized.
Enforce control decisions through in-place consumption
The data mesh allows us to share data from the product lakes, rather than copying it to the consumer applications that use it. In addition to keeping the storage bill down, sharing minimizes discrepancies in the data between the system that produced the data and the system that consumes it. That helps ensure that the data being consumed for analytics, AI/ML, and reporting is up to date and accurate.
Additionally, because the data doesn’t physically leave the lake, it’s easier to enforce the decisions that the data product owners make about their data. For example, if the data product owners decide to tokenize certain types of data in their lake, data consumers can only access the tokenized values. There are no copies of the untokenized data outside of the lake to create a control gap.
In-place consumption requires more sophisticated access control mechanisms than those needed to control access to copied data, however. When data is consumed in place, we need to restrict visibility at a very granular level—to specific columns, records, and even to individual values (see the following diagram). For example, when a system from one of our LOBs queries a pool of firm-wide reference data shared through a lake, that system may only be granted permission to access the reference data that pertains to that line of business.
Provide cross-enterprise visibility of data consumption
Historically, data exchanges between systems were either system-to-system or via message queues. Because we didn’t have a central, automated repository of all data flows, data product owners couldn’t easily see when their data was flowing between systems.
Our data mesh architecture addresses the visibility challenge by using a cloud-based mesh catalog to facilitate data visibility between the lakes and the data consumers. One could use AWS Glue Data Catalog or a similar cloud-based data cataloging service to enable this.
This catalog doesn’t hold any data, but it does have visibility of which lakes are sharing data with which data consumers. This offers a single point of visibility into the data flows across the enterprise, and gives the data product owners confidence that they know where their data is being used (see the following diagram).
Data mesh in action
Here’s an example to illustrate how the data mesh architecture enables our business.
In the past, teams producing firm-wide reports extracted and joined data from multiple systems in multiple data domains to produce reports.
Through the data mesh architecture, the data product owners for those data domains make their data available in lakes. The enterprise data catalog allows reporting teams to find and request the lake-based data to be made available in their reporting application. The mesh catalog allows auditing the data flows from the lakes to the reporting application, so it’s clear where the data in the reports originates.
Conclusion
JPMC’s data mesh architecture aligns our data technology solutions to our data product strategy. By providing a blueprint for instantiating data lakes that implements the mesh architecture in a standardized way using a defined set of cloud services, we enable data sharing across the enterprise while giving data owners the control and visibility they need to manage their data effectively.
About the Authors
Anu Jain has more than 20 years of technology leadership experience and she is JPMorgan Chase’s Head of Enterprise Data Technology. As a member of the Chief Technology Office (CTO) leadership team, Anu drives the bank’s data technology strategy including data architecture and big data practices to enable the lines of business to make informed decisions and drive value from our data assets. She is an innovator as well as an expert in developing and delivering large scale distributed software that solves real world problems with demonstrated success envisioning and implementing broad range of highly scalable Data platforms, products and architecture.
Graham Person is a senior technology leader at JPMorgan Chase Bank (“JPMC”) with over 30 years of engineering execution, operational excellence, risk management, and leadership experience in the Enterprise IT industry. Currently, he works in the Central Technology Organization at JPMC as a key engineering lead. Graham is driving the acceleration and adoption of public cloud technologies in support of JPMC’s expanding AI/ML demands. By leveraging the latest technologies, the bank is able to deliver data to businesses far quicker and with greater accuracy and flexibility, generating improved insights and capabilities for business innovation and growth.
Paul Conroy is the product owner for Data Governance capabilities in JPMC’s Enterprise Cloud Data Platform. Paul has over 25 years’ experience in financial services technology architecture, engineering, operations, information security, cloud, and data management. Paul delivers Data Governance capabilities that are built into the data platform, and that are integrated with existing information architecture processes and business metadata. These capabilities help teams use shared data to produce new business insights, while reducing data risk by helping teams more easily meet their legal, regulatory, contractual, and policy obligations for sharing and managing that data.
Nivas Shankar is a Principal Data Architect at Amazon Web Services. He helps and works closely with enterprise customers building data lakes and analytical applications on the AWS platform. He holds a Master’s degree in physics and is highly passionate about theoretical physics concepts. When not building and designing data lakes, Nivas enjoys spending time with his kids and wife exploring travelling and new places.
Audit History
Last reviewed and updated in April 2024 by Fabrizio Napolitano | Principal Analytics Solutions Architect