Modernizing Data Platforms, Accelerating Innovation, and Unlocking Business Value with Data Mesh on AWS
By Tony Giordano, Sr. Partner, Global Leader Data Platform Services – IBM
By Ryan Keough, Manager, Solution Architecture – AWS
By Amit Chowdhury, Sr. Partner Solutions Architect – AWS
Organizations are constantly generating and collecting large amounts of data. Storing, analyzing, and taking action on this ever-growing data becomes a challenge with traditional techniques like data lake and data warehouse approaches which focuses on centralizing data and supporting data ecosystem components.
To sustain innovation and drive business value, organizations are realizing the importance of decentralization of data products using a data mesh architectural design approach.
In this post, we will discuss how to set the right balance of business ownership using data mesh to modernize data platform and accelerate innovation. We’ll also describe referenceable techniques to build data mesh solutions using Amazon Web Services (AWS) native services including Amazon DataZone, AWS Lake Formation, and AWS Glue.
IBM Consulting is an AWS Premier Tier Services Partner and is recognized as a Global Systems Integrator (GSI) for many competencies including Data and Analytics Consulting, which positions IBM to help customers who use AWS to harness the power of innovation and drive their business transformation.
What is Data Mesh?
Data mesh is the latest evolution of the continuing data management discussion on how to best enable businesses to leverage their data investment on AWS, while at the same time ensuring data is treated as an enterprise asset that’s being governed, managed, secured and monetized.
Data mesh addresses the fact that while there are centralized enterprise systems, data analytics are local or domain-driven. For example, the data analytics needs for a finance organization are radically different than those of a marketing organization. Those differences have only become wider in the digital age.
The era of data warehousing, where the primary use case was business intelligence (BI) and reporting, or the data lake where the primary use cases were big data, predictive analytics, and data science development, have both been replaced with the modern data platform.
The digital age which requires multiple uses of analytics data for business intelligence and reporting, data science modeling, operational, and digital use cases have spawned the discussion of data mesh for who, where, and how these analytic apps are built.
A data mesh is based on the core concept of decentralized data development and ownership by business domain. Data mesh views data as a “product,” and each domain designs, develops, and uses their data products related to their area of business for their analytic needs.
In a data mesh model, domains are not only responsible for the full lifecycle of their data products but the integrity and quality as well. The view of data mesh principles is that since business domains own their data products, they can quickly respond to new needs with their own prioritization vs. waiting for a centralized IT function. This is meant to foster an accelerated data-driven innovation discipline by allowing greater autonomy and flexibility for data owners.
Challenges of Data Mesh Implementation
The reality of data mesh implementations has been largely mixed, however. Many organizations are in the second or even third try at data mesh, and many have spent millions of dollars attempting to implement data mesh into their organizations with little return on investment.
Some of the common points of failure have included:
- Lack of coordination (and agreement) of how to collaborate with enterprise data teams: Many organizations are not defining what processes domain teams will perform vs. the central team. They have domain teams that have either defined duplicative data integration, data governance, and cataloging capabilities with the enterprise (unnecessary organizational costs), or have made assumptions on what capabilities they’re inheriting from the enterprise systems and finding it difficult to conform and use data for their products.
- Lack of coordination (and agreement) of how to collaborate with other data domain teams: Several organizations have taken on the responsibilities (without the central enterprise IT unit) to create data domain masters for core areas such as customer, product, and other subject areas. In these organizations, each team has built their stores in isolation with no broader blueprint on how all this data will “connect” across the differently designed master data stores. They are now trying to develop complex semantic models or virtualization engines to help connect data stores that use different consolidation approaches and data design patterns that are reflective of the pitfalls of bottom-up data modelling for what were traditionally enterprise responsibilities.
When an organization plans to leverage a data mesh strategy there needs to be tight alignment with what the core enterprise should still be responsible for in an organization’s data, and what the data domain teams should provision.
Data Mesh: Guiding Principles
Fortunately, there is a fairly simple set of guiding principles that will pragmatically ensure the enterprise can deliver a set of data capabilities and provide the services and guardrails needed for a properly managed data mesh environment for the data domain owners.
These design patterns work elegantly in the AWS data environment and follow commonly agreed upon data platform architectural patterns:
Guiding Principle #1: Data Management
An enterprise should still provision certain types of data for the domains. Fully independent data domain environments will inevitably create unnecessary duplication of data, redundant data management processes, and poor data quality. What should be enterprise vs. what should be domain based?
Typically, there are two use cases for enterprise data management:
- Data lake layer: The first use case is for raw data which is used for data science experimentation, downstream staging, and history. These data storage layers, often called the data lakes, are easily built into Amazon Simple Storage Service (Amazon S3) with highly efficient, cost-effective storage.
Enterprise teams tend to take a broader view of ingesting data into an enterprise data source and will build high-efficient teams that can provisioning data across the entire enterprise from internal and external sources, both real-time and batch using AWS capabilities such as Amazon Managed Streaming for Apache Kafka (Amazon MSK) and AWS Glue. It makes sense to have a centralized team to provide data for the enterprise that all the data domains can use.
- Data warehouse (enterprise) layer: The days of attempting to load all of an enterprise’s data into a data warehouse are long past. However, the need for an integrated, conformed, and 360-degree view of core subject areas such as customer, product, and events are still as relevant as they were 15 years ago. In fact, in the digital age with inbound and outbound digital events streaming through most modern data platforms the need for that 360-master data view is even more important. These much thinner enterprise data stores are ideal for Redshift data stores.
- Domain-specific consumption layers: Data domain owners should have their own data environments that leverage raw and enterprise data to build out BI-specific data marts in Amazon Redshift or Postgres, data science sandboxes in AWS Deep Learning AMI, and other domain-specific analytic applications that business unit needs to run their business.
Guiding Principle #2: Data Governance
The enterprise should still provide automated data management processes to facilitate (not hinder) the development of domain-specific data products.
One of the most important lessons from the data lake era was data cataloging. Those organizations that spun up large Hadoop environments and simply “dumped” data into environments quickly realized that there could be tens of thousands of files in this environment with little knowledge of what was on those platforms; hence the term “data swamps.”
These data swamps were an expensive lesson for the need for data cataloging. This traditional data governance function should be provided as a service to data domains. This includes managing the technical metadata in AWS Glue.
These guiding principles are best portrayed in the architectural diagram below that delineates a data environment between core, necessary, central data management, and business-driven data innovation.
Figure 1 – Balanced data mesh environment.
AWS and IBM Joint Industry Solutions
AWS and IBM have worked together on building joint industry solutions using data mesh architecture that portray the value of this approach and accelerate business innovation.
A large healthcare company in North America was spending more time on developing analytics vs. writing queries using a traditional data warehouse. IBM worked with this customer to reengineer their environment with a focus on providing data in multiple formats, thus allowing them to use it in more productive ways. This resulted in the development of 14 new predictive models for member care, improving member experience, within the first eight weeks.
In another example, a pharmaceutical company eliminated its analytic backlog with its data mesh mindset. Issues of ever-changing prioritization by different stakeholders eventually paralyzed its central IT function.
IBM built a data mesh solution on AWS by establishing core centralized services to support data governance and business analytics, and each of the core business domains (sales, marketing, R&D, production) established and owned their data domains to help accelerate innovation.
Following is a high-level design built on top of the data mesh pattern separating consumers, producers, and central governance. A data domain may represent a data consumer, data producer, or both.
Four core principles of data mesh design are illustrated in the figure below:
- Data domain ownership: A data mesh features data domains as nodes, which exist in data lake accounts; it’s founded in decentralization and distribution of data responsibility to people closest to the data. Data owners are responsible for their data products’ reliability, availability, and accuracy.
- Federated computational governance: Federated data governance is how data products are shared, delivering discoverable metadata auditability based on the decision-making and accountability structures led by federation of data product owners.
- Data as a product (DaaP): This refers to leveraging domain-driven design techniques to formulate and establish bounded context. A data producer contributes one or more data products to a central catalog in a data mesh account, and DaaP must be autonomous, discoverable, secure, and correct, and useable.
- Self-serve sharing: The platform streamlines the experience of data users to discover, access, and use data products. It also enables the experience of data providers to build, deploy, and maintain data products through the self-serve data infrastructure with open protocols.
Figure 2 – Data mesh architectural design on AWS.
Below is a data mesh architecture patterns with data product sharing employing both data lake and data warehouse techniques. The central organization helps enforce data governance and compliance policies using Amazon DataZone and AWS Lake Formation.
Business domains can have their own data domain with a data lake on Amazon S3 and data warehouse solution with Amazon Redshift, with data transformation running on AWS Glue offering data catalog and running ETL workloads. This approach helps data democratization and leverage benefits of data lake and data warehouse.
Figure 3 – Data mesh architecture patterns on AWS.
It’s crucial to find the right balance of business ownership for your data, data mesh techniques, and architecture patterns on AWS to help you modernize data platforms, accelerate innovation, and unlock business value for your organization.
The key to a successful data mesh implementation requires the right mix of enterprise data management implemented on AWS that will result in robust, useful domain-based data users. Implementing a data mesh on AWS is made simple by using managed and serverless services to provide a well understood, performant, scalable, and cost-effective solution to integrate, prepare, and serve data.
To learn more about designing and building applications based on event-driven architecture, see the AWS Event-Driven Architecture page. To dive deeper into data mesh concepts, see the AWS blog post called Design a Data Mesh Architecture using AWS Lake Formation and AWS Glue.
To get familiar with end-to-end governance capabilities, refer to the Amazon DataZone page. To build an event-driven data mesh design, refer to the AWS blog post titled Use an Event-Driven Architecture to Build a Data Mesh on AWS.
If you wish to explore this topic further, please contact your representatives from IBM Consulting or AWS.
IBM – AWS Partner Spotlight
IBM Consulting is an AWS Premier Tier Services Partner that helps customers who use AWS to harness the power of innovation and drive their business transformation.