AWS Marketplace

Integrating third-party data into your data mesh

Data-driven organizations are increasingly using third-party data to gain insights into their competitive landscape, track changing consumer behaviors, develop real-time responses to market dynamics, and optimize their operations with sustainability in mind. The concept of a data mesh that promotes data product thinking and sharing data across domains has resonated across industries. This plays a significant role in the success of data-driven organizations.

In this blog post, I will show how data-driven organizations can easily integrate third-party data products procured via AWS Data Exchange into their data mesh using the new AWS Data Exchange for AWS Lake Formation feature.

AWS Data Exchange data delivery methods

AWS Data Exchange makes it easy to find, subscribe to, and use third-party data from a wide range of providers. You can seamlessly ingest third-party data available in Amazon Simple Storage Service (Amazon S3), Amazon Redshift, AWS Lake Formation, and APIs from over 250 providers. You can quickly analyze it with a wide variety of AWS analytics and machine learning (ML) services.

The following diagram shows the different delivery mechanisms available with AWS Data Exchange. It shows how subscribers can ingest the dataset as Amazon S3 files, query Amazon Redshift tables as datashares, or call APIs.

Delivery mechanisms available with AWS Data Exchange

Solution Overview: Accessing third-party data through AWS Lake Formation

Now, with the new AWS Data Exchange for AWS Lake Formation feature, subscribers can access the Lake Formation objects shared directly from the provider’s data lake. As a subscriber, you no longer have to build extract-transform-load (ETL)/extract-load-transform (ELT) pipelines to move the data. This saves valuable engineering effort, especially during early data testing phases. As soon as the subscription is approved, data lake administrators can provide fine-grained access to the data consumers on these tables. Business analysts or data scientists can start using analytics services such as Amazon Athena, AWS Glue DataBrew, Amazon QuickSight or Amazon SageMaker to query, analyze, and visualize data or run machine learning algorithms.

The following diagram illustrates how AWS Data Exchange for AWS Lake Formation brings first-party and third-party data to the same data catalog without the need for any ETL/ELT processes.

  • It shows first-party data from different sources, including devices, web applications, sensors, social media platforms, and databases being ingested to Amazon S3 in the subscriber account.
  • The first-party data is then registered in the AWS Lake Formation catalog.
  • From the third-party data provider account, the data products are published to AWS Data Exchange for AWS Lake Formation.
  • Subscribers consuming the data product can see the third-party data in their AWS Lake Formation catalog along with the first-party data.
  • AWS Analytics and AI/ML services, including Amazon Athena, AWS Glue DataBrew, Amazon QuickSight, and Amazon SageMaker can then consume the AWS Lake Formation tables directly or through Amazon Athena tables. Refer to the following diagram.

how AWS Data Exchange for AWS Lake Formation brings first-party and third-party data to the same data catalog

Conceptual Workflow: Building a data mesh on AWS

A data mesh is an approach to sourcing, managing, and accessing analytical data. It is developed on the idea of decentralizing data by pushing the data storage and management responsibilities into different domains, such as business units. Taking this approach ensures stakeholders who best understand the data are responsible for managing it. These stakeholders can then share the data product with other interested parties through a marketplace with a federated governance structure to govern it from a central location. This approach has particularly resonated with large organizations with disparate data houses across the business.

A data product is a central concept to the data mesh. Data products are broad, cohesive collections of related data aligned to specific business cases or goals intended to drive a well-defined, quantified business outcome. A producer domain is responsible for creating a data product, maintaining it, upgrading it, and registering it in the central catalog. The data mesh aims to enable data consumers to self-serve, that is, discover, learn, and consume a mesh of data products, in a secure way that is compliant with the mesh policies. Zhamak Dehghani calls this the “Mesh Experience Plane” in her book Data Mesh: Delivering Data-Driven Value at Scale (Part III, Chapter 9, “The Logical Architecture”).

The following diagram illustrates the conceptual data mesh workflow.

  1. Data products created by producer domains are registered with a catalog in the central account.
  2. The central catalog is also shared with the producer domain for local governance.
  3. Data consumers find and request access to a data product through the mesh experience plane.
  4. The request is sent to the data product manager in the producer domain that owns the product.
  5. The data product manager verifies the request.
  6. Once the request is approved, the data product is shared with the consumer domain from the central catalog with relevant permissions.

Steps 4, 5, and 6 can be automated for common requests using an workflow management tool. Manual intervention might only be required in certain cases, for example, accessing sensitive data, or custom access requirements.

  1. Data consumers can access the data and start querying using analytics services. Refer to the following diagram.

data mesh workflow conceptually

For a deeper dive into the concept of a data mesh on AWS, read Design a data mesh architecture using AWS Lake Formation and AWS Glue. You can also read Build a data sharing workflow with AWS Lake Formation for your data mesh to understand how the end-to-end workflow could work in a data mesh on AWS using AWS Lake Formation.

Integrating third-party data into your data mesh on AWS

Organizations operating in a data mesh or building a data mesh on AWS must share data products between different AWS accounts. In such cases, subscribing to third-party data from multiple accounts or domains within the organization might pose some governance challenges in the long run. If several domains subscribe to different third-party data products from separate accounts within the organization, it might be difficult to comply with global policies and standards on third-party data usage and also risks duplication in licensing and cost. Moreover, in a data mesh, it becomes necessary to subscribe and share from the central account to comply with the federated data governance principle.

The mesh experience plane can be used to facilitate the search and discovery of third-party data products through the AWS Marketplace Discovery APIs.

In this diagram, I show one possible configuration for including third-party data into your data mesh.

Producer domain accounts

  1. Producer domains register their data products in the central AWS Lake Formation catalog. Any changes to the data products in the producer domain are first reflected in the central account. The data is shared with other accounts using resource share links provided by AWS Resource Access Manager (AWS RAM).
  2. The Lake Formation catalog in the central account is shared back with the producer domain for the product owners to manage and access their own data.
  3. AWS Amplify hosts the website that provides the search and discovery experience for consumers looking for data products across the organization. It queries the catalog for first-party data products. Refer to the following diagram.

Producer domains register first party data products

Third-party provider accounts

  1. Data providers create datasets using Lake Formation tag-based permissions.
  2. They publish these data sets as products on AWS Data Exchange. Find more information and step-by-step guidance on providing data products through AWS Data Exchange. Refer to the following diagram.

third-party data providers register locations on AWS Lake Formation and products are published on AWS Data Exchange

Consumer domain accounts

  1. Data consumers from different domains search for data products on a web portal designed to provide search and discoverability capabilities. A consumer persona finds and requests access to a third-party data product on this portal. Refer to the following diagram.

The AWS Amplify website facilitates search and discovery of data products for data consumers in a data mesh

Central account

  1. The web application is hosted in AWS Amplify within the central account. It provides the mesh experience presenting a catalog of first-party, second-party, and third-party data products that consumers can search and request access to.
  2. When a user chooses a third-party data product, the web application uses the AWS Marketplace Discovery APIs to query AWS Data Exchange and returns results based on the search. For further reference, read how Domo implemented such an integration. You can also follow the publicly available labs on Discovery APIs for AWS Data Exchange to set this up.
  3. Access is granted by the AWS Data Exchange service automatically unless the provider enables subscription verification to verify subscriber’s identity.
  4. An administrator in the central account uses AWS License Manager to share that license for the data product with the other accounts in that organization. Refer to the following diagram.

Web application hosted in AWS Amplify uses AWS Marketplace Discovery APIs to fetch data products on AWS Data Exchange.

Consumer domain accounts

  1. Third-party data delivered as AWS Lake Formation tables is cataloged along with first-party data in the consumer account.
  2. The Lake Formation catalog in the central account is shared with the consumer domain, and first-party data is available to the consumer now. The data is shared with other accounts using resource share links provided by AWS RAM.
  3. Data consumers can now access the data using analytics and ML services like Amazon Athena, AWS Glue DataBrew, Amazon QuickSight, or Amazon SageMaker to build solutions. Refer to the following diagram.

Consumer domain accounts in a data mesh use various AWS Analytics and Machine Learning services to query data through AWS Lake Formation catalog.

Consumer domain accessing third-party data can create data products (playing the role of a producer) and share those with other consumer domains through the central account.

Conclusion

In this blog post, I showed how you can integrate third-party data easily into the data mesh using AWS Data Exchange and AWS Lake Formation.

Data-driven organizations are transitioning to a data product thinking mindset and are adopting approaches like the data mesh to drive value at scale. The domains responsible for building data products use third-party data to enrich the first-party data to drive business decisions. For example, you can use specific third-party data products to derive location intelligence for better positioning of your products or services. With AWS Data Exchange for AWS Lake Formation, it becomes easier for you to integrate third-party data products in your data mesh.

About the authors

Sandipan Bhaumik (Sandi) is a Senior Analytics Specialist Solutions Architect based in London. He has worked with customers in different industries like Banking & Financial Services, Healthcare, Power & Utilities, Manufacturing and Retail helping them solve complex challenges with large-scale data platforms. At AWS he focuses on strategic accounts in the UK and Ireland and helps customers to accelerate their journey to the cloud and innovate using AWS analytics and machine learning services. He loves playing badminton, and reading books.