AWS Big Data Blog

Introducing data products in Amazon DataZone: Simplify discovery and subscription with business use case based grouping

We are excited to announce a new feature in Amazon DataZone that allows data producers to group data assets into well-defined, self-contained packages (data products) tailored for specific business use cases. For example, a marketing analysis data product can bundle various data assets such as marketing campaign data, pipeline data, and customer data. This simplifies the process for data consumers to find datasets, understand their context through shared metadata, and access comprehensive datasets for specific use cases through a single workflow. With the grouping capabilities of data products, data producers can manage and control access to the underlying data assets with just a few steps.

Customers often face challenges in locating and accessing the fragmented data they need, expending time and resources in the process. With Amazon DataZone, they can use data products to enhance data cataloging and subscription processes, aligning these more closely with business objectives while eliminating redundancy in handling individual assets.

In this post, we highlight the key benefits of data products, outline their essential features and workflows, and demonstrate how customers can use these features for easier publishing, discovery, and subscription.

Key benefits of data products

Customers use Amazon DataZone to create data meshes and adopt a culture that emphasizes data as a product. Amazon DataZone facilitates the publication of data assets from diverse sources that are enriched with their business context. It is crucial to organize assets into cohesive units with relational context to maximize the potential of data as a product and drive business use cases.

Amazon DataZone now offers the capability to group data assets with shared metadata into cohesive, business use case based data products, enhancing both the publishing and subscription processes. Data products provide three core benefits that help customers address their business challenges:

  • Simplified discovery – Data consumers can quickly identify interconnected data assets by searching for and finding them as a single unit. This reduces the time and effort required to find all relevant information and lowers the risk of missing important data.
  • Unified access model – Data products simplify access to data with a single request by implementing a unified access model. This eliminates the need for multiple permissions, speeding up the initiation of data analysis.
  • Reduced administrative overhead – By cataloging assets as data product units, data producers reduce administrative overhead by enabling metadata and access control management at the product level rather than individually. This makes access governance and data utilization more efficient, ensuring alignment with business goals and easy accessibility for its intended use. Data governance teams can monitor consumption rates for these data products, providing valuable insights into data literacy maturity.

For example, one of our customers, Natera, uses Amazon DataZone to create tailored datasets for their specific needs. Mirko Buholzer, VP of software engineering at Natera, says

“At Natera, our mission to revolutionize precision medicine depends on managing and leveraging our vast clinical and genomic data. With the Amazon DataZone data products feature, we can create tailored datasets for specific uses like reproductive health, oncology, or organ transplantation. This streamlines data discovery and access for our researchers and data scientists, enabling quick analysis of relevant data. Additionally, it will help physicians and patients gain deeper insights in combination with our clinical tests, ultimately improving patient outcomes.”

With data products, Amazon DataZone now supports business use case based grouping, enhancing data publishing, discovery, and subscription. This feature enables the following capabilities, as shown in the following image:

  • Data product creation and publishing – Producers can create data products by selecting assets from their project’s inventory, setting up shared metadata, and publishing these products to make them discoverable to consumers.
  • Data discovery and subscription – Consumers can search for and subscribe to data product units. Subscription requests are sent within a single workflow to producers for approval. Subscription approval processes, such as approve, reject, and revoke, ensure that access is managed securely. Once approved, access grants for the individual assets within the data product are automatically managed by the system.
  • Data product lifecycle management – Producers have control over the lifecycle of data products, including the ability to edit them and remove them from the catalog. When a producer edits product metadata or adds or removes assets from a data product, they republish it as a new version, and subscriptions are updated without any reapproval.

Solution overview

To demonstrate these capabilities and workflows, consider a use case where a product marketing team wants to drive a campaign on product adoption. To be successful, they need access to sales data, customer data, and review data of similar products. The sales data engineer, acting as the data producer, owns this data and understands the common requests from customers to access these different data assets for sales-related analysis. The data producer’s objective is to group these assets so consumers, such as the product marketing team, can find them together and seamlessly subscribe to perform analysis.

The following high-level implementation steps show how to achieve this use case with data products in Amazon DataZone and are detailed in the following sections.

  1. Data publisher creates and publishes data product
    1. Create data product – The data publisher (the project contributor for the producing project) provides a name and description and adds assets to the data product.
    2. Curate data product – The data publisher adds a readme, glossaries, and metadata forms to the data product.
    3. Publish data product – The data publisher publishes the data product to make it discoverable to consumers.
  2. Data consumer discovers and subscribes to data product
    1. Search data product – The data consumer (the project member of the consuming project) looks for the desired data product in the catalog.
    2. Request subscription – The data consumer submits a request to access the data product.
    3. Data owner approves subscription request – The data owner reviews and approves the subscription request.
    4. Review access approval and grant – The system manages access grants for the underlying assets.
    5. Query subscribed data – The data consumer receives approval and can now access and query the data assets within the subscribed data product.
  3. Data owner maintains lifecycle of data product
    1. Revise data product – The data owner (the project owner for the producing project) updates the data product as needed.
    2. Unpublish data product – The data owner removes the data product from the catalog if necessary.
    3. Delete data product – The data owner permanently deletes the data product if it is no longer needed.
    4. Revoke subscription – The data owner manages subscriptions and revokes access if required.

Prerequisites

To follow along with this post, ensure the publisher of the product sales data asset has ingested individual data assets into Amazon DataZone. In our use case, a data engineer in sales owns the following AWS Glue tables: customers, order_items, orders, products, reviews, and shipments. The data engineer has added a data source to bring these six data assets into the sales producer project inventory, ingesting the metadata in Amazon DataZone. For instructions on ingesting metadata for AWS Glue tables, refer to Create and run an Amazon DataZone data source for the AWS Glue Data Catalog. For Amazon Redshift, see Create and run an Amazon DataZone data source for Amazon Redshift.

On the producer side, a sales product project has been created with a data lake environment. A data source was created to ingest the technical metadata from the AWS Glue salesdb database, which contains the six AWS Glue tables mentioned previously. On the consumer side, a marketing consumer project with a data lake environment has been established.

Data publisher creates and publishes data product

Sign in to Amazon DataZone data portal as a data publisher in the sales producer project. You can now create a data product to group inventory assets relevant to the sales analysis use case. Use the following steps to create and publish a data product, as shown in the following screenshot.

  1. Select DATA in the top ribbon of the Sales Product Project
  2. Select Inventory data in the navigation pane
  3. Choose DATA PRODUCTS to create a data product

Create data product

Follow these steps to create a data product:

  1. Choose Create new data product. Under Details, in the name field, enter “Sales Data Product.” In the description, enter “A data product containing the following 6 assets: Product, Shipments, Order Items, Orders, Customers, and Reviews,” as shown in the following screenshot.
  2. Select Choose assets to add the data assets. Select CHOOSE on the right side next to each of the six data products. Be sure to go to the second page to select the sixth asset. After all are selected, choose the blue CHOOSE button at the bottom of the page, as shown in the following screenshot. Then choose Create to create the data product.

Curate data product

You can curate the sales data product by adding a readme, glossary term, and metadata forms to provide business context to the data product, as shown in the following screenshot.

  1. Choose Add terms under GLOSSARY TERMS. Select a glossary term that you have added to your glossary, for example, Sales. Refer to Create, edit, or delete a business glossary for how to create a business glossary.
  2. Choose Add metadata form to add a form such as a business owner. Refer to Create, edit, or delete metadata forms for how to create a metadata form. In this example, we added Ownership as a metadata form.

Publish data product

Follow these steps to publish a data product.

  1. Once all the necessary business metadata has been added, choose Publish to publish the data product to the business catalog, as shown in the following screenshot.
  2. In the pop-up, choose Publish data product.

The six data assets in the data product will also be published but will only be discoverable through the data product unless published individually. Consumers cannot subscribe to the individual data assets unless they are published and made discoverable in the catalog separately.

Data consumer discovers and subscribes to data product

Now, as the marketing user, inside of the marketing project, you can find and subscribe to the sales data product.

Search data product

Sign in to the Amazon DataZone data portal as a marketing user in the marketing consumer project. In the search bar, enter “sales” or any other metadata that you added to the sales data product.

Once you find the appropriate data product, select it. You can view the metadata added and see which data assets are included in the data product by selecting the DATA ASSETS tab, as shown in the following screenshot.

Request subscription

Choose Subscribe to bring up the Subscribe to Sales Data Product modal. Make sure the project is your consumer project, for example, Marketing Consumer Project. In Reason for request, enter “Running a marketing campaign for the latest sales play.” Choose SUBSCRIBE.

The request will be routed to the sales producer project for approval.

Data owner approves subscription request

Sign in to Amazon DataZone as the project owner for the sales producer project to approve the request. You will see an alert in the task notification bar. Choose the notification icon on the top right to see the notifications, then choose Subscription Request Created, as shown in the following screenshot.

You can also view incoming subscription requests by choosing DATA in the blue ribbon at the top. Then choose Incoming requests in the navigation pane, REQUESTED under Incoming requests, and then View request, as shown in the following screenshot.

On the Subscription request pop-up, you will see who requested access to the Sales Data Product, from which project, the requested date and time, and their reason for requesting it. You can enter a Decision comment and then choose APPROVE.

Review access approval and grant

The marketing consumer is now approved to access the six assets included in the sales data product. Sign in to Amazon DataZone as a marketing user in the marketing consumer project. A new event will appear, showing that the SUBSCRIPTION REQUEST APPROVED has been completed.

You can view this in two different ways. Choose the notification icon on the top right and then EVENTS under Notifications, as shown in the first following screenshot. Alternatively, select DATA in the blue ribbon bar, then Subscribed data, and then Data products, as shown in the second following screenshot.

Choose the Sales Data Product and then Data assets. Amazon DataZone will automatically add the six data assets to the AWS Glue tables that the marketing consumer can use. Wait until you see that all six assets have been added to one environment, as shown in the following screenshot, before proceeding.

Query subscribed data

Once you complete the previous step, return to the main page of the marketing consumer project by choosing Marketing Consumer Project in the top left pull-down project selector, then choose OVERVIEW. The data can now be consumed through the Amazon Athena deep link on the right side. Choose Query data to open Athena, as shown in the following screenshot. In the Open Amazon Athena window, choose Open Amazon Athena.

A new window will open where the marketing consumer has been federated into the role that Amazon DataZone uses for granting permissions to the marketing consumer project data lake environment. The workgroup defaults to the appropriate workgroup that Amazon DataZone manages. Make sure that the Database under Data is the sub_db for the marketing consumer data lake environment. There will be six tables listed that correspond to the original six data assets added to the sales data product. Run your query. In this case, we used a query that looked for the top five best-selling products, as shown in the following code snippet and screenshot.

SELECT p.product_name, SUM(oi.quantity) AS total_quantity FROM order_items oi JOIN products p ON oi.product_id = p.product_idGROUP BY p.product_nameORDER BY total_quantity DESC 
LIMIT 5;

Data owner maintains lifecycle of data product

Follow these steps to maintain the lifecycle of the data product.

Revise data product

The data owner updates the data product, which includes editing metadata and adding or removing assets as needed. For detailed instructions, refer to Republish data products.

The sales data engineer has been tasked with removing one of the assets, the reviews table, from the sales data product.

  1. Open the SALES PRODUCER PROJECT by selecting it from the top project selector.
  2. Select DATA in the top ribbon.
  3. Select Published data in the navigation pane.
  4. Choose DATA PRODUCTS on the right side.
  5. Choose Sales Data Product.

The following screenshot shows these steps.

Once in the data product, the data engineer can add and remove metadata or assets. In To change any of the assets in the data product, follow these steps, as shown in the following screenshot.

  1. Select ASSETS in Sales Data Product.
  2. Select any of the assets. For this example, we remove the Reviews
  3. Select the three dots on the right side.
  4. Select Remove asset.
  5. A pop-up will appear confirming that you want to remove the asset. Choose Remove. The Reviews asset will now have a status of Removing asset: This asset is still available to subscribers.
  6. Republish the data product to remove access to this asset from all subscribers. Choose REPUBLISH and REPUBLISH DATA PRODUCT in the pop-up.
  7. To confirm the asset has been removed, sign in to the marketing project as the consumer. Open the Amazon Athena deep link on the OVERVIEW After selecting the sub_db associated with the marketing consumer data lake environment, only five tables are visible because the Reviews table was removed from the data product, as shown in the following screenshot.

The consumer doesn’t have to take any action after a data product has been republished. If the data engineer had changed any of the business metadata, such as by adding a metadata form, updating the readme, or adding glossary terms and republishing, the consumer would see those changes reflected when viewing the data product under the subscribed data.

Unpublish data product

The data owner removes the data product from the catalog, making it no longer discoverable to the organization. You can choose to retain existing subscription access for the underlying assets. For detailed instructions, refer to refer to Unpublish data product.

Delete data product

The data owner permanently deletes the data product if it is no longer needed. Before deletion, you need to revoke all subscriptions. This action will not delete the underlying data assets. For detailed instructions, refer to Delete Data Product.

Revoke subscription

The data owner manages subscriptions and may revoke a subscription after it has been approved. For detailed instructions, refer to Revoke subscription.

Cleanup

To ensure no additional charges are incurred after testing, be sure to delete the Amazon DataZone domain. Refer to Delete domains for the process.

Conclusion

Data products are crucial for improving decision-making accuracy and speed in modern businesses. Beyond making raw data available, they offer strategic packaging, curation, and discoverability. Data products help customers address the difficulty of locating and accessing fragmented data, which reduces the time and resources needed to perform this important task.

Amazon DataZone already facilitates data cataloging from various sources. Building on this capability, this new feature streamlines data utilization by bundling data into purpose-built data products aligned with business goals. As a result, customers can unlock the full potential of their data.

The feature is supported in all the AWS commercial Regions where Amazon DataZone is currently available. To get started, check out the Working with data products.


About the authors

Jason Hines is a Senior Solutions Architect, at AWS, specializing in serving global customers in the Healthcare and Life Sciences industries. With over 25 years of experience, he has worked with numerous Fortune 100 companies across multiple verticals, bringing a wealth of knowledge and expertise to his role. Outside of work, Jason has a passion for an active lifestyle. He enjoys various outdoor activities such as hiking, scuba diving, and exploring nature. Maintaining a healthy work-life balance is essential to him.

Ramesh H Singh is a Senior Product Manager Technical (External Services) at AWS in Seattle, Washington, currently with the Amazon DataZone team. He is passionate about building high-performance ML/AI and analytics products that enable enterprise customers to achieve their critical goals using cutting-edge technology. Connect with him on LinkedIn.

Leonardo Gomez is a Principal Analytics Specialist Solutions Architect at AWS. He has over a decade of experience in data management, helping customers around the globe address their business and technical needs. Connect with him on LinkedIn.