Overview
DataHub is an AI & Data Context Platform adopted by over 3,000 enterprises including Apple, CVS Health, Netflix, and Visa. Innovated jointly with a thriving open-source community of 13,000+ members, DataHub's metadata graph provides in-depth context of AI and data assets with best-in-class scalability and extensibility. The company's enterprise SaaS offering, DataHub Cloud, delivers a fully-managed solution with AI-powered discovery, observability, and governance capabilities. Organizations rely on DataHub solutions to accelerate time-to-value from their data investments, ensure AI system reliability, and implement unified governance - enabling AI & data to work together and bring order to data chaos.
For Data Analysts, developers, data scientists, and automated workflows:
Easily find trusted datasets with the most current data
- Access data where you work with a chrome extension for BI tools
- Discover data your way - personalization for multiple business and technical user profiles
- Support AI models and automations with a metadata graph that keeps up with today's data volume and velocity
- Understand data provenance with table, column, and job level lineage graphs
- Auto-enrich metadata with no-code automation
- Use AI-generated documentation and propagation to better understand context
- Always stay up-to-date with subscriptions to assets, activity and notifications
For Data Engineers:
Deliver reliable data quality
- Provide end-to-end observability with user-created data quality checks and reports
- Surface data quality results and impact analysis across all points in lineage
- Monitor freshness SLAs, data volume, table schemas, column quality, and custom SQL
- Use AI Anomaly Detection for freshness, volume, and column stats
- Easily keep an eye on data quality with assertions and AI-based smart assertions
- Evaluate data contracts and quality checks on-demand with API
- Get notified where you work (slack, email, and more)
- Easily manage data quality with a data health dashboard
For Data Governance:
Ensure continuous AI & data governance in production versus episodic compliance checks
- Ensure every AI & data asset is accounted for by defining and enforcing documentation standards
- Integrate governance practices early with automated shift-left governance
- Automatically classify your data as it moves and transforms with lineage-driven compliance
- Keep tags harmonized with seamless metadata flow between DataHub and source systems
- Deliver continuous compliance monitoring with forms, impact analysis, and reporting
- Create and implement bespoke compliance approval workflows
Highlights
- Search All Corners of Your Data Stack- DataHub's unified search experience surfaces results across databases, data lakes, BI platforms, ML feature stores, orchestration tools, and more.
- Trace End-to-End Lineage- Quickly understand the end-to-end journey of data by tracing lineage across platforms, datasets, ETL/ELT pipelines, charts, dashboards, and beyond.
- View Metadata 360 at a Glance- Combine technical, operational and business metadata to provide a 360 degree view of your data entities.Generate Dataset Stats to understand the shape & distribution of the data.
Details
Introducing multi-product solutions
You can now purchase comprehensive solutions tailored to use cases and industries.
Features and programs
Buyer guide

Financing for AWS Marketplace purchases
Pricing
Dimension | Description | Cost/12 months |
|---|---|---|
Discover & Govern | Up to 20 Monthly Active Users | $75,000.00 |
Vendor refund policy
All fees are non-cancellable and non-refundable except as required by law.
How can we make this page better?
Legal
Vendor terms and conditions
Content disclaimer
Delivery details
Software as a Service (SaaS)
SaaS delivers cloud-based software applications directly to customers over the internet. You can access these applications through a subscription model. You will pay recurring monthly usage fees through your AWS bill, while AWS handles deployment and infrastructure management, ensuring scalability, reliability, and seamless integration with other AWS services.
Resources
Vendor resources
Support
Vendor support
Email support is offered Monday - Friday during regular business hours.
marketplace@datahub.com
AWS infrastructure support
AWS Support is a one-on-one, fast-response support channel that is staffed 24x7x365 with experienced and technical support engineers. The service helps customers of all sizes and technical abilities to successfully utilize the products and features provided by Amazon Web Services.

Standard contract
Customer reviews
Centralized lineage and catalog have transformed how we track incidents and classify sensitive data
What is our primary use case?
My main use case for Data Hub is to catalog the dataset across my company and to get the lineage of data in the my company pipeline.
To give an example of how I use Data Hub in my day-to-day work, suppose the data is flowing from a source to Kafka and then to some data storages. If some cross-team wants to use the data but there is a problem at the Kafka level, we are not sure who all are consuming that data. Data Hub is very useful for us in this scenario. It can generate the lineage from source to destination, and when there is an issue at the Kafka side, we will get to know what the end results and impacted data sources are.
I would add that sometimes when we do not want to share the data or when the customer or another team wants to consume the data, we are not sure what kind of data is there. We have to look at the schema. Data Hub is useful for us as we are doing the cataloging of all the datasets across my company, allowing us to later use and see the table information and schema information so that the team can identify what data is PII or non-PII.
What is most valuable?
The best features Data Hub offers include support for cataloging and lineage very well, as we are getting all the different types of connectors to consume and use across the my company dataset pipeline. Apart from that, the GraphQL APIs provided by Data Hub are very good, allowing us to get all the information we need programmatically whenever we need it.
Regarding how the GraphQL APIs help my team in day-to-day tasks, we sometimes use custom logic to check whether the data has PII or non-PII. We have some AI model running on top of it, which requires classification. Based on the dataset URL, we are getting information about the dataset using the GraphQL APIs. GraphQL APIs are very handy, allowing us to customize properties and pass on the necessary information. For example, if we need a structured property, we can get those structured properties. If we need tags or owners, we can retrieve that as well.
Data Hub positively impacts my organization by enhancing collaboration as previously, we had to ask the team to provide the schema information. my company operates in a cross-region environment, so a person in India could wait a day to receive information about the schema from someone in the US. However, with Data Hub, we have a centralized place where we can access all the schema of the datasets, making it very helpful. Additionally, whenever there is a problem, using the lineage helps us quickly identify the impacted team or dataset.
Whenever there is an incident, we first go to Data Hub to see the downstream teams impacted and stop any jobs running on those datasets. It helps us save around eighty percent of time, as we previously had to track down information manually to find the owners, but using Data Hub, we can tag the owners of the datasets directly in the tool.
What needs improvement?
For improvements to Data Hub, I feel the security is a bit on the weaker side. We have ingestion jobs that require exact permissions for different owners, but this setup does not align with the my company grouping system. We need to create some custom grouping to manage those permissions. I would appreciate it if there were a method to consolidate all the information on a single page, which would simplify sharing permissions for running ingestion jobs.
Additionally, I do feel that the metadata test we run daily takes too long. Initially, it takes one day, which I find excessive. Ideally, we should get information within one hour. These are the two main issues that would benefit from improvement for our use case.
For how long have I used the solution?
I have been using Data Hub for one and a half years.
What do I think about the stability of the solution?
Data Hub is stable in my experience. However, there are times when we attempt to upgrade it, and it may go down for a couple of minutes, but not more than that.
What do I think about the scalability of the solution?
Data Hub handles scalability effectively, accommodating growing data and users.
How are customer service and support?
I have had to reach out to Data Hub customer support multiple times. For example, when we were setting up a private link to connect to Data Hub GraphQL APIs, we required our account to be whitelisted. I have also requested some future features for our use cases. For instance, when working with a metadata test scenario, I needed to have a range date column, which was not available. I requested the Data Hub team to make it public so we could use it.
What was our ROI?
I have seen a return on investment with Data Hub. For instance, I have noticed time savings during incidents and while looking up schemas. In terms of resources, Data Hub centralizes data cataloging and classification, saving us from having to disclose PII column information to teams not utilizing it. Regarding financial metrics, I do not have specific metrics available.
Which other solutions did I evaluate?
Before choosing Data Hub, we looked into Unity Catalog from Databricks , but we ultimately decided to stick with Data Hub.
What other advice do I have?
My advice for others looking into using Data Hub is to use it for cataloging, classification, and centralizing all your schema. Data Hub supports a variety of connectors and has excellent lineage options. Additionally, make sure to utilize the well-written documentation that can guide you in building your product solutions. I would rate this product a nine out of ten.
Which deployment model are you using for this solution?
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Centralized data library has boosted discovery, collaboration, and time savings across teams
What is our primary use case?
My main use case for Data Hub is that we use it as a library for all the data assets that we generate. It serves as an internal data mart where people can search for whatever data they need, and they can search by tags, by roles, and then add more metadata to it. This provides visibility to the data.
A specific example of how my team uses Data Hub in a real-world scenario is that we collect and manipulate a bunch of data layers. Because we have huge teams, the exposure to data that we have already manipulated can sometimes be hindered when using traditional systems. Data Hub acts as a search engine for all of the data. One example would be when the marketing team was looking for specific data around marketing. They discovered that once they searched it on Data Hub, it was easily visible. They did not have to retrieve it from the raw layer and manipulate it for their usage because another team had already built it.
Regarding how my teams interact with Data Hub, we use Data Hub with a self-hosted system. We have connectors which look into multiple data sources, manipulation engines, and orchestration layers to gather the metadata, and then that is pulled into Data Hub. This is how we get data assets in Data Hub.
What is most valuable?
The best features that Data Hub offers include primarily data discovery and data governance. Data Hub has data catalogs, which helps with the business glossary, ownership tracking, and lineage. Lineage is something that we are strongly using at this point in time. It helps us understand the impact analysis, such as what breaks if I change this column. Data Hub also provides data observability, helping us understand what data is fresh, what is not, and what has changed schema recently. Additionally, it makes our system AI and LLM ready.
The lineage feature has changed the way my team works and collaborates significantly. Because we now have data lineage through Data Hub, if we have a really huge dependent pipeline with multiple layers of upstream and downstream dependency, and something breaks in the downstream system, we can exactly pinpoint what all data assets would be affected. Having that lineage functionality helps us drill down what needs to be debugged and fixed and what exact part is breaking. It saves us time in remedying the issue.
I really like the integrations that Data Hub provides. Data Hub has a very large set of integrations that we can do with Snowflake , Databricks , BigQuery , Redshift, DBT, and Airflow .
Data Hub has positively impacted my organization as teams can now be directly dependent on one source of truth for all their data needs. The time spent finding information has become significantly smaller, which is the real productivity improvement that I have seen, impacting multiple teams throughout the organization. I estimate that we save about thirty to forty percent of the time now since we do not have to read documents or message people for specific data assets. This results in a productivity increase of around thirty to forty percent in terms of time and efficiency.
What needs improvement?
I think Data Hub can be improved by supporting the open source version better. Many features have moved to the paid version now, making it difficult for small-scale companies to operate on Data Hub because we are required to pay, even though it started as an open source project that is now essentially behind a paywall.
One needed improvement for Data Hub would be stronger AI-powered metadata discovery. I understand Data Hub has been investing in AI, but the natural language processing power on Data Hub search is not that good. The search itself is not accurate many times. Another improvement could be enhancing the DBT developer experience, such as surfacing DBT test failures directly in lineage. Additionally, when we change schema, if it could provide a risk scoring of some sort, that would also be beneficial. Lastly, automated cleanup recommendations would help because managing orphan data assets on Data Hub currently takes a lot of manual time.
For how long have I used the solution?
I have been using Data Hub for a year.
What do I think about the stability of the solution?
Data Hub is pretty stable in my experience with no downtime or issues.
What do I think about the scalability of the solution?
Data Hub's scalability has been effective, handling our organization's growth and data volume well.
How are customer service and support?
I have not had to reach out to customer support.
Which solution did I use previously and why did I switch?
I did not previously use a different solution before Data Hub.
What's my experience with pricing, setup cost, and licensing?
My experience with pricing, setup cost, and licensing has been pleasant, and I have no complaints.
Which other solutions did I evaluate?
Before choosing Data Hub, we evaluated Atlan and decided on Data Hub because it has a cleaner UI and also a decent open source community to support it.
What other advice do I have?
Data Hub does most of the job it is designed to do, but there could still be improvement as the industry progresses, particularly around metadata discovery. Regarding Data Hub's AI capabilities, its governance and security do the job really well as of right now. I do not have any complaints, especially around data classification, as it allows us to have control over whatever data we are displaying, including customization for PII, sensitive, and financial data. Data Hub has met our expectations regarding its accuracy and reliability of output, and there have not been any issues.
My advice to others looking into using Data Hub is that it is a pretty nice product right now with easy integration. The pricing model could be negotiated, so it is essential to keep that in mind. I would rate Data Hub a solid eight on a scale of one to ten.
Which deployment model are you using for this solution?
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Centralized lineage has reduced onboarding time and improves tracking of complex data flows
What is our primary use case?
Our main use case for Data Hub is for data lineage and metadata governance for our UFC project, where we are utilizing multiple databases such as SQL Server , Databricks , and Snowflake . We have adopted Data Hub to create centralized metadata for all these databases.
A specific example of how we use Data Hub for metadata governance in our UFC project involves getting data from multiple sources including Excel files, CSV files, APIs, and external databases, storing that data first into Amazon S3 buckets, and then into Snowflake staging areas. We transform the raw data using a DVT model, create a silver layer, and then load data into the gold layer for reporting. With Data Hub, we have a centralized view of the data flow, which makes it easier to track issues in downstream applications such as Power BI reporting.
We also use Data Hub for onboarding new team members, as it was previously hectic to provide complete metadata details from our seven to eight data sources and over two hundred tables in our Snowflake database. Now, new team members can refer to the lineage of any table or column to understand the complete flow without relying solely on others.
What is most valuable?
One of the best features Data Hub offers is its ability to identify schema changes in the source side efficiently, especially when we pull data from multiple external databases such as SQL Server . It helps us quickly pinpoint necessary updates when columns are added or removed, streamlining what was previously a time-consuming manual process.
I find Data Hub quite manageable in the downstream application within the UFC data mart, mainly when issues are reported in Power BI. It provides a complete view of the data lineage, allowing us to backtrace the source of any discrepancies easily.
Data Hub has positively impacted our organization by reducing the knowledge transition period from three months to one month for new team members, enabling them to refer to the complete lineage without depending heavily on others, which is a substantial improvement.
What needs improvement?
In terms of improvements for Data Hub, it seems more useful for critical or large data pipelines, as small data architectures can be straightforward to understand without it.
Regarding enhancements for complex projects, I have noticed that sometimes Data Hub does not provide a complete picture of the lineage, particularly in complex data pipelines such as when we fetch data from an API to S3 and subsequently to Snowflake. We have to review the metadata in Data Hub closely.
For how long have I used the solution?
I am working in the data engineering field for over twelve years.
What do I think about the stability of the solution?
Data Hub is stable in my experience.
What do I think about the scalability of the solution?
Data Hub's scalability is advantageous, as we onboard data from over one hundred fifty tables in SQL Server to Snowflake, and adding new tables to Data Hub is not time-consuming.
How are customer service and support?
Customer support for Data Hub is quite good; our infrastructure team received ample support during the initial setup within the given timelines.
Which solution did I use previously and why did I switch?
Previously, we used the Snowflake inbuilt lineage graph to identify data flow, but we switched to Data Hub for its centralized governance capabilities across multiple databases.
How was the initial setup?
The initial setup of Data Hub was completed by our infrastructure team, and I do not have complete visibility of how they made the purchase.
What about the implementation team?
Regarding pricing, setup cost, and licensing for Data Hub, it was handled by our client infrastructure team, so I lack visibility into those aspects.
What was our ROI?
I have seen a return on investment with Data Hub, notably in reducing the knowledge transition period and improving our ability to troubleshoot production issues in Power BI, thus saving time.
Which other solutions did I evaluate?
We did not evaluate other options before choosing Data Hub since we were solely relying on the lineage functionalities of Databricks and Snowflake.
What other advice do I have?
My advice for others considering Data Hub is to utilize it, as it is free and can significantly reduce time for production support and addressing data issues, while simpler data models can benefit from the inbuilt functionalities of their respective databases. I would rate this product eight point five out of ten.
Metadata governance has improved data lineage visibility but still needs simpler integrations
What is our primary use case?
I work with Data Hub as a user, but I also have some administrative responsibilities there. I'm not a final user; the final users are business users, and I play some administrative roles in the tool to have the metadata information available for all Uber users.
I'm a Data Quality