DataHub
Data mapping has improved metadata completeness and now supports faster business data discovery
What is our primary use case?
Within the engineering teams of Renault, there was a lot of data without sufficient metadata, such as descriptions of tables and columns. The objective was to complete the definitions and descriptions of business data objects within the glossary and map these descriptions to the tables and columns that comprise the data sets of this engineering department to ensure a comprehensive experience when searching for data, providing adequate definitions and descriptions of the data used in this department.
I use Data Hub within two of my clients. With Renault, the car constructor, they changed their data catalog from Zeenea to Data Hub, and I have a mission to contribute to the enrichment of this data catalog by conducting workshops with data providers, data stewards, and all the stakeholders involved in this data catalog. The aim of this mission is to map real data to the definitions and descriptions of business data objects available in the company's glossary. My second mission was with Hitachi Rail, a company that provides rail services, where the mission involved benchmarking several data catalogs including OpenMetadata and Data Galaxy. Data Hub was chosen for its available functionalities, with the task of implementing this data catalog with a specific scope and then completing the usage of this data if everything works well.
What is most valuable?
The best feature that Data Hub offers in my experience is the ability to map between real data and data sets.
The mapping feature helps my team and clients significantly because it addresses the lack of metadata information about the tables and columns used in the company's data lake, enriching the data catalog considerably through this mapping.
Data Hub positively impacts my organization and clients by making it easier to search for data. It facilitates easier collaboration and helps save time. However, concerning data quality, it is not sufficiently equipped as it lacks components to evaluate the data quality level, which is a feature available in other data catalogs, indicating an area for improvement.
What needs improvement?
Data Hub can be improved in several ways, primarily by enhancing the data quality evaluation capabilities. Additionally, I would suggest improving the hierarchy of business glossary terms, as understanding the characteristics of each business data object can be challenging within the current structure of business glossary terms in Data Hub.
For how long have I used the solution?
What do I think about the scalability of the solution?
How are customer service and support?
What other advice do I have?
I have conducted benchmarks with OpenMetadata and Data Galaxy, but I have never used them for a mission with my clients. Before choosing Data Hub, I evaluated all the principal tools on the market, including Castor, Data Galaxy, and OpenMetadata.
I have no experience with pricing as I used the free license. My advice for others looking into using Data Hub is to consider the paid version for enhanced options related to data quality and the availability of KPIs regarding the completeness and accuracy of metadata, which results in a superior experience with this tool. I would rate this product an eight out of ten.
Cataloging data and business terms has reduced questions and speeds up KPI tracking
What is our primary use case?
My main use case for Data Hub is for a catalog system because we are integrating all of the data sources to Snowflake and then we want to catalog and share business glossary terms with our company employees.
A quick specific example of how I use Data Hub in my daily workflow is that we have all of the data in Snowflake and all of the employees using Snowflake did not know what kind of data is in Snowflake. They did not know all of the tables and what kind of columns and metrics, KPI definitions exist, so we are using Data Hub for searching the data in Snowflake and identifying who is using Snowflake.
My main use case is covered.
How has it helped my organization?
Data Hub has positively impacted my organization because there are many data analysts in each team, and the time to Q&A has significantly decreased since we started using Data Hub. This improvement is also seen in our KPI tracking.
I cannot provide specific time savings, but for example, we used to have 100 user requests for questions, which required searching Snowflake tables to determine what tables should be used, but now it is down to almost 10 questions.
What is most valuable?
In my opinion, the best features Data Hub offers are the searching function and tagging function. If I add a tag for some of the tables or columns, it is very easy to find people who need that information.
I am trying to use the tagging function for all of our data, but we are currently developing it, so we have covered almost 70% of our data.
What needs improvement?
We are using the free version of Data Hub with Docker Compose, so it is somewhat difficult to find out the lineage. If we are using Data Hub free version, then we can only figure out the tables' lineage, but we cannot search the column lineage, which is why I would like to add the columns-level lineage.
I need the lineage function for more column-level lineage and I think more example documents that are essential for our company would be very useful because there are many glossary terms and features in Data Hub, but I did not know which are more essential for us.
Additionally, I also have one more concern regarding using Docker Compose for Data Hub; the memory issues come up sometimes and consume a lot of memory resources, so I need a more efficient way to use Data Hub without these issues.
For how long have I used the solution?
I have been using Data Hub for almost one year.
What other advice do I have?
We are using private clouds in AWS, and we have deployed Data Hub on the AWS EC2 server with Docker Compose.
The cloud provider we use is AWS.
I did not purchase Data Hub through the AWS Marketplace; I am just using the EC2 server and deploying it with Docker Compose.
My advice for others looking into using Data Hub is that if there is no catalog system or data dictionary system and if there are many KPIs or metrics within their company, then I recommend Data Hub to those kinds of teams.
I give Data Hub an overall rating of 8.
Which deployment model are you using for this solution?
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Metadata lineage tracking has improved governance and currently supports clear data observability
What is our primary use case?
My main use case for Data Hub is data lineage tracking. With Data Hub, we track multiple sources, ingestion sources, and different sources where the data resides in S3. We bring all that metadata into Data Hub to track lineage on the data ingestion patterns that we perform or transformations that we do, and how they move from different tables or assets or the data pipelines. Whatever transformations we do with Spark and S3, Snowflake, all those are being tracked via Data Hub. We have S3 buckets and Snowflake tables, and all those lineage tracking is managed through the platform.
My main use case is mostly covered as we used Data Hub for metadata tracking and lineage for whatever transformations that we do so that we can track each transformation down the line.
What is most valuable?
In my experience, the best features Data Hub offers include lineage tracking, which is mostly on the asset level, a good glossary, and good connector support.
Regarding asset level and the good glossary, we need the glossary of our products so that it is easy to track which product, what went at what time on that particular product, how many assets are related, and so on. For asset integrations, Data Hub makes it easy to ingest all that metadata of those particular assets from S3 via connectors, which is quite easy. It has good connector support, although limited in some cases.
Overall, Data Hub is a good tool. If we talk about lineage, metadata, and observability on some high level, including domain descriptions, PII classification, datasets, and keeping datasets in one place along with policies, it is good in that particular sense. We do have a plan based on project-to-project usage, but in some of the projects, we do use Data Hub as well.
What needs improvement?
I would like to add that for the connectors, there is sometimes limited support for using wildcards to get the items or assets ingested from sources like S3; it does not support very good wildcard filters. Additionally, Data Hub has a problem with column-level lineage support, especially regarding non-pro users or those without any plans. If I talk about the free features of Data Hub open source, those two I found could be improved during my use case.
Regarding improvements needed for Data Hub, I have already mentioned the limitations on the usage of wildcards in the ingestion or connectors; that can be worked upon, especially regarding the open-source part of Data Hub. The rest is that I hope the UI is quite good.
For how long have I used the solution?
I used Data Hub for one and a half years.
What other advice do I have?
My advice for others looking into using Data Hub is that it is a good tool if you want to capture all that metadata, lineage, keep track of governance, security, and observability. It just depends on how you want to use it; you can choose the open-source version or the paid version and subscription-based model. The paid versions have more features, but open-source Data Hub, which most people will try to go for, has some limitations, such as the missing column-level lineage with Spark. You need to consider those points, but overall, it is good. I would rate this product an 8 out of 10.
Which deployment model are you using for this solution?
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Data mesh has connected 2,000 colleagues and has made cross‑team collaboration transparent
What is our primary use case?
My main use case for Data Hub involves integrating our HR system or Active Directory, which automatically pulls in all 2,000 workers and groups them into their respective project squads and R&D teams. Each team gets its own team profile page in Data Hub, which helps solve the classic corporate headache of determining who to ask for specific information.
When a team builds a data pipeline, a Kafka topic for telecom signals, or a dashboard, it is tagged explicitly with their team profile as the owner in Data Hub. This means that if a developer in Split, working in the same company, needs to find a specific network dataset, they do not waste days spamming Slack channels; they can simply look it up in Data Hub and find the team profile that owns it along with the direct contact info or Slack channel.
Additionally, it enables us to run a data mesh model with 2,000 people, allowing one central IT team to manage everything while Data Hub facilitates splitting the company into logical domains such as electronic health, telecom networks, IoT, or smart cities.
What is most valuable?
The best features that Data Hub offers include the ability to centralize everything in one platform, such as creating profiles and organizing them into separate domains like engineering, health teams, supporting teams, and HR teams. This allows information to be shared across different domains.
Utilizing the data mesh model enables the company to maximize functionality using a single solution. Data Hub supports collaboration between different teams and departments significantly, as evidenced when we created various data mesh modules and established different domains such as E-Health, telecom networks, and IoT. This allowed us to share datasets effectively, and with authenticated users, the communication and responses were much quicker.
Among those features, I find the collaborative aspects the most valuable in my work because it has greatly improved our operations over the past year. We evaluated various licenses and methods to integrate data catalog platforms, ultimately deciding to move forward with Data Hub since it was more compatible with our company's security requirements. Compared to other tools, it received better support from the community, which is updated daily, allowing us to collaborate effectively through contact sharing.
Data Hub has positively impacted my organization by functioning as an all-in-one solution. It uses data mesh and separates domains to manage privileged access based on user validation, allowing us to share data sets across the company, which informs everyone about internal regulations. Furthermore, it significantly aids new joiners in understanding the operations and knowing who works on specific projects, while also providing updates on changes occurring within various sectors and domains.
The frequency and quality of updates or new features released for Data Hub have been impressive. This extensive community support was a key factor for us at Ericsson Nikola Tesla to choose Data Hub as our data catalog.
What needs improvement?
Regarding how Data Hub can be improved, I believe they should focus on enhancing their marketing efforts. Within our company, we were unaware of the Data Hub platform while searching for data catalog options that offered strong security and collaboration. Better marketing would help other companies learn about this effective solution.
My rating of eight rather than a nine or ten pertains to the connections with different systems. Specifically, the integration with Slack and Azure, as well as how we link our HR system to Data Hub, could be improved for better compatibility.
Integrating Data Hub with our existing tools and systems was not very easy, which is why my rating is an eight. We attempted to incorporate our HR system with Data Hub, aiming to set governance status for the 2,000 employees in our organization, but I did not complete this aspect before leaving the organization.
For how long have I used the solution?
I have been using Data Hub for at least six months at the company called Ericsson Nikola Tesla in Zagreb, which has a massive operation with an entire ICT and R&D division of around 2,000 workers.
What do I think about the scalability of the solution?
In terms of scalability, I believe Data Hub performs exceptionally well as more teams come on board, making it efficient for large organizations with approximately 2,000 employees. It adequately supports the scalability of data sets and the implementation of data mesh models.
How was the initial setup?
During implementation, the documentation and support resources from Data Hub were very helpful. I followed the guidelines, accessed each section, and understood the platform effectively, which made the initial setup easy.
What other advice do I have?
Data Hub is flexible, optimistic, and user-friendly in terms of its interface and experience. I rate Data Hub an eight on a scale of one to ten.
The learning curve for new users adopting Data Hub is addressed through their learning section that guides users on how to navigate the platform. I found it quite simple and effective to follow.
We purchased Data Hub through the AWS Marketplace.
As for specific outcomes or metrics, I currently do not possess numbers since we are still in the early stages of implementing Data Hub within our company. However, the HR department reported significant time savings in completing tasks before and after adopting Data Hub, which has resulted in faster completion and better collaboration without interrupting others.
Data Hub has worked for me personally, as I noticed that after we began ingesting Data Hub into our Ericsson Nikola Tesla company network, it proved to be incredibly helpful for easier access to information. By positioning team profiles at the center of Data Hub, it prevents the duplication of data sets, accelerates onboarding for new engineers, and fosters more connected and collaborative teams within our large employee base. Personally, it has helped me specify tasks and has contributed to the company's progress with the data catalog we chose.
My advice for others considering using Data Hub is to understand how it works and explore its integration potential within their organization. Engaging with community support can also be beneficial, as the team's collaborative approach is impressive.
Which deployment model are you using for this solution?
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Centralizes data lineage and ownership and has improved our organization-wide data governance
What is our primary use case?
I use Data Hub for our data lineage, data management, data heritage, and our data dictionary. Our organization is quite large, with about 2,000 people working on different initiatives, and everyone wants to connect to a database somehow. As the data engineering team, we are responsible for connecting every single data source that we have, defining each one, and providing an accurate single source of truth for the data so everyone can have the same understanding of the data they are discussing. Since we are ingesting every database that the company has into our own data infrastructure through different tools, we needed to have a clear understanding of data quality, data lineage, and the data discovery part of the process.
What is most valuable?
The Helm chart of Data Hub is designed really well, which makes our deployment strategies lean and operational. The UI is excellent, and I really appreciated the ability to treat data gathering and data ingestion as GitHub workflows. Data Hub is one of the services that was truly scalable, at least in the open-source version, which is one of the things we valued since you could scale every part of the system it used, including its internal MySQL or metadata database, Elasticsearch, and the search capabilities. Everything about Data Hub was quite scalable due to its excellent Helm chart, as they really focused on the Kubernetes aspect.
What needs improvement?
We encountered some issues when we wanted to connect our streaming infrastructure to Data Hub, which was somewhat problematic.
In our data streaming infrastructure, we had a database CDC'd through Kafka Connect to a Kafka topic, and at the end of the pipeline, it would go to either an OLAP or a data lakehouse. However, the problem with visualizing this data lineage was that while the connection between MySQL and Kafka worked, when we wanted to track data from Kafka to other services, we couldn't track everything back because the IDs were generated randomly and couldn't be connected. We had to fix this manually by stating where the data had gone, which was tedious.
Data Hub's GMS service, or General Metadata Service, is a good service that I used regularly, but the CLI version had considerable changes across different versions. When I installed a different version, there wasn't enough consistency to ensure that commands I used would work in future versions of Data Hub's GMS CLI, which was frustrating. I also recall that setting up Kafka without Zookeeper was not possible, which was inconvenient, though I should verify this as I don't remember if they fixed it. At least from my recollection, when I wanted to set it up one and a half years ago, they did not have direct support for KRAFT in their Helm chart.
For how long have I used the solution?
I have been using Data Hub for approximately one and a half years.
Which solution did I use previously and why did I switch?
Before adopting Data Hub, we considered moving forward with OpenMetadata but decided against it since it couldn't support MySQL version 5.
How was the initial setup?
The setup of Data Hub was quite straightforward. One aspect of the architecture I appreciated is that Data Hub relies heavily on Cron jobs and jobs in Kubernetes. Whenever it needs to fix something, it initiates a job to repair its MySQL or its Elasticsearch. Operationally, I find it to be an excellent service, as they worked well on that aspect with the open-source version. However, the lack of support for KRAFT out of the box was somewhat problematic.
Which other solutions did I evaluate?
I previously evaluated OpenMetadata as a tool we considered before choosing Data Hub. In comparison to OpenMetadata, the lack of support for more databases and data sources was the deciding factor, whereas for Data Hub, we didn't encounter any problems; it worked really well.
What other advice do I have?
Data Hub helped us by making it clear who owned which data and who needed to make changes to clean the deprecated data models and infrastructures we had, which was the most significant benefit. Using a tool that Data Hub provided made us visible to the faults and bugs in our different data sources.
I would recommend that organizations considering Data Hub adopt GitOps practices, as we implemented it where every single ingestion or transformation was triggered by GitLab CI/CD, making it straightforward for everyone to use. That was the most innovative approach we took by running every single ingestion job as a Cron job in Kubernetes through our GitOps.
I would rate this product a nine out of ten.
Data catalog has unified business terms and democratized access to our data lake
What is our primary use case?
My main use case for Data Hub is to implement a data catalog for one of the clients that the consultancy I work at is serving.
A specific example of how the data catalog was used for that client is that it was used to define business terms and to explore the terms from the data glossary by adding definitions. It was also used to capture all the tables and fields that were connected to a data lake, allowing me to explore the entire production data lake and tag the tables and fields, segmenting these tables by domains such as sales tables and marketing tables.
What is most valuable?
Data Hub offers several best features including the tagging capability, domain segmentation, data exploration, and creation of a data glossary, which was very interesting to me. Additionally, the ease of plugging in new data sources is exceptional. Data Hub can be easily integrated with a data lake, and the environment can be explored through the metadata via Data Hub. I found the connection part straightforward.
Data Hub had a positive impact on my organization by disclosing to the organization and to business users what existed in the data lake. The interface that the technical team has with the tables and fields is designed for professionals in the technical area. Having a data catalog helps provide a better interface for data discovery and data democratization within the organization since everyone should have access to what types of data the organization has, and that was the biggest impact.
What needs improvement?
I started using the quality part for consistency, but I had limited contact with it and we did not progress much.
I believe the data quality module can always be improved by examining what is available in the market and making appropriate improvements to the tool. The data quality part is very important and it is not always fully leveraged as it should be. I also think that providing consulting or support with professionals who are qualified to use Data Hub would be interesting, along with providing training and certifications for the tool so that those who are implementing it can specialize increasingly in its features.
For how long have I used the solution?
I have been using Data Hub for around one year.
What do I think about the stability of the solution?
Data Hub is stable, and I did not have any stability problems when I was working with the tool.
What do I think about the scalability of the solution?
Data Hub's scalability is very easy, as we were able to add users and new datasets very quickly and smoothly.
Which solution did I use previously and why did I switch?
I was not previously using a different solution. The implementation was already directly part of a data governance initiative and it was done directly with Data Hub, meaning there was no previous solution.
What about the implementation team?
I believe the consultancy has some kind of commercial relationship with Data Hub to promote and offer Data Hub as a data catalog solution.
Which other solutions did I evaluate?
Before choosing Data Hub, the consultancy worked with some tools such as Google's DataPlex and Purview.
What other advice do I have?
My advice for others thinking about using Data Hub is to have the governance initiative well-structured and to have all the documentation for data owners and data stewardship so you know who will be the points of contact when the tool starts being configured, ensuring that you have people responsible for doing reviews and approvals in the tool. I would rate this product an eight out of ten.
Data governance has unified domains and now supports conversational discovery for all teams
What is our primary use case?
My main use case for Data Hub is to build data products inside Natura; primarily, I built data products for CRM, which is customer relationship management, and also for some data products for the product field, such as analytic fields. I used it by dividing the company into domains, and each domain has its own functionality and its own structure. With that approach, I used it extensively for building domains. I also used it to build data lineage across the entire data journey, from the ingestion of the data to the use of the data in the final part, such as in a data product or in a dashboard.
A specific example of how I used Data Hub for building domains and data lineage is the domain called GenAI, which is primarily built for products based on AI, mainly generative AI. To accomplish this, I used Data Hub to track the data from the ingestion field. I used some CDP tools such as Segment and I also have data in an S3 bucket that was ingested to Databricks using Airflow. With that setup, I track this lineage from the origin system. After that, I performed a lot of transformation of the data inside Databricks to clean the data and conduct some data augmentation. After that, this data is used to train some models using Databricks LLM. With that, I ingest all this metadata into Data Hub and I can see from where the data is coming from and to where the data is going. This is primarily for LLMs to help consultants at the end of the product.
What is most valuable?
The best features that Data Hub offers include the capability to make conversational questions inside the platform, which I believe is the best thing that they built in the past year. It is also easy to connect different data sources. Since data lakes, I have connectors to some databases and also to some business analyst tools and other tools. I can connect many types of data inside Data Hub and see what is going on and how we govern the data. Data Hub is a pretty good tool for that. I also value very much the open-source version because it is free and everyone can use it.
I do not have much experience using the conversational questions feature, but I do not need to go to the asset to see from where the data is coming from and where it is going. I can simply ask, 'How can we calculate the sales in this month?' and Data Hub will identify which table will be used and from where this data is coming from and where this data is going. This is very effective.
Data Hub has impacted my organization positively by helping us build a data governance environment and share the knowledge about the data for the entire company. As we used the open-source version, we have no limitation in how many people can use the tool, which is excellent. I can conduct many tests and test as quickly as possible. It is very good for building POCs, for example. Data Hub also helps to give this understanding of the data for the entire company. Everyone in the company can see the data and know where the data is coming from and where it is going. I believe it is very effective and all the people in the organization, not only the data field personnel, can understand more about the data and also help to build better products.
What needs improvement?
I believe Data Hub could provide more functionalities in the free version. I understand that we have to pay the persons who build the platform, but the free version has some limitations. Some capabilities of the paid version being included in the free version would be beneficial. Another improvement that is needed in Data Hub is how I can get data from Data Hub to build some metrics. I know that I have the API and the GraphQL API, but I believe it could be better. If this is improved, it would be very helpful.
For how long have I used the solution?
I used Data Hub for over two years, mostly in the open-source version.
What other advice do I have?
I do not have actual metrics to provide currently; I only have some metrics. I certainly improved the data discovery part of building a data product because it is really fast to know if the data product already exists or does not exist. In the past, I had many products that were the same, and with that, I had a lot of work doing it twice or three times in different parts of the process. This is very good. I do not actually know the exact number of time saved, but I certainly saved time. I have a metric that before Data Hub, I believe 20 to 30 persons used and had knowledge about the data. Currently, I have almost 250 persons using Data Hub.
I did not use many AI features in Data Hub, as I stopped using Data Hub before it started offering these functionalities.
My advice to others looking into using Data Hub is to start by trying the tool using the free version. If it is sufficient and you already understand how to use it, you can transition to the paid version. However, you can accomplish everything in the free version. I would rate this review an eight out of ten.
Which deployment model are you using for this solution?
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Metadata management has streamlined lineage tracking and data discovery for our teams
What is our primary use case?
My main use case for Data Hub involves connecting a lot of data that is available and coming from upstream data points or data lakes like Kafka, or in BigQuery itself. We usually connect this data to Data Hub as it is a modern data catalog designed to streamline metadata management. We can put all the metadata of our data inside Data Hub, setting who the owner is and tracking where this data is coming from and where it is consumed downstream. We can have data discovery and governance as well.
My specific example of using Data Hub in my daily workflow involves an orders table, which is very large and is joined with several other tables. This data is populated by a Kafka consumer that consumes messages from a specific topic, and thereafter, a batch that runs once a day transfers this data to a history table in BigQuery. This allows us to manage visualizations and data management tasks. We usually put all this metadata in Data Hub to track the data lineage, profile datasets, and establish data contracts. This way, we know the lineage of each field, and if any batch fails the data contract check, it sends an email notification to the responsible person. We can add more contracts such as validations to the data as necessary.
What is most valuable?
The best features Data Hub offers include its integration capability with many popular tools like Apache Airflow, Snowflake, dbt, Looker, Apache Kafka, and BigQuery. These tools provide us with data in various places, and we commonly use Apache Airflow for the DAG, while utilizing BigQuery as our database and Apache Kafka for consuming messaging queues. Data Hub easily connects with all these tools and features excellent data discovery and visualization capabilities. We can see data visibility, where it comes from, its upstream and downstream relationships. If we remove a column, we can assess the impact of that change. Furthermore, if there are duplicate datasets being used by different teams that do not communicate regularly, onboarding all data to Data Hub allows us to identify these duplicates easily.
Out of all those features, I believe data discovery and impact analysis are the most valuable for my team because when we want to add or drop a column, we can assess the impact analysis to understand the downstream effects. This helps us know who owns a dataset, and we can easily contact the owner. Tracking the data lineage back to the source table is also a key benefit.
Data Hub has positively impacted my organization by significantly reducing manual work that was previously needed to identify upstream and downstream data relationships, as well as recognizing duplicate datasets. If a data contract is broken, we now easily get notified of those issues, making the process much easier and more efficient. It is particularly useful for data engineers and platform teams to check for problems directly within Data Hub.
Data Hub has saved our team a lot of time. For example, in a large company like Porch, if I want to know whether a specific dataset exists, I can check Data Hub, as it serves as a centralized point for managing the metadata of our data. While it does not contain all data, it does contain the metadata necessary for understanding the dataset's origin. If a dataset does not exist, I can simply see who the owner is and reach out to them, which reduces the dependency on others by providing direct access to information in Data Hub.
What needs improvement?
Regarding improvements for Data Hub, I think there is no scope for improvement. It is the best tool in the market currently. I have reviewed some other tools as well, but Data Hub stands out.
In terms of areas for improvement, I do not see anything lacking. Data Hub offers both cloud and self-hosted deployment options, and it has a robust community. They hold open Slack community sessions as well as webinars, typically once or twice a month, to share knowledge and updates, which is a significant benefit. I have not encountered any major issues with Data Hub.
For how long have I used the solution?
I have been using Data Hub for around three years.
What do I think about the stability of the solution?
I have not seen any downtime within Data Hub.
What do I think about the scalability of the solution?
In my experience, Data Hub's scalability is impressive. We have around 300 datasets from BigQuery, 400 from Kafka, and many more, yet I have not seen any downtime within Data Hub. We have successfully onboarded over 1000 datasets from various sources without any issues.
How are customer service and support?
Customer support for Data Hub is very genuine, and they are responsive and attentive. If I raise a ticket today, they usually respond by the next day. Additionally, they host webinars monthly to discuss new features and updates. They also have an open Slack community where responses tend to be immediate.
Which solution did I use previously and why did I switch?
I previously used OpenMetadata before adopting Data Hub, but I found Data Hub to be more user-friendly and easier to utilize than OpenMetadata.
How was the initial setup?
Data Hub exceeds expectations in user-friendliness and functionality. It features a great user interface, an available SDK, APIs, and GraphQL previews, all complemented by a responsive Slack community and helpful customer support. The ease of documentation, website usability, and setup contributes to its overall effectiveness.
What other advice do I have?
Additionally, we use some other data governance tools with Data Hub. We can add domains to any dataset, such as specifying that this is the orders domain or the customer domain. We can add more tags, manage data ownership by indicating which team owns specific data, and create glossary terms, which act as labels for different datasets.
I find myself relying on Data Hub for lineage checks and data contracts once a week.
Regarding Data Hub's AI capabilities, it exposes several MCP servers that easily integrate with LLMs such as Claude, Cursor, Gemini, or LangChain, along with the Agent Development Kit from Google. In terms of security, Data Hub ensures that no company data is exposed outside, and they maintain strict confidentiality regarding the metadata of the company, adhering to similar NDAs that prevent revealing sensitive information.
In terms of accuracy and reliability of output with Data Hub's AI capabilities, I find it exceeds 95% accuracy. Having utilized the MCP connectors with Claude and the ADK, I can confidently say that it performs flawlessly and retrieves data effectively.
My advice for others considering the use of Data Hub is to add more glossary labels and categorize datasets by domain. While it is manageable with a smaller dataset, as the amount of data scales, these glossary terms and domains become immensely helpful. Initially, we did not leverage them, but we found their value as we scaled up and needed to filter data efficiently. I would rate Data Hub a perfect 10 overall.
Centralized lineage and catalog have transformed how we track incidents and classify sensitive data
What is our primary use case?
My main use case for Data Hub is to catalog the dataset across my company and to get the lineage of data in the my company pipeline.
To give an example of how I use Data Hub in my day-to-day work, suppose the data is flowing from a source to Kafka and then to some data storages. If some cross-team wants to use the data but there is a problem at the Kafka level, we are not sure who all are consuming that data. Data Hub is very useful for us in this scenario. It can generate the lineage from source to destination, and when there is an issue at the Kafka side, we will get to know what the end results and impacted data sources are.
I would add that sometimes when we do not want to share the data or when the customer or another team wants to consume the data, we are not sure what kind of data is there. We have to look at the schema. Data Hub is useful for us as we are doing the cataloging of all the datasets across my company, allowing us to later use and see the table information and schema information so that the team can identify what data is PII or non-PII.
What is most valuable?
The best features Data Hub offers include support for cataloging and lineage very well, as we are getting all the different types of connectors to consume and use across the my company dataset pipeline. Apart from that, the GraphQL APIs provided by Data Hub are very good, allowing us to get all the information we need programmatically whenever we need it.
Regarding how the GraphQL APIs help my team in day-to-day tasks, we sometimes use custom logic to check whether the data has PII or non-PII. We have some AI model running on top of it, which requires classification. Based on the dataset URL, we are getting information about the dataset using the GraphQL APIs. GraphQL APIs are very handy, allowing us to customize properties and pass on the necessary information. For example, if we need a structured property, we can get those structured properties. If we need tags or owners, we can retrieve that as well.
Data Hub positively impacts my organization by enhancing collaboration as previously, we had to ask the team to provide the schema information. my company operates in a cross-region environment, so a person in India could wait a day to receive information about the schema from someone in the US. However, with Data Hub, we have a centralized place where we can access all the schema of the datasets, making it very helpful. Additionally, whenever there is a problem, using the lineage helps us quickly identify the impacted team or dataset.
Whenever there is an incident, we first go to Data Hub to see the downstream teams impacted and stop any jobs running on those datasets. It helps us save around eighty percent of time, as we previously had to track down information manually to find the owners, but using Data Hub, we can tag the owners of the datasets directly in the tool.
What needs improvement?
For improvements to Data Hub, I feel the security is a bit on the weaker side. We have ingestion jobs that require exact permissions for different owners, but this setup does not align with the my company grouping system. We need to create some custom grouping to manage those permissions. I would appreciate it if there were a method to consolidate all the information on a single page, which would simplify sharing permissions for running ingestion jobs.
Additionally, I do feel that the metadata test we run daily takes too long. Initially, it takes one day, which I find excessive. Ideally, we should get information within one hour. These are the two main issues that would benefit from improvement for our use case.
For how long have I used the solution?
I have been using Data Hub for one and a half years.
What do I think about the stability of the solution?
Data Hub is stable in my experience. However, there are times when we attempt to upgrade it, and it may go down for a couple of minutes, but not more than that.
What do I think about the scalability of the solution?
Data Hub handles scalability effectively, accommodating growing data and users.
How are customer service and support?
I have had to reach out to Data Hub customer support multiple times. For example, when we were setting up a private link to connect to Data Hub GraphQL APIs, we required our account to be whitelisted. I have also requested some future features for our use cases. For instance, when working with a metadata test scenario, I needed to have a range date column, which was not available. I requested the Data Hub team to make it public so we could use it.
What was our ROI?
I have seen a return on investment with Data Hub. For instance, I have noticed time savings during incidents and while looking up schemas. In terms of resources, Data Hub centralizes data cataloging and classification, saving us from having to disclose PII column information to teams not utilizing it. Regarding financial metrics, I do not have specific metrics available.
Which other solutions did I evaluate?
Before choosing Data Hub, we looked into Unity Catalog from Databricks, but we ultimately decided to stick with Data Hub.
What other advice do I have?
My advice for others looking into using Data Hub is to use it for cataloging, classification, and centralizing all your schema. Data Hub supports a variety of connectors and has excellent lineage options. Additionally, make sure to utilize the well-written documentation that can guide you in building your product solutions. I would rate this product a nine out of ten.
Which deployment model are you using for this solution?
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Centralized data library has boosted discovery, collaboration, and time savings across teams
What is our primary use case?
My main use case for Data Hub is that we use it as a library for all the data assets that we generate. It serves as an internal data mart where people can search for whatever data they need, and they can search by tags, by roles, and then add more metadata to it. This provides visibility to the data.
A specific example of how my team uses Data Hub in a real-world scenario is that we collect and manipulate a bunch of data layers. Because we have huge teams, the exposure to data that we have already manipulated can sometimes be hindered when using traditional systems. Data Hub acts as a search engine for all of the data. One example would be when the marketing team was looking for specific data around marketing. They discovered that once they searched it on Data Hub, it was easily visible. They did not have to retrieve it from the raw layer and manipulate it for their usage because another team had already built it.
Regarding how my teams interact with Data Hub, we use Data Hub with a self-hosted system. We have connectors which look into multiple data sources, manipulation engines, and orchestration layers to gather the metadata, and then that is pulled into Data Hub. This is how we get data assets in Data Hub.
What is most valuable?
The best features that Data Hub offers include primarily data discovery and data governance. Data Hub has data catalogs, which helps with the business glossary, ownership tracking, and lineage. Lineage is something that we are strongly using at this point in time. It helps us understand the impact analysis, such as what breaks if I change this column. Data Hub also provides data observability, helping us understand what data is fresh, what is not, and what has changed schema recently. Additionally, it makes our system AI and LLM ready.
The lineage feature has changed the way my team works and collaborates significantly. Because we now have data lineage through Data Hub, if we have a really huge dependent pipeline with multiple layers of upstream and downstream dependency, and something breaks in the downstream system, we can exactly pinpoint what all data assets would be affected. Having that lineage functionality helps us drill down what needs to be debugged and fixed and what exact part is breaking. It saves us time in remedying the issue.
I really like the integrations that Data Hub provides. Data Hub has a very large set of integrations that we can do with Snowflake, Databricks, BigQuery, Redshift, DBT, and Airflow.
Data Hub has positively impacted my organization as teams can now be directly dependent on one source of truth for all their data needs. The time spent finding information has become significantly smaller, which is the real productivity improvement that I have seen, impacting multiple teams throughout the organization. I estimate that we save about thirty to forty percent of the time now since we do not have to read documents or message people for specific data assets. This results in a productivity increase of around thirty to forty percent in terms of time and efficiency.
What needs improvement?
I think Data Hub can be improved by supporting the open source version better. Many features have moved to the paid version now, making it difficult for small-scale companies to operate on Data Hub because we are required to pay, even though it started as an open source project that is now essentially behind a paywall.
One needed improvement for Data Hub would be stronger AI-powered metadata discovery. I understand Data Hub has been investing in AI, but the natural language processing power on Data Hub search is not that good. The search itself is not accurate many times. Another improvement could be enhancing the DBT developer experience, such as surfacing DBT test failures directly in lineage. Additionally, when we change schema, if it could provide a risk scoring of some sort, that would also be beneficial. Lastly, automated cleanup recommendations would help because managing orphan data assets on Data Hub currently takes a lot of manual time.
For how long have I used the solution?
I have been using Data Hub for a year.
What do I think about the stability of the solution?
Data Hub is pretty stable in my experience with no downtime or issues.
What do I think about the scalability of the solution?
Data Hub's scalability has been effective, handling our organization's growth and data volume well.
How are customer service and support?
I have not had to reach out to customer support.
Which solution did I use previously and why did I switch?
I did not previously use a different solution before Data Hub.
What's my experience with pricing, setup cost, and licensing?
My experience with pricing, setup cost, and licensing has been pleasant, and I have no complaints.
Which other solutions did I evaluate?
Before choosing Data Hub, we evaluated Atlan and decided on Data Hub because it has a cleaner UI and also a decent open source community to support it.
What other advice do I have?
Data Hub does most of the job it is designed to do, but there could still be improvement as the industry progresses, particularly around metadata discovery. Regarding Data Hub's AI capabilities, its governance and security do the job really well as of right now. I do not have any complaints, especially around data classification, as it allows us to have control over whatever data we are displaying, including customization for PII, sensitive, and financial data. Data Hub has met our expectations regarding its accuracy and reliability of output, and there have not been any issues.
My advice to others looking into using Data Hub is that it is a pretty nice product right now with easy integration. The pricing model could be negotiated, so it is essential to keep that in mind. I would rate Data Hub a solid eight on a scale of one to ten.