Alation Data Catalog API integration has mostly been smooth. We have seen issues with the LLM API pieces or lineage, but there is not clarity around what the costs are, how many requests exist, and what constitutes a request to this API. One issue that we found before, which I believe is solved now, is using the same refresh token in multiple different processes that could run in parallel. The refresh token would create an API token, and the API token would then get invalidated if another process uses the same refresh token to create an API token. We are now starting to use service accounts by Alation Data Catalog and will have different service accounts for each process, so this does not happen anymore.
The current search functionality in Alation Data Catalog is not necessarily great. It does not do natural language searching as we would prefer. It mostly searches titles, and we would prefer it to search the description and some of the source comments to find answers that we need. Additionally, we would like to be able to search common queries or the queries that are used most frequently for certain aspects so people can get ideas of what queries they can use when they want to find a specific metric.
We have been able to find downstream impacts easier with Alation Data Catalog. Our machine learning team uses it quite often to find certain data. Our analytics team has yet to adopt it as much as we would prefer because it does not have the easy finding features for searching and finding queries that they can use or endorsed queries. They want to figure out how to use the tables and join them to find other metrics, which is difficult in the current state, though I believe there are things in the future to improve on. The other aspect is holding data producers accountable and being able to see who owns tables. Currently, that is a manual process, but we are creating an automated process to add owners to tables. If anybody has a question on a table that the description may not have answered, then we can find that out through the owner of the data in Alation Data Catalog table page.
We do not have as many P1 incidents anymore based on anecdotal evidence. Previously, a change may have happened and people did not know about the downstream impacts, which caused a lot of issues. Now it is easier to mitigate or just not encounter the P1 incidents in the first place.
The search feature of Alation Data Catalog could be improved. Alation Data Catalog Compose is also interesting in that we cannot search queries or see queries in the table page that are not published unless we go to the query history. We do not allow Compose on many items right now due to information security. Our security requirements do not allow Alation Data Catalog to access the underlying connections because we do not want people to pull in data. From a security standpoint, that is an issue, and we would like to have workarounds in certain cases. The other issue that we have found recently is along the same lines of security. We do not want to automatically sample tables in Alation Data Catalog because there could be some issue or we do not want that data being stored on a different server. However, if we have a schema that is enabled for sampling, then any new table automatically gets enabled for sampling. We have had to work around this by trying to figure out the correct permission and setup on the Databricks side to not allow Alation Data Catalog to sample certain tables if we do not want it to, because it is not feasible to do that on Alation Data Catalog side.