AWS and Data Minded Power Data Lake for Belgian Unicorn Collibra
Collibra, a data intelligence company, enabled its employees to collaborate with data and build its own data products using Amazon Web Services (AWS). AWS Partner Data Minded deployed its Datafy service to help Collibra build a self-service data infrastructure based on a range of AWS components.
AWS Powers Belgian Unicorn Collibra’s Data Lake
As Belgium’s first billion-dollar unicorn scale-up, Collibra delivers an end-to-end, integrated data intelligence platform that’s purpose-built to automate data workflows and deliver data insights to users. The Collibra Data Intelligence Cloud is the system of record for data, helping enterprises to gain visibility into their data, collaborate, and generate actionable insights.
Founded in 2008, the company employs more than 650 people worldwide and is ranked number 32 in the Forbes 2020 Cloud 100 list of the top private-cloud companies.
“Drinking Our Own Champagne”
In recent years, cloud services for data have significantly progressed to the point that modern data infrastructures are almost inevitably cloud-native, the perfect architecture for Collibra’s data lake, according to Stijn Christiaens, Co-Founder and CTO of Collibra.
Collibra worked with Data Minded, a Belgian data consultancy and AWS Partner, to bring together a number of AWS services to centralize its data consumption and accelerate the creation of a data lake. This provided multiple stakeholders within Collibra, including product, sales, and marketing, with the ability to collaborate around the data.
“We said, ‘Let’s drink our own champagne!’ How do we get meaning out of data? How do we get value out of data? How do we ensure all Collibra data consumers are using the same, trusted data?” explains Christiaens, who also leads the company’s data office.
A Plethora of Use Cases
To tap into the value of its data, Collibra’s employees can now use the new data infrastructure to build internal data products, such as dashboards of information around service usage or financial metrics.
After interviewing stakeholders across the business, a broad set of use cases emerged. For example, the product team was interested in gaining new insights into how Collibra software is used by customers to inform improvements to the product. Data scientists in the same team were keen on various machine learning (ML) use cases, such as how to improve search results and recommendations. Marketing, meanwhile, wanted to dig into pipeline metrics, and use ML to provide more accurate forecasts for business performance.
“Datafy takes care of the data infrastructure, making sure the data is in the right place and the lights stay on. When people need insights they can focus on the data product, without having to worry about any of the underlying data infrastructure.”
- Stijn Christiaens, Co-Founder and CTO, Collibra
Creating a Data Product
In September 2019, Data Minded created a self-service data infrastructure for Collibra. Data Minded’s Datafy service brought together several AWS components to provide a repository of production-ready data for building in-house data products.
The data lake ingests raw data from a range of source systems, such as Salesforce and Marketo, which is then distributed into raw and refined zones of Amazon Simple Storage Service (Amazon S3) buckets. Datafy executes data pipelines that load data into Amazon Redshift for further consumption.
The service uses Amazon EC2 Spot Instances to facilitate the calculations and compute needed to run the data pipelines. Jobs are based on Python or Spark scripts, and are run on an Amazon Elastic Kubernetes Service (Amazon EKS) cluster. Additionally, business analysts also have the ability to perform analytics engineering in Redshift via command-line tool DBT.
“Datafy takes care of the data infrastructure, making sure the data is in the right place and the lights stay on. When people need insights, they can focus on the data product without having to worry about any of the underlying data infrastructure,” explains Christiaens.
When a Collibra employee wants to build a data product, they use Collibra’s Data Intelligence Cloud to shop for data in the data lake. The Collibra Data Catalog provides insight into available data sets: what data they contain, where they came from, who owns them, and whether they’ve been certified. Upon "check-out" the platform’s data governance capabilities automate coordination of approvals (for example, by the data owner) and access. Once access is approved, Collibra connects with user-authentication service Okta and Redshift so employees can access single sign-on and start consuming the data they need.
“We are serious about connecting our data intelligence service as deep as we can into our data infrastructure,” says Christiaens.
Datafy runs around 20,000 internal data-processing jobs per month for Collibra, spread over 42 data pipelines in six development environments and one production environment. The jobs range from ingestion and data-cleaning jobs to feature-engineering and machine-learning pipelines.
Collaboration Across the Business
The technical expertise of Data Minded's engineers complemented Collibra’s teams to support the technical aspects of the project. Christiaens was impressed that its partner's engineers were equally as comfortable interacting with different parts of Collibra business.
Data Minded works with several Collibra teams around field operations, product analytics, and machine learning. Collibra has 20 data products currently in production and future plans include data products being re-used internally, as well as externally.
The fact that Data Minded’s engineers are AWS-certified is yet another benefit for Collibra—something that is particularly important given the rapid release of new AWS products and services.
The end result of the collaboration between Data Minded and Collibra is a trusted, easily accessible data lake. “It’s a self-service, production-ready, enterprise-grade infrastructure that people can rely on for their data work,” says Christiaens.
About Data Minded
Data consultancy, Data Minded, combines deep data engineering skills with years of experience in diverse industries. It offers consulting, training, and managed services in the field of data collection, data analysis, machine learning, and AI.
Published February 2021