Data Minded and BioStrand Rely on AWS Big Data for Genomic Insights

Executive Summary

BioStrand is a Belgian startup bringing a unique analytical approach to genomic data. Partnering with Data Minded, the company uses Amazon Web Services (AWS) to combine its insights into DNA, RNA, and proteins with a pipelined approach to mass scalable analytics, promising a new approach to understanding and using sequenced genomes

Much of modern medicine is genomics, the study of the human genome. As all genomes can be expressed as pure data, data analysis is at the heart of genomics.  

BioStrand, a Belgian start-up, discovered and commercializes a technique for indexing cellular blueprints and building blocks, bringing an entirely new way to efficiently handle the large data sets in genomic science. This technique promises not only scalability throughout, but also new insights into genome functionality.

Founded in February 2019, BioStrand is run by husband and wife team Dr. Dirk Van Hyfte and Dr. Ingrid Brands. “It started when I was doing some data mining and experimental analytics in an AWS cluster I ran from home,” says Van Hyfte. “The results were interesting and we saw the potential for scientific and commercial exploitation, so BioStrand came into being. We knew Data Minded, as we’d worked together before, so we reached out to get the necessary technical help to turn our ideas into something that could be positioned on the market.”

Scaling for Millions of Code Sequences

BioStrand’s technology searches coding and non-coding sequences to find patterns it calls HYFTs that are specific for biological structures and functions. The company has found and indexed more than half a billion such patterns and can look for alignment, similarities, and differences in sequences, all based on indexing principles rather than letter-by-letter sequence matching.
One early challenge was that the existing processes didn’t scale—a challenge that BioStrand and AWS Partner Data Minded overcame with a container-based architecture.  Built with Amazon Elastic Container Service (Amazon ECS), the architecture can auto scale as needed, improving business efficiency. It is most efficient when running continually as the data structures needed for effective indexing are large and dynamic, taking time to build. 

Says Van Hyfte, "It’s the equivalent of looking at the internet not letter by letter, but as words and the relationship between those words. This reveals a lot of new information, including novel 3D patterns, which informs not only genome function but also the relationships between different genomes and their functional commonalities." 

Genomics data is huge and messy. Data Minded has built a platform that is expected to eventually index hundreds of petabytes of data, handling the normalization, storage, analysis, cross-comparison, and presentation of the data economically and at scale. The service is based on highly parallelized indexing. BioStrand has made datasets available in multiple formats. In some, sequences are already identified and indexed.

Using AWS, BioStrand can index a million sequences with an average length of 320 characters in about three minutes, matching against 660 million HYFT patterns. This is only part of the ingest, normalize, index, and load pipeline, where permanence overall is data and format dependent.

Starting with a small set of tools and public datasets, BioStrand is building out a self-service SaaS platform. Eventually, it will incorporate its own and customer machine learning and AI tools and let customers work on their own and third-party datasets securely, all with the unique insight that the company describes as moving from the syntax to the semiotics of genomes.

Towards a Universal Standard

The genomics industry poses unique challenges to entry, incorporating data and techniques from thousands of projects going back decades. The sector has also had a new impetus from the COVID-19 pandemic. With an estimated size of $17.2B in 2019, growing at around eight percent per year, the genomics market is one of the few to be boosted by the pandemic.

Data Minded established a data lake approach using microservices to build an ingestion, normalization, indexing, and storage parallelized pipeline with autoscaling using Amazon ECS.

This is managed using AWS Step Functions in combination with AWS Batch using Amazon Simple Queue Service (SQS) messaging. The data is housed in Amazon Simple Storage Service (Amazon S3). The SaaS platform is created using AWS Fargate and Amazon Elastic Compute Cloud (Amazon EC2), AWS App Mesh, and gRPC. It’s a classic compute task of applying algorithms to data where efficiency and capability come to the fore as business requirements.

Genomic data formats are varied. Normalizing to a common format prior to indexing is a major part of the early work, but one that has potential to establish new industry norms. To that end, Data Minded is developing what is hoped to become a new universal schema to store genomic data and metadata for use across multiple platforms, techniques, and tools.

Data Minded is currently evaluating different AWS components for efficiency and scalability in conjunction with feedback from early users at universities and in industry.

The Future of Genomics Analytics

The company loaded a couple of terabytes of data into the database with 20TB available, but once the platform has evolved into full SaaS where customers upload their own data, that’s expected to scale to between 100 and 200 petabytes. However, much of the data targeted by BioStrand is already in AWS, meaning that on-ramping issues are eased.

By using AWS, Data Minded can adopt new techniques such as AI and machine learning easily. This gives BioStrand options on its roadmap and builds a community of users evolving and sharing techniques based on such a widely understood and available toolsets and standards.

BioStrand will be exposing an API in time, where AWS’s mature security and regulatory frameworks will encourage compliant, managed deployment at scale across different regions. Biotechnology may be cutting-edge science, but it still needs the same fundamental competence and range of robust options on which to build the cloud services all users expect.

About BioStrand

BioStrand is a Belgian startup that has created a unique and revolutionary cloud-based solution, enabling faster and more accurate research into genomics.

About Data Minded

Data consultancy Data Minded combines deep data engineering skills with years of experience in diverse industries. It offers consulting, training, and managed services in the field of data collection, data analysis, machine learning, and AI.

Published February 2021