Customer Stories / Life Sciences
BioNTech Accelerates Data Processing for Proteomics Workflows by 500x Using AWS
Learn how BioNTech accelerated processing of mass spectrometry data using parallelized workflows to decrease processing time by 500 times.
Headquartered in Germany, BioNTech is a global company that specializes in developing immunotherapies and vaccines, such as the Pfizer-BioNTech COVID-19 vaccine, for cancer and infectious diseases. Mass spectrometry (MS) is a powerful technology for direct identification of peptides bound to human leukocyte antigen (HLA) molecules from patient-derived tumor tissue or cell lines. These HLA immunopeptidomes can be interrogated as a source for antigen discovery for cell-based therapies and used to train machine learning models to inform vaccine development.
BioNTech aimed to further improve its workflows for storing, organizing, and processing terabytes of MS data to make them more efficient and scalable. It decided to migrate its on-premises MS software and data storage to Amazon Web Services (AWS), allowing for scalable and secure state-of-the-art handling. Now, BioNTech has accelerated its time to insights and made it simpler for researchers to share and collaborate on MS data using AWS Storage Gateway, a service that provides on-premises applications with access to virtually unlimited cloud storage.
Opportunity | Using AWS Storage Gateway to Further Streamline and Accelerate the Processing of BioNTech’s Mass Spectrometry Data
Mass spectrometry is a powerful methodology for immunopeptidomics because it can detect and identify thousands of unique HLA-bound peptides in a single analysis of clinically relevant tissues and cell lines. The raw data set produced in a single acquisition is a large collection of spectra that can be searched against a reference proteome database to yield peptide and protein identifications. In proteomics and immunopeptidomics workflows, software packages such as Spectrum Mill MS Proteomics Software are vital components in processing and analyzing the large volumes of MS data that is routinely collected.
Until 2022, the company ran this software on local servers. Scientists had to move data manually from instrument computers to local workstations running Spectrum Mill, and these devices would fill up quickly, requiring additional steps to archive the data. “Our total data was easily 10–15 terabytes, and moving it to the on-premises device was time consuming and challenging,” says Akhil Chaudhary, data engineer at BioNTech. “As our research activities were growing, our MS data collection was also significantly increasing,” says Michael McCarthy, solutions architect at BioNTech. “The local hardware could no longer support our scale.”
To accelerate data processing and access to the interpreted results, BioNTech’s computational biology team needed a way to process hundreds of requests simultaneously with different search parameters and protein sequence databases as part of their effort to maximize the peptide and protein information for novel discoveries. The department approached the BioNData team—a central data and analytics group within the company—to build tools to scale the data processing capabilities horizontally. The team chose AWS to build a hybrid lab data model and create horizontally scaling APIs. “In the US, we have a long history of using AWS successfully in products,” says McCarthy. “It was the natural choice.”
On AWS, our scientists are generating and sharing exponentially more data with the aim of finding effective, targeted, and personalized therapies for patients. It’s really the imagination that limits you, and I haven’t yet found something that I couldn’t build in AWS."
Solutions Architect, BioNTech
Solution | Massively Accelerating Data Processing Using Parallelized Workflows
In the first phase, BioNTech’s focus was to be able to move data seamlessly from the MS instrument computers to the cloud and host Spectrum Mill on AWS. The second phase involved building a system for running the search requests simultaneously.
To move the MS raw data to the cloud, BioNTech installed the AWS Storage Gateway agent on every instrument computer. Following acquisition, MS raw data is quickly and automatically moved to Amazon Simple Storage Service (Amazon S3), an object storage service built to retrieve any amount of data from anywhere. “The speed is extremely fast. A file of 5 GB takes only 5–10 seconds to appear on Amazon S3,” says Chaudhary. With multiple instruments generating large data sets, this MS data pipeline enables more efficient migration of the data to a centralized localization for easy access for processing and archiving.
BioNTech’s computational biology team quickly adopted the new workflow. “Everyone’s using the cloud-based system, and the researchers find it much simpler,” says McCarthy. “We automate data management in AWS, letting scientists focus on the science.”
Next, the team installed Spectrum Mill on Amazon Elastic Compute Cloud (Amazon EC2), which provides secure and resizable compute capacity for virtually any workload. “By running Spectrum Mill on the cloud, we cut individual search times by 50–75 percent,” says Chaudhary. In addition, BioNTech runs Amazon EC2 Spot Instances, which can run fault-tolerant workloads for up to 90 percent off compared to On-Demand prices. Because the company only pays for the time it’s using the instances, it has reduced compute costs significantly.
To scale the number of workflows it can run at a time, the team uses Amazon Machine Images, which provide the information required to launch an instance, and Amazon EC2 Auto Scaling, which can add or remove compute capacity to meet changing demand. “Now, we run our searches 50–75 percent faster, and with Amazon EC2 Auto Scaling, we can run hundreds of instances in parallel, massively accelerating data processing up to 500 times,” says McCarthy.
BioNTech manages Spectrum Mill workflows using Amazon Simple Queue Service (Amazon SQS), a fully managed message queuing service. And the company uses Amazon API Gateway, a service for creating, maintaining, and securing APIs at any scale, to execute Spectrum Mill searches. Then, it pulls the data from a data warehouse on Amazon Redshift, which offers excellent price performance for cloud data warehousing. These datasets are used by the scientific teams to identify therapeutic targets and build artificial intelligence algorithms for vaccine design.
The team connects processed results with data consumers across the company with data.all, an open-source tool for sharing datasets across AWS accounts. As a result, researchers no longer need to spend time on data management. “On AWS, our scientists are generating and sharing exponentially more data with the aim of finding effective, targeted, and personalized therapies for patients,” says McCarthy.
Outcome | Expanding Speed and Scalability to More Workflows
BioNTech has quickly seen the benefits of its new workflows on AWS. “We could redo all the work from the past 7 years in 60 hours for a fraction of the price,” says Chaudhary. In its next phase, the team is looking to improve and automate mass spectrometry analysis tools to lower the false discovery rate of peptides. It’s also creating a graphical wrapper around its API so that all teams at BioNTech can benefit from the API in their day-to-day workflows.
“The Spectrum Mill project is just the first of many we’re planning,” says McCarthy. “This project inspired confidence that we can solve similar problems for our global teams. It’s really the imagination that limits you, and I haven’t yet found something that I couldn’t build in AWS.”
BioNTech is a global immunotherapy research and development company that creates and manufactures active immunotherapies and performs clinical trials of treatments and vaccines for cancer and other diseases.
AWS Services Used
AWS Storage Gateway
AWS Storage Gateway is a set of hybrid cloud storage services that provide on-premises access to virtually unlimited cloud storage.
Amazon Elastic Compute Cloud (Amazon EC2) offers the broadest and deepest compute platform, with over 750 instances and choice of the latest processor, storage, networking, operating system, and purchase model to help you best match the needs of your workload.
Learn more »
Amazon Simple Storage Service (Amazon S3) is an object storage service offering industry-leading scalability, data availability, security, and performance.
Learn more »
Amazon Simple Queue Service (Amazon SQS) lets you send, store, and receive messages between software components at any volume, without losing messages or requiring other services to be available.
Learn more »
Organizations of all sizes across all industries are transforming their businesses and delivering on their missions every day using AWS. Contact our experts and start your own AWS journey today.