AWS HPC Blog

Dataset of protein-ligand complexes now available in the Registry of Open Data on AWS

by Deva Priyakumar, Beryl Rabindran, Alex Iankoulski, Prathit Chatterjee, Rakesh Srivastava, Ramanathan Sethuraman, Vladimir Aladinskiy, and Yusong Wang on in High Performance Computing Permalink Share

Dataset of protein-ligand complexes now available in the Registry of Open Data on AWSThis post was contributed by U. Deva Priyakumar, Rakesh Srivatsava, Prathit Chatterjee, Vladimir Aladinskiy, Ramanathan Sethuraman, Yusong Wang, Alex Iankoulski, and Beryl Rabindran

Today, we’re excited to announce the release of a comprehensive dataset featuring molecular dynamics (MD) trajectories for over 16,000 protein-ligand complexes (PLCs). This dataset, now available on AWS as part of the Registry of Open Data on AWS, encapsulates time-resolved data on atomic and molecular interactions within each protein-ligand system.

This represents a valuable resource for research at the intersection of machine learning and structural biology. This dataset can be applied to addressing essential challenges in modern drug design, potentially aiding the discovery of new therapeutics.

Why is this needed?

The rapid advancement of computing technologies, particularly artificial intelligence (AI), has revolutionized various domains, including drug discovery. Curated datasets are crucial for developing reliable, generalizable, and accurate models for practical applications.

However, generating experimental data on a large scale is a cost-prohibitive and arduous process. On the other hand, simulations have proven extremely valuable in life sciences and help to compensate for the lack of experimental studies. To address this need, teams from IIIT-Hyderabad, Intel, AWS, and Insilico Medicine (AWS-IIITH-Intel-Insilico – AI3), collaborated to perform massive physics-based calculations on about 20,000 protein-ligand complexes (PLCs).

This generated dataset provides dynamic, data-enriched protein-ligand coordinates and can be instrumental for training machine-learning solutions, such as artificial neural networks, to predict various aspects of ligand-protein binding.

Available now

The AI3 dataset is now accessible to the public at no cost through the AWS Open Data Sponsorship Program. AWS users will not need to pay transfer fees or long-term storage costs to access this data, or to maintain a personal copy of this data. By democratizing access to the PLC data through this collaboration, we aim to inspire innovation in computational drug discovery, thereby transform health outcomes around the world.

What’s included

The AI3 dataset contains molecular dynamics snapshots and binding energy estimates, along with individual energy components, computed for 16,692 PLCs from RCSB PDB (Research Collaboratory for Structural Bioinformatics Protein Data Bank, 2023).

The code bases come from IIIT-H (for initial input preparations), AWS (for undertaking all-atomistic simulations in GROMACS simulations package, followed by MMPBSA calculations), and Insilico Medicine (for undertaking Alchemical calculations). They have been publicly released in a GitHub repository.

As a next step, the team plans to augment the dataset by adding 10 metastable intermediates for each original PLC – making 200,000 PLC conformations in total. We will write about this again when the work is complete.

How we generated this data

Simulation of a large number of PLCs on a molecular level requires vast computational resources. Using GROMACS, a single PLC can be simulated on one compute instance using either multiple GPUs or multiple Intel Xeon CPU cores.

Depending on the number of atoms in the PLC (within a range from a few thousand to 500,000) it takes between 2 and 12 hours to complete a single run. Simulating hundreds to hundreds of thousands of PLCs goes beyond the boundaries of a single compute instance. Individual PLCs can, however, be simulated independently, making it possible to run these in parallel.

We designed the technology stack for this architecture for performance optimization and cost-efficiency, and aimed for execution times measured in hours or days. We used Intel Xeon CPUs to minimize cost and facilitate both vertical and horizontal scaling – only possible because of the large selection and availability of Amazon Elastic Compute Cloud (Amazon EC2) instance types based on Intel architectures.

For data storage, we used Amazon FSx for Lustre for short-term storage needs, and Amazon Simple Storage Service (Amazon S3) for long-term storage. We also used standard Elastic Network Interface networking in our cluster configuration.

We built our software stack on Amazon Linux 2 and encapsulated it in the Amazon Machine Image (AMI) we created for the cluster nodes. This included  . To create the infrastructure, we used an open-source project called aws-do-pcluster, which allowed us to spin up AWS ParallelCluster using the do-framework.

ParallelCluster allowed us to scale to an impressive 300,000 vCPUs concurrently within a single Availability Zone, which can be further expanded by incorporating additional Availability Zones in the same region for a larger workload. This scalability enabled us to manage and process vast amounts of data efficiently, ensuring that our applications remain responsive and performant. The auto-scaling capabilities of AWS ParallelCluster allowed us to automatically adjust the number of vCPUs based on demand, ensuring cost efficiency.

Figure 1 - PLC Deployment Architecture. The deployment architecture uses a standard Parallel Cluster within an AWS VPC. A scratch FSx for Lustre volume is mounted on the head node and all compute nodes. To minimize cost, the architecture uses primarily SPOT instances with “lowest price” allocation strategy. A partition of on-demand instances is also added to the cluster primarily for development and experimentation purposes. Each of the AWS ParallelCluster partitions shown in the private subnet corresponds to a SLURM queue. When the queue contains no jobs, the partition is automatically scaled to 0 nodes. When jobs are submitted to the SLURM queues, compute nodes of the corresponding EC2 instance types are automatically added to the cluster, thus creating an elastic compute environment.

Figure 1 – PLC Deployment Architecture. The deployment architecture uses a standard Parallel Cluster within an AWS VPC. A scratch FSx for Lustre volume is mounted on the head node and all compute nodes. To minimize cost, the architecture uses primarily SPOT instances with “lowest price” allocation strategy. A partition of on-demand instances is also added to the cluster primarily for development and experimentation purposes. Each of the AWS ParallelCluster partitions shown in the private subnet corresponds to a SLURM queue. When the queue contains no jobs, the partition is automatically scaled to 0 nodes. When jobs are submitted to the SLURM queues, compute nodes of the corresponding EC2 instance types are automatically added to the cluster, thus creating an elastic compute environment.

Conclusion

Computational researchers working on drug discovery can use this data for their own workflows. That could be the analysis of changes in the shapes of proteins, comparison of binding sites, energy landscape mapping, studying the dynamics of protein-ligand interactions, or estimating binding affinity.

We’re looking forward to seeing what new and interesting questions the global computational drug discovery community will be able to answer by bringing these, and other datasets together in the cloud.

Contact us to let us know about your insights and breakthroughs using the AI3 dataset, and if the Registry of Open Data helps you along the way.

Contributors:
Insilico Medicine: Vladimir Aladinskiy, Evgeny Kirilin, Igor Diankin, Georgiy Andreev, Arkadii Lin, Eugene Babin, and Petrina Kamya
IIITH: Rakesh Srivastava, Prathit Chatterjee, Indhu Ramachandran, and U. Deva Priyakumar
Intel: Farid El Chalouhi, Chethan Rao, Arun Karthi Subramaniyan, Akanksha R Bilani, Varma S Konala, and Ramanathan Sethuraman
AWS: Alex Iankoulski, Yusong Wang, Stephen Litster, and Srinivas Tadepalli

Deva Priyakumar

Deva Priyakumar

Deva Priyakumar is a Professor at the Center for Computational Natural Sciences and Bioinformatics (CCNSB), International Institute of Information Technology Hyderabad (IIITH). He also leads the Technology Innovation Hub (IHub-Data, IIITH) on Data Driven Technologies. Deva completed his PhD in Computational Chemistry and was a Postdoctoral Fellow at the Computer-Aided Drug Design Center, School of Pharmacy, University of Maryland, before starting his own lab at IIIT-H. His areas of research are developing AI/ML based methods for molecule/material design and Healthcare; and applying computational methods to investigate chemical/biological systems and processes.

Beryl Rabindran

Beryl Rabindran

Beryl is the life sciences lead on the Amazon Web Services (AWS) open data team. Beryl is a geneticist and cell biologist by training, and led clinical research for a med-tech AI startup in the breast cancer imaging space before joining AWS. She is passionate about bringing science and technology closer to people.

Alex Iankoulski

Alex Iankoulski

Alex Iankoulski is a full-stack software and infrastructure architect who likes to do deep, hands-on work. He is currently a Principal Solutions Architect for Self-managed Machine Learning at AWS. In his role he focuses on helping customers with containerization and orchestration of ML and AI workloads on container-powered AWS services. He is also the author of the open source [Do framework](https://bit.ly/do-framework) and a Docker captain who loves applying container technologies to accelerate the pace of innovation while solving the world's biggest challenges. During the past 10 years, Alex has worked on combating climate change, democratizing AI and ML, making travel safer, healthcare better, and energy smarter.

Prathit Chatterjee

Prathit Chatterjee

Dr. Prathit Chatterjee received his PhD in physical chemistry (computational biophysics) from CSIR-NCL, Pune, in 2016. He did his post-doc under Prof. Sihyun Ham in the same field, and joined the admixture-based industry, in close collaboration with Prof. Raja Shunmugam, Polymer Science Division, IISER Kolkata in 2017. In 2020, he went back to South Korea for his post-doc under Prof. Sihyun Ham, followed by 2-year post-doc (2021-2023) with Dr. Juyong Lee (Kangwon National University, lab shifted later to Seoul National University). In 2023, he joined CCNSB as a research scientist at IIIT-Hyderabad, funded by IIT-H DATA I-HUB FOUNDATION (academic head Prof. U. Deva Priyakumar).

Rakesh Srivastava

Rakesh Srivastava

Dr. Rakesh Srivastava is an Assistant Professor in the School of Computing Science Engineering and Artificial Intelligence (SCAI) at VIT Bhopal University, India. He was a Postdoctoral Fellow at I-HUB Data, International Institute of Information Technology (IIIT), Hyderabad in the area of Health Informatics and worked on the development of Computational Datasets for the application of Machine Learning (ML) techniques. He has obtained his Ph.D. in Computational Biology and Bioinformatics from Jawaharlal Nehru University (JNU), New Delhi. Dr. Srivastava completed his M. Tech. in Computational and Systems Biology from JNU New Delhi, and B. Tech. in Computer Science and Engineering from Institute of Engineering and Technology (IET), Lucknow, India. Dr. Srivastava’s research area includes Machine Learning, Health Informatics, Computational Drug Discovery and Computational Biology.

Ramanathan Sethuraman

Ramanathan Sethuraman

Ramanathan Sethuraman has 28 years of R&D expertise in multimedia and healthcare analytics. After his Ph.D. from the Indian Institute of Science in Digital Signal Processor architectures, he was a Research Scientist at the Philips Research Labs (PRL) in The Netherlands, a Director and Product Manager at Silicon Hive, and a Media/Healthcare Architect and Principal Engineer at Intel. In healthcare analytics, his research focus has been on understanding multimodal Health and Life Sciences (HLS) data, leading to novel insights that point towards disease causation rather than just correlation. His research premise is centred around AI-ML-DL based learning from HLS Big Data (OMICS of hologenome) of metaorganisms (organisms that coexist with their microbiota), which reveals more comprehensive disease manifestation pathways, allowing for earlier disease detection and more efficacious interventions.

Vladimir Aladinskiy

Vladimir Aladinskiy

Vladimir Aladinskiy has more than 10 years of experience in drug discovery, specializing in Computer-Aided Drug Design and the use of Artificial Intelligence in chemistry. He is the Head of Computational Chemistry at Insilico Medicine, where he has contributed to multiple drug discovery projects. Notably, he directed the chemistry efforts for the Insilico IPF program, leading to a drug candidate that reached Phase II clinical trials. Currently, he oversees an interdisciplinary сomputational сhemistry team at Insilico, extensively working on developing new computational solutions for drug discovery, including the SaaS platform Chemistry42. His main research interest is applying innovative technologies to help patients around the world.

Yusong Wang

Yusong Wang

Dr. Yusong Wang is a Senior High-Performance Computing (HPC) Specialist Solutions Architect at AWS. He has over 18 years of extensive experience spanning national research institutes in the United States and large financial enterprises. Yusong obtained his PhD in Mathematical Computer Science from the University of Illinois Chicago.