
OpenProteinSet
Provided by: OpenFold, part of the AWS Open Data Sponsorship Program
Provided by: OpenFold, part of the AWS Open Data Sponsorship Program

OpenProteinSet
Provided by: OpenFold, part of the AWS Open Data Sponsorship Program
Provided by: OpenFold, part of the AWS Open Data Sponsorship Program
This product is part of the AWS Open Data Sponsorship Program and contains data sets that are publicly available for anyone to access and use. No subscription is required. Unless specifically stated in the applicable data set documentation, data sets available through the AWS Open Data Sponsorship Program are not provided and maintained by AWS.
Description
Multiple sequence alignments (MSAs) for 140,000 unique Protein Data Bank (PDB) chains and 16,000,000 UniClust30 clusters. Template hits are also provided for the PDB chains and 270,000 UniClust30 clusters chosen for maximal diversity and MSA depth. MSAs were generated with HHBlits (-n3) and JackHMMER against MGnify, BFD, UniRef90, and UniClust30 while templates were identified from PDB70 with HHSearch, all according to procedures outlined in the supplement to the AlphaFold 2 Nature paper, Jumper et al. 2021 . We expect the database to be broadly useful to structural biologists training or validating deep learning models for protein structure prediction and related tasks.
License
How to cite
OpenProteinSet was accessed on DATE
from https://registry.opendata.aws/openfold .
Additionally, please cite our manuscript .
Update frequency
Never
Support information
Managed by: OpenFold
General AWS Data Exchange support
Resources on AWS
Description
A repository of MSAs and template hits.
Resource type
S3 Bucket
Amazon Resource Name (ARN)
arn:aws:s3:::openfold
AWS Region
us-east-1
AWS CLI Access (No AWS account required)
aws s3 ls --no-sign-request s3://openfold/
Usage examples
Publications
- OpenFold: Retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization by Ahdritz, Gustaf; Bouatta, Nazim; Kadyan, Sachin; Xia, Qinghui; Gerecke, William; O'Donnell, Timothy J, et al
- OpenProteinSet: Training data for structural biology at scale by Ahdritz, Gustaf; Bouatta, Nazim; Kadyan, Sachin; Jarosch, Lukas; Berenberg, Daniel; Fisk, Ian, et al
Tutorials
- Run inference at scale for OpenFold, a PyTorch-based protein folding ML model, using Amazon EKS by Shubha Kumbadakone, Ankur Srivastava, and Sachin Kadyan