Generative artificial intelligence for de novo protein design
Current Opinion in Structural Biology
@article{winnifrith2023generative,
title={Generative artificial intelligence for de novo protein design},
author={Winnifrith, Adam and Outeiral, Carlos and Hie, Brian},
journal={arXiv preprint arXiv:2310.09685},
year={2023}
}
Talks about Sequence based and structure based protein generative models Inverse folding going from Structure to Sequence LLM based protein models use masked language models to hide some parts of the sequence and then train on predicting them correctly Seq-to-prop models, they take the latent variable from sequence model and then feed it to other various ML tasks. 1. E(d): the Euclidean group includes translations, rotations, and reflections. It represents all possible transformations in d-dimensional Euclidean space. 2. O(d): the orthogonal group represents rotations and reflections. O(d) ≤ E(d). 3. SO(d): the special orthogonal group consists of rotations without reflections, in the d-dimensional space. SO(d) ≤ O(d). 4. SE(d): the special euclidean group combines translations and rotations. SE(d) ≤ E(d). 5. T(d): the translation group represents the symmetry transformations of pure spatial translations in d-dimensions, without any rotation or reflection. T(d) ≤ SE(d). Most of the proteins are based on SE(3) notations Structure representation methods, Nodes represent atoms (often backbone atoms like Cα- this is the red carbon) or residues.
Edges connect nodes that are spatially close, typically within a distance threshold (e.g., 8-10 Å). they store the orientation and the distance between atoms
good for inverse folding (cause capture local relationships) Proteins represented as coordinates, atomic coordinates, typically for backbone atoms (e.g., Cα) or all atoms. Used in diffusion models like RFDiffusion and Chroma diffusion models add noise to coordinates and then learn to reverse it Frame based representation - (coordinate frame) used to model interactions and binding to other proteins This captures both where residues are and how they are oriented, which is crucial for understanding side-chain interactions and detailed structural arrangements. SE(3) belong here. used in diffusion models to represents rotations There are some other multi-dimensional array representations as well
A Survey on Graph Diffusion Models: Generative AI in Science for Molecule, Protein and Material
@article{zhang2023survey,
title={A survey on graph diffusion models: Generative ai in science for molecule, protein and material},
author={Zhang, Mengchun and Qamar, Maryam and Kang, Taegoo and Jung, Yuna and Zhang, Chenshuang and Bae, Sung-Ho and Zhang, Chaoning},
journal={arXiv preprint arXiv:2304.01565},
year={2023}
}
th
Graph based diffusion models, proteins are represented as graphs Graph based methods
Nodes represent atoms (often backbone atoms like Cα- this is the red carbon) or residues.
Edges connect nodes that are spatially close, typically within a distance threshold (e.g., 8-10 Å). they store the orientation and the distance between atoms
adding noise is different compared to images (continuous noise) cause need to preserve graph properties Therefore add discrete noise like masking the Residue part of the protein (R group) or Nodes or edges, Categorical Diffusion: For discrete features, defining a transition matrix that "diffuses" a discrete value to another discrete value (e.g., an amino acid type transitioning to a "mask" token). Cross-Entropy Loss: The denoising model learns to predict the original discrete values using a cross-entropy loss, rather than predicting continuous noise. The denoising model (often a GNN) itself is designed to be equivariant. so no issues in rotating and translating DDPM (Denoising Diffusion models) these models add the noise following a markov model (forward process and model uses a NN to learn the noise that was added at each step Score Based Models (SGMs) these models try to model score function using a NN. NN estimate below function Score Function = delta(log {density function of data}) SDE Stochastic Differential Equation these models try to understand the discrete diffusion part from a continous perspective by defining the DE for diffusion models. Statistical Metrics: Comparing graph statistics (degree distribution, clustering coefficient, orbit counts) of generated graphs to real ones. Validity: Are the generated molecules/proteins chemically and structurally valid? (e.g., correct valence, bond lengths, no steric clashes). This is paramount for biological applications. Uniqueness/Novelty: Are the generated samples new and diverse, not just copies of training data? Application-Specific Metrics: For drug discovery, metrics like binding affinity are crucial. Scalability: Diffusing edges or many components of large graphs (many proteins are large) can be computationally very expensive.
Irregularity: Graphs have varying numbers of nodes and edges, making it tricky to define a diffusion process that effectively captures their dynamics. Interpretability: Understanding why a diffusion model generates a particular graph structure can be difficult, as the learned score functions can be complex. Diversity of Graphs & Specific Needs: Proteins have unique characteristics (e.g., specific biochemical properties, the need for proper folding) that general graph generation models might not fully capture. Much research focuses on general molecular graphs, but proteins require specialized considerations.
Protein Large Language Models: A Comprehensive Survey
@article{xiao2025protein,
title={Protein large language models: A comprehensive survey},
author={Xiao, Yijia and Zhao, Wanjia and Zhang, Junkai and Jin, Yiqiao and Zhang, Han and Ren, Zhicheng and Sun, Renliang and Wang, Haixin and Wan, Guancheng and Lu, Pan and others},
journal={arXiv preprint arXiv:2502.17504},
year={2025}
}
LLMs are used for,
How a protein folds into a 3D shape (structure). What a protein does (function). How to design new proteins. Models can be divided into 4 main types,
Sequence-to-Property Prediction Sequence-to-Label Prediction . mapping sequences to categorical labels, including secondary structure types, contact maps, or functional annotations. Sequence-to-Structure Prediction Sequence-to-Text Understanding Protein Engineering methods,
Protein Engineering: fθ : (S, T) → S′ modifies protein S toward the desired attributes T, yielding the engineered protein S ′ Protein Generation: fθ : (T, R) → P generates proteins with attributes T by sampling from the protein space using random seeds R. Protein Translation: fθ : (P, T) → P′ translates a protein P into an alternative representation P ′ based on the target translation parameters T. Metrics Used
Root Mean Square Deviation (RMSD) - measures the distance between predicted and actual 7 atomic coordinates, with lower values indicating better accuracy Global Distance Test (GDT-TS) - calculates the percentage of alphacarbon atoms (main carbon atom C_alpha)within 1, 2, 4, and 8 Å thresholds, reflecting structural similarity GDT-TS is preferred over RMSD in many cases because it is less sensitive to outliers. RMSD calculates the average distance between all corresponding atoms after alignment, which can be heavily influenced by a few poorly predicted regions, such as flexible loops. Template Modeling (TM) Score evaluates global structural similarity (scores between 0 and 1) it overlay the both predicted and the real structure and focusing on the alpha carbon measure the similarity. focus on overall shape and isn't affected by the size Local Distance Difference Test (lDDT) quantifies local accuracy by comparing interatomic distances Predicted Local Distance Difference Test (pLDDT) provides per-residue confidence scores (0–100) without a reference structure, as used in AlphaFold However, NMR has limited size range: NMR is mostly suitable for proteins smaller than 30–50 kDa (larger proteins become challenging due to signal overlap)
Pre Training dataset - these are used to train the language model in a self-supervised manner,
UniRef Clusters (Suzek et al., 2015): A collection of clustered protein sequences designed to reduce data redundancy and improve computational efficiency PDB (Bank, 1971): The Protein Data Bank is a repository for the 3D structural data of large biological molecules, such as proteins and nucleic acids AlphaFoldDB (Tunyasuvunakool et al., 2021): The AlphaFold Protein Structure Database offers predicted protein structures generated by the AlphaFold model containing over 200 million entries. benchmark datasets, which contain labeled sequences for supervised fine-tuning and evaluation on specific biological tasks.
CASP - Structure prediction competition which is not defunct ProteinGym (Notin et al., 2023): A large-scale benchmark platform for protein design and fitness prediction. ProteinLMBench (Shen et al., 2024a): A benchmark dataset comprising 944 manually verified multiple-choice questions aimed at assessing the protein understanding capabilities of LLMs
De novo design of protein structure and function with RFdiffusion
@article{watson2023novo,
title={De novo design of protein structure and function with RFdiffusion},
author={Watson, Joseph L and Juergens, David and Bennett, Nathaniel R and Trippe, Brian L and Yim, Jason and Eisenach, Helen E and Ahern, Woody and Borst, Andrew J and Ragotte, Robert J and Milles, Lukas F and others},
journal={Nature},
volume={620},
number={7976},
pages={1089--1100},
year={2023},
publisher={Nature Publishing Group UK London}
}
Denoising Diffusion Probabilistic Models (DDPMs) for protein monomer design showed limited success in generating sequences that folded into intended structures in silico and had not been experimentally validated. Up until this paper Diffusion models were used only in the image and text generation Denoising Diffusion Probabilistic Models (DDPMs) for protein monomer design showed limited success in generating sequences that folded into intended structures in silico and had not been experimentally validated. Up until this paper Diffusion models were used only in the image and text generation RoseTTAFold diffusion (RFdiffusion), a generative model of protein backbones obtained by fine-tuning the RoseTTAFold (RF) structure prediction network on protein structure denoising tasks. RFDiffusion is validated through experimental characterization of hundreds of designed symmetric assemblies, metal-binding proteins, and protein binders. model is trained by minimizing a mean-squared error (m.s.e.) loss between predicted frames and the true protein structure Using MSE loss promotes the continuity of the structure through various time steps
Validation Data (In Silico): uses the AF2 data An in silico "success" is defined by stringent criteria, including high AF2 confidence (mean predicted aligned error (PAE) less than five), global backbone root mean-squared deviation (r.m.s.d.) within 2 Å of the designed structure, and r.m.s.d. within 1 Å on any scaffolded functional site.
Protein Structure and Sequence Generation with Equivariant Denoising Diffusion Probabilistic Models
@article{anand2022protein,
title={Protein structure and sequence generation with equivariant denoising diffusion probabilistic models},
author={Anand, Namrata and Achim, Tudor},
journal={arXiv preprint arXiv:2205.15019},
year={2022}
}
3D generation is a challenging problem. Recent methods limited to generation of small molecules, small/one domain topology. Apply DDPM to generate 3D images.
Large language models generate functional protein sequences across diverse families
@article{madani2023large,
title={Large language models generate functional protein sequences across diverse families},
author={Madani, Ali and Krause, Ben and Greene, Eric R and Subramanian, Subu and Mohr, Benjamin P and Holton, James M and Olmos Jr, Jose Luis and Xiong, Caiming and Sun, Zachary Z and Socher, Richard and others},
journal={Nature biotechnology},
volume={41},
number={8},
pages={1099--1106},
year={2023},
publisher={Nature Publishing Group US New York}
}
LLM based De Novo protein design paper ProGen, a language model capable of generating protein sequences with predictable functions They do this using augmented with control tags, enabling specification of protein properties during sequence generation. The model successfully generated functional proteins in the "twilight zone" of sequence identity (below 40%), demonstrating its ability to design sequences distant enough to not be considered traditional homologs. twilight zone" refers to a range of sequence similarity (typically 20-35% identity) between protein sequences where sequence alignment becomes unreliable, and the probability of detecting true evolutionary relationships becomes uncertain. Pfam IDs as control tags. minimizing a loss function that evaluates its ability to predict the next amino acid in a sequence given the preceding ones. Sequence identity: This metric is used to quantify the similarity between generated artificial proteins and known natural proteins, particularly when evaluating proteins designed in the "twilight zone" of sequence identity (e.g., less than 40% identity to any known natural protein).
280 million protein sequences from more than 19,000 protein families these data was sources from following databases, UniprotKB (UniprotKB), UniParc (UniParc), NCBI Taxonomy (NCBI Taxonomy), Pfam (Pfam), Uniref30 (Uniref30), NCBI nr database (NCBI nr), and Interpro (Interpro). to train Fine tuned on - Fine-tuning used curated sequences and tags from five lysozyme families, chorismate mutase, and malate dehydrogenase, with artificial proteins showing low sequence identity (31.4%) to natural proteins. ProGen is a 1.2-billion-parameter neural network based on the Transformer architecture, similar to those used in natural language processing. Conditional Generation: A key feature is conditional generation, where protein properties such as family, biological process, or molecular function are provided as "control tags" (inputs) to guide sequence generation. This allows for controlled design of proteins with desired characteristics.
Protein structure generation via folding diffusion
@article{wu2024protein,
title={Protein structure generation via folding diffusion},
author={Wu, Kevin E and Yang, Kevin K and van den Berg, Rianne and Alamdari, Sarah and Zou, James Y and Lu, Alex X and Amini, Ava P},
journal={Nature communications},
volume={15},
number={1},
pages={1059},
year={2024},
publisher={Nature Publishing Group UK London}
}
diffusion-based generative model that designs protein backbone structures via a procedure that mirrors the native folding process. They model the folding process a native protein folding problem The model is trained by minimizing a modified smooth L1 loss function This loss is designed to handle periodic angular values by wrapping them within the range of [−π,π), behaving like L1 loss for high errors and L2 loss for low errors. There is a special loss function in the paper to generate new protein in a single gpu in about one minute
the CATH dataset, which provides a “de-duplicated” set of protein structural folds spanning a wide range of functions where no two chains share more than 40% sequence identity over 60% overlap (Sillitoe et al., 201 a denoising diffusion probabilistic model with a simple transformer backbone and demonstrate that our resulting model unconditionally generates highly realistic protein structures
Illuminating protein space with a programmable generative model
@article{ingraham2023illuminating,
title={Illuminating protein space with a programmable generative model},
author={Ingraham, John B and Baranov, Max and Costello, Zak and Barber, Karl W and Wang, Wujie and Ismail, Ahmed and Frappier, Vincent and Lord, Dana M and Ng-Thow-Hing, Christopher and Van Vlack, Erik R and others},
journal={Nature},
volume={623},
number={7989},
pages={1070--1078},
year={2023},
publisher={Nature Publishing Group UK London}
}
Use a diffusion based model to model protein backbones with scalable molecular neural networks for backbone synthesis and all-atom design. modelling the joint, all-atom likelihood of sequences and three-dimensional structures of full protein complexes generating full complexes, is important because proteins function by interacting with other molecules achieving this with computation that scales sub-quadratically with the size of the protein system e sub-quadratic scaling of computation enabling conditional sampling under diverse design constraints without retraining o sample from a model without having to retrain it on new target functions
modelling full complexes with quasi-linear computational scaling and by allowing arbitrary conditional sampling at generation time We show that Chroma generates high-quality, diverse and innovative structures that refold both in silico and in crystallographic experiments scalable generative models this model can be scaled for other tasks optimizing a likelihood was used to train the model split-GFP solubility assay CD spectra structure evaluation
DiffSDS: A language diffusion model for protein backbone inpainting under geometric conditions and constraints
@article{gao2023diffsds,
title={DiffSDS: a language diffusion model for protein backbone inpainting under geometric conditions and constraints},
author={Gao, Zhangyang and Tan, Cheng and Li, Stan Z},
journal={arXiv preprint arXiv:2301.09642},
year={2023}
}
ICLR 2024 Conference Withdrawn Submission
A text-guided protein design framework
@article{liu2025text,
title={A text-guided protein design framework},
author={Liu, Shengchao and Li, Yanjing and Li, Zhuoxinran and Gitter, Anthony and Zhu, Yutao and Lu, Jiarui and Xu, Zhao and Nie, Weili and Ramanathan, Arvind and Xiao, Chaowei and others},
journal={Nature Machine Intelligence},
pages={1--12},
year={2025},
publisher={Nature Publishing Group UK London}
}
natural language based protein design taska
text data can help in protein design tasks has not been explored. (e textual description) we propose ProteinDT, a multimodal framework that leverages textual descriptions for protein design. ProteinCLAP, which aligns the representation of two modalities facilitator that generates the protein representation from the text modality decoder that creates the protein sequences from the representation ProteinDT can learn a robust protein representation in terms of text and protein sequences ProteinCLAP uses INFOnce loss Contrastive Loss Cross-Entropy Loss: The framework includes a decoder that creates protein sequences from the representation. Structural similarity: If structural predictions or experimental structures are involved, metrics like TM-score or Root Mean Square Deviation (RMSD) would be used. Functional assays: In experimental validation, metrics like binding affinity, catalytic activity, or stability would be measured.
we aim to exploit the following two modalities of proteins: the protein sequence and the textual description SwissProtCLAP, a text–protein pair dataset, from UniProt20 for text representations UniProt is the most comprehensive knowledge base for protein sequences. It also contains extensive annotations including function descriptions, domains, sub-cellular localisation, post-translational modifications and functionally characterised variants22
Protein Sequence and Structure Co-Design with Equivariant Translation
@article{shi2022protein,
title={Protein sequence and structure co-design with equivariant translation},
author={Shi, Chence and Wang, Chuanrui and Lu, Jiarui and Zhong, Bozitao and Tang, Jian},
journal={arXiv preprint arXiv:2210.08761},
year={2022}
}
High inference computation cost. Sequential generation strategy fails to cross-condition on sequence and structure, which might lead to inconsistent proteins.
co-design task as a translation problem in the joint sequence-structure space based on context features. context features represent prior knowledge encoding constraints that biologists want to impose on
the protein to be designed.
Protein Sequence-Structure Co-Design Fixed Backbone Sequence Design Structural
Antibody Database (SAbDab)
SE(3) diffusion model with application to protein backbone generation
@article{yim2023se,
title={SE (3) diffusion model with application to protein backbone generation},
author={Yim, Jason and Trippe, Brian L and De Bortoli, Valentin and Mathieu, Emile and Doucet, Arnaud and Barzilay, Regina and Jaakkola, Tommi},
journal={arXiv preprint arXiv:2302.02277},
year={2023}
}
proteins are represented as rigid bodies in 3D (referred to as frames) using SE(3)
and there werent any diffusion on SE3 until this paper came out
diffusion model (RFdiffusion) to generate novel protein-binders with high, experimental-verified affinities, but relied on a heuristic denoising loss and required pretraining on protein structure prediction. Our goal is to bridge this theory-practice gap and develop a principled method without pretraining.
developing theoretical foundations of SE(3) invariant diffusion models on multiple frames followed by a novel framework, FrameDiff, apply FrameDiff on monomer backbone generation and find it can generate designable monomers up to 500 amino acids
Protein Data Bank (PDB)
Our model comprises 17.4 million parameters and was trained for one week on two A100 Nvidia GPUs.
Multistate and functional protein design using RoseTTAFold sequence space diffusion
@article{lisanza2024multistate,
title={Multistate and functional protein design using RoseTTAFold sequence space diffusion},
author={Lisanza, Sidney Lyayuga and Gershon, Jacob Merle and Tipps, Samuel WK and Sims, Jeremiah Nelson and Arnoldt, Lucas and Hendel, Samuel J and Simma, Miriam K and Liu, Ge and Yase, Muna and Wu, Hongwei and others},
journal={Nature biotechnology},
pages={1--11},
year={2024},
publisher={Nature Publishing Group US New York}
}
Diffusion
sequence based diffusion
diffusions models have been explored less in categorical domains, such as text and protein sequences an generate sequence–structure pairs without additional training, but these solutions can be adversarial and require a large number of steps to converge, and robust experimental success requires subsequent sequence design on the hallucinated backbone it does the diffusion in the sequence space instead of structure space designed thermostable proteins with varying amino acid compositions and internal sequence repeats and cage bioactive peptides, such as melittin ProteinCLAP, which is designed to "align the representation of two modalities" (text and protein) used INFONCE loss Cross-Entropy Loss: The framework includes a decoder that creates protein sequences from the representation.
a sequence space diffusion model based on RoseTTAFold that simultaneously generates protein sequences and structures.
An all-atom protein generative model
@article{chu2024all,
title={An all-atom protein generative model},
author={Chu, Alexander E and Kim, Jinho and Cheng, Lucy and El Nesr, Gina and Xu, Minkai and Shuai, Richard W and Huang, Po-Ssu},
journal={Proceedings of the National Academy of Sciences},
volume={121},
number={27},
pages={e2311500121},
year={2024},
publisher={National Academy of Sciences}
}
multi-chain protein complexes (one of the papers which addresses this issues), single-chain protein modeling (e.g., through models like AlphaFold2 and ESM), these approaches fall short in capturing the inter-chain interactions at the atomic level that are essential for multi-chain proteins. Native Multi-Chain Protein Modeling: Efficient All-Atom Structure Generation achieves superior performance in designing bioactive protein complexes, such as antibodies and binding peptides APM can be fine-tuned for specific tasks (e.g., antibody design) via supervised fine-tuning (SFT) while also supporting zero-shot sampling for certain applications, demonstrating its flexibility and generalizability. RMSD (Root Mean Square Deviation): Self-Consistency TM-score (scTM) AAR (Amino Acid Recovery): Multi-Chain Protein Generation Delta G$ (Binding Affinity)
PDB Biological Assemblies: 11,620 samples, filtered to exclude
The APM employs a flow-matching approach to generate protein sequences and backbone structures. Flow-matching is a technique related to normalizing flows, where the model learns to transform a simple distribution into a complex one that represents the target data (in this case, protein sequences and backbone configurations). This process involves a training phase where a flow-matching loss is used, incorporating a discrete loss for sequences and an SE(3) loss for structures, ensuring accurate modeling of both sequence and spatial properties. Phase I: The sequence and backbone generation module (Seq&BB Module) and the sidechain generation module (Sidechain Module) are trained separately using the flow-matching approach. Phase II: These modules, along with a refinement module (Refine Module), are trained together iteratively to fine-tune the all-atom structure of the protein complexes.
AbODE: Ab Initio Antibody Design using Conjoined ODEs
@inproceedings{verma2023abode,
title={Abode: Ab initio antibody design using conjoined odes},
author={Verma, Yogesh and Heinonen, Markus and Garg, Vikas},
booktitle={International Conference on Machine Learning},
pages={35037--35050},
year={2023},
organization={PMLR}
}
AbODE models the antibody-antigen complex as a joint 3D graph, extending graph PDEs to simultaneously generate the CDR sequence and structure. Unlike autoregressive methods, it uses a single round of full-shot decoding, Beyond antibody design, AbODE performs competitively on tasks like fixed backbone protein sequence design Perplexity (PPL) - Perplexity measures how well the model predicts the sequence distribution, adapted here for antibody sequence generation. Lower values indicate better performance. Amino Acid Recovery (AAR) - AAR is the percentage of correctly predicted amino acids in the generated sequence compared to the ground truth. Higher values indicate better sequence prediction accuracy. Root Mean Square Deviation (RMSD)
Structural Antibody Database (SAbDab) as its main dataset for antibody-related tasks. Additionally, it employs the CATH 4.2 dataset for a specific task involving fixed backbone sequence
graph-based generative model
Protein generation with evolutionary diffusion: sequence is all you need
@article{alamdari2023protein,
title={Protein generation with evolutionary diffusion: sequence is all you need},
author={Alamdari, Sarah and Thakkar, Nitya and van den Berg, Rianne and Tenenholtz, Neil and Strome, Bob and Moses, Alan and Lu, Alex Xijie and Fusi, Nicolo and Amini, Ava Pardis and Yang, Kevin K},
journal={BioRxiv},
pages={2023--09},
year={2023},
publisher={Cold Spring Harbor Laboratory}
}
traditional state-of-the-art models like RFdiffusion rely on generating protein structures, which restricts their training data to the limited set of solved protein structures Proteins are fundamentally defined by their amino acid sequences, which dictate both their structure and function Sequence-First Generative Model Flexible Conditioning: The model supports conditional generation, facilitating tasks such as evolution-guided design Diffusion Process Parameters pLDDT (Predicted Local Distance Difference Test):
UniRef50,OpenFold dataset
Generalized biomolecular modeling and design with RoseTTAFold All-Atom
@article{krishna2024generalized,
title={Generalized biomolecular modeling and design with RoseTTAFold All-Atom},
author={Krishna, Rohith and Wang, Jue and Ahern, Woody and Sturmfels, Pascal and Venkatesh, Preetham and Kalvet, Indrek and Lee, Gyu Rie and Morey-Burrows, Felix S and Anishchenko, Ivan and Humphreys, Ian R and others},
journal={Science},
volume={384},
number={6693},
pages={eadl2528},
year={2024},
publisher={American Association for the Advancement of Science}
}
need for a generalizable model capable of handling diverse and chemically complex biomolecular systems at the all-atom level, including proteins, nucleic acids, ligands, cofactors, and metals. Existing methods often lack the accuracy required for high experimental success rates in de novo design, particularly for functional sites and various architectural forms This paper introduce roseTTAFold All-Atom (RFAA): A generalized deep learning framework for all-atom biomolecular modeling and design. model can accurately design amino acid sequences for given protein and biomolecular structures. inverse folding stuff RFAA supports de novo design of monomers, oligomers, and protein-protein complexes. Model uses this metric - Frame Aligned Point Error (FAPE). This loss measures the deviation between predicted and true atomic coordinates or residue frames, effectively guiding the model to learn accurate three-dimensional structures at the all-atom level.
Out of Many, One: Designing and Scaffolding Proteins at the Scale of the Structural Universe with Genie 2
@article{lin2024out,
title={Out of many, one: Designing and scaffolding proteins at the scale of the structural universe with genie 2},
author={Lin, Yeqing and Lee, Minji and Zhang, Zhao and AlQuraishi, Mohammed},
journal={arXiv preprint arXiv:2405.15489},
year={2024}
}
there is a continuous need to expand their ability to capture a larger and more diverse protein structural space. Existing methods often face limitations in designing complex proteins that require multiple interacting partners or specific functions, particularly when dealing with motif scaffolding where the positions and orientations between motifs are not predetermined. paper presents Genie 2, an evolution of the pioneering Genie model, designed to explore a broader and more varied protein structure space Novel Multi-Motif Scaffolding: A key innovation is the addition of motif scaffolding capabilities via a new multi-motif framework. This allows for the design of co-occurring motifs with unspecified inter-motif positions and orientations, enabling the creation of complex proteins that can engage multiple interaction partners and perform diverse functions. minimize a reconstruction loss L2 loss is used to train the model to evaluat we use 1 - TMscore(x, y) and inference time
asymmetrically represents protein structures during the forward and backward processes, using simple Gaussian noising for the former and expressive SE(3)-equivariant attention for the latter
Deep Learning Methods for Small Molecule Drug Discovery: A Survey
IEEE Transactions on Artificial Intelligence
@article{hu2023deep,
title={Deep learning methods for small molecule drug discovery: A survey},
author={Hu, Wenhao and Liu, Yingying and Chen, Xuanyu and Chai, Wenhao and Chen, Hangyue and Wang, Hongwei and Wang, Gaoang},
journal={IEEE Transactions on Artificial Intelligence},
volume={5},
number={2},
pages={459--479},
year={2023},
publisher={IEEE}
}
Unified rational protein engineering with sequence-based deep representation learning
@article{alley2019unified,
title={Unified rational protein engineering with sequence-based deep representation learning},
author={Alley, Ethan C and Khimulya, Grigory and Biswas, Surojit and AlQuraishi, Mohammed and Church, George M},
journal={Nature methods},
volume={16},
number={12},
pages={1315--1322},
year={2019},
publisher={Nature Publishing Group US New York}
}
principled way to model the relationship between protein sequence and function. There was a need for a unified approach that could distill fundamental protein features into a semantically rich statistical representation directly from large, unlabeled amino-acid sequence datasets. Unified Representation (UniRep) Development: The paper presents UniRep, a deep learning-based method that learns a versatile and semantically rich statistical representation of fundamental protein features from unlabeled amino-acid sequences. This representation is grounded in structural, evolutionary, and biophysical properties, providing a holistic understanding of proteins. this is more like protein representation learning paper CE loss and masked language modelling is use to train the model as this is model is trained on unlabelled protein data enzyme activity, binding affinity melting temperatures or free energy changes of unfolding
Learning protein sequence embeddings using information from structure
@article{bepler2019learning,
title={Learning protein sequence embeddings using information from structure},
author={Bepler, Tristan and Berger, Bonnie},
journal={arXiv preprint arXiv:1902.08661},
year={2019}
}
Representation Learning using LSTM
Existing approaches for detecting structural similarity between proteins from sequence are unable to recognize and exploit structural patterns when sequences have diverged too far, limiting our ability to transfer knowledge between structurally related proteins.
We introduce a framework that maps any protein sequence to a sequence of vector embeddings — one per amino acid position — that encode structural information. We train bidirectional long short-term memory (LSTM) models on protein sequences with a two-part feedback mechanism that incorporates information from (i) global structural similarity between proteins and (ii) pairwise residue contact maps for individual proteins. we define a novel similarity measure between arbitrarylength sequences of vector embeddings based on a soft symmetric alignment (SSA) between them. We show empirically that our multi-task framework outperforms other sequence-based methods and even a top-performing structure-based alignment method when predicting structural similarity, our goal. structure similarity prediction on the ASTRAL 2.06 test set and 8-class secondary structure prediction on a 40% sequence identity filtered dataset containing 22,086 protein sequences from the protein data bank (PDB) [37], a repository of experimentally determined protein structures. Secondary structure prediction is a sequence labeling problem in which we attempt to classify every position of a protein sequence into one of eight classes describing the local 3D structure at that residue. This model predict the secondary structure as well Perplexity: A measure of how well the model predicts the secondary structure class at each residue position, typically used in probabilistic models. Lower perplexity indicates better performance. Accuracy: The fraction of correctly predicted secondary structure classes (out of eight possible classes) across all residue positions. Reported in Table 2, e.g., 0.630 for the full SSA model. Structural Similarity Loss
PDB dataset
bidirectional long short-term memory (LSTM) 3 biLSTM layers with 512 hidden units each and a final output embedding dimension of 100 (Appendix Figure 2). Language model hidden states are projected into a 512 dimension vector before being fed into the encoder. In the contact prediction module, we use a hidden layer with dimension 50.
ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning
@article{elnaggar2021prottrans,
title={Prottrans: Toward understanding the language of life through self-supervised learning},
author={Elnaggar, Ahmed and Heinzinger, Michael and Dallago, Christian and Rehawi, Ghalia and Wang, Yu and Jones, Llion and Gibbs, Tom and Feher, Tamas and Angerer, Christoph and Steinegger, Martin and others},
journal={IEEE transactions on pattern analysis and machine intelligence},
volume={44},
number={10},
pages={7112--7127},
year={2021},
publisher={IEEE}
}
APplication of transfer learning to PLMs
Self supervised learning model for proten feature extraction they train a masked laguange model and then train the LLM to learn features then use transfer learning to do down stream tasks We validated the advantage of using the embeddings as exclusive input for several subsequent tasks: (1) a per-residue (per-token) prediction of protein secondary structure (3-state accuracy Q3=81%-87%); (2) per-protein (pooling) predictions of protein sub-cellular location (ten-state accuracy: Q10=81%) and membrane versus water-soluble (2-state accuracy Q2=91%)
BFD (Big Fantastic Database)
Learning the language of viral evolution and escape
@article{hie2021learning,
title={Learning the language of viral evolution and escape},
author={Hie, Brian and Zhong, Ellen D and Berger, Bonnie and Bryson, Bryan},
journal={Science},
volume={371},
number={6526},
pages={284--288},
year={2021},
publisher={American Association for the Advancement of Science}
}
To model viral mutations of the proteins VIROLOGY ability for viruses to mutate and evade the human immune system and cause infection = VIRAL ESCAPE VIral stuff
Viral mutations that evade neutralizing antibodies, an occurrence known as viral escape, can occur and may impede the development of vaccines. Uses - ) Constrained semantic change search (CSCS) viral escape prediction is designed to search for mutations to a viral sequence that preserve fitness while being antigenically different Unified Model for Fitness and Functional Similarity: The authors developed a single machine learning model that simultaneously captures both the evolutionary fitness (ability to replicate and infect) and the functional or semantic similarity (antigenic properties) of viral proteins. Application of Language Models Constrained Semantic Change Search (CSCS): A novel method called constrained semantic change search (CSCS) was introduced to identify mutations that preserve viral fitness while significantly altering antigenic properties, thus enabling escape from immune recognition. Structural Similarity Prediction Secondary Structure Prediction Area Under the Curve (AUC): Used to assess the model's ability to distinguish escape mutations from non-escape mutations in receiver operating characteristic curves. Cross-Entropy Loss was used here
Influenza A Hemagglutinin (HA): The training data likely consists of amino acid sequences from the Influenza Virus Resource at the National Center for Biotechnology Information HIV-1 Envelope Glycoprotein (HIV Env): The dataset for HIV Env is most likely derived from the Los Alamos HIV Sequence Database or similar repositories. SARS-CoV-2 Spike: The sequences for SARS-CoV-2 Spike were probably obtained from GISAID or GenBank.
xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein
@article{chen2024xtrimopglm,
title={xTrimoPGLM: unified 100B-scale pre-trained transformer for deciphering the language of protein},
author={Chen, Bo and Cheng, Xingyi and Li, Pan and Geng, Yangli-ao and Gong, Jing and Li, Shen and Bei, Zhilei and Tan, Xu and Wang, Boyan and Zeng, Xin and others},
journal={arXiv preprint arXiv:2401.06199},
year={2024}
}
Antibody
However, most existing models are limited by either autoencoding or autoregressive pre-training objectives, which makes them struggle to handle protein understanding and generation tasks concurrently Three different unsupervised language models were constructed for influenza A hemagglutinin, HIV-1 envelope glycoprotein, and severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) spike glycoprotein.
Turnover number predictions for kinetically uncharacterized enzymes using machine and deep learning
@article{kroll2023turnover,
title={Turnover number predictions for kinetically uncharacterized enzymes using machine and deep learning},
author={Kroll, Alexander and Rousset, Yvan and Hu, Xiao-Pan and Liebrand, Nina A and Lercher, Martin J},
journal={Nature communications},
volume={14},
number={1},
pages={4139},
year={2023},
publisher={Nature Publishing Group UK London}
}
enzyme efficiency, is central to understanding cellular physiology and resource allocation. CDR Co-Design (Section 4.1), Protein Sequence-Structure Co-Design (Section 4.2), and Fixed
experimental kcat estimates are unavailable for the vast majority of enzymatic reactions, the development of accurate computational prediction methods is highly desirable. existing machine learning models are limited to a single, well-studied organism, or they provide inaccurate predictions except for enzymes that are highly similar to proteins in the training set. deep learning approach for predicting in vitro kcat values for natural reactions of wild-type enzymes. capture the enzyme properties, we use fine-tuned state-of-the-art protein representations as additional model inputs There is a web interface to calculate the K_cat Regression training
Uses ESM 1B model to load and represent protein sequences
Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences
@article{rives2021biological,
title={Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences},
author={Rives, Alexander and Meier, Joshua and Sercu, Tom and Goyal, Siddharth and Lin, Zeming and Liu, Jason and Guo, Demi and Ott, Myle and Zitnick, C Lawrence and Ma, Jerry and others},
journal={Proceedings of the National Academy of Sciences},
volume={118},
number={15},
pages={e2016239118},
year={2021},
publisher={National Academy of Sciences}
}
Backbone Sequence Design, only sequence based model
Language models enable zero-shot prediction of the effects of mutations on protein function
@article{meier2021language,
title={Language models enable zero-shot prediction of the effects of mutations on protein function},
author={Meier, Joshua and Rao, Roshan and Verkuil, Robert and Liu, Jason and Sercu, Tom and Rives, Alex},
journal={Advances in neural information processing systems},
volume={34},
pages={29287--29303},
year={2021}
}
Antibody, mutations on protein function
They model the protein function prediction task a unsupervise learning task as they mention that there maybe categories that we have not yet found, so they use Zero shot prediction for function
We perform experiments with state-of-the-art protein language models ESM-1b [12] and MSA Transformer [13]. We introduce a new protein language model, ESM-1v, with zero-shot performance comparable to state-of-the-art mutational effect predictors
Learning inverse folding from millions of predicted structures
@inproceedings{hsu2022learning,
title={Learning inverse folding from millions of predicted structures},
author={Hsu, Chloe and Verkuil, Robert and Liu, Jason and Lin, Zeming and Hie, Brian and Sercu, Tom and Lerer, Adam and Rives, Alexander},
booktitle={International conference on machine learning},
pages={8946--8970},
year={2022},
organization={PMLR}
}
Backbone Sequence Design, inverse folding
predicting a protein sequence from its backbone atom coordinates We approach inverse folding as a sequenceto-sequence problem (Ingraham et al., 2019), using an autoregressive encoder-decoder architecture, where the model is tasked with recovering the native sequence of a protein from the coordinates of its backbone atoms. TMscore using Foldseek, Perplexity
Alphefold is used for structure stuff Geometric Vector Perceptron (GVP) layers, GVP-Transformer
Evolutionary-scale prediction of atomic-level protein structure with a language model
@article{lin2023evolutionary,
title={Evolutionary-scale prediction of atomic-level protein structure with a language model},
author={Lin, Zeming and Akin, Halil and Rao, Roshan and Hie, Brian and Zhu, Zhongkai and Lu, Wenting and Smetanin, Nikita and Verkuil, Robert and Kabeli, Ori and Shmueli, Yaniv and others},
journal={Science},
volume={379},
number={6637},
pages={1123--1130},
year={2023},
publisher={American Association for the Advancement of Science}
}
15B parameter LLM for structure prediction by Facebook AI
Protein sequences encode evolutionary patterns that reflect their 3D structures and biological functions Exploring Emergent Capabilities in Language Models Atomic-Level Structure Prediction Long-Range Contact Precision - Assesses the accuracy of predicted contacts between residues far apart in the sequence but close in the 3D structure. Root Mean Square Deviation (RMSD
43 million UniRef50 cluster Transformer-based masked language model
@inproceedings{rao2021msa,
title={MSA transformer},
author={Rao, Roshan M and Liu, Jason and Verkuil, Robert and Meier, Joshua and Canny, John and Abbeel, Pieter and Sercu, Tom and Rives, Alexander},
booktitle={International Conference on Machine Learning},
pages={8844--8856},
year={2021},
organization={PMLR}
}
Traditional unsupervised methods, such as Potts models, fit a separate model to each protein family using multiple sequence alignments (MSAs). While effective, this approach is computationally intensive, requiring a multi-step pipeline of sequence search, With protein sequence databases growing exponentially, there is a pressing need for scalable and efficient methods that can process large datasets without compromising accuracy. MSA Transformer performs inference directly from sets of aligned sequences (MSAs). MSA Transformer, a novel transformer-based model that operates directly on multiple sequence alignments rather than individual sequences. To manage the 2D structure of MSAs (sequences as rows, positions as columns), the model employs axial attention, alternating between row and column attentions. tied row attention, where a single attention map is shared across all sequences in the MSA. Long-range contact precision Supervised Contact Prediction Secondary Structure Prediction
Tranception: Protein Fitness Prediction with Autoregressive
@inproceedings{notin2022tranception,
title={Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval},
author={Notin, Pascal and Dias, Mafalda and Frazer, Jonathan and Marchena-Hurtado, Javier and Gomez, Aidan N and Marks, Debora and Gal, Yarin},
booktitle={International Conference on Machine Learning},
pages={16990--17017},
year={2022},
organization={PMLR}
}
MSAs are not always available or reliable, particularly for proteins that are difficult to align, such as disordered proteins, or those with shallow alignments (few homologous sequences). This restricts the applicability of these methods across diverse protein types. Cause MSA check for similar ones in databases Alignment-based models only for specific proteins families Existing models struggle to accurately predict the effects of complex mutations, such as multiple amino acid substitutions or insertions/deletions (indels). autoregressive transformer designed to enhance specialization across attention heads and explicitly capture patterns from contiguous subsequences (k-mers) Model uses both homologu and MSA models Spearman’s Rank Correlation between the predictions and experimental results model’s ability to classify mutants as functional or non-functional. Matthews Correlation Coefficient - This complements AUC by providing a balanced measure of classification performance, especially for imbalanced datasets.
Grouped ALiBi Position Encoding Tranception L, has 700 million parameters, featuring 36 layers, 20 attention heads, and an embedding size of 1,280,
PoET: A generative model of protein families as sequences-of-sequences
@article{truong2023poet,
title={Poet: A generative model of protein families as sequences-of-sequences},
author={Truong Jr, Timothy and Bepler, Tristan},
journal={Advances in Neural Information Processing Systems},
volume={36},
pages={77379--77415},
year={2023}
}
The goal is to develop a model that can generate proteins from a specific family without needing a large MSA and enable transfer learning across protein families to improve generalization, particularly for data-scarce families. Sequence-of-Sequences Approach: PoET models entire protein families as sequences-of-sequences, treating a set of related protein sequences as a single sequence where each protein is a subsequence. Order-Invariant Architecture: It features a unique Transformer layer that processes tokens within sequences sequentially (preserving order dependence) while attending to different sequences in an order-invariant manner. T Structural Conservation (TM-score)
Deep transfer learning for inter-chain contact predictions of transmembrane protein complexes
@article{lin2023deep,
title={Deep transfer learning for inter-chain contact predictions of transmembrane protein complexes},
author={Lin, Peicong and Yan, Yumeng and Tao, Huanyu and Huang, Sheng-You},
journal={Nature Communications},
volume={14},
number={1},
pages={4935},
year={2023},
publisher={Nature Publishing Group UK London}
}
Transmembrane protein complexes are underrepresented in structural databases, with only about 350 non-redundant homo-oligomeric TMP complexes in the PDBTM database. Transmembrane protein are different compared to the soluble proteins in their structural characteristics (e.g., predominantly α-helical structures driven by hydrophobic interactions) and physicochemical environments (e.g., the membrane milieu) and all the existing models are trained on soluable proteins Development of DeepTMP: A novel deep transfer learning method, DeepTMP, is proposed specifically for predicting inter-chain contacts in TMP complexes, addressing the unique challenges posed by these proteins. Geometric Triangle-Aware Module: This innovative module captures many-body effects and ensures geometric consistency in inter-chain interaction predictions, enhancing accuracy by reducing inconsistencies that arise in traditional models.
TMP complexes from the PDBTM database.
A Systematic Study of Joint Representation Learning on Protein Sequences and Structures
@article{zhang2023systematic,
title={A systematic study of joint representation learning on protein sequences and structures},
author={Zhang, Zuobai and Wang, Chuanrui and Xu, Minghao and Chenthamarakshan, Vijil and Lozano, Aur{\'e}lie and Das, Payel and Tang, Jian},
journal={arXiv preprint arXiv:2303.06275},
year={2023}
}
mulit- modality protein model : structure and sequence Underexplored Integration: Combining sequence and structure data could enhance representation learning, but prior attempts have not consistently outperformed single-modality approaches. Three innovative methods to combine sequence and structure representations: Serial Fusion: Feeds sequence representations from PLMs into structure encoders as initial features. Parallel Fusion: Concatenates sequence and structure representations Cross Fusion: Uses multi-head self-attention to integrate the two modalities. Integration with Structure Encoders: The authors pair PLMs with three structure encoders—GearNet, GVP, and CDConv Enzyme Commission (EC) Number Prediction Protein Structure Ranking (PSR)
AlphaFold Protein Structure Database v1
architecture combines ESM-2-650M with encoders like GearNet
Multi-level Protein Structure Pre-training via Prompt Learning
@inproceedings{wang2022multi,
title={Multi-level protein structure pre-training via prompt learning},
author={Wang, Zeyuan and Zhang, Qiang and Shuang-Wei, HU and Yu, Haoran and Jin, Xurui and Gong, Zhichen and Chen, Huajun},
booktitle={The Eleventh International Conference on Learning Representations},
year={2022}
}
yet most existing methods focus solely on the primary sequence or tertiary structure, overlooking the others. This incomplete representation limits the ability to fully understand protein functionality. existingTraditional multi-task learning, where multiple objectives are trained simultaneously, often suffers from negative knowledge transfer. models are only good at structure prediction only Prompt-Aware Attention Module: This modifies the Transformer architecture with:
Attention Masks: Prevents prompt tokens from attending to input sequence tokens, ensuring task-specific focus. Masked Language Modeling (MLM): Predicts masked amino acids to learn primary structure. Function Annotation (EC Numbers and GO Terms) Protein Engineering (e.g., Stability, Fluorescence)
Structure-informed protein language models are robust predictors for variant effects
@article{sun2024structure,
title={Structure-informed protein language models are robust predictors for variant effects},
author={Sun, Yuanfei and Shen, Yang},
journal={Human Genetics},
pages={1--17},
year={2024},
publisher={Springer}
}
structure aware protein language model Unsupervised fine-tuning of sequence-only pLMs on family-specific sequences can lead to overfitting, reducing performance in variant effect prediction. The authors observe a trade-off between sequence and structure awareness, necessitating a method to balance these aspects effectively. A novel framework that extends masked sequence denoising to cross-modality denoising, integrating sequence and structural information. Spearman's Rank Correlation Coefficient Area Under the Precision-Recall Curve (AUPRC)
RP15: 12 million sequences clustered at 15% co-membership. RP75: 68 million sequences clustered at 75% co-membership. 35 Deep Mutagenesis Scanning (DMS) Datasets BERT-based protein language
OntoProtein: Protein Pretraining With Gene Ontology Embedding
@article{zhang2022ontoprotein,
title={Ontoprotein: Protein pretraining with gene ontology embedding},
author={Zhang, Ningyu and Bi, Zhen and Liang, Xiaozhuan and Cheng, Siyuan and Hong, Haosen and Deng, Shumin and Lian, Jiazhang and Zhang, Qiang and Chen, Huajun},
journal={arXiv preprint arXiv:2201.11147},
year={2022}
}
Knowledge-Enhanced Models
models struggle to incorporate biological knowledge beyond the sequence data. Gene Ontology (GO) provides a structured knowledge graph with rich biological facts about proteins (e.g., molecular functions, cellular components). While this could enhance protein representations, integrating it with sequence-based PLMs is challenging due to the differing data types first general framework to integrate Gene Ontology (GO) knowledge into protein pre-training. contrastive learning method that jointly optimizes embeddings for the knowledge graph and protein sequences. A new, large-scale knowledge graph dataset with 612,483 entities (565,254 proteins and 47,229 GO terms) and 4,990,097 triples. TAPE Benchmark. - Secondary Structure Protein-Protein Interaction (PPI) Protein Function Prediction
ProteinKG25 for secondary structure prediction, STRING for Protein-Protein Interaction
ProteinBERT: a universal deep-learning model of protein sequence and function
@article{brandes2022proteinbert,
title={ProteinBERT: a universal deep-learning model of protein sequence and function},
author={Brandes, Nadav and Ofer, Dan and Peleg, Yam and Rappoport, Nadav and Linial, Michal},
journal={Bioinformatics},
volume={38},
number={8},
pages={2102--2110},
year={2022},
publisher={Oxford University Press}
}
Knowledge-Enhanced Models
Increase the efficiency of protein language models neglecting functional information such as Gene Ontology (GO) annotations. to make an LLM with transfer learning capabilities ProteinBERT, a novel deep language model designed specifically for proteins. ProteinBERT features an architecture that uses global attention mechanisms instead of standard self-attention, reducing computational complexity from quadratic to linear with respect to sequence length. uses the functional information in addition to sequence data Secondary Structure Prediction Signal Peptide Prediction: Accuracy for binary classification of signal peptides.
https://github.com/nadavbra/protein_bert)
Protein Representation Learning via Knowledge Enhanced Primary Structure Reasoning
@inproceedings{zhou2023protein,
title={Protein representation learning via knowledge enhanced primary structure reasoning},
author={Zhou, Hong-Yu and Fu, Yunxiang and Zhang, Zhicheng and Cheng, Bian and Yu, Yizhou},
booktitle={The Eleventh International Conference on Learning Representations},
year={2023}
}
Knowledge-Enhanced Models
biological context is missing in most of the PLMs previous methods fails to capture fine-grained relationships between individual amino acids and specific knowledge terms. ex -OntoProtein simplified training method KeAP is a novel model that performs token-level knowledge graph exploration, allowing amino acids to iteratively query and integrate relevant knowledge from associated terms (e.g., molecular functions, biological processes) using cross-attention mechanisms. single objective training using MLM KeAP explores knowledge in a cascaded manner, first querying relation terms and then attribute terms, which proves more effective for knowledge encoding than non-cascaded approaches. Amino Acid Contact Prediction Protein Homology Detection
Prot2Text: Multimodal Protein’s Function Generation with GNNs and Transformers
@inproceedings{abdine2024prot2text,
title={Prot2text: Multimodal protein’s function generation with gnns and transformers},
author={Abdine, Hadi and Chatzianastasis, Michail and Bouyioukos, Costas and Vazirgiannis, Michalis},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
volume={38},
number={10},
pages={10757--10765},
year={2024}
}
Protein Description and Annotation Models
Conventional protein function prediction is often treated as a multi-classification task, where proteins are assigned predefined labels. This approach can oversimplify the complex and multifaceted nature of protein functions, limiting the depth of insight provided. Free-text descriptions offer a more nuanced and comprehensive understanding of protein functions compared to rigid labels. This is particularly valuable for applications like drug discovery and protein engineering, where detailed functional knowledge is critical. uses multimodal data - 3D structures, and textual annotations. Introduces a novel multimodal model that generates free-text descriptions of protein functions. protein sequence and structural data using a Relational Graph Convolution Network (RGCN) for 3D structures and the ESM protein language model for sequences, followed by a GPT-2 decoder to generate detailed descriptions. BLEU Score: Measures similarity between generated and reference texts based on n-gram overlap, Rouge-1, Rouge-2, Rouge-L: Evaluate text similarity by calculating the overlap of unigrams (Rouge-1), bigrams (Rouge-2), and the longest common subsequence (Rouge-L) between generated and reference texts. BioBERT to assess semantic similarity, providing a more nuanced evaluation of meaning beyond exact word matches.
Multilingual translation for zero-shot biomedical classification using BioTranslator
@article{xu2023multilingual,
title={Multilingual translation for zero-shot biomedical classification using BioTranslator},
author={Xu, Hanwen and Woicik, Addie and Poon, Hoifung and Altman, Russ B and Wang, Sheng},
journal={Nature Communications},
volume={14},
number={1},
pages={738},
year={2023},
publisher={Nature Publishing Group UK London}
}
Protein Description and Annotation Models
text to biospace translator takes the text data and then translate it into meaningful protein or gene sequence BioTranslator takes a user-written text as input and then translates it into non-text biological data. Develops a framework where multiple biological data modalities (e.g., gene sequences, drug SMILES, phenotype networks) are translated into a shared text-based embedding space. Enables zero-shot classification, allowing classification of instances into previously unseen classes without requiring annotated training data for those classes. Employed for tasks like marker gene identification, AUROC measures the model’s ability to distinguish between positive and negative instances, evaluating its discriminative power for both seen and unseen cell types. BLEU score for biological translation
Tabula Muris: Employed for single-cell analysis and cell type annotation. Includes gene sequences, drug SMILES strings, phenotype networks, pathway gene sets, BERT type model
BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations
@article{pei2023biot5,
title={Biot5: Enriching cross-modal integration in biology with chemical knowledge and natural language associations},
author={Pei, Qizhi and Zhang, Wei and Zhu, Jinhua and Wu, Kehan and Gao, Kaiyuan and Wu, Lijun and Xia, Yingce and Yan, Rui},
journal={arXiv preprint arXiv:2310.07276},
year={2023}
}
Protein Description and Annotation Models
current models exhibit several limitations, such as the generation of invalid molecular SMILES, underutilization of contextual information, and equal treatment of structured and unstructured knowledge Prior models treat structured data (e.g., molecule-text pairs from databases) and unstructured data (e.g., free text from literature) in the same manner, potentially missing opportunities to optimize their distinct contributions. e BioT5, a comprehensive pretraining framework that enriches cross-modal integration in biology with chemical knowledge and natural language associations. kinda crossover between chemical knowledge and natural language The framework distinguishes between structured data (e.g., database pairs from PubChem and SwissProt) and unstructured data (e.g., literature text), Unlike prior models that use SMILES, BioT5 employs SELFIES (Self-referencing Embedded Strings), which ensures 100% robust and chemically valid molecular representations, Include molecule captioning (generating text descriptions from SELFIES) and text-based molecule generation (generating SELFIES from text descriptions), both evaluated on the ChEBI-20 dataset.
33 million PubMed articles. 339,000 pairs from PubChem SwissProt annotated samples
BioT5+: Towards Generalized Biological Understanding with IUPAC Integration and Multi-task Tuning
@article{pei2024biot5+,
title={Biot5+: Towards generalized biological understanding with iupac integration and multi-task tuning},
author={Pei, Qizhi and Wu, Lijun and Gao, Kaiyuan and Liang, Xiaozhuan and Fang, Yin and Zhu, Jinhua and Xie, Shufang and Qin, Tao and Yan, Rui},
journal={arXiv preprint arXiv:2402.17810},
year={2024}
}
Protein Description and Annotation Models
existing models doesn't incorporate IUPAC names in their models to build LLMs so miss out on that part models struggled to generalize across diverse biological entities, particularly in how they processed textual representations of bio-sequences, limiting their applicability to varied biological contexts. IUPAC names with SELFIES (a robust molecular string representation), BioT5+ bridges the gap between formal molecular structures and their textual descriptions, improving comprehension of molecular properties and functions. incorporate bio-text from bioRxiv and PubMed, and molecular data from PubChem, enriching its knowledge base and contextual understanding of biological entities. Mean Absolute Error (MAE) is employed to assess performance, particularly for tasks like predicting molecular properties BLEU for text generation task
SELFIES with IUPAC names from PubChem FASTA sequences from UniProt50.
ProtST: Multi-Modality Learning of Protein Sequences
@inproceedings{xu2023protst,
title={Protst: Multi-modality learning of protein sequences and biomedical texts},
author={Xu, Minghao and Yuan, Xinyu and Miret, Santiago and Tang, Jian},
booktitle={International Conference on Machine Learning},
pages={38749--38767},
year={2023},
organization={PMLR}
}
Protein Description and Annotation Models
adds the functional knowledge into protein sequences like subcellular locations Underutilization of Biomedical Texts: Textual descriptions of protein properties, widely available in databases like Swiss-Prot, contain rich functional information but are not effectively integrated into existing PLMs. add textual annotaion novel dataset pairing protein sequences with textual descriptions of their functions called -ProtDescribe multi-modal learning framework with textual data Multimodal Mask Prediction Protein Localization Prediction: Measured with accuracy for binary and subcellular localization tasks. Zero-Shot Protein Classification text to protein classification
A transformer-based model (e.g., ESM-1b) encodes protein sequences into contextualized representations, capturing sequence-based features. A transformer-based model (e.g., PubMedBERT) encodes textual descriptions into semantically meaningful representations.
Large language models generate functional protein sequences across diverse families
@article{madani2023large,
title={Large language models generate functional protein sequences across diverse families},
author={Madani, Ali and Krause, Ben and Greene, Eric R and Subramanian, Subu and Mohr, Benjamin P and Holton, James M and Olmos Jr, Jose Luis and Xiong, Caiming and Sun, Zachary Z and Socher, Richard and others},
journal={Nature biotechnology},
volume={41},
number={8},
pages={1099--1106},
year={2023},
publisher={Nature Publishing Group US New York}
}
Generative Models (Protein Decoder)
Structure-based de novo design methods depend on scarce experimental structural data and complex biophysical simulations, which can be computationally expensive or intractable. ProGen was introduced in this paper — to generate protein sequences with predictable functions across diverse protein families, analogous to generating coherent text in NLP. The model was trained on a massive dataset of 280 million protein sequences from over 19,000 families, enabling it to learn broad sequence patterns. lysozymes, the catalytic efficiency (kcat/KMkcat/KM) of generated proteins was measured and compared to natural lysozymes, such as hen egg white lysozyme (HEWL), using assays like the EnzChek Lysozyme kit. Root Mean Square Deviation (RMSD)
Universal Protein Sequence Dataset
ProtGPT2 is a deep unsupervised language model for protein design
@article{ferruz2022protgpt2,
title={ProtGPT2 is a deep unsupervised language model for protein design},
author={Ferruz, Noelia and Schmidt, Steffen and H{\"o}cker, Birte},
journal={Nature communications},
volume={13},
number={1},
pages={4348},
year={2022},
publisher={Nature Publishing Group UK London}
}
Generative Models (Protein Decoder)
ChatGPT type model for protein , u text and ask it will generate the sequences Novel Protein Sequence Generation Simulations showed that ProtGPT2-generated structures are stable, with root mean square deviation (RMSD) distributions comparable to natural proteins.
De novo generation of SARS-CoV-2 antibody CDRH3 with a pre-trained generative large language model
@article{he2024novo,
title={De novo generation of SARS-CoV-2 antibody CDRH3 with a pre-trained generative large language model},
author={He, Haohuai and He, Bing and Guan, Lei and Zhao, Yu and Jiang, Feng and Chen, Guanxing and Zhu, Qingge and Chen, Calvin Yu-Chian and Li, Ting and Yao, Jianhua},
journal={Nature Communications},
volume={15},
number={1},
pages={6867},
year={2024},
publisher={Nature Publishing Group UK London}
}
Generative Models (Protein Decoder)
antibody design
Conventional antibody design relies heavily on isolating antigen-specific antibodies from serum, a process that is both time-consuming and resource-intensive. of SARS-CoV-2, exemplified by variants such as XBB, requires faster and more adaptable methods to design antibodies capable of targeting new mutations effectively. COVID stuff so this paper paper use AI for antibody PALM-H3 is a pre-trained generative large language model developed to generate artificial heavy-chain complementarity-determining region 3 (CDR3) sequences with desired antigen-binding specificity. AbBinder is a high-precision model designed to pair antigen epitope sequences with antibody sequences, predicting their binding specificity and affinity. Neutralization Potency: Measured using the half-maximal inhibitory concentration (IC50), which indicates the concentration of antibody required to neutralize 50% of the virus, reflecting its functional efficacy. AbBinder — between predicted and actual binding affinities, showcasing AbBinder’s predictive accuracy.
Observed Antibody Space (OAS) Fine-tuned on a dataset specific to SARS-CoV-2
Integrating protein language models and automatic biofoundry for enhanced protein evolution
@article{zhang2025integrating,
title={Integrating protein language models and automatic biofoundry for enhanced protein evolution},
author={Zhang, Qiang and Chen, Wanyi and Qin, Ming and Wang, Yuhao and Pu, Zhongji and Ding, Keyan and Liu, Yuyue and Zhang, Qunfeng and Li, Dongfang and Li, Xinjia and others},
journal={Nature Communications},
volume={16},
number={1},
pages={1553},
year={2025},
publisher={Nature Publishing Group UK London}
}
Generative Models (Protein Decoder)
. These methods struggle particularly with enzymes that have complex functions or polyspecificity, and they are limited by the vastness of protein sequence space, requiring extensive experimental iterations. Protein Language Model-enabled Automatic Evolution (PLMeAE) platform, a closed-loop system that integrates PLMs with an automatic biofoundry to automate the Design-Build-Test-Learn (DBTL) cycle. The platform demonstrates rapid protein evolution, completing four rounds in just 10 days, resulting in tRNA synthetase mutants with up to 2.4-fold improved enzyme activity. Enzyme Activity Improvement Key indicators include the number of variants tested per round (96 variants) and the total time for evolution (10 days for four rounds), highlighting the platform’s high throughput and speed.
UBC9 and Ubiquitin Datasets: SM-2: A transformer-based model (specifically ESM2_t33_650M_UR50D) is used for zero-shot prediction of mutation effects and to compute amino acid probabilities at masked positions.
Architecture: Built on a transformer framework with self-attention mechanisms, ESM-2 captures long-range dependencies in protein sequences, leveraging pre-trained knowledge from vast evolutionary data.
ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts
@inproceedings{xu2023protst,
title={Protst: Multi-modality learning of protein sequences and biomedical texts},
author={Xu, Minghao and Yuan, Xinyu and Miret, Santiago and Tang, Jian},
booktitle={International Conference on Machine Learning},
pages={38749--38767},
year={2023},
organization={PMLR}
}
Generative Models (Protein Decoder)
current methods fail to explicitly acquire protein functions and other critical properties, such as subcellular locations, which are essential for comprehensive protein representation learning. They make a new dataset called ProtDescribe, which pairs protein sequences with textual descriptions of their properties ProtST - a multi-modality learning approach that enhances protein sequence pre-training and understanding by incorporating biomedical texts. Protein Localization Prediction Fitness Landscape Prediction Protein Function Annotation: Likely evaluated using precision, recall, and F1-score, though exact metrics are not specified in the excerpt.
ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning
@article{elnaggar2021prottrans,
title={Prottrans: Toward understanding the language of life through self-supervised learning},
author={Elnaggar, Ahmed and Heinzinger, Michael and Dallago, Christian and Rehawi, Ghalia and Wang, Yu and Jones, Llion and Gibbs, Tom and Feher, Tamas and Angerer, Christoph and Steinegger, Martin and others},
journal={IEEE transactions on pattern analysis and machine intelligence},
volume={44},
number={10},
pages={7112--7127},
year={2021},
publisher={IEEE}
}
Generative Models (Protein Decoder)
labeled data (e.g., experimentally determined protein structures) is scarce and costly. The authors aim to use self-supervised learning to extract meaningful representations from vast, unlabeled protein sequence datasets, reducing dependency on annotations. There is a significant disparity between the abundance of known protein sequences and the limited number of experimentally determined structures. Two auto regressive models are combined they both are trained on very large datasets this model can perform multiple tasks Secondary Structure Prediction Subcellular Location Prediction Membrane vs. Water-Soluble Classification 3-State Accuracy (Q3): Measures the percentage of residues correctly classified into helix, sheet, or coil states for secondary structure Subcellular Location Prediction Membrane vs. Water-Soluble Classification
BFD (Big Fantastic Database) - Contains 393 billion amino acids, making it one of the largest protein sequence corpora.
Artificial intelligence to solve the X-ray crystallography phase problem: a case study report
@article{barbarin2021artificial,
title={Artificial intelligence to solve the X-ray crystallography phase problem: a case study report},
author={Barbarin-Bocahu, Ir{\`e}ne and Graille, Marc},
journal={BioRxiv},
pages={2021--12},
year={2021},
publisher={Cold Spring Harbor Laboratory}
}
This is a case study paper on how the AI is being used in traditional structure determining papers
MR is the primary technique used, involving the use of a known protein structure as a search model to estimate phases for the target protein's diffraction data. It typically requires a model with significant structural similarity ( The authors leverage AI-generated models from AlphaFold and RoseTTAFold as search models. These tools predict highly accurate 3D protein structures using machine learning, bypassing the need for homologous experimental structures. MOLREP - Measures the difference between the highest and mean scores of solutions after the translation function search in MOLREP. The translation function Z-score assesses the statistical significance of MR solutions in PHASER. Higher scores suggest greater reliability. pLDDT - The predicted Local Distance Difference Test score from AlphaFold indicates confidence in the predicted model’s accuracy per residue. While not a direct crystallographic metric, it correlates with model quality.
FID-Net: A versatile deep neural network architecture for NMR spectral reconstruction and virtual decoupling
Journal of Biomolecular NMR
@article{karunanithy2021fid,
title={FID-Net: A versatile deep neural network architecture for NMR spectral reconstruction and virtual decoupling},
author={Karunanithy, Gogulan and Hansen, D Flemming},
journal={Journal of biomolecular NMR},
volume={75},
number={4},
pages={179--191},
year={2021},
publisher={Springer}
}
Paper tries to answer the X-ray crystallography provides diffraction intensity data but lacks phase information, which is necessary to reconstruct electron density maps and determine atomic positions. it need to do many experiments to get good results this paper successfully apply AlphaFold and RoseTTAFold to solve the crystal structure of the Kluyveromyces lactis Nmd4 protein (KINmd4), AI-generated models, particularly from AlphaFold, are sufficiently accurate to serve as templates for MR, even when traditional methods fail. This validates the integration of AI into crystallographic workflows.
KINmd4 model was retrieved from this database
CryoDRGN: reconstruction of heterogeneous cryo-EM structures using neural networks
@article{zhong2021cryodrgn,
title={CryoDRGN: reconstruction of heterogeneous cryo-EM structures using neural networks},
author={Zhong, Ellen D and Bepler, Tristan and Berger, Bonnie and Davis, Joseph H},
journal={Nature methods},
volume={18},
number={2},
pages={176--185},
year={2021},
publisher={Nature Publishing Group US New York}
}
Cryo-electron microscopy (cryo-EM) method used in traditional methods is discussed in this paper Techniques such as multibody refinement in RELION assume heterogeneity arises from rigid body movements, which is insufficient for flexible molecules with non-rigid, continuous motions. algorithm learns a low-dimensional latent space that encodes the heterogeneity, allowing for visualization and analysis of structural variability in an intuitive, reduced-dimensional framework. CryoDRGN reconstructs a continuous distribution of 3D density maps The trained model can generate 3D density maps from arbitrary points in the latent space, facilitating exploration of conformational trajectories and rare structural states. Fourier Shell Correlation (FSC):
A standard metric in cryo-EM, FSC compares the resolution of reconstructed density maps against ground-truth maps (for simulated data) or consensus reconstructions (for real data).
Electron Microscopy Public Image Archive (EMPIAR)
CryoGAN: A New Reconstruction Paradigm for Single-Particle Cryo-EM Via Deep Adversarial Learning
@article{gupta2021cryogan,
title={CryoGAN: A new reconstruction paradigm for single-particle cryo-EM via deep adversarial learning},
author={Gupta, Harshit and McCann, Michael T and Donati, Laurene and Unser, Michael},
journal={IEEE Transactions on Computational Imaging},
volume={7},
pages={759--774},
year={2021},
publisher={IEEE}
}
limitations of traditional single-particle cryo-electron microscopy (cryo-EM) reconstruction methods. However, these methods face challenges, particularly when dealing with heterogeneous samples—where particles exhibit structural variability—or datasets with significant noise and artifacts. present CryoGAN as a new paradigm that uses deep adversarial learning to reconstruct 3D biomolecular structures. Unlike traditional methods that depend on iterative refinement from an initial volume estimate, CryoGAN offers an alternative by directly learning to generate realistic 3D volumes from 2D projections. standout feature is that CryoGAN initializes with a zero-valued volume rather than requiring a preliminary low-resolution estimate, simplifying the reconstruction process and potentially reducing bias from initial models. Fourier Shell Correlation (FSC) — widely used metric in cryo-EM to measure the resolution of the reconstructed 3D volume. Visual Quality of Reconstructions: The authors likely present visual comparisons of the 3D maps generated by CryoGAN against those from traditional methods
Deep learning-based mixed-dimensional Gaussian mixture model for characterizing variability in cryo-EM
@article{chen2021deep,
title={Deep learning-based mixed-dimensional Gaussian mixture model for characterizing variability in cryo-EM},
author={Chen, Muyuan and Ludtke, Steven J},
journal={Nature methods},
volume={18},
number={8},
pages={930--936},
year={2021},
publisher={Nature Publishing Group US New York}
}
reconstructing heterogeneous structures from cryo-EM datasets. cause different structures have different representations continuous and discrete Improve the ability to reconstruct and interpret complex biological systems, such as assembling ribosomes and the SARS-CoV-2 spike protein, which exhibit significant structural diversity. neural network architecture is proposed to model heterogeneity in cryo-EM data used in the traditional method unsupervised classification of particles into different conformational or compositional states without requiring prior knowledge of the number or nature of these states, enhancing flexibility and applicability. 3D structure reconstruction using identified stages 3D structures are visually evaluated for biological plausibility and to highlight differences between states (Visual inspection) and Classification Accuracy
EMPIAR-10076 - ribosome assembly dataset, EMPIAR-10180 - Spliceosome dataset EMPIAR-10493: SARS-CoV-2 spike protein dataset
Protein structure generation via folding diffusion
@article{wu2024protein,
title={Protein structure generation via folding diffusion},
author={Wu, Kevin E and Yang, Kevin K and van den Berg, Rianne and Alamdari, Sarah and Zou, James Y and Lu, Alex X and Amini, Ava P},
journal={Nature communications},
volume={15},
number={1},
pages={1059},
year={2024},
publisher={Nature Publishing Group UK London}
}
motivation of this research is to computationally generate novel yet physically foldable protein structures, which could unlock significant advancements in biological discovery and therapeutic development. generate new structure on - novel protein structures could reveal new biological mechanisms or pathways, potentially enhancing our understanding of diseases such as Alzheimer's, Parkinson's, Huntington's, and cystic fibrosis. The authors introduce a novel generative model, named "FoldingDiff," which uses a diffusion process to generate protein backbone structures instead of the cartesian coordinates the model represent the strucutres as angles - shift- and rotation-invariant representation eliminates the need for complex equivariant networks and aligns with the biophysical properties of protein folding. visual inspection of the protein Ramachandran Plots: These plots visualize the co-occurrence of $\phi$ and $\psi$ dihedral angles, critical for identifying secondary structure elements like $\alpha$-helices and $\beta$-sheets. Structural Similarity (TM-Score)
generative model employs a simple bidirectional transformer with relative positional embeddings as its backbone.
Equivariant 3D-conditional diffusion model for molecular linker design
@article{igashov2024equivariant,
title={Equivariant 3D-conditional diffusion model for molecular linker design},
author={Igashov, Ilia and St{\"a}rk, Hannes and Vignac, Cl{\'e}ment and Schneuing, Arne and Satorras, Victor Garcia and Frossard, Pascal and Welling, Max and Bronstein, Michael and Correia, Bruno},
journal={Nature Machine Intelligence},
volume={6},
number={4},
pages={417--427},
year={2024},
publisher={Nature Publishing Group UK London}
}
Drug design stuff, where a linker is designed to connect fragmented molucules
fragment-based drug discovery (FBDD), a key strategy in early-stage drug development. FBDD involves identifying small molecular fragments and connecting them with linkers to create larger, pharmacologically relevant molecules.
Automate Linker Design: Create a method to generate chemically valid linkers that connect an arbitrary number of molecular fragments, overcoming the restriction of many prior approaches that only link pairs of fragments. Enable Conditional Generation — based on some conditions provided by generator DiffLinker Model: The authors propose DiffLinker, an E(3)-equivariant 3D-conditional diffusion model tailored for molecular linker design. DiffLinker can link an arbitrary number of fragments, enhancing its applicability to complex drug design scenarios. DiffLinker can incorporate protein pocket information as a condition, enabling the generation of linkers tailored to specific biological targets, which improves their relevance in drug design. Quantitative Estimate of Drug-likeness (QED) Synthetic Accessibility (SA) Validity of the designed linker
E(3)-Equivariant Graph Neural Network (EGNN): The core denoising model is an EGNN, which processes the molecular graph (atoms as nodes, bonds as edges)
Learning from Protein Structure with Geometric Vector Perceptrons
@article{jing2020learning,
title={Learning from protein structure with geometric vector perceptrons},
author={Jing, Bowen and Eismann, Stephan and Suriana, Patricia and Townshend, Raphael JL and Dror, Ron},
journal={arXiv preprint arXiv:2009.01411},
year={2020}
}
learn the structural representation of the protein, structural representation learning
To create a single network architecture that bridges the strengths of graph neural networks (GNNs), which excel at relational reasoning, and convolutional neural networks (CNNs), which are adept at geometric reasoning. to combine these and improve them Geometric Vector Perceptrons (GVPs) combined - GVP and GNN architectures Proofs that GVPs maintain equivariance for vector outputs and invariance for scalar outputs under 3D rotations and reflections. interpretability of the model - The architecture enables visualization of learned vector features, providing potential insights into the model's decision-making process. Perplexity - Measures how well the model predicts amino acid sequences given a protein structure, analogous to uncertainty in language modeling. Lower values indicate better performance.a Model Quality Assessment (MQA)
EquiBind: Geometric Deep Learning for Drug Binding Structure Prediction
@article{stark2202equibind,
title={EquiBind: geometric deep learning for drug binding structure prediction. arXiv (2022)},
author={St{\"a}rk, H and Ganea, OE and Pattanaik, L and Barzilay, R and Jaakkola, T},
journal={arXiv preprint arXiv:2202.05146},
volume={10}
}
Drug design binding stuff, how drugs bind to proteins
challenges in predicting how drug-like molecules (ligands) bind to specific protein targets (receptors), a core problem in drug discovery. traditional methods are expensive in terms of binding paper produce a fast way of doing this drug discovery through the development of EquiBind, a novel SE(3)-equivariant geometric deep learning model. Incorporates an efficient mechanism to model ligand flexibility by adjusting torsion angles of rotatable bonds while preserving local structures (bond lengths and angles), ensuring chemical realism. Root Mean Square Deviation (RMSD)
State-specific protein–ligand complex structure prediction with a multiscale deep generative model
@article{qiao2024state,
title={State-specific protein--ligand complex structure prediction with a multiscale deep generative model},
author={Qiao, Zhuoran and Nie, Weili and Vahdat, Arash and Miller III, Thomas F and Anandkumar, Animashree},
journal={Nature Machine Intelligence},
volume={6},
number={2},
pages={195--208},
year={2024},
publisher={Nature Publishing Group UK London}
}
tools like AlphaFold2 (AF2) excel at predicting static protein structures, they often fail to model the dynamic conformational changes induced by ligand binding. seek to develop a method that directly predicts the 3D structures of protein-ligand complexes, including the ligand's binding pose and the protein’s state-specific conformational changes NeuralPLexer — novel deep generative model that predicts protein-ligand complex structures directly from protein sequences and ligand molecular graphs. Incorporation of Biophysical Constraints — inductive bias to biophysical systems like assumptions Demonstrates practical utility through case studies on biologically relevant targets, suggesting its potential to accelerate drug discovery and enzyme engineering. Ligand Root Mean Square Deviation (RMSD) IDDT-BS (Binding-Site Local Distance Difference Test): Evaluates the accuracy of the protein’s binding-site structure relative to the reference. pLDDT (Predicted Local Distance Difference Test): Provides a per-residue confidence score to gauge prediction reliability.
DiSCO: Diffusion Schrodinger Bridge for Molecular Conformer Optimization
@inproceedings{lee2024disco,
title={Disco: Diffusion schr{\"o}dinger bridge for molecular conformer optimization},
author={Lee, Danyeong and Lee, Dohoon and Bang, Dongmin and Kim, Sun},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
volume={38},
number={12},
pages={13365--13373},
year={2024}
}
deep learning models in generating energetically optimal 3D molecular conformers oerate in the euclidian space. this space is vast and hard to search so the current models make simplifications to search DiSCO: A new framework that utilizes a diffusion Schrödinger bridge to adjust the Euclidean coordinates of pre-existing conformers, aligning their approximate distribution with the ground-truth energy landscape. SE(3)-Equivariant Schrödinger Bridge: Incorporates an SE(3)-equivariant design within the Schrödinger bridge to ensure roto-translational equivariance, a critical property for accurate conformer generation. Ensemble Root Mean Square Deviation (RMSD) Mean Absolute Error (MAE)
SE(3)-equivariant neural network
DynamicBind: predicting ligand-specific protein-ligand complex structure with a deep equivariant generative model
@article{lu2024dynamicbind,
title={DynamicBind: predicting ligand-specific protein-ligand complex structure with a deep equivariant generative model},
author={Lu, Wei and Zhang, Jixian and Huang, Weifeng and Zhang, Ziqiao and Jia, Xiangyu and Wang, Zhenyu and Shi, Leilei and Li, Chengtao and Wolynes, Peter G and Zheng, Shuangjia},
journal={Nature Communications},
volume={15},
number={1},
pages={1071},
year={2024},
publisher={Nature Publishing Group UK London}
}
Existing protein methods are static only methods they cant predict the dynamic nature of proteins Because of this conventional protein docking treat proteins as rigid and partially rigid structures therefore protein binding issues is problematic DynamicBind: This is a novel deep learning framework that predicts ligand-specific protein-ligand complex structures by dynamically adjusting protein conformations from initial predictions (e.g., AlphaFold) to ligand-bound (holo-like) states. The model efficiently manages substantial protein conformational shifts, such as the DFG-in to DFG-out transition in kinases. The prposed model can handle large protein conformations Unlike system-specific Boltzmann generators, DynamicBind is a generalizable model that can predict structures for new proteins and ligands. It does not rely on holo-structures or predefined binding pockets, Ligand RMSD (Root Mean Square Deviation) Clash Score: This quantifies steric clashes between the protein and ligand, Contact-LDDT (cLDDT): Inspired by AlphaFold’s LDDT, this metric assesses the conservation of intermolecular contacts between protein and ligand. Binding Affinity Prediction (auROC): For virtual screening, the model’s ability to predict binding affinities is measured using the area under the receiver operating characteristic curve (auROC).
SE(3)-equivariant neural network
E3 is used to represent the proteins and then use diffusion is used to generate new ligands
DiffBP: generative diffusion of 3D molecules for target protein binding
Royal Society of Chemistry
@article{lin2025diffbp,
title={Diffbp: Generative diffusion of 3d molecules for target protein binding},
author={Lin, Haitao and Huang, Yufei and Zhang, Odin and Ma, Siqi and Liu, Meng and Li, Xuanjing and Wu, Lirong and Wang, Jishui and Hou, Tingjun and Li, Stan Z},
journal={Chemical Science},
volume={16},
number={3},
pages={1417--1431},
year={2025},
publisher={Royal Society of Chemistry}
}
Generating molecules that effectively bind to specific proteins is a critical yet challenging task in drug discovery. Traditional methods struggle to produce molecules with high affinity and appropriate properties for this purpose. previous models are sequential and autoregressive uthors argue that modeling molecular probability should rely on joint distributions rather than sequential conditional ones, as global interactions dictate molecular behavior. The authors propose "DiffBP," a generative diffusion model designed to generate 3D molecular structures tailored for target protein binding. this model generate the entire model at one go MPBG (Mean Percentage Binding Gap) ΔBinding: The difference in binding scores QED (Quantitative Estimate of Drug-likeness): Quantifies how closely the generated molecules resemble drug-like compounds. LPSK (Lipinski's Rule of Five): The ratio of generated molecules that satisfy Lipinski's rule, a standard heuristic for drug-likeness.
CrossDocked2020 Dataset
Equivariant Graph Neural Network (EGNN): The denoising step utilizes an EGNN, which ensures that the model respects the symmetries of 3D space
Unsupervised Protein-Ligand Binding Energy Prediction via Neural Euler's Rotation Equation
@article{jin2023unsupervised,
title={Unsupervised protein-ligand binding energy prediction via neural euler's rotation equation},
author={Jin, Wengong and Sarkizova, Siranush and Chen, Xun and Hacohen, Nir and Uhler, Caroline},
journal={Advances in Neural Information Processing Systems},
volume={36},
pages={33514--33528},
year={2023}
}
antibodies types
However, applying the same strategy to other ligand types, such as antibodies, is challenging because labeled data is scarce. For instance, the largest binding affinity dataset for antibodies contains only 566 data points, which is insufficient for training effective supervised models. Existing supervised models are often tailored to specific ligand types, such as small molecules, and do not generalize well to others, like antibodies, due to structural differences. The authors reformulate binding energy prediction as a generative modeling task. They train an energy-based model (EBM) on unlabeled protein-ligand complexes from the Protein Data Bank — make is unsupervised Neural Euler's Rotation Equation (NERE): A novel equivariant rotation prediction network is proposed for SE(3) denoising score matching (DSM). Pearson Correlation Coefficient for proteins prediction and target Comparison with Baselines: NERE's performance is compared against various unsupervised baselines (e.g., MM/GBSA, ESM-1v, ESM-IF
Structural Antibody Database (SAbDab)
DIFFUSION PROBABILISTIC MODELING OF PROTEIN BACKBONES IN 3D FOR THE MOTIF-SCAFFOLDING PROBLEM
@article{trippe2022diffusion,
title={Diffusion probabilistic modeling of protein backbones in 3d for the motif-scaffolding problem},
author={Trippe, Brian L and Yim, Jason and Tischer, Doug and Baker, David and Broderick, Tamara and Barzilay, Regina and Jaakkola, Tommi},
journal={arXiv preprint arXiv:2206.04119},
year={2022}
}
major part of protein design is designing a scaffold—a supporting protein backbone—that stabilizes and correctly positions these motifs while ensuring the entire structure folds into a stable, functional form. authors present a novel diffusion probabilistic model (DPM) named ProtDiff, designed to generate realistic 3D protein backbone structures. Unlike previous methods, ProtDiff operates directly in 3D coordinate space, , SMCDiff, is introduced to generate scaffolds conditioned on a given motif. Utilizing a sequential Monte Carlo (SMC) approach, SMCDiff samples from the conditional distribution, effectively handling the intricate dependencies within protein structures. ProtDiff and SMCDiff, the authors create a robust framework for motif-scaffolding. Self-Consistency TM-Score (scTM) Root-Mean-Square Deviation (RMSD)
Filtering and Preprocessing
Structure-based Drug Design with Equivariant Diffusion Models
v
@article{schneuing2024structure,
title={Structure-based drug design with equivariant diffusion models},
author={Schneuing, Arne and Harris, Charles and Du, Yuanqi and Didi, Kieran and Jamasb, Arian and Igashov, Ilia and Du, Weitao and Gomes, Carla and Blundell, Tom L and Lio, Pietro and others},
journal={Nature Computational Science},
volume={4},
number={12},
pages={899--909},
year={2024},
publisher={Nature Publishing Group}
}
SBDD methods rely on high-throughput screening of large chemical databases, which is both expensive and time-consuming. These approaches also limit exploration to previously studied molecules, often emphasizing commercial availability, which restricts the chemical space. previous models doesnt have any structure based drug design models Need for Holistic and Versatile Models: There is a growing demand for generative models that can simultaneously place all atoms of a ligand, allowing for a more comprehensive understanding of molecular interactions. DiffSBDD, an SE(3)-equivariant diffusion model designed to generate novel ligands conditioned on the 3D structure of protein pockets. multiple SBDD tasks—such as de novo design, property optimization, explicit negative design, and partial molecular design (e.g., inpainting)—using a single pretrained model. SE(3)-equivariant graph neural network (EGNN) to ensure that the generated molecular structures respect the symmetries of 3D space, which is crucial for physical realism. QED (Quantitative Estimate of Drug-likeness): A composite score that evaluates how closely the generated molecules resemble drug-like compounds. SA (Synthetic Accessibility): Estimates the ease of synthesizing the generated molecules, with higher scores indicating greater feasibility.
SE(3)-Equivariant Graph Neural Network (EGNN)
Equivariant 3D-conditional diffusion model for molecular linker design
@article{igashov2024equivariant,
title={Equivariant 3D-conditional diffusion model for molecular linker design},
author={Igashov, Ilia and St{\"a}rk, Hannes and Vignac, Cl{\'e}ment and Schneuing, Arne and Satorras, Victor Garcia and Frossard, Pascal and Welling, Max and Bronstein, Michael and Correia, Bruno},
journal={Nature Machine Intelligence},
volume={6},
number={4},
pages={417--427},
year={2024},
publisher={Nature Publishing Group UK London}
}
Conventional SBDD methods rely on high-throughput screening of large chemical databases, which is both expensive and time-consuming. These approaches also limit exploration to previously studied molecules, often emphasizing commercial availability, which restricts the chemical space. autoregressive models, impose an artificial sequential ordering in the generation process. This approach fails to capture the global context of molecular structures, which is critical for accurately modeling ligand-protein interactions. uthors present DiffSBDD, an SE(3)-equivariant diffusion model designed to generate novel ligands conditioned on the 3D structure of protein pockets. multiple SBDD tasks—such as de novo design, property optimization, explicit negative design, and partial molecular design (e.g., inpainting)—using a single pretrained model. Drug-Likeness and Synthetic Feasibility QED (Quantitative Estimate of Drug-likeness) SA (Synthetic Accessibility):
DiffBP: Generative Diffusion of 3D Molecules for Target Protein Binding
@article{lin2025diffbp,
title={Diffbp: Generative diffusion of 3d molecules for target protein binding},
author={Lin, Haitao and Huang, Yufei and Zhang, Odin and Ma, Siqi and Liu, Meng and Li, Xuanjing and Wu, Lirong and Wang, Jishui and Hou, Tingjun and Li, Stan Z},
journal={Chemical Science},
volume={16},
number={3},
pages={1417--1431},
year={2025},
publisher={Royal Society of Chemistry}
}
Structural data are hard to come by Generating molecules that bind specifically to a target protein is a complex task. limitations of the AR models and sequence modelling DiffBP Model: The authors introduce DiffBP, a generative diffusion model designed to create 3D molecular structures tailored to a target protein's binding site. Diffusion-Based Approach: DiffBP employs a diffusion process to denoise both atom types and their 3D coordinates, producing chemically valid and spatially coherent molecules. This method leverages a probabilistic framework to learn the joint distribution of molecular features. model generates molecules with high binding affinity and desirable drug-like characteristics, Mean Percentage Binding Gap (MPBG): Quantitative Estimate of Drug-likeness (QED) Synthetic Accessibility (SA) Lipinski's Rule of Five (LPS)
SE(3)-Equivariant Graph Neural Network (EGNN)
CrossDocked2020
Aligning Protein Conformation Ensemble Generation with Physical Feedback
@inproceedings{
lu2025aligning,
title={Aligning Protein Conformation Ensemble Generation with Physical Feedback},
author={Jiarui Lu and Xiaoyin Chen and Stephen Zhewen Lu and Aurelie Lozano and Vijil Chenthamarakshan and Payel Das and Jian Tang},
booktitle={Forty-second International Conference on Machine Learning},
year={2025},
month={Jul.},
url={https://openreview.net/forum?id=Asr955jcuZ}
}
Diffusion on Language Model Encodings for Protein Sequence Generation
Enhancing Ligand Validity and Affinity in
Structure-Based Drug Design with Multi-Reward Optimization