Skip to content

Yijia-Xiao/Protein-LLM-Survey

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 

Repository files navigation

Large Language Models in Protein: A Comprehensive Survey

Awesome Survey License: MIT

LLM Methods for Protein Understanding and Prediction

Protein Sequence Models

Paper Published in Resources
Unified rational protein engineering with sequence-based deep representation learning Nature Methods, 2019 Code
Learning protein sequence embeddings using information from structure ICLR, 2019 Code
Mutation effect estimation on protein–protein interactions using deep contextualized representation learning NAR Genomics and Bioinformatics, 2020 Code
Prottrans: Toward understanding the language of life through self-supervised learning IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021 Code
Modeling protein using large-scale pretrain language model arXiv preprint, 2021 Code
Single-sequence protein structure prediction using a language model and deep learning Nature Biotechnology, 2022 Code
Bertology meets biology: Interpreting attention in protein language models arXiv preprint, 2020 Code
Learning the language of viral evolution and escape Science, 2021 Code
TCR-BERT: learning the grammar of T-cell receptors for flexible antigen-xbinding analyses bioRxiv, 2021 Code
xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein arXiv e-prints, 2024 Model&Data
Turnover number predictions for kinetically uncharacterized enzymes using machine and deep learning Nature Communications, 2023 Code
Enzyme function prediction using contrastive learning Science, 2023 Code

Evolutionary Scale Modeling (ESM) Series

Paper Published in Resources
Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences PNAS, 2021 Code
Language models enable zero-shot prediction of the effects of mutations on protein function Advances in Neural Information Processing Systems, 2021 Code
Learning inverse folding from millions of predicted structures ICML, 2022 Code
Evolutionary-scale prediction of atomic-level protein structure with a language model Science, 2023 Code
Simulating 500 million years of evolution with a language model bioRxiv, 2024 Code

MSA-based Models

Paper Published in Resources
MSA transformer ICML, 2021 Code
Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval ICML, 2022 Code
Leveraging protein language models for accurate multiple sequence alignments Genome Research, 2023 Code
PoET: A generative model of protein families as sequences-of-sequences Neurips, 2023 Code
Deep transfer learning for inter-chain contact predictions of transmembrane protein complexes Nature Communications, 2023 Code

Structure-Integrated Models

Paper Published in Resources
A systematic study of joint representation learning on protein sequences and structures arXiv preprint, 2023 Code
Saprot: Protein language modeling with structure-aware vocabulary bioRxiv, 2023 Code
Simple, Efficient and Scalable Structure-aware Adapter Boosts Protein Language Models arXiv preprint, 2024 Code
Multi-level Protein Structure Pre-training via Prompt Learning ICLR, 2023 Code
Structure-informed protein language models are robust predictors for variant effects Human Genetics, 2024 N/A
Integration of pre-trained protein language models into geometric deep learning networks Communications Biology, 2023 Code
Structure-Informed Protein Language Model arXiv preprint, 2024 Code
S-PLM: Structure-Aware Protein Language Model via Contrastive Learning Between Sequence and Structure Advanced Science, 2024 Code
CCPL: Cross-modal Contrastive Protein Learning Pattern Recognition, 2024 N/A

Knowledge-Enhanced Models

Paper Published in Resources
OntoProtein: Protein Pretraining With Gene Ontology Embedding ICLR, 2022 Code
ProteinCLIP: enhancing protein language models with natural language bioRxiv, 2024 Code
ProteinBERT: a universal deep-learning model of protein sequence and function Bioinformatics, 2022 Code
Protein Representation Learning via Knowledge Enhanced Primary Structure Reasoning ICLR, 2023 Code
MolBind: Multimodal Alignment of Language, Molecules, and Proteins arXiv preprint, 2024 N/A

Protein Description and Annotation Models

Paper Published in Resources
Prot2text: Multimodal protein’s function generation with gnns and transformers AAAI, 2024 Code
Protranslator: zero-shot protein function prediction using textual description International Conference on Research in Computational Molecular Biology, 2022 Code
Multilingual translation for zero-shot biomedical classification using BioTranslator Nature Communications, 2023 Code
Biot5: Enriching cross-modal integration in biology with chemical knowledge and natural language associations EMNLP, 2023 Code
BioT5+: Towards Generalized Biological Understanding with IUPAC Integration and Multi-task Tuning ACL, 2024 Code
ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts ICML, 2023 Code
ProtChatGPT: Towards Understanding Proteins with Large Language Models arXiv, 2024 N/A
ProteinChat: Towards Achieving ChatGPT-Like Functionalities on Protein 3D Structures TechRxiv, 2023 N/A

LLM Methods for Protein Engineering and Generation

Generative Models (Protein Decoder)

Paper Published in Resources
Large language models generate functional protein sequences across diverse families Nature Biotechnology, 2023 Code
ProtGPT2: Deep Unsupervised Language Model for Protein Design Nature Communications, 2022 Code
ProGen2: Exploring the Boundaries of Protein Language Models Cell Systems, 2023 Code
IgLM: Infilling Language Modeling for Antibody Sequence Design Cell Systems, 2023 Code
PALM-H3: Targeted Antibody Generation for SARS-CoV-2 Nature Communications, 2024 Code
Integrating protein language models and automatic biofoundry for enhanced protein evolution Nature Communications, 2025 Code

Protein Encoder Models

Paper Published in Resources
ProtST: Multi-modality Learning of Protein Sequences and Biomedical Texts ICML 2023 Code
ProteinBERT: a universal deep-learning model of protein sequence and function Bioinformatics, 2022 Code
Bertology meets biology: Interpreting attention in protein language models arXiv preprint, 2020 Code
Prottrans: Toward understanding the language of life through self-supervised learning IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021 Code
Modeling protein using large-scale pretrain language model arXiv preprint, 2021 Code

Encoder-Decoder Models

Paper Published in Resources
ProstT5: Bilingual Modeling of Protein Sequence and Structure bioRxiv, 2023 Code
Fold2Seq: A Joint Sequence–Fold Embedding-based Generative Model for Protein Design ICML 2021 Code
Ankh: Optimized Protein Language Model for Efficient Generation arXiv, 2023 Code

Interactive and Multimodal Models

Paper Published in Resources
ProteinGPT: Multimodal LLM for Protein Property Prediction and Structure Understanding arXiv, 2024 Code
ProteinChat: ChatGPT-like Functionalities on Protein 3D Structures Authorea Preprints, 2023 Code
ProtChatGPT: Towards Understanding Proteins with Large Language Models arXiv, 2024 Code
ProteinDT: A Text-guided Protein Design Framework arXiv, 2023 Code

Traditional Experimental Methods for Protein

X-ray Crystallography

Paper Published in Resources
Artificial intelligence to solve the X-ray crystallography phase problem: a case study report BiorXiv, 2021 N/A

Nuclear Magnetic Resonance (NMR) Spectroscopy

Paper Published in Resources
FID-Net: A versatile deep neural network architecture for NMR spectral reconstruction and virtual decoupling Journal of Biomolecular NMR, 2021 Code
Accelerated Nuclear Magnetic Resonance Spectroscopy with Deep Learning Angewandte Chemie, 2020 Code

Cryo-EM

Paper Published in Resources
CryoDRGN: reconstruction of heterogeneous cryo-EM structures using neural networks Nature Methods, 2021 Code
CryoGAN: A New Reconstruction Paradigm for Single-Particle Cryo-EM Via Deep Adversarial Learning IEEE Transactions on Computational Imaging , 2021 Code
Deep learning-based mixed-dimensional Gaussian mixture model for characterizing variability in cryo-EM Nature Methods, 2021 Code
3dflex: determining structure and motion of flexible proteins from cryo-em Nature Methods, 2023 Code
Cryostar: leveraging structural priors and constraints for cryo-em heterogeneous reconstruction Nature Methods, 2024 Code

Protein Datasets: Training Data and Benchmarks

Pretraining Dataset

Dataset Name Description Resources
UniProtKB/Swiss-Prot Manually curated protein database with detailed functional annotations Link
UniProtKB/TrEMBL Automatically annotated protein database with computational analysis Link
UniRef Clusters Clustered protein sequences for reduced redundancy and efficient searches Link
Pfam Database of protein families and domains Link
PDB Database of 3D structures of biological macromolecules Link
BFD Large database of clustered protein sequences Link
UniParc Non-redundant archive of protein sequences from public databases Link
PIR Comprehensive annotated protein sequence database Link
AlphaFoldDB Database of predicted protein structures using AI Link

Benchmark

Paper Published in Resources
Critical assessment of methods of protein structure prediction (CASP)—Round XV Proteins: Structure, Function, and Bioinformatics Link
ProteinGym: Large-Scale Benchmarks for Protein Fitness Prediction and Design Neurips, 2023 Code
Evaluating protein transfer learning with tape Neurips, 2019 Code
CATH–a hierarchic classification of protein domain structures Structure, 1997 Link
Peer: a comprehensive and multi-task benchmark for protein sequence understanding Neurips, 2022 Code
ExplorEnz: the primary source of the IUBMB enzyme list Nucleic acids research, 2009 Link
HIPPIE: Integrating protein interaction networks with experiment based quality scores PloS One, 2012 Link
A Fine-tuning Dataset and Benchmark for Large Language Models for Protein Understanding arXiv, 2019 Code

Releases

No releases published

Packages

No packages published