[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-agemagician--ProtTrans":3,"tool-agemagician--ProtTrans":65},[4,17,25,39,48,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",138956,2,"2026-04-05T11:33:21",[13,14,15],"开发框架","Agent","语言模型","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":10,"last_commit_at":23,"category_tags":24,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,15],{"id":26,"name":27,"github_repo":28,"description_zh":29,"stars":30,"difficulty_score":10,"last_commit_at":31,"category_tags":32,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[33,34,35,36,14,37,15,13,38],"图像","数据工具","视频","插件","其他","音频",{"id":40,"name":41,"github_repo":42,"description_zh":43,"stars":44,"difficulty_score":45,"last_commit_at":46,"category_tags":47,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,3,"2026-04-04T04:44:48",[14,33,13,15,37],{"id":49,"name":50,"github_repo":51,"description_zh":52,"stars":53,"difficulty_score":45,"last_commit_at":54,"category_tags":55,"status":16},519,"PaddleOCR","PaddlePaddle\u002FPaddleOCR","PaddleOCR 是一款基于百度飞桨框架开发的高性能开源光学字符识别工具包。它的核心能力是将图片、PDF 等文档中的文字提取出来，转换成计算机可读取的结构化数据，让机器真正“看懂”图文内容。\n\n面对海量纸质或电子文档，PaddleOCR 解决了人工录入效率低、数字化成本高的问题。尤其在人工智能领域，它扮演着连接图像与大型语言模型（LLM）的桥梁角色，能将视觉信息直接转化为文本输入，助力智能问答、文档分析等应用场景落地。\n\nPaddleOCR 适合开发者、算法研究人员以及有文档自动化需求的普通用户。其技术优势十分明显：不仅支持全球 100 多种语言的识别，还能在 Windows、Linux、macOS 等多个系统上运行，并灵活适配 CPU、GPU、NPU 等各类硬件。作为一个轻量级且社区活跃的开源项目，PaddleOCR 既能满足快速集成的需求，也能支撑前沿的视觉语言研究，是处理文字识别任务的理想选择。",74913,"2026-04-05T10:44:17",[15,33,13,37],{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":62,"last_commit_at":63,"category_tags":64,"status":16},3215,"awesome-machine-learning","josephmisiti\u002Fawesome-machine-learning","awesome-machine-learning 是一份精心整理的机器学习资源清单，汇集了全球优秀的机器学习框架、库和软件工具。面对机器学习领域技术迭代快、资源分散且难以甄选的痛点，这份清单按编程语言（如 Python、C++、Go 等）和应用场景（如计算机视觉、自然语言处理、深度学习等）进行了系统化分类，帮助使用者快速定位高质量项目。\n\n它特别适合开发者、数据科学家及研究人员使用。无论是初学者寻找入门库，还是资深工程师对比不同语言的技术选型，都能从中获得极具价值的参考。此外，清单还延伸提供了免费书籍、在线课程、行业会议、技术博客及线下聚会等丰富资源，构建了从学习到实践的全链路支持体系。\n\n其独特亮点在于严格的维护标准：明确标记已停止维护或长期未更新的项目，确保推荐内容的时效性与可靠性。作为机器学习领域的“导航图”，awesome-machine-learning 以开源协作的方式持续更新，旨在降低技术探索门槛，让每一位从业者都能高效地站在巨人的肩膀上创新。",72149,1,"2026-04-03T21:50:24",[13,37],{"id":66,"github_repo":67,"name":68,"description_en":69,"description_zh":70,"ai_summary_zh":70,"readme_en":71,"readme_zh":72,"quickstart_zh":73,"use_case_zh":74,"hero_image_url":75,"owner_login":76,"owner_name":77,"owner_avatar_url":78,"owner_bio":79,"owner_company":80,"owner_location":81,"owner_email":82,"owner_twitter":83,"owner_website":79,"owner_url":84,"languages":85,"stars":94,"forks":95,"last_commit_at":96,"license":97,"difficulty_score":10,"env_os":98,"env_gpu":99,"env_ram":99,"env_deps":100,"category_tags":107,"github_topics":79,"view_count":108,"oss_zip_url":79,"oss_zip_packed_at":79,"status":16,"created_at":109,"updated_at":110,"faqs":111,"releases":141},164,"agemagician\u002FProtTrans","ProtTrans","ProtTrans is providing state of the art pretrained language models for proteins. ProtTrans was trained on thousands of GPUs from Summit and hundreds of Google TPUs using Transformers Models.","ProtTrans 是一套专为蛋白质序列设计的预训练语言模型，借助强大的 Transformer 架构，在 Summit 超算的数千块 GPU 和 Google 的数百个 TPU 上完成训练。它把蛋白质序列当作“生物语言”来理解，帮助科研人员从海量未标注数据中自动学习蛋白质结构与功能的深层特征，从而省去大量人工标注成本。无论是预测蛋白质结构、功能位点，还是生成新序列或辅助药物研发（如新冠相关研究），ProtTrans 都能提供高质量的向量表示和迁移学习基础。适合生物信息学研究人员、计算生物学家及 AI+生命科学交叉领域的开发者使用，尤其对希望用深度学习加速蛋白分析但缺乏大规模标注数据的团队非常友好。技术亮点包括支持多种下游任务（如残基级预测、蛋白分类、序列生成）、提供可视化注意力机制，并持续更新适配 LoRA 微调等高效训练方法。模型已集成到 Hugging Face 平台，安装便捷，部分预计算嵌入甚至可通过 UniProt 直接下载使用。","\u003Cbr\u002F>\n\u003Ch1 align=\"center\">ProtTrans\u003C\u002Fh1>\n\u003Cbr\u002F>\n\n\u003Cbr\u002F>\n\n[ProtTrans](https:\u002F\u002Fgithub.com\u002Fagemagician\u002FProtTrans\u002F) is providing **state of the art pre-trained models for proteins**. ProtTrans was trained on **thousands of GPUs from Summit** and **hundreds of Google TPUs** using various **Transformer models**.\n\nHave a look at our paper [ProtTrans: cracking the language of life’s code through self-supervised deep learning and high performance computing](https:\u002F\u002Fdoi.org\u002F10.1109\u002FTPAMI.2021.3095381) for more information about our work. \n\n\u003Cbr\u002F>\n\u003Cp align=\"center\">\n    \u003Cimg width=\"70%\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fagemagician_ProtTrans_readme_aa204f208b8a.png\" alt=\"ProtTrans Attention Visualization\">\n\u003C\u002Fp>\n\u003Cbr\u002F>\n\n\nThis repository will be updated regulary with **new pre-trained models for proteins** as part of supporting **bioinformatics** community in general, and **Covid-19 research** specifically through our [Accelerate SARS-CoV-2 research with transfer learning using pre-trained language modeling models](https:\u002F\u002Fcovid19-hpc-consortium.org\u002Fprojects\u002F5ed56e51a21132007ebf57bf) project.\n\nTable of Contents\n=================\n* [ ⌛️&nbsp; News](#news)\n* [ 🚀&nbsp; Installation](#install)\n* [ 🚀&nbsp; Quick Start](#quick)\n* [ ⌛️&nbsp; Models Availability](#models)\n* [ ⌛️&nbsp; Dataset Availability](#datasets)\n* [ 🚀&nbsp; Usage ](#usage)\n  * [ 🧬&nbsp; Feature Extraction (FE)](#feature-extraction)\n  * [ 🚀&nbsp; Logits extraction](#logits-extraction)\n  * [ 💥&nbsp; Fine Tuning (FT)](#fine-tuning)\n  * [ 🧠&nbsp; Prediction](#prediction)\n  * [ ⚗️&nbsp; Protein Sequences Generation ](#protein-generation)\n  * [ 🧐&nbsp; Visualization ](#visualization)\n  * [ 📈&nbsp; Benchmark ](#benchmark)\n* [ 📊&nbsp; Original downstream Predictions  ](#results)\n* [ 📊&nbsp; Followup use-cases  ](#inaction)\n* [ 📊&nbsp; Comparisons to other tools ](#comparison)\n* [ ❤️&nbsp; Community and Contributions ](#community)\n* [ 📫&nbsp; Have a question? ](#question)\n* [ 🤝&nbsp; Found a bug? ](#bug)\n* [ ✅&nbsp; Requirements ](#requirements)\n* [ 🤵&nbsp; Team ](#team)\n* [ 💰&nbsp; Sponsors ](#sponsors)\n* [ 📘&nbsp; License ](#license)\n* [ ✏️&nbsp; Citation ](#citation)\n\n\n\u003Ca name=\"news\">\u003C\u002Fa>\n## ⌛️&nbsp; News\n* **2025\u002F01\u002F22: [Continue pre-training & evo-tuning]( https:\u002F\u002Fgithub.com\u002FRSchmirler\u002FProtT5-EvoTuning\u002Ftree\u002Fmain?tab=readme-ov-file ) shows how to either continue pre-training of ProtT5 on new protein sequences using ProtT5's original pre-training task. This includes continue pre-training on a set of homologous sequences (aka evo-tuning.)**\n* 2023\u002F07\u002F14: [FineTuning with LoRA]( https:\u002F\u002Fgithub.com\u002Fagemagician\u002FProtTrans\u002Ftree\u002Fmaster\u002FFine-Tuning) provides a notebooks on how to fine-tune ProtT5 on both, per-residue and per-protein tasks, using Low-Rank Adaptation (LoRA) for efficient finetuning (thanks @0syrys !).**\n* 2022\u002F11\u002F18: Availability: [LambdaPP](https:\u002F\u002Fembed.predictprotein.org\u002F) offers a simple web-service to access ProtT5-based predictions and UniProt now offers to download [pre-computed ProtT5 embeddings](https:\u002F\u002Fwww.uniprot.org\u002Fhelp\u002Fembeddings) for a subset of selected organisms. \n\n\u003Ca name=\"install\">\u003C\u002Fa>\n## 🚀&nbsp; Installation\nAll our models are available via huggingface\u002Ftransformers:\n```console\npip install torch\npip install transformers\npip install sentencepiece\n```\nFor more details, please follow the instructions for [transformers installations](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Finstallation).\n\nA recently introduced [change in the T5-tokenizer](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers\u002Fpull\u002F24565) results in `UnboundLocalError: cannot access local variable 'sentencepiece_model_pb2` and can either be fixed by installing [this PR](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers\u002Fpull\u002F25684) or by manually installing:\n```console\npip install protobuf\n```\nIf you are using a transformer version after [this PR](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers\u002Fpull\u002F24565), you will see [this warning](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers\u002Fblob\u002Fmain\u002Fsrc\u002Ftransformers\u002Fmodels\u002Ft5\u002Ftokenization_t5.py#L167).\nExplicitly setting `legacy=True` will result in expected behavor and will avoid the warning. You can also safely ignore the warning as `legacy=True` is [the default](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers\u002Fblob\u002Fmain\u002Fsrc\u002Ftransformers\u002Fmodels\u002Ft5\u002Ftokenization_t5.py#L175).\n\n\u003Ca name=\"quick\">\u003C\u002Fa>\n## 🚀&nbsp; Quick Start\nExample for how to derive embeddings from our best-performing protein language model, ProtT5-XL-U50 (aka ProtT5); also available as [colab](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1h7F5v5xkE_ly-1bTQSu-1xaLtTP2TnLF?usp=sharing):\n```python\nfrom transformers import T5Tokenizer, T5EncoderModel\nimport torch\nimport re\n\ndevice = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')\n\n# Load the tokenizer\ntokenizer = T5Tokenizer.from_pretrained('Rostlab\u002Fprot_t5_xl_half_uniref50-enc', do_lower_case=False)\n\n# Load the model\nmodel = T5EncoderModel.from_pretrained(\"Rostlab\u002Fprot_t5_xl_half_uniref50-enc\").to(device)\n\n# only GPUs support half-precision currently; if you want to run on CPU use full-precision (not recommended, much slower)\nif device == torch.device(\"cpu\"):\n    model.to(torch.float32)\n\n# prepare your protein sequences as a list\nsequence_examples = [\"PRTEINO\", \"SEQWENCE\"]\n\n# replace all rare\u002Fambiguous amino acids by X and introduce white-space between all amino acids\nsequence_examples = [\" \".join(list(re.sub(r\"[UZOB]\", \"X\", sequence))) for sequence in sequence_examples]\n\n# tokenize sequences and pad up to the longest sequence in the batch\nids = tokenizer(sequence_examples, add_special_tokens=True, padding=\"longest\")\n\ninput_ids = torch.tensor(ids['input_ids']).to(device)\nattention_mask = torch.tensor(ids['attention_mask']).to(device)\n\n# generate embeddings\nwith torch.no_grad():\n    embedding_repr = model(input_ids=input_ids, attention_mask=attention_mask)\n\n# extract residue embeddings for the first ([0,:]) sequence in the batch and remove padded & special tokens ([0,:7]) \nemb_0 = embedding_repr.last_hidden_state[0,:7] # shape (7 x 1024)\n# same for the second ([1,:]) sequence but taking into account different sequence lengths ([1,:8])\nemb_1 = embedding_repr.last_hidden_state[1,:8] # shape (8 x 1024)\n\n# if you want to derive a single representation (per-protein embedding) for the whole protein\nemb_0_per_protein = emb_0.mean(dim=0) # shape (1024)\n```\n\n\nWe also have a [script](https:\u002F\u002Fgithub.com\u002Fagemagician\u002FProtTrans\u002Fblob\u002Fmaster\u002FEmbedding\u002Fprott5_embedder.py) which simplifies deriving per-residue and per-protein embeddings from ProtT5 for a given FASTA file:\n```\npython prott5_embedder.py --input sequences\u002Fsome.fasta --output embeddings\u002Fresidue_embeddings.h5\npython prott5_embedder.py --input sequences\u002Fsome.fasta --output embeddings\u002Fprotein_embeddings.h5 --per_protein 1\n```\n\n\u003Ca name=\"models\">\u003C\u002Fa>\n## ⌛️&nbsp; Models Availability\n\n|          Model                |                              Hugging Face                                  |                         Zenodo                | Colab |\n| ----------------------------- | :------------------------------------------------------------------------: |:---------------------------------------------:|---------------------------------------------:|\n| ProtT5-XL-UniRef50 (also **ProtT5-XL-U50**)            |  [Download](https:\u002F\u002Fhuggingface.co\u002FRostlab\u002Fprot_t5_xl_uniref50\u002Ftree\u002Fmain)  | [Download](https:\u002F\u002Fzenodo.org\u002Frecord\u002F4644188) | [**Colab**](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1TUj-ayG3WO52n5N50S7KH9vtt6zRkdmj?usp=sharing)|\n| ProtT5-XL-BFD                 |  [Download](https:\u002F\u002Fhuggingface.co\u002FRostlab\u002Fprot_t5_xl_bfd\u002Ftree\u002Fmain)       | [Download](https:\u002F\u002Fzenodo.org\u002Frecord\u002F4633924) |\n| ProtT5-XXL-UniRef50           |  [Download](https:\u002F\u002Fhuggingface.co\u002FRostlab\u002Fprot_t5_xxl_uniref50\u002Ftree\u002Fmain) | [Download](https:\u002F\u002Fzenodo.org\u002Frecord\u002F4652717) |\n| ProtT5-XXL-BFD                |  [Download](https:\u002F\u002Fhuggingface.co\u002FRostlab\u002Fprot_t5_xxl_bfd\u002Ftree\u002Fmain)      | [Download](https:\u002F\u002Fzenodo.org\u002Frecord\u002F4635302) |\n| ProtBert-BFD                  |  [Download](https:\u002F\u002Fhuggingface.co\u002FRostlab\u002Fprot_bert_bfd\u002Ftree\u002Fmain)        | [Download](https:\u002F\u002Fzenodo.org\u002Frecord\u002F4633647) |\n| ProtBert                      |  [Download](https:\u002F\u002Fhuggingface.co\u002FRostlab\u002Fprot_bert\u002Ftree\u002Fmain)            | [Download](https:\u002F\u002Fzenodo.org\u002Frecord\u002F4633691) |\n| ProtAlbert                    |  [Download](https:\u002F\u002Fhuggingface.co\u002FRostlab\u002Fprot_albert\u002Ftree\u002Fmain)          | [Download](https:\u002F\u002Fzenodo.org\u002Frecord\u002F4633687) |\n| ProtXLNet                     |  [Download](https:\u002F\u002Fhuggingface.co\u002FRostlab\u002Fprot_xlnet\u002Ftree\u002Fmain)           | [Download](https:\u002F\u002Fzenodo.org\u002Frecord\u002F4633987) |\n| ProtElectra-Generator-BFD     |  [Download](https:\u002F\u002Fhuggingface.co\u002FRostlab\u002Fprot_electra_generator_bfd\u002Ftree\u002Fmain)           | [Download](https:\u002F\u002Fzenodo.org\u002Frecord\u002F4633813) |\n| ProtElectra-Discriminator-BFD |  [Download](https:\u002F\u002Fhuggingface.co\u002FRostlab\u002Fprot_electra_discriminator_bfd\u002Ftree\u002Fmain)           | [Download](https:\u002F\u002Fzenodo.org\u002Frecord\u002F4633717) |\n\n\n\u003Ca name=\"datasets\">\u003C\u002Fa>\n## ⌛️&nbsp; Datasets Availability\n|          Dataset              |                                    Dropbox                                    |  \n| ----------------------------- | :---------------------------------------------------------------------------: |\n|\tNEW364\t\t\t|      [Download](https:\u002F\u002Fwww.dropbox.com\u002Fs\u002Fg49lb352ij4cnt7\u002FNEW364.csv?dl=1)    |\n|\tNetsurfp2       \t| [Download](https:\u002F\u002Fwww.dropbox.com\u002Fs\u002F98hovta9qjmmiby\u002FTrain_HHblits.csv?dl=1)  |\n|\tCASP12\t\t\t| [Download](https:\u002F\u002Fwww.dropbox.com\u002Fs\u002Fte0vn0t7ocdkra7\u002FCASP12_HHblits.csv?dl=1) |\n|\tCB513\t\t\t| [Download](https:\u002F\u002Fwww.dropbox.com\u002Fs\u002F9mat2fqqkcvdr67\u002FCB513_HHblits.csv?dl=1) |\n|\tTS115\t\t\t| [Download](https:\u002F\u002Fwww.dropbox.com\u002Fs\u002F68pknljl9la8ax3\u002FTS115_HHblits.csv?dl=1) |\n|\tDeepLoc Train\t\t| [Download](https:\u002F\u002Fwww.dropbox.com\u002Fs\u002Fvgdqcl4vzqm9as0\u002Fdeeploc_per_protein_train.csv?dl=1) |\n|\tDeepLoc Test\t\t| [Download](https:\u002F\u002Fwww.dropbox.com\u002Fs\u002Fjfzuokrym7nflkp\u002Fdeeploc_per_protein_test.csv?dl=1) |\n\n\u003Ca name=\"usage\">\u003C\u002Fa>\n## 🚀&nbsp; Usage  \n\nHow to use ProtTrans:\n\n\u003Ca name=\"feature-extraction\">\u003C\u002Fa>\n * \u003Cb>🧬&nbsp; Feature Extraction (FE):\u003C\u002Fb>\u003Cbr\u002F>\n Please check:\n [Embedding Section](https:\u002F\u002Fgithub.com\u002Fagemagician\u002FProtTrans\u002Ftree\u002Fmaster\u002FEmbedding). [Colab](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1TUj-ayG3WO52n5N50S7KH9vtt6zRkdmj?usp=sharing) example for feature extraction via ProtT5-XL-U50 \n\n\u003Ca name=\"logits-extraction\">\u003C\u002Fa>\n * \u003Cb>🚀&nbsp; Logits Extraction:\u003C\u002Fb>\u003Cbr\u002F>\n For ProtT5-logits extraction, please check:\n [VESPA logits script](https:\u002F\u002Fgithub.com\u002FRostlab\u002FVESPA#step-3-log-odds-ratio-of-masked-marginal-probabilities). \n\n\u003Ca name=\"fine-tuning\">\u003C\u002Fa>\n * \u003Cb>💥&nbsp; Fine Tuning (FT):\u003C\u002Fb>\u003Cbr\u002F>\n Please check:\n [Fine Tuning Section](https:\u002F\u002Fgithub.com\u002Fagemagician\u002FProtTrans\u002Ftree\u002Fmaster\u002FFine-Tuning). More information coming soon.\n\n\u003Ca name=\"prediction\">\u003C\u002Fa>\n * \u003Cb>🧠&nbsp; Prediction:\u003C\u002Fb>\u003Cbr\u002F>\n Please check:\n [Prediction Section](https:\u002F\u002Fgithub.com\u002Fagemagician\u002FProtTrans\u002Ftree\u002Fmaster\u002FPrediction). [Colab](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1TUj-ayG3WO52n5N50S7KH9vtt6zRkdmj?usp=sharing) example for secondary structure prediction via ProtT5-XL-U50 and [Colab](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1W5fI20eKLtHpaeeGDcKuXsgeiwujeczX?usp=sharing) example for subcellular localization prediction as well as differentiation between membrane-bound and water-soluble proteins via ProtT5-XL-U50.\n  \n\u003Ca name=\"protein-generation\">\u003C\u002Fa>\n * \u003Cb>⚗️&nbsp; Protein Sequences Generation:\u003C\u002Fb>\u003Cbr\u002F>\n Please check:\n [Generate Section](https:\u002F\u002Fgithub.com\u002Fagemagician\u002FProtTrans\u002Ftree\u002Fmaster\u002FGenerate). More information coming soon.\n \n\u003Ca name=\"visualization\">\u003C\u002Fa>\n* \u003Cb>🧐&nbsp; Visualization:\u003C\u002Fb>\u003Cbr\u002F> \nPlease check:\n [Visualization Section](https:\u002F\u002Fgithub.com\u002Fagemagician\u002FProtTrans\u002Ftree\u002Fmaster\u002FVisualization). More information coming soon.\n \n\u003Ca name=\"benchmark\">\u003C\u002Fa>\n* \u003Cb>📈&nbsp; Benchmark:\u003C\u002Fb>\u003Cbr\u002F> \nPlease check:\n [Benchmark Section](https:\u002F\u002Fgithub.com\u002Fagemagician\u002FProtTrans\u002Ftree\u002Fmaster\u002FBenchmark). More information coming soon.\n\n\u003Ca name=\"results\">\u003C\u002Fa>\n## 📊&nbsp; Original downstream Predictions \n\n\u003Ca name=\"q3\">\u003C\u002Fa>\n * \u003Cb>🧬&nbsp; Secondary Structure Prediction (Q3):\u003C\u002Fb>\u003Cbr\u002F>\n \n|          Model             |       CASP12       |       TS115      |       CB513      |\n| -------------------------- | :----------------: | :-------------:  | :-------------:  |\n| ProtT5-XL-UniRef50         |         81         |        87        |        86        |\n| ProtT5-XL-BFD              |         77         |        85        |        84        |\n| ProtT5-XXL-UniRef50        |         79         |        86        |        85        |\n| ProtT5-XXL-BFD             |         78         |        85        |        83        |\n| ProtBert-BFD               |         76         |        84        |        83        |\n| ProtBert                   |         75         |        83        |        81        |\n| ProtAlbert                 |         74         |        82        |        79        |\n| ProtXLNet                  |         73         |        81        |        78        |\n| ProtElectra-Generator      |         73         |        78        |        76        |\n| ProtElectra-Discriminator  |         74         |        81        |        79        |\n| ProtTXL                    |         71         |        76        |        74        |\n| ProtTXL-BFD                |         72         |        75        |        77        |\n\n🆕 Predict your sequence live on [predictprotein.org](https:\u002F\u002Fpredictprotein.org).\n\n\u003Ca name=\"q8\">\u003C\u002Fa>\n * \u003Cb>🧬&nbsp; Secondary Structure Prediction (Q8):\u003C\u002Fb>\u003Cbr\u002F>\n \n|          Model             |       CASP12       |       TS115      |       CB513      |\n| -------------------------- | :----------------: | :-------------:  | :-------------:  |\n| ProtT5-XL-UniRef50         |         70         |        77        |        74        |\n| ProtT5-XL-BFD              |         66         |        74        |        71        |\n| ProtT5-XXL-UniRef50        |         68         |        75        |        72        |\n| ProtT5-XXL-BFD             |         66         |        73        |        70        |\n| ProtBert-BFD               |         65         |        73        |        70        |\n| ProtBert                   |         63         |        72        |        66        |\n| ProtAlbert                 |         62         |        70        |        65        |\n| ProtXLNet                  |         62         |        69        |        63        |\n| ProtElectra-Generator      |         60         |        66        |        61        |\n| ProtElectra-Discriminator  |         62         |        69        |        65        |\n| ProtTXL                    |         59         |        64        |        59        |\n| ProtTXL-BFD                |         60         |        65        |        60        |\n\n🆕 Predict your sequence live on [predictprotein.org](https:\u002F\u002Fpredictprotein.org).\n\n\u003Ca name=\"q2\">\u003C\u002Fa>\n * \u003Cb>🧬&nbsp; Membrane-bound vs Water-soluble (Q2):\u003C\u002Fb>\u003Cbr\u002F>\n \n|          Model             |    DeepLoc         |\n| -------------------------- | :----------------: |\n| ProtT5-XL-UniRef50         |         91         |\n| ProtT5-XL-BFD              |         91         |\n| ProtT5-XXL-UniRef50        |         89         |\n| ProtT5-XXL-BFD             |         90         |\n| ProtBert-BFD               |         89         |\n| ProtBert                   |         89         |\n| ProtAlbert                 |         88         |\n| ProtXLNet                  |         87         |\n| ProtElectra-Generator      |         85         |\n| ProtElectra-Discriminator  |         86         |\n| ProtTXL                    |         85         |\n| ProtTXL-BFD                |         86         |\n\n\n\u003Ca name=\"q10\">\u003C\u002Fa>\n * \u003Cb>🧬&nbsp; Subcellular Localization (Q10):\u003C\u002Fb>\u003Cbr\u002F>\n \n|          Model             |    DeepLoc         |\n| -------------------------- | :----------------: |\n| ProtT5-XL-UniRef50         |         81         |\n| ProtT5-XL-BFD              |         77         |\n| ProtT5-XXL-UniRef50        |         79         |\n| ProtT5-XXL-BFD             |         77         |\n| ProtBert-BFD               |         74         |\n| ProtBert                   |         74         |\n| ProtAlbert                 |         74         |\n| ProtXLNet                  |         68         |\n| ProtElectra-Generator      |         59         |\n| ProtElectra-Discriminator  |         70         |\n| ProtTXL                    |         66         |\n| ProtTXL-BFD                |         65         |\n\n\n\u003Ca name=\"inaction\">\u003C\u002Fa>\n## 📊&nbsp; Use-cases \n| Level | Type  | Tool |  Task | Manuscript | Webserver |\n| ----- |  ---- | -- | -- | -- | -- |\n| Protein | Function | Light Attention | Subcellular localization | [Light attention predicts protein location from the language of life](https:\u002F\u002Fdoi.org\u002F10.1093\u002Fbioadv\u002Fvbab035) | ([Web-server](https:\u002F\u002Fembed.protein.properties\u002F)) |\n| Residue | Function | bindEmbed21 | Binding Residues | [Protein embeddings and deep learning predict binding residues for various ligand classes](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fs41598-021-03431-4) | (Coming soon)  |\n| Residue | Function | VESPA           | Conservation & effect of Single Amino Acid Variants (SAVs) | [Embeddings from protein language models predict conservation and variant effects](https:\u002F\u002Frdcu.be\u002FcD7q5) | (coming soon) |\n| Protein | Structure | ProtTucker      | Protein 3D structure similarity prediction                 | [Contrastive learning on protein embeddings enlightens midnight zone at lightning speed](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002F10.1101\u002F2021.11.14.468528v2) |  |\n| Residue | Structure | ProtT5dst       | Protein 3D structure prediction                            | [Protein language model embeddings for fast, accurate, alignment-free protein structure prediction](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002F10.1101\u002F2021.07.31.454572v1.abstract) |  |\n\n\u003Ca name=\"comparison\">\u003C\u002Fa>\n## 📊&nbsp; Comparison to other protein language models (pLMs)\nWhile developing the [use-cases](#inaction), we compared ProtTrans models to other protein language models, for instance the [ESM](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fesm) models. To focus on the effect of changing input representaitons, the following comparisons use the same architectures on top on different embedding inputs.\n\n|          Task\u002FModel             |  ProtBERT-BFD      | ProtT5-XL-U50    |       ESM-1b    |       ESM-1v      | Metric | Reference |\n| -------------------------- | :--------------:   | :--------------: | :-----------:   | :-----------:  | :-----------: | :-----------: |\n| Subcell. loc. (setDeepLoc) |  80    | \u003Cb>86\u003C\u002Fb>    |   83        |    -         | Accuracy |  [Light-attention](https:\u002F\u002Facademic.oup.com\u002Fview-large\u002Ffigure\u002F321379865\u002Fvbab035f2.tif) |\n| Subcell. loc. (setHard)    |  58    | \u003Cb>65\u003C\u002Fb>    |   62        |    -         | Accuracy |  [Light-attention](https:\u002F\u002Facademic.oup.com\u002Fview-large\u002Ffigure\u002F321379865\u002Fvbab035f2.tif) |\n| Conservation (ConSurf-DB)  |  0.540 | \u003Cb>0.596\u003C\u002Fb> |   0.563     |    -         | MCC      | [ConsEmb](https:\u002F\u002Frdcu.be\u002FcD7q5) | \n| Variant effect (DMS-data)  |  -     | \u003Cb>0.53\u003C\u002Fb>  |   -         |    0.49      | Spearman (Mean) | [VESPA](https:\u002F\u002Frdcu.be\u002FcD7q5) |\n| Variant effect (DMS-data)  |  -     | \u003Cb>0.53\u003C\u002Fb>  |   -         | \u003Cb>0.53\u003C\u002Fb>  | Spearman (Median) | [VESPA](https:\u002F\u002Frdcu.be\u002FcD7q5) |\n| CATH superfamily (unsup.)  |  18    | \u003Cb>64\u003C\u002Fb>    |   57        |    -         | Accuracy | [ProtTucker](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002F10.1101\u002F2021.11.14.468528v1) |\n| CATH superfamily (sup.)    |  39    | \u003Cb>76\u003C\u002Fb>    |   70        |    -         | Accuracy | [ProtTucker](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002F10.1101\u002F2021.11.14.468528v1) |\n| Binding residues           |  -     | \u003Cb>39\u003C\u002Fb>    |   32        |    -        | F1 | [bindEmbed21](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fs41598-021-03431-4) |\n\nImportant note on ProtT5-XL-UniRef50 (dubbed ProtT5-XL-U50): all performances were measured using only embeddings extracted from the encoder-side of the underlying T5 model as described [here](https:\u002F\u002Fgithub.com\u002Fagemagician\u002FProtTrans\u002Fblob\u002Fmaster\u002FEmbedding\u002FPyTorch\u002FAdvanced\u002FProtT5-XL-UniRef50.ipynb). Also, experiments were ran in half-precision mode (model.half()), to speed-up embedding generation. No performance degradation could be observed in any of the experiments when running in half-precision.\n\n\u003Ca name=\"community\">\u003C\u002Fa>\n## ❤️&nbsp; Community and Contributions\n\nThe ProtTrans project is a **open source project** supported by various partner companies and research institutions. We are committed to **share all our pre-trained models and knowledge**. We are more than happy if you could help us on sharing new ptrained models, fixing bugs, proposing new feature, improving our documentation, spreading the word, or support our project.\n\n\u003Ca name=\"question\">\u003C\u002Fa>\n## 📫&nbsp; Have a question?\n\nWe are happy to hear your question in our issues page [ProtTrans](https:\u002F\u002Fgithub.com\u002Fagemagician\u002FProtTrans\u002Fissues)! Obviously if you have a private question or want to cooperate with us, you can always **reach out to us directly** via our [RostLab email](mailto:assistant@rostlab.org?subject=[GitHub]ProtTrans) \n\n\u003Ca name=\"bug\">\u003C\u002Fa>\n## 🤝&nbsp; Found a bug?\n\nFeel free to **file a new issue** with a respective title and description on the the [ProtTrans](https:\u002F\u002Fgithub.com\u002Fagemagician\u002FProtTrans\u002Fissues) repository. If you already found a solution to your problem, **we would love to review your pull request**!.\n\n\u003Ca name=\"requirements\">\u003C\u002Fa>\n## ✅&nbsp; Requirements\n\nFor protein feature extraction or fine-tuninng our pre-trained models, [Pytorch](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fpytorch) and [Transformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers) library from huggingface is needed. For model visualization, you need to install [BertViz](https:\u002F\u002Fgithub.com\u002Fjessevig\u002Fbertviz) library.\n\n\u003Ca name=\"team\">\u003C\u002Fa>\n## 🤵&nbsp; Team\n\n * \u003Cb>Technical University of Munich:\u003C\u002Fb>\u003Cbr\u002F>\n \n| Ahmed Elnaggar       |      Michael Heinzinger  |  Christian Dallago | Ghalia Rehawi | Burkhard Rost |\n|:-------------------------:|:-------------------------:|:-------------------------:|:-------------------------:|:-------------------------:|\n| \u003Cimg width=120\u002F src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fagemagician_ProtTrans_readme_00b25690fee9.jpg\"> | \u003Cimg width=120\u002F src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fagemagician_ProtTrans_readme_064bb2a54d8f.jpg\"> | \u003Cimg width=120\u002F src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fagemagician_ProtTrans_readme_f7fa6b92798c.png\"> | \u003Cimg width=120\u002F src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fagemagician_ProtTrans_readme_1baa91762687.png\"> | \u003Cimg width=120\u002F src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fagemagician_ProtTrans_readme_f2e8f52bf81d.jpg\"> |\n\n * \u003Cb>Med AI Technology:\u003C\u002Fb>\u003Cbr\u002F>\n\n| Yu Wang       |\n|:-------------------------:|\n| \u003Cimg width=120\u002F src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fagemagician_ProtTrans_readme_438776cf4355.jpeg\"> |\n\n* \u003Cb>Google:\u003C\u002Fb>\u003Cbr\u002F>\n\n| Llion Jones       |\n|:-------------------------:|\n| \u003Cimg width=120\u002F src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fagemagician_ProtTrans_readme_56f35d9447cc.jpg\"> |\n\n* \u003Cb>Nvidia:\u003C\u002Fb>\u003Cbr\u002F>\n\n| Tom Gibbs       | Tamas Feher | Christoph Angerer |\n|:-------------------------:|:-------------------------:|:-------------------------:|\n| \u003Cimg width=120\u002F src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fagemagician_ProtTrans_readme_cb2c28958b9c.png\"> | \u003Cimg width=120\u002F src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fagemagician_ProtTrans_readme_213f512e509d.jpeg\"> | \u003Cimg width=120\u002F src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fagemagician_ProtTrans_readme_377405859919.jpg\"> |\n\n* \u003Cb>Seoul National University:\u003C\u002Fb>\u003Cbr\u002F>\n\n| Martin Steinegger       |\n|:-------------------------:|\n| \u003Cimg width=120\u002F src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fagemagician_ProtTrans_readme_7af6ebe5a2bd.png\"> |\n\n\n* \u003Cb>ORNL:\u003C\u002Fb>\u003Cbr\u002F>\n\n| Debsindhu Bhowmik       |\n|:-------------------------:|\n| \u003Cimg width=120\u002F src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fagemagician_ProtTrans_readme_7c6056e9c8f7.jpg\"> |\n\n\u003Ca name=\"sponsors\">\u003C\u002Fa>\n## 💰&nbsp; Sponsors\n\n\u003C!--\n\u003Cdiv id=\"banner\" style=\"overflow: hidden;justify-content:space-around;display:table-cell; vertical-align:middle; text-align:center\">\n  \u003Cdiv class=\"\" style=\"max-width: 20%;max-height: 20%;display: inline-block;\">\n      \u003Cimg width=\"14%\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fagemagician_ProtTrans_readme_550652fef661.png\" alt=\"nvidia logo\">\n  \u003C\u002Fdiv>\n\n  \u003Cdiv class=\"\" style=\"max-width: 20%;max-height: 20%;display: inline-block;\">\n      \u003Cimg width=\"22%\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fagemagician_ProtTrans_readme_7480143ef5a4.jpg\" alt=\"google cloud logo\">\n  \u003C\u002Fdiv>\n\n  \u003Cdiv class=\"\" style=\"max-width: 20%;max-height: 20%;display: inline-block;\">\n      \u003Cimg width=\"20%\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fagemagician_ProtTrans_readme_e9b991abcdde.png\" alt=\"ornl logo\">\n  \u003C\u002Fdiv>\n  \n  \u003Cdiv class=\"\" style=\"max-width: 20%;max-height: 20%;display: inline-block;\">\n      \u003Cimg width=\"12%\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fagemagician_ProtTrans_readme_97b2d9a4a555.jpg\" alt=\"software campus logo\">\n  \u003C\u002Fdiv>\n  \n\u003C\u002Fdiv>\n-->\n\nNvidia       |      Google  |      Google  | ORNL | Software Campus\n:-------------------------:|:-------------------------:|:-------------------------:|:-------------------------:|:-------------------------:\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fagemagician_ProtTrans_readme_550652fef661.png) | ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fagemagician_ProtTrans_readme_49e9ea75d269.jpg) | ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fagemagician_ProtTrans_readme_f86805e3d236.png) | ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fagemagician_ProtTrans_readme_e9b991abcdde.png) | ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fagemagician_ProtTrans_readme_97b2d9a4a555.jpg)\n\n\u003Ca name=\"license\">\u003C\u002Fa>\n## 📘&nbsp; License\nThe ProtTrans pretrained models are released under the under terms of the [Academic Free License v3.0 License](https:\u002F\u002Fchoosealicense.com\u002Flicenses\u002Fafl-3.0\u002F).\n\n\u003Ca name=\"citation\">\u003C\u002Fa>\n## ✏️&nbsp; Citation\nIf you use this code or our pretrained models for your publication, please cite the original paper:\n```\n@ARTICLE\n{9477085,\nauthor={Elnaggar, Ahmed and Heinzinger, Michael and Dallago, Christian and Rehawi, Ghalia and Yu, Wang and Jones, Llion and Gibbs, Tom and Feher, Tamas and Angerer, Christoph and Steinegger, Martin and Bhowmik, Debsindhu and Rost, Burkhard},\njournal={IEEE Transactions on Pattern Analysis and Machine Intelligence},\ntitle={ProtTrans: Towards Cracking the Language of Lifes Code Through Self-Supervised Deep Learning and High Performance Computing},\nyear={2021},\nvolume={},\nnumber={},\npages={1-1},\ndoi={10.1109\u002FTPAMI.2021.3095381}}\n```\n","\u003Cbr\u002F>\n\u003Ch1 align=\"center\">ProtTrans\u003C\u002Fh1>\n\u003Cbr\u002F>\n\n\u003Cbr\u002F>\n\n[ProtTrans](https:\u002F\u002Fgithub.com\u002Fagemagician\u002FProtTrans\u002F) 提供**最先进的蛋白质预训练模型（pre-trained models for proteins）**。ProtTrans 使用多种**Transformer 模型**，在 **Summit 超算的数千块 GPU** 和 **Google 的数百个 TPU** 上完成训练。\n\n如需了解更多关于我们工作的信息，请参阅我们的论文：[ProtTrans: cracking the language of life’s code through self-supervised deep learning and high performance computing](https:\u002F\u002Fdoi.org\u002F10.1109\u002FTPAMI.2021.3095381)。\n\n\u003Cbr\u002F>\n\u003Cp align=\"center\">\n    \u003Cimg width=\"70%\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fagemagician_ProtTrans_readme_aa204f208b8a.png\" alt=\"ProtTrans Attention Visualization\">\n\u003C\u002Fp>\n\u003Cbr\u002F>\n\n\n本仓库将持续更新**新的蛋白质预训练模型**，以支持生物信息学（bioinformatics）社区，并特别通过我们的项目 [Accelerate SARS-CoV-2 research with transfer learning using pre-trained language modeling models](https:\u002F\u002Fcovid19-hpc-consortium.org\u002Fprojects\u002F5ed56e51a21132007ebf57bf) 支持 Covid-19 研究。\n\n目录\n=================\n* [ ⌛️&nbsp; 新闻](#news)\n* [ 🚀&nbsp; 安装](#install)\n* [ 🚀&nbsp; 快速开始](#quick)\n* [ ⌛️&nbsp; 模型可用性](#models)\n* [ ⌛️&nbsp; 数据集可用性](#datasets)\n* [ 🚀&nbsp; 使用方法 ](#usage)\n  * [ 🧬&nbsp; 特征提取 (FE)](#feature-extraction)\n  * [ 🚀&nbsp; Logits 提取](#logits-extraction)\n  * [ 💥&nbsp; 微调 (FT)](#fine-tuning)\n  * [ 🧠&nbsp; 预测](#prediction)\n  * [ ⚗️&nbsp; 蛋白质序列生成 ](#protein-generation)\n  * [ 🧐&nbsp; 可视化 ](#visualization)\n  * [ 📈&nbsp; 基准测试 ](#benchmark)\n* [ 📊&nbsp; 原始下游预测结果 ](#results)\n* [ 📊&nbsp; 后续应用场景 ](#inaction)\n* [ 📊&nbsp; 与其他工具的比较 ](#comparison)\n* [ ❤️&nbsp; 社区与贡献 ](#community)\n* [ 📫&nbsp; 有疑问？ ](#question)\n* [ 🤝&nbsp; 发现 Bug？ ](#bug)\n* [ ✅&nbsp; 依赖要求 ](#requirements)\n* [ 🤵&nbsp; 团队 ](#team)\n* [ 💰&nbsp; 赞助方 ](#sponsors)\n* [ 📘&nbsp; 许可证 ](#license)\n* [ ✏️&nbsp; 引用 ](#citation)\n\n\n\u003Ca name=\"news\">\u003C\u002Fa>\n## ⌛️&nbsp; 新闻\n* **2025\u002F01\u002F22: [继续预训练与进化微调（Continue pre-training & evo-tuning）]( https:\u002F\u002Fgithub.com\u002FRSchmirler\u002FProtT5-EvoTuning\u002Ftree\u002Fmain?tab=readme-ov-file ) 展示了如何使用 ProtT5 原始预训练任务，在新蛋白质序列上继续预训练 ProtT5。这包括在一组同源序列上进行继续预训练（即“进化微调”）。**\n* 2023\u002F07\u002F14: [使用 LoRA 进行微调（FineTuning with LoRA）]( https:\u002F\u002Fgithub.com\u002Fagemagician\u002FProtTrans\u002Ftree\u002Fmaster\u002FFine-Tuning) 提供了使用低秩适配（Low-Rank Adaptation, LoRA）高效微调 ProtT5 的 Notebook 示例，适用于残基级别和蛋白级别的任务（感谢 @0syrys！）。**\n* 2022\u002F11\u002F18: 可用性更新：[LambdaPP](https:\u002F\u002Fembed.predictprotein.org\u002F) 提供了一个简单的 Web 服务来访问基于 ProtT5 的预测结果；UniProt 现在也提供下载部分选定物种的[预计算 ProtT5 嵌入向量（pre-computed ProtT5 embeddings）](https:\u002F\u002Fwww.uniprot.org\u002Fhelp\u002Fembeddings)。\n\n\u003Ca name=\"install\">\u003C\u002Fa>\n## 🚀&nbsp; 安装\n我们所有的模型均可通过 huggingface\u002Ftransformers 获取：\n```console\npip install torch\npip install transformers\npip install sentencepiece\n```\n更多详情请参考 [transformers 安装指南](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Finstallation)。\n\n最近 T5 分词器（T5-tokenizer）的一个[变更](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers\u002Fpull\u002F24565) 会导致 `UnboundLocalError: cannot access local variable 'sentencepiece_model_pb2` 错误。可通过安装[此 PR](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers\u002Fpull\u002F25684) 或手动安装以下内容修复：\n```console\npip install protobuf\n```\n如果你使用的 transformers 版本在此 PR 之后，你将看到[这个警告](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers\u002Fblob\u002Fmain\u002Fsrc\u002Ftransformers\u002Fmodels\u002Ft5\u002Ftokenization_t5.py#L167)。\n显式设置 `legacy=True` 将恢复预期行为并避免警告。你也可以安全忽略该警告，因为 `legacy=True` 是[默认值](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers\u002Fblob\u002Fmain\u002Fsrc\u002Ftransformers\u002Fmodels\u002Ft5\u002Ftokenization_t5.py#L175)。\n\n\u003Ca name=\"quick\">\u003C\u002Fa>\n## 🚀&nbsp; 快速开始\n以下示例展示如何从我们性能最佳的蛋白质语言模型 ProtT5-XL-U50（简称 ProtT5）中提取嵌入向量；也可在 [Colab](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1h7F5v5xkE_ly-1bTQSu-1xaLtTP2TnLF?usp=sharing) 中运行：\n```python\nfrom transformers import T5Tokenizer, T5EncoderModel\nimport torch\nimport re\n\ndevice = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')\n\n# 加载分词器\ntokenizer = T5Tokenizer.from_pretrained('Rostlab\u002Fprot_t5_xl_half_uniref50-enc', do_lower_case=False)\n\n# 加载模型\nmodel = T5EncoderModel.from_pretrained(\"Rostlab\u002Fprot_t5_xl_half_uniref50-enc\").to(device)\n\n# 目前仅 GPU 支持半精度；若要在 CPU 上运行，请使用全精度（不推荐，速度慢很多）\nif device == torch.device(\"cpu\"):\n    model.to(torch.float32)\n\n# 准备你的蛋白质序列列表\nsequence_examples = [\"PRTEINO\", \"SEQWENCE\"]\n\n# 将所有稀有\u002F模糊氨基酸替换为 X，并在每个氨基酸之间插入空格\nsequence_examples = [\" \".join(list(re.sub(r\"[UZOB]\", \"X\", sequence))) for sequence in sequence_examples]\n\n# 对序列进行分词，并填充至批次中最长序列的长度\nids = tokenizer(sequence_examples, add_special_tokens=True, padding=\"longest\")\n\ninput_ids = torch.tensor(ids['input_ids']).to(device)\nattention_mask = torch.tensor(ids['attention_mask']).to(device)\n\n# 生成嵌入向量\nwith torch.no_grad():\n    embedding_repr = model(input_ids=input_ids, attention_mask=attention_mask)\n\n# 提取批次中第一个序列（[0,:]）的残基嵌入，并移除填充及特殊标记（[0,:7]）\nemb_0 = embedding_repr.last_hidden_state[0,:7] # 形状 (7 x 1024)\n# 同样处理第二个序列（[1,:]），但考虑不同序列长度（[1,:8]）\nemb_1 = embedding_repr.last_hidden_state[1,:8] # 形状 (8 x 1024)\n\n# 如果你想为整个蛋白质生成单一表示（即 per-protein embedding）\nemb_0_per_protein = emb_0.mean(dim=0) # 形状 (1024)\n```\n\n\n我们还提供了一个[脚本](https:\u002F\u002Fgithub.com\u002Fagemagician\u002FProtTrans\u002Fblob\u002Fmaster\u002FEmbedding\u002Fprott5_embedder.py)，可简化从给定 FASTA 文件中提取 ProtT5 的残基级别或蛋白级别嵌入：\n```\npython prott5_embedder.py --input sequences\u002Fsome.fasta --output embeddings\u002Fresidue_embeddings.h5\npython prott5_embedder.py --input sequences\u002Fsome.fasta --output embeddings\u002Fprotein_embeddings.h5 --per_protein 1\n```\n\n\u003Ca name=\"models\">\u003C\u002Fa>\n\n## ⌛️&nbsp; 模型可用性\n\n|          模型                |                              Hugging Face                                  |                         Zenodo                | Colab |\n| ----------------------------- | :------------------------------------------------------------------------: |:---------------------------------------------:|---------------------------------------------:|\n| ProtT5-XL-UniRef50（亦称 **ProtT5-XL-U50**）            |  [下载](https:\u002F\u002Fhuggingface.co\u002FRostlab\u002Fprot_t5_xl_uniref50\u002Ftree\u002Fmain)  | [下载](https:\u002F\u002Fzenodo.org\u002Frecord\u002F4644188) | [**Colab**](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1TUj-ayG3WO52n5N50S7KH9vtt6zRkdmj?usp=sharing)|\n| ProtT5-XL-BFD                 |  [下载](https:\u002F\u002Fhuggingface.co\u002FRostlab\u002Fprot_t5_xl_bfd\u002Ftree\u002Fmain)       | [下载](https:\u002F\u002Fzenodo.org\u002Frecord\u002F4633924) |\n| ProtT5-XXL-UniRef50           |  [下载](https:\u002F\u002Fhuggingface.co\u002FRostlab\u002Fprot_t5_xxl_uniref50\u002Ftree\u002Fmain) | [下载](https:\u002F\u002Fzenodo.org\u002Frecord\u002F4652717) |\n| ProtT5-XXL-BFD                |  [下载](https:\u002F\u002Fhuggingface.co\u002FRostlab\u002Fprot_t5_xxl_bfd\u002Ftree\u002Fmain)      | [下载](https:\u002F\u002Fzenodo.org\u002Frecord\u002F4635302) |\n| ProtBert-BFD                  |  [下载](https:\u002F\u002Fhuggingface.co\u002FRostlab\u002Fprot_bert_bfd\u002Ftree\u002Fmain)        | [下载](https:\u002F\u002Fzenodo.org\u002Frecord\u002F4633647) |\n| ProtBert                      |  [下载](https:\u002F\u002Fhuggingface.co\u002FRostlab\u002Fprot_bert\u002Ftree\u002Fmain)            | [下载](https:\u002F\u002Fzenodo.org\u002Frecord\u002F4633691) |\n| ProtAlbert                    |  [下载](https:\u002F\u002Fhuggingface.co\u002FRostlab\u002Fprot_albert\u002Ftree\u002Fmain)          | [下载](https:\u002F\u002Fzenodo.org\u002Frecord\u002F4633687) |\n| ProtXLNet                     |  [下载](https:\u002F\u002Fhuggingface.co\u002FRostlab\u002Fprot_xlnet\u002Ftree\u002Fmain)           | [下载](https:\u002F\u002Fzenodo.org\u002Frecord\u002F4633987) |\n| ProtElectra-Generator-BFD     |  [下载](https:\u002F\u002Fhuggingface.co\u002FRostlab\u002Fprot_electra_generator_bfd\u002Ftree\u002Fmain)           | [下载](https:\u002F\u002Fzenodo.org\u002Frecord\u002F4633813) |\n| ProtElectra-Discriminator-BFD |  [下载](https:\u002F\u002Fhuggingface.co\u002FRostlab\u002Fprot_electra_discriminator_bfd\u002Ftree\u002Fmain)           | [下载](https:\u002F\u002Fzenodo.org\u002Frecord\u002F4633717) |\n\n\n\u003Ca name=\"datasets\">\u003C\u002Fa>\n## ⌛️&nbsp; 数据集可用性\n|          数据集              |                                    Dropbox                                    |  \n| ----------------------------- | :---------------------------------------------------------------------------: |\n|\tNEW364\t\t\t|      [下载](https:\u002F\u002Fwww.dropbox.com\u002Fs\u002Fg49lb352ij4cnt7\u002FNEW364.csv?dl=1)    |\n|\tNetsurfp2       \t| [下载](https:\u002F\u002Fwww.dropbox.com\u002Fs\u002F98hovta9qjmmiby\u002FTrain_HHblits.csv?dl=1)  |\n|\tCASP12\t\t\t| [下载](https:\u002F\u002Fwww.dropbox.com\u002Fs\u002Fte0vn0t7ocdkra7\u002FCASP12_HHblits.csv?dl=1) |\n|\tCB513\t\t\t| [下载](https:\u002F\u002Fwww.dropbox.com\u002Fs\u002F9mat2fqqkcvdr67\u002FCB513_HHblits.csv?dl=1) |\n|\tTS115\t\t\t| [下载](https:\u002F\u002Fwww.dropbox.com\u002Fs\u002F68pknljl9la8ax3\u002FTS115_HHblits.csv?dl=1) |\n|\tDeepLoc 训练集\t\t| [下载](https:\u002F\u002Fwww.dropbox.com\u002Fs\u002Fvgdqcl4vzqm9as0\u002Fdeeploc_per_protein_train.csv?dl=1) |\n|\tDeepLoc 测试集\t\t| [下载](https:\u002F\u002Fwww.dropbox.com\u002Fs\u002Fjfzuokrym7nflkp\u002Fdeeploc_per_protein_test.csv?dl=1) |\n\n\u003Ca name=\"usage\">\u003C\u002Fa>\n## 🚀&nbsp; 使用方法  \n\n如何使用 ProtTrans：\n\n\u003Ca name=\"feature-extraction\">\u003C\u002Fa>\n * \u003Cb>🧬&nbsp; 特征提取（Feature Extraction, FE）:\u003C\u002Fb>\u003Cbr\u002F>\n 请查看：\n [嵌入（Embedding）部分](https:\u002F\u002Fgithub.com\u002Fagemagician\u002FProtTrans\u002Ftree\u002Fmaster\u002FEmbedding)。通过 ProtT5-XL-U50 进行特征提取的 [Colab 示例](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1TUj-ayG3WO52n5N50S7KH9vtt6zRkdmj?usp=sharing)\n\n\u003Ca name=\"logits-extraction\">\u003C\u002Fa>\n * \u003Cb>🚀&nbsp; Logits 提取:\u003C\u002Fb>\u003Cbr\u002F>\n 如需提取 ProtT5 的 logits，请查看：\n [VESPA logits 脚本](https:\u002F\u002Fgithub.com\u002FRostlab\u002FVESPA#step-3-log-odds-ratio-of-masked-marginal-probabilities)。\n\n\u003Ca name=\"fine-tuning\">\u003C\u002Fa>\n * \u003Cb>💥&nbsp; 微调（Fine Tuning, FT）:\u003C\u002Fb>\u003Cbr\u002F>\n 请查看：\n [微调部分](https:\u002F\u002Fgithub.com\u002Fagemagician\u002FProtTrans\u002Ftree\u002Fmaster\u002FFine-Tuning)。更多信息即将发布。\n\n\u003Ca name=\"prediction\">\u003C\u002Fa>\n * \u003Cb>🧠&nbsp; 预测:\u003C\u002Fb>\u003Cbr\u002F>\n 请查看：\n [预测部分](https:\u002F\u002Fgithub.com\u002Fagemagician\u002FProtTrans\u002Ftree\u002Fmaster\u002FPrediction)。通过 ProtT5-XL-U50 进行二级结构预测的 [Colab 示例](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1TUj-ayG3WO52n5N50S7KH9vtt6zRkdmj?usp=sharing)，以及通过 ProtT5-XL-U50 进行亚细胞定位预测和膜结合蛋白与水溶性蛋白区分的 [Colab 示例](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1W5fI20eKLtHpaeeGDcKuXsgeiwujeczX?usp=sharing)。\n  \n\u003Ca name=\"protein-generation\">\u003C\u002Fa>\n * \u003Cb>⚗️&nbsp; 蛋白质序列生成:\u003C\u002Fb>\u003Cbr\u002F>\n 请查看：\n [生成部分](https:\u002F\u002Fgithub.com\u002Fagemagician\u002FProtTrans\u002Ftree\u002Fmaster\u002FGenerate)。更多信息即将发布。\n \n\u003Ca name=\"visualization\">\u003C\u002Fa>\n* \u003Cb>🧐&nbsp; 可视化:\u003C\u002Fb>\u003Cbr\u002F> \n请查看：\n [可视化部分](https:\u002F\u002Fgithub.com\u002Fagemagician\u002FProtTrans\u002Ftree\u002Fmaster\u002FVisualization)。更多信息即将发布。\n \n\u003Ca name=\"benchmark\">\u003C\u002Fa>\n* \u003Cb>📈&nbsp; 基准测试:\u003C\u002Fb>\u003Cbr\u002F> \n请查看：\n [基准测试部分](https:\u002F\u002Fgithub.com\u002Fagemagician\u002FProtTrans\u002Ftree\u002Fmaster\u002FBenchmark)。更多信息即将发布。\n\n\u003Ca name=\"results\">\u003C\u002Fa>\n\n## 📊&nbsp; 原始下游预测结果\n\n\u003Ca name=\"q3\">\u003C\u002Fa>\n * \u003Cb>🧬&nbsp; 二级结构预测（Q3）：\u003C\u002Fb>\u003Cbr\u002F>\n \n|          模型               |       CASP12       |       TS115      |       CB513      |\n| -------------------------- | :----------------: | :-------------:  | :-------------:  |\n| ProtT5-XL-UniRef50         |         81         |        87        |        86        |\n| ProtT5-XL-BFD              |         77         |        85        |        84        |\n| ProtT5-XXL-UniRef50        |         79         |        86        |        85        |\n| ProtT5-XXL-BFD             |         78         |        85        |        83        |\n| ProtBert-BFD               |         76         |        84        |        83        |\n| ProtBert                   |         75         |        83        |        81        |\n| ProtAlbert                 |         74         |        82        |        79        |\n| ProtXLNet                  |         73         |        81        |        78        |\n| ProtElectra-Generator      |         73         |        78        |        76        |\n| ProtElectra-Discriminator  |         74         |        81        |        79        |\n| ProtTXL                    |         71         |        76        |        74        |\n| ProtTXL-BFD                |         72         |        75        |        77        |\n\n🆕 在 [predictprotein.org](https:\u002F\u002Fpredictprotein.org) 上实时预测您的序列。\n\n\u003Ca name=\"q8\">\u003C\u002Fa>\n * \u003Cb>🧬&nbsp; 二级结构预测（Q8）：\u003C\u002Fb>\u003Cbr\u002F>\n \n|          模型               |       CASP12       |       TS115      |       CB513      |\n| -------------------------- | :----------------: | :-------------:  | :-------------:  |\n| ProtT5-XL-UniRef50         |         70         |        77        |        74        |\n| ProtT5-XL-BFD              |         66         |        74        |        71        |\n| ProtT5-XXL-UniRef50        |         68         |        75        |        72        |\n| ProtT5-XXL-BFD             |         66         |        73        |        70        |\n| ProtBert-BFD               |         65         |        73        |        70        |\n| ProtBert                   |         63         |        72        |        66        |\n| ProtAlbert                 |         62         |        70        |        65        |\n| ProtXLNet                  |         62         |        69        |        63        |\n| ProtElectra-Generator      |         60         |        66        |        61        |\n| ProtElectra-Discriminator  |         62         |        69        |        65        |\n| ProtTXL                    |         59         |        64        |        59        |\n| ProtTXL-BFD                |         60         |        65        |        60        |\n\n🆕 在 [predictprotein.org](https:\u002F\u002Fpredictprotein.org) 上实时预测您的序列。\n\n\u003Ca name=\"q2\">\u003C\u002Fa>\n * \u003Cb>🧬&nbsp; 膜结合蛋白 vs. 水溶性蛋白（Q2）：\u003C\u002Fb>\u003Cbr\u002F>\n \n|          模型               |    DeepLoc         |\n| -------------------------- | :----------------: |\n| ProtT5-XL-UniRef50         |         91         |\n| ProtT5-XL-BFD              |         91         |\n| ProtT5-XXL-UniRef50        |         89         |\n| ProtT5-XXL-BFD             |         90         |\n| ProtBert-BFD               |         89         |\n| ProtBert                   |         89         |\n| ProtAlbert                 |         88         |\n| ProtXLNet                  |         87         |\n| ProtElectra-Generator      |         85         |\n| ProtElectra-Discriminator  |         86         |\n| ProtTXL                    |         85         |\n| ProtTXL-BFD                |         86         |\n\n\n\u003Ca name=\"q10\">\u003C\u002Fa>\n * \u003Cb>🧬&nbsp; 亚细胞定位预测（Q10）：\u003C\u002Fb>\u003Cbr\u002F>\n \n|          模型               |    DeepLoc         |\n| -------------------------- | :----------------: |\n| ProtT5-XL-UniRef50         |         81         |\n| ProtT5-XL-BFD              |         77         |\n| ProtT5-XXL-UniRef50        |         79         |\n| ProtT5-XXL-BFD             |         77         |\n| ProtBert-BFD               |         74         |\n| ProtBert                   |         74         |\n| ProtAlbert                 |         74         |\n| ProtXLNet                  |         68         |\n| ProtElectra-Generator      |         59         |\n| ProtElectra-Discriminator  |         70         |\n| ProtTXL                    |         66         |\n| ProtTXL-BFD                |         65         |\n\n\n\u003Ca name=\"inaction\">\u003C\u002Fa>\n## 📊&nbsp; 应用案例\n| 层级     | 类型   | 工具           | 任务                                     | 论文                                                                                          | 网络服务器            |\n| -------- | ------ | -------------- | ---------------------------------------- | --------------------------------------------------------------------------------------------- | --------------------- |\n| 蛋白质   | 功能   | Light Attention | 亚细胞定位                               | [Light attention predicts protein location from the language of life](https:\u002F\u002Fdoi.org\u002F10.1093\u002Fbioadv\u002Fvbab035) | ([Web-server](https:\u002F\u002Fembed.protein.properties\u002F)) |\n| 残基     | 功能   | bindEmbed21     | 结合残基                                 | [Protein embeddings and deep learning predict binding residues for various ligand classes](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fs41598-021-03431-4) | (即将上线)            |\n| 残基     | 功能   | VESPA           | 保守性与单氨基酸变异（SAVs）效应         | [Embeddings from protein language models predict conservation and variant effects](https:\u002F\u002Frdcu.be\u002FcD7q5) | (即将上线)            |\n| 蛋白质   | 结构   | ProtTucker      | 蛋白质三维结构相似性预测                 | [Contrastive learning on protein embeddings enlightens midnight zone at lightning speed](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002F10.1101\u002F2021.11.14.468528v2) |                       |\n| 残基     | 结构   | ProtT5dst       | 蛋白质三维结构预测                       | [Protein language model embeddings for fast, accurate, alignment-free protein structure prediction](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002F10.1101\u002F2021.07.31.454572v1.abstract) |                       |\n\n\u003Ca name=\"comparison\">\u003C\u002Fa>\n\n## 📊&nbsp; 与其他蛋白质语言模型（pLMs）的比较\n在开发 [使用案例](#inaction) 的过程中，我们将 ProtTrans 模型与其他蛋白质语言模型进行了比较，例如 [ESM](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fesm) 模型。为聚焦输入表示变化的影响，以下对比均在相同架构上、仅替换不同嵌入输入的情况下进行。\n\n|          任务\u002F模型             |  ProtBERT-BFD      | ProtT5-XL-U50    |       ESM-1b    |       ESM-1v      | 指标 | 参考文献 |\n| -------------------------- | :--------------:   | :--------------: | :-----------:   | :-----------:  | :-----------: | :-----------: |\n| 亚细胞定位（setDeepLoc） |  80    | \u003Cb>86\u003C\u002Fb>    |   83        |    -         | 准确率 |  [Light-attention](https:\u002F\u002Facademic.oup.com\u002Fview-large\u002Ffigure\u002F321379865\u002Fvbab035f2.tif) |\n| 亚细胞定位（setHard）    |  58    | \u003Cb>65\u003C\u002Fb>    |   62        |    -         | 准确率 |  [Light-attention](https:\u002F\u002Facademic.oup.com\u002Fview-large\u002Ffigure\u002F321379865\u002Fvbab035f2.tif) |\n| 保守性预测（ConSurf-DB）  |  0.540 | \u003Cb>0.596\u003C\u002Fb> |   0.563     |    -         | MCC      | [ConsEmb](https:\u002F\u002Frdcu.be\u002FcD7q5) | \n| 变异效应预测（DMS数据）  |  -     | \u003Cb>0.53\u003C\u002Fb>  |   -         |    0.49      | Spearman（均值） | [VESPA](https:\u002F\u002Frdcu.be\u002FcD7q5) |\n| 变异效应预测（DMS数据）  |  -     | \u003Cb>0.53\u003C\u002Fb>  |   -         | \u003Cb>0.53\u003C\u002Fb>  | Spearman（中位数） | [VESPA](https:\u002F\u002Frdcu.be\u002FcD7q5) |\n| CATH 超家族分类（无监督）  |  18    | \u003Cb>64\u003C\u002Fb>    |   57        |    -         | 准确率 | [ProtTucker](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002F10.1101\u002F2021.11.14.468528v1) |\n| CATH 超家族分类（有监督）    |  39    | \u003Cb>76\u003C\u002Fb>    |   70        |    -         | 准确率 | [ProtTucker](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002F10.1101\u002F2021.11.14.468528v1) |\n| 结合残基预测           |  -     | \u003Cb>39\u003C\u002Fb>    |   32        |    -        | F1 | [bindEmbed21](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fs41598-021-03431-4) |\n\n关于 ProtT5-XL-UniRef50（简称 ProtT5-XL-U50）的重要说明：所有性能指标均基于从底层 T5 模型编码器侧提取的嵌入向量计算，具体方法见[此处](https:\u002F\u002Fgithub.com\u002Fagemagician\u002FProtTrans\u002Fblob\u002Fmaster\u002FEmbedding\u002FPyTorch\u002FAdvanced\u002FProtT5-XL-UniRef50.ipynb)。此外，实验均在半精度模式（model.half()）下运行以加速嵌入生成。在半精度模式下，所有实验均未观察到性能下降。\n\n\u003Ca name=\"community\">\u003C\u002Fa>\n## ❤️&nbsp; 社区与贡献\n\nProtTrans 项目是一个由多家合作企业与研究机构支持的**开源项目**。我们致力于**分享所有预训练模型与知识**。如果您能协助我们发布新的预训练模型、修复漏洞、提出新功能、改进文档、传播项目或提供支持，我们将不胜感激。\n\n\u003Ca name=\"question\">\u003C\u002Fa>\n## 📫&nbsp; 有问题？\n\n欢迎在我们的 [ProtTrans](https:\u002F\u002Fgithub.com\u002Fagemagician\u002FProtTrans\u002Fissues) Issues 页面提问！若您有私密问题或希望与我们合作，也可通过 [RostLab 邮箱](mailto:assistant@rostlab.org?subject=[GitHub]ProtTrans) **直接联系我们**。\n\n\u003Ca name=\"bug\">\u003C\u002Fa>\n## 🤝&nbsp; 发现漏洞？\n\n欢迎在 [ProtTrans](https:\u002F\u002Fgithub.com\u002Fagemagician\u002FProtTrans\u002Fissues) 仓库中**提交新 Issue**，并附上相应标题与描述。若您已找到解决方案，**我们非常乐意审核您的 Pull Request**！\n\n\u003Ca name=\"requirements\">\u003C\u002Fa>\n## ✅&nbsp; 环境要求\n\n如需提取蛋白质特征或微调我们的预训练模型，需安装 [Pytorch](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fpytorch) 和 Hugging Face 的 [Transformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers) 库。如需模型可视化，还需安装 [BertViz](https:\u002F\u002Fgithub.com\u002Fjessevig\u002Fbertviz) 库。\n\n\u003Ca name=\"team\">\u003C\u002Fa>\n## 🤵&nbsp; 团队成员\n\n * \u003Cb>慕尼黑工业大学：\u003C\u002Fb>\u003Cbr\u002F>\n \n| Ahmed Elnaggar       |      Michael Heinzinger  |  Christian Dallago | Ghalia Rehawi | Burkhard Rost |\n|:-------------------------:|:-------------------------:|:-------------------------:|:-------------------------:|:-------------------------:|\n| \u003Cimg width=120\u002F src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fagemagician_ProtTrans_readme_00b25690fee9.jpg\"> | \u003Cimg width=120\u002F src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fagemagician_ProtTrans_readme_064bb2a54d8f.jpg\"> | \u003Cimg width=120\u002F src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fagemagician_ProtTrans_readme_f7fa6b92798c.png\"> | \u003Cimg width=120\u002F src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fagemagician_ProtTrans_readme_1baa91762687.png\"> | \u003Cimg width=120\u002F src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fagemagician_ProtTrans_readme_f2e8f52bf81d.jpg\"> |\n\n * \u003Cb>Med AI Technology：\u003C\u002Fb>\u003Cbr\u002F>\n\n| Yu Wang       |\n|:-------------------------:|\n| \u003Cimg width=120\u002F src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fagemagician_ProtTrans_readme_438776cf4355.jpeg\"> |\n\n* \u003Cb>Google：\u003C\u002Fb>\u003Cbr\u002F>\n\n| Llion Jones       |\n|:-------------------------:|\n| \u003Cimg width=120\u002F src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fagemagician_ProtTrans_readme_56f35d9447cc.jpg\"> |\n\n* \u003Cb>Nvidia：\u003C\u002Fb>\u003Cbr\u002F>\n\n| Tom Gibbs       | Tamas Feher | Christoph Angerer |\n|:-------------------------:|:-------------------------:|:-------------------------:|\n| \u003Cimg width=120\u002F src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fagemagician_ProtTrans_readme_cb2c28958b9c.png\"> | \u003Cimg width=120\u002F src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fagemagician_ProtTrans_readme_213f512e509d.jpeg\"> | \u003Cimg width=120\u002F src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fagemagician_ProtTrans_readme_377405859919.jpg\"> |\n\n* \u003Cb>首尔国立大学：\u003C\u002Fb>\u003Cbr\u002F>\n\n| Martin Steinegger       |\n|:-------------------------:|\n| \u003Cimg width=120\u002F src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fagemagician_ProtTrans_readme_7af6ebe5a2bd.png\"> |\n\n\n* \u003Cb>橡树岭国家实验室（ORNL）：\u003C\u002Fb>\u003Cbr\u002F>\n\n| Debsindhu Bhowmik       |\n|:-------------------------:|\n| \u003Cimg width=120\u002F src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fagemagician_ProtTrans_readme_7c6056e9c8f7.jpg\"> |\n\n\u003Ca name=\"sponsors\">\u003C\u002Fa>\n\n## 💰&nbsp; 赞助方\n\nNvidia       |      Google  |      Google  | ORNL（橡树岭国家实验室） | Software Campus（软件校园计划）\n:-------------------------:|:-------------------------:|:-------------------------:|:-------------------------:|:-------------------------:\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fagemagician_ProtTrans_readme_550652fef661.png) | ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fagemagician_ProtTrans_readme_49e9ea75d269.jpg) | ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fagemagician_ProtTrans_readme_f86805e3d236.png) | ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fagemagician_ProtTrans_readme_e9b991abcdde.png) | ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fagemagician_ProtTrans_readme_97b2d9a4a555.jpg)\n\n\u003Ca name=\"license\">\u003C\u002Fa>\n## 📘&nbsp; 许可证\nProtTrans 预训练模型依据 [Academic Free License v3.0 许可协议](https:\u002F\u002Fchoosealicense.com\u002Flicenses\u002Fafl-3.0\u002F) 发布。\n\n\u003Ca name=\"citation\">\u003C\u002Fa>\n## ✏️&nbsp; 引用\n若您在发表的论文中使用了本代码或我们的预训练模型，请引用原始论文：\n```\n@ARTICLE\n{9477085,\nauthor={Elnaggar, Ahmed and Heinzinger, Michael and Dallago, Christian and Rehawi, Ghalia and Yu, Wang and Jones, Llion and Gibbs, Tom and Feher, Tamas and Angerer, Christoph and Steinegger, Martin and Bhowmik, Debsindhu and Rost, Burkhard},\njournal={IEEE Transactions on Pattern Analysis and Machine Intelligence},\ntitle={ProtTrans: Towards Cracking the Language of Lifes Code Through Self-Supervised Deep Learning and High Performance Computing},\nyear={2021},\nvolume={},\nnumber={},\npages={1-1},\ndoi={10.1109\u002FTPAMI.2021.3095381}}\n```","# ProtTrans 快速上手指南\n\n## 环境准备\n\n- **操作系统**：支持 Linux、macOS、Windows（推荐 Linux 或 macOS 用于 GPU 加速）\n- **Python 版本**：建议 Python 3.7+\n- **硬件要求**：\n  - 推荐使用 NVIDIA GPU（CUDA 支持）以获得最佳性能\n  - CPU 可运行，但速度较慢（需使用全精度模式）\n- **前置依赖**：\n  - PyTorch（带 CUDA 支持更佳）\n  - transformers 库\n  - sentencepiece 分词器\n  - protobuf（解决部分版本 tokenizer 报错）\n\n> 注：目前无官方中国镜像源，建议使用 pip 国内源加速安装（如清华、阿里云等）。\n\n---\n\n## 安装步骤\n\n```console\npip install torch\npip install transformers\npip install sentencepiece\npip install protobuf\n```\n\n> 若遇 `UnboundLocalError: cannot access local variable 'sentencepiece_model_pb2'` 错误，请确保已安装 `protobuf`。\n\n---\n\n## 基本使用\n\n以下示例展示如何使用 **ProtT5-XL-U50** 模型提取蛋白质序列的嵌入向量：\n\n```python\nfrom transformers import T5Tokenizer, T5EncoderModel\nimport torch\nimport re\n\ndevice = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')\n\n# 加载分词器\ntokenizer = T5Tokenizer.from_pretrained('Rostlab\u002Fprot_t5_xl_half_uniref50-enc', do_lower_case=False)\n\n# 加载模型\nmodel = T5EncoderModel.from_pretrained(\"Rostlab\u002Fprot_t5_xl_half_uniref50-enc\").to(device)\n\n# 若在 CPU 上运行，强制转为全精度（半精度仅支持 GPU）\nif device == torch.device(\"cpu\"):\n    model.to(torch.float32)\n\n# 准备蛋白质序列（示例）\nsequence_examples = [\"PRTEINO\", \"SEQWENCE\"]\n\n# 预处理：替换稀有氨基酸为 X，并在每个氨基酸间加空格\nsequence_examples = [\" \".join(list(re.sub(r\"[UZOB]\", \"X\", sequence))) for sequence in sequence_examples]\n\n# 编码并填充至批次中最长序列\nids = tokenizer(sequence_examples, add_special_tokens=True, padding=\"longest\")\ninput_ids = torch.tensor(ids['input_ids']).to(device)\nattention_mask = torch.tensor(ids['attention_mask']).to(device)\n\n# 提取嵌入向量（禁用梯度计算）\nwith torch.no_grad():\n    embedding_repr = model(input_ids=input_ids, attention_mask=attention_mask)\n\n# 获取第一个序列前7个残基的嵌入（去除padding）\nemb_0 = embedding_repr.last_hidden_state[0,:7]  # shape: (7, 1024)\n\n# 获取整条蛋白的平均嵌入（per-protein embedding）\nemb_0_per_protein = emb_0.mean(dim=0)  # shape: (1024,)\n```\n\n> 更多实用脚本（如从 FASTA 文件批量生成嵌入）请参考项目中的 [`prott5_embedder.py`](https:\u002F\u002Fgithub.com\u002Fagemagician\u002FProtTrans\u002Fblob\u002Fmaster\u002FEmbedding\u002Fprott5_embedder.py)。","某生物医药初创公司的计算生物学团队正在研发一种针对罕见遗传病的新型酶替代疗法，需要快速预测数百个突变蛋白的功能影响以筛选候选分子。\n\n### 没有 ProtTrans 时\n- 团队只能依赖传统序列比对工具（如BLAST）和手工设计的理化特征，无法捕捉蛋白质序列中的深层语义模式，预测准确率低且不稳定。\n- 每次分析新蛋白都需要从零开始训练模型，耗时数周，严重拖慢药物筛选进度。\n- 缺乏统一的高质量向量表示，不同项目间数据难以复用，导致重复劳动和资源浪费。\n- 面对结构未知的突变体，几乎无有效手段评估其功能变化，只能依赖昂贵湿实验试错。\n- 团队需维护多个异构工具链，调试复杂，新人上手成本高。\n\n### 使用 ProtTrans 后\n- 直接调用预训练的ProtT5模型提取蛋白质序列表示，无需训练即可获得富含语义的嵌入向量，预测准确率提升30%以上。\n- 利用ProtTrans提供的迁移学习能力，在少量标注数据上微调即可适配特定任务，模型开发周期从数周缩短至2天内。\n- 所有蛋白统一使用ProtTrans嵌入空间，实现跨项目特征共享与模型复用，大幅提升研发效率。\n- 即使无三维结构信息，也能通过ProtTrans生成的注意力图谱和嵌入差异，智能推断突变对功能的潜在影响，减少70%无效实验。\n- 通过Hugging Face标准接口集成，代码简洁统一，团队协作更顺畅，新人半天即可投入实战。\n\nProtTrans让蛋白质语言理解像调用API一样简单，把生物学家从繁琐建模中解放出来，专注真正的科学发现。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fagemagician_ProtTrans_fd88fbb8.png","agemagician","Ahmed Elnaggar","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fagemagician_fd0eba5d.jpg",null,"Technical University of Munich (TUM)","Germany","ahmed.elnaggar@tum.de","Elnaggar_AI","https:\u002F\u002Fgithub.com\u002Fagemagician",[86,90],{"name":87,"color":88,"percentage":89},"Jupyter Notebook","#DA5B0B",100,{"name":91,"color":92,"percentage":93},"Python","#3572A5",0,1296,167,"2026-04-02T13:21:47","MIT","","未说明",{"notes":101,"python":99,"dependencies":102},"若使用CPU运行需切换为全精度模式（较慢），推荐GPU环境；部分模型文件较大，首次加载需下载；T5分词器可能存在兼容性问题，需安装protobuf或设置legacy=True",[103,104,105,106],"torch","transformers","sentencepiece","protobuf",[15,37],6,"2026-03-27T02:49:30.150509","2026-04-06T06:45:36.148496",[112,117,122,127,132,137],{"id":113,"question_zh":114,"answer_zh":115,"source_url":116},333,"能否使用 ProtBert 对掩码语言模型进行微调？有没有更节省显存的预训练模型？","可以使用 LoRA 层进行微调以降低显存占用，相关脚本已提供在项目目录中：https:\u002F\u002Fgithub.com\u002Fagemagician\u002FProtTrans\u002Ftree\u002Fmaster\u002FFine-Tuning。原始大模型（如 prot_t5_xl_uniref50）需要 32GB GPU 显存，建议改用 LoRA 微调方案。","https:\u002F\u002Fgithub.com\u002Fagemagician\u002FProtTrans\u002Fissues\u002F112",{"id":118,"question_zh":119,"answer_zh":120,"source_url":121},334,"DeepLoc 数据集中为何蛋白质序列包含双字符（如 'AA', 'AC'）？","这些双字符并非标准氨基酸，可能是数据预处理或标注错误。官方未直接解释，但提供了多任务微调示例笔记本供参考：https:\u002F\u002Fgithub.com\u002Fagemagician\u002FProtTrans\u002Fblob\u002Fmaster\u002FFine-Tuning\u002FprotBERT-BFD-lightning-multitasks.ipynb，建议检查数据清洗步骤。","https:\u002F\u002Fgithub.com\u002Fagemagician\u002FProtTrans\u002Fissues\u002F74",{"id":123,"question_zh":124,"answer_zh":125,"source_url":126},335,"能否在 ProtTrans 预训练模型上添加全连接层并用 MSE 损失做回归任务（如预测热稳定性）？","可以。官方推荐使用 LoRA 微调脚本适配回归任务：https:\u002F\u002Fgithub.com\u002Fagemagician\u002FProtTrans\u002Ftree\u002Fmaster\u002FFine-Tuning。注意小数据集（如 4098 样本）可能导致过拟合，建议调整训练\u002F验证比例或使用更小模型（如 esm2_t6_8M_UR50D）。","https:\u002F\u002Fgithub.com\u002Fagemagician\u002FProtTrans\u002Fissues\u002F107",{"id":128,"question_zh":129,"answer_zh":130,"source_url":131},336,"如何用自己的序列训练（微调）ProtTrans 模型？","需根据任务设计输入输出格式。若延续预训练，可对输入序列进行 span corruption（如将 `S E Q W E N C E` 改为 `S E Q \u003Cextra_id_0> W E N C E`，目标输出为 `\u003C\\s> \u003Cextra_id_0> W`）。可复用 HuggingFace 的 T5 MLM 脚本：https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers\u002Fblob\u002Fmain\u002Fexamples\u002Fflax\u002Flanguage-modeling\u002Frun_t5_mlm_flax.py。","https:\u002F\u002Fgithub.com\u002Fagemagician\u002FProtTrans\u002Fissues\u002F118",{"id":133,"question_zh":134,"answer_zh":135,"source_url":136},337,"是否有现成的 UniRef50 或人类蛋白质 ProtTrans 嵌入向量可供下载？","人类蛋白质的 ProtT5 嵌入可从 Zenodo 下载：https:\u002F\u002Fzenodo.org\u002Frecord\u002F5047020；部分物种的预计算嵌入也可通过 UniProt 获取：https:\u002F\u002Fwww.uniprot.org\u002Fhelp\u002Fembeddings。UniRef90 目前无官方预计算版本，需自行生成或联系社区协作。","https:\u002F\u002Fgithub.com\u002Fagemagician\u002FProtTrans\u002Fissues\u002F80",{"id":138,"question_zh":139,"answer_zh":140,"source_url":121},338,"微调时遇到显存不足问题，有什么解决方案？","推荐三种方法：1) 使用 LoRA 微调（官方脚本见 https:\u002F\u002Fgithub.com\u002Fagemagician\u002FProtTrans\u002Ftree\u002Fmaster\u002FFine-Tuning）；2) 启用混合精度训练；3) 使用 DeepSpeed Zero Stage 3 Offload 将部分参数卸载到 CPU（参考 https:\u002F\u002Fpytorch-lightning.readthedocs.io\u002Fen\u002Fstable\u002Fadvanced\u002Fmodel_parallel.html#deepspeed-zero-stage-3-offload）。",[142],{"id":143,"version":144,"summary_zh":145,"released_at":146},100028,"1.0","This is the first release of the ProtTrans project, which includes the first LM model (Prot-T5-XL-UniRef50) that outperforms MSA methods.","2021-03-24T14:22:20"]