[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-hussius--deeplearning-biology":3,"tool-hussius--deeplearning-biology":65},[4,23,32,40,49,57],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":22},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,2,"2026-04-05T10:45:23",[13,14,15,16,17,18,19,20,21],"图像","数据工具","视频","插件","Agent","其他","语言模型","开发框架","音频","ready",{"id":24,"name":25,"github_repo":26,"description_zh":27,"stars":28,"difficulty_score":29,"last_commit_at":30,"category_tags":31,"status":22},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,3,"2026-04-04T04:44:48",[17,13,20,19,18],{"id":33,"name":34,"github_repo":35,"description_zh":36,"stars":37,"difficulty_score":29,"last_commit_at":38,"category_tags":39,"status":22},519,"PaddleOCR","PaddlePaddle\u002FPaddleOCR","PaddleOCR 是一款基于百度飞桨框架开发的高性能开源光学字符识别工具包。它的核心能力是将图片、PDF 等文档中的文字提取出来，转换成计算机可读取的结构化数据，让机器真正“看懂”图文内容。\n\n面对海量纸质或电子文档，PaddleOCR 解决了人工录入效率低、数字化成本高的问题。尤其在人工智能领域，它扮演着连接图像与大型语言模型（LLM）的桥梁角色，能将视觉信息直接转化为文本输入，助力智能问答、文档分析等应用场景落地。\n\nPaddleOCR 适合开发者、算法研究人员以及有文档自动化需求的普通用户。其技术优势十分明显：不仅支持全球 100 多种语言的识别，还能在 Windows、Linux、macOS 等多个系统上运行，并灵活适配 CPU、GPU、NPU 等各类硬件。作为一个轻量级且社区活跃的开源项目，PaddleOCR 既能满足快速集成的需求，也能支撑前沿的视觉语言研究，是处理文字识别任务的理想选择。",74913,"2026-04-05T10:44:17",[19,13,20,18],{"id":41,"name":42,"github_repo":43,"description_zh":44,"stars":45,"difficulty_score":46,"last_commit_at":47,"category_tags":48,"status":22},3215,"awesome-machine-learning","josephmisiti\u002Fawesome-machine-learning","awesome-machine-learning 是一份精心整理的机器学习资源清单，汇集了全球优秀的机器学习框架、库和软件工具。面对机器学习领域技术迭代快、资源分散且难以甄选的痛点，这份清单按编程语言（如 Python、C++、Go 等）和应用场景（如计算机视觉、自然语言处理、深度学习等）进行了系统化分类，帮助使用者快速定位高质量项目。\n\n它特别适合开发者、数据科学家及研究人员使用。无论是初学者寻找入门库，还是资深工程师对比不同语言的技术选型，都能从中获得极具价值的参考。此外，清单还延伸提供了免费书籍、在线课程、行业会议、技术博客及线下聚会等丰富资源，构建了从学习到实践的全链路支持体系。\n\n其独特亮点在于严格的维护标准：明确标记已停止维护或长期未更新的项目，确保推荐内容的时效性与可靠性。作为机器学习领域的“导航图”，awesome-machine-learning 以开源协作的方式持续更新，旨在降低技术探索门槛，让每一位从业者都能高效地站在巨人的肩膀上创新。",72149,1,"2026-04-03T21:50:24",[20,18],{"id":50,"name":51,"github_repo":52,"description_zh":53,"stars":54,"difficulty_score":46,"last_commit_at":55,"category_tags":56,"status":22},2234,"scikit-learn","scikit-learn\u002Fscikit-learn","scikit-learn 是一个基于 Python 构建的开源机器学习库，依托于 SciPy、NumPy 等科学计算生态，旨在让机器学习变得简单高效。它提供了一套统一且简洁的接口，涵盖了从数据预处理、特征工程到模型训练、评估及选择的全流程工具，内置了包括线性回归、支持向量机、随机森林、聚类等在内的丰富经典算法。\n\n对于希望快速验证想法或构建原型的数据科学家、研究人员以及 Python 开发者而言，scikit-learn 是不可或缺的基础设施。它有效解决了机器学习入门门槛高、算法实现复杂以及不同模型间调用方式不统一的痛点，让用户无需重复造轮子，只需几行代码即可调用成熟的算法解决分类、回归、聚类等实际问题。\n\n其核心技术亮点在于高度一致的 API 设计风格，所有估算器（Estimator）均遵循相同的调用逻辑，极大地降低了学习成本并提升了代码的可读性与可维护性。此外，它还提供了强大的模型选择与评估工具，如交叉验证和网格搜索，帮助用户系统地优化模型性能。作为一个由全球志愿者共同维护的成熟项目，scikit-learn 以其稳定性、详尽的文档和活跃的社区支持，成为连接理论学习与工业级应用的最",65628,"2026-04-05T10:10:46",[20,18,14],{"id":58,"name":59,"github_repo":60,"description_zh":61,"stars":62,"difficulty_score":10,"last_commit_at":63,"category_tags":64,"status":22},3364,"keras","keras-team\u002Fkeras","Keras 是一个专为人类设计的深度学习框架，旨在让构建和训练神经网络变得简单直观。它解决了开发者在不同深度学习后端之间切换困难、模型开发效率低以及难以兼顾调试便捷性与运行性能的痛点。\n\n无论是刚入门的学生、专注算法的研究人员，还是需要快速落地产品的工程师，都能通过 Keras 轻松上手。它支持计算机视觉、自然语言处理、音频分析及时间序列预测等多种任务。\n\nKeras 3 的核心亮点在于其独特的“多后端”架构。用户只需编写一套代码，即可灵活选择 TensorFlow、JAX、PyTorch 或 OpenVINO 作为底层运行引擎。这一特性不仅保留了 Keras 一贯的高层易用性，还允许开发者根据需求自由选择：利用 JAX 或 PyTorch 的即时执行模式进行高效调试，或切换至速度最快的后端以获得最高 350% 的性能提升。此外，Keras 具备强大的扩展能力，能无缝从本地笔记本电脑扩展至大规模 GPU 或 TPU 集群，是连接原型开发与生产部署的理想桥梁。",63927,"2026-04-04T15:24:37",[20,14,18],{"id":66,"github_repo":67,"name":68,"description_en":69,"description_zh":70,"ai_summary_zh":70,"readme_en":71,"readme_zh":72,"quickstart_zh":73,"use_case_zh":74,"hero_image_url":75,"owner_login":76,"owner_name":77,"owner_avatar_url":78,"owner_bio":79,"owner_company":80,"owner_location":81,"owner_email":82,"owner_twitter":83,"owner_website":84,"owner_url":85,"languages":79,"stars":86,"forks":87,"last_commit_at":88,"license":79,"difficulty_score":89,"env_os":90,"env_gpu":91,"env_ram":91,"env_deps":92,"category_tags":95,"github_topics":79,"view_count":10,"oss_zip_url":79,"oss_zip_packed_at":79,"status":22,"created_at":96,"updated_at":97,"faqs":98,"releases":99},3697,"hussius\u002Fdeeplearning-biology","deeplearning-biology","A list of deep learning implementations in biology","deeplearning-biology 是一个专注于生物学领域的深度学习实现资源清单。它系统性地整理了将深度学习方法应用于生物科学的各类开源项目与代码库，尤其侧重于基因组学方向。\n\n在生物数据日益复杂且海量的背景下，研究人员往往难以快速找到适配特定任务的成熟算法模型。deeplearning-biology 有效解决了这一痛点，它将分散的资源按应用场景分类，涵盖序列建模、多组学整合、蛋白质结构预测与功能设计、药物发现、单细胞分析以及群体遗传学等前沿领域。除了具体的代码实现，该清单还精选了高质量的综述论文，帮助读者理解卷积网络、图神经网络及生成模型等在生物场景中的理论基础与应用前景。\n\n这份资源非常适合生物信息学研究人员、计算生物学家以及从事医疗 AI 开发的工程师使用。无论是希望利用现有模型加速科研进程，还是寻找灵感以开发新的算法架构，用户都能从中获得极具价值的指引。作为连接人工智能技术与生命科学研究的桥梁，deeplearning-biology 以开放的社区协作模式持续更新，是探索\"AI+ 生物”交叉领域不可或缺的实用指南。","# deeplearning-biology\n\nThis is a list of implementations of deep learning methods to biology, originally published on [Follow the Data](https:\u002F\u002Ffollowthedata.wordpress.com\u002F). There is a slant towards genomics because that's the subfield that I follow most closely.\n\nPlease, contribute to this growing list, especially in categories that I haven't covered well! \n\nYou might also want to refer to the [awesome deepbio](https:\u002F\u002Fgithub.com\u002Fgokceneraslan\u002Fawesome-deepbio) list.\n\n## Table of contents\n  - [Reviews](#reviews)\n  - [Model repositories and resources](#repositories)\n  - [Sequence modelling](#seqmodels)\n  - [Multi-omics integration](#integration)\n  - [Protein biology](#protein_biology)\n    - [Structure prediction](#protein_biology_structure_prediction)\n    - [Protein design](#protein_biology_design)\n    - [Function prediction](#protein_biology_function_prediction)\n  - [Genomics](#genomics)\n    - [Variant calling](#genomics_variant-calling)\n    - [Gene expression](#genomics_expression)\n    - [Imaging and gene expression](#imaging_expression)\n    - [Predicting enhancers and regulatory regions](#genomics_enhancers)\n    - [Non-coding RNA](#genomics_non-coding)\n    - [Methylation](#genomics_methylation)\n    - [Single-cell applications](#genomics_single-cell)\n  - [Chemoinformatics and drug discovery](#chemo)\n  - [Biomarker discovery](#biomarker)\n  - [Metabolomics](#metabolomics)\n  - [Generative models](#generative)\n  - [Population genetics](#genomics_pop)\n  - [Systems biology](#sysbio)\n\n## Reviews \u003Ca name=\"reviews\">\u003C\u002Fa>\n\nThese are not implementations as such, but contain useful pointers. Because review papers in this field are more time-sensitive, I have added the month of journal publication. Note that the original preprint may in some cases have been available online long before the published version.\n\n**(2021-11) A Unified View of Relational Deep Learning for Polypharmacy Side Effect, Combination Synergy, and Drug-Drug Interaction Prediction** [[open access paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.02916v1.pdf)]\n\nIn recent years, numerous machine learning models which attempt to solve polypharmacy side effect identification, drug-drug interaction prediction and combination therapy design tasks have been proposed. Here, we present a unified theoretical view of relational machine learning models which can address these tasks. We provide fundamental definitions, compare existing model architectures and discuss performance metrics, datasets and evaluation protocols. In addition, we emphasize possible high impact applications and important future research directions in this domain.\n\n**(2019-12) Deep learning of pharmacogenomics resources: moving towards precision oncology** [[Briefings in Bioinformatics](https:\u002F\u002Facademic.oup.com\u002Fbib\u002Fadvance-article\u002Fdoi\u002F10.1093\u002Fbib\u002Fbbz144\u002F5669856#186956080)]\n\n**(2019-04) Deep learning: new computational modelling techniques for genomics** [[Nature Reviews Genetics paper](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fs41576-019-0122-6)]\n\nThis is a very nice conceptual review of how deep learning can be used in genomics. It explains how convolutional networks, recurrent networks, graph convolutional networks, autoencoders and GANs work. It also explains useful concepts like multi-modal learning, transfer learning, and model explainability.\n\n**(2019-01) A guide to deep learning in healthcare** [[Nature Medicine paper](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fs41591-018-0316-z)]\n\nFrom the abstract: \"Here we present deep-learning techniques for healthcare, centering our discussion on deep learning in computer vision, natural language processing, reinforcement learning, and generalized methods. We describe how these computational techniques can impact a few key areas of medicine and explore how to build end-to-end systems. Our discussion of computer vision focuses largely on medical imaging, and we describe the application of natural language processing to domains such as electronic health record data. Similarly, reinforcement learning is discussed in the context of robotic-assisted surgery, and generalized deep-learning methods for genomics are reviewed.\"\n\n**(2018-11) A primer on deep learning in genomics** [[Nature Genetics paper](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fs41588-018-0295-5)][[Colaboratory notebook with tutorial](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F17E4h5aAOioh5DiTo7MZg4hpL6Z_0FyWr)]\n\nThis review, which features yours truly as one of its co-authors, is billed as a 'primer' which means it tries to help genomics researchers get started with deep learning. We tried to accomplish this by highlighting many practical issues such as tooling (not only deep learning libraries but also GPU cloud platforms, model zoos and online courses), defining your deep learning problem, explainability and troubleshooting. We also made a tutorial on Colaboratory that shows how to set up and run a simple convolutional network model for learning binding motifs, and how to inspect the model's predictions after it has been trained.\n\n**(2018-10) Deep learning in biomedicine** [[Nature Biotechnology paper](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fnbt.4233)]\n\nFrom the abstract: \"Deep learning is beginning to impact biological research and biomedical applications as a result of its ability to integrate vast datasets, learn arbitrarily complex relationships and incorporate existing knowledge. Already, deep learning models can predict, with varying degrees of success, how genetic variation alters cellular processes involved in pathogenesis, which small molecules will modulate the activity of therapeutically relevant proteins, and whether radiographic images are indicative of disease. However, the flexibility of deep learning creates new challenges in guaranteeing the performance of deployed systems and in establishing trust with stakeholders, clinicians and regulators, who require a rationale for decision making. We argue that these challenges will be overcome using the same flexibility that created them; for example, by training deep models so that they can output a rationale for their predictions. Significant research in this direction will be needed to realize the full potential of deep learning in biomedicine.\"\n\n**(2018-04) Opportunities And Obstacles For Deep Learning In Biology And Medicine** [[bioRxiv preprint](http:\u002F\u002Fbiorxiv.org\u002Fcontent\u002Fearly\u002F2017\u002F05\u002F28\u002F142760)][[J Roy Soc interface paper](https:\u002F\u002Froyalsocietypublishing.org\u002Fdoi\u002F10.1098\u002Frsif.2017.0387)]\n\nThis impressive collaborative review was written completely in the open on [Github](https:\u002F\u002Fgithub.com\u002Fgreenelab\u002Fdeep-review). It is focused on discussing how deep learning may be able to transform patient classification and treatment as well as fundamental biological research in the future, and what the main obstacles are that could prevent it from happening. A lot of interesting points are brought up here. Together with the review listed below, which has a more technical slant, you will get a good overview of how deep learning is used and can be used in biology and medicine.\n\n**(2017-01) Deep learning for health informatics** [[open access paper](http:\u002F\u002Fieeexplore.ieee.org\u002Fstamp\u002Fstamp.jsp?tp=&arnumber=7801947)]\n\nAn overview of several types of deep nets and their applications in translational bioinformatics, medical imaging, \"pervasive sensing\", medical data and public health.\n\n**(2016-07) Deep learning for computational biology** [[open access paper](http:\u002F\u002Fmsb.embopress.org\u002Fcontent\u002F12\u002F7\u002F878)]\n\nThis is a very nice review of deep learning applications in biology. It primarily deals with convolutional networks and explains well why and how they are used for sequence (and image) classification.\n\n## Model repositories and resources \u003Ca name=\"repositories\">\u003C\u002Fa>\n\n**The Kipoi repository accelerates community exchange and reuse of predictive models for genomics** [[Github](https:\u002F\u002Fgithub.com\u002Fkipoi\u002Fkipoiseq\u002F)][[Website](https:\u002F\u002Fkipoi.org\u002F)][[Paper](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fs41587-019-0140-0)] \n\nKipoi is a model zoo for genomics, installable by a simple pip install, which provides a consistent interface to hundreds of predictive models in genomics. Kipoi implements a standard set of data loaders for training and prediction of sequence models in deep learning.\n\n**DragoNN** [[Github](https:\u002F\u002Fgithub.com\u002Fkundajelab\u002Fdragonn)][[Website](https:\u002F\u002Fkundajelab.github.io\u002Fdragonn\u002F)]\n\nDragoNN provides a toolkit for learning about modelling regulatory sequence with neural networks. It has tools for interpreting sequence models and web-based tutorials using Jupyter Notebooks for teaching interactive model manipulation and visualization.\n\n\n## Sequence modelling \u003Ca name=\"seqmodels\">\u003C\u002Fa>\n\nThis is a collection of mostly NLP inspired models for modelling biological sequences, such as proteins or genes. Perhaps these models should be moved to other sections as language models in biology become more mainstream.\n\n**Continuous Distributed Representation of Biological Sequences for Deep Genomics and Deep Proteomics**[[github](https:\u002F\u002Fgithub.com\u002Fehsanasgari\u002FDeep-Proteomics)][[paper](http:\u002F\u002Fjournals.plos.org\u002Fplosone\u002Farticle?id=10.1371\u002Fjournal.pone.0141287)]\n\nThe GitHub summary reads: \"We introduce a new representation for biological sequences. Named bio-vectors (BioVec) to refer to biological sequences in general with protein-vectors (ProtVec) for proteins (amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representation can be widely used in applications of deep learning in proteomics and genomics. Biovectors are basically n-gram character skip-gram wordvectors for biological sequences (DNA, RNA, and Protein). In this work, we have explored biophysical and biochemical meaning of this space. In addition, in variety of bioinformatics tasks we have shown the strength of such a sequence representation.\"\n\n**pysster: Learning Sequence and Structure Motifs in DNA and RNA Sequences using Convolutional Neural Networks**[[github](https:\u002F\u002Fgithub.com\u002Fbudach\u002Fpysster)][[preprint](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002Fearly\u002F2017\u002F12\u002F06\u002F230086)]\n\nA toolbox for learning motifs from DNA\u002FRNA sequence data using convolutional neural networks, this Tensorflow-based library supposedly runs on GPU out of the box and also does things like hyperparameter optimization and visualizations of what different network layers are learning.\n\n**Unified rational protein engineering with sequence-based deep representation learning** [[github](https:\u002F\u002Fgithub.com\u002Fchurchlab\u002FUniRep)][[paper](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fs41592-019-0598-1)]\n\nThe authors introduce UniRep, an early language model for protein sequences based on mLSTMs (multiplicative LSTMs). It's trained on 24 million protein sequences from UniRef50 and can be used to convert protein sequences into numerical vector representations that contain information about protein properties. For example, the representations can be used to train downstream predictors of protein stability and function. UniRep can also be used as a \"babbler\", or generative model, to design new proteins.\n\n\n**Natural language predicts viral escape** [[github](https:\u002F\u002Fgithub.com\u002Fbrianhie\u002Fviral-mutation)][[paper](https:\u002F\u002Fscience.sciencemag.org\u002Fcontent\u002F371\u002F6526\u002F248.17.full)]\n\nThis paper attempts to model how viruses evade being detected by the immune system (\"viral escape\") by using a language model on amino acids implemented with a BiLSTM-based networks. They posit that a sequence that enables escape from the immune system should have high viability, which they liken to the grammaticality of a sentence, while also having different \"semantics\", i.e. looking different from an antigenic point of view. The grammaticality is learned in the final layer as a prediction task, whereas the semantics are extracted from the representation in the next to last layer.\n\n\n**Genomic-ULMFiT: ULMFiT for Genomic Sequence Data** [[github](https:\u002F\u002Fgithub.com\u002Fkheyer\u002FGenomic-ULMFiT)]\n\nThis repo is an implementation of FastAI's ULMFiT language transfer learning model for genomics. ULMFiT is based on an AWD-LSTM model and has been shown to be very effective for solving various text classification tasks. Here, the repo's author has extended FastAI's classes with specific subclasses for DNA sequence data. The concept with ULMFiT is that you (1) learn a language model from a large body of text in an unsupervised way (ie you don't need any labels) by having the model guess the next word (or token); (2) take the language model from step (1) and fine-tune it on the (probably) smaller labeled data set that you want to do classification on, but still do the training without labels in this step (and try to predict the next word), (3) finally fine-tune on the final classification task, using the labels. In genomics, the large body of text in step (1) could be, for instance, the whole human genome, or some other subset of GenBank\u002FSequence Read Archive\u002F... The author shows that this approach works quite well for a range of classification problems, like E. coli and human promoter classification, metagenomic classification, enhancer classification and mRNA\u002FlincRNA classification. \n\n**Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences** [[github](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fesm)][[preprint](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002F10.1101\u002F622803v1.full)]\n\nIn this work from Facebook's (now Meta's) AI group, the BERT language model is used to train a language model, ESM-1, on 86 billion amino acids across 250 million sequences. Like with ULMFiT (above), the idea is to use transfer learning: pre-training on a massive amount of data to teach a model something about the underlying logic of the language of DNA or proteins, in order to then be able to fine-tune the model for specific tasks. \n\n**ProGen2: Exploring the Boundaries of Protein Language Models** [[github](https:\u002F\u002Fgithub.com\u002Fsalesforce\u002Fprogen)][[preprint](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.13517.pdf)]\n\nSurprisingly, Salesforce has also been involved in protein language model research for quite some time. In this preprint they introduce a suite of protein language models that can be used for things like sequence fitness prediction and sequence generation. \n\n**MSA Transformer** [[github](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fesm)][[preprint](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002F10.1101\u002F2021.02.12.430858v3)]\n\nHere, the same team from Meta that introduced the ESM-1 model (above) show that a different type of transformer, which uses multiple sequence alignments (MSA) as input instead of protein sequences, can achieve even better results than a BERT-style transformer while using a smaller number of parameters. They introduce different forms of row and column attention to extract as much information from the MSAs as possible. The GitHub repo contains one trained version of the model, ESM-MSA-1b. \n\n\n**ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing** [[github](https:\u002F\u002Fgithub.com\u002Fagemagician\u002FProtTrans)][[huggingface](https:\u002F\u002Fhuggingface.co\u002FRostlab\u002Fprot_bert_bfd)][[preprint](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002F10.1101\u002F2020.07.12.199554v2)]\n\nA large-scale effort to train and benchmark Transformer models on protein sequences, this project even has provided several of its models to the public on the HuggingFace model hub. The abstract starts: *\"Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models (LMs) taken from Natural Language Processing (NLP). These LMs reach for new prediction frontiers at low inference costs. Here, we trained two auto-regressive language models (Transformer-XL, XLNet) and two auto-encoder models (Bert, Albert) on data from UniRef and BFD containing up to 393 billion amino acids (words) from 2.1 billion protein sequences (22- and 112-times the entire English Wikipedia). The LMs were trained on the Summit supercomputer at Oak Ridge National Laboratory (ORNL), using 936 nodes (total 5616 GPUs) and one TPU Pod (V3-512 or V3-1024).\"*\n\n\n**Effective gene expression prediction from sequence by integrating long-range interactions** [[github](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdeepmind-research\u002Ftree\u002Fmaster\u002Fenformer)][[tensorflow hub](https:\u002F\u002Ftfhub.dev\u002Fdeepmind\u002Fenformer\u002F1)][[paper](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002F10.1101\u002F2021.04.07.438649v1)]\n\n\nCan a transformer architecture help solve the hard problem of relating genomic enhancers to gene expression? It is experimentally laborious to connect distal enhancers to genes, and the presence of many long-range interactions has made it challenging to learn them from data via correlations (due to multiple testing), convolutional networks (too short receptive fields) or recurrent networks (hard to keep a long enough memory.) Now researchers at Deep Mind, Calico and Google have introduced the \"enhancer transformer\", ie Enformer, which can leverage the self-attention mechanism to learn enhancer\u002Fgene expression interactions with a much longer range than before. Commendably, the authors have not only published the code on github but there is also a pretrained model on Tensorflow Hub.\n\n\n**DNABERT**\n[[github](https:\u002F\u002Fgithub.com\u002Fjerryji1993\u002FDNABERT)][[paper](https:\u002F\u002Facademic.oup.com\u002Fbioinformatics\u002Farticle\u002F37\u002F15\u002F2112\u002F6128680)]\n\nA BERT-like masked language model trained on human DNA sequence using k-mers as tokens.\n\n**GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics** [[github](https:\u002F\u002Fgithub.com\u002Framanathanlab\u002Fgenslm)][[paper](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002F10.1101\u002F2022.10.10.511571v1)]\n\n\nLike the earlier Enformer, this is a transformer model for nucleotides (DNA or RNA), but with different design and goals. Whereas Enformer is a pre-trained model for mammalian genomes (human and mouse), GenSLM is intended as a foundation model for less complex genomes, such as bacteria and viruses. It is pre-trained on 110 million prokaryotic (bacterial and archaeal) genomes using a GPT-style (\"predict the next token\") loss. The tokens are codons (nucleotide triplets), and consequently, the trained model can be \"prompted\" in GPT-3 fashion with codons. The foundation model can be further finetuned on a subset of genomes (\"evolutionary finetuning\"), in the case of this paper 1.5 million sequences SARS-CoV-2 genomes, yielding a SARS-CoV-2 specific language model, which contains implicit knowledge of the virus' evolutionary landscape and can be used to identify variants of concern. A further interesting twist in this paper is that long-range interactions, which Enformer tried to solve with convolutions coupled with self-attention, are modelled using diffusion models (á la Stable Diffusion.)\n\n\n**The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics**\n[[github](https:\u002F\u002Fgithub.com\u002Finstadeepai\u002Fnucleotide-transformer)][[preprint](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002F10.1101\u002F2023.01.11.523679v1)]\nThis is work from [Instadeep](https:\u002F\u002Fwww.instadeep.com), an ML consultancy that was acquired by BioNTech, the developer of the Pfizer mRNA vaccine against SARS-CoV-2, in Januari 2023. The authors have attempted to build foundation models for nucleotide (i.e., DNA and RNA) sequences based on a broader variety of genomes than in previous efforts, namely >3000 human genomes and 850 non-human genomes from different organisms. These models are meant to be used for transfer learning and are claimed to work well for downstream prediction tasks even in low-data settings. Concretely, the models are BERT-like models with nucleotide 6-mers as tokens.\n\n**GENA-LM**\n[[github](https:\u002F\u002Fgithub.com\u002FAIRI-Institute\u002FGENA_LM)]\nThis is a masked language model trained on human DNA sequence (like e.g. the DNABERT or Nucleotide Transformer above.)\nAccording to the github page, one difference from DNABERT is that GENA-LM uses BPE tokenization instead of k-mers and can thus handle an input sequence size of about 3000 nucleotides (512 BPE tokens) compared to 510 nucleotides of DNABERT.\n\n## Multi-omics integration \u003Ca name='integration'>\u003C\u002Fa>\n\n**Rise of Deep Learning for Genomic, Proteomic, and Metabolomic Data Integration in Precision Medicine.** [[paper](https:\u002F\u002Fwww.ncbi.nlm.nih.gov\u002Fpmc\u002Farticles\u002FPMC6207407\u002F)]\n\nA review paper about the potential of deep learning for multi-omics data integration.\n\n## Protein biology \u003Ca name=\"protein_biology\">\u003C\u002Fa>\n\nThis category is divided into sub-categories.\n\n### Structure prediction \u003Ca name='protein_biology_structure_prediction'>\u003C\u002Fa>\n\n**Highly accurate protein structure prediction with AlphaFold** [[github](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Falphafold)][[paper](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fs41586-021-03819-2)]\n\nThis one probably needs no introduction. DeepMind released the first version of its protein-folding method AlphaFold in 2018, when it won the prestigious CASP competition. A completely redesigned version, described in this paper (and sometimes called AlphaFold2) won the same competition in 2020 with a very wide margin. The new version used a component called the \"Evoformer\", a kind of transformer which iteratively processed a set of aligned protein sequences and a matrix of pairwise interaction between amino acids to generate a representation that can be used as input to a folding module, which uses a specific type of attention called \"Invariant pointwise attention\". The original AlphaFold paper has been followed by many papers that show how new tasks can be solved by modifying the model in different ways.\n\n**OpenFold** [[github](https:\u002F\u002Fgithub.com\u002Faqlaboratory\u002Fopenfold)]\n\nThis is a Pytorch-based, open-source reimplementation of AlphaFold, which reproduces practically all of the functionality. Before AlphaFold made its model weights generally available, OpenFold was a way to train your own folding model.\n\n**MiniFold: a re-implementation of DeepMind's AlphaFold** [[github](https:\u002F\u002Fgithub.com\u002FEricAlcaide\u002FMiniFold)]\n\nOne of the more spectacular successes of deep learning in biology in the recent years was when DeepMind's AlphaFold model won the CASP13 protein structure prediction challenge. It was originally not listed on this page because there was no open implementation, but this has since changed. In any case, MiniFold was an attempt to re-implement AlphaFold in a somewhat more minimalistic way.\n\n**Evolutionary-scale prediction of atomic level protein structure with a language model** [[github](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fesm#esmfold)][[preprint](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002F10.1101\u002F2022.07.20.500902v2)]\n\nThe Meta research group that created ESM-1, ESM-MSA and other models described elsewhere in this document here show that large language models can also be used to do protein structure prediction. Given enough parameters and training data, the trained model starts to implicitly learn information about 3D conformation. The authors claim that this LLM-only approach (i.e. it does not use multiple sequence alignments or backbone inputs) is up to 60x faster than other approaches, such as AlphaFold. They use the model to make structure predictions for 600 million metagenomic (environmental DNA) samples.\n\n**Protein Loop Modeling Using Deep Generative Adversarial Network**[[paper](https:\u002F\u002Fieeexplore.ieee.org\u002Fabstract\u002Fdocument\u002F8372069\u002F)][[website](https:\u002F\u002Fzhaoyu.li\u002Floop_modeling_gan.html)]\n\nFrom the abstract: \"Biology and medicine have a long-standing interest in computational structure prediction and modeling of proteins. There are often missing regions or regions that need to be remodeled in protein structures. The process of predicting particular missing regions in a protein structure is called loop modeling. In this paper, we propose a generative adversarial network (GAN) in deep learning for loop modeling using the idea of image inpainting. The generative network is to capture the context of the loop region and predict the missing area. The adversarial network is to make the prediction look real and provide gradients to the generative network. The proposed network was evaluated on a common benchmark for loop modeling. Experiments show that our method can successfully predict the loop region and has achieved better performance than the state-of-the-art tools. To our knowledge, this work represents the first attempt of using GAN for any bioinformatics studies.\"\n\n**Pcons2 – Improved Contact Predictions Using the Recognition of Protein Like Contact Patterns** [[web interface](http:\u002F\u002Fc2.pcons.net\u002F)]\n\nHere, a “deep random forest” with five layers is used to improve predictions of which residues (amino acids) in a protein are physically interacting which each other. This is useful for predicting the overall structure of the protein (a very hard problem.)\n\n### Protein design \u003Ca name='protein_biology_design'>\u003C\u002Fa>\n\n**Low-N protein engineering with data-efficient deep learning** [[preprint](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fs41592-021-01100-y)]\n\nBased on the UniRep model (described elsewhere in this document), the authors introduce a machine learning paradigm or workflow for training models predicting protein properties and designing novel sequence variants based on a very small number of labelled samples (as few as 20-30). In this paradigm, a base model is trained in an unsupervised manner on a large set of diverse protein sequences (like in the original UniRep paper), and then this model is trained further with the same loss function but on a more restricted family of proteins which is evolutionarily related to the target protein. This procedure is called \"evotuning\" or evolutionary finetuning. After this step, the authors show that supervised learning using the representation created by the evotuned model often works well given only a small number of labelled samples. With the supervised model in hand, in silico directed evolution can be used to design a new variant of the target protein with desired characteristics.\n\n\n**ProteinGAN: Expanding functional protein sequence space using generative adversarial networks** [[code](https:\u002F\u002Fgithub.com\u002Fbiomatterdesigns\u002FProteinGAN)][[preprint](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002Fearly\u002F2019\u002F10\u002F04\u002F789719.full.pdf)]\n\nFrom the abstract: \"De novo protein design for catalysis of any desired chemical reaction is a long standing goal in protein engineering, due to the broad spectrum of technological, scientific and medical applications. Currently, mapping protein sequence to protein function is, however, neither computationionally nor experimentally tangible. Here we developed ProteinGAN, a specialised variant of the generative adversarial network that is able to 'learn' natural protein sequence diversity and enables the generation of functional protein sequences. ProteinGAN learns the evolutionary relationships of protein sequences directly from the complex multidimensional amino acid sequence space and creates new, highly diverse sequence variants with natural-like physical properties. Using malate dehydrogenase as a template enzyme, we show that 24% of the ProteinGAN-generated and experimentally tested sequences are soluble and display wild-type level catalytic activity in the tested conditions in vitro, even in highly mutated (>100 mutations) sequences. ProteinGAN therefore demonstrates the potential of artificial intelligence to rapidly generate highly diverse novel functional proteins within the allowed biological constraints of the sequence space.\"\n\n**Robust deep learning based protein sequence design using ProteinMPNN** [[code](https:\u002F\u002Fgithub.com\u002Fdauparas\u002FProteinMPNN)][[preprint](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002F10.1101\u002F2022.06.03.494563v1)]\n\nThis work presents a method for designing a protein sequence that is predicted to fold into a specified conformation, i.e. in a way the reverse of AlphaFold: going from structure to sequence. This is achieved by using a type of graph neural network, a message passing neural network (MPNN.) The diversity of the generated sequences can be tuned, and the authors test the performance of the method both using AlphaFold and experimentally.\n\n**Learning inverse folding from millions of predicted structures** [[code](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fesm#invf)][[preprint](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002F10.1101\u002F2022.04.10.487779v2)]\n\nLike ProteinMPNN, this is an inverse folding method that attempts to go from a structure (protein backbone atom coordinates) to a sequence. While ProteinMPNN was trained on experimentally determined strutures, this method, called ESM-IF, uses 12 million structures predicted by AlphaFold as its training material.\n\n### Function prediction \u003Ca name='protein_biology_function_prediction'>\u003C\u002Fa>\n\n\n**A Deep Learning Model for Predicting Tumor Suppressor Genes and Oncogenes from PDB Structure** [[github](https:\u002F\u002Fgithub.com\u002Ftavanaei\u002FCancer-Suppressor-Gene-Deep-Learning)][[bioRxiv preprint](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002Fearly\u002F2017\u002F10\u002F22\u002F177378)]\n\nThe authors use CNNs on feature maps extracted from protein 3D structures in the Protein Data Base (PDB) to predict oncogenes and tumor-suppressor genes.   \n\n**Deep-RBPPred: Predicting RNA binding proteins in the proteome scale based on deep learning** [[code](http:\u002F\u002Fwww.rnabinding.com\u002FDeep_RBPPred\u002FDeep-RBPPred.html)][[bioRxiv preprint](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002Fearly\u002F2017\u002F10\u002F27\u002F210153)] \n\nPredicts RNA-binding proteins using CNNs.\n\n**EVOVAE: Variational autoencoding of Protein Sequences**[[code](https:\u002F\u002Fgithub.com\u002Fsamsinai\u002FVAE_protein_function)][[arXiv preprint](https:\u002F\u002Farxiv.org\u002Fabs\u002F1712.03346)]\n\nFrom the abstract: \"We present an embedding of natural protein sequences using a Variational Auto-Encoder and use it to predict how mutations affect protein function. We use this unsupervised approach to cluster natural variants and learn interactions between sets of positions within a protein. This approach generally performs better than baseline methods that consider no interactions within sequences, and in some cases better than the state-of-the-art approaches that use the inverse-Potts model. This generative model can be used to computationally guide exploration of protein sequence space and to better inform rational and automatic protein design.\"\n\n**Structure-Based Function Prediction using Graph Convolutional Networks** [[preprint](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002Fbiorxiv\u002Fearly\u002F2019\u002F10\u002F04\u002F786236.full.pdf)]\n\nFrom the abstract: \"We present a deep learning Graph Convolutional Network (GCN) trained on sequence and structural data and evaluate it on ~40k proteins with known structures and functions from the Protein Data Bank (PDB). Our GCN predicts functions more accurately than Convolutional Neural Networks trained on sequence data alone and competing methods. Feature extraction via a language model removes the need for constructing multiple sequence alignments or feature engineering. Our model learns general structure-function relationships by robustly predicting functions of proteins with ≤ 30% sequence identity to the training set. Using class activation mapping, we can automatically identify structural regions at the residue-level that lead to each function prediction for every protein confidently predicted, advancing site-specific function prediction.\"\n\n\n## Genomics \u003Ca name=\"genomics\">\u003C\u002Fa>\n\nThis category is divided into several subfields.\n\n### Variant calling \u003Ca name='genomics_variant-calling'>\u003C\u002Fa>\n\n**DeepVariant** [[github](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fdeepvariant)][[preprint](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002Fearly\u002F2016\u002F12\u002F21\u002F092890)]\n\nThis preprint from Google originally came out in late 2016 but it got the most publicity about a year later when the code was made public and press releases started appearing. The Google researchers approached a well-studied problem, variant calling from DNA sequencing data (where the aim is to correctly identify variations from the reference genome in an individual's DNA, e.g. mutations or polymorphisms) using a counter-intuitive but clever approach. Instead of using the nucleotides in the sequenced DNA fragments directly (in the form of the symbols A, C, G, T), they first converted the sequences into images and then applied convolutional neural networks to these images (which represent \"pile-ups\" or DNA sequences; stacks of aligned sequences.) This turned out to be a very effective way to call variants as proven by both Google's own and independent benchmarks.\n\n**Language models enable zero-shot prediction of the effects of mutations on protein function** [[github](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fesm#zs_variant)][[preprint](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002F10.1101\u002F2021.07.09.450648v2)]\n\nThis work builds on Meta's protein language models (ESM-1 et al.; see above) and shows that these models can be used for \"zero-shot\" prediction of variant effects on protein function; that is, no extra experimental data or model training is needed. The protein language model can be used as-is to infer variant effects.\n\n### Gene expression \u003Ca name='genomics_expression'>\u003C\u002Fa>\n\nIn modeling gene expression, the inputs are typically numerical values (integers or floats) estimating how much RNA is produced from a DNA template in a particular cell type or condition.\n\n**scPRINT: pre-training on 50 million cells allows robust gene network predictions** [[github](https:\u002F\u002Fgithub.com\u002Fcantinilab\u002FscPRINT)] [[bioRxiv](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002F10.1101\u002F2024.07.29.605556v1)]\nA Large Cell Model to predict gene-gene interactions and more at the level of the transcriptome. It is using novel encoding and decoding of RNAseq data to learn the regulatory mechanism from tens of millions of single cell RNAseq profiles.\n\n**Gene Expression Convolutions Using Gene Interaction Graphs** [[github](https:\u002F\u002Fgithub.com\u002Fmila-iqia\u002Fgene-graph-conv)] [[arxiv](https:\u002F\u002Fgithub.com\u002Fmila-iqia\u002Fgene-graph-conv)]\nThey discuss how gene-gene interaction graphs (same pathway, protein-protein, co-expression, or research paper text association) can be used to impose a bias on a deep neural network model similar to the spatial bias imposed by convolutions on an image. They find this approach provides an advantage for particular tasks in a low data regime but is very dependent on the quality of the graph used. \n\n**ADAGE – Analysis using Denoising Autoencoders of Gene Expression** [[github](https:\u002F\u002Fgithub.com\u002Fgreenelab\u002Fadage)]\n\nThis is a Theano implementation of stacked denoising autoencoders for extracting relevant patterns from large sets of gene expression data, a kind of feature construction approach if you will. I have played around with this package quite a bit myself. The authors initially published a [conference paper](http:\u002F\u002Fwww.worldscientific.com\u002Fdoi\u002Fabs\u002F10.1142\u002F9789814644730_0014) applying the model to a compendium of breast cancer (microarray) gene expression data, and more recently posted a paper on [bioRxiv](http:\u002F\u002Fbiorxiv.org\u002Fcontent\u002Fearly\u002F2015\u002F11\u002F05\u002F030650) where they apply it to all available expression data (microarray and RNA-seq) on the pathogen Pseudomonas aeruginosa. (I understand that this manuscript will soon be published in a journal.)\n\n**Exploiting Ladder Networks for Gene Expression Classification** [[paper](https:\u002F\u002Flink.springer.com\u002Fchapter\u002F10.1007%2F978-3-319-78723-7_23)]\n\nThis paper applies Ladder networks, a semi-supervised deep learning method, to the binary cancer classification problem. The model performance is evaluated on TCGA dataset against other deep learning and conventional machine learning approaches.  \n\n**Learning structure in gene expression data using deep architectures** [[paper](http:\u002F\u002Fbiorxiv.org\u002Fcontent\u002Fearly\u002F2015\u002F11\u002F16\u002F031906)]\n\nThis is also about using stacked denoising autoencoders for gene expression data, but there is no available implementation (as far as I could tell). Included here for the sake of completeness (or something.)\n\n**Gene expression inference with deep learning** [[github](https:\u002F\u002Fgithub.com\u002Fuci-cbcl\u002FD-GEX)][[paper](http:\u002F\u002Fbiorxiv.org\u002Fcontent\u002Fearly\u002F2015\u002F12\u002F15\u002F034421)]\n\nThis deals with a specific prediction task, namely to predict the expression of specified target genes from a panel of about 1,000 pre-selected “landmark genes”. As the authors explain, gene expression levels are often highly correlated and it may be a cost-effective strategy in some cases to use such panels and then computationally infer the expression of other genes. Based on Pylearn2\u002FTheano.\n\n**Learning a hierarchical representation of the yeast transcriptomic machinery using an autoencoder model** [[paper](http:\u002F\u002Fbmcbioinformatics.biomedcentral.com\u002Farticles\u002F10.1186\u002Fs12859-015-0852-1)]\n\nThe authors use stacked autoencoders to learn biological features in yeast from thousands of microarrays. They analyze the hidden layer representations and show that these encode biological information in a hierarchical way, so that for instance transcription factors are represented in the first hidden layer.\n\n**Boosting Gene Expression Clustering with System-Wide Biological Information: A Robust Autoencoder Approach** [[bioRxiv preprint](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002Fearly\u002F2017\u002F11\u002F05\u002F214122)]\n\nUses a robust autoencoder (an autoencoder with an outlier filter) to cluster gene expression profiles. \n\n**Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk** [[github](https:\u002F\u002Fgithub.com\u002FFunctionLab\u002FExPecto)][[paper](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fs41588-018-0160-6)]\n\nThe authors use a two-step model to predict the effect of genetic variants on gene expression. In the first step, the authors trained a convolutional neural network to model the 2002 epigenetic marks collected in ENCODE and ROADMAP consortium. In the second step, the authors trained a tissue-specific regularized linear model on the cis-regulatory region of the gene that is encoded by the first step convolutional neural network model. Then the effect of the variants on tissue-specific gene is calculated by the decrease in predicted gene expression through *in silico* mutagenesis.\n\n### Imaging and gene expression \u003Ca name='imaging_expression'>\u003C\u002Fa>\n\n**Transcriptomic learning for digital pathology** [[preprint](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002Fbiorxiv\u002Fearly\u002F2019\u002F10\u002F11\u002F760173.full.pdf)]\n\nFrom the abstract: \"We propose a novel approach based on the integration of multiple data modes, and show that our deep learning model, HE2RNA, can be trained to systematically predict RNA-Seq profiles from whole-slide images alone, without the need for expert annotation. HE2RNA is interpretable by design, opening up new opportunities for virtual staining. In fact, it provides virtual spatialization of gene expression,as validated by double-staining on an independent dataset. Moreover, the transcriptomic representation learned by HE2RNA can be transferred to improve predictive performance for other tasks, particularly for small datasets.\"\n\n### Predicting enhancers and regulatory regions \u003Ca name='genomics_enhancers'>\u003C\u002Fa>\n\nHere the inputs are typically “raw” DNA sequence, and convolutional networks (or layers) are often used to learn regularities within the sequence. Hat tip to [Melissa Gymrek](http:\u002F\u002Fmelissagymrek.com\u002Fscience\u002F2015\u002F12\u002F01\u002Funlocking-noncoding-variation.html) for pointing out some of these.\n\n**DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences** [[github](https:\u002F\u002Fgithub.com\u002Fuci-cbcl\u002FDanQ)]\n\nMade for predicting the function of non-protein coding DNA sequence. Uses a convolution layer to capture regulatory motifs (i e single DNA snippets that control the expression of genes, for instance), and a recurrent layer (of the LSTM type) to try to discover a “grammar” for how these single motifs work together. Based on Keras\u002FTheano.\n\n**Basset – learning the regulatory code of the accessible genome with deep convolutional neural networks** [[github](https:\u002F\u002Fgithub.com\u002Fdavek44\u002FBasset)]\n\nBased on Torch, this package focuses on predicting the accessibility (or “openness”) of the chromatin – the physical packaging of the genetic information (DNA+associated proteins). This can exist in more condensed or relaxed states in different cell types, which is partly influenced by the DNA sequence (not completely, because then it would not differ from cell to cell.)\n\n**Basenji – Sequential regulatory activity prediction across chromosomes with convolutional neural networks** [[github1](https:\u002F\u002Fwww.github.com\u002Fcalico\u002Fbasenji)][[github2](https:\u002F\u002Fgithub.com\u002Ftensorflow\u002Ftensor2tensor\u002Fblob\u002Fmaster\u002Ftensor2tensor\u002Fmodels\u002Fresearch\u002Fgene_expression.py)][[biorxiv](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002Fearly\u002F2017\u002F07\u002F10\u002F161851)]\n\nA follow-up project to Basset, this Tensorflow-based model uses both standard and dilated convolutions to model regulatory signals and gene expression (in the form of CAGE tag density) in many different cell types. Notably, the underlying model has been brought into Google's Tensor2Tensor repository (see \"github2\" link above), which collects many models in image and speech recognition, machine translation, text classification etc. However, at the time of writing the Tensor2Tensor model seems not quite mature for easy use, so it is probably better to use the dedicated Basenji repo (\"github1\") for now. \n\n**DeepSEA – Predicting effects of noncoding variants with deep learning–based sequence model** [[web server](http:\u002F\u002Fdeepsea.princeton.edu\u002Fjob\u002Fanalysis\u002Fcreate\u002F)][[paper](http:\u002F\u002Fwww.nature.com\u002Fnmeth\u002Fjournal\u002Fv12\u002Fn10\u002Ffull\u002Fnmeth.3547.html)]\n\nLike the packages above, this one also models chromatin accessibility as well as the binding of certain proteins (transcription factors) to DNA and the presence of so-called histone marks that are associated with changes in accessibility. This piece of software seems to focus a bit more explicitly than the others on predicting how single-nucleotide mutations affect the chromatin structure. Published in a high-profile journal (Nature Methods).\n\n**DeepBind – Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning** [[code](http:\u002F\u002Ftools.genes.toronto.edu\u002Fdeepbind\u002F)][[paper](http:\u002F\u002Fwww.nature.com\u002Fnbt\u002Fjournal\u002Fv33\u002Fn8\u002Ffull\u002Fnbt.3300.html)]\n\nThis is from the group of Brendan Frey in Toronto, and the authors are also involved in the company Deep Genomics. DeepBind focuses on predicting the binding specificities of DNA-binding or RNA-binding proteins, based on experiments such as ChIP-seq, ChIP-chip, RIP-seq,  protein-binding microarrays, and HT-SELEX. Published in a high-profile journal (Nature Biotechnology.)\n\n**DeeperBind - Enhancing Prediction of Sequence Specificities of DNA Binding Proteins** [[preprint](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1611.05777.pdf)]\n\nThis is an attempt to improve on DeepBind by adding a recurrent sequence learning module (LSTM) after the convolutional layer(s). In this way, the authors propose to capture a positional dimension that is lost in the pooling step in the original DeepBind design. They claim that benchmarking shows that this architecture leads to superior performance compared to previous work.\n\n**DeepMotif - Visualizing Genomic Sequence Classifications** [[paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F1605.01133)]\n\nThis is also about learning and predicting binding specificities of proteins to certain DNA patterns or \"motifs\". However, this paper makes use of a combination of convolutional layers and [highway networks](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1505.00387v2.pdf), with more layers than the DeepBind network. The authors also show how a learned classifier can generate typical DNA motifs by input optimization; applying back-propagation with all the weights held constant in order to find an input pattern that maximally activates the appropriate output node in the network.\n\n**Convolutional Neural Network Architectures for Predicting DNA-Protein Binding** [[code](http:\u002F\u002Fcnn.csail.mit.edu\u002F)][[paper](http:\u002F\u002Fbioinformatics.oxfordjournals.org\u002Fcontent\u002F32\u002F12\u002Fi121.full)]\n\nThis work describes a systematic exploration of convolutional neural network (CNN) architectures for DNA-protein binding. It concludes that the convolutional kernels are very important for the success of the networks on motif-based tasks. Interestingly, the authors have provided a Dockerized implementation of DeepBind from the Frey lab (see above) and also provide EC2-laucher scripts and code for comparing different GPU enabled models programmed in Caffe.\n\n**PEDLA: predicting enhancers with a deep learning-based algorithmic framework** [[code](https:\u002F\u002Fgithub.com\u002Fwenjiegroup\u002FPEDLA)][[paper](http:\u002F\u002Fbiorxiv.org\u002Fcontent\u002Fearly\u002F2016\u002F01\u002F07\u002F036129)]\n\nThis package is for predicting enhancers (stretches of DNA that can enhance the expression of a gene under certain conditions or in a certain kind of cell, often working at a distance from the gene itself) based on heterogeneous data from (e.g.) the ENCODE project, using 1,114 features altogether.\n\n**DEEP: a general computational framework for predicting enhancers** [[paper](http:\u002F\u002Fnar.oxfordjournals.org\u002Fcontent\u002Fearly\u002F2014\u002F11\u002F05\u002Fnar.gku1058.full)][[code](http:\u002F\u002Fcbrc.kaust.edu.sa\u002Fdeep\u002F)]\n\nAn ensemble prediction method for enhancers.\n\n**Genome-Wide Prediction of cis-Regulatory Regions Using Supervised Deep Learning Methods** (and several other papers applying various kinds of deep networks to regulatory region prediction) [[code](https:\u002F\u002Fgithub.com\u002Fyifeng-li\u002FDECRES)] (one [[paper](http:\u002F\u002Fbiorxiv.org\u002Fcontent\u002Fearly\u002F2016\u002F02\u002F28\u002F041616)] out of several)\n\nWyeth Wasserman’s group have made a kind of [toolkit](https:\u002F\u002Fgithub.com\u002Fyifeng-li\u002FDECRES) (based on the Theano tutorials) for applying different kinds of deep learning architectures to cis-regulatory element (DNA stretches that can modulate the expression of a nearby gene) prediction. They use a specific “feature selection layer” in their nets to restrict the number of features in the models. This is implemented as an additional sparse one-to-one linear layer between the input layer and the first hidden layer of a multi-layer perceptron.\n\n**FIDDLE: An integrative deep learning framework for functional genomic data inference** [[paper](http:\u002F\u002Fbiorxiv.org\u002Fcontent\u002Fearly\u002F2016\u002F10\u002F17\u002F081380)][[code](https:\u002F\u002Fgithub.com\u002Fueser\u002FFIDDLE)][[Youtube talk](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=pcLTUsOm5pc&feature=youtu.be&list=PLlMMtlgw6qNjROoMNTBQjAcdx53kV50cS&t=2411)]\n\nThe group predicted transcription start site and regulatory regions but claims this solution could be easily generalized and predict other features too. FIDDLE stands for Flexible Integration of Data with Deep LEarning. The idea (nicely explained by the author in the YouTube video above) is to model several genomic signals jointly using convolutional networks. This could be for example DNase-seq, ATAC-seq, ChIP-seq, TSS-seq, maybe RNA-seq signals (as in .wig files with one value per base in the genome).\n\n**Deep Learning Of The Regulatory Grammar Of Yeast 5′ Untranslated Regions From 500,000 Random Sequences** [[paper](http:\u002F\u002Fgenome.cshlp.org\u002Fcontent\u002F27\u002F12\u002F2015)][[code](http:\u002F\u002Fgenome.cshlp.org\u002Fcontent\u002Fsuppl\u002F2017\u002F11\u002F02\u002Fgr.224964.117.DC1\u002FSupplemental_code.tar.gz)]\n\nThis is a CNN model that attempts to predict protein expression from the DNA sequence in a specific type of genomic region called 5' UTR (five-prime untranslated region). The model is built in Keras and a nice touch by the authors is that they optimized the parameters using hyperopt, which is also shown in one of the Jupyter notebooks that comes along with the paper. The results look promising and easily reproducible, judging from my own trial.\n\n**Modeling Enhancer-Promoter Interactions with Attention-Based Neural Networks** [[bioRxiv preprint](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002Fearly\u002F2017\u002F11\u002F14\u002F219667)][[code](https:\u002F\u002Fgithub.com\u002Fwgmao\u002FEPIANN)]\n\nThe concept of attention in (recurrent) neural networks has become quite popular recently, not least because it has been used to great effect in machine translation models. This paper proposes an attention-based model for getting at the interactions between enhancer sequences and promoter sequences.\n\n**Predicting Transcription Factor Binding Sites with Convolutional Kernel Networks** [[bioRxiv preprint](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002Fearly\u002F2017\u002F11\u002F10\u002F217257)][[code](https:\u002F\u002Fgitlab.inria.fr\u002Fdchen\u002FCKN-seq)]\n\nThis paper uses a hybrid of CNNs (to learn good representations) and kernel methods (to learn good prediction functions) to predict transcription factor binding sites.\n\n**Predicting DNA accessibility in the pan-cancer tumor genome using RNA-seq, WGS, and deep learning** [[bioRxiv preprint](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002Fearly\u002F2017\u002F12\u002F05\u002F229385)]\n\nLike Basset (above) this paper shows how to predict DNA accessibility from sequence using CNNs, but it adds the possibility to leverage RNA sequencing data from different cell types as input. In this way implicit information related to cell type can be \"transferred\" to the accessibility prediction task.\n\n**Deep learning at base-resolution reveals motif syntax of the cis-regulatory code** [[bioRxiv preprint](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002Fbiorxiv\u002Fearly\u002F2019\u002F08\u002F21\u002F737981.full.pdf)]\n\nHere, a CNN with dilated convolutions is used to learn how different transcription factor binding motifs cooperate. This is the \"motif syntax\" mentioned in the title. The neural network is trained to predict the signal from a basepair-resolution ChIP assay (ChIP-nexus) and the trained network is then used to infer rules of motif cooperativity.\n\n**DNA language models are powerful zero-shot predictors of genome-wide variant effects** [[github](https:\u002F\u002Fgithub.com\u002Fsonglab-cal\u002Fgpn)][[preprint](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002F10.1101\u002F2022.08.22.504706v2)]\n\nA BERT-like DNA language models, complemented with dilated convolutions to process the raw one-hot encoded inputs but keeping the sequence length fixed, is trained on several different plant genomes and shown to be able to predict variant effects in Arabidopsis thaliana. The authors claim that the model, called Genomic Pretrained Network (GPN), outperforms predictors based on popular conservation scores such as phyloP and phastCons.\n\n### Non-coding RNA \u003Ca name='genomics_non-coding'>\u003C\u002Fa>\n\n**DeepLNC, a long non-coding RNA prediction tool using deep neural network** [[paper](http:\u002F\u002Flink.springer.com\u002Farticle\u002F10.1007%2Fs13721-016-0129-2)] [[web server](http:\u002F\u002Fbioserver.iiita.ac.in\u002Fdeeplnc\u002F)]\n\nIdentification of potential long non-coding RNA molecules from DNA sequence, based on k-mer profiles.\n\n**A Deep Recurrent Neural Network Discovers Complex Biological Rules to Decipher RNA Protein-Coding Potential** [[github](https:\u002F\u002Fgithub.com\u002Fhendrixlab\u002FmRNN)][[paper](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002Fearly\u002F2017\u002F11\u002F13\u002F200758.1)] \n\nFrom the abstract: *While traditional, feature-based methods for RNA classification are limited by current scientific knowledge, deep learning methods can independently discover complex biological rules in the data de novo. We trained a gated recurrent neural network (RNN) on human messenger RNA (mRNA) and long noncoding RNA (lncRNA) sequences. Our model, mRNA RNN (mRNN), surpasses state-of-the-art methods at predicting protein-coding potential.*\n\n### Methylation \u003Ca name='genomics_methylation'>\u003C\u002Fa>\n\n**DeepCpG - Predicting DNA methylation in single cells**\n[[paper](http:\u002F\u002Fdx.doi.org\u002F10.1186\u002Fs13059-017-1189-z)]\n[[code](https:\u002F\u002Fgithub.com\u002Fcangermueller\u002Fdeepcpg)]\n[[docs](http:\u002F\u002Fdeepcpg.readthedocs.io\u002Fen\u002Flatest\u002F)]\n\nDeepCpG is a deep neural network for predicting DNA methylation in multiple cells. DeepCpG has a modular architecture, consisting of a recurrent CpG module to account for correlations between CpG sites within and across cells, a convolutional DNA module to extract patterns from a wide DNA sequence window, and a Joint module that integrates the evidence from the CpG and DNA module to predict the methylation state of multiple cells for a target CpG site. DeepCpG yields accurate predictions, enables discovering DNA sequence motifs that are associated with DNA methylation states and cell-to-cell variability, and can be used for analyzing the effect of single-nucleotide mutations on DNA methylation. DeepCpG is implemented in Python and publicly available.\n\n**Predicting DNA Methylation State of CpG Dinucleotide Using Genome Topological Features and Deep Networks** [[paper](http:\u002F\u002Fwww.nature.com\u002Farticles\u002Fsrep19598)][[web server](http:\u002F\u002Fdna.cs.usm.edu\u002Fdeepmethyl\u002F)]\n\nThis implementation uses a stacked autoencoder with a supervised layer on top of it to predict whether a certain type of genomic region called “CpG islands” (stretches with an overrepresentation of a sequence pattern where a C nucleotide is followed by a G) is methylated (a chemical modification to DNA that can modify its function, for instance methylation in the vicinity of a gene is often but not always related to the down-regulation or silencing of that gene.) This paper uses a network structure where the hidden layers in the autoencoder part have a much larger number of nodes than the input layer, so it would have been nice to read the authors’ thoughts on what the hidden layers represent.\n\n### Single-cell applications \u003Ca name='genomics_single-cell'>\u003C\u002Fa>\n\n**DeepCpG - Predicting DNA methylation in single cells**\n[[paper](http:\u002F\u002Fdx.doi.org\u002F10.1186\u002Fs13059-017-1189-z)]\n[[code](https:\u002F\u002Fgithub.com\u002Fcangermueller\u002Fdeepcpg)]\n[[docs](http:\u002F\u002Fdeepcpg.readthedocs.io\u002Fen\u002Flatest\u002F)]\n\nSee above.\n\n**CellCnn – Representation Learning for detection of disease-associated cell subsets**\n[[code](https:\u002F\u002Fgithub.com\u002Feiriniar\u002FCellCnn)][[paper](http:\u002F\u002Fbiorxiv.org\u002Fcontent\u002Fearly\u002F2016\u002F03\u002F31\u002F046508)]\n\nThis is a convolutional network (Lasagne\u002FTheano) based approach for “Representation Learning for detection of phenotype-associated cell subsets.” It is interesting because most neural network approaches for high-dimensional molecular measurements (such as those in the gene expression category above) have used autoencoders rather than convolutional nets.\n\n**DeepCyTOF: Automated Cell Classification of Mass Cytometry Data by Deep Learning and Domain Adaptation**[[paper](http:\u002F\u002Fbiorxiv.org\u002Fcontent\u002Fbiorxiv\u002Fearly\u002F2016\u002F05\u002F31\u002F054411.full.pdf)]\n\nDescribes autoencoder approaches (stacked AE and multi-AE) to gating (assigning cells into discrete groups) with mass cytometry (CyTOF).\n\n**Using Neural Networks To Improve Single-Cell RNA-Seq Data Analysis**[[preprint](http:\u002F\u002Fbiorxiv.org\u002Fcontent\u002Fearly\u002F2017\u002F04\u002F23\u002F129759)]\n\nTests a variety of neural network architectures for obtaining a reduced representation of single-cell gene expression data. Introduces a database of tens of thousands of single-cell profiles which can be queried to infer a cell type or state based on this reduced representation.\n\n**Removal of batch effects using distribution-matching residual networks**[[code](https:\u002F\u002Fgithub.com\u002Fushaham\u002FBatchEffectRemoval)][[paper](https:\u002F\u002Facademic.oup.com\u002Fbioinformatics\u002Farticle-abstract\u002Fdoi\u002F10.1093\u002Fbioinformatics\u002Fbtx196\u002F3611270\u002FRemoval-of-Batch-Effects-using-Distribution)]\n\nMost high-throughput assays in genomics, proteomics etc. are affected to some extent by systematic technical errors, so-called \"batch effects\". This paper uses a residual neural network to attenuate batch effects by trying to match the distributions of replicate experiments on e.g. single-cell RNA sequencing or mass cytometry. \n\n**Active deep learning reduces annotation burden in automatic cell segmentation** [[bioRxiv preprint](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002Fearly\u002F2017\u002F11\u002F01\u002F211060)]\n\nActive learning, a framework addressing how to select training examples in order to train a model most efficiently, is shown to significantly reduce the time required by experts to annotate cell segmentation images in high-throughput high-context microscopy. Training deep learning models on this type of application of course requires a lot of high-quality labeled data, but the time of the human experts that can provide the labels (perform annotation) is limited and expensive. \n\n**scVAE: Variational auto-encoders for single-cell gene expression data** [[code](https:\u002F\u002Fgithub.com\u002Fscvae\u002Fscvae)][[preprint](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002F10.1101\u002F318295v2)]\n\nThis approach models single-cell gene expression data directly from counts without initial normalization, and performs clustering in the latent space. Since it is based on a variational autoencoder, it can also be used to generate synthetic single-cell data by sampling from the latent distribution.\n\n**CellBender remove-background: a deep generative model for unsupervised removal of background noise from scRNA-seq datasets** [[code](https:\u002F\u002Fgithub.com\u002Fbroadinstitute\u002FCellBender)][[preprint](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002Fbiorxiv\u002Fearly\u002F2019\u002F10\u002F03\u002F791699.full.pdf)]\n\nThe authors present a generative model for removing statistical background noise in single-cell RNA-seq datasets.\n\n**scVAE: Single-cell variational auto-encoders** [[code](https:\u002F\u002Fgithub.com\u002Fscvae\u002Fscvae)][[preprint](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002F10.1101\u002F318295v4)]\n\nscVAE is a command-line tool for modelling single-cell transcript counts using variational auto-encoders. Using variational autoencoders it is possible both to model the data in a more compact way and to generate realistic synthetic data based on the distribution that the real data come from.\n\n**Realistic in silico generation and augmentation of single cell RNA-seq data using Generative Adversarial Neural Networks** [[code](https:\u002F\u002Fgithub.com\u002Fimsb-uke\u002FscGAN)][[preprint](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002F10.1101\u002F390153v2)]\n\nFrom the abstract: \"A fundamental problem in biomedical research is the low number of observations available, mostly due to a lack of available biosamples, prohibitive costs, or ethical reasons. Augmenting few real observations with generated in silico samples could lead to more robust analysis results and a higher reproducibility rate. Here we propose the use of conditional single cell Generative Adversarial Neural Networks (cscGANs) for the realistic generation of single cell RNA-seq data. cscGANs learn non-linear gene-gene dependencies from complex, multi cell type samples and use this information to generate realistic cells of defined types.\"\n\n**Knowledge-primed neural networks enable biologically interpretable deep learning on single-cell sequencing data** [[code](https:\u002F\u002Fgithub.com\u002Fepigen\u002FKPNN)][[preprint](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002Fbiorxiv\u002Fearly\u002F2019\u002F10\u002F07\u002F794503.full.pdf)]\n\nFrom the abstract: \"Deep learning has emerged as a powerful methodology for predicting a variety of complex biological phenomena. However, its utility for biological discovery has so far been limited, given that generic deep neural networks provide little insight into the biological mechanisms that underlie a successful prediction. Here we demonstrate\ndeep learning on biological networks, where every node has a molecular equivalent (such as a protein or gene) and every edge has a mechanistic interpretation (e.g., a regulatory interaction along a signaling pathway). With knowledge-primed neural networks (KPNNs), we exploit the ability of deep learning algorithms to assign meaningful weights to multi-layered networks for interpretable deep learning.\"\n\n**FlashDeconv: Atlas-scale spatial transcriptomics deconvolution** [[code](https:\u002F\u002Fgithub.com\u002Fcafferychen777\u002Fflashdeconv)][[paper](https:\u002F\u002Fdoi.org\u002F10.64898\u002F2025.12.22.696108)]\n\nA high-performance method for cell type deconvolution in spatial transcriptomics using structure-preserving randomized sketching and graph neural regularization. Processes 1 million spots in ~3 minutes with linear O(N) scaling. Employs leverage-score importance sampling to preserve rare cell type signals, and uses sparse graph Laplacian regularization for spatially coherent predictions. Optimized for subcellular-resolution platforms (Visium HD, Stereo-seq, Xenium).\n\n## Chemoinformatics and drug discovery \u003Ca name=\"chemo\">\u003C\u002Fa>\n\n**Learning substructure invariance for out-of-distribution molecular representations** [[github](https:\u002F\u002Fgithub.com\u002Fyangnianzu0515\u002FMoleOOD)][[paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=2nWUNTnFijm)] \n\nA general molecular representation learning framework entitled MoleOOD which can incorporate any existing molecular representation learning method as backbone to improve their generalization ability against distribution shifts. Specifically, MoleOOD devises a new learning scheme with its equivalent practical instantiation. MoleOOD also develops an environment inference model to identify each molecule’s corresponding environment without need of manual specifications of environments.\n\n**Neural graph fingerprints** [[github](https:\u002F\u002Fgithub.com\u002FHIPS\u002Fneural-fingerprint)]\n\nA convolutional net that can learn features which are useful for predicting properties of novel molecules; “molecular fingerprints”. The net works on a graph where atoms are nodes and bonds are edges. Developed by the group of Ryan Adams, who used to co-host the very good [Talking Machines](http:\u002F\u002Fwww.thetalkingmachines.com\u002F) podcast.\n\n**Automatic chemical design using a data-driven continuous representation of molecules** [[github](https:\u002F\u002Fgithub.com\u002Faspuru-guzik-group\u002Fchemical_vae)][[preprint](https:\u002F\u002Farxiv.org\u002Fabs\u002F1610.02415)]\n\nAbstract starts: \"We report a method to convert discrete representations of molecules to and from a multidimensional continuous representation. This model allows us to generate new molecules for efficient exploration and optimization through open-ended spaces of chemical compounds.\"\n\n**Objective-Reinforced Generative Adversarial Networks (ORGAN)** [[github](https:\u002F\u002Fgithub.com\u002Fgablg1\u002FORGAN)][[preprint](https:\u002F\u002Farxiv.org\u002Fabs\u002F1705.10843)]\n\nA method that combines generative models with reinforcement learning to direct the generative process towards some desired target, ORGAN is a generic method for discrete data but is in this case exemplified by a drug discovery use case.\n\n**Extraction of organic chemistry grammar from unsupervised learning of chemical reactions** [[github](https:\u002F\u002Fgithub.com\u002Frxn4chemistry\u002Frxnmapper)][[paper](https:\u002F\u002Fadvances.sciencemag.org\u002Fcontent\u002F7\u002F15\u002Feabe4166)]\n\n\nThis package does atom mapping for chemistry using transformer networks. From the abstract: *During the last few hundred years, chemists compiled the language of chemical synthesis inferring a series of “reaction rules” from knowing how atoms rearrange during a chemical transformation, a process called atom-mapping. Atom-mapping is a laborious experimental task and, when tackled with computational methods, requires continuous annotation of chemical reactions and the extension of logically consistent directives. Here, we demonstrate that Transformer Neural Networks learn atom-mapping information between products and reactants without supervision or human labeling. Using the Transformer attention weights, we build a chemically agnostic, attention-guided reaction mapper and extract coherent chemical grammar from unannotated sets of reactions.*\n\n**Molecular De-Novo Design through Deep Reinforcement Learning** [[github](https:\u002F\u002Fgithub.com\u002FMarcusOlivecrona\u002FREINVENT)][[preprint](https:\u002F\u002Farxiv.org\u002Fabs\u002F1704.07555)]\n\nPyTorch sequence generation model that uses reinforcement learning. Nice widget showing training progress and molecules generated during training is shown on the Github page. Abstract starts: \"This work introduces a method to tune a sequence-based generative model for molecular de novo design that through augmented episodic likelihood can learn to generate structures with certain specified desirable properties. We demonstrate how this model can execute a range of tasks such as generating analogues to a query structure and generating compounds predicted to be active against a biological target.\"\n\n**One-shot learning models for drug discovery and DeepChem** [[github](https:\u002F\u002Fgithub.com\u002Fdeepchem\u002Fdeepchem)][[Python library](http:\u002F\u002Fdeepchem.io\u002F)][[paper](http:\u002F\u002Fpubs.acs.org\u002Fdoi\u002Fabs\u002F10.1021\u002Facscentsci.6b00367)]\n\nDeepChem is a \"... [P]ython library that aims to make the use of machine-learning in drug discovery straightforward and convenient\" which checks a lot of boxes when it comes to advanced deep learning: one-shot learning, graph convolutional networks, learning from less data, and LSTM embeddings. According to the GitHub site, \"DeepChem aims to provide a high quality open-source toolchain that democratizes the use of deep-learning in drug discovery, materials science, and quantum chemistry.\"\n\n**The cornucopia of meaningful leads: Applying deep adversarial autoencoders for new molecule development in oncology** [[github](https:\u002F\u002Fgithub.com\u002Fspoilt333\u002Fonco-aae)][[paper](https:\u002F\u002Fwww.ncbi.nlm.nih.gov\u002Fpmc\u002Farticles\u002FPMC5355231\u002F)]\n\nExplores the use of generative adversarial networks (GAN) in generating new molecular leads for drug candidates. In analogy to generating images or video that \"look like\" they come from some specified distribution, perhaps with some conditioning like \"show me a cat picture\", the authors reason that novel drug-like molecular structures can be generated with cues about what kind of drug one wants. Here they explore a specific type of generative network, an adversarial autoencoder (AAE), and adapt it into what they call a \"artificially-intelligent drug discovery engine.\"\n\n**Deep learning enables rapid identification of potent DDR1 kinase inhibitors** [[github](https:\u002F\u002Fgithub.com\u002Finsilicomedicine\u002Fgentrl)][[paper](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fs41587-019-0224-x)] In this paper from InSilico Medicine, which came out to some fanfare in 2019, an approach called GENTRL (Generative Tensorial Reinforcement Learning) was used to do rapid discovery of small-molecule inhibitors towards an interesting target. Using this method, the authors were able to come up with a candidate molecule in just 21 days. The model uses an initial generative step with a variational autoencoder and a reinforcement learning procedure for exploring the chemical space. They use an interesting loss function based on Kohonen self-organizing maps. Tensor decomposition was used to encode the relationship between chemical structures and properties. \n\n**Deep Genomics Nominates Industry’s First AI-Discovered Therapeutic Candidate** [[preprint](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002Fbiorxiv\u002Fearly\u002F2019\u002F09\u002F17\u002F693572.full.pdf)]\n\nIn September 2019, Deep Genomics announced that its deep learning-based platform had identified a therapeutic target and a corresponding drug candidate. The details of the disease-causing mechanism targeted by the proposed candidate molecule are in the preprint link above. \n\n\n## Biomarker discovery \u003Ca name=\"biomarker\">\u003C\u002Fa>\n\n**Deep biomarkers of human aging** [[online predictor](http:\u002F\u002Fwww.aging.ai\u002F)][[paper](https:\u002F\u002Fwww.ncbi.nlm.nih.gov\u002Fpubmed\u002F27191382)]\n\nFrom the abstract: \"One of the major impediments in human aging research is the absence of a comprehensive and actionable set of biomarkers that may be targeted and measured to track the effectiveness of therapeutic interventions. In this study, we designed a modular ensemble of 21 deep neural networks (DNNs) of varying depth, structure and optimization to predict human chronological age using a basic blood test. \"\n\n\n## Metabolomics \u003Ca name=\"metabolomics\">\u003C\u002Fa>\n\n**Deep Learning Accurately Predicts Estrogen Receptor Status in Breast Cancer Metabolomics Data** [[code](http:\u002F\u002Fpubs.acs.org\u002Fdoi\u002Fsuppl\u002F10.1021\u002Facs.jproteome.7b00595\u002Fsuppl_file\u002Fpr7b00595_si_001.pdf)][[paper](http:\u002F\u002Fpubs.acs.org\u002Fdoi\u002Ffull\u002F10.1021\u002Facs.jproteome.7b00595)]\n\nClassification algorithms for metabolomics data with respect to estrogen receptor status are compared, and the best performing algorithm is an autoencoder-based feedforward network with parameters tuned using H2O's R interface.\n\n## Generative models \u003Ca name='generative'>\u003C\u002Fa>\n\nIn many cases, it can be useful to generate synthetic data that resembles real data in order to boost dataset sizes or avoid violating patient privacy. Here, some of these approaches are listed.\n\n**Privacy-preserving generative deep neural networks support clinical data sharing** [[Github](https:\u002F\u002Fgithub.com\u002Fgreenelab\u002FSPRINT_gan)][[bioRxiv preprint](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002Fearly\u002F2017\u002F11\u002F15\u002F159756)]\n\nThis describes a clever idea where generative adversarial networks (GANs) are used to synthesize data that closely resembles actual data measured on study participants, but which cannot be traced back to a specific subject. The latter aspect, called differential privacy, is incorporated into the method by design and gives strong guarantees of the likelihood that a subject could be identified as a member of a trial.\n\n**Creating artificial human genomes using generative models** [[preprint](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002Fbiorxiv\u002Fearly\u002F2019\u002F10\u002F07\u002F769091.full.pdf)]\n\nThe authors compare Restricted Boltzmann Machines (RBM) and Generative Adversarial Networks (GAN) as tools for creating synthetic human genomes.\n\n## Population genetics \u003Ca name='genomics_pop'>\u003C\u002Fa>\n\n**Deep learning for population genetic inference** [[code](https:\u002F\u002Fsourceforge.net\u002Fprojects\u002Fevonet\u002F)][[paper](http:\u002F\u002Fjournals.plos.org\u002Fploscompbiol\u002Farticle?id=10.1371\u002Fjournal.pcbi.1004845)]\n\n**Diet networks: thin parameters for fat genomics** [[manuscript](http:\u002F\u002Fopenreview.net\u002Fpdf?id=Sk-oDY9ge)]\n\nThis weirdly-named paper addresses the frequently encountered problem in genomics where the number of features is much larger than the number of training examples. Here, it is addressed in the context of SNPs (single-nucleotide polymorphisms, genetic variations between individuals). The authors propose a new network parametrization that reduces the number of free parameters using a multi-task architecture which tries to learn a useful embedding of the input features.\n\n**dna-claude-analysis** [[github](https:\u002F\u002Fgithub.com\u002Fshmlkv\u002Fdna-claude-analysis)]\n\nA personal genome analysis toolkit that uses LLM-powered variant interpretation to analyze raw DNA data (e.g. from 23andMe or AncestryDNA) across 17 categories including health risks, ancestry, pharmacogenomics, nutrition, sports\u002Ffitness, longevity, and more. Python scripts parse SNP data and generate structured reports by querying a language model for interpretation of each variant, producing a single-page HTML visualization of the findings.\n\n## Systems biology\u003Ca name='sysbio'>\u003C\u002Fa>\n\n**Using deep learning to model the hierarchical structure and function of a cell** [[web server](http:\u002F\u002Fd-cell.ucsd.edu)][[paper](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fnmeth.4627\u002F)]\n\nIn this ambitious paper, the authors attempt to construct an interpretable neural network model (VNN; visible neural network) of a eukaryotic cell based on millions of genotype-phenotype associations. The network is built in a hierarchy with 12 levels, where each level is supposed to reflect a biologically meaningful level of organization. The resulting model can predict, for a given genetic perturbation, what the resulting phenotype is likely to be.\n\n\n","# 深度学习-生物学\n\n这是一个将深度学习方法应用于生物学领域的实现列表，最初发表在 [Follow the Data](https:\u002F\u002Ffollowthedata.wordpress.com\u002F) 上。该列表偏向于基因组学，因为这是我最密切关注的子领域。\n\n欢迎大家为这个不断增长的列表贡献力量，尤其是在我尚未充分覆盖的类别中！\n\n你也可以参考 [awesome deepbio](https:\u002F\u002Fgithub.com\u002Fgokceneraslan\u002Fawesome-deepbio) 列表。\n\n## 目录\n  - [综述](#reviews)\n  - [模型仓库与资源](#repositories)\n  - [序列建模](#seqmodels)\n  - [多组学整合](#integration)\n  - [蛋白质生物学](#protein_biology)\n    - [结构预测](#protein_biology_structure_prediction)\n    - [蛋白质设计](#protein_biology_design)\n    - [功能预测](#protein_biology_function_prediction)\n  - [基因组学](#genomics)\n    - [变异检测](#genomics_variant-calling)\n    - [基因表达](#genomics_expression)\n    - [成像与基因表达](#imaging_expression)\n    - [增强子及调控区域预测](#genomics_enhancers)\n    - [非编码RNA](#genomics_non-coding)\n    - [甲基化](#genomics_methylation)\n    - [单细胞应用](#genomics_single-cell)\n  - [化学信息学与药物发现](#chemo)\n  - [生物标志物发现](#biomarker)\n  - [代谢组学](#metabolomics)\n  - [生成模型](#generative)\n  - [群体遗传学](#genomics_pop)\n  - [系统生物学](#sysbio)\n\n## 评论 \u003Ca name=\"reviews\">\u003C\u002Fa>\n\n这些并非具体的实现方案，但提供了很有价值的参考。由于该领域的综述论文时效性较强，我特别标注了期刊发表的月份。需要注意的是，在某些情况下，原始预印本可能早已在线发布，远早于正式出版版本。\n\n**(2021-11) 多药联合用药副作用、联合协同效应及药物相互作用预测的关联深度学习统一视角** [[开放获取论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.02916v1.pdf)]\n\n近年来，针对多药联合用药副作用识别、药物相互作用预测以及联合治疗方案设计等任务，涌现出了大量机器学习模型。本文提出了一种能够解决这些任务的关联机器学习模型的统一理论框架。我们给出了基本定义，比较了现有模型架构，并讨论了性能指标、数据集和评估协议。此外，我们还强调了该领域中可能产生重大影响的应用场景以及未来重要的研究方向。\n\n**(2019-12) 药物基因组学资源的深度学习：迈向精准肿瘤学** [[Briefings in Bioinformatics](https:\u002F\u002Facademic.oup.com\u002Fbib\u002Fadvance-article\u002Fdoi\u002F10.1093\u002Fbib\u002Fbbz144\u002F5669856#186956080)]\n\n**(2019-04) 阿尔法狗来了：基因组学的新计算建模技术** [[Nature Reviews Genetics 论文](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fs41576-019-0122-6)]\n\n这是一篇非常优秀的概念性综述，阐述了深度学习在基因组学中的应用方式。文中详细解释了卷积网络、循环网络、图卷积网络、自编码器和生成对抗网络的工作原理，并介绍了多模态学习、迁移学习以及模型可解释性等重要概念。\n\n**(2019-01) 医疗健康领域的深度学习指南** [[Nature Medicine 论文](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fs41591-018-0316-z)]\n\n摘要指出：“在此，我们介绍用于医疗健康的深度学习技术，重点围绕计算机视觉、自然语言处理、强化学习以及通用方法展开讨论。我们描述了这些计算技术如何影响医学的几个关键领域，并探讨了端到端系统的构建方法。其中，计算机视觉部分主要聚焦于医学影像；自然语言处理则应用于电子健康记录数据等领域。同样地，强化学习被置于机器人辅助手术的背景下进行讨论，而面向基因组学的通用深度学习方法也得到了回顾。”\n\n**(2018-11) 基因组学深度学习入门** [[Nature Genetics 论文](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fs41588-018-0295-5)][[Colaboratory 教程笔记本](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F17E4h5aAOioh5DiTo7MZg4hpL6Z_0FyWr)]\n\n这篇由本人参与撰写的综述被称为“入门”，旨在帮助基因组学研究人员快速上手深度学习。为此，我们着重讨论了许多实际问题，如工具选择（不仅包括深度学习框架，还有 GPU 云平台、模型库和在线课程）、如何明确深度学习问题、模型可解释性以及故障排除等。此外，我们还在 Colaboratory 上提供了一个教程，演示如何搭建并运行一个用于学习结合基序的简单卷积网络模型，以及在模型训练完成后如何检查其预测结果。\n\n**(2018-10) 生物医学中的深度学习** [[Nature Biotechnology 论文](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fnbt.4233)]\n\n摘要写道：“凭借整合海量数据、学习任意复杂关系以及融合已有知识的能力，深度学习正逐渐对生物学研究和生物医学应用产生深远影响。目前，深度学习模型已能在不同程度上成功预测遗传变异如何改变参与疾病发生发展的细胞过程、哪些小分子能够调节治疗相关蛋白的活性，以及放射影像是否提示疾病存在。然而，深度学习的高度灵活性也为部署系统的性能保障以及与利益相关方、临床医生和监管机构建立信任带来了新的挑战——后者需要决策背后的合理依据。我们认为，正是这种灵活性本身将有助于克服这些挑战；例如，通过训练深度模型使其能够输出预测结果的理由。要充分发挥深度学习在生物医学领域的潜力，仍需开展大量相关研究。”\n\n**(2018-04) 生物学与医学中深度学习的机遇与挑战** [[bioRxiv 预印本](http:\u002F\u002Fbiorxiv.org\u002Fcontent\u002Fearly\u002F2017\u002F05\u002F28\u002F142760)][[J Roy Soc interface 论文](https:\u002F\u002Froyalsocietypublishing.org\u002Fdoi\u002F10.1098\u002Frsif.2017.0387)]\n\n这篇令人印象深刻的协作式综述完全在 [Github](https:\u002F\u002Fgithub.com\u002Fgreenelab\u002Fdeep-review) 上公开撰写完成。它重点探讨了深度学习在未来如何变革患者分类与治疗，以及基础生物学研究，并分析了可能阻碍这一进程的主要障碍。文中提出了许多引人深思的观点。结合下文列出的更具技术性的综述，读者可以全面了解深度学习在生物学和医学领域的应用现状及潜在前景。\n\n**(2017-01) 健康信息学中的深度学习** [[开放获取论文](http:\u002F\u002Fieeexplore.ieee.org\u002Fstamp\u002Fstamp.jsp?tp=&arnumber=7801947)]\n\n概述了几种类型的深度神经网络及其在转化生物信息学、医学影像、“普适感知”、医疗数据和公共卫生等领域的应用。\n\n**(2016-07) 计算生物学中的深度学习** [[开放获取论文](http:\u002F\u002Fmsb.embopress.org\u002Fcontent\u002F12\u002F7\u002F878)]\n\n这是一篇关于深度学习在生物学中应用的优秀综述。文章主要讨论卷积网络，并深入阐释了为何以及如何将其用于序列（和图像）分类。\n\n## 模型仓库与资源 \u003Ca name=\"repositories\">\u003C\u002Fa>\n\n**Kipoi 仓库加速了基因组学预测模型的社区交流与复用** [[Github](https:\u002F\u002Fgithub.com\u002Fkipoi\u002Fkipoiseq\u002F)][[网站](https:\u002F\u002Fkipoi.org\u002F)][[论文](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fs41587-019-0140-0)] \n\nKipoi 是一个面向基因组学的模型库，可通过简单的 `pip install` 安装，并为数百个基因组学预测模型提供一致的接口。Kipoi 实现了一套标准的数据加载器，用于深度学习中序列模型的训练和预测。\n\n**DragoNN** [[Github](https:\u002F\u002Fgithub.com\u002Fkundajelab\u002Fdragonn)][[网站](https:\u002F\u002Fkundajelab.github.io\u002Fdragonn\u002F)]\n\nDragoNN 提供了一个工具包，用于利用神经网络对调控序列进行建模研究。它包含解释序列模型的工具以及基于 Jupyter Notebook 的在线教程，可用于教学交互式模型操作与可视化。\n\n\n## 序列建模 \u003Ca name=\"seqmodels\">\u003C\u002Fa>\n\n这是一系列主要受自然语言处理启发的模型，用于建模生物序列，如蛋白质或基因。随着生物学中的语言模型逐渐主流化，这些模型或许应被移至其他章节。\n\n**用于深度基因组学和深度蛋白质组学的生物序列连续分布式表示**[[github](https:\u002F\u002Fgithub.com\u002Fehsanasgari\u002FDeep-Proteomics)][[论文](http:\u002F\u002Fjournals.plos.org\u002Fplosone\u002Farticle?id=10.1371\u002Fjournal.pone.0141287)]\n\nGitHub 项目简介写道：“我们提出了一种新的生物序列表示方法。统称为 bio-vectors（BioVec），其中 protein-vectors（ProtVec）用于蛋白质（氨基酸序列），gene-vectors（GeneVec）则用于基因序列。这种表示方法可广泛应用于蛋白质组学和基因组学领域的深度学习任务。Biovectors 基本上是针对生物序列（DNA、RNA 和蛋白质）的 n-gram 字符跳字词向量。在本工作中，我们探讨了这一表示空间的生物物理和生化意义，并在多种生物信息学任务中展示了这种序列表示的强大之处。”\n\n**pysster：使用卷积神经网络从 DNA 和 RNA 序列中学习序列和结构基序**[[github](https:\u002F\u002Fgithub.com\u002Fbudach\u002Fpysster)][[预印本](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002Fearly\u002F2017\u002F12\u002F06\u002F230086)]\n\n这是一个基于 TensorFlow 的库，用于通过卷积神经网络从 DNA\u002FRNA 序列数据中学习基序。该工具箱据称开箱即用即可在 GPU 上运行，并且还支持超参数优化以及可视化不同网络层所学习的内容。\n\n**基于序列的深度表示学习实现统一理性蛋白质工程** [[github](https:\u002F\u002Fgithub.com\u002Fchurchlab\u002FUniRep)][[论文](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fs41592-019-0598-1)]\n\n作者提出了 UniRep，一种早期的基于 mLSTM（乘法 LSTM）的蛋白质序列语言模型。该模型在 UniRef50 数据集上的 2400 万条蛋白质序列上进行训练，可以将蛋白质序列转换为包含蛋白质属性信息的数值向量表示。例如，这些表示可用于训练下游的蛋白质稳定性与功能预测器。此外，UniRep 还可用作“闲聊者”或生成模型来设计新蛋白质。\n\n\n**自然语言预测病毒逃逸机制** [[github](https:\u002F\u002Fgithub.com\u002Fbrianhie\u002Fviral-mutation)][[论文](https:\u002F\u002Fscience.sciencemag.org\u002Fcontent\u002F371\u002F6526\u002F248.17.full)]\n\n这篇论文尝试使用基于 BiLSTM 的网络构建氨基酸语言模型，以模拟病毒如何逃避免疫系统的检测（即“病毒逃逸”）。作者认为，能够逃避免疫系统识别的序列应当具有较高的“存活能力”，这类似于句子的语法正确性；同时，它们还应具备不同的“语义”，即从抗原性的角度来看要有所不同。语法正确性作为一项预测任务在最后一层学习得到，而语义则从倒数第二层的表示中提取出来。\n\n\n**Genomic-ULMFiT：适用于基因组序列数据的 ULMFiT 模型** [[github](https:\u002F\u002Fgithub.com\u002Fkheyer\u002FGenomic-ULMFiT)]\n\n该仓库是 FastAI 的 ULMFiT 语言迁移学习模型在基因组学领域的实现。ULMFiT 基于 AWD-LSTM 模型，已被证明在解决各类文本分类任务时非常有效。在此，仓库作者扩展了 FastAI 的类，增加了专门用于 DNA 序列数据的子类。ULMFiT 的核心思想是：(1) 通过无监督方式从大量文本中学习语言模型——即无需标签，让模型预测下一个词（或标记）；(2) 将步骤 (1) 中学到的语言模型进一步微调到较小的标注数据集上，用于具体的分类任务，但在此步骤中仍采用无标签的方式进行训练，并尝试预测下一个词；(3) 最后在最终的分类任务上使用标签进行微调。在基因组学领域，步骤 (1) 中的大量文本可以是整个人类基因组，或其他来自 GenBank 或 Sequence Read Archive 等数据库的部分数据。作者表明，这种方法在一系列分类问题上表现良好，例如大肠杆菌与人类启动子分类、宏基因组分类、增强子分类以及 mRNA\u002FlincRNA 分类。\n\n**通过将无监督学习扩展到 2.5 亿条蛋白质序列，揭示生物结构与功能** [[github](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fesm)][[预印本](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002F10.1101\u002F622803v1.full)]\n\n在 Facebook（现 Meta）人工智能团队的这项工作中，他们使用 BERT 语言模型，在 2.5 亿条序列共 860 亿个氨基酸的基础上训练了一个名为 ESM-1 的语言模型。与上述 ULMFiT 类似，其核心思想也是利用迁移学习：先在海量数据上进行预训练，使模型掌握 DNA 或蛋白质语言的基本逻辑，从而为进一步针对特定任务的微调奠定基础。\n\n**ProGen2：探索蛋白质语言模型的边界** [[github](https:\u002F\u002Fgithub.com\u002Fsalesforce\u002Fprogen)][[预印本](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.13517.pdf)]\n\n令人惊讶的是，Salesforce 也长期参与蛋白质语言模型的研究。在这篇预印本中，他们介绍了一系列可用于序列适应度预测和序列生成等任务的蛋白质语言模型。\n\n**MSA Transformer** [[github](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fesm)][[预印本](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002F10.1101\u002F2021.02.12.430858v3)]\n\n在这里，来自Meta的同一团队——他们之前推出了ESM-1模型（见上文）——展示了另一种类型的Transformer，它以多序列比对（MSA）作为输入，而非蛋白质序列，在参数量更少的情况下，仍能取得比BERT风格的Transformer更好的效果。他们引入了不同形式的行注意力和列注意力机制，以尽可能从MSA中提取更多信息。该模型的GitHub仓库中包含一个已训练好的版本，即ESM-MSA-1b。\n\n\n**ProtTrans：通过自监督深度学习与高性能计算破解生命密码的语言** [[github](https:\u002F\u002Fgithub.com\u002Fagemagician\u002FProtTrans)][[huggingface](https:\u002F\u002Fhuggingface.co\u002FRostlab\u002Fprot_bert_bfd)][[预印本](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002F10.1101\u002F2020.07.12.199554v2)]\n\n这是一项大规模的工作，旨在对蛋白质序列训练并评估Transformer模型，该项目甚至将其多个模型公开发布在HuggingFace模型库上。摘要开头写道：“计算生物学和生物信息学从蛋白质序列中挖掘出海量数据宝藏，这些数据非常适合借鉴自然语言处理（NLP）中的语言模型（LM）。借助这些语言模型，我们能够在较低的推理成本下突破新的预测边界。在此，我们基于UniRef和BFD的数据训练了两个自回归语言模型（Transformer-XL、XLNet）以及两个自编码器模型（Bert、Albert），所用数据包含多达3930亿个氨基酸（相当于“词”），来自21亿条蛋白质序列（分别是整个英文维基百科的22倍和112倍）。这些语言模型是在橡树岭国家实验室（ORNL）的Summit超级计算机上训练的，使用了936个节点（共计5616块GPU）以及一台TPU Pod（V3-512或V3-1024）。\"\n\n\n**通过整合长程相互作用实现高效的基因表达序列预测** [[github](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdeepmind-research\u002Ftree\u002Fmaster\u002Fenformer)][[tensorflow hub](https:\u002F\u002Ftfhub.dev\u002Fdeepmind\u002Fenformer\u002F1)][[论文](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002F10.1101\u002F2021.04.07.438649v1)]\n\n\nTransformer架构能否帮助解决将基因组增强子与基因表达联系起来这一难题？实验上将远端增强子与基因关联起来非常耗时，而大量长程相互作用的存在使得仅靠相关性分析（多重检验问题）、卷积神经网络（感受野过短）或循环神经网络（难以维持足够长的记忆）从数据中学习这些关系变得十分困难。如今，DeepMind、Calico和Google的研究人员提出了“增强子Transformer”，即Enformer，它能够利用自注意力机制学习比以往更长距离的增强子—基因表达相互作用。值得称赞的是，作者不仅在GitHub上公开了代码，还在TensorFlow Hub上提供了一个预训练模型。\n\n\n**DNABERT**\n[[github](https:\u002F\u002Fgithub.com\u002Fjerryji1993\u002FDNABERT)][[论文](https:\u002F\u002Facademic.oup.com\u002Fbioinformatics\u002Farticle\u002F37\u002F15\u002F2112\u002F6128680)]\n\n一种类似BERT的掩码语言模型，以k-mer作为token，在人类DNA序列上进行训练。\n\n**GenSLMs：基因组规模的语言模型揭示SARS-CoV-2的进化动态** [[github](https:\u002F\u002Fgithub.com\u002Framanathanlab\u002Fgenslm)][[论文](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002F10.1101\u002F2022.10.10.511571v1)]\n\n\n与早期的Enformer类似，这也是一个针对核苷酸（DNA或RNA）的Transformer模型，但其设计和目标有所不同。Enformer是为哺乳动物基因组（人类和小鼠）预训练的模型，而GenSLM则旨在作为更简单基因组的基础模型，例如细菌和病毒。它基于1.1亿个原核生物（细菌和古菌）基因组，采用GPT风格的“预测下一个token”损失函数进行预训练。这里的token是密码子（三个核苷酸组成的三联体），因此训练后的模型可以像GPT-3那样以密码子作为提示进行生成。该基础模型还可以进一步针对特定基因组子集进行微调（“进化微调”），例如本文中针对150万个SARS-CoV-2基因组进行微调，从而得到一个专门用于SARS-CoV-2的语言模型，该模型蕴含着关于病毒进化轨迹的隐含知识，并可用于识别值得关注的变异株。此外，本文还有一个有趣的创新点：Enformer曾尝试用卷积结合自注意力来解决长程相互作用问题，而这里则采用了扩散模型（类似于Stable Diffusion）来建模这些相互作用。\n\n\n**核苷酸Transformer：构建并评估稳健的人类基因组学基础模型**\n[[github](https:\u002F\u002Fgithub.com\u002Finstadeepai\u002Fnucleotide-transformer)][[预印本](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002F10.1101\u002F2023.01.11.523679v1)]\n这是由Instadeep公司完成的工作，该公司是一家机器学习咨询公司，于2023年1月被开发辉瑞mRNA新冠疫苗的BioNTech收购。作者试图基于比以往更广泛的基因组数据构建核苷酸（即DNA和RNA）序列的基础模型，具体包括超过3000个人类基因组以及来自不同生物体的850个非人类基因组。这些模型旨在用于迁移学习，并声称即使在数据稀缺的情况下也能很好地应用于下游预测任务。具体而言，这些模型是类似BERT的结构，以核苷酸6-mers作为token。\n\n**GENA-LM**\n[[github](https:\u002F\u002Fgithub.com\u002FAIRI-Institute\u002FGENA_LM)]\n这是一个在人类DNA序列上训练的掩码语言模型（类似于上述DNABERT或核苷酸Transformer）。\n根据GitHub页面介绍，与DNABERT的一个区别在于，GENA-LM采用BPE分词法而非k-mer，因此能够处理约3000个核苷酸的输入序列（512个BPE token），而DNABERT的上限仅为510个核苷酸。\n\n\n\n## 多组学整合 \u003Ca name='integration'>\u003C\u002Fa>\n\n**深度学习在精准医学中用于基因组、蛋白质组和代谢组数据整合的兴起。** [[论文](https:\u002F\u002Fwww.ncbi.nlm.nih.gov\u002Fpmc\u002Farticles\u002FPMC6207407\u002F)]\n\n一篇关于深度学习在多组学数据整合中潜力的综述论文。\n\n## 蛋白质生物学 \u003Ca name=\"protein_biology\">\u003C\u002Fa>\n\n本类别细分为若干子类别。\n\n### 结构预测 \u003Ca name='protein_biology_structure_prediction'>\u003C\u002Fa>\n\n**利用 AlphaFold 实现高精度蛋白质结构预测** [[github](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Falphafold)][[论文](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fs41586-021-03819-2)]\n\n这款模型想必无需多言。DeepMind 于 2018 年发布了其蛋白质折叠方法 AlphaFold 的首个版本，并在当时赢得了享有盛誉的 CASP 竞赛冠军。而经过全新设计的版本（有时被称为 AlphaFold2）则在 2020 年以压倒性优势再次夺得该项赛事桂冠。新版本引入了一种名为“Evoformer”的组件，它是一种基于 Transformer 架构的模块，能够迭代处理一组对齐的蛋白质序列以及氨基酸间两两相互作用矩阵，从而生成可用于折叠模块的表征；该折叠模块采用了一种称为“不变逐点注意力”的特殊注意力机制。自最初的 AlphaFold 论文发表以来，已有大量研究通过不同方式对模型进行改进，以解决新的任务。\n\n**OpenFold** [[github](https:\u002F\u002Fgithub.com\u002Faqlaboratory\u002Fopenfold)]\n\n这是一款基于 PyTorch 的开源 AlphaFold 重实现，几乎复现了原版的所有功能。在 AlphaFold 公开其模型权重之前，OpenFold 曾是训练自有折叠模型的一种途径。\n\n**MiniFold：DeepMind AlphaFold 的重新实现** [[github](https:\u002F\u002Fgithub.com\u002FEricAlcaide\u002FMiniFold)]\n\n近年来，深度学习在生物学领域最引人注目的成就之一便是 DeepMind 的 AlphaFold 模型在 CASP13 蛋白质结构预测挑战赛中夺冠。最初该模型并未被列入本页，因为尚无公开的实现版本，但如今这一情况已得到改变。无论如何，MiniFold 是一次尝试以更为精简的方式重新实现 AlphaFold 的工作。\n\n**基于语言模型的进化尺度原子级蛋白质结构预测** [[github](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fesm#esmfold)][[预印本](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002F10.1101\u002F2022.07.20.500902v2)]\n\n本文介绍了 Meta 研究团队开发的 ESM-1、ESM-MSA 等模型，并指出大型语言模型同样可以用于蛋白质结构预测。当模型具备足够多的参数和充足的训练数据时，它会开始隐式地学习关于三维构象的信息。作者声称，这种仅依赖语言模型的方法（即不使用多序列比对或主链输入）相较于其他方法（如 AlphaFold）速度可提升多达 60 倍。他们利用该模型为 6 亿个宏基因组样本（环境 DNA）进行了结构预测。\n\n**基于深度生成对抗网络的蛋白质环区建模**[[论文](https:\u002F\u002Fieeexplore.ieee.org\u002Fabstract\u002Fdocument\u002F8372069\u002F)][[网站](https:\u002F\u002Fzhaoyu.li\u002Floop_modeling_gan.html)]\n\n摘录自摘要：“生物学与医学长期以来一直关注蛋白质的计算结构预测与建模。蛋白质结构中常常存在缺失区域或需要重新建模的区域。预测蛋白质结构中特定缺失区域的过程被称为环区建模。在本文中，我们提出了一种基于深度学习的生成对抗网络（GAN），并借鉴图像修复的思想来实现环区建模。生成网络负责捕捉环区的上下文信息并预测缺失部分；而判别网络则使预测结果更加逼真，并为生成网络提供梯度信号。所提出的网络在环区建模的常用基准测试集上进行了评估。实验结果表明，我们的方法能够成功预测环区，并且性能优于当前最先进的工具。据我们所知，这项工作首次将 GAN 应用于生物信息学研究。”\n\n**Pcons2 – 利用蛋白质类似接触模式识别改进接触预测** [[网页界面](http:\u002F\u002Fc2.pcons.net\u002F)]\n\n在此，研究人员采用一种包含五层的“深度随机森林”模型，以提高对蛋白质中哪些残基（氨基酸）之间存在物理相互作用的预测准确性。这对于预测蛋白质的整体结构（一个极其困难的问题）非常有帮助。\n\n### 蛋白质设计 \u003Ca name='protein_biology_design'>\u003C\u002Fa>\n\n**基于数据高效深度学习的低N蛋白工程** [[预印本](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fs41592-021-01100-y)]\n\n作者基于UniRep模型（本文其他部分有介绍），提出了一种机器学习范式或工作流程，用于训练能够预测蛋白质性质并根据极少量标注样本（少至20–30个）设计新型序列变体的模型。在该范式中，首先在一个包含大量多样化蛋白质序列的大数据集上以无监督方式训练基础模型（如原始UniRep论文所述），随后使用相同的损失函数，在与目标蛋白质进化相关的更狭窄的蛋白质家族上进一步微调该模型。这一过程被称为“evotuning”或进化微调。经过这一步骤后，作者表明，仅需少量标注样本，利用由evotuned模型生成的表示进行有监督学习通常就能取得良好效果。有了有监督模型之后，便可通过计算机辅助定向进化来设计出具有所需特性的目标蛋白质新变体。\n\n\n**ProteinGAN：利用生成对抗网络扩展功能性蛋白质序列空间** [[代码](https:\u002F\u002Fgithub.com\u002Fbiomatterdesigns\u002FProteinGAN)][[预印本](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002Fearly\u002F2019\u002F10\u002F04\u002F789719.full.pdf)]\n\n摘要中写道：“为催化任何期望的化学反应而从头设计蛋白质，一直是蛋白质工程领域的一个长期目标，因为它具有广泛的技术、科学和医学应用前景。然而，目前无论是从计算还是实验的角度来看，将蛋白质序列映射到其功能仍然难以实现。在此，我们开发了ProteinGAN，这是一种专门的生成对抗网络变体，能够‘学习’天然蛋白质序列的多样性，并生成具有功能的蛋白质序列。ProteinGAN直接从复杂的多维氨基酸序列空间中学习蛋白质序列之间的进化关系，创造出具有天然类似物理性质的高多样性新序列变体。以苹果酸脱氢酶为模板酶，我们证明，在经过实验测试的ProteinGAN生成序列中，24%具有可溶性，并且在所测试的体外条件下表现出野生型水平的催化活性，即使是在高度突变（超过100处突变）的序列中亦是如此。因此，ProteinGAN展示了人工智能在允许的序列空间生物约束范围内快速生成高度多样化的新型功能性蛋白质的潜力。”\n\n\n**基于ProteinMPNN的鲁棒深度学习蛋白质序列设计** [[代码](https:\u002F\u002Fgithub.com\u002Fdauparas\u002FProteinMPNN)][[预印本](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002F10.1101\u002F2022.06.03.494563v1)]\n\n这项工作提出了一种设计蛋白质序列的方法，该序列被预测能够折叠成指定构象，即与AlphaFold的方向相反：从结构到序列。其实现方式是使用一种图神经网络——消息传递神经网络（MPNN）。生成序列的多样性可以调节，作者同时采用AlphaFold预测结果和实验验证两种方式对该方法的性能进行了测试。\n\n\n**从数百万个预测结构中学习逆向折叠** [[代码](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fesm#invf)][[预印本](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002F10.1101\u002F2022.04.10.487779v2)]\n\n与ProteinMPNN类似，这是一种逆向折叠方法，旨在从蛋白质骨架原子坐标（即结构）推导出序列。尽管ProteinMPNN是基于实验测定的结构进行训练的，但这种方法称为ESM-IF，其训练材料则是由AlphaFold预测的1200万个结构。\n\n\n### 功能预测 \u003Ca name='protein_biology_function_prediction'>\u003C\u002Fa>\n\n\n**基于PDB结构的肿瘤抑制基因和癌基因深度学习预测模型** [[GitHub](https:\u002F\u002Fgithub.com\u002Ftavanaei\u002FCancer-Suppressor-Gene-Deep-Learning)][[bioRxiv预印本](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002Fearly\u002F2017\u002F10\u002F22\u002F177378)]\n\n作者利用卷积神经网络处理从蛋白质数据库（PDB）中的蛋白质三维结构中提取的特征图，以预测癌基因和肿瘤抑制基因。\n\n\n**Deep-RBPPred：基于深度学习的蛋白质组规模RNA结合蛋白预测** [[代码](http:\u002F\u002Fwww.rnabinding.com\u002FDeep_RBPPred\u002FDeep-RBPPred.html)][[bioRxiv预印本](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002Fearly\u002F2017\u002F10\u002F27\u002F210153)]\n\n利用卷积神经网络预测RNA结合蛋白。\n\n\n**EVOVAE：蛋白质序列的变分自编码**[[代码](https:\u002F\u002Fgithub.com\u002Fsamsinai\u002FVAE_protein_function)][[arXiv预印本](https:\u002F\u002Farxiv.org\u002Fabs\u002F1712.03346)]\n\n摘要中写道：“我们提出了一种基于变分自编码器的天然蛋白质序列嵌入方法，并利用它来预测突变对蛋白质功能的影响。通过这种无监督方法，我们可以对天然变异体进行聚类，并学习蛋白质内部各位置之间的相互作用。该方法的整体表现优于不考虑序列内相互作用的基线方法，在某些情况下甚至优于采用反Potts模型的最先进方法。这种生成模型可用于计算引导蛋白质序列空间的探索，并更好地支持理性及自动化的蛋白质设计。”\n\n\n**基于图卷积网络的结构功能预测** [[预印本](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002Fbiorxiv\u002Fearly\u002F2019\u002F10\u002F04\u002F786236.full.pdf)]\n\n摘要中写道：“我们提出了一种基于序列和结构数据训练的深度学习图卷积网络（GCN），并在蛋白质数据库（PDB）中约4万个具有已知结构和功能的蛋白质上对其进行了评估。我们的GCN在功能预测方面的准确性高于仅基于序列数据训练的卷积神经网络及其他竞争方法。通过语言模型进行特征提取，无需构建多重序列比对或进行特征工程。我们的模型能够稳健地预测与训练集序列相似度不超过30%的蛋白质的功能，从而学习到普遍适用的结构—功能关系。借助类别激活映射技术，我们可以自动识别出每个被自信预测的功能所对应的残基水平结构区域，从而推进位点特异性功能预测。”\n\n\n## 基因组学 \u003Ca name=\"genomics\">\u003C\u002Fa>\n\n该类别可分为若干子领域。\n\n### 变异检测 \u003Ca name='genomics_variant-calling'>\u003C\u002Fa>\n\n**DeepVariant** [[github](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fdeepvariant)][[预印本](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002Fearly\u002F2016\u002F12\u002F21\u002F092890)]\n\n这篇来自谷歌的预印本最初发表于2016年末，但在大约一年后代码公开并陆续有新闻稿发布时才引起广泛关注。谷歌的研究人员采用了一种反直觉但巧妙的方法来解决一个已被广泛研究的问题——从DNA测序数据中进行变异检测（即在个体的DNA中准确识别与参考基因组相比的差异，例如突变或多态性）。他们没有直接使用测序DNA片段中的核苷酸序列（以A、C、G、T四种符号表示），而是先将这些序列转换为图像，再对这些图像（代表“pile-up”或DNA序列堆叠）应用卷积神经网络。事实证明，这种方法在变异检测方面非常有效，这一点不仅得到了谷歌内部基准测试的支持，也得到了独立机构的验证。\n\n**语言模型实现零样本预测突变对蛋白质功能的影响** [[github](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fesm#zs_variant)][[预印本](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002F10.1101\u002F2021.07.09.450648v2)]\n\n这项工作基于Meta公司的蛋白质语言模型（ESM-1等；见上文），表明这些模型可用于“零样本”预测变异对蛋白质功能的影响——也就是说，无需额外的实验数据或模型训练。只需直接使用蛋白质语言模型即可推断变异效应。\n\n### 基因表达 \u003Ca name='genomics_expression'>\u003C\u002Fa>\n\n在建模基因表达时，输入通常是数值型数据（整数或浮点数），用于估计在特定细胞类型或条件下，由DNA模板转录产生的RNA量。\n\n**scPRINT：基于5000万细胞的预训练实现稳健的基因调控网络预测** [[github](https:\u002F\u002Fgithub.com\u002Fcantinilab\u002FscPRINT)] [[bioRxiv](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002F10.1101\u002F2024.07.29.605556v1)]\n一种大型细胞模型，可在转录组水平上预测基因间相互作用等信息。它利用新颖的RNA-seq数据编码与解码方法，从数千万个单细胞RNA-seq谱中学习调控机制。\n\n**基于基因互作图的基因表达卷积模型** [[github](https:\u002F\u002Fgithub.com\u002Fmila-iqia\u002Fgene-graph-conv)] [[arXiv](https:\u002F\u002Fgithub.com\u002Fmila-iqia\u002Fgene-graph-conv)]\n文中讨论了如何利用基因间互作图（如同一条通路、蛋白质-蛋白质相互作用、共表达关系或科研文献中的关联）为深度神经网络模型引入类似卷积操作对图像施加的空间偏置。研究发现，在数据稀缺的情况下，这种方法对特定任务具有优势，但其效果高度依赖所用图的质量。\n\n**ADAGE——基于去噪自编码器的基因表达数据分析** [[github](https:\u002F\u002Fgithub.com\u002Fgreenelab\u002Fadage)]\n\n这是一个使用Theano实现的堆叠式去噪自编码器工具，用于从大规模基因表达数据集中提取相关模式，可视为一种特征构建方法。我个人曾多次尝试使用该工具包。作者最初发表了一篇会议论文（http:\u002F\u002Fwww.worldscientific.com\u002Fdoi\u002Fabs\u002F10.1142\u002F9789814644730_0014），将该模型应用于乳腺癌（微阵列）基因表达数据集；近期又在bioRxiv上发表了一篇论文（http:\u002F\u002Fbiorxiv.org\u002Fcontent\u002Fearly\u002F2015\u002F11\u002F05\u002F030650），将其应用于病原菌铜绿假单胞菌的所有可用表达数据（微阵列和RNA-seq）。据悉，该文即将被期刊正式发表。\n\n**利用Ladder网络进行基因表达分类** [[论文](https:\u002F\u002Flink.springer.com\u002Fchapter\u002F10.1007%2F978-3-319-78723-7_23)]\n\n本文将半监督深度学习方法Ladder网络应用于二分类癌症问题。模型性能在TCGA数据集上与其他深度学习及传统机器学习方法进行了对比评估。\n\n**利用深度架构学习基因表达数据中的结构** [[论文](http:\u002F\u002Fbiorxiv.org\u002Fcontent\u002Fearly\u002F2015\u002F11\u002F16\u002F031906)]\n\n同样探讨了使用堆叠式去噪自编码器处理基因表达数据，但据我所知目前尚无可用的实现。此处列出仅供参考。\n\n**基于深度学习的基因表达推断** [[github](https:\u002F\u002Fgithub.com\u002Fuci-cbcl\u002FD-GEX)][[论文](http:\u002F\u002Fbiorxiv.org\u002Fcontent\u002Fearly\u002F2015\u002F12\u002F15\u002F034421)]\n\n该研究聚焦于一项特定的预测任务，即根据约1000个预先选定的“标志基因”面板，预测指定目标基因的表达水平。作者指出，基因表达水平往往高度相关，在某些情况下，使用此类基因面板并通过计算推断其他基因的表达是一种经济高效的做法。基于Pylearn2\u002FTheano框架。\n\n**利用自编码器模型学习酵母转录组系统的层次化表征** [[论文](http:\u002F\u002Fbmcbioinformatics.biomedcentral.com\u002Farticles\u002F10.1186\u002Fs12859-015-0852-1)]\n\n作者使用堆叠式自编码器从数千份酵母微阵列数据中学习生物学特征，并分析隐藏层的表征，结果表明这些表征以层次化方式编码了生物学信息，例如转录因子会在第一层隐藏层中得到体现。\n\n**结合系统级生物学信息提升基因表达聚类：鲁棒自编码器方法** [[bioRxiv预印本](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002Fearly\u002F2017\u002F11\u002F05\u002F214122)]\n\n采用鲁棒自编码器（带有异常值过滤功能的自编码器）对基因表达谱进行聚类分析。\n\n**基于深度学习的从头预测变异对表达及疾病风险的影响** [[github](https:\u002F\u002Fgithub.com\u002FFunctionLab\u002FExPecto)][[论文](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fs41588-018-0160-6)]\n\n作者采用两步模型预测遗传变异对基因表达的影响。第一步，他们训练了一个卷积神经网络，用于建模ENCODE和ROADMAP联盟收集的2002种表观遗传标记；第二步，基于第一步卷积神经网络模型所编码的基因顺式调控区域，进一步训练了一个组织特异性正则化线性模型。随后，通过计算机模拟诱变计算变异对组织特异性基因表达的影响及其变化幅度。\n\n### 影像与基因表达 \u003Ca name='imaging_expression'>\u003C\u002Fa>\n\n**数字病理学中的转录组学习** [[预印本](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002Fbiorxiv\u002Fearly\u002F2019\u002F10\u002F11\u002F760173.full.pdf)]\n\n摘自摘要：“我们提出了一种基于多模态数据融合的创新方法，并证明我们的深度学习模型HE2RNA仅通过全切片图像即可系统性地预测RNA测序谱，而无需专家标注。HE2RNA在设计上具有可解释性，从而为虚拟染色开辟了新的机遇。事实上，它能够实现基因表达的虚拟空间化，这一点已在独立数据集上的双重染色实验中得到验证。此外，HE2RNA所学习到的转录组表示可以被迁移，以提升其他任务的预测性能，尤其是在小规模数据集上。”\n\n### 预测增强子和调控区域 \u003Ca name='genomics_enhancers'>\u003C\u002Fa>\n\n在此类研究中，输入通常是“原始”DNA序列，卷积网络（或卷积层）常被用来学习序列中的规律性特征。感谢[Melissa Gymrek](http:\u002F\u002Fmelissagymrek.com\u002Fscience\u002F2015\u002F12\u002F01\u002Funlocking-noncoding-variation.html)指出其中一些工作。\n\n**DanQ：一种混合卷积与循环神经网络的深度神经网络，用于量化DNA序列的功能** [[github](https:\u002F\u002Fgithub.com\u002Fuci-cbcl\u002FDanQ)]\n\n该模型用于预测非编码DNA序列的功能。它使用卷积层来捕捉调控基序（即控制基因表达的单个DNA片段等），并利用LSTM类型的循环层尝试发现这些单个基序如何协同工作的“语法”。基于Keras\u002FTheano框架。\n\n**Basset – 利用深度卷积神经网络学习可及基因组的调控代码** [[github](https:\u002F\u002Fgithub.com\u002Fdavek44\u002FBasset)]\n\n该软件基于Torch框架，专注于预测染色质的可及性（或“开放性”）——即遗传信息（DNA及其相关蛋白质）的物理包装状态。不同细胞类型中，染色质可能处于更为紧密或松弛的状态，而这在一定程度上受DNA序列的影响（但并非完全由其决定，否则所有细胞的染色质状态将无差异）。\n\n**Basenji – 利用卷积神经网络跨染色体进行序列调控活性预测** [[github1](https:\u002F\u002Fwww.github.com\u002Fcalico\u002Fbasenji)][[github2](https:\u002F\u002Fgithub.com\u002Ftensorflow\u002Ftensor2tensor\u002Fblob\u002Fmaster\u002Ftensor2tensor\u002Fmodels\u002Fresearch\u002Fgene_expression.py)][[biorxiv](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002Fearly\u002F2017\u002F07\u002F10\u002F161851)]\n\n作为Basset的后续项目，这款基于TensorFlow的模型同时使用标准卷积和扩张卷积来建模多种细胞类型的调控信号及基因表达（以CAGE标签密度的形式）。值得注意的是，其底层模型已被纳入Google的Tensor2Tensor库（见上述“github2”链接），该库汇集了大量用于图像和语音识别、机器翻译、文本分类等领域的模型。然而，在撰写本文时，Tensor2Tensor中的模型似乎尚未成熟到易于使用，因此目前仍建议直接使用专门的Basenji仓库（“github1”）。\n\n**DeepSEA – 基于深度学习序列模型预测非编码变异的影响** [[网页服务器](http:\u002F\u002Fdeepsea.princeton.edu\u002Fjob\u002Fanalysis\u002Fcreate\u002F)][[论文](http:\u002F\u002Fwww.nature.com\u002Fnmeth\u002Fjournal\u002Fv12\u002Fn10\u002Ffull\u002Fnmeth.3547.html)]\n\n与上述工具类似，该软件也对染色质的可及性、特定蛋白质（转录因子）与DNA的结合情况，以及与可及性变化相关的所谓组蛋白标记进行了建模。与其他工具相比，DeepSEA更明确地聚焦于预测单核苷酸突变如何影响染色质结构。发表于高影响力期刊《Nature Methods》。\n\n**DeepBind – 利用深度学习预测DNA和RNA结合蛋白的序列特异性** [[代码](http:\u002F\u002Ftools.genes.toronto.edu\u002Fdeepbind\u002F)][[论文](http:\u002F\u002Fwww.nature.com\u002Fnbt\u002Fjournal\u002Fv33\u002Fn8\u002Ffull\u002Fnbt.3300.html)]\n\n该模型来自多伦多Brendan Frey团队，其作者还参与创立了Deep Genomics公司。DeepBind专注于根据ChIP-seq、ChIP-chip、RIP-seq、蛋白质结合微阵列以及HT-SELEX等实验数据，预测DNA结合蛋白或RNA结合蛋白的结合特异性。发表于高影响力期刊《Nature Biotechnology》。\n\n**DeeperBind - 提升DNA结合蛋白序列特异性预测能力** [[预印本](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1611.05777.pdf)]\n\n这是对DeepBind的改进尝试，通过在卷积层之后添加一个循环序列学习模块（LSTM），以捕捉原始DeepBind设计中池化步骤所丢失的位置维度。作者声称，基准测试表明，这种架构相较于先前的工作表现更优。\n\n**DeepMotif - 可视化基因组序列分类** [[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F1605.01133)]\n\n该研究同样致力于学习和预测蛋白质与特定DNA模式或“基序”的结合特异性。不过，这篇论文采用了卷积层与[高速公路网络](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1505.00387v2.pdf)相结合的方式，层数比DeepBind网络更多。作者还展示了如何通过输入优化生成典型的DNA基序：在保持所有权重不变的情况下应用反向传播，以找到能够最大程度激活网络中相应输出节点的输入模式。\n\n**用于预测DNA-蛋白质结合的卷积神经网络架构** [[代码](http:\u002F\u002Fcnn.csail.mit.edu\u002F)][[论文](http:\u002F\u002Fbioinformatics.oxfordjournals.org\u002Fcontent\u002F32\u002F12\u002Fi121.full)]\n\n这项工作系统性地探讨了用于DNA-蛋白质结合任务的卷积神经网络（CNN）架构，并得出结论：卷积核对于这类基于基序的任务的成功至关重要。有趣的是，作者不仅提供了Frey实验室DeepBind的Docker化实现（见上文），还提供了EC2启动脚本以及用于比较不同基于Caffe框架、支持GPU加速的模型的代码。\n\n**PEDLA：基于深度学习算法框架预测增强子** [[代码](https:\u002F\u002Fgithub.com\u002Fwenjiegroup\u002FPEDLA)][[论文](http:\u002F\u002Fbiorxiv.org\u002Fcontent\u002Fearly\u002F2016\u002F01\u002F07\u002F036129)]\n\n该软件用于基于来自ENCODE计划等项目的异质数据，结合共1,114个特征预测增强子（即在特定条件下或特定细胞类型中能够增强基因表达的DNA片段，通常远离目标基因本身）。\n\n**DEEP：用于预测增强子的通用计算框架** [[论文](http:\u002F\u002Fnar.oxfordjournals.org\u002Fcontent\u002Fearly\u002F2014\u002F11\u002F05\u002Fnar.gku1058.full)][[代码](http:\u002F\u002Fcbrc.kaust.edu.sa\u002Fdeep\u002F)]\n\n一种用于增强子的集成预测方法。\n\n**利用监督式深度学习方法进行顺式调控区域的全基因组预测**（以及多篇将各类深度网络应用于调控区域预测的其他论文）[[代码](https:\u002F\u002Fgithub.com\u002Fyifeng-li\u002FDECRES)]（其中一篇[[论文](http:\u002F\u002Fbiorxiv.org\u002Fcontent\u002Fearly\u002F2016\u002F02\u002F28\u002F041616)]）\n\nWyeth Wasserman课题组开发了一套基于Theano教程的工具包[[代码](https:\u002F\u002Fgithub.com\u002Fyifeng-li\u002FDECRES)]，用于将不同类型的深度学习架构应用于顺式调控元件（能够调节附近基因表达的DNA片段）的预测。他们在网络中使用了一个特殊的“特征选择层”，以限制模型中的特征数量。该层被实现为一个多层感知机输入层与第一隐藏层之间的一个额外的稀疏一对一线性层。\n\n**FIDDLE：用于功能基因组数据推断的整合式深度学习框架** [[论文](http:\u002F\u002Fbiorxiv.org\u002Fcontent\u002Fearly\u002F2016\u002F10\u002F17\u002F081380)][[代码](https:\u002F\u002Fgithub.com\u002Fueser\u002FFIDDLE)][[YouTube演讲](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=pcLTUsOm5pc&feature=youtu.be&list=PLlMMtlgw6qNjROoMNTBQjAcdx53kV50cS&t=2411)]\n\n该团队预测了转录起始位点和调控区域，但声称这一解决方案可以轻松推广到其他特征的预测。FIDDLE代表“深度学习驱动的数据灵活整合”。其核心思想（作者在上述YouTube视频中进行了清晰阐述）是利用卷积神经网络联合建模多种基因组信号，例如DNase-seq、ATAC-seq、ChIP-seq、TSS-seq，甚至可能是RNA-seq信号（如以.wig文件形式存储、每碱基对应一个值的基因组数据）。\n\n**从50万个随机序列中学习酵母5′非翻译区的调控语法的深度学习** [[论文](http:\u002F\u002Fgenome.cshlp.org\u002Fcontent\u002F27\u002F12\u002F2015)][[代码](http:\u002F\u002Fgenome.cshlp.org\u002Fcontent\u002Fsuppl\u002F2017\u002F11\u002F02\u002Fgr.224964.117.DC1\u002FSupplemental_code.tar.gz)]\n\n这是一个CNN模型，旨在根据特定类型基因组区域——5' UTR（5′非翻译区）——的DNA序列预测蛋白质表达。该模型基于Keras构建，作者的一个亮点是使用hyperopt优化了模型参数，相关过程也在随论文附带的Jupyter笔记本中有所展示。根据我的实验结果，该模型的表现令人鼓舞且易于复现。\n\n**基于注意力机制的神经网络对增强子-启动子相互作用的建模** [[bioRxiv预印本](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002Fearly\u002F2017\u002F11\u002F14\u002F219667)][[代码](https:\u002F\u002Fgithub.com\u002Fwgmao\u002FEPIANN)]\n\n近年来，（循环）神经网络中的注意力机制变得非常流行，尤其是在机器翻译模型中取得了显著成效。本文提出了一种基于注意力机制的模型，用于研究增强子序列与启动子序列之间的相互作用。\n\n**利用卷积核网络预测转录因子结合位点** [[bioRxiv预印本](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002Fearly\u002F2017\u002F11\u002F10\u002F217257)][[代码](https:\u002F\u002Fgitlab.inria.fr\u002Fdchen\u002FCKN-seq)]\n\n本文采用CNN（用于学习良好的表示）与核方法（用于学习有效的预测函数）相结合的方式，来预测转录因子结合位点。\n\n**利用RNA-seq、WGS和深度学习预测泛癌肿瘤基因组中的DNA可及性** [[bioRxiv预印本](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002Fearly\u002F2017\u002F12\u002F05\u002F229385)]\n\n与上文提到的Basset类似，本文展示了如何使用CNN从序列中预测DNA可及性，但进一步引入了来自不同细胞类型的RNA测序数据作为输入。通过这种方式，与细胞类型相关的隐含信息可以“传递”到DNA可及性的预测任务中。\n\n**基于碱基分辨率的深度学习揭示顺式调控密码的基元语法** [[bioRxiv预印本](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002Fbiorxiv\u002Fearly\u002F2019\u002F08\u002F21\u002F737981.full.pdf)]\n\n这里使用了一个带有扩张卷积的CNN来学习不同转录因子结合基元之间的协同作用。这正是标题中提到的“基元语法”。该神经网络被训练用来预测来自碱基分辨率ChIP实验（ChIP-nexus）的信号，随后利用训练好的网络推导出基元协同作用的规则。\n\n**类BERT的DNA语言模型是强大的零样本全基因组变异效应预测器** [[GitHub](https:\u002F\u002Fgithub.com\u002Fsonglab-cal\u002Fgpn)][[预印本](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002F10.1101\u002F2022.08.22.504706v2)]\n\n一种类似于BERT的DNA语言模型，辅以扩张卷积处理原始的独热编码输入，同时保持序列长度固定，被训练于多个不同的植物基因组，并被证明能够预测拟南芥中的变异效应。作者声称，这种名为基因组预训练网络（GPN）的模型，在性能上超越了基于phyloP和phastCons等常用保守性评分的预测方法。\n\n\n\n### 非编码RNA \u003Ca name='genomics_non-coding'>\u003C\u002Fa>\n\n**DeepLNC：基于深度神经网络的长链非编码RNA预测工具** [[论文](http:\u002F\u002Flink.springer.com\u002Farticle\u002F10.1007%2Fs13721-016-0129-2)] [[网页服务器](http:\u002F\u002Fbioserver.iiita.ac.in\u002Fdeeplnc\u002F)]\n\n基于k-mer谱图，从DNA序列中识别潜在的长链非编码RNA分子。\n\n**深度循环神经网络发现复杂生物规则以解析RNA的蛋白质编码潜力** [[GitHub](https:\u002F\u002Fgithub.com\u002Fhendrixlab\u002FmRNN)][[论文](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002Fearly\u002F2017\u002F11\u002F13\u002F200758.1)]\n\n摘自摘要：*尽管传统的基于特征的RNA分类方法受限于当前的科学认知，但深度学习方法能够从数据中自主地、从头开始发现复杂的生物规则。我们用人类信使RNA（mRNA）和长链非编码RNA（lncRNA）序列训练了一个门控循环神经网络（RNN）。我们的模型mRNA RNN（mRNN）在预测蛋白质编码潜力方面超越了现有最先进的方法。*\n\n### 甲基化 \u003Ca name='genomics_methylation'>\u003C\u002Fa>\n\n**DeepCpG - 预测单细胞中的DNA甲基化**\n[[论文](http:\u002F\u002Fdx.doi.org\u002F10.1186\u002Fs13059-017-1189-z)]\n[[代码](https:\u002F\u002Fgithub.com\u002Fcangermueller\u002Fdeepcpg)]\n[[文档](http:\u002F\u002Fdeepcpg.readthedocs.io\u002Fen\u002Flatest\u002F)]\n\nDeepCpG是一种用于预测多个细胞中DNA甲基化的深度神经网络。该模型采用模块化架构，包括一个循环CpG模块，用于捕捉细胞内及细胞间CpG位点之间的相关性；一个卷积DNA模块，用于从较宽的DNA序列窗口中提取模式特征；以及一个联合模块，整合来自CpG模块和DNA模块的信息，从而预测目标CpG位点在多个细胞中的甲基化状态。DeepCpG能够提供准确的预测结果，帮助发现与DNA甲基化状态及细胞间差异相关的DNA序列基序，并可用于分析单核苷酸突变对DNA甲基化的影响。DeepCpG基于Python实现，且已公开发布。\n\n**利用基因组拓扑特征和深度网络预测CpG二核苷酸的DNA甲基化状态** [[论文](http:\u002F\u002Fwww.nature.com\u002Farticles\u002Fsrep19598)][[网页服务器](http:\u002F\u002Fdna.cs.usm.edu\u002Fdeepmethyl\u002F)]\n\n该方法使用了一个堆叠自编码器，并在其顶部添加了一个监督层，以预测被称为“CpG岛”的特定基因组区域是否发生甲基化（DNA的一种化学修饰，可改变其功能；例如，基因附近的甲基化通常但不总是与该基因的下调或沉默相关）。本文采用的网络结构中，自编码器部分的隐藏层节点数远多于输入层，因此如果能进一步了解作者对这些隐藏层所代表意义的思考，将更有助于理解模型的工作原理。\n\n### 单细胞应用 \u003Ca name='genomics_single-cell'>\u003C\u002Fa>\n\n**DeepCpG - 预测单细胞DNA甲基化**\n[[论文](http:\u002F\u002Fdx.doi.org\u002F10.1186\u002Fs13059-017-1189-z)]\n[[代码](https:\u002F\u002Fgithub.com\u002Fcangermueller\u002Fdeepcpg)]\n[[文档](http:\u002F\u002Fdeepcpg.readthedocs.io\u002Fen\u002Flatest\u002F)]\n\n参见上文。\n\n**CellCnn – 用于检测疾病相关细胞亚群的表征学习**\n[[代码](https:\u002F\u002Fgithub.com\u002Feiriniar\u002FCellCnn)][[论文](http:\u002F\u002Fbiorxiv.org\u002Fcontent\u002Fearly\u002F2016\u002F03\u002F31\u002F046508)]\n\n这是一种基于卷积网络（Lasagne\u002FTheano）的方法，用于“表征学习以检测与表型相关的细胞亚群”。它之所以有趣，是因为大多数针对高维分子测量数据的神经网络方法（如上述基因表达类别中的方法）都使用自编码器而非卷积网络。\n\n**DeepCyTOF：通过深度学习和领域自适应实现质谱流式细胞术数据的自动化细胞分类**[[论文](http:\u002F\u002Fbiorxiv.org\u002Fcontent\u002Fbiorxiv\u002Fearly\u002F2016\u002F05\u002F31\u002F054411.full.pdf)]\n\n描述了使用自编码器方法（堆叠自编码器和多自编码器）对质谱流式细胞术（CyTOF）数据进行门控分析（将细胞分配到离散群体中）。\n\n**利用神经网络改进单细胞RNA测序数据分析**[[预印本](http:\u002F\u002Fbiorxiv.org\u002Fcontent\u002Fearly\u002F2017\u002F04\u002F23\u002F129759)]\n\n测试了多种神经网络架构，以获得单细胞基因表达数据的降维表示。同时引入了一个包含数万个单细胞图谱的数据库，用户可以通过查询该数据库，基于降维表示推断细胞类型或状态。\n\n**利用分布匹配残差网络去除批次效应**[[代码](https:\u002F\u002Fgithub.com\u002Fushaham\u002FBatchEffectRemoval)][[论文](https:\u002F\u002Facademic.oup.com\u002Fbioinformatics\u002Farticle-abstract\u002Fdoi\u002F10.1093\u002Fbioinformatics\u002Fbtx196\u002F3611270\u002FRemoval-of-Batch-Effects-using-Distribution)]\n\n基因组学、蛋白质组学等领域的大多数高通量实验都会在一定程度上受到系统性技术误差的影响，即所谓的“批次效应”。本文采用残差神经网络，通过尝试匹配例如单细胞RNA测序或质谱流式细胞术等重复实验的数据分布，来减弱批次效应。\n\n**主动深度学习降低自动细胞分割中的标注负担** [[bioRxiv预印本](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002Fearly\u002F2017\u002F11\u002F01\u002F211060)]\n\n主动学习是一种解决如何选择训练样本以最高效地训练模型的框架。研究表明，主动学习可以显著减少专家在高通量高内涵显微镜图像中进行细胞分割标注所需的时间。当然，此类应用的深度学习模型训练需要大量高质量的标注数据，但能够提供这些标注的人力资源既有限又昂贵。\n\n**scVAE：用于单细胞基因表达数据的变分自编码器** [[代码](https:\u002F\u002Fgithub.com\u002Fscvae\u002Fscvae)][[预印本](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002F10.1101\u002F318295v2)]\n\n该方法直接从计数数据建模单细胞基因表达，无需初始归一化，并在潜在空间中进行聚类。由于基于变分自编码器，它还可以通过从潜在分布中采样来生成合成的单细胞数据。\n\n**CellBender remove-background：一种用于无监督去除scRNA-seq数据背景噪声的深度生成模型** [[代码](https:\u002F\u002Fgithub.com\u002Fbroadinstitute\u002FCellBender)][[预印本](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002Fbiorxiv\u002Fearly\u002F2019\u002F10\u002F03\u002F791699.full.pdf)]\n\n作者提出了一种生成模型，用于去除单细胞RNA测序数据中的统计背景噪声。\n\n**scVAE：单细胞变分自编码器** [[代码](https:\u002F\u002Fgithub.com\u002Fscvae\u002Fscvae)][[预印本](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002F10.1101\u002F318295v4)]\n\nscVAE是一个命令行工具，用于利用变分自编码器对单细胞转录本计数进行建模。通过变分自编码器，不仅可以更紧凑地建模数据，还能根据真实数据的分布生成逼真的合成数据。\n\n**利用生成对抗神经网络实现单细胞RNA测序数据的逼真计算机模拟生成与增强** [[代码](https:\u002F\u002Fgithub.com\u002Fimsb-uke\u002FscGAN)][[预印本](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002F10.1101\u002F390153v2)]\n\n摘要内容如下：“生物医学研究中的一个根本问题在于可用观测数量较少，这主要是由于生物样本不足、成本过高或伦理原因所致。通过用计算机模拟生成的样本对少量真实观测进行扩充，有望获得更稳健的分析结果并提高实验的可重复性。在此，我们提出使用条件单细胞生成对抗神经网络（cscGANs），以实现单细胞RNA测序数据的逼真生成。cscGANs可以从复杂的多细胞类型样本中学习非线性的基因间依赖关系，并利用这些信息生成具有明确类型的逼真细胞。”\n\n**知识驱动的神经网络实现单细胞测序数据的生物学可解释性深度学习** [[代码](https:\u002F\u002Fgithub.com\u002Fepigen\u002FKPNN)][[预印本](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002Fbiorxiv\u002Fearly\u002F2019\u002F10\u002F07\u002F794503.full.pdf)]\n\n摘要内容如下：“深度学习已成为预测各种复杂生物现象的强大方法。然而，由于通用深度神经网络难以揭示成功预测背后的生物学机制，其在生物发现方面的实用性迄今仍较为有限。在此，我们展示了\n在生物网络上的深度学习应用，其中每个节点都对应一个分子实体（如蛋白质或基因），每条边都有明确的机制解释（如信号通路中的调控相互作用）。借助知识驱动的神经网络（KPNNs），我们充分利用深度学习算法为多层网络赋予有意义权重的能力，从而实现可解释的深度学习。”\n\n**FlashDeconv：图谱规模的空间转录组学去卷积** [[代码](https:\u002F\u002Fgithub.com\u002Fcafferychen777\u002Fflashdeconv)][[论文](https:\u002F\u002Fdoi.org\u002F10.64898\u002F2025.12.22.696108)]\n\n一种高性能的细胞类型去卷积方法，采用保持结构的随机化草图技术和图神经网络正则化。可在约3分钟内处理100万个点，且计算复杂度呈线性O(N)增长。该方法利用杠杆得分重要性采样来保留稀有细胞类型的信号，并通过稀疏图拉普拉斯正则化实现空间一致性的预测结果。特别针对亚细胞分辨率平台（Visium HD、Stereo-seq、Xenium）进行了优化。\n\n## 化学信息学与药物发现 \u003Ca name=\"chemo\">\u003C\u002Fa>\n\n**面向分布外分子表征的子结构不变性学习** [[github](https:\u002F\u002Fgithub.com\u002Fyangnianzu0515\u002FMoleOOD)][[论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=2nWUNTnFijm)] \n\nMoleOOD 是一种通用的分子表征学习框架，可将任何现有的分子表征学习方法作为骨干模型，以提升其在分布漂移情况下的泛化能力。具体而言，MoleOOD 设计了一种新的学习方案及其等效的实用实现。此外，MoleOOD 还开发了一个环境推理模型，无需手动指定环境即可识别每种分子对应的环境。\n\n**神经图指纹** [[github](https:\u002F\u002Fgithub.com\u002FHIPS\u002Fneural-fingerprint)]\n\n一种卷积神经网络，能够学习对预测新分子性质有用的特征——即“分子指纹”。该网络基于图结构运行，其中原子为节点，化学键为边。由瑞安·亚当斯团队开发，他曾共同主持非常优秀的播客节目《Talking Machines》。\n\n**基于数据驱动的分子连续表示的自动化学设计** [[github](https:\u002F\u002Fgithub.com\u002Faspuru-guzik-group\u002Fchemical_vae)][[预印本](https:\u002F\u002Farxiv.org\u002Fabs\u002F1610.02415)]\n\n摘要开头：“我们报告了一种将分子的离散表示与多维连续表示相互转换的方法。该模型允许我们在开放的化学化合物空间中高效地探索和优化，从而生成新分子。”\n\n**目标强化生成对抗网络（ORGAN）** [[github](https:\u002F\u002Fgithub.com\u002Fgablg1\u002FORGAN)][[预印本](https:\u002F\u002Farxiv.org\u002Fabs\u002F1705.10843)]\n\n这是一种将生成模型与强化学习相结合的方法，用于引导生成过程朝着特定目标进行。ORGAN 是一种适用于离散数据的通用方法，但在本例中以药物发现为应用场景进行了演示。\n\n**从化学反应的无监督学习中提取有机化学语法** [[github](https:\u002F\u002Fgithub.com\u002Frxn4chemistry\u002Frxnmapper)][[论文](https:\u002F\u002Fadvances.sciencemag.org\u002Fcontent\u002F7\u002F15\u002Feabe4166)]\n\n\n该工具包利用 Transformer 网络进行化学中的原子映射。摘要中写道：*在过去的几百年里，化学家通过观察原子在化学转化过程中如何重新排列，推导出一系列“反应规则”，这一过程称为原子映射。原子映射是一项繁琐的实验工作，而采用计算方法时，则需要持续标注化学反应并制定逻辑一致的指令。在此，我们证明了 Transformer 神经网络能够在无需监督或人工标注的情况下学习产物与反应物之间的原子映射信息。借助 Transformer 的注意力权重，我们构建了一个化学无关、基于注意力的反应映射器，并从未标注的反应数据集中提取出连贯的化学语法。*\n\n**基于深度强化学习的分子从头设计** [[github](https:\u002F\u002Fgithub.com\u002FMarcusOlivecrona\u002FREINVENT)][[预印本](https:\u002F\u002Farxiv.org\u002Fabs\u002F1704.07555)]\n\n这是一个使用强化学习的 PyTorch 序列生成模型。GitHub 页面上展示了一个精美的小部件，用于显示训练进度以及训练过程中生成的分子。摘要开头：“本文介绍了一种针对分子从头设计的序列生成模型的调优方法，该模型通过增强的片段似然函数，能够学习生成具有特定期望性质的结构。我们展示了该模型如何执行多种任务，例如生成查询结构的类似物，以及生成预计对特定生物靶点具有活性的化合物。”\n\n**药物发现中的单样本学习模型与 DeepChem** [[github](https:\u002F\u002Fgithub.com\u002Fdeepchem\u002Fdeepchem)][[Python 库](http:\u002F\u002Fdeepchem.io\u002F)][[论文](http:\u002F\u002Fpubs.acs.org\u002Fdoi\u002Fabs\u002F10.1021\u002Facscentsci.6b00367)]\n\nDeepChem 是一个“……旨在使机器学习在药物发现中的应用变得简单便捷的 Python 库”，它涵盖了深度学习领域的诸多前沿技术：单样本学习、图卷积网络、少样本学习以及 LSTM 嵌入等。根据 GitHub 网站介绍，“DeepChem 旨在提供一套高质量的开源工具链，以 democratize 深度学习在药物发现、材料科学和量子化学中的应用。”\n\n**有意义先导化合物的宝库：将深度对抗自编码器应用于肿瘤学中新分子的开发** [[github](https:\u002F\u002Fgithub.com\u002Fspoilt333\u002Fonco-aae)][[论文](https:\u002F\u002Fwww.ncbi.nlm.nih.gov\u002Fpmc\u002Farticles\u002FPMC5355231\u002F)]\n\n探讨了生成对抗网络（GAN）在生成新型药物候选分子方面的应用。类似于生成“看起来”来自某个特定分布的图像或视频，甚至可以根据条件生成“给我一张猫的照片”，作者认为也可以通过提示所需药物类型来生成新颖的类药分子结构。在此研究中，他们探索了一种特定类型的生成网络——对抗自编码器（AAE），并将其改造为所谓的“人工智能驱动的药物发现引擎”。\n\n**深度学习助力快速鉴定强效 DDR1 激酶抑制剂** [[github](https:\u002F\u002Fgithub.com\u002Finsilicomedicine\u002Fgentrl)][[论文](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fs41587-019-0224-x)] 在 Insilico Medicine 于 2019 年发表的一篇备受关注的论文中，研究人员采用了一种名为 GENTRL（生成张量强化学习）的方法，快速发现了针对某一重要靶点的小分子抑制剂。借助该方法，作者仅用 21 天就筛选出了一种候选分子。该模型首先通过变分自编码器进行生成，随后利用强化学习策略探索化学空间。他们还使用了一种基于 Kohonen 自组织映射的独特损失函数，并采用张量分解来编码化学结构与性质之间的关系。\n\n**Deep Genomics 提名业界首个由 AI 发现的治疗候选药物** [[预印本](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002Fbiorxiv\u002Fearly\u002F2019\u002F09\u002F17\u002F693572.full.pdf)]\n\n2019 年 9 月，Deep Genomics 宣布其基于深度学习的平台已确定了一个治疗靶点及相应的药物候选。关于该候选分子所针对的致病机制的详细信息，请参阅上述预印本链接。\n\n## 生物标志物发现 \u003Ca name=\"biomarker\">\u003C\u002Fa>\n\n**人类衰老的深度生物标志物** [[在线预测工具](http:\u002F\u002Fwww.aging.ai\u002F)][[论文](https:\u002F\u002Fwww.ncbi.nlm.nih.gov\u002Fpubmed\u002F27191382)]\n\n摘要中指出：“人类衰老研究的主要障碍之一，是缺乏一套全面且可操作的生物标志物，这些标志物可用于监测和评估治疗干预的效果。在本研究中，我们设计了一个由21个深度神经网络组成的模块化集成模型，这些网络具有不同的深度、结构和优化策略，能够仅通过常规血液检测来预测人的实际年龄。”\n\n\n## 代谢组学 \u003Ca name=\"metabolomics\">\u003C\u002Fa>\n\n**深度学习准确预测乳腺癌代谢组学数据中的雌激素受体状态** [[代码](http:\u002F\u002Fpubs.acs.org\u002Fdoi\u002Fsuppl\u002F10.1021\u002Facs.jproteome.7b00595\u002Fsuppl_file\u002Fpr7b00595_si_001.pdf)][[论文](http:\u002F\u002Fpubs.acs.org\u002Fdoi\u002Ffull\u002F10.1021\u002Facs.jproteome.7b00595)]\n\n该研究比较了用于代谢组学数据中雌激素受体状态分类的算法，并发现表现最佳的是基于自编码器的前馈神经网络，其参数通过H2O的R语言接口进行调优。\n\n## 生成模型 \u003Ca name='generative'>\u003C\u002Fa>\n\n在许多情况下，生成与真实数据相似的合成数据有助于扩充数据集规模或避免侵犯患者隐私。以下列出了一些相关方法。\n\n**保护隐私的生成式深度神经网络支持临床数据共享** [[GitHub](https:\u002F\u002Fgithub.com\u002Fgreenelab\u002FSPRINT_gan)][[bioRxiv预印本](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002Fearly\u002F2017\u002F11\u002F15\u002F159756)]\n\n本文提出了一种巧妙的方法：利用生成对抗网络（GAN）合成与研究参与者实际测量数据高度相似但无法追溯到特定个体的模拟数据。这种方法从设计上就融入了差分隐私机制，从而为确保研究对象不会被识别为试验参与者提供了强有力的保障。\n\n**利用生成模型创建人工人类基因组** [[预印本](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002Fbiorxiv\u002Fearly\u002F2019\u002F10\u002F07\u002F769091.full.pdf)]\n\n作者比较了受限玻尔兹曼机（RBM）和生成对抗网络（GAN）这两种工具在生成合成人类基因组方面的应用。\n\n## 群体遗传学 \u003Ca name='genomics_pop'>\u003C\u002Fa>\n\n**用于群体遗传推断的深度学习** [[代码](https:\u002F\u002Fsourceforge.net\u002Fprojects\u002Fevonet\u002F)][[论文](http:\u002F\u002Fjournals.plos.org\u002Fploscompbiol\u002Farticle?id=10.1371\u002Fjournal.pcbi.1004845)]\n\n**饮食网络：瘦型参数应对胖型基因组学** [[手稿](http:\u002F\u002Fopenreview.net\u002Fpdf?id=Sk-oDY9ge)]\n\n这篇标题颇为奇特的论文探讨了基因组学领域中常见的问题：特征数量远大于训练样本数量。在此背景下，作者聚焦于单核苷酸多态性（SNP，即个体间的遗传变异），提出了一种新的网络参数化方法，通过多任务架构学习输入特征的有效嵌入表示，从而大幅减少自由参数的数量。\n\n**dna-claude-analysis** [[GitHub](https:\u002F\u002Fgithub.com\u002Fshmlkv\u002Fdna-claude-analysis)]\n\n这是一款个人基因组分析工具包，采用大语言模型驱动的变异解读技术，对来自23andMe或AncestryDNA等平台的原始DNA数据进行分析，涵盖健康风险、祖先溯源、药物基因组学、营养、运动健身、长寿等多个类别。该工具通过Python脚本解析SNP数据，并向语言模型查询每个变异的解释，最终生成结构化的报告，以单页HTML形式呈现分析结果。\n\n## 系统生物学\u003Ca name='sysbio'>\u003C\u002Fa>\n\n**利用深度学习建模细胞的层级结构与功能** [[网络服务器](http:\u002F\u002Fd-cell.ucsd.edu)][[论文](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fnmeth.4627\u002F)]\n\n在这篇极具雄心的研究中，作者尝试基于数百万条基因型-表型关联数据，构建一个可解释的神经网络模型（VNN；可见神经网络），用于描述真核细胞。该网络采用12层的层级结构，每一层都对应着具有生物学意义的组织层次。最终构建的模型能够根据给定的遗传扰动预测可能产生的表型变化。","# deeplearning-biology 快速上手指南\n\n`deeplearning-biology` 并非一个单一的 Python 包，而是一个精选的深度学习在生物学（特别是基因组学）领域应用的开源项目清单与资源库。本指南将帮助你快速搭建环境并访问核心资源。\n\n## 环境准备\n\n在开始之前，请确保你的开发环境满足以下要求：\n\n*   **操作系统**: Linux (推荐 Ubuntu 20.04+) 或 macOS。Windows 用户建议使用 WSL2。\n*   **Python 版本**: 3.8 或更高版本。\n*   **硬件加速**: 强烈建议配备 NVIDIA GPU 以运行深度学习模型，并安装对应的 CUDA Toolkit。\n*   **前置依赖**:\n    *   `git`: 用于克隆仓库。\n    *   `pip`: Python 包管理工具。\n    *   `conda` (可选但推荐): 用于管理复杂的生物信息学依赖环境。\n\n> **国内加速建议**：\n> 推荐使用清华或阿里镜像源加速 Python 包安装：\n> ```bash\n> pip config set global.index-url https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n> ```\n\n## 安装步骤\n\n由于该项目是资源列表，主要“安装”步骤是获取代码库以及安装其中提到的核心通用工具（如 Kipoi）。\n\n### 1. 克隆项目仓库\n获取完整的资源列表和文档：\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fyour-target-repo\u002Fdeeplearning-biology.git\ncd deeplearning-biology\n```\n*(注：请替换为实际的 GitHub 仓库地址，原 README 主要指向列表页面)*\n\n### 2. 安装核心模型库 (Kipoi)\n列表中推荐的 **Kipoi** 是基因组学预测模型的\"Model Zoo\"，提供了统一的接口，是入门的首选工具：\n```bash\npip install kipoi\n```\n\n### 3. 安装深度学习框架\n根据具体模型需求，安装 PyTorch 或 TensorFlow。以 PyTorch 为例（推荐国内镜像）：\n```bash\npip install torch torchvision torchaudio --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu118\n```\n\n## 基本使用\n\n以下是利用该资源库中推荐的工具进行简单序列建模的示例。\n\n### 示例：使用 Kipoi 加载预训练模型进行预测\nKipoi 允许你通过几行代码调用社区贡献的数百种基因组学模型。\n\n```python\nimport kipoi\nfrom kipoi.data import default_collate\n\n# 1. 获取模型描述符 (例如：一个用于预测转录因子结合位点的模型)\nmodel_name = \"DeepBind\u002FHLF-2015_Homo_sapiens\"\n\n# 2. 加载模型\nmodel = kipoi.get_model(model_name)\n\n# 3. 准备输入数据 (DNA 序列示例)\n# 注意：实际使用中通常使用 Kipoi 的数据加载器自动处理 FASTA 文件\ninput_sequences = [\"ACGTACGTACGT\", \"TGCATGCATGCA\"]\n\n# 4. 运行预测\npredictions = model.predict(input_sequences)\n\nprint(f\"Prediction results: {predictions}\")\n```\n\n### 示例：探索特定子领域工具\n你可以直接前往仓库中列出的具体子项目（如 `UniRep` 或 `DragoNN`）的 GitHub 页面进行独立安装和使用。例如，使用 **UniRep** 生成蛋白质向量表示：\n\n```bash\n# 克隆 UniRep 仓库\ngit clone https:\u002F\u002Fgithub.com\u002Fchurchlab\u002FUniRep.git\ncd UniRep\n\n# 安装依赖\npip install -r requirements.txt\n\n# 在 Python 中加载预训练模型\npython -c \"from unreppy import load_model; m = load_model(); print('Model loaded successfully')\"\n```\n\n通过浏览该项目的 `README` 目录结构，你可以快速定位到 **Sequence modelling**（序列建模）、**Protein biology**（蛋白质生物学）或 **Genomics**（基因组学）等具体分类，找到适合你研究任务的特定工具链接。","某生物信息学团队正致力于利用深度学习预测新型抗癌药物的靶点蛋白功能，急需寻找合适的算法模型进行验证。\n\n### 没有 deeplearning-biology 时\n- **检索效率低下**：研究人员需在 GitHub、arXiv 和各类期刊网站间反复切换搜索，耗费数周时间才能拼凑出零散的代码资源。\n- **领域匹配困难**：通用的深度学习模型库缺乏生物学特异性，难以区分哪些算法适用于基因组变异检测，哪些适合蛋白质结构预测。\n- **技术选型盲目**：缺乏权威的综述指引，团队无法快速了解卷积网络、图神经网络等在特定生物任务中的最新进展与优劣对比。\n- **重复造轮子**：因找不到现成的多组学整合或药物相互作用预测实现，不得不从头编写基础代码，严重拖慢研发进度。\n\n### 使用 deeplearning-biology 后\n- **资源一站获取**：直接通过分类目录（如“蛋白质功能预测”或“药物发现”）定位到经过筛选的高质量开源实现，将调研周期从数周缩短至几天。\n- **场景精准对接**：借助清晰的子领域划分，迅速锁定针对非编码 RNA 或单细胞应用的具体模型，确保技术方案与生物问题高度契合。\n- **前沿视野开阔**：通过内置的综述文献链接，团队快速掌握了可解释性模型和迁移学习在医疗领域的最新应用范式，优化了架构设计。\n- **开发加速落地**：直接复用列表中成熟的变体调用或增强子预测代码库，让团队能专注于核心业务逻辑调试而非底层实现。\n\ndeeplearning-biology 通过构建垂直领域的算法地图，消除了生物学家与深度学习技术之间的鸿沟，显著提升了科研转化的效率。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhussius_deeplearning-biology_d91a9c44.png","hussius","Mikael Huss","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fhussius_4db7043b.png",null,"Codon Consulting","Stockholm","mikael.huss@gmail.com","mikaelhuss","http:\u002F\u002Ffollowthedata.wordpress.com","https:\u002F\u002Fgithub.com\u002Fhussius",2128,491,"2026-04-03T02:40:28",4,"","未说明",{"notes":93,"python":91,"dependencies":94},"该仓库并非单一的可执行软件工具，而是一个深度学习在生物学领域应用的项目、论文和代码库的汇总列表（Awesome List）。README 中列出的各个子项目（如 Kipoi, DragoNN, UniRep 等）拥有各自独立的运行环境和依赖要求。部分提及的工具（如 pysster）基于 TensorFlow 且支持 GPU，但整个列表没有统一的操作系统、Python 版本或硬件需求说明。用户需根据具体感兴趣的子项目查阅其对应的 GitHub 仓库以获取详细安装指南。",[],[18],"2026-03-27T02:49:30.150509","2026-04-06T07:13:01.303168",[],[]]