[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-HazyResearch--hyena-dna":3,"tool-HazyResearch--hyena-dna":65},[4,17,27,35,48,57],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",153609,2,"2026-04-13T11:34:59",[13,14,15],"开发框架","Agent","语言模型","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,3,"2026-04-06T11:19:32",[15,26,14,13],"图像",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":10,"last_commit_at":33,"category_tags":34,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":10,"last_commit_at":41,"category_tags":42,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",85092,"2026-04-10T11:13:16",[26,43,44,45,14,46,15,13,47],"数据工具","视频","插件","其他","音频",{"id":49,"name":50,"github_repo":51,"description_zh":52,"stars":53,"difficulty_score":54,"last_commit_at":55,"category_tags":56,"status":16},5784,"funNLP","fighting41love\u002FfunNLP","funNLP 是一个专为中文自然语言处理（NLP）打造的超级资源库，被誉为\"NLP 民工的乐园”。它并非单一的软件工具，而是一个汇集了海量开源项目、数据集、预训练模型和实用代码的综合性平台。\n\n面对中文 NLP 领域资源分散、入门门槛高以及特定场景数据匮乏的痛点，funNLP 提供了“一站式”解决方案。这里不仅涵盖了分词、命名实体识别、情感分析、文本摘要等基础任务的标准工具，还独特地收录了丰富的垂直领域资源，如法律、医疗、金融行业的专用词库与数据集，甚至包含古诗词生成、歌词创作等趣味应用。其核心亮点在于极高的全面性与实用性，从基础的字典词典到前沿的 BERT、GPT-2 模型代码，再到高质量的标注数据和竞赛方案，应有尽有。\n\n无论是刚刚踏入 NLP 领域的学生、需要快速验证想法的算法工程师，还是从事人工智能研究的学者，都能在这里找到急需的“武器弹药”。对于开发者而言，它能大幅减少寻找数据和复现模型的时间；对于研究者，它提供了丰富的基准测试资源和前沿技术参考。funNLP 以开放共享的精神，极大地降低了中文自然语言处理的开发与研究成本，是中文 AI 社区不可或缺的宝藏仓库。",79857,1,"2026-04-08T20:11:31",[15,43,46],{"id":58,"name":59,"github_repo":60,"description_zh":61,"stars":62,"difficulty_score":54,"last_commit_at":63,"category_tags":64,"status":16},6590,"gpt4all","nomic-ai\u002Fgpt4all","GPT4All 是一款让普通电脑也能轻松运行大型语言模型（LLM）的开源工具。它的核心目标是打破算力壁垒，让用户无需依赖昂贵的显卡（GPU）或云端 API，即可在普通的笔记本电脑和台式机上私密、离线地部署和使用大模型。\n\n对于担心数据隐私、希望完全掌控本地数据的企业用户、研究人员以及技术爱好者来说，GPT4All 提供了理想的解决方案。它解决了传统大模型必须联网调用或需要高端硬件才能运行的痛点，让日常设备也能成为强大的 AI 助手。无论是希望构建本地知识库的开发者，还是单纯想体验私有化 AI 聊天的普通用户，都能从中受益。\n\n技术上，GPT4All 基于高效的 `llama.cpp` 后端，支持多种主流模型架构（包括最新的 DeepSeek R1 蒸馏模型），并采用 GGUF 格式优化推理速度。它不仅提供界面友好的桌面客户端，支持 Windows、macOS 和 Linux 等多平台一键安装，还为开发者提供了便捷的 Python 库，可轻松集成到 LangChain 等生态中。通过简单的下载和配置，用户即可立即开始探索本地大模型的无限可能。",77307,"2026-04-11T06:52:37",[15,13],{"id":66,"github_repo":67,"name":68,"description_en":69,"description_zh":70,"ai_summary_zh":70,"readme_en":71,"readme_zh":72,"quickstart_zh":73,"use_case_zh":74,"hero_image_url":75,"owner_login":76,"owner_name":76,"owner_avatar_url":77,"owner_bio":78,"owner_company":79,"owner_location":79,"owner_email":80,"owner_twitter":79,"owner_website":81,"owner_url":82,"languages":83,"stars":122,"forks":123,"last_commit_at":124,"license":125,"difficulty_score":126,"env_os":127,"env_gpu":128,"env_ram":129,"env_deps":130,"category_tags":140,"github_topics":141,"view_count":10,"oss_zip_url":79,"oss_zip_packed_at":79,"status":16,"created_at":145,"updated_at":146,"faqs":147,"releases":173},7210,"HazyResearch\u002Fhyena-dna","hyena-dna","Official implementation for HyenaDNA, a long-range genomic foundation model built with Hyena","HyenaDNA 是一个专为基因组学设计的长序列基础模型，基于高效的 Hyena 架构构建。它能够以单核苷酸分辨率处理长达 100 万个碱基对的 DNA 序列，突破了传统模型在上下文长度上的限制。\n\n在基因研究中，DNA 序列往往极长且包含复杂的远程依赖关系，普通模型难以一次性完整读取和分析。HyenaDNA 通过在人类参考基因组上进行大规模预训练，有效解决了这一难题，能够精准捕捉长距离的基因调控模式，适用于基因功能预测、变异影响分析等下游任务。\n\n这款工具主要面向生物信息学研究人员、AI 科学家以及希望探索基因组大模型的开发者。项目提供了从轻量级到超大规模的多种预训练权重，并集成了 Hugging Face 与 Google Colab，用户无需深厚工程背景即可快速加载模型进行微调或推理。\n\n其核心技术亮点在于结合了 Hyena 算子与闪式注意力机制（Flash Attention），在保持线性计算复杂度的同时实现了对百万级序列的高效建模。无论是想在免费算力上尝试 45 万长度序列的初学者，还是需要处理全基因组数据的资深专家，HyenaDNA 都提供了灵活且强大的解决方案。","# HyenaDNA\n\n![HyenaDNA_pipeline](assets\u002Fpipeline.png \"HyenaDNA\")\n\n## Important links:  \n- [arxiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.15794)  \n- [blog](https:\u002F\u002Fhazyresearch.stanford.edu\u002Fblog\u002F2023-06-29-hyena-dna)\n- [colab](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1wyVEQd4R3HYLTUOXEEQmp_I8aNC_aLhL?usp=sharing)  \n- [huggingface](https:\u002F\u002Fhuggingface.co\u002FLongSafari)\n- [discord](https:\u002F\u002Fdiscord.gg\u002FRJxUq4mzmW)\n- [youtube (talk)](https:\u002F\u002Fyoutu.be\u002FhaSkAC1fPX0?si=IUMmo_iGZ6SK1DBX)\n\n## Intro\n\nWelcome to the HyenaDNA repo! HyenaDNA is a long-range genomic foundation model pretrained on context lengths of up to \\***1 million tokens**\\* at \\***single nucleotide resolution**\\*. \n\nThe repo is a work in progress, but we're very excited to get this in the hands of researchers, so bare with us :)\n\nThis repo is best suited for those who want to pretrain a HyenaDNA model, or try one of the downstream tasks from the paper.\n\nFor the easiest entry point though, check out the HyenaDNA **[colab](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1wyVEQd4R3HYLTUOXEEQmp_I8aNC_aLhL?usp=sharing)**, a self contained notebook that is Huggingface integrated. You'll be able to load pretrained weights and fine-tune on the GenomicBenchmarks dataset. Also, you'll be able to do inference and get embeddings on DNA sequences up to 450k nucleotides on the free tier. For 1 million long DNA sequences, you can get an A100 on Colab (paid tier), or run the notebook on your own machine.\n\n  \nCredit: much of the code is forked and extended from [S4](https:\u002F\u002Fgithub.com\u002FHazyResearch\u002Fstate-spaces) and [Safari](https:\u002F\u002Fgithub.com\u002FHazyResearch\u002Fsafari).\n\n## Discord\n\nTrying [Discord](https:\u002F\u002Fdiscord.gg\u002FRJxUq4mzmW) out! Maybe it'll be conducive to sharing ideas \u002F tips on how HyenaDNA could be applied in different ways. Feel free to post questions there.\n\n## Hugging Face pretrained weights\n\u003Ca name=\"huggingface\">\u003C\u002Fa>\n\nCheck these out :)  There are different model sizes, and different training sequence lengths that they can handle up to. All pretrained on a single human reference genome (hg38).\n\n- [tiny-1k](https:\u002F\u002Fhuggingface.co\u002FLongSafari\u002Fhyenadna-tiny-1k-seqlen\u002Ftree\u002Fmain)\n- [tiny-1k-d256](https:\u002F\u002Fhuggingface.co\u002FLongSafari\u002Fhyenadna-tiny-1k-seqlen-d256\u002Ftree\u002Fmain)\n- [tiny-16k-d128](https:\u002F\u002Fhuggingface.co\u002FLongSafari\u002Fhyenadna-tiny-16k-seqlen-d128\u002Ftree\u002Fmain)\n- [small-32k](https:\u002F\u002Fhuggingface.co\u002FLongSafari\u002Fhyenadna-small-32k-seqlen\u002Ftree\u002Fmain)\n- [medium-160k](https:\u002F\u002Fhuggingface.co\u002FLongSafari\u002Fhyenadna-medium-160k-seqlen\u002Ftree\u002Fmain)\n- [medium-450k](https:\u002F\u002Fhuggingface.co\u002FLongSafari\u002Fhyenadna-medium-450k-seqlen\u002Ftree\u002Fmain)\n- [large-1m](https:\u002F\u002Fhuggingface.co\u002FLongSafari\u002Fhyenadna-large-1m-seqlen\u002Ftree\u002Fmain)\n\nSee the suggested [GPU requirements](https:\u002F\u002Fhuggingface.co\u002FLongSafari\u002Fhyenadna-tiny-1k-seqlen#hardware) for each model.\n\nThere's a few way to use these HuggingFace weights, all with different flavors:\n\n1. [colab](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1wyVEQd4R3HYLTUOXEEQmp_I8aNC_aLhL?usp=sharing)  \n2. [Pytorch Lighting in this repo](#loadweights)\n3. [standalone](#standalone)\n\n## Dependencies\n\u003Ca name=\"dependencies\">\u003C\u002Fa>\n\nFor this repo, let's start with the dependancies that are needed. (If you're familiar with Docker, you can skip this section and jump to the [docker](#docker) setup below). The repo is built using Pytorch Lightning (a training library) and Hydra a config oriented ML library. (It'll be super helpful to get familiar with those tools.)\n\n- clone repo, cd into it\n\n```\ngit clone --recurse-submodules https:\u002F\u002Fgithub.com\u002FHazyResearch\u002Fhyena-dna.git && cd hyena-dna\n```\n\n- create a conda environment, with Python 3.8+\n\n```\nconda create -n hyena-dna python=3.8\n```\n\n- The repo is developed with Pytorch 1.13, using cuda 11.7\n\n```\nconda install pytorch==1.13.0 torchvision==0.14.0 torchaudio==0.13.0 pytorch-cuda=11.7 -c pytorch -c nvidia\n```\n\n- install requirements:\n```\npip install -r requirements.txt\n```\n- install Flash Attention, these [notes](https:\u002F\u002Fgithub.com\u002FHazyResearch\u002Fsafari#getting-started) will be helpful.\n```\ncd hyena-dna\ngit submodule update --init\ncd flash-attention\ngit submodule update --init\npip install -e . --no-build-isolation\n```\n- optional fused layers for speed (takes a bit of time)\n```\n# from inside flash-attn\u002F\ncd csrc\u002Flayer_norm && pip install . --no-build-isolation\n```\n\n## Dockerfile\n\u003Ca name=\"docker\">\u003C\u002Fa>\n\nEven better, if you're familar with Docker, we have an image you can pull with all the dependencies installed. It's the simplest, surest, but does require some familiarity with using Docker containers.\n\nSlight complication - you also need to clone the `flash-attn` repo that's used as a submodule in the main `hyena-dna` repo. That means you need the `--recurse-submodules` flag, in case you cloned without it.\n\n```\n# clones main and submodule repos\ngit clone --recurse-submodules https:\u002F\u002Fgithub.com\u002FHazyResearch\u002Fhyena-dna.git && cd hyena-dna\n\n```\n\nPrepare docker container\n```\n# build the image within the hyena-dna repo (it will grab the Dockerfile here).  You need to place $USER_NAME with your own Dockerhub username.\ndocker build . -t $USER_NAME\u002Fhyena-dna\n\nOr,\n\n# pull already built image (our $USER_NAME is hyenadna)\ndocker pull hyenadna\u002Fhyena-dna:latest\n\n# run the container: this will give you an interactive shell with the dependencies\ndocker run --gpus all -it -p80:3000 hyenadna\u002Fhyena-dna \u002Fbin\u002Fbash\n```\n\nUpdate:\n\u003Ca name=\"docker_nt\">\u003C\u002Fa>\n\nWe actually have a second Docker image, which has all the Nucleotide Transformer datasets, checkpoint, and exact commands and hyperparameter settings used to reproduce the best results in the HyenaDNA paper.\n\n```\ndocker pull hyenadna\u002Fhyena-dna-nt6:latest \ndocker run --gpus all -it -p80:3000 hyenadna\u002Fhyena-dna-nt6 \u002Fbin\u002Fbash\n\n```\n\nThis will land you inside the `\u002Fwdr`, which has a file named `launch_commands_nucleotide_transformer` with all the launch commands and (associated hyperparameters) for the 18 Nucleotide Transformer datasets.\n\nWhat's the difference with the first Docker image you ask?  Not much, just some different dependancy versions.\n\n\n## Quick Entry point \n\nA quick start for this the repo is to train from scratch on a small genomics dataset. Let's try this just to see if things are set up ok.\n\nThe command below should auto-download a small dataset into `data\u002F`. It uses a small 2 layer HyenaDNA model with a linear decoder (head) on a binary classification task. It already beats the SotA by 7 pts (one task from GenomicBenchmarks), but we can do even better with a pretrained model.\n\n```\npython -m train wandb=null experiment=hg38\u002Fgenomic_benchmark_scratch\n```\n\nLet's describe this.\n\n\u003Ca name=\"quick\">\u003C\u002Fa>\n- `-m` lets you run the script as a module (no .py used in name).  \n- `train` is calling the main `train.py` script that launches all training \u002F finetuning experiments.\n- `wandb=null`, this connects to wandb too, but for quick testing I set to null.  Otherwise you can use something like `wandb.group=custom_name_here`.\n- `experiment` is passing the config for `experiment`, using the `genomic_benchmark_scratch.yaml` file, located in `configs\u002Fexperiments\u002Fhg38\u002F`.\n- You can pass other configs in the command line the same way, eg, `dataset=your_custom_datset_name`.  But more on that later.\n\n## Loading pretrained weights\n\u003Ca name=\"loadweights\">\u003C\u002Fa>\n\nThere are 2 ways to use the pretrained weights from HuggingFace:\n1. HuggingFace integration (best example), via [colab](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1wyVEQd4R3HYLTUOXEEQmp_I8aNC_aLhL?usp=sharing)\n2. Pytorch Lightning in this repo:\n- You can clone the HuggingFace repo, and pass the ckpt path to Pytorch Lighting (the .ckpt is from Lightning actually)\n- the flag is `train.pretrained_model_path=\u002Fpath\u002Fto\u002Fckpt`\n- you'll need to make sure the model config settings are the same when launching. The config is also in the HuggingFace repo.\n\n\n## Standalone code (HuggingFace too)\n\u003Ca name=\"standalone\">\u003C\u002Fa>\n\nWe actually have a 3rd way, but it's really just a copy of the colab but put into this repo as a `.py` file (in case that's more your thing). It's HuggingFace integrated, not Pytorch Lightning, so you don't get all the bells and whistles, but it is standalone, meaning it's easier to port to your own codebase.  It assumes you have all the [dependencies](#dependencies) installed already.\n\n- see the `huggingface.py` script for example of inference, loading pretrained weights from HF\n- and the `standalone_hyenadna.py`, which has all the classes you need to create a HyenaDNA model\n\n\n## Experiments\n\nWe share our training and dataloading code for pretraining on the human reference genome (HG38), fine-tuning on a number of downstreams, and examples of our in-context learning variants using soft prompt tokens and instruction fine-tuning. You'll need to download and preprocess on your own for now, we'll share our steps for those later.\n\nIn general, get comfortable with the configs in `configs\u002Fexperiments\u002Fhg38`, all our (sample) experiment settings are there.\n\n### Pretraining on Human Reference Genome\n\u003Ca name=\"pretraining\">\u003C\u002Fa>\n\nFirst step is download the Human Reference Genome data. It's comprised of 2 files, 1 with all the sequences (the .fasta file), and with the intervals we use (.bed file).\n\nThe file structure should look like\n```\ndata\n|-- hg38\u002F\n    |-- hg38.ml.fa\n    |-- human-sequences.bed\n\n```\n\n- Download fasta (.fa format) file (of the entire human genome) into hyena-dna\u002Fdata\u002Fhg38.  ~24 chromosomes in the whole genome (merged into 1 file), each chromosome is a continuous sequence, basically. Then download the .bed file with sequence intervals (contains chromosome name, start, end, split, which then allow you to retrieve from the fasta file)  \n\n```\nmkdir -p data\u002Fhg38\u002F\ncurl https:\u002F\u002Fstorage.googleapis.com\u002Fbasenji_barnyard2\u002Fhg38.ml.fa.gz > data\u002Fhg38\u002Fhg38.ml.fa.gz\ncurl https:\u002F\u002Fstorage.googleapis.com\u002Fbasenji_barnyard2\u002Fsequences_human.bed > data\u002Fhg38\u002Fhuman-sequences.bed\n```\n\nlaunch pretraining run  \n\n```\npython -m train wandb=null experiment=hg38\u002Fhg38_hyena model.d_model=128 model.n_layer=2 dataset.batch_size=256 train.global_batch_size=256 dataset.max_length=1024 optimizer.lr=6e-4 trainer.devices=1\n```\n\nLet's describe a little about this command.  \n- `experiment=hg38\u002Fhg38_hyena` passes the config for this experiment using a Hyena(DNA) model\n- `model.d_model=128`, and `model.n_layer=2` select the model width and depth, key hyparams\n- `dataset.max_length=1024` sets the max sequence length sampled from the dataset, the model layer max length is set from this too, or...\n- `model.layer.l_max`  # you can set the max model length manually\n- `model.d_inner`  # likewise, the reverse bottleneck with can be set manually too (default is 4x d_model)\n\nLots of other commands you can pass and customize, feel free to check out the `experiment=hg38\u002Fhg38_hyena` for details.\n\n### Pretraining on your own data\n\u003Ca name=\"pretraining_custom\">\u003C\u002Fa>\n\nTo pretrain on your own data, all you need is (ideally) a `.fasta` file.  You don't need a .bed file like we used for HG38, we just used that for convenience.  You can follow our species classification dataloader for how to setup a general pretraining dataloader that would randomly sample a chromosome and then a sequence of a given length.  \n\nSample pretraining dataloader  \n```\nsrc\u002Fdataloaders\u002Fdatasets\u002Fspecies_dataset.py\n```\n\nThe species dataloader can be used for pretraining as well by swapping out the .fasta file (for your own) and doing some wrangling with the configs.  There is also some code change to map the actual chromosomes you have in your .fasta file, so you'll have to dive into the code and what the dataloader is doing.  Most of the work in general for using this repo is just setting up dataloaders and configs (which takes time, but it's worth it!).\n\nNote: if you plan on pretraining on your own data, make sure to preprocess your data correctly, and your samples are what you expect in the dataloader. Things like, uppercase\u002Flowercase, unknown characters, etc. Also, if your sequences are variable length (in our setting we used fixed lengths mostly, since next token prediction should theoretically be introduced to variable length sequences) then the padding may become significant or an issue. ie, if your length range is 100-32k, then the 100 sequence will have a lot of padding, so you'll need to ignore those tokens in the loss to avoid instability in training. The padding token should be `4` by default, so you can pass this in the command line, `+task.loss.ignore_index=4`, or modify the config too (under `task.loss`).\n\n### GenomicBenchmarks\n\u003Ca name=\"genomicbenchmarks\">\u003C\u002Fa>\n\nThe [GenomicBenchmarks](https:\u002F\u002Fgithub.com\u002FML-Bioinfo-CEITEC\u002Fgenomic_benchmarks) is an easy to use set of datasets for sequence level classification. We use as a good entry point to try new things out.\n\nSample run:\n```\npython -m train wandb=null experiment=hg38\u002Fgenomic_benchmark dataset_name=human_enhancers_cohn train.pretrained_model_path=\u002Fpath\u002Fto\u002Fckpt dataset.max_length=500 model.layer.l_max=1024\n```\n\nThis runs a HyenaDNA model on one of the datasets, auto-downloaded into `data\u002F`. Here are the other datasets and their stats, which you can pass into this config too. The config in `configs\u002Fdataset\u002Fgenomic_benchmark` is setup to pull in the correct dataset metadata (num_samples, classes, etc). \n\nJust like the [quick entry point](#quick) explained above, you'll need to set the flags for `dataset.max_length` you want to use, as well as the `model.layer.l_max`, which tells the model the max length you want to use. The inputs will be padded up to `model.layer.l_max`.  eg, data sample = 500, and `l_max` = 1024, then it will pad 501 to `l_max`.\n\nThe new flag here for this fine-tune experiment is to pass a pretrained ckpt via `train.pretrained_model_path=\u002Fpath\u002Fto\u002Fckpt`.\n\n\nThere are 8 datasets in this suite, choose 1 at a time (passing the `dataset.dataset_name` sets the num_classes and num_seqs automatically).\n```\n# name                                num_seqs        num_classes     median len    std\n# dummy_mouse_enhancers_ensembl       1210            2               2381          984.4  \n# demo_coding_vs_intergenomic_seqs    100_000         2               200           0\n# demo_human_or_worm                  100_000         2               200           0\n# human_enhancers_cohn                27791           2               500           0\n# human_enhancers_ensembl             154842          2               269           122.6\n# human_ensembl_regulatory            289061          3               401           184.3\n# human_nontata_promoters             36131           2               251           0\n# human_ocr_ensembl                   174756          2               315           108.1\n```\n\n### Nucleotide Transformer datasets\n\nYou can check out the [Nucleotide Transformer](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002F10.1101\u002F2023.01.11.523679v1) paper appendix for how to download and process the datasets. \n\nIf you'd like to use the pretrained weights we used to finetune on, you'll need the [tiny-1k-d256](https:\u002F\u002Fhuggingface.co\u002FLongSafari\u002Fhyenadna-tiny-1k-seqlen-d256\u002Ftree\u002Fmain) weights on Huggingface.\n\nUpdate:  Or, you can invest a bit of time and learn how use Docker, and just use our [pre-built Docker image](#docker_nt) that has the exact Nucleotide Transformer datasets\u002Fsplits, pretrained weights, and hyperparameters used to obtain the results in the HyenaDNA paper (by far most convenenient way to reproduce results).\n\nsample run  \n```\n# trains from scratch\npython -m train wandb=null experiment=hg38\u002Fnucleotide_transformer dataset_name=enhancer dataset.max_length=500 model.layer.l_max=1026\n```\n\nSimilarly with GenomicBenchmarks, we need to select which dataset to use from the 17 Nucleotide Transformer datasets.\n\nSee the dataset config in `configs\u002Fdataset\u002Fnucleotide_transformer` for more dataset metadata, but here's some:\n\n```\nFields\nname max_len n_classes n_samples metric\n\n# enhancer 200   2  14968 MCC\n# enhancer_types 200   3  14968 MCC\n# H3 500   2  13468 MCC\n# H3K4me1  500   2  28509 MCC\n# H3K4me2  500   2  27614 MCC\n# H3K4me3  500   2  33119 MCC\n# H3K9ac   500   2  25003 MCC\n# H3K14ac  500   2  29743 MCC\n# H3K36me3 500   2  31392 MCC\n# H3K79me3 500   2  25953 MCC\n# H4 500   2  13140 MCC\n# H4ac  500   2  30685 MCC\n# promoter_all   300   2  53276 F1\n# promoter_non_tata 300   2  47759 F1\n# promoter_tata  300   2  5517  F1\n# splice_sites_acceptor   600   2  19961 F1\n# splice_sites_donor   600   2  19775 F1\n```\n\nThe file structure for the data should look like:\n\n```\ndata\n|-- nucleotide_transformer\u002F\n    |-- enhancer\u002F\n        |-- all_test_enhancer.fasta\n        |-- all_train_enhancer.fasta\n    |-- H3\u002F\n        |-- H3_test.fasta\n        |-- H3_train.fasta\n    |-- promoter_tata\u002F\n        |-- promoter_tata_test.fasta\n        |-- promoter_tata_train.fasta\n    |-- ...\n\n```\n\n\n### In-context Learning\n\nWe use the [GenomicBenchmarks](#genomicbenchmarks) for exploring in-context learning (ICL). It should autodownload the data into `data\u002F`.\n\nSoft prompting example run:  \n```\npython -m evals\u002Fsoft_prompting_genomics\n```\n\ninstruction fine-tune example:  \n```\npython -m evals\u002Finstruction_tuned_genomics\n```\n\n### Chromatin Profile\n\nYou'll need to see the [DeepSea](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fnmeth.3547) and [repo](https:\u002F\u002Fgithub.com\u002FFunctionLab\u002Fsei-framework) for info how to download and preprocess.\n\n\nexample chromatin profile run:   \n```\npython -m train wandb=null experiment=hg38\u002Fchromatin_profile dataset.ref_genome_path=\u002Fpath\u002Fto\u002Ffasta\u002Fhg38.ml.fa dataset.data_path=\u002Fpath\u002Fto\u002Fchromatin_profile dataset.ref_genome_version=hg38\n```\n\n- `dataset.ref_genome_path`  # path to a human ref genome file (the input sequences)\n- `dataset.ref_genome_version`  # the version of the ref genome (hg38 or hg19, we use hg38)\n- `dataset.data_path`  # path to the labels of the dataset\n\n\n### Species Classification\n\nYou'll need to download fasta files for each species that you want to use (just the .zips, the dataloader wil unzip automatically). You can download them using the following commands:\n\n```\n# Human\nwget -P human\u002F -r -nH --cut-dirs=12 --no-parent ftp:\u002F\u002Fftp.ncbi.nlm.nih.gov\u002Fgenomes\u002Fall\u002FGCA\u002F009\u002F914\u002F755\u002FGCA_009914755.4_T2T-CHM13v2.0\u002FGCA_009914755.4_T2T-CHM13v2.0_assembly_structure\u002FPrimary_Assembly\u002Fassembled_chromosomes\u002FFASTA\u002F\n# Lemur\nwget -P lemur\u002F -r -nH --cut-dirs=11 --no-parent ftp:\u002F\u002Fftp.ncbi.nlm.nih.gov\u002Fgenomes\u002Fgenbank\u002Fvertebrate_mammalian\u002FLemur_catta\u002Flatest_assembly_versions\u002FGCA_020740605.1_mLemCat1.pri\u002FGCA_020740605.1_mLemCat1.pri_assembly_structure\u002FPrimary_Assembly\u002Fassembled_chromosomes\u002FFASTA\u002F\n# House mouse\nwget -P mouse\u002F -r -nH --cut-dirs=11 --no-parent ftp:\u002F\u002Fftp.ncbi.nlm.nih.gov\u002Fgenomes\u002Fgenbank\u002Fvertebrate_mammalian\u002FMus_musculus\u002Flatest_assembly_versions\u002FGCA_921998355.2_A_J_v3\u002FGCA_921998355.2_A_J_v3_assembly_structure\u002FPrimary_Assembly\u002Fassembled_chromosomes\u002FFASTA\u002F\n# Pig\nwget -P pig\u002F -r -nH --cut-dirs=11 --no-parent ftp:\u002F\u002Fftp.ncbi.nlm.nih.gov\u002Fgenomes\u002Fgenbank\u002Fvertebrate_mammalian\u002FSus_scrofa\u002Flatest_assembly_versions\u002FGCA_002844635.1_USMARCv1.0\u002FGCA_002844635.1_USMARCv1.0_assembly_structure\u002FPrimary_Assembly\u002Fassembled_chromosomes\u002FFASTA\u002F\n# Hippo\nwget -P hippo\u002F -r -nH --cut-dirs=11 --no-parent ftp:\u002F\u002Fftp.ncbi.nlm.nih.gov\u002Fgenomes\u002Fgenbank\u002Fvertebrate_mammalian\u002FHippopotamus_amphibius\u002Flatest_assembly_versions\u002FGCA_023065835.1_ASM2306583v1\u002FGCA_023065835.1_ASM2306583v1_assembly_structure\u002FPrimary_Assembly\u002Fassembled_chromosomes\u002FFASTA\u002F\n```\n\nYour folder struture should look like this:  \n\n```\ndata\n|-- species\u002F\n    |-- chimpanzee\u002F\n        |-- chr1.fna\n        |-- chr2.fna\n        |-- ...\n    |-- hippo\u002F\n        |-- chr1.fna\n        |-- chr2.fna\n        |-- ...\n    |-- human\u002F\n        |-- chr1.fna\n        |-- chr2.fna\n        |-- ...\n    |-- mouse\u002F\n        |-- chr1.fna\n        |-- chr2.fna\n        |-- ...\n    |-- orangutan\u002F\n        |-- chr1.fna\n        |-- chr2.fna\n        |-- ...\n    |-- other species ...\n|-- ...\n```\n\n\nSample species run: \n```\npython -m train wandb=null experiment=hg38\u002Fspecies dataset.species=[human,mouse,hippo,pig,lemur] train.global_batch_size=256 optimizer.lr=6e-5 trainer.devices=1 dataset.batch_size=1 dataset.max_length=1024 dataset.species_dir=\u002Fpath\u002Fto\u002Fdata\u002Fspecies\u002F model.layer.l_max=1026 model.d_model=128 model.n_layer=2 trainer.max_epochs=150 decoder.mode=last train.pretrained_model_path=null train.pretrained_model_state_hook=null\n```\n\nLet's break some of these args down:\n- `experiment=hg38\u002Fspecies`  # main config for this experiment\n- `dataset.species`  # list of species you want (and already downloaded their .fasta files)\n- `decoder.mode=last`  # using the last token to classify (instead of default pooling)\n- `train.pretrained_model_path`  # if using a pretrained model, point to it, if not, set to null\n- `train.pretrained_model_state_hook=null`  # if using a pretrained model, this will load the weights properly (and not head). if not, set to null\n\n\n# More advanced stuff below\n\n\n## Setting up downstream experiments (fine tuning)\n\nLet's see what's needed to set up a downstream task.\n\nThe main ingredients are:\n\n1. Model weights and model config (which are provided via [HuggingFace](#huggingface) at the top)\n2. Custom dataset class and dataloader class\n3. Configs for `experiment`, `dataset`, `pipeline`, `model`. Don't worry, we have examples for each of these.\n\n\nAgain, example run, breakdown in launch command:  \n```\npython -m train wandb=null experiment=hg38\u002Fgenomic_benchmark\n```\n\n### Model config:\n\nWe talked about some of the model config setting [above](#pretraining). We placed the model config within the experiment config for convenience (which can override, basically), but you can place in the `configs\u002Fmodel` dir if you want. There is a separate layer config at `configs\u002Fmodel\u002Flayer`. This is where it's useful to understand the Hydra config stuff.\n\n#### Flags for using ultralong context (gradient checkpointing)\n\nWe have a checkpoint flag that allows ~3x less memory on a GPU (to enable longer sequences).  However, this means that you may have trouble loading checkpoints\nif you don't set the flags correctly (they need to be True if it was pretrained with these, and False if not).\n\n- `model.checkpoint_mixer: True`  # set true for memory reduction\n- `model.checkpoint_mlp: True`  # set true for memory reduction\n\nNote, if it's not in the config and you want to pass it in the commandline, you would add a `+` in front, like this:  `+model.checkpoint_mixer=True`\n\nIf you get an error (like below) with the state_dict keys not matching, it's likely due to these flags, so toggle these on\u002Foff\n\n```\nMissing key in pretrained model! backbone.layers.0.mixer.layer.filter_fn.bias\n```\n\n### Setting up a Dataset class\n\nHere's a sample dataset class for a DNA downstream task.  \n\n`src\u002Fdataloaders\u002Fdatasets\u002Fgenomic_bench_dataset.py`\n\nIt's basically a standard Pytorch dataset.  Place data in the `data\u002F`, with something like `\u002Fdata\u002Fyour_custom_dataset_name`, so the repo can find it.\n\nHere's a sample dataloader for a DNA downstream task.  There's some more actually connecting with the HyenaDNA repo required here.\n\n`src\u002Fdataloaders\u002Fgenomic_bench_dataloader.py`\n\nNotice the name is placed with `_name_ = \"genomic_benchmark\"` as a class attribute. This name is how we find it. Also, we need to add the dataloader file\nto the `__init__`, see the top of this script, `src\u002Fdataloaders\u002F__init__.py`.\n\nI would emulate this dataloader file.  It's basically a way for Pytorch lightning to handle a lot of the dataloading stuff in the background.  Pass params to the init that you need to create it.  Notice the `def setup()`, this is where the dataset class is instantiated.  `setup()` gets called in the training script (more on that later).\n\nThere are 3 dataloader functions that create the train\u002Fval\u002Ftest dataloaders.  (In this example, the dataset only uses train and test dataloader.)\n\n### Creating Configs\n\nAs mentioned above, the main config is the experiment config, and for our example, located here `configs\u002Fexperiment\u002Fhg38\u002Fgenomic_benchmark.yaml`.\n\nYou can think of each of these sections as their own configs too. eg, `model`, `task`, `optimizer` etc. You can write them in here, or have it referenced\nat the top (as default or overide, subtle differences).\n\nFor a new dataset, we need a new `dataset` config and a `pipeline` config. These configs get passed when they're instantiated.  \n\nThe pipeline config hasn't been mentioned yet, but it's where we define a few different things. Take a look inside:\n\n`configs\u002Fpipeline\u002Fgenomic_benchmark.yaml` \n\n Try to emulate this config too, which will get reference at the top of the `experiment` config.  We select the `optimizer`, `scheduler`, name of the `dataset`, the `task` (typically classification for these downsteams, but we have other options for the decoder). Don't worry about the `encoder`.  We do use a `decoder`, which is just a single MLP that maps the backbone to the number of classes we're trying to predict.  When you create the dataset class, it will require a `d_output` for the number of classes, and the `decoder` will automatically pull this attribute in the background, as well as the dimension of the backbone from `d_model`.  The `decoder` can also have options, like `pool`, where we average the token embeddings, or `last` or `first`, meaning which token we use for the MLP to learn from.\n\nIf want to train at different sequence lengths, there's a few places we would need to change too.  Namely, the dataset config and the model configs.  You could\nchange these in the `experiment` config, or individually setup defaults in the standalone `dataset` \u002F `dataloader` configs too, up to you.\n\n`dataset` config expects a `max_length` to be set.\n\n`model.layer.l_max` expects a length too.  Usually set to the dataset max_length + 2\n\n## Launch a finetuning experiment\n\n```\n# example downstream task\npython -m train wandb=null experiment=hg38\u002Fgenomic_benchmark train.pretrained_model_path=\u003Cpath_to_ckpt>\n```\n\nThe dataset will automatically download to the `data\u002F` dir (probably), and it's not that large, ~5-10 min setup.  All you need to do is download the weights from [HuggingFace](#huggingface) above, and change the configs to match the model settings, and the dataset seq_len you want to use.  Might take some fumbling around to get right, but it'll be worth it! \n\nTo describe this `experiment` config a little more, let's dive in.  It finetunes a HyenaDNA (GPT-like). Let's focus on the `train` arguments.\n\n- `remove_test_loader_in_eval: true`  # no test set in this benchmark  \nWe have the option to remove an extra test_loader, eg, if val and test are the same.\n\n- `pretrained_model_strict_load: False`  # false allows encoder\u002Fdecoder to be used if new model uses it  \nSet false to play nicely when loading pretrained weights\n\nfor loading backbone and not head, requires both of these flags below  \n- `pretrained_model_path: \u002Fhome\u002Fworkspace\u002Feric\u002Fsafari-internal\u002Foutputs\u002F2023-03-23\u002F07-10-41-239444\u002Fcheckpoints\u002Fval\u002Floss.ckpt`\nThis is where we pass the pretrained model to use as a backbone\n\n- `pretrained_model_state_hook`\n- `_name_: load_backbone`\nThis is a custom hook function that will load the backbone properly with a new MLP decoder head for the downstream task.\n\n- `freeze_backbone: false`  # seems to work much better if false (ie finetune entire model)  \nWe have the option to freeze here.\n\n#### Loading a finetuned model\n\nNext we'll show an example of loading weights (that were finetuned) on a downstream task (it will continue to train though).\n\n- see weights from [HuggingFace](#huggingface) above.\n- They are for a 2 layer, d_model=128 (width), with a max_length=1024 (sequence len)\n- Place these somewhere in the repo, typically we place them in the `outputs\u002F`dir.\n\nThe main things we need to do now are to update appropriate args in a config.\n\n```\n# path to config finetuned model config\nsafari-internal\u002Fconfigs\u002Fexperiment\u002Fhg38\u002Fgenomic_benchmark_load_finetuned_model.yaml\n```\n\nFor this config, select the dataset you want to train with `dataset.dataset_name`, which we'll use `human_nontata_promoters`, since this is what the weights above are fine tuned on.\n\nNext, you need to update `train.pretrained_model_path`: path_to_ckpt, to wherever you placed them in the repo.\n\nNow we can launch a run with this:\n\n```\npython -m train wandb=null experiment=hg38\u002Fgenomic_benchmark_load_finetuned_model\n```\n\nThis will run the main `src\u002Ftrain.py` script.\n\nLet's point out a few keys locations in the train.py script, since it's a little confusing where all the stuff gets called.\n\n- loading weights occurs with the `train.py`, `def load_state_dict()` function.  It actually calls a custom state hook to load gracefully (in the `src\u002Fmodels\u002Fsequence\u002Flong_conv_lm.py`, inthe `load_backbone()` function.\n\n- forward prop is done in the `def forward()` function, inside `SequenceLightning` module of `train.py`, but realy, it calls `self.task.forward()`, which actually makes the call to the model. That is to say, you need to go `src\u002Ftasks\u002Ftasks.py`, and fine `class LMTask`, and its `def forward()` function.  Here you'll see the actual call to the model.  Note, the decoder head (a single MLP for classification) is separate from the main model backbone (feature extractor).\n\n### Sequence Length Warmup Callback\n\nWe have sequence length warmup scheduler, implemented using a callback, which will increase sequence length in stages during training.  Basically the script will check what epoch and \"stage\" the training is at, and update the dataset\u002Fdataloaders to the parameters for that stage.  Currently, you need to specify the stages manually in a config, the example config is at, and the relevant portion is at the bottom of the config, and here below too:\n\n```\nconfigs\u002Fexperiment\u002Fhg38\u002Fhg38_hyena_seqlen_warmup_reload.yaml\n```\n\nGuidance:\nYou have to be careful to know ahead of time that the batch size and seq len will fit into memory for EACH stage.  \n\nTo make your dataloader compatible with the seqlen warmup, you need to implement an interface, which is `init_datasets()`. Here's what it looks like:\n\n\nThe sharp edges:\n\nTo use this callback, we'll use the sample config above, `configs\u002Fexperiment\u002Fhg38\u002Fhg38_hyena_seqlen_warmup_reload.yaml`.  \n\nYou'll need to design the stages manually, ie, what epoch and seq len you want to gradually increase the seq len (and lower batch size).  Note, the `epochs` at each stage means how long we run that stage for (it's not cummulative).\n\n```\ncallbacks:\n  seqlen_warmup_reload:\n    # epochs refers to how long to run at that stage (not cummulative!)\n    # this is just a sample\n    stage_params:\n      - epochs: 2  # means run this stage for 2 epochs (0, and 1)\n        seq_len: 1024\n        batch_size: 256  # in the background, grad accum = 1, since train.global_batch_size=256\n      - epochs: 2  # run for 2 epochs (2 and 3)\n        seq_len: 2048\n        batch_size: 128\n      - epochs: 2  # run for epochs 4, 5\n        seq_len: 4096  #\n        batch_size: 64\n      - epochs: 2  # epoch 6, 7\n        seq_len: 8192  \n        batch_size: 32\n      - epochs: 4  #  epoch 8, 9, 10, 11\n        seq_len: 16_384  # \n        batch_size: 16\n      - epochs: 4  # epoch 12, 13, 14, 15\n        seq_len: 32_768\n        batch_size: 8\n```\n\nAs for the other parameters you run in the command line that are important:\n\nIn the sample config, see the \n\n- `train.global_batch_size` don't forget to set this!  It will control the `accumulate_grad_batches` to keep the lr consistent each stage. eg, 256 or 128 typically (maybe 64 for very long seqs)\n- `dataset.batch_size` now refers to the test (or final seq len and batch).  the test set will always be the same\n- `dataset.max_length` now refers to the test (or final seq len and max_length).  the test set will always be the same\n- `model.layer.l_max` needs to be set to the highest seq len +2 (the test set size)\n\nThings to note:\n\nTrain dataset will change during training, but the test set will always be fixed.  The test len\u002Fbatch size is set the normal way in your command launch, ie, `dataset.batch_size`, `dataset`.\n\n### Getting logits from pretrained model\n\u003Ca name=\"logits\">\u003C\u002Fa>\n\nHere's a simple [script](https:\u002F\u002Fgithub.com\u002FHazyResearch\u002Fhyena-dna\u002Fblob\u002Fmain\u002Fevals\u002Fhg38_inference.py) to get the logits from a pretrained model.\n\nThis isn't automated, so you'll need to download the weights manually from HF, and place them locally somewhere. You need the model head to get the logits.\n\nDifference from the [Huggingface](https:\u002F\u002Fgithub.com\u002FHazyResearch\u002Fhyena-dna\u002Fblob\u002Fmain\u002Fhuggingface.py): this script is meant for getting embeddings easily, which doesn't use the model head. We don't have a current use case for the logits yet, so there's some extra steps if you want those.\n\n### Experimental\n\n1. We have an experimental bidirectional implementation of HyenaDNA. We used this in a recent ablation on the GenomicBenchmarks dataset where we trained from scratch, ie, did not pretrain using masked language modeling via BERT. We compared this to the standard causal HyenaDNA, and the causal version performed better. But some people very much want a bidirectional HyenaDNA, so we provide one instantiation of this, of which there are many ways to do bidirectionalality.\n\nIn regards to how we implementated it, we simply manipulate the padding of the FFT convolution. Checkout the `src\u002Fmodels\u002Fsequence\u002Fhyena.py` script for more details (eg just search for `bidirectional`).\n\nTo use bidirectional, pass in the flag (at launch) `model.bidirectional=True`, that's it!  \n\nNote, the codebase only supports bidirectional training from scratch on a downstream task, ie, no masked language model pretraining. It doesn't make sense to do causal pretraining using bidirectionalality, so use at your own risk!\n\n2. For downstream tasks, we added an option to only use \u002F average over masked tokens. We updated the GenomicBenchmarks and Nuc Trans dataset with this ability, see there dataset classes for how it's implemented. To use it:\n\n- you need to also set the right config settings. See their experiment configs, eg, `\u002Fsrc\u002Fconfigs\u002Fexperiment\u002Fhg38\u002Fgenomic_benchmark.yaml`, and in particular, use `dataset.return_mask=True`, and `dataset.padding_side=right`\n- you need to set the new task created, which is called `masked_multiclass`, also in the experiment config.  All this does (differently than before) is handle the passing of the masks correctly to the model.\n\nIn practice, for short range tasks with not a lot padding, we noticed it didn't make too much of a difference.  But if your sequences have a ton of padding, then this will definitely help. In the paper, we didn't use this, and had left side padding by default.\n\n\n### Change Log \u002F updates:\n\n- Added more weights to [huggingface](#huggingface).\n- Added [docker image](#docker_nt) with Nucleotide Transformer datasets, weights, and exact hyper parameters to reproduce results.\n- There's an experimential bidirectional option added.  See [Experimental](#Experimental).\n- We added an option to pass a mask and ignore padded tokens for downstream tasks.  See [Experimental](#Experimental).\n- Added some tips on [pretraining](#pretraining_custom) your on your own data.\n- Example to get [logits](#logits) from pretrained model.\n\n\n## Citation\n\nFeel free to cite us if you find our work useful :)  \n```\n@article{nguyen2023hyenadna,\n      title={HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution}, \n      author={Eric Nguyen and Michael Poli and Marjan Faizi and Armin Thomas and Callum Birch-Sykes and Michael Wornow and Aman Patel and Clayton Rabideau and Stefano Massaroli and Yoshua Bengio and Stefano Ermon and Stephen A. Baccus and Chris Ré},\n      year={2023},\n      eprint={2306.15794},\n      archivePrefix={arXiv},\n      primaryClass={cs.LG}\n}\n```\n","# HyenaDNA\n\n![HyenaDNA_pipeline](assets\u002Fpipeline.png \"HyenaDNA\")\n\n## 重要链接：\n- [arxiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.15794)  \n- [博客](https:\u002F\u002Fhazyresearch.stanford.edu\u002Fblog\u002F2023-06-29-hyena-dna)\n- [Colab](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1wyVEQd4R3HYLTUOXEEQmp_I8aNC_aLhL?usp=sharing)  \n- [Hugging Face](https:\u002F\u002Fhuggingface.co\u002FLongSafari)\n- [Discord](https:\u002F\u002Fdiscord.gg\u002FRJxUq4mzmW)\n- [YouTube（演讲）](https:\u002F\u002Fyoutu.be\u002FhaSkAC1fPX0?si=IUMmo_iGZ6SK1DBX)\n\n## 简介\n\n欢迎来到 HyenaDNA 仓库！HyenaDNA 是一个长序列基因组基础模型，在高达 ***100 万个 token*** 的上下文长度上进行了预训练，并且具有 ***单核苷酸分辨率***。\n\n该仓库目前仍在开发中，但我们非常期待将其交付给研究人员使用，请大家多多包涵 :)\n\n这个仓库最适合希望预训练 HyenaDNA 模型或尝试论文中下游任务的研究人员。不过，最简单的入门方式是使用 HyenaDNA 的 **[Colab 笔记本](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1wyVEQd4R3HYLTUOXEEQmp_I8aNC_aLhL?usp=sharing)**，这是一个集成了 Hugging Face 的自包含笔记本。你可以加载预训练权重并在 GenomicBenchmarks 数据集上进行微调。此外，在免费层级下，你还可以对长达 45 万核苷酸的 DNA 序列进行推理并获取嵌入。对于 100 万长度的 DNA 序列，你可以在 Colab 的付费层级中申请 A100 卡，或者在自己的机器上运行该笔记本。\n\n**致谢：** 代码大部分源自并扩展自 [S4](https:\u002F\u002Fgithub.com\u002FHazyResearch\u002Fstate-spaces) 和 [Safari](https:\u002F\u002Fgithub.com\u002FHazyResearch\u002Fsafari)。\n\n## Discord\n\n正在试用 [Discord](https:\u002F\u002Fdiscord.gg\u002FRJxUq4mzmW)！也许它能促进大家分享关于如何以不同方式应用 HyenaDNA 的想法和技巧。欢迎大家在那里提问。\n\n## Hugging Face 预训练权重\n\u003Ca name=\"huggingface\">\u003C\u002Fa>\n\n请查看这些权重 :) 它们有不同的模型尺寸和可处理的最大训练序列长度。所有模型均基于单一人类参考基因组（hg38）进行预训练。\n\n- [tiny-1k](https:\u002F\u002Fhuggingface.co\u002FLongSafari\u002Fhyenadna-tiny-1k-seqlen\u002Ftree\u002Fmain)\n- [tiny-1k-d256](https:\u002F\u002Fhuggingface.co\u002FLongSafari\u002Fhyenadna-tiny-1k-seqlen-d256\u002Ftree\u002Fmain)\n- [tiny-16k-d128](https:\u002F\u002Fhuggingface.co\u002FLongSafari\u002Fhyenadna-tiny-16k-seqlen-d128\u002Ftree\u002Fmain)\n- [small-32k](https:\u002F\u002Fhuggingface.co\u002FLongSafari\u002Fhyenadna-small-32k-seqlen\u002Ftree\u002Fmain)\n- [medium-160k](https:\u002F\u002Fhuggingface.co\u002FLongSafari\u002Fhyenadna-medium-160k-seqlen\u002Ftree\u002Fmain)\n- [medium-450k](https:\u002F\u002Fhuggingface.co\u002FLongSafari\u002Fhyenadna-medium-450k-seqlen\u002Ftree\u002Fmain)\n- [large-1m](https:\u002F\u002Fhuggingface.co\u002FLongSafari\u002Fhyenadna-large-1m-seqlen\u002Ftree\u002Fmain)\n\n请参阅每个模型的建议 [GPU 要求](https:\u002F\u002Fhuggingface.co\u002FLongSafari\u002Fhyenadna-tiny-1k-seqlen#hardware)。\n\n使用这些 Hugging Face 权重有几种方式，各有不同的特点：\n\n1. [Colab](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1wyVEQd4R3HYLTUOXEEQmp_I8aNC_aLhL?usp=sharing)  \n2. [本仓库中的 PyTorch Lightning](#loadweights)\n3. [独立运行](#standalone)\n\n## 依赖项\n\u003Ca name=\"dependencies\">\u003C\u002Fa>\n\n首先，我们来安装本仓库所需的依赖项。（如果你熟悉 Docker，可以跳过本节直接前往下方的 [Docker](#docker) 设置。）该仓库基于 PyTorch Lightning（一个训练框架）和 Hydra（一个面向配置的机器学习库）构建。熟悉这些工具将非常有帮助。\n\n- 克隆仓库并进入目录：\n\n```\ngit clone --recurse-submodules https:\u002F\u002Fgithub.com\u002FHazyResearch\u002Fhyena-dna.git && cd hyena-dna\n```\n\n- 创建一个 Conda 环境，Python 版本需为 3.8 或更高：\n\n```\nconda create -n hyena-dna python=3.8\n```\n\n- 本仓库使用 PyTorch 1.13 和 CUDA 11.7 开发：\n\n```\nconda install pytorch==1.13.0 torchvision==0.14.0 torchaudio==0.13.0 pytorch-cuda=11.7 -c pytorch -c nvidia\n```\n\n- 安装项目所需依赖：\n\n```\npip install -r requirements.txt\n```\n\n- 安装 Flash Attention，以下 [说明](https:\u002F\u002Fgithub.com\u002FHazyResearch\u002Fsafari#getting-started) 将有所帮助：\n\n```\ncd hyena-dna\ngit submodule update --init\ncd flash-attention\ngit submodule update --init\npip install -e . --no-build-isolation\n```\n\n- 可选：安装加速用的融合层（耗时较长）：\n\n```\n# 在 flash-attn 目录内\ncd csrc\u002Flayer_norm && pip install . --no-build-isolation\n```\n\n## Dockerfile\n\u003Ca name=\"docker\">\u003C\u002Fa>\n\n如果你熟悉 Docker，我们提供了一个预装所有依赖的镜像，这是最简单、最可靠的方式，但需要一定的 Docker 使用经验。\n\n一个小插曲：你还需要克隆作为主 `hyena-dna` 仓库子模块的 `flash-attn` 仓库。这意味着你需要使用 `--recurse-submodules` 标志，以防你之前克隆时未启用该选项。\n\n```\n# 克隆主仓库及子模块\ngit clone --recurse-submodules https:\u002F\u002Fgithub.com\u002FHazyResearch\u002Fhyena-dna.git && cd hyena-dna\n```\n\n准备 Docker 容器：\n\n```\n# 在 hyena-dna 仓库中构建镜像（会自动读取 Dockerfile）。请将 $USER_NAME 替换为你自己的 Docker Hub 用户名。\ndocker build . -t $USER_NAME\u002Fhyena-dna\n\n或者，\n\n# 拉取已构建好的镜像（我们的 $USER_NAME 是 hyenadna）\ndocker pull hyenadna\u002Fhyena-dna:latest\n\n# 运行容器：这将为你提供一个带有所有依赖的交互式 shell\ndocker run --gpus all -it -p80:3000 hyenadna\u002Fhyena-dna \u002Fbin\u002Fbash\n```\n\n更新：\n\u003Ca name=\"docker_nt\">\u003C\u002Fa>\n\n我们实际上还有第二个 Docker 镜像，其中包含了所有的 Nucleotide Transformer 数据集、检查点以及用于复现 HyenaDNA 论文中最佳结果的精确命令和超参数设置。\n\n```\ndocker pull hyenadna\u002Fhyena-dna-nt6:latest \ndocker run --gpus all -it -p80:3000 hyenadna\u002Fhyena-dna-nt6 \u002Fbin\u002Fbash\n\n```\n\n这将带你进入 `\u002Fwdr` 目录，其中有一个名为 `launch_commands_nucleotide_transformer` 的文件，包含了针对 18 个 Nucleotide Transformer 数据集的所有启动命令及其相关超参数。\n\n你可能会问，与第一个 Docker 镜像有什么区别呢？其实差别不大，只是部分依赖版本有所不同而已。\n\n## 快速入门\n\n快速开始使用这个仓库的方法是在一个小规模的基因组数据集上从头训练。让我们先试试看，确认一切设置是否正常。\n\n下面的命令会自动将一个小数据集下载到 `data\u002F` 目录中。它使用一个小型的两层 HyenaDNA 模型，并配有一个线性解码器（头部），用于二分类任务。目前，该模型已经比 SotA 高出 7 个百分点（来自 GenomicBenchmarks 的一项任务），但如果我们使用预训练模型，还可以进一步提升性能。\n\n```\npython -m train wandb=null experiment=hg38\u002Fgenomic_benchmark_scratch\n```\n\n下面我们来详细说明一下：\n\n\u003Ca name=\"quick\">\u003C\u002Fa>\n- `-m` 允许你以模块方式运行脚本（无需在名称中使用 `.py`）。\n- `train` 调用主 `train.py` 脚本，启动所有训练和微调实验。\n- `wandb=null` 表示连接到 WandB，但在快速测试时我将其设置为 null。如果你希望记录实验，可以使用类似 `wandb.group=custom_name_here` 的配置。\n- `experiment` 传递实验配置，使用位于 `configs\u002Fexperiments\u002Fhg38\u002F` 中的 `genomic_benchmark_scratch.yaml` 文件。\n- 你也可以通过命令行传递其他配置，例如 `dataset=your_custom_datset_name`。不过这方面的内容我们稍后再详细介绍。\n\n## 加载预训练权重\n\u003Ca name=\"loadweights\">\u003C\u002Fa>\n\n有两种方法可以使用 Hugging Face 上的预训练权重：\n1. Hugging Face 集成（最佳示例），可通过 [colab](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1wyVEQd4R3HYLTUOXEEQmp_I8aNC_aLhL?usp=sharing) 实现。\n2. 在本仓库中使用 PyTorch Lightning：\n   - 你可以克隆 Hugging Face 的仓库，并将检查点路径传递给 PyTorch Lightning（实际的 `.ckpt` 文件来自 Lightning）。\n   - 使用的标志是 `train.pretrained_model_path=\u002Fpath\u002Fto\u002Fckpt`。\n   - 启动时需要确保模型配置与 Hugging Face 仓库中的配置一致。这些配置同样可以在 Hugging Face 仓库中找到。\n\n## 独立代码（也基于 Hugging Face）\n\u003Ca name=\"standalone\">\u003C\u002Fa>\n\n实际上我们还有第三种方式，但它只是将 Colab 笔记本的内容复制到本仓库中，保存为 `.py` 文件（如果你更喜欢这种方式的话）。这种方式集成的是 Hugging Face，而不是 PyTorch Lightning，因此不会提供所有高级功能，但它是一个独立的脚本，意味着更容易移植到你自己的代码库中。此方法假设你已经安装了所有[依赖项](#dependencies)。\n\n- 参阅 `huggingface.py` 脚本，了解如何进行推理以及从 Hugging Face 加载预训练权重。\n- 还有 `standalone_hyenadna.py`，其中包含了创建 HyenaDNA 模型所需的所有类。\n\n## 实验\n我们分享了用于在人类参考基因组（HG38）上进行预训练、在多个下游任务上进行微调，以及使用软提示标记和指令微调实现上下文学习变体的训练和数据加载代码。目前，你需要自行下载并预处理数据；我们稍后会分享这些步骤。\n\n总的来说，建议你熟悉 `configs\u002Fexperiments\u002Fhg38` 中的配置文件，因为我们所有的（示例）实验设置都保存在那里。\n\n### 在人类参考基因组上进行预训练\n\u003Ca name=\"pretraining\">\u003C\u002Fa>\n\n第一步是下载人类参考基因组的数据。它由两个文件组成：一个是包含所有序列的 `.fasta` 文件，另一个是我们使用的区间文件 `.bed`。\n\n文件结构应如下所示：\n```\ndata\n|-- hg38\u002F\n    |-- hg38.ml.fa\n    |-- human-sequences.bed\n```\n\n- 将整个人类基因组的 `.fa` 格式文件下载到 `hyena-dna\u002Fdata\u002Fhg38` 目录下。整个基因组大约有 24 条染色体，合并为一个文件，每条染色体基本上都是连续的序列。然后下载包含序列区间的 `.bed` 文件（包含染色体名称、起始位置、结束位置等信息，可用于从 `.fasta` 文件中提取相应序列）。\n\n```\nmkdir -p data\u002Fhg38\u002F\ncurl https:\u002F\u002Fstorage.googleapis.com\u002Fbasenji_barnyard2\u002Fhg38.ml.fa.gz > data\u002Fhg38\u002Fhg38.ml.fa.gz\ncurl https:\u002F\u002Fstorage.googleapis.com\u002Fbasenji_barnyard2\u002Fsequences_human.bed > data\u002Fhg38\u002Fhuman-sequences.bed\n```\n\n启动预训练运行：\n\n```\npython -m train wandb=null experiment=hg38\u002Fhg38_hyena model.d_model=128 model.n_layer=2 dataset.batch_size=256 train.global_batch_size=256 dataset.max_length=1024 optimizer.lr=6e-4 trainer.devices=1\n```\n\n下面我们对这条命令做一些解释：\n- `experiment=hg38\u002Fhg38_hyena` 传递了使用 Hyena(DNA) 模型的实验配置。\n- `model.d_model=128` 和 `model.n_layer=2` 分别设置了模型的宽度和深度，这是关键的超参数。\n- `dataset.max_length=1024` 设置了从数据集中采样的最大序列长度，模型的最大序列长度也会根据此值进行设定；或者你也可以手动设置 `model.layer.l_max`。\n- `model.d_inner` 同样可以手动设置反向瓶颈维度（默认为 `4 * d_model`）。\n\n还有很多其他可传递和自定义的命令，你可以查看 `experiment=hg38\u002Fhg38_hyena` 配置文件以获取更多细节。\n\n### 在你自己的数据上进行预训练\n\u003Ca name=\"pretraining_custom\">\u003C\u002Fa>\n\n要在你自己的数据上进行预训练，你只需要一个 `.fasta` 文件即可。不需要像我们用于 HG38 那样的 `.bed` 文件，我们使用 `.bed` 文件只是为了方便。你可以参考我们的物种分类数据加载器，学习如何构建一个通用的预训练数据加载器，它可以随机选择一条染色体，并从中抽取一段指定长度的序列。\n\n示例预训练数据加载器：\n```\nsrc\u002Fdataloaders\u002Fdatasets\u002Fspecies_dataset.py\n```\n\n通过替换 `.fasta` 文件为你的数据，并调整一些配置，物种分类数据加载器也可以用于预训练。此外，还需要修改代码以映射你 `.fasta` 文件中实际存在的染色体，因此你需要深入研究代码及其工作原理。总体而言，使用这个仓库的主要工作就是设置数据加载器和配置文件（虽然耗时，但非常值得）。\n\n注意：如果你计划在自己的数据上进行预训练，请务必正确预处理数据，并确保数据样本符合数据加载器的预期。例如，大小写、未知字符等问题。另外，如果你的序列长度不固定（我们在实验中主要使用固定长度的序列，因为理论上下一个词预测应该能够处理变长序列），那么填充可能会变得显著或成为问题。比如，如果你的序列长度范围是 100 到 32,000，那么长度为 100 的序列将会有很多填充，这时你需要在损失函数中忽略这些填充标记，以避免训练不稳定。填充标记的默认值是 `4`，你可以在命令行中通过 `+task.loss.ignore_index=4` 来指定，或者直接修改配置文件中的 `task.loss` 部分。\n\n### 基因组基准数据集\n\u003Ca name=\"genomicbenchmarks\">\u003C\u002Fa>\n\n[GenomicBenchmarks](https:\u002F\u002Fgithub.com\u002FML-Bioinfo-CEITEC\u002Fgenomic_benchmarks) 是一套易于使用的序列级别分类数据集。我们将其作为尝试新方法的良好起点。\n\n示例运行：\n```\npython -m train wandb=null experiment=hg38\u002Fgenomic_benchmark dataset_name=human_enhancers_cohn train.pretrained_model_path=\u002Fpath\u002Fto\u002Fckpt dataset.max_length=500 model.layer.l_max=1024\n```\n\n此命令在其中一个数据集上运行 HyenaDNA 模型，该数据集会自动下载到 `data\u002F` 目录中。以下是其他数据集及其统计信息，您也可以将这些信息传递到配置中。`configs\u002Fdataset\u002Fgenomic_benchmark` 中的配置已设置为正确加载数据集元数据（样本数、类别数等）。\n\n与上述 [快速入门](#quick) 一样，您需要设置要使用的 `dataset.max_length` 标志，以及 `model.layer.l_max`，它指示模型您希望使用的最大序列长度。输入序列将被填充至 `model.layer.l_max` 长度。例如，如果数据样本长度为 500，而 `l_max` 设置为 1024，则会将 501 补齐至 `l_max`。\n\n在此微调实验中新增的标志是通过 `train.pretrained_model_path=\u002Fpath\u002Fto\u002Fckpt` 传入预训练检查点。\n\n该套件包含 8 个数据集，每次选择一个（通过传递 `dataset.dataset_name` 可自动设置类别数和序列数）。\n```\n# 名称                                序列数        类别数     中位数长度    标准差\n# dummy_mouse_enhancers_ensembl       1210            2               2381          984.4  \n# demo_coding_vs_intergenomic_seqs    100_000         2               200           0\n# demo_human_or_worm                  100_000         2               200           0\n# human_enhancers_cohn                27791           2               500           0\n# human_enhancers_ensembl             154842          2               269           122.6\n# human_ensembl_regulatory            289061          3               401           184.3\n# human_nontata_promoters             36131           2               251           0\n# human_ocr_ensembl                   174756          2               315           108.1\n```\n\n### 核苷酸 Transformer 数据集\n\n您可以查阅 [Nucleotide Transformer](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002F10.1101\u002F2023.01.11.523679v1) 论文附录，了解如何下载和处理这些数据集。\n\n如果您想使用我们用于微调的预训练权重，您需要 Hugging Face 上的 [tiny-1k-d256](https:\u002F\u002Fhuggingface.co\u002FLongSafari\u002Fhyenadna-tiny-1k-seqlen-d256\u002Ftree\u002Fmain) 权重。\n\n更新：或者，您可以花一点时间学习如何使用 Docker，并直接使用我们的 [预构建 Docker 镜像](#docker_nt)，其中包含了 HyenaDNA 论文中结果所用的确切核苷酸 Transformer 数据集\u002F划分、预训练权重和超参数（这是迄今为止最便捷的复现结果的方式）。\n\n示例运行：\n```\n# 从头开始训练\npython -m train wandb=null experiment=hg38\u002Fnucleotide_transformer dataset_name=enhancer dataset.max_length=500 model.layer.l_max=1026\n```\n\n与 GenomicBenchmarks 类似，我们需要从 17 个核苷酸 Transformer 数据集中选择要使用的数据集。\n\n请参阅 `configs\u002Fdataset\u002Fnucleotide_transformer` 中的数据集配置以获取更多元数据，以下是一些示例：\n```\n字段\n名称 最大长度 类别数 样本数 指标\n\n# enhancer 200   2  14968 MCC\n# enhancer_types 200   3  14968 MCC\n# H3 500   2  13468 MCC\n# H3K4me1  500   2  28509 MCC\n# H3K4me2  500   2  27614 MCC\n# H3K4me3  500   2  33119 MCC\n# H3K9ac   500   2  25003 MCC\n# H3K14ac  500   2  29743 MCC\n# H3K36me3 500   2  31392 MCC\n# H3K79me3 500   2  25953 MCC\n# H4 500   2  13140 MCC\n# H4ac  500   2  30685 MCC\n# promoter_all   300   2  53276 F1\n# promoter_non_tata 300   2  47759 F1\n# promoter_tata  300   2  5517  F1\n# splice_sites_acceptor   600   2  19961 F1\n# splice_sites_donor   600   2  19775 F1\n```\n\n数据文件结构应如下所示：\n```\ndata\n|-- nucleotide_transformer\u002F\n    |-- enhancer\u002F\n        |-- all_test_enhancer.fasta\n        |-- all_train_enhancer.fasta\n    |-- H3\u002F\n        |-- H3_test.fasta\n        |-- H3_train.fasta\n    |-- promoter_tata\u002F\n        |-- promoter_tata_test.fasta\n        |-- promoter_tata_train.fasta\n    |-- ...\n\n```\n\n\n### 上下文学习\n\n我们使用 [GenomicBenchmarks](#genomicbenchmarks) 来探索上下文学习 (ICL)。它应该会自动将数据下载到 `data\u002F` 目录中。\n\n软提示示例运行：\n```\npython -m evals\u002Fsoft_prompting_genomics\n```\n\n指令微调示例：\n```\npython -m evals\u002Finstruction_tuned_genomics\n```\n\n### 染色质谱\n\n您需要参考 [DeepSea](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fnmeth.3547) 和 [repo](https:\u002F\u002Fgithub.com\u002FFunctionLab\u002Fsei-framework) 以获取有关如何下载和预处理的信息。\n\n染色质谱示例运行：\n```\npython -m train wandb=null experiment=hg38\u002Fchromatin_profile dataset.ref_genome_path=\u002Fpath\u002Fto\u002Ffasta\u002Fhg38.ml.fa dataset.data_path=\u002Fpath\u002Fto\u002Fchromatin_profile dataset.ref_genome_version=hg38\n```\n\n- `dataset.ref_genome_path`  # 人类参考基因组文件路径（输入序列）\n- `dataset.ref_genome_version`  # 参考基因组版本（hg38 或 hg19，我们使用 hg38）\n- `dataset.data_path`  # 数据集标签文件路径\n\n\n### 物种分类\n\n您需要为每个要使用的物种下载 FASTA 文件（只需下载 .zip 文件，数据加载器会自动解压）。可以使用以下命令下载：\n\n```\n# 人类\nwget -P human\u002F -r -nH --cut-dirs=12 --no-parent ftp:\u002F\u002Fftp.ncbi.nlm.nih.gov\u002Fgenomes\u002Fall\u002FGCA\u002F009\u002F914\u002F755\u002FGCA_009914755.4_T2T-CHM13v2.0\u002FGCA_009914755.4_T2T-CHM13v2.0_assembly_structure\u002FPrimary_Assembly\u002Fassembled_chromosomes\u002FFASTA\u002F\n# 狐猴\nwget -P lemur\u002F -r -nH --cut-dirs=11 --no-parent ftp:\u002F\u002Fftp.ncbi.nlm.nih.gov\u002Fgenomes\u002Fgenbank\u002Fvertebrate_mammalian\u002FLemur_catta\u002Flatest_assembly_versions\u002FGCA_020740605.1_mLemCat1.pri\u002FGCA_020740605.1_mLemCat1.pri_assembly_structure\u002FPrimary_Assembly\u002Fassembled_chromosomes\u002FFASTA\u002F\n# 家鼠\nwget -P mouse\u002F -r -nH --cut-dirs=11 --no-parent ftp:\u002F\u002Fftp.ncbi.nlm.nih.gov\u002Fgenomes\u002Fgenbank\u002Fvertebrate_mammalian\u002FMus_musculus\u002Flatest_assembly_versions\u002FGCA_921998355.2_A_J_v3\u002FGCA_921998355.2_A_J_v3_assembly_structure\u002FPrimary_Assembly\u002Fassembled_chromosomes\u002FFASTA\u002F\n# 猪\nwget -P pig\u002F -r -nH --cut-dirs=11 --no-parent ftp:\u002F\u002Fftp.ncbi.nlm.nih.gov\u002Fgenomes\u002Fgenbank\u002Fvertebrate_mammalian\u002FSus_scrofa\u002Flatest_assembly_versions\u002FGCA_002844635.1_USMARCv1.0\u002FGCA_002844635.1_USMARCv1.0_assembly_structure\u002FPrimary_Assembly\u002Fassembled_chromosomes\u002FFASTA\u002F\n\n# 河马\nwget -P hippo\u002F -r -nH --cut-dirs=11 --no-parent ftp:\u002F\u002Fftp.ncbi.nlm.nih.gov\u002Fgenomes\u002Fgenbank\u002Fvertebrate_mammalian\u002FHippopotamus_amphibius\u002Flatest_assembly_versions\u002FGCA_023065835.1_ASM2306583v1\u002FGCA_023065835.1_ASM2306583v1_assembly_structure\u002FPrimary_Assembly\u002Fassembled_chromosomes\u002FFASTA\u002F\n```\n\n你的文件夹结构应该如下所示：\n\n```\ndata\n|-- species\u002F\n    |-- chimpanzee\u002F\n        |-- chr1.fna\n        |-- chr2.fna\n        |-- ...\n    |-- hippo\u002F\n        |-- chr1.fna\n        |-- chr2.fna\n        |-- ...\n    |-- human\u002F\n        |-- chr1.fna\n        |-- chr2.fna\n        |-- ...\n    |-- mouse\u002F\n        |-- chr1.fna\n        |-- chr2.fna\n        |-- ...\n    |-- orangutan\u002F\n        |-- chr1.fna\n        |-- chr2.fna\n        |-- ...\n    |-- 其他物种 ...\n|-- ...\n```\n\n\n示例物种运行：\n```\npython -m train wandb=null experiment=hg38\u002Fspecies dataset.species=[human,mouse,hippo,pig,lemur] train.global_batch_size=256 optimizer.lr=6e-5 trainer.devices=1 dataset.batch_size=1 dataset.max_length=1024 dataset.species_dir=\u002Fpath\u002Fto\u002Fdata\u002Fspecies\u002F model.layer.l_max=1026 model.d_model=128 model.n_layer=2 trainer.max_epochs=150 decoder.mode=last train.pretrained_model_path=null train.pretrained_model_state_hook=null\n```\n\n让我们来分解其中的一些参数：\n- `experiment=hg38\u002Fspecies`  # 本次实验的主要配置\n- `dataset.species`  # 你想要的物种列表（并且已经下载了它们的 .fasta 文件）\n- `decoder.mode=last`  # 使用最后一个 token 进行分类（而不是默认的池化操作）\n- `train.pretrained_model_path`  # 如果使用预训练模型，指向该模型；否则设置为 null\n- `train.pretrained_model_state_hook=null`  # 如果使用预训练模型，这将正确加载权重（而非头部）。如果不使用，则设置为 null\n\n\n# 更高级的内容如下\n\n\n## 设置下游任务（微调）\n\n让我们看看设置下游任务需要什么。\n\n主要组成部分是：\n\n1. 模型权重和模型配置（这些通过顶部的 [HuggingFace](#huggingface) 提供）\n2. 自定义数据集类和数据加载器类\n3. `experiment`、`dataset`、`pipeline`、`model` 的配置。不用担心，我们为每个部分都提供了示例。\n\n\n再次提供示例运行及启动命令分解：\n```\npython -m train wandb=null experiment=hg38\u002Fgenomic_benchmark\n```\n\n### 模型配置：\n\n我们在[上面](#pretraining)讨论了一些模型配置设置。为了方便起见，我们将模型配置放在了实验配置中（可以覆盖默认值），但如果你愿意，也可以将其放在 `configs\u002Fmodel` 目录下。还有一个单独的层配置位于 `configs\u002Fmodel\u002Flayer`。在这里理解 Hydra 配置系统会很有帮助。\n\n#### 使用超长上下文的标志（梯度检查点）\n\n我们有一个检查点标志，可以在 GPU 上减少约 3 倍的内存占用（以支持更长的序列）。然而，这意味着如果你没有正确设置这些标志，可能会在加载检查点时遇到问题（如果模型是在启用这些标志的情况下预训练的，则应设置为 True；否则设置为 False）。\n\n- `model.checkpoint_mixer: True`  # 设置为 True 以减少内存占用\n- `model.checkpoint_mlp: True`  # 设置为 True 以减少内存占用\n\n请注意，如果配置中没有这些选项，而你想在命令行中传递它们，需要在前面加上一个加号，例如：`+model.checkpoint_mixer=True`\n\n如果你遇到类似以下的错误，即 state_dict 键不匹配，很可能是由于这些标志导致的，因此请尝试切换这些标志的开关：\n\n```\n预训练模型中缺少键！backbone.layers.0.mixer.layer.filter_fn.bias\n```\n\n### 设置数据集类\n\n这里是一个用于 DNA 下游任务的示例数据集类。\n\n`src\u002Fdataloaders\u002Fdatasets\u002Fgenomic_bench_dataset.py`\n\n它基本上是一个标准的 PyTorch 数据集。将数据放在 `data\u002F` 目录下，例如 `\u002Fdata\u002Fyour_custom_dataset_name`，以便仓库能够找到它。\n\n这里是一个用于 DNA 下游任务的示例数据加载器。实际上，还需要与 HyenaDNA 仓库进行一些连接。\n\n`src\u002Fdataloaders\u002Fgenomic_bench_dataloader.py`\n\n请注意，类属性中设置了 `_name_ = \"genomic_benchmark\"`。这个名称是我们用来查找它的标识。此外，我们需要将数据加载器文件添加到 `__init__.py` 中，具体位置在该脚本的顶部，即 `src\u002Fdataloaders\u002F__init__.py`。\n\n我会模仿这个数据加载器文件。它基本上是让 Pytorch Lightning 在后台处理大量数据加载工作的机制。在初始化时传递所需的参数即可创建数据加载器。注意其中的 `def setup()` 方法，这是实例化数据集类的地方。`setup()` 会在训练脚本中被调用（稍后会详细介绍）。\n\n这里有 3 个数据加载器函数用于创建训练、验证和测试的数据加载器。（在这个例子中，数据集只使用训练和测试数据加载器。）\n\n### 创建配置文件\n\n如上所述，主要配置是实验配置，对于我们的示例，位于 `configs\u002Fexperiment\u002Fhg38\u002Fgenomic_benchmark.yaml`。\n\n你可以将这些部分中的每一项也视为独立的配置文件，例如 `model`、`task`、`optimizer` 等。你可以在主配置文件中直接编写这些内容，或者在顶部引用它们（作为默认值或覆盖值，两者之间存在细微差异）。\n\n对于新的数据集，我们需要一个新的 `dataset` 配置文件和一个 `pipeline` 配置文件。这些配置文件会在实例化时被传递。\n\n目前还没有提到流水线配置，但它用于定义一些不同的内容。请查看内部：\n\n`configs\u002Fpipeline\u002Fgenomic_benchmark.yaml`\n\n你也应该尝试模仿这个配置文件，它会被引用到实验配置的顶部。我们在这里选择优化器、调度器、数据集名称以及任务类型（通常对于这些下游任务是分类任务，但我们也有其他解码器选项）。不用担心编码器。我们确实使用解码器，它只是一个简单的 MLP，用于将骨干网络映射到我们要预测的类别数量。当你创建数据集类时，它会要求指定输出维度 `d_output`，表示类别的数量，而解码器会在后台自动获取这一属性，同时也会从 `d_model` 中获取骨干网络的维度。解码器还可以有其他选项，比如 `pool`，即对 token 嵌入进行平均；或者 `last` 或 `first`，表示我们使用哪个 token 来让 MLP 学习。\n\n如果你想在不同序列长度下进行训练，还需要在几个地方进行更改。主要是数据集配置和模型配置。你可以在实验配置中修改这些内容，也可以分别在独立的 `dataset` 和 `dataloader` 配置文件中设置默认值，具体取决于你的需求。\n\n`dataset` 配置需要设置一个 `max_length`。\n\n`model.layer.l_max` 也需要设置一个长度。通常设置为数据集的最大长度加 2。\n\n## 启动微调实验\n\n```\n\n# 下游任务示例\npython -m train wandb=null experiment=hg38\u002Fgenomic_benchmark train.pretrained_model_path=\u003C预训练模型权重路径>\n```\n\n数据集会自动下载到 `data\u002F` 目录（大概率如此），而且数据集不大，设置时间大约5-10分钟。你只需要从上面提到的 [HuggingFace](#huggingface) 下载权重，并调整配置以匹配模型设置和你想要使用的数据集序列长度即可。可能需要一些尝试和调整才能达到理想效果，但最终会值得的！\n\n为了更详细地介绍这个 `experiment` 配置，我们来深入了解一下。该配置对 HyenaDNA 模型（类似 GPT）进行微调。我们重点关注 `train` 参数。\n\n- `remove_test_loader_in_eval: true`  # 在这个基准测试中没有测试集  \n如果验证集和测试集相同，我们可以选择移除额外的测试加载器。\n\n- `pretrained_model_strict_load: False`  # 设置为 False 允许在新模型使用编码器\u002F解码器时继续使用  \n将此参数设置为 False 可以更好地兼容加载预训练权重。\n\n要加载骨干网络而不加载头部，需要同时设置以下两个标志：  \n- `pretrained_model_path: \u002Fhome\u002Fworkspace\u002Feric\u002Fsafari-internal\u002Foutputs\u002F2023-03-23\u002F07-10-41-239444\u002Fcheckpoints\u002Fval\u002Floss.ckpt`  \n这是用于指定作为骨干网络的预训练模型路径的地方。\n\n- `pretrained_model_state_hook`\n- `_name_: load_backbone`  \n这是一个自定义钩子函数，它会正确加载骨干网络，并为下游任务添加一个新的 MLP 解码器头部。\n\n- `freeze_backbone: false`  # 如果设置为 False（即对整个模型进行微调），效果似乎更好  \n这里可以选择是否冻结骨干网络。\n\n#### 加载微调后的模型\n\n接下来我们将展示如何加载在下游任务上进行过微调的权重（尽管如此，模型仍将继续训练）。\n\n- 权重来自上面提到的 [HuggingFace](#huggingface)。\n- 这些权重适用于一个两层、d_model=128（宽度）、max_length=1024（序列长度）的模型。\n- 将这些权重放置在仓库中的某个位置，通常我们会将其放在 `outputs\u002F` 目录下。\n\n现在我们需要做的主要事情是更新配置文件中的相应参数。\n\n```\n# 微调后模型配置文件路径\nsafari-internal\u002Fconfigs\u002Fexperiment\u002Fhg38\u002Fgenomic_benchmark_load_finetuned_model.yaml\n```\n\n对于这个配置文件，选择你要用来训练的数据集 `dataset.dataset_name`，这里我们使用 `human_nontata_promoters`，因为上述权重就是在这个数据集上微调得到的。\n\n接下来，你需要更新 `train.pretrained_model_path` 参数，将其指向你存放权重的具体路径。\n\n现在我们可以用以下命令启动训练：\n\n```\npython -m train wandb=null experiment=hg38\u002Fgenomic_benchmark_load_finetuned_model\n```\n\n这将运行主脚本 `src\u002Ftrain.py`。\n\n让我们指出 `train.py` 脚本中几个关键的位置，因为代码中各部分的调用关系有些复杂。\n\n- 权重加载是在 `train.py` 的 `def load_state_dict()` 函数中完成的。实际上，它会调用一个自定义的状态钩子来优雅地加载（位于 `src\u002Fmodels\u002Fsequence\u002Flong_conv_lm.py` 中的 `load_backbone()` 函数）。\n\n- 前向传播是在 `train.py` 中的 `SequenceLightning` 模块内的 `def forward()` 函数中完成的，但实际上，它会调用 `self.task.forward()`，而后者才是真正调用模型的部分。也就是说，你需要进入 `src\u002Ftasks\u002Ftasks.py`，找到 `class LMTask` 及其 `def forward()` 函数。在这里你会看到对模型的实际调用。需要注意的是，解码器头部（一个用于分类的单层 MLP）与主模型的骨干网络（特征提取器）是分开的。\n\n### 序列长度预热回调\n\n我们实现了一个基于回调的序列长度预热调度器，它会在训练过程中分阶段逐步增加序列长度。基本上，脚本会检查当前处于哪个 epoch 和“阶段”，并根据该阶段的参数更新数据集和数据加载器。目前，你需要手动在配置文件中指定各个阶段，示例配置文件如下，相关部分位于配置文件底部，也在此处列出：\n\n```\nconfigs\u002Fexperiment\u002Fhg38\u002Fhg38_hyena_seqlen_warmup_reload.yaml\n```\n\n注意事项：\n你需要提前确保每个阶段的批次大小和序列长度都能适配显存。\n\n为了让数据加载器与序列长度预热机制兼容，你需要实现一个接口，即 `init_datasets()`。它的具体实现如下所示：\n\n\n关键点：\n\n要使用这个回调，我们可以使用上面提供的示例配置文件：`configs\u002Fexperiment\u002Fhg38\u002Fhg38_hyena_seqlen_warmup_reload.yaml`。\n\n你需要手动设计各个阶段，即希望在哪些 epoch 逐步增加序列长度（同时降低批次大小）。请注意，每个阶段的 `epochs` 表示该阶段持续的时间（不是累计的）。\n\n```\ncallbacks:\n  seqlen_warmup_reload:\n    # epochs 表示在该阶段运行多长时间（非累计！）\n    # 这只是一个示例\n    stage_params:\n      - epochs: 2  # 表示在第 0 和第 1 个 epoch 运行此阶段\n        seq_len: 1024\n        batch_size: 256  # 后台中 grad accum = 1，因为 train.global_batch_size=256\n      - epochs: 2  # 在第 2 和第 3 个 epoch 运行\n        seq_len: 2048\n        batch_size: 128\n      - epochs: 2  # 在第 4 和第 5 个 epoch 运行\n        seq_len: 4096\n        batch_size: 64\n      - epochs: 2  # 在第 6 和第 7 个 epoch 运行\n        seq_len: 8192\n        batch_size: 32\n      - epochs: 4  # 在第 8、9、10 和第 11 个 epoch 运行\n        seq_len: 16_384\n        batch_size: 16\n      - epochs: 4  # 在第 12、13、14 和第 15 个 epoch 运行\n        seq_len: 32_768\n        batch_size: 8\n```\n\n此外，在命令行中还有一些重要的参数需要注意：\n\n在示例配置文件中，请注意以下内容：\n\n- `train.global_batch_size` 切勿忘记设置！它会控制 `accumulate_grad_batches`，以保持每个阶段的学习率一致。例如，通常设置为 256 或 128（对于非常长的序列，可能为 64）。\n- `dataset.batch_size` 现在指的是测试集的批次大小（或最终的序列长度和批次大小）。测试集始终保持不变。\n- `dataset.max_length` 现在指的是测试集的序列长度和最大长度。测试集始终保持不变。\n- `model.layer.l_max` 需要设置为最高的序列长度加 2（即测试集的大小）。\n\n注意事项：\n\n训练数据集会在训练过程中变化，但测试集始终固定不变。测试集的序列长度和批次大小通过常规方式在启动命令中设置，即 `dataset.batch_size` 和 `dataset.`。\n\n### 从预训练模型获取 logits\n\u003Ca name=\"logits\">\u003C\u002Fa>\n\n这里有一个简单的 [脚本](https:\u002F\u002Fgithub.com\u002FHazyResearch\u002Fhyena-dna\u002Fblob\u002Fmain\u002Fevals\u002Fhg38_inference.py)，可以用来从预训练模型获取 logits。\n\n这个过程并非自动化，因此你需要手动从 HF 下载权重，并将其保存在本地某个位置。要获取 logits，还需要模型的头部。\n\n与 [Huggingface](https:\u002F\u002Fgithub.com\u002FHazyResearch\u002Fhyena-dna\u002Fblob\u002Fmain\u002Fhuggingface.py) 的区别在于：这个脚本旨在方便地获取嵌入表示，而不涉及模型头部。目前我们还没有使用 logits 的实际场景，因此如果你需要它们，还需要额外的步骤。\n\n### 实验性功能\n\n1. 我们实现了一个实验性的双向HyenaDNA模型。我们最近在GenomicBenchmarks数据集上进行了一项消融实验，其中我们从头开始训练，即未使用BERT的掩码语言建模进行预训练。我们将该模型与标准的因果HyenaDNA进行了对比，结果表明因果版本表现更好。然而，仍有一些用户非常希望使用双向HyenaDNA，因此我们提供了一种实现方式——当然，实现双向性的方法还有很多。\n\n关于具体实现，我们只是简单地调整了FFT卷积中的填充方式。更多细节请参阅`src\u002Fmodels\u002Fsequence\u002Fhyena.py`脚本（例如搜索“bidirectional”）。\n\n要使用双向模式，只需在启动时传入标志`model.bidirectional=True`即可，非常简单！\n\n需要注意的是，当前代码库仅支持在下游任务上从头开始进行双向训练，即不支持掩码语言模型的预训练。使用双向性进行因果预训练并无意义，请谨慎使用！\n\n2. 对于下游任务，我们新增了一个选项，允许仅对掩码标记进行处理或取平均。我们已将此功能更新到GenomicBenchmarks和Nuc Trans数据集的相关类中，具体实现方式请参考这些数据集类。使用方法如下：\n\n- 您还需要设置正确的配置参数。请参考相关实验配置文件，例如`\u002Fsrc\u002Fconfigs\u002Fexperiment\u002Fhg38\u002Fgenomic_benchmark.yaml`，特别注意将`dataset.return_mask=True`和`dataset.padding_side=right`这两个参数启用。\n- 您还需在实验配置中定义一个新的任务类型，名为`masked_multiclass`。以上设置的作用在于正确地将掩码传递给模型，这与之前的处理方式有所不同。\n\n实践中，对于序列较短且填充较少的任务，我们发现这一改进效果并不显著。但如果您的序列包含大量填充，则该功能将带来明显帮助。不过，在论文中，我们并未采用此功能，默认使用左侧填充。\n\n### 变更日志 \u002F 更新内容：\n\n- 在[Hugging Face](#huggingface)上增加了更多权重文件。\n- 发布了包含核苷酸Transformer数据集、权重及精确超参数的[Docker镜像](#docker_nt)，以便复现实验结果。\n- 新增了一个实验性的双向选项。详情请参阅[实验性功能](#Experimental)。\n- 我们还为下游任务添加了传递掩码并忽略填充标记的选项。详情请参阅[实验性功能](#Experimental)。\n- 增加了一些关于如何使用您自己的数据进行[预训练](#pretraining_custom)的提示。\n- 提供了从预训练模型获取[logits](#logits)的示例。\n\n## 引用\n\n如果您觉得我们的工作有所帮助，欢迎引用！  \n```\n@article{nguyen2023hyenadna,\n      title={HyenaDNA: 高分辨率单核苷酸级长距离基因组序列建模}, \n      author={Eric Nguyen and Michael Poli and Marjan Faizi and Armin Thomas and Callum Birch-Sykes and Michael Wornow and Aman Patel and Clayton Rabideau and Stefano Massaroli and Yoshua Bengio and Stefano Ermon and Stephen A. Baccus and Chris Ré},\n      year={2023},\n      eprint={2306.15794},\n      archivePrefix={arXiv},\n      primaryClass={cs.LG}\n}\n```","# HyenaDNA 快速上手指南\n\nHyenaDNA 是一个长序列基因组基础模型，支持高达 **100 万 token** 的上下文长度和**单核苷酸分辨率**。本指南将帮助您快速搭建环境并运行模型。\n\n## 1. 环境准备\n\n### 系统要求\n- **操作系统**: Linux (推荐)\n- **Python**: 3.8+\n- **GPU**: 支持 CUDA 的 NVIDIA 显卡（预训练大模型建议 A100，小模型推理可用免费 tier Colab 或普通显卡）\n- **CUDA**: 11.7 (官方开发环境)\n\n### 前置依赖\n- Git (需支持 submodule)\n- Conda (推荐用于环境管理)\n- Docker (可选，最简便的安装方式)\n\n## 2. 安装步骤\n\n您可以选择 **Conda 手动安装** 或 **Docker 镜像** 两种方式。\n\n### 方式一：Conda 手动安装（推荐开发者）\n\n1. **克隆仓库**\n   ```bash\n   git clone --recurse-submodules https:\u002F\u002Fgithub.com\u002FHazyResearch\u002Fhyena-dna.git && cd hyena-dna\n   ```\n\n2. **创建 Conda 环境**\n   ```bash\n   conda create -n hyena-dna python=3.8\n   conda activate hyena-dna\n   ```\n\n3. **安装 PyTorch (CUDA 11.7)**\n   *注：国内用户可使用清华源加速*\n   ```bash\n   conda install pytorch==1.13.0 torchvision==0.14.0 torchaudio==0.13.0 pytorch-cuda=11.7 -c pytorch -c nvidia\n   # 或使用清华源\n   # conda install pytorch==1.13.0 torchvision==0.14.0 torchaudio==0.13.0 pytorch-cuda=11.7 -c https:\u002F\u002Fmirrors.tuna.tsinghua.edu.cn\u002Fanaconda\u002Fcloud\u002Fpytorch -c https:\u002F\u002Fmirrors.tuna.tsinghua.edu.cn\u002Fanaconda\u002Fcloud\u002Fnvidia\n   ```\n\n4. **安装基础依赖**\n   ```bash\n   pip install -r requirements.txt\n   ```\n\n5. **安装 Flash Attention (核心组件)**\n   ```bash\n   cd hyena-dna\n   git submodule update --init\n   cd flash-attention\n   git submodule update --init\n   pip install -e . --no-build-isolation\n   ```\n\n6. **(可选) 安装融合层以加速**\n   ```bash\n   cd csrc\u002Flayer_norm && pip install . --no-build-isolation\n   ```\n\n### 方式二：Docker 安装（最简便）\n\n如果您熟悉 Docker，可直接拉取预构建镜像，包含所有依赖。\n\n1. **拉取并运行镜像**\n   ```bash\n   docker pull hyenadna\u002Fhyena-dna:latest\n   docker run --gpus all -it -p80:3000 hyenadna\u002Fhyena-dna \u002Fbin\u002Fbash\n   ```\n   *注：如需复现论文中 Nucleotide Transformer 数据集的结果，可使用 `hyenadna\u002Fhyena-dna-nt6:latest` 镜像。*\n\n## 3. 基本使用\n\n### 快速入门：从头训练小模型\n以下命令将自动下载一个小规模基因组数据集，并使用一个小型 HyenaDNA 模型（2 层）进行二分类任务训练。这是验证环境是否配置正确的最佳方式。\n\n```bash\npython -m train wandb=null experiment=hg38\u002Fgenomic_benchmark_scratch\n```\n\n**参数说明：**\n- `wandb=null`: 禁用权重与偏差跟踪，适合本地快速测试。\n- `experiment=hg38\u002Fgenomic_benchmark_scratch`: 加载位于 `configs\u002Fexperiments\u002Fhg38\u002F` 下的配置文件。\n\n### 使用预训练模型 (Hugging Face)\n\nHyenaDNA 提供了多种尺寸的预训练权重（从 1k 到 1M 序列长度）。\n\n#### 方法 A：使用 Colab (最简单)\n直接访问官方提供的集成 Notebook，无需本地配置即可加载权重并在 GenomicBenchmarks 数据集上进行微调或推理。\n- [HyenaDNA Colab Notebook](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1wyVEQd4R3HYLTUOXEEQmp_I8aNC_aLhL?usp=sharing)\n\n#### 方法 B：本地加载权重 (PyTorch Lightning)\n1. 从 [Hugging Face](https:\u002F\u002Fhuggingface.co\u002FLongSafari) 下载对应的 `.ckpt` 文件。\n2. 确保本地配置文件中的模型参数与预训练模型一致。\n3. 运行训练\u002F微调命令并指定路径：\n   ```bash\n   python -m train train.pretrained_model_path=\u002Fpath\u002Fto\u002Fyour\u002Fmodel.ckpt experiment=your_config\n   ```\n\n#### 方法 C：独立脚本推理\n如果您希望将模型集成到自己的代码库中，可参考仓库内的独立脚本：\n- `huggingface.py`: 演示如何加载 HF 权重并进行推理。\n- `standalone_hyenadna.py`: 包含构建 HyenaDNA 模型所需的所有类定义。\n\n### 预训练自定义数据\n若需在自定义数据上预训练，只需准备 `.fasta` 文件。参考 `src\u002Fdataloaders\u002Fdatasets\u002Fspecies_dataset.py` 修改数据加载器，并调整 `configs\u002Fexperiments\u002F` 下的配置文件即可启动训练。","某生物科技公司研发团队正致力于从人类全基因组数据中挖掘与罕见遗传病相关的长距离调控元件，以加速新药靶点的发现。\n\n### 没有 hyena-dna 时\n- **上下文长度受限**：传统 Transformer 模型受限于显存和计算复杂度，只能处理几千个碱基的片段，被迫将长达百万碱基的基因序列强行切割，导致关键的远端调控关系（如增强子与启动子的相互作用）被切断而无法识别。\n- **分辨率粗糙**：为了适应模型输入，往往需要对基因序列进行降采样或合并处理，丢失了单核苷酸级别的突变细节，难以精准定位致病变异。\n- **特征工程繁琐**：研究人员需依赖人工设计的生物学规则提取特征，不仅耗时耗力，且容易遗漏数据中隐含的非线性模式，模型泛化能力差。\n- **训练成本高昂**：从头预训练一个能理解长序列的基因组模型需要巨大的算力集群和数周时间，中小团队难以承担。\n\n### 使用 hyena-dna 后\n- **百万级序列直通**：hyena-dna 支持高达 100 万 token 的上下文长度，可直接输入完整的基因座甚至染色体片段，完整保留长距离依赖关系，精准捕捉远端调控机制。\n- **单碱基高精度**：模型在单核苷酸分辨率上进行预训练，无需降采样，能敏锐识别单个碱基的插入、缺失或替换对整体基因功能的影响。\n- **端到端自动学习**：利用在人类参考基因组（hg38）上预训练的权重，hyena-dna 能自动提取深层语义特征，只需少量下游任务数据微调即可达到高准确率，大幅减少人工干预。\n- **推理效率飞跃**：借助 Hyena 算子的高效性，即使在消费级显卡或 Colab 免费层级上，也能对长达 45 万碱基的序列进行快速推理和嵌入生成，显著降低研发门槛。\n\nhyena-dna 通过突破长序列建模瓶颈，让研究人员能以单碱基精度“读懂”百万级基因组全景，将遗传病机理研究从碎片化拼图升级为全局性洞察。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FHazyResearch_hyena-dna_0e679fd5.png","HazyResearch","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FHazyResearch_5f558f19.png","We are a CS research group led by Prof. Chris Ré.",null,"contact.hazy@gmail.com","https:\u002F\u002Fcs.stanford.edu\u002Fpeople\u002Fchrismre\u002F","https:\u002F\u002Fgithub.com\u002FHazyResearch",[84,88,92,96,100,104,107,111,115,119],{"name":85,"color":86,"percentage":87},"Assembly","#6E4C13",87,{"name":89,"color":90,"percentage":91},"Pawn","#dbb284",4.9,{"name":93,"color":94,"percentage":95},"HTML","#e34c26",2.3,{"name":97,"color":98,"percentage":99},"C++","#f34b7d",1.8,{"name":101,"color":102,"percentage":103},"Python","#3572A5",1.7,{"name":105,"color":106,"percentage":54},"POV-Ray SDL","#6bac65",{"name":108,"color":109,"percentage":110},"Cuda","#3A4E3A",0.7,{"name":112,"color":113,"percentage":114},"PHP","#4F5D95",0.3,{"name":116,"color":117,"percentage":118},"JavaScript","#f1e05a",0.1,{"name":120,"color":121,"percentage":118},"CMake","#DA3434",777,105,"2026-04-12T13:20:36","Apache-2.0",4,"Linux","必需 NVIDIA GPU。官方开发环境基于 CUDA 11.7。运行大型模型（如处理 100 万 token 序列）推荐使用 A100；Colab 免费层可处理约 45 万核苷酸序列。需安装 Flash Attention 及其融合层。","未说明",{"notes":131,"python":132,"dependencies":133},"1. 项目主要基于 PyTorch Lightning 和 Hydra 构建，建议熟悉这两个工具。\n2. 必须正确安装 Flash Attention 子模块（需执行 git submodule 更新及特定 pip 安装命令），否则无法运行。\n3. 提供 Docker 镜像以简化环境配置，包含所有依赖及复现论文结果所需的数据集和参数。\n4. 预训练人类参考基因组 (hg38) 需手动下载约 2GB 的 .fasta 和 .bed 文件。\n5. 不同模型尺寸对显存要求差异巨大，从 tiny-1k 到 large-1m 需根据任务长度选择合适模型。","3.8+",[134,135,136,137,138,139],"torch==1.13.0","torchvision==0.14.0","torchaudio==0.13.0","pytorch-lightning","hydra-core","flash-attn",[15],[142,143,144],"foundation-models","genomics","language-models","2026-03-27T02:49:30.150509","2026-04-14T03:13:25.511205",[148,153,158,163,168],{"id":149,"question_zh":150,"answer_zh":151,"source_url":152},32372,"如何在 Singularity 容器中运行 Hyena-DNA，或者遇到 'git-lfs' 缺失错误怎么办？","官方不支持 Singularity，建议直接使用 Docker。如果必须使用 Singularity，可以通过以下命令拉取并执行镜像：\n\n```\napptainer pull docker:\u002F\u002Fhyenadna\u002Fhyena-dna-nt7:latest\napptainer exec --nv docker:\u002F\u002Fhyenadna\u002Fhyena-dna-nt7:latest \u002Fbin\u002Fbash\n```\n\n如果遇到 'git: lfs is not a git command' 错误，通常是因为克隆的代码文件夹位于挂载驱动器（如 \u002Fmnt\u002F）上，容器无法访问。请将代码移动到用户主目录（home folder）下再运行。此外，请确保将 Docker 命令转换为等效的 Singularity 命令。","https:\u002F\u002Fgithub.com\u002FHazyResearch\u002Fhyena-dna\u002Fissues\u002F14",{"id":154,"question_zh":155,"answer_zh":156,"source_url":157},32373,"如何使用 Hyena-DNA 进行语言模型推理（Language Model Inference）或变体效应预测？","虽然 Huggingface 示例主要展示嵌入（embeddings）功能，但项目已支持使用语言模型头进行推理。您可以利用 log-likelihood 进行变体效应预测。相关功能已在代码库中修复并实现，可以直接调用相应的推理接口。如果需要类似 `AutoModelForCausalLM` 的 API，请参考最新的 Huggingface 集成代码。","https:\u002F\u002Fgithub.com\u002FHazyResearch\u002Fhyena-dna\u002Fissues\u002F20",{"id":159,"question_zh":160,"answer_zh":161,"source_url":162},32374,"在 Nucleotide Transformer 下游任务训练时，为什么没有返回 MCC（马修斯相关系数）结果？","这是一个已知问题，通常由配置或代码版本引起。维护者已确认并修复了该 Bug。请确保您使用的是最新版本的代码库。如果问题仍然存在，请检查训练脚本中的评估指标设置，确认是否正确启用了 MCC 计算回调。","https:\u002F\u002Fgithub.com\u002FHazyResearch\u002Fhyena-dna\u002Fissues\u002F31",{"id":164,"question_zh":165,"answer_zh":166,"source_url":167},32375,"在人类基因组预训练时遇到 'RuntimeError: Trying to resize storage that is not resizable' 错误如何解决？","该错误通常与 GPU 架构特定的安装有关。如果您在一台机器（如 P100）上安装并在另一台不同架构的机器（如 A100）上运行，可能会出现此问题。\n\n解决方案：\n1. 不要混用不同架构机器上的环境。\n2. 建议在目标机器（如 A100）上进行全新的干净安装（clean install）。\n3. 或者直接使用 README 中提供的 Docker 镜像，以避免环境依赖问题。","https:\u002F\u002Fgithub.com\u002FHazyResearch\u002Fhyena-dna\u002Fissues\u002F34",{"id":169,"question_zh":170,"answer_zh":171,"source_url":172},32376,"在 A100-80GB GPU 上使用 Huggingface 预训练模型处理超长序列时出现 'CUDA out of memory' 或 'KeyError' 怎么办？","这是因为从 Huggingface 下载的预训练模型配置与本地修改的配置不匹配。预训练模型通常是在 `checkpoint_mixer` 和 `checkpoint_mlp` 为 False 的情况下训练的。\n\n解决方案：\n1. 不要随意修改下载模型文件夹中的 `config.json` 文件来强行开启检查点（checkpoint），这会导致权重键名不匹配（KeyError）。\n2. 保持配置文件与预训练权重一致（即相关 checkpoint 选项设为 False）。\n3. 如果显存不足，请尝试减小 `max_length` 或 `batch_size`，而不是通过修改配置强制开启不兼容的检查点机制。","https:\u002F\u002Fgithub.com\u002FHazyResearch\u002Fhyena-dna\u002Fissues\u002F30",[]]