[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-alxndrTL--mamba.py":3,"tool-alxndrTL--mamba.py":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",138956,2,"2026-04-05T11:33:21",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":76,"owner_avatar_url":77,"owner_bio":78,"owner_company":79,"owner_location":80,"owner_email":81,"owner_twitter":82,"owner_website":83,"owner_url":84,"languages":85,"stars":90,"forks":91,"last_commit_at":92,"license":93,"difficulty_score":23,"env_os":94,"env_gpu":95,"env_ram":96,"env_deps":97,"category_tags":104,"github_topics":79,"view_count":23,"oss_zip_url":79,"oss_zip_packed_at":79,"status":16,"created_at":105,"updated_at":106,"faqs":107,"releases":136},2029,"alxndrTL\u002Fmamba.py","mamba.py","A simple and efficient Mamba implementation in pure PyTorch and MLX.","mamba.py 是一个用纯 PyTorch 和 MLX 实现的 Mamba 模型，代码简洁易读，适合学习和快速实验。它通过高效的并行扫描算法，显著提升了 Mamba 在训练时的速度，相比传统序列化实现可提速 10%–20%，尤其在自然语言处理任务中表现突出。项目还支持 Jamba（Mamba 与注意力机制混合）、Vision Mamba 和 Mamba-2 等扩展架构，满足不同模态的研究需求。特别值得一提的是，它内置了 muP 超参数调优方法，让小模型的调参结果能直接迁移到大模型，大幅降低调参成本。MLX 版本让 macOS 用户也能本地训练和推理，无需依赖 CUDA。无论是想深入理解 Mamba 原理的开发者，还是希望快速验证新想法的研究人员，都能从中受益。项目已集成进 Hugging Face Transformers，并可通过 pip 直接安装，开箱即用。","# mamba.py 🐍 : a simple and efficient Mamba implementation\nA straightfoward implementation of [Mamba](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.00752) in PyTorch with a simple parallel scan implementation, offering an major speedup over a sequential implementation, as the parallel scan allows the parallelization over the time dimension.\nIt combines the ease of read with good performances when training. Few other functionalities are implemented, like [Jamba](https:\u002F\u002Fwww.ai21.com\u002Fblog\u002Fannouncing-jamba), [Vision Mamba](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.09417) as well as [muP](https:\u002F\u002Farxiv.org\u002Fabs\u002F2203.03466).\n\n## Updates\n- \u003Cb>03\u002F08\u002F2024\u003C\u002Fb> : Added a muP implementation for Mamba and Mamba2. This allows to sweep for optimal hyperparameters on a small model and directly transfer them to a large model. See [this PR](https:\u002F\u002Fgithub.com\u002FalxndrTL\u002Fmamba.py\u002Fpull\u002F50)\n- \u003Cb>23\u002F07\u002F2024\u003C\u002Fb> : `mamba.py` is now part of the transformers 🤗 library. See [this PR](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers\u002Fpull\u002F30139).\n- \u003Cb>27\u002F06\u002F2024\u003C\u002Fb> : Deployed a package version of `mamba.py` on PyPI, which you can install with `pip install mambapy`.\n\n- \u003Cb>21\u002F04\u002F2024\u003C\u002Fb> : Added the `jamba.py` file, which implements the [Jamba](https:\u002F\u002Fwww.ai21.com\u002Fblog\u002Fannouncing-jamba) architecture (mix of Mamba and attention layers). Also added as a possible backend the official CUDA implementation.\n\n- \u003Cb>30\u002F03\u002F2024\u003C\u002Fb> : Updated inference function, now supports a custom sampling temperature and batch_size.\n\n- \u003Cb>09\u002F02\u002F2024\u003C\u002Fb> : First part of the performance update. For small sequences (\u003C128), it can speed up training by more than 20% compared to the first version. For setups close to what can found in practice (like in NLP), it can speed up training by 10%. See [this](https:\u002F\u002Fgithub.com\u002FalxndrTL\u002Fmamba.py\u002Fpull\u002F12).\n\n- \u003Cb>22\u002F01\u002F2024\u003C\u002Fb> : Added a MLX version of `mamba.py`, which supports inference as well as training. This version is similar to PyTorch, and allows Mac users to play around with Mamba models. It was [tested]() on the largest Mamba trained to date (2.8b)\n\n- \u003Cb>17\u002F01\u002F2024\u003C\u002Fb> : Added a step function for inference. It uses the \"RNN-formulation\" of Mamba to greatly speed up inference.\n___\n## Overview\n\n![speed comparison](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FalxndrTL_mamba.py_readme_da7db51e5291.png)\n\nThis graph shows the training time (forward and backward pass) of a single Mamba layer (`d_model=16, d_state=16`) using 3 different methods : `CUDA`, which is the official [Mamba implementation](https:\u002F\u002Fgithub.com\u002Fstate-spaces\u002Fmamba), `mamba.py`, which is this repo, and `sequential`, which is a sequential (RNN-like) implementation of the selective scan.\n\nThis repo contains a simple and readable code implementing the [Mamba](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.00752) architecture in pure PyTorch as well as MLX. You can also play around with the Jamba model, which combines Mamba and attention layers. The primary goal of this repo is educational.\n\n\u003Cp align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FalxndrTL_mamba.py_readme_a671d6764e0a.png\" alt=\"a python and a mamba\" width=\"300\" height=\"300\" alt=\"python mamba\"\u002F>\n\u003C\u002Fp>\n\n\u003Cu>The repo is organized as follows : \u003C\u002Fu>\n- `📁 mambapy` : the PyTorch implementation of Mamba\n    - `pscan.py` : a PyTorch implementation of Blelloch's parallel scan\n    - `mamba.py` : the Mamba model, as described in the [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.00752). It is numerically equivalent (initialization, forward and backward pass).\n    - `mamba2.py` (beta) : the Mamba-2 model, as described in the [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.21060). It requieres CUDA as it is only adapted from the original version. (for now)\n    - `lm.py` : encapsulates a Mamba(-2) model in order to use it as a language model\n    - `jamba.py` : an implementation of the [Jamba](https:\u002F\u002Fwww.ai21.com\u002Fblog\u002Fannouncing-jamba) model in PyTorch\n    - `vim.py` : an implementation of [Vision Mamba](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.09417).\n    - `📁 onnx` : export a trained Mamba model in ONNX for inference.\n- `📁 mlx` : basically the same code as above, but in MLX.\n- `📁 docs` : a folder containing annotated explanations about the code, focusing on the parallel scan for now.\n- `📁 examples` : two examples of how to use the Mamba model in PyTorch as well as a training file.\n\n[muP](https:\u002F\u002Farxiv.org\u002Fabs\u002F2203.03466) is implemented and compatible with both the Mamba models (see below for more details).\n\n## Usage\n\nYou can either download this repo or install it with `pip install mambapy`.\n\nThe most basic usage is to use the `Mamba` object ([mamba.py](mamba.py)), which implements a simple Mamba model given a configuration.\nNo embedding, no head : input is `(B, L, D)` and output is `(B, L, D)` as well.\n\n```python\nimport torch\nfrom mambapy.mamba import Mamba, MambaConfig\n\nconfig = MambaConfig(d_model=16, n_layers=2)\nmodel = Mamba(config)\n\nB, L, D = 2, 64, 16\nx = torch.randn(B, L, D)\ny = model(x)\n\nassert y.shape == x.shape\n```\n\nYou can also use Mamba-2 by importing the `Mamba2Config` and `Mamba2` objectfs from `mamba2.py`.\n\n\nThe class `LM` ([lm.py](lm.py)) builds on the `Mamba` or `Mamba-2` objects and offers a classic API for language models. It can be used as follows :\n\n```python\nfrom mambapy.lm import LM, MambaConfig\n\nconfig = MambaConfig(d_model=16, n_layers=4) # core model\nmodel = MambaLM(config, vocab_size=32000) # encapsulate it in a LM\n\nx = torch.randint(high=32000, size=(16, 64))\nlogits = model(x) # (B, L, vocab_size)\n```\n\nIt simply encapsulates a `Mamba(-2)` object with an embedding layer, a final normalization and a language modeling head.\n\nYou can use it off the shelf with a pretrained Mamba model :\n```python\nfrom mambapy.lm import from_pretrained\nfrom transformers import AutoTokenizer\n\nmodel = from_pretrained('state-spaces\u002Fmamba-130m').to(\"cuda\")\ntokenizer = AutoTokenizer.from_pretrained('EleutherAI\u002Fgpt-neox-20b')\n\noutput = model.generate(tokenizer, \"Mamba is a type of\")\n```\n\nThis is the structure of the `mamba.py` modules:\n\n\u003Cp align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FalxndrTL_mamba.py_readme_755e1c3dcc00.jpg\" width=\"737\" height=\"429\" alt=\"mamba structure\"\u002F>\n\u003C\u002Fp>\n\n## Jamba\nYou can also train and run inference on Jamba models. Take a look at the `jamba.py` file, which constructs a `Jamba` object, which interleaves Mamba layers (from `mamba.py`) with attention layers.\n\nThis is the structure of the modules  found in `jamba.py` :\n\n\u003Cp align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FalxndrTL_mamba.py_readme_51a6f70525ed.jpg\" width=\"737\" height=\"429'\" alt=\"mamba structure\"\u002F>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FalxndrTL_mamba.py_readme_226b3611098f.jpg\" width=\"602\" height=\"343\" alt=\"mamba structure\"\u002F>\n\u003C\u002Fp>\n\nThe API is the same as with the `Mamba` and `MambaLM` models.\nYou can load a pretrained Jamba model like so :\n\n```python\nfrom mambapy.jamba_lm import from_pretrained\nfrom transformers import AutoTokenizer\n\nmodel = from_pretrained('TechxGenus\u002FMini-Jamba').to(\"cuda\")\ntokenizer = AutoTokenizer.from_pretrained('TechxGenus\u002FMini-Jamba')\n\noutput = model.generate(tokenizer, \"def min(arr):\")\n```\n\n## `📁 examples`\nThere are two basics examples available (some may be outdated):\n- `example_llm.ipynb` : load a Mamba model with pretrained weights (from 130M to 2.8B from HuggingFace)\n- `example_e2e_training.ipynb` : an end-to-end training example where a Mamba model is employed as a world model for a simple 3-3 grid game (training is not completed, the model should be larger).\n\nIf you want a full training example (like in llama2.c), you can check the [othello_mamba repo](https:\u002F\u002Fgithub.com\u002FalxndrTL\u002Fothello_mamba) I've done. With this repo, you can train a Mamba or a Jamba from scratch, use `bfloat16`, easily swipe it with a Transformer, come up with your own data, etc ...\n\n## muP\n[muP](https:\u002F\u002Farxiv.org\u002Fabs\u002F2203.03466) is a technique that allows to transfer hyperparameters (like the learning rate) from small to very large models. For example, it is [possible](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.05728) to transfer (ie, use the same) the learning rate from a 2M model to a 10B model. This is extremely useful in practice when doing hyperparameters search : you just do sweeps to find the bests HPs on your small model, which is fast and inexpensive, and you automatically have the best performing HPs for your large model.\n\nmuP makes it possible by initializing and scaling the learning rates of the weights the model in a specific way. This is the result of these modifications:\n\n\u003Cp align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FalxndrTL_mamba.py_readme_eb9b24c77a00.png\" alt=\"a python and a mamba\" \n    width=\"1200\" height=\"200\" alt=\"python mamba\"\u002F>\n\u003C\u002Fp>\n\nWithout muP, what we get is :\n\n\u003Cp align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FalxndrTL_mamba.py_readme_91f583351dca.png\" alt=\"a python and a mamba\" \n    width=\"1200\" height=\"200\" alt=\"python mamba\"\u002F>\n\u003C\u002Fp>\n\nWhat we see here are the scale of the activations for various widths (d_model) starting at t=1 (initialization) to t=5 (5 steps of training). With SP (standard parametrization), the activations of the network are greatly vary with width, whereas they stay constant with width under muP.\nAnd intuitively, if the activations (the \"signals\") of the network behave the same no matter the width, one can easily imagine that the optimal HP is thus independent of the width.\n\nAnd this is what we observe in practice when we sweep for the optimal LR :\n\n\u003Cp align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FalxndrTL_mamba.py_readme_0bff28ef5472.png\" alt=\"a python and a mamba\" \n    width=\"900\" height=\"340\" alt=\"python mamba\"\u002F>\n\u003C\u002Fp>\n\nThe optimal LR shifts with bigger models under SP, whereas, with muP, it stays roughly constant. The smaller model has only 172k params, while the bigger has over 100M!\n\nFor more information about muP in general, you can take a look at the [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2203.03466), and to see my derivation of the muP implementation for Mamba, and what it changes concretly in code, please see the [associated PR](https:\u002F\u002Fgithub.com\u002FalxndrTL\u002Fmamba.py\u002Fpull\u002F50).\n\n___\n## Performances\nThis section provides a more comprehensive performance comparison between `mamba.py` and the official Mamba implementation.\nOverall, as the first graph of this file shows, both have approximately the same asymptotic performance with respect to the sequence length. You can think as `mamba.py` as a regular Transformer implementation, while the official Mamba implementation is more like FlashAttention v1. Both have their owns advantages.\n\nThat being said, does the two implementations have the same asymptotic performances with respect to the other parameters ?\n\n##### `d_model` asymptotic performances\n\u003Cp align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FalxndrTL_mamba.py_readme_154df033d2e5.png\" alt=\"a python and a mamba\" \n    width=\"800\" height=\"413\" alt=\"python mamba\"\u002F>\n\u003C\u002Fp>\n\nWe can see that both implementations behave the same as we increase `d_model`. The gap between the two stays roughly the same. (`mamba.py` is overall ~2x slower)\n\n##### `d_state` asymptotic performances\n\u003Cp align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FalxndrTL_mamba.py_readme_0e6e1555f23c.png\" alt=\"a python and a mamba\" \n    width=\"800\" height=\"413\" alt=\"python mamba\"\u002F>\n\u003C\u002Fp>\n\nThis graph is important. We see that here, the asymptotic performance is not the same as we increase `d_state`. For a reminder, `d_state`, or $N$ in the paper, is the state expansion factor : each channel of the input is expanded into $N$ channels of the hidden state.\n\n\u003Ci>Note : the CUDA version doesn't seem to be impacted by the increase of `d_state`. This is because the benchmark was done with a batch size of 1 : the GPU was not at its full capacity and thus the impact of an increased `d_state` isn't visible. The same happens if you have a small model, or a small input length. See [this issue](https:\u002F\u002Fgithub.com\u002FalxndrTL\u002Fmamba.py\u002Fissues\u002F8).\u003C\u002Fi>\n\nDoes it matter in practice ? As of now, all the pretrained Mamba models (up to 2.8B parameters) used `d_state=16`, so this change of performance over `d_state` isn't important in this case. As `d_state` is not something that is supposed to grow (contrary to the seq length or `d_model`), this isn't a catastrophic result, but something to consider.\n\nHowever, it is interesting to relate this observation with the claim made by Albert Gu and Tri Dao [Mamba paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.00752) : \u003Ci>The main idea is to leverage properties of modern accelerators (GPUs) to \u003Cb>materialize the state ℎ only in more efficient levels of the memory hierarchy.\u003C\u002Fb>\u003C\u002Fi>\nThey also describe (Annex D) the main data movements of their selective scan : working mainly in SRAM, they can reduce the memory reads\u002Fwrites by a factor of $O(N)$. This explains the different asymptotic behaviors that we see here.\n\nWith `d_state=16` (as in `state-spaces\u002Fmamba-2.8b-slimpj`), the gap between the two is relatively small, but with `d_state=64` (currently not used in any models), the gap widens. (note the OOM on the second graph)\n\n\u003Cp align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FalxndrTL_mamba.py_readme_17318b92cd25.png\" alt=\"a python and a mamba\" \n    width=\"1152\" height=\"240\" alt=\"python mamba\"\u002F>\n\u003C\u002Fp>\n\nAll the previous graph were computed with a batch size of 1, on a A100 80GB.\nIt is a measure of both the forward and backward pass of a single Mamba block.\n\nThe previous analysis showed the importance of kernel fusion, which reduces the memory accesses by $O(N)$, which makes the whole process faster.\n\nBut memory requierement should also be considered : the official Mamba implementation uses \u003Cb>recomputation\u003C\u002Fb> in the backward pass : rather than keeping in memory the activations computed during the forward pass, it simply recomputes them in the backward pass, when needed. This greatly reduces the memory requierement of the Mamba model when doing training. This is not implemented in this repo.\n\nHence, this repo implements one of the three techniques mentionned in the Mamba paper that form the so called \"hardware-aware selective scan\" : the parallel scan.\nWe say how kernel fusion impacts the speed while recomputation the memory requierements.\n\n___\n## Sources and where to learn more\n- the [Mamba paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.00752) : describes the Mamba architecture as implemented in this repo, which allows to model sequences in linear time.\n- the [Mamba implementation](https:\u002F\u002Fgithub.com\u002Fstate-spaces\u002Fmamba), which is written in PyTorch but uses a parallel scan written in CUDA. This is the version that is the fastest. \n- [a minimal PyTorch implementation of Mamba](https:\u002F\u002Fgithub.com\u002Fjohnma2006\u002Fmamba-minimal), which implements the scan operation as a sequential loop (its performance are a bit worse than the 'sequential' line in the first graph). This code closely follows [this file](https:\u002F\u002Fgithub.com\u002Fstate-spaces\u002Fmamba\u002Fblob\u002Fda2626b5a5f347a8e844ac5e96a2cbcde3c34abb\u002Fmamba_ssm\u002Fmodules\u002Fmamba_simple.py) from the officile Mamba implementation, but replaces the CUDA convolution with `torch.nn.Conv1d`, and the selective scan written in CUDA with a sequential loop. The code of this repo follows the structure of these 2 files.\n- [Prefix Sums and Their Applications](https:\u002F\u002Fwww.cs.cmu.edu\u002F~guyb\u002Fpapers\u002FBle93.pdf), by Guy E. Blelloch (1993).\n- [Parallelizing Linear Recurrent Neural Nets Over Sequence Length](https:\u002F\u002Farxiv.org\u002Fabs\u002F1709.04057) : applies a parallel scan over the sequence in order to get rid of the sequential for-loop.\n- x.com\u002Ffrancoisfleuret : original pscan implementation.\n\n## TODOs\n- finish docs\n- Mamba 2\n- clean `vim.py`\n- following the performance update, update perf graph\n- plot the training mem consumption of the three differents mamba imple (official, naive, mamba.py)\n- pscan implementation using [ThunderKittens](https:\u002F\u002Fhazyresearch.stanford.edu\u002Fblog\u002F2024-05-12-quick-tk) ?\n- ~~Jamba ? inference and\u002For fine-tuning ?~~\n- ~~more tests with an increased `d_model` (add a Performances section)~~\n- ~~a step function, used for (auto-regressive) inference.~~\n- ~~a training function, similar to [llama2.c](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fllama2.c)~~\n\nperfs related:\n- ~~unfold the for-loops in `pscan.py` to achieve better performance (see [François Fleuret's pscan](https:\u002F\u002Ffleuret.org\u002Fcgi-bin\u002Fgitweb\u002Fgitweb.cgi?p=mygptrnn.git;a=blob;f=pscan.py;h=0bb0d145bf9c6c82115956c8ce1e6a063e56e747;hb=HEAD)) (although this will sacrifice readability of bit)~~\n~~- write a reverse parallel scan specifically for the backward pass. (For now, we have to flip the array before and after the scan).~~\n- reduce the memory usage somehow (at the cost of speed if needed)\n- use torch.compile(). As far as I tested, it doesn’t work for now. It seems it isn’t happy with the custom PScan autograd function. Need to investigate. \u003Cb>(see [PR#1](https:\u002F\u002Fgithub.com\u002FalxndrTL\u002Fmamba.py\u002Fpull\u002F1))\u003C\u002Fb>\n\n## Citation\n\nIf you find this project useful in your research and wish to cite it, please use the following BibTex entry:\n\n```\n@software{mambapy,\n  author = {Alexandre Torres--Leguet},\n  title = {mamba.py: A simple, hackable and efficient Mamba implementation in pure PyTorch and MLX.},\n  url = {https:\u002F\u002Fgithub.com\u002FalxndrTL\u002Fmamba.py},\n  version = {1.0},\n  year = {2024},\n}\n```\n","# mamba.py 🐍：一个简单高效的Mamba实现\n这是一个基于PyTorch的[ Mamba](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.00752)的直接实现，采用了简单的并行扫描实现方式。与顺序实现相比，该实现可显著提升速度，因为并行扫描允许在时间维度上进行并行化处理。\n\n它兼具易读性和良好的训练性能。目前仅实现了部分其他功能，如[Jamba](https:\u002F\u002Fwww.ai21.com\u002Fblog\u002Fannouncing-jamba)、[Vision Mamba](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.09417)以及[muP](https:\u002F\u002Farxiv.org\u002Fabs\u002F2203.03466)。\n\n## 更新内容\n- \u003Cb>2024年3月8日\u003C\u002Fb>：新增了针对Mamba和Mamba2的muP实现。这使得用户可以在小型模型上搜索最优超参数，并直接将其迁移到大型模型上。请参阅[此PR](https:\u002F\u002Fgithub.com\u002FalxndrTL\u002Fmamba.py\u002Fpull\u002F50)。\n- \u003Cb>2024年7月23日\u003C\u002Fb>：`mamba.py`现已作为Hugging Face的transformers库的一部分。请参阅[此PR](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers\u002Fpull\u002F30139)。\n- \u003Cb>2024年6月27日\u003C\u002Fb>：已在PyPI上发布了`mamba.py`的软件包版本，您可以通过`pip install mambapy`进行安装。\n\n- \u003Cb>2024年4月21日\u003C\u002Fb>：新增了`jamba.py`文件，实现了[ Jamba](https:\u002F\u002Fwww.ai21.com\u002Fblog\u002Fannouncing-jamba)架构（Mamba与注意力层的混合）。同时，官方的CUDA实现也被添加为一种可能的后端。\n\n- \u003Cb>2024年3月30日\u003C\u002Fb>：更新了推理函数，现在支持自定义采样温度和batch_size。\n\n- \u003Cb>2024年2月9日\u003C\u002Fb>：性能更新的第一部分。对于长度小于128的小序列，与第一版相比，训练速度可提升超过20%。对于接近实际应用的设置（如NLP领域），训练速度可提升10%。请参阅[此PR](https:\u002F\u002Fgithub.com\u002FalxndrTL\u002Fmamba.py\u002Fpull\u002F12)。\n\n- \u003Cb>2024年1月22日\u003C\u002Fb>：新增了`mamba.py`的MLX版本，支持推理和训练。该版本与PyTorch类似，让Mac用户也能轻松玩转Mamba模型。该版本已在迄今为止训练过的最大规模Mamba模型（28亿参数）上进行了[测试]()。\n\n- \u003Cb>2024年1月17日\u003C\u002Fb>：新增了用于推理的step函数。它使用了Mamba的“RNN形式”来大幅加速推理过程。\n___\n## 概览\n\n![速度对比](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FalxndrTL_mamba.py_readme_da7db51e5291.png)\n\n这张图展示了单个Mamba层（`d_model=16, d_state=16`）的训练时间（前向和后向传播），分别采用三种方法：`CUDA`——即官方的[Mamba实现](https:\u002F\u002Fgithub.com\u002Fstate-spaces\u002Fmamba)，`mamba.py`——即本仓库的实现，以及`sequential`——即选择性扫描的顺序（类似RNN）实现。\n\n本仓库包含了一个简单易懂的代码实现，以纯PyTorch及MLX语言实现了[ Mamba](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.00752)架构。您还可以尝试Jamba模型，该模型结合了Mamba与注意力层。本仓库的主要目标是教育用途。\n\n\u003Cp align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FalxndrTL_mamba.py_readme_a671d6764e0a.png\" alt=\"python和mamba\" width=\"300\" height=\"300\" alt=\"python mamba\"\u002F>\n\u003C\u002Fp>\n\n\u003Cu>仓库组织结构如下：\u003C\u002Fu>\n- `📁 mambapy`：Mamba的PyTorch实现\n    - `pscan.py`：Blelloch并行扫描的PyTorch实现\n    - `mamba.py`：Mamba模型，如[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.00752)所述。在数值上完全等价（初始化、前向和后向传播）。\n    - `mamba2.py`（beta）：Mamba-2模型，如[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.21060)所述。目前仅需CUDA支持，因为它仅基于原始版本进行了适配。\n    - `lm.py`：封装Mamba(-2)模型，以便将其用作语言模型\n    - `jamba.py`：在PyTorch中实现[ Jamba](https:\u002F\u002Fwww.ai21.com\u002Fblog\u002Fannouncing-jamba)模型\n    - `vim.py`：实现[ Vision Mamba](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.09417)。\n    - `📁 onnx`：将训练好的Mamba模型导出为ONNX格式，用于推理。\n- `📁 mlx`：基本与上述相同，但采用MLX语言实现。\n- `📁 docs`：包含代码注释说明的文件夹，目前重点在于并行扫描。\n- `📁 examples`：两个使用Mamba模型的PyTorch示例，以及一个训练文件。\n\n[muP](https:\u002F\u002Farxiv.org\u002Fabs\u002F2203.03466)已实现，并兼容两种Mamba模型（详情见下文）。\n\n## 使用方法\n\n您可以下载本仓库，也可以通过`pip install mambapy`进行安装。\n\n最基本的用法是使用`Mamba`对象（[mamba.py](mamba.py)），它根据配置实现了一个简单的Mamba模型。无需嵌入层或头部：输入为`(B, L, D)`，输出也为`(B, L, D)`。\n\n```python\nimport torch\nfrom mambapy.mamba import Mamba, MambaConfig\n\nconfig = MambaConfig(d_model=16, n_layers=2)\nmodel = Mamba(config)\n\nB, L, D = 2, 64, 16\nx = torch.randn(B, L, D)\ny = model(x)\n\nassert y.shape == x.shape\n```\n\n您也可以通过导入`mamba2.py`中的`Mamba2Config`和`Mamba2`对象来使用Mamba-2。\n\n类`LM`（[lm.py](lm.py)）基于`Mamba`或`Mamba-2`对象，提供经典的语言模型API。使用方法如下：\n\n```python\nfrom mambapy.lm import LM, MambaConfig\n\nconfig = MambaConfig(d_model=16, n_layers=4) # 核心模型\nmodel = MambaLM(config, vocab_size=32000) # 封装为LM\n\nx = torch.randint(high=32000, size=(16, 64))\nlogits = model(x) # (B, L, vocab_size)\n```\n\n它简单地封装了一个`Mamba(-2)`对象，加上嵌入层、最终归一化和语言建模头部。\n\n您可以直接使用预训练的Mamba模型：\n```python\nfrom mambapy.lm import from_pretrained\nfrom transformers import AutoTokenizer\n\nmodel = from_pretrained('state-spaces\u002Fmamba-130m').to(\"cuda\")\ntokenizer = AutoTokenizer.from_pretrained('EleutherAI\u002Fgpt-neox-20b')\n\noutput = model.generate(tokenizer, \"Mamba is a type of\")\n```\n\n这是`mamba.py`模块的结构：\n\n\u003Cp align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FalxndrTL_mamba.py_readme_755e1c3dcc00.jpg\" width=\"737\" height=\"429\" alt=\"mamba结构\"\u002F>\n\u003C\u002Fp>\n\n## Jamba\n您也可以训练和运行Jamba模型的推理。请查看`jamba.py`文件，它构建了一个`Jamba`对象，将来自`mamba.py`的Mamba层与注意力层交错排列。\n\n这是`jamba.py`中模块的结构：\n\n\u003Cp align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FalxndrTL_mamba.py_readme_51a6f70525ed.jpg\" width=\"737\" height=\"429\" alt=\"mamba结构\"\u002F>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FalxndrTL_mamba.py_readme_226b3611098f.jpg\" width=\"602\" height=\"343\" alt=\"mamba结构\"\u002F>\n\u003C\u002Fp>\n\nAPI与`Mamba`和`MambaLM`模型相同。\n您可以这样加载预训练的Jamba模型：\n\n```python\nfrom mambapy.jamba_lm import from_pretrained\nfrom transformers import AutoTokenizer\n\nmodel = from_pretrained('TechxGenus\u002FMini-Jamba').to(\"cuda\")\ntokenizer = AutoTokenizer.from_pretrained('TechxGenus\u002FMini-Jamba')\n\noutput = model.generate(tokenizer, \"def min(arr):\")\n```\n\n## `📁 示例`\n目前有两个基本示例可供使用（部分可能已过时）：\n- `example_llm.ipynb`：加载一个带有预训练权重的Mamba模型（从HuggingFace获取，参数量从1.3亿到28亿不等）\n- `example_e2e_training.ipynb`：一个端到端训练示例，其中Mamba模型被用作一个简单3x3网格游戏的世界模型（训练尚未完成，模型规模应更大）。\n\n如果您想要一个完整的训练示例（如llama2.c中所示），可以查看我创建的[othello_mamba仓库](https:\u002F\u002Fgithub.com\u002FalxndrTL\u002Fothello_mamba)。借助这个仓库，您可以从零开始训练Mamba或Jamba模型，使用`bfloat16`，轻松地与Transformer结合，自定义数据集等等。\n\n## muP\n[muP](https:\u002F\u002Farxiv.org\u002Fabs\u002F2203.03466) 是一种技术，能够将超参数（例如学习率）从小规模模型迁移到大规模模型。例如，[有研究](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.05728)表明，可以将200万参数模型的学习率迁移到100亿参数模型上，且二者保持一致。这在实际超参数搜索中极为有用：您只需在小规模模型上进行扫描，找到最佳超参数组合，既快速又经济，然后自动获得适用于大规模模型的最佳超参数。\n\nmuP之所以能做到这一点，是因为它以特定方式初始化并缩放了模型权重的学习率。这些修改的效果如下：\n\n\u003Cp align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FalxndrTL_mamba.py_readme_eb9b24c77a00.png\" alt=\"python和Mamba\" \n    width=\"1200\" height=\"200\" alt=\"python Mamba\"\u002F>\n\u003C\u002Fp>\n\n如果不使用muP，结果是：\n\n\u003Cp align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FalxndrTL_mamba.py_readme_91f583351dca.png\" alt=\"python和Mamba\" \n    width=\"1200\" height=\"200\" alt=\"python Mamba\"\u002F>\n\u003C\u002Fp>\n\n这里展示的是不同宽度（d_model）下激活值的变化情况，从t=1（初始化）到t=5（训练5个步骤）。在标准参数化（SP）下，网络的激活值随宽度变化很大；而在muP下，激活值则几乎不受宽度影响。\n\n直观地说，如果网络的激活值（“信号”）无论宽度如何都保持一致，那么我们很容易想象，最优超参数也就不受宽度影响了。\n\n这正是我们在实践中观察到的现象：当我们扫描寻找最佳学习率时：\n\n\u003Cp align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FalxndrTL_mamba.py_readme_0bff28ef5472.png\" alt=\"python和Mamba\" \n    width=\"900\" height=\"340\" alt=\"python Mamba\"\u002F>\n\u003C\u002Fp>\n\n在标准参数化下，随着模型变大，最佳学习率会随之调整；而使用muP时，最佳学习率则大致保持不变。小模型只有17.2万个参数，而大模型却超过1亿！\n\n如需了解更多关于muP的一般信息，可以阅读[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2203.03466)；若想了解我对Mamba的muP实现的推导过程以及代码中具体的变化，请参阅[相关PR](https:\u002F\u002Fgithub.com\u002FalxndrTL\u002Fmamba.py\u002Fpull\u002F50)。\n\n___\n## 性能\n本节提供了`mamba.py`与官方Mamba实现之间更全面的性能对比。\n总体而言，正如本文第一张图所示，两者在序列长度上的渐近性能大致相同。您可以将`mamba.py`视为普通的Transformer实现，而官方Mamba实现更接近FlashAttention v1。两者各有优势。\n\n不过，这两种实现是否在其他参数上的渐近性能也相同呢？\n\n##### `d_model`的渐近性能\n\u003Cp align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FalxndrTL_mamba.py_readme_154df033d2e5.png\" alt=\"python和Mamba\" \n    width=\"800\" height=\"413\" alt=\"python Mamba\"\u002F>\n\u003C\u002Fp>\n\n我们可以看到，随着`d_model`增大，两种实现的表现完全一致。两者的差距大致保持不变。（总体来说，`mamba.py`慢约2倍）\n\n##### `d_state`的渐近性能\n\u003Cp align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FalxndrTL_mamba.py_readme_0e6e1555f23c.png\" alt=\"python和Mamba\" \n    width=\"800\" height=\"413\" alt=\"python Mamba\"\u002F>\n\u003C\u002Fp>\n\n这张图非常重要。我们发现，随着`d_state`增大，两者的渐近性能并不相同。提醒一下，`d_state`，即论文中的$N$，是状态扩展因子：输入的每个通道会被扩展为隐藏状态的$N$个通道。\n\n\u003Ci>注意：CUDA版本似乎并未受到`d_state`增大的影响。这是因为基准测试是在批大小为1的情况下进行的：GPU并未达到满负荷，因此`d_state`增大的影响并不明显。同样的情况也会出现在小型模型或短输入长度的情况下。请参阅[此问题](https:\u002F\u002Fgithub.com\u002FalxndrTL\u002Fmamba.py\u002Fissues\u002F8)。\u003C\u002Fi>\n\n这在实际中有意义吗？截至目前，所有预训练的Mamba模型（最高达28亿参数）都使用了`d_state=16`，因此这种`d_state`带来的性能变化在此情况下并不重要。由于`d_state`本不该增长（与序列长度或`d_model`不同），这并不是灾难性的结果，但值得考虑。\n\n然而，将这一观察与Albert Gu和Tri Dao在[Mamba论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.00752)中的说法联系起来就很有意思：\u003Ci>核心思想是利用现代加速器（GPU）的特性，\u003Cb>仅在内存层次结构中更高效的层级上显式化状态ℎ。\u003C\u002Fb>\u003C\u002Fi>\n他们还在附录D中描述了选择性扫描的主要数据移动过程：主要在SRAM中工作，可将内存读写次数减少$O(N)$倍。这解释了我们在这里看到的不同渐近行为。\n\n当`d_state=16`（如state-spaces\u002Fmamba-2.8b-slimpj中所用）时，两者的差距相对较小；但当`d_state=64`（目前未用于任何模型）时，差距会显著拉大。（请注意第二张图中的OOM问题）\n\n\u003Cp align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FalxndrTL_mamba.py_readme_17318b92cd25.png\" alt=\"python和Mamba\" \n    width=\"1152\" height=\"240\" alt=\"python Mamba\"\u002F>\n\u003C\u002Fp>\n\n以上所有图表均是在A100 80GB上，以批大小为1计算得出的。\n这是单个Mamba块前向和后向传播的综合度量。\n\n之前的分析表明，内核融合的重要性：它可将内存访问次数减少$O(N)$，从而加快整个过程。\n\n但同时也要考虑内存需求：官方Mamba实现的后向传播采用了\u003Cb>重新计算\u003C\u002Fb>策略——与其在内存中保存前向传播过程中计算出的激活值，不如在需要时直接在后向传播中重新计算。这大大降低了Mamba模型在训练时的内存需求。而本仓库并未实现这一功能。\n\n因此，本仓库实现了Mamba论文中提到的三种技术之一，即所谓的“硬件感知选择性扫描”：并行扫描。\n我们将探讨内核融合如何影响速度，同时讨论重新计算对内存需求的影响。\n\n___\n\n## 资源与更多信息\n- [Mamba论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.00752)：介绍了本仓库中实现的Mamba架构，该架构能够以线性时间对序列进行建模。\n- [Mamba实现](https:\u002F\u002Fgithub.com\u002Fstate-spaces\u002Fmamba)，采用PyTorch编写，但使用了CUDA编写的并行扫描。这是速度最快的版本。\n- [Mamba的最小PyTorch实现](https:\u002F\u002Fgithub.com\u002Fjohnma2006\u002Fmamba-minimal)，将扫描操作实现为一个顺序循环（其性能略逊于第一张图中的“顺序”曲线）。这段代码基本遵循官方Mamba实现中的[此文件](https:\u002F\u002Fgithub.com\u002Fstate-spaces\u002Fmamba\u002Fblob\u002Fda2626b5a5f347a8e844ac5e96a2cbcde3c34abb\u002Fmamba_ssm\u002Fmodules\u002Fmamba_simple.py)，但用`torch.nn.Conv1d`替换了CUDA卷积，并用顺序循环替换了CUDA编写的选择性扫描。本仓库的代码结构也沿用了这两个文件的组织方式。\n- [前缀和及其应用](https:\u002F\u002Fwww.cs.cmu.edu\u002F~guyb\u002Fpapers\u002FBle93.pdf)，作者：Guy E. Blelloch（1993）。\n- [对序列长度并行化线性递归神经网络](https:\u002F\u002Farxiv.org\u002Fabs\u002F1709.04057)：通过在序列上应用并行扫描，以消除顺序for循环。\n- x.com\u002Ffrancoisfleuret：原始pscan实现。\n\n## 待办事项\n- 完成文档\n- Mamba 2\n- 清理`vim.py`\n- 根据性能更新，更新性能图表\n- 绘制三种不同Mamba实现（官方、朴素、mamba.py）的训练内存消耗图\n- 使用[ThunderKittens](https:\u002F\u002Fhazyresearch.stanford.edu\u002Fblog\u002F2024-05-12-quick-tk)实现pscan？\n- ~~Jamba？推理和\u002F或微调？~~\n- ~~增加`d_model`后的更多测试（添加性能章节）~~\n- ~~一个阶跃函数，用于（自回归）推理。~~\n- ~~一个训练函数，类似于[llama2.c](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fllama2.c)~~\n\n性能相关：\n- ~~展开`pscan.py`中的for循环，以获得更好的性能（参见[François Fleuret的pscan](https:\u002F\u002Ffleuret.org\u002Fcgi-bin\u002Fgitweb\u002Fgitweb.cgi?p=mygptrnn.git;a=blob;f=pscan.py;h=0bb0d145bf9c6c82115956c8ce1e6a063e56e747;hb=HEAD)）（尽管这会牺牲一些可读性）~~\n- ~~专门为反向传播编写一个逆向并行扫描。（目前，我们需要在扫描前后翻转数组）~~\n- 某种程度上降低内存占用（如有必要，可以牺牲速度）\n- 使用torch.compile()。据我测试，目前它还无法正常工作。似乎它不认可自定义的PScan自动求导函数。需要进一步调查。\u003Cb>（参见[PR#1](https:\u002F\u002Fgithub.com\u002FalxndrTL\u002Fmamba.py\u002Fpull\u002F1))\u003C\u002Fb>\n\n## 引用\n\n如果您在研究中发现本项目有用并希望引用，请使用以下BibTex条目：\n\n```\n@software{mambapy,\n  author = {Alexandre Torres--Leguet},\n  title = {mamba.py: 一个简单、可 hack 且高效的纯PyTorch和MLX版Mamba实现},\n  url = {https:\u002F\u002Fgithub.com\u002FalxndrTL\u002Fmamba.py},\n  version = {1.0},\n  year = {2024},\n}\n```","# mamba.py 快速上手指南\n\n## 环境准备\n\n- **系统要求**：Linux \u002F macOS \u002F Windows（推荐 Linux）\n- **Python 版本**：≥ 3.8\n- **PyTorch**：≥ 2.0（推荐使用 CUDA 11.8 或 12.1）\n- **推荐硬件**：NVIDIA GPU（支持 CUDA），或 Apple Silicon（支持 MLX 版本）\n\n> 国内用户可使用清华源加速安装：  \n> `pip install -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple mambapy`\n\n## 安装步骤\n\n```bash\n# 通过 PyPI 安装（推荐）\npip install mambapy\n\n# 或从源码安装（如需修改代码）\ngit clone https:\u002F\u002Fgithub.com\u002FalxndrTL\u002Fmamba.py\ncd mamba.py\npip install -e .\n```\n\n## 基本使用\n\n### 1. 使用基础 Mamba 模型\n\n```python\nimport torch\nfrom mambapy.mamba import Mamba, MambaConfig\n\n# 配置模型\nconfig = MambaConfig(d_model=16, n_layers=2)\nmodel = Mamba(config)\n\n# 输入输出：(B, L, D)\nB, L, D = 2, 64, 16\nx = torch.randn(B, L, D)\ny = model(x)\n\nassert y.shape == x.shape  # 输出维度与输入一致\n```\n\n### 2. 使用语言模型封装（推荐用于实际任务）\n\n```python\nfrom mambapy.lm import LM, MambaConfig\nimport torch\n\nconfig = MambaConfig(d_model=16, n_layers=4)\nmodel = LM(config, vocab_size=32000)  # 封装为语言模型\n\nx = torch.randint(high=32000, size=(16, 64))  # 输入 token ID\nlogits = model(x)  # 输出: (B, L, vocab_size)\n```\n\n### 3. 加载预训练模型（推荐）\n\n```python\nfrom mambapy.lm import from_pretrained\nfrom transformers import AutoTokenizer\n\n# 加载预训练 Mamba-130M\nmodel = from_pretrained('state-spaces\u002Fmamba-130m').to(\"cuda\")\ntokenizer = AutoTokenizer.from_pretrained('EleutherAI\u002Fgpt-neox-20b')\n\n# 生成文本\noutput = model.generate(tokenizer, \"Mamba 是一种\")\nprint(output)\n```\n\n> ✅ 支持 `Mamba2`、`Jamba` 模型，使用方式类似，只需导入对应类（如 `Mamba2`、`JambaLM`）即可。","一家中小型AI创业公司正在开发一款面向客服场景的轻量级对话生成模型，团队仅有3名工程师，使用MacBook Pro进行本地开发与调试，预算有限，无法依赖云端GPU集群。\n\n### 没有 mamba.py 时\n- 团队尝试使用官方Mamba的CUDA实现，但Mac用户无法运行，开发和调试只能依赖远程服务器，每次修改代码都要上传、排队、等待，效率极低。\n- 用纯PyTorch手动实现Mamba的串行扫描，训练一个2.8B参数模型单轮迭代耗时超过45分钟，超参数调优需数天，项目进度严重滞后。\n- 无法在本地快速验证新想法，比如尝试不同温度采样或小批量推理，导致模型响应速度优化停滞。\n- 团队成员对Mamba内部机制理解困难，官方代码复杂晦涩，新人上手需一周以上，影响协作效率。\n- 尝试集成Jamba架构时，发现没有开源的PyTorch混合实现，被迫放弃多模态对话的探索计划。\n\n### 使用 mamba.py 后\n- MLX版本直接在M1\u002FM2芯片上运行，开发者可在本地完成模型训练、推理和调试，无需依赖远程服务器，迭代周期从小时级缩短至分钟级。\n- 并行扫描优化使训练速度提升15%以上，2.8B模型单轮训练从45分钟降至38分钟，超参数搜索周期从7天压缩至3天。\n- 内置的step函数支持自定义采样温度与批量推理，团队快速实现了对话生成的多样性控制，用户满意度提升22%。\n- 代码结构清晰、注释完整，新成员3天内即可读懂并贡献代码，团队协作效率显著提升。\n- 通过jamba.py文件，团队成功融合注意力与Mamba模块，构建出首个混合架构对话模型，上线后长对话连贯性大幅改善。\n\nmamba.py 让一支资源有限的团队，在不依赖昂贵算力的前提下，快速实现并迭代前沿Mamba架构，真正把研究能力转化为产品竞争力。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FalxndrTL_mamba.py_da7db51e.png","alxndrTL","Alexandre TL","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FalxndrTL_a3f8b2e5.png","Full time student in Lille\r\n",null,"Montpellier & Lille, France","alextorresleguet@icloud.com","AlexandreTL2","youtube.com\u002F@alexandretl","https:\u002F\u002Fgithub.com\u002FalxndrTL",[86],{"name":87,"color":88,"percentage":89},"Python","#3572A5",100,1450,127,"2026-04-03T05:43:43","MIT","Linux, macOS","需要 NVIDIA GPU，显存 8GB+，CUDA 11.7+","16GB+",{"notes":98,"python":99,"dependencies":100},"建议使用 conda 管理环境，首次运行需下载约 5GB 模型文件；MLX 版本支持 macOS 本地运行，无需 CUDA；Mamba-2 仅支持 CUDA；muP 可帮助在小模型上调参后迁移到大模型，显著降低训练成本。","3.8+",[101,102,103],"torch>=2.0","transformers>=4.30","accelerate",[26,13,54],"2026-03-27T02:49:30.150509","2026-04-06T05:38:00.328969",[108,113,117,122,126,131],{"id":109,"question_zh":110,"answer_zh":111,"source_url":112},9219,"为什么增大 d_state 时 Mamba 的推理速度没有明显变化？","这是因为 CPU 生成张量（如 torch.randn）的时间掩盖了 GPU 计算时间。解决方案是直接在 GPU 上生成张量，避免数据传输开销：将 x = torch.randn(...).to('cuda') 改为 x = torch.randn(..., device='cuda')。这样能清晰观察到 d_state 增大时推理时间的线性增长。","https:\u002F\u002Fgithub.com\u002FalxndrTL\u002Fmamba.py\u002Fissues\u002F8",{"id":114,"question_zh":115,"answer_zh":116,"source_url":112},9220,"如何正确测试 Mamba 的真实推理速度以避免 CPU-GPU 异步干扰？","应避免在 CPU 上生成输入张量后再传输到 GPU。正确的做法是直接在 CUDA 设备上生成数据：使用 torch.randn(batch, length, dim, device='cuda')。同时，使用 torch.cuda.synchronize() 确保 GPU 计算完成后再计时，以获得准确的性能数据。",{"id":118,"question_zh":119,"answer_zh":120,"source_url":121},9221,"在 MLX 上运行 Mamba 模型时出现 'Got unsupported ScalarType BFloat16' 错误怎么办？","这是因为 MLX 不支持 PyTorch 的 BFloat16 类型。解决方案是将模型权重转换为 float32：在 map_mambapy_torch_to_mlx 函数中，添加 value = value.float() 再调用 .numpy()，或在加载模型前将 PyTorch 模型的参数强制转为 float32。","https:\u002F\u002Fgithub.com\u002FalxndrTL\u002Fmamba.py\u002Fissues\u002F6",{"id":123,"question_zh":124,"answer_zh":125,"source_url":121},9222,"MLX 推理时生成的文本乱码或不完整，如何解决？","问题源于使用 Python generator（yield）逐个输出 token 时 tokenizer.decode 的字节级解码冲突。解决方案是移除 generate 函数中的 yield，改为一次性生成所有 token ID 后再统一解码：收集所有生成的 token_ids，再调用 tokenizer.decode(token_ids, skip_special_tokens=True)。",{"id":127,"question_zh":128,"answer_zh":129,"source_url":130},9223,"Mamba 训练中 deltaA 值过大导致 loss 变为 NaN，如何修复？","这是因为 delta 未经过 softplus 激活函数，导致其值过大，进而使 exp(delta * A) 溢出。解决方案是在计算 deltaA 前对 delta 应用 softplus：delta = torch.nn.functional.softplus(delta)，确保 delta 为正且数值稳定。","https:\u002F\u002Fgithub.com\u002FalxndrTL\u002Fmamba.py\u002Fissues\u002F53",{"id":132,"question_zh":133,"answer_zh":134,"source_url":135},9224,"如何在项目中启用 CUDA 加速的 selective scan？","需要先安装官方 mamba_ssm 库：pip install mamba-ssm。安装后，代码会自动导入 mamba_ssm.ops.selective_scan_interface 中的 selective_scan_fn。若仍失败，请确认 CUDA 和 PyTorch 版本兼容，并确保已安装 NVIDIA 驱动和 cuDNN。","https:\u002F\u002Fgithub.com\u002FalxndrTL\u002Fmamba.py\u002Fissues\u002F37",[137,142],{"id":138,"version":139,"summary_zh":140,"released_at":141},106622,"v1.2.0","muP implementation","2024-07-31T12:02:47",{"id":143,"version":144,"summary_zh":145,"released_at":146},106623,"v1.0.0","First package version of `mamba.py`. You can also install it with `pip install mambapy`.","2024-06-27T07:55:22"]