[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-FasterDecoding--Medusa":3,"tool-FasterDecoding--Medusa":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",157379,2,"2026-04-15T23:32:42",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108322,"2026-04-10T11:39:34",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[52,13,15,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":32,"last_commit_at":59,"category_tags":60,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":72,"owner_avatar_url":73,"owner_bio":74,"owner_company":75,"owner_location":75,"owner_email":75,"owner_twitter":75,"owner_website":75,"owner_url":76,"languages":77,"stars":90,"forks":91,"last_commit_at":92,"license":93,"difficulty_score":10,"env_os":94,"env_gpu":95,"env_ram":96,"env_deps":97,"category_tags":106,"github_topics":107,"view_count":32,"oss_zip_url":75,"oss_zip_packed_at":75,"status":17,"created_at":110,"updated_at":111,"faqs":112,"releases":147},7894,"FasterDecoding\u002FMedusa","Medusa","Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads","Medusa 是一个旨在加速大语言模型（LLM）文本生成的轻量级框架。它通过在大模型基础上附加多个“解码头”，实现了对未来多个令牌（token）的并行预测，从而显著提升了推理速度。\n\n传统加速方法（如投机采样）通常依赖额外的草稿模型，不仅系统复杂，且在非贪婪采样场景下效率受限。Medusa 巧妙地解决了这些痛点：它无需引入新模型，仅在原模型上训练额外的解码头，保持了原有架构不变；训练过程参数高效，即使显存资源有限的用户也能轻松上手；同时，它放宽了对分布匹配的要求，使得在多样化生成任务中也能获得比传统贪婪解码更快的速度。实测显示，Medusa 能在多种主流模型上带来 2.2 至 3.6 倍的提速效果。\n\n该工具特别适合希望优化本地部署性能的开发者、追求高效推理的研究人员，以及需要快速迭代大模型应用的工程师。其独特亮点包括支持全模型训练的 Medusa-2 方案，以及无需原始训练数据即可适配任意微调模型的自蒸馏技术。无论是构建高性能聊天机器人还是处理长文本生成任务，Medusa 都能以极简的集成方式，为用户提供流畅且高效的生成体验。","\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFasterDecoding_Medusa_readme_d228bd697dca.png\" alt=\"Medusa\" width=\"100\" align=\"left\">\u003Cdiv align=\"center\">\u003Ch1>&nbsp;Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads\u003C\u002Fh1>\u003C\u002Fdiv>\n\n\u003Cp align=\"center\">\n| \u003Ca href=\"https:\u002F\u002Fsites.google.com\u002Fview\u002F\nmedusa-llm\">\u003Cb>Blog\u003C\u002Fb>\u003C\u002Fa> | \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.10774\">\u003Cb>Report\u003C\u002Fb>\u003C\u002Fa> | \u003Ca href=\"ROADMAP.md\">\u003Cb>Roadmap\u003C\u002Fb>\u003C\u002Fa> |\n\u003C\u002Fp>\n\n---\n*News* 🔥\n- [2024\u002F1] Medusa technical report is now available on [arXiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.10774). We've added multiple new features, including Medusa-2 recipe for full-model training, self-distillation for adding Medusa to any fine-tuned LLM, etc. The new results show a 2.2-3.6x speedup over the original model on a range of LLMs.\n\n---\n## Introduction\n\nMedusa is a simple framework that democratizes the acceleration techniques for LLM generation with multiple decoding heads.\n\n\u003Cdiv align=\"center\">\n  \u003Cpicture>\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFasterDecoding_Medusa_readme_7cadb94c2e53.gif\" width=\"80%\">\n  \u003C\u002Fpicture>\n  \u003Cbr>\n  \u003Cdiv align=\"center\" width=\"80%\">\n  \u003Cem>Medusa-1 on Vicuna-7b.\u003C\u002Fem>\n  \u003C\u002Fdiv>\n  \u003Cbr>\n\u003C\u002Fdiv>\n\n\nWe aim to tackle the three pain points of popular acceleration techniques like speculative decoding:\n\n- Requirement of a good draft model.\n- System complexity.\n- Inefficiency when using sampling-based generation.\n\n\n\u003Cdiv align=\"center\">\n  \u003Cpicture>\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFasterDecoding_Medusa_readme_6fcd73c79ebf.jpg\" width=\"60%\">\n  \u003C\u002Fpicture>\n  \u003Cbr>\n  \u003Cdiv align=\"left\" width=\"80%\">\n  \u003Cem>Medusa adds extra \"heads\" to LLMs to predict multiple future tokens simultaneously. When augmenting a model with Medusa, the original model stays untouched, and only the new heads are fine-tuned during training. During generation, these heads each produce multiple likely words for the corresponding position. These options are then combined and processed using a tree-based attention mechanism. Finally, a typical acceptance scheme is employed to pick the longest plausible prefix from the candidates for further decoding.\u003C\u002Fem>\n  \u003C\u002Fdiv>\n  \u003Cbr>\n\u003C\u002Fdiv>\n\nWe aim to solve the challenges associated with speculative decoding by implementing the following ideas:\n\n- Instead of introducing a new model, we train multiple decoding heads on the *same* model.\n- The training is parameter-efficient so that even the \"GPU-Poor\" can do it. And since there is no additional model, there is no need to adjust the distributed computing setup.\n- Relaxing the requirement of matching the distribution of the original model makes the non-greedy generation even faster than greedy decoding.\n\nIn the initial release, our primary focus is on optimizing Medusa for a batch size of 1—a setting commonly utilized for local model hosting. In this configuration, Medusa delivers approximately a 2x speed increase across a range of Vicuna models. We are actively working to extend Medusa's capabilities by integrating it into additional inference frameworks, with the aim of achieving even greater performance gains and extending Medusa to broader settings.\n\n\u003Cp align=\"center\">\n  \u003Cpicture>\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFasterDecoding_Medusa_readme_e776a21034ef.jpg\" width=\"45%\">\n  \u003C\u002Fpicture>\n\u003C\u002Fp>\n\nIn the updated version, we add support for full-model training, called Medusa-2 (compared to Medusa-1, which only trains the new heads), which requires a special recipe that adds the speculative prediction ability while keeping the original model's performance.\n\nWe also add support for self-distillation, which allows us to add Medusa to any fine-tuned LLM without requiring the availability of the original training data.\n\n## Contents\n- [Introduction](#introduction)\n- [Contents](#contents)\n- [Installation](#installation)\n  - [Method 1: With pip (may not be the latest version)](#method-1-with-pip-may-not-be-the-latest-version)\n  - [Method 2: From the source (recommended)](#method-2-from-the-source-recommended)\n  - [Model Weights](#model-weights)\n  - [Inference](#inference)\n  - [Training](#training)\n  - [Training (legacy)](#training-legacy)\n  - [Push to Hugging Face Hub](#push-to-hugging-face-hub)\n- [Citation](#citation)\n- [Codebase Guide](#codebase-guide)\n- [Community Adoption](#community-adoption)\n- [Contributing](#contributing)\n- [Acknowledgements](#acknowledgements)\n\n## Installation\n### Method 1: With pip (may not be the latest version)\n```bash\npip install medusa-llm\n```\n### Method 2: From the source (recommended)\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FFasterDecoding\u002FMedusa.git\ncd Medusa\npip install -e .\n```\n\n### Model Weights\n#### Medusa-1\n| Size | Chat Command                                  | Hugging Face Repo                                                     |\n| ---- | --------------------------------------------- | --------------------------------------------------------------------- |\n| 7B   | `python -m medusa.inference.cli --model FasterDecoding\u002Fmedusa-vicuna-7b-v1.3` | [FasterDecoding\u002Fmedusa-vicuna-7b-v1.3](https:\u002F\u002Fhuggingface.co\u002FFasterDecoding\u002Fmedusa-vicuna-7b-v1.3)   |\n| 13B  | `python -m medusa.inference.cli --model FasterDecoding\u002Fmedusa-vicuna-13b-v1.3` | [FasterDecoding\u002Fmedusa-vicuna-13b-v1.3](https:\u002F\u002Fhuggingface.co\u002FFasterDecoding\u002Fmedusa-vicuna-13b-v1.3) |\n| 33B  | `python -m medusa.inference.cli --model FasterDecoding\u002Fmedusa-vicuna-33b-v1.3` | [FasterDecoding\u002Fmedusa-vicuna-33b-v1.3](https:\u002F\u002Fhuggingface.co\u002FFasterDecoding\u002Fmedusa-vicuna-33b-v1.3) |\n\n#### Medusa-2\n| Size | Chat Command                                  | Hugging Face Repo                                                     |\n| ---- | --------------------------------------------- | --------------------------------------------------------------------- |\n| Zephyr-7B-Beta   | `python -m medusa.inference.cli --model FasterDecoding\u002Fmedusa-1.0-zephyr-7b-beta` | [FasterDecoding\u002Fmedusa-1.0-zephyr-7b-beta](https:\u002F\u002Fhuggingface.co\u002FFasterDecoding\u002Fmedusa-1.0-zephyr-7b-beta)   |\n| Vicuna-7B-v1.5 | `python -m medusa.inference.cli --model FasterDecoding\u002Fmedusa-1.0-vicuna-7b-v1.5` | [FasterDecoding\u002Fmedusa-1.0-vicuna-7b-v1.5](https:\u002F\u002Fhuggingface.co\u002FFasterDecoding\u002Fmedusa-1.0-vicuna-7b-v1.5) |\n| Vicuna-13B-v1.5  | `python -m medusa.inference.cli --model FasterDecoding\u002Fmedusa-1.0-vicuna-13b-v1.5` | [FasterDecoding\u002Fmedusa-1.0-vicuna-13b-v1.5](https:\u002F\u002Fhuggingface.co\u002FFasterDecoding\u002Fmedusa-1.0-vicuna-13b-v1.5) |\n| Vicuna-33B-v1.5  | `python -m medusa.inference.cli --model FasterDecoding\u002Fmedusa-1.0-vicuna-33b-v1.5` | [FasterDecoding\u002Fmedusa-1.0-vicuna-33b-v1.5](https:\u002F\u002Fhuggingface.co\u002FFasterDecoding\u002Fmedusa-1.0-vicuna-33b-v1.5) |\n\n\n### Inference\nWe currently support single-GPU inference with a batch size of 1, which is the most common setup for local model hosting. We are actively working to extend Medusa's capabilities by integrating it into other inference frameworks; please don't hesitate to reach out if you are interested in contributing to this effort.\n\nYou can use the following command to launch a CLI interface:\n```bash\nCUDA_VISIBLE_DEVICES=0 python -m medusa.inference.cli --model [path of medusa model]\n```\nYou can also pass `--load-in-8bit` or `--load-in-4bit` to load the base model in quantized format. If you download the base model elsewhere, you may override base model name or path with `--base-model  [path of base model]`.\n\n### Training\nIn the updated version, we use the amazing [axolotl](https:\u002F\u002Fgithub.com\u002FOpenAccess-AI-Collective\u002Faxolotl) library to manage the training process. Please refer to our [fork](https:\u002F\u002Fgithub.com\u002Fctlllll\u002Faxolotl) for the training code. The major code modifications are in [`src\u002Faxolotl\u002Futils\u002Fmodels.py`](https:\u002F\u002Fgithub.com\u002Fctlllll\u002Faxolotl\u002Fblob\u002Fmain\u002Fsrc\u002Faxolotl\u002Futils\u002Fmodels.py). The training configs can be found in [`examples\u002Fmedusa`](https:\u002F\u002Fgithub.com\u002Fctlllll\u002Faxolotl\u002Ftree\u002Fmain\u002Fexamples\u002Fmedusa). A typical training command is as follows:\n```bash\naccelerate launch -m axolotl.cli.train examples\u002Fmedusa\u002Fyour_config.yml\n```\n\nThe data preparation code for self-distillation can be found in [`data_generation` folder](data_generation) of the current repo. For other datasets, you can directly download the data from the corresponding Hugging Face dataset repo.\n\n### Training on various architectures\n*The following instructions are for the initial release of Medusa, it provides a minimal example of how to train a Medusa-1 model. For the updated version, please refer to the previous section.*\n\nFor training, please install:\n```bash\npip install -e \".[train]\"\n```\n#### Prepare the data\nWe take a public version of the ShareGPT dataset, which is a subset of the Vicuna training data. For other models, you can use the corresponding training dataset.\n```bash\ngit clone https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FAeala\u002FShareGPT_Vicuna_unfiltered\n```\nRemark: If you haven't installed `git-lfs`, please install it before cloning:\n```bash\ngit lfs install\n```\n\n#### Adapt the data to the model you want to enable medusa on.\n\nStart by launch an inference server you like that will run the model you want to train on.\nLet's use [mistralai\u002FMistral-7B-Instruct-v0.2](https:\u002F\u002Fhuggingface.co\u002Fmistralai\u002FMistral-7B-Instruct-v0.2) as an example.\n\nFor instance you can use [text-generation-inference](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-generation-inference), which you\ncan also use after you've trained the medusa heads.\n\n```\nmodel=mistralai\u002FMistral-7B-Instruct-v0.2\nvolume=$PWD\u002Fdata # share a volume with the Docker container to avoid downloading weights every run\ndocker run --gpus all --shm-size 1g -p 8080:80 -v $volume:\u002Fdata ghcr.io\u002Fhuggingface\u002Ftext-generation-inference:latest --model-id $model --input-length 4000 --max-total-tokens 4096 --max-batch-prefill-tokens 4000\n```\nThe sequences in shareGPT are relatively long for some, so make sure you can infer on those. If you do not have enough room, the script will simply ignore those long conversation.\nIt shouldn't impact too much downstream performance, but more data is always better.\nYou can use various tradeoffs to [speed up inference](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftext-generation-inference\u002Findex) but the defaults show be good enough in most cases.\n\n```\npython create_data.py --input-filename ShareGPT_Vicuna_unfiltered\u002FShareGPT_V4.3_unfiltered_cleaned_split.json --output-filename mistral.json\n```\n\n#### Train the model\nWe follow the training setup from [FastChat](https:\u002F\u002Fgithub.com\u002Flm-sys\u002FFastChat#fine-tuning), but with a much larger learning rate because we freeze the original model and only train the new heads. Here is the training command for the Vicuna-7b model on 4 GPUs. Since we are only training the new heads, the training does not require a lot of memory, and only data parallelism is needed. You can modify the script to fit your own setup. For larger models, we use the same setup. You can also use `--load_in_8bit` or `--load_in_4bit` to load the base model in quantized format.\n```bash\ntorchrun --nproc_per_node=4 medusa\u002Ftrain\u002Ftrain_legacy.py --model_name_or_path mistralai\u002FMistral-7B-Instruct-v0.2 \\\n    --data_path mistral.json \\\n    --bf16 True \\\n    --output_dir test \\\n    --num_train_epochs 2 \\\n    --per_device_train_batch_size 8 \\\n    --per_device_eval_batch_size 8 \\\n    --gradient_accumulation_steps 4 \\\n    --evaluation_strategy \"no\" \\\n    --save_strategy \"no\" \\\n    --learning_rate 1e-3 \\\n    --weight_decay 0.0 \\\n    --warmup_ratio 0.1 \\\n    --lr_scheduler_type \"cosine\" \\\n    --logging_steps 1 \\\n    --tf32 True \\\n    --model_max_length 2048 \\\n    --lazy_preprocess True \\\n    --medusa_num_heads 3 \\\n    --medusa_num_layers 1 \\\n    --deepspeed deepspeed.json\n```\n### Push to Hugging Face Hub\nYou can use the following command to push your model to the Hugging Face Hub:\n```bash\npython -m medusa.hf_utils --folder [path of the model folder] --repo [name of the repo]\n```\n\n## Citation\n```bibtex\n@article{cai2024medusa,\n  title   = {Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads},\n  author  = {Tianle Cai and Yuhong Li and Zhengyang Geng and Hongwu Peng and Jason D. Lee and Deming Chen and Tri Dao},\n  year    = {2024},\n  journal = {arXiv preprint arXiv: 2401.10774}\n}\n```\n\n## Codebase Guide\n`medusa\u002Fmodel\u002Fmedusa_model.py` is the key file for Medusa. It contains the `MedusaModel` class, which is a wrapper of the original model and the new heads. This class also has an implementation of a streaming generation method. If you want to dive into the details of Medusa, this is the place to start.\n\nWe also provide some illustrative notebooks in `notebooks\u002F` to help you understand the codebase.\n\n## Community Adoption\nWe are super excited to see that Medusa has been adopted by many open-source projects. Here is an (incomplete) list:\n- [TensorRT-LLM](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTensorRT-LLM\u002Ftree\u002Fmain\u002Fexamples\u002Fmedusa)\n- [TGI](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-generation-inference\u002Fblob\u002Fmain\u002Fserver\u002Ftext_generation_server\u002Futils\u002Fmedusa.py)\n- [RTP-LLM](https:\u002F\u002Fgithub.com\u002Falibaba\u002Frtp-llm\u002Fblob\u002Fmain\u002Fdocs\u002FSpeculativeDecoding-Tutroial.md#medusa-decoding)\n\nWe are grateful to the authors for their contributions to the community and sincerely hope that Medusa can help accelerate the development of LLMs. If you are using Medusa in your project, please let us know, and we will add your project to the list.\n\n## Contributing\nWe welcome community contributions to Medusa. If you have an idea for how to improve it, please open an issue to discuss it with us. When submitting a pull request, please ensure that your changes are well-tested. Please split each major change into a separate pull request. We also have a [Roadmap](ROADMAP.md) summarizing our future plans for Medusa. Don't hesitate to reach out if you are interested in contributing to any of the items on the roadmap.\n\n## Acknowledgements\nThis codebase is influenced by remarkable projects from the LLM community, including [FastChat](https:\u002F\u002Fgithub.com\u002Flm-sys\u002FFastChat), [TinyChat](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002Fllm-awq\u002Ftree\u002Fmain\u002F), [vllm](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm), [axolotl](https:\u002F\u002Fgithub.com\u002FOpenAccess-AI-Collective\u002Faxolotl).\n\nThis project is supported by [Together AI](https:\u002F\u002Ftogether.ai\u002F), [MyShell AI](https:\u002F\u002Fmyshell.ai\u002F), [Chai AI](https:\u002F\u002Fwww.chai-research.com\u002F).\n","\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFasterDecoding_Medusa_readme_d228bd697dca.png\" alt=\"Medusa\" width=\"100\" align=\"left\">\u003Cdiv align=\"center\">\u003Ch1>&nbsp;Medusa：用于多解码头加速LLM生成的简单框架\u003C\u002Fh1>\u003C\u002Fdiv>\n\n\u003Cp align=\"center\">\n| \u003Ca href=\"https:\u002F\u002Fsites.google.com\u002Fview\u002F\nmedusa-llm\">\u003Cb>博客\u003C\u002Fb>\u003C\u002Fa> | \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.10774\">\u003Cb>报告\u003C\u002Fb>\u003C\u002Fa> | \u003Ca href=\"ROADMAP.md\">\u003Cb>路线图\u003C\u002Fb>\u003C\u002Fa> |\n\u003C\u002Fp>\n\n---\n*新闻* 🔥\n- [2024\u002F1] Medusa技术报告现已在[arXiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.10774)上发布。我们新增了多项功能，包括用于全模型训练的Medusa-2配方、用于将Medusa添加到任何微调LLM中的自蒸馏等。新结果表明，在一系列LLM上，相比原始模型，速度提升了2.2至3.6倍。\n\n---\n## 简介\n\nMedusa是一个简单的框架，旨在普及使用多解码头加速LLM生成的技术。\n\n\u003Cdiv align=\"center\">\n  \u003Cpicture>\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFasterDecoding_Medusa_readme_7cadb94c2e53.gif\" width=\"80%\">\n  \u003C\u002Fpicture>\n  \u003Cbr>\n  \u003Cdiv align=\"center\" width=\"80%\">\n  \u003Cem>Medusa-1在Vicuna-7b上的演示。\u003C\u002Fem>\n  \u003C\u002Fdiv>\n  \u003Cbr>\n\u003C\u002Fdiv>\n\n\n我们致力于解决诸如推测性解码等流行加速技术的三大痛点：\n\n- 需要一个优秀的草稿模型。\n- 系统复杂度高。\n- 在基于采样的生成过程中效率低下。\n\n\n\u003Cdiv align=\"center\">\n  \u003Cpicture>\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFasterDecoding_Medusa_readme_6fcd73c79ebf.jpg\" width=\"60%\">\n  \u003C\u002Fpicture>\n  \u003Cbr>\n  \u003Cdiv align=\"left\" width=\"80%\">\n  \u003Cem>Medusa为LLM添加额外的“头”，以同时预测多个未来标记。在用Medusa增强模型时，原始模型保持不变，仅在训练过程中对新增的头部进行微调。生成时，这些头部各自为相应位置生成多个可能的词。随后，这些候选选项会通过基于树状注意力机制进行组合和处理。最后，采用典型的接受方案从候选序列中挑选出最长的合理前缀，继续进行解码。\u003C\u002Fem>\n  \u003C\u002Fdiv>\n  \u003Cbr>\n\u003C\u002Fdiv>\n\n我们希望通过以下思路来解决推测性解码所面临的挑战：\n\n- 不引入新模型，而是在*同一*模型上训练多个解码头。\n- 训练过程参数高效，即使是资源有限的用户也能完成。由于无需额外模型，因此也无需调整分布式计算设置。\n- 放宽对与原始模型分布匹配的要求，使得非贪婪生成的速度甚至超过贪婪解码。\n\n在初始版本中，我们的主要关注点是优化适用于批大小为1的Medusa——这一设置常用于本地模型部署。在此配置下，Medusa在一系列Vicuna模型上可实现约2倍的速度提升。我们正积极努力将Medusa集成到更多推理框架中，以进一步提升性能，并将其扩展到更广泛的场景。\n\n\u003Cp align=\"center\">\n  \u003Cpicture>\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFasterDecoding_Medusa_readme_e776a21034ef.jpg\" width=\"45%\">\n  \u003C\u002Fpicture>\n\u003C\u002Fp>\n\n在更新版本中，我们增加了对全模型训练的支持，称为Medusa-2（相对于仅训练新增头部的Medusa-1），这需要一种特殊的配方，能够在保留原始模型性能的同时，赋予其推测性预测能力。\n\n我们还新增了自蒸馏支持，使我们能够将Medusa添加到任何微调过的LLM中，而无需原始训练数据。\n\n## 目录\n- [简介](#introduction)\n- [目录](#contents)\n- [安装](#installation)\n  - [方法1：通过pip（可能不是最新版本）](#method-1-with-pip-may-not-be-the-latest-version)\n  - [方法2：从源代码安装（推荐）](#method-2-from-the-source-recommended)\n  - [模型权重](#model-weights)\n  - [推理](#inference)\n  - [训练](#training)\n  - [传统训练方式](#training-legacy)\n  - [推送到Hugging Face Hub](#push-to-hugging-face-hub)\n- [引用](#citation)\n- [代码库指南](#codebase-guide)\n- [社区采纳](#community-adoption)\n- [贡献](#contributing)\n- [致谢](#acknowledgements)\n\n## 安装\n### 方法1：通过pip（可能不是最新版本）\n```bash\npip install medusa-llm\n```\n### 方法2：从源代码安装（推荐）\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FFasterDecoding\u002FMedusa.git\ncd Medusa\npip install -e .\n```\n\n### 模型权重\n#### Medusa-1\n| 规模 | 对话命令                                  | Hugging Face仓库                                                     |\n| ---- | --------------------------------------------- | --------------------------------------------------------------------- |\n| 7B   | `python -m medusa.inference.cli --model FasterDecoding\u002Fmedusa-vicuna-7b-v1.3` | [FasterDecoding\u002Fmedusa-vicuna-7b-v1.3](https:\u002F\u002Fhuggingface.co\u002FFasterDecoding\u002Fmedusa-vicuna-7b-v1.3)   |\n| 13B  | `python -m medusa.inference.cli --model FasterDecoding\u002Fmedusa-vicuna-13b-v1.3` | [FasterDecoding\u002Fmedusa-vicuna-13b-v1.3](https:\u002F\u002Fhuggingface.co\u002FFasterDecoding\u002Fmedusa-vicuna-13b-v1.3) |\n| 33B  | `python -m medusa.inference.cli --model FasterDecoding\u002Fmedusa-vicuna-33b-v1.3` | [FasterDecoding\u002Fmedusa-vicuna-33b-v1.3](https:\u002F\u002Fhuggingface.co\u002FFasterDecoding\u002Fmedusa-vicuna-33b-v1.3) |\n\n#### Medusa-2\n| 规模 | 对话命令                                  | Hugging Face仓库                                                     |\n| ---- | --------------------------------------------- | --------------------------------------------------------------------- |\n| Zephyr-7B-Beta   | `python -m medusa.inference.cli --model FasterDecoding\u002Fmedusa-1.0-zephyr-7b-beta` | [FasterDecoding\u002Fmedusa-1.0-zephyr-7b-beta](https:\u002F\u002Fhuggingface.co\u002FFasterDecoding\u002Fmedusa-1.0-zephyr-7b-beta)   |\n| Vicuna-7B-v1.5 | `python -m medusa.inference.cli --model FasterDecoding\u002Fmedusa-1.0-vicuna-7b-v1.5` | [FasterDecoding\u002Fmedusa-1.0-vicuna-7b-v1.5](https:\u002F\u002Fhuggingface.co\u002FFasterDecoding\u002Fmedusa-1.0-vicuna-7b-v1.5) |\n| Vicuna-13B-v1.5  | `python -m medusa.inference.cli --model FasterDecoding\u002Fmedusa-1.0-vicuna-13b-v1.5` | [FasterDecoding\u002Fmedusa-1.0-vicuna-13b-v1.5](https:\u002F\u002Fhuggingface.co\u002FFasterDecoding\u002Fmedusa-1.0-vicuna-13b-v1.5) |\n| Vicuna-33B-v1.5  | `python -m medusa.inference.cli --model FasterDecoding\u002Fmedusa-1.0-vicuna-33b-v1.5` | [FasterDecoding\u002Fmedusa-1.0-vicuna-33b-v1.5](https:\u002F\u002Fhuggingface.co\u002FFasterDecoding\u002Fmedusa-1.0-vicuna-33b-v1.5) |\n\n### 推理\n目前我们支持单 GPU 推理，批次大小为 1，这是本地模型托管中最常见的配置。我们正在积极努力将 Medusa 集成到其他推理框架中，以扩展其功能；如果您有意参与这一工作，请随时与我们联系。\n\n您可以使用以下命令启动 CLI 界面：\n```bash\nCUDA_VISIBLE_DEVICES=0 python -m medusa.inference.cli --model [Medusa 模型路径]\n```\n您还可以传递 `--load-in-8bit` 或 `--load-in-4bit` 参数，以量化格式加载基础模型。如果您在其他地方下载了基础模型，可以使用 `--base-model [基础模型路径]` 来覆盖基础模型的名称或路径。\n\n### 训练\n在更新版本中，我们使用强大的 [axolotl](https:\u002F\u002Fgithub.com\u002FOpenAccess-AI-Collective\u002Faxolotl) 库来管理训练过程。请参阅我们的 [fork](https:\u002F\u002Fgithub.com\u002Fctlllll\u002Faxolotl) 获取训练代码。主要的代码修改位于 [`src\u002Faxolotl\u002Futils\u002Fmodels.py`](https:\u002F\u002Fgithub.com\u002Fctlllll\u002Faxolotl\u002Fblob\u002Fmain\u002Fsrc\u002Faxolotl\u002Futils\u002Fmodels.py)。训练配置可在 [`examples\u002Fmedusa`](https:\u002F\u002Fgithub.com\u002Fctlllll\u002Faxolotl\u002Ftree\u002Fmain\u002Fexamples\u002Fmedusa) 中找到。典型的训练命令如下：\n```bash\naccelerate launch -m axolotl.cli.train examples\u002Fmedusa\u002Fyour_config.yml\n```\n\n用于自蒸馏的数据准备代码可在当前仓库的 [`data_generation` 文件夹](data_generation) 中找到。对于其他数据集，您可以直接从相应的 Hugging Face 数据集仓库下载数据。\n\n### 在不同架构上训练\n*以下说明适用于 Medusa 的初始版本，提供了一个训练 Medusa-1 模型的最小示例。有关更新版本的说明，请参阅前一节。*\n\n进行训练时，请安装：\n```bash\npip install -e \".[train]\"\n```\n\n#### 准备数据\n我们采用 ShareGPT 数据集的一个公开版本，它是 Vicuna 训练数据的一个子集。对于其他模型，您可以使用相应的训练数据集。\n```bash\ngit clone https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FAeala\u002FShareGPT_Vicuna_unfiltered\n```\n\n注意：如果您尚未安装 `git-lfs`，请在克隆之前先安装：\n```bash\ngit lfs install\n```\n\n#### 将数据适配到您希望启用 Medusa 的模型上。\n\n首先启动一个您喜欢的推理服务器，运行您想要训练的模型。我们以 [mistralai\u002FMistral-7B-Instruct-v0.2](https:\u002F\u002Fhuggingface.co\u002Fmistralai\u002FMistral-7B-Instruct-v0.2) 为例。\n\n例如，您可以使用 [text-generation-inference](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-generation-inference)，该工具在您训练完 Medusa 头部后也可以继续使用。\n```bash\nmodel=mistralai\u002FMistral-7B-Instruct-v0.2\nvolume=$PWD\u002Fdata # 与 Docker 容器共享卷，避免每次运行都重新下载权重\ndocker run --gpus all --shm-size 1g -p 8080:80 -v $volume:\u002Fdata ghcr.io\u002Fhuggingface\u002Ftext-generation-inference:latest --model-id $model --input-length 4000 --max-total-tokens 4096 --max-batch-prefill-tokens 4000\n```\n\nShareGPT 中的部分序列较长，因此请确保能够成功推理这些序列。如果内存不足，脚本会自动跳过这些长对话。这不会对下游性能产生太大影响，但更多的数据总是更好的。\n\n您可以根据需要调整各种参数以 [加速推理](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftext-generation-inference\u002Findex)，不过默认设置在大多数情况下已经足够好。\n\n```bash\npython create_data.py --input-filename ShareGPT_Vicuna_unfiltered\u002FShareGPT_V4.3_unfiltered_cleaned_split.json --output-filename mistral.json\n```\n\n#### 训练模型\n我们沿用了 [FastChat](https:\u002F\u002Fgithub.com\u002Flm-sys\u002FFastChat#fine-tuning) 的训练设置，但由于我们冻结了原始模型，只训练新增的头部，因此学习率设置得更高。以下是针对 Vicuna-7b 模型在 4 张 GPU 上的训练命令。由于我们仅训练新增的头部，训练所需的显存较少，只需使用数据并行即可。您可以根据自己的硬件配置调整脚本。对于更大的模型，我们也采用相同的设置。此外，您还可以使用 `--load_in_8bit` 或 `--load_in_4bit` 参数，以量化格式加载基础模型。\n```bash\ntorchrun --nproc_per_node=4 medusa\u002Ftrain\u002Ftrain_legacy.py --model_name_or_path mistralai\u002FMistral-7B-Instruct-v0.2 \\\n    --data_path mistral.json \\\n    --bf16 True \\\n    --output_dir test \\\n    --num_train_epochs 2 \\\n    --per_device_train_batch_size 8 \\\n    --per_device_eval_batch_size 8 \\\n    --gradient_accumulation_steps 4 \\\n    --evaluation_strategy \"no\" \\\n    --save_strategy \"no\" \\\n    --learning_rate 1e-3 \\\n    --weight_decay 0.0 \\\n    --warmup_ratio 0.1 \\\n    --lr_scheduler_type \"cosine\" \\\n    --logging_steps 1 \\\n    --tf32 True \\\n    --model_max_length 2048 \\\n    --lazy_preprocess True \\\n    --medusa_num_heads 3 \\\n    --medusa_num_layers 1 \\\n    --deepspeed deepspeed.json\n```\n\n### 推送到 Hugging Face Hub\n您可以使用以下命令将您的模型推送到 Hugging Face Hub：\n```bash\npython -m medusa.hf_utils --folder [模型文件夹路径] --repo [仓库名称]\n```\n\n## 引用\n```bibtex\n@article{cai2024medusa,\n  title   = {Medusa: 基于多解码头的简单 LLM 推理加速框架},\n  author  = {Tianle Cai、Yuhong Li、Zhengyang Geng、Hongwu Peng、Jason D. Lee、Deming Chen、Tri Dao},\n  year    = {2024},\n  journal = {arXiv 预印本 arXiv: 2401.10774}\n}\n```\n\n## 代码库指南\n`medusa\u002Fmodel\u002Fmedusa_model.py` 是 Medusa 的核心文件。它包含了 `MedusaModel` 类，该类是对原始模型和新增头部的封装。此外，该类还实现了流式生成方法。如果您想深入了解 Medusa 的细节，这里就是起点。\n\n我们还在 `notebooks\u002F` 目录中提供了一些示例笔记本，帮助您更好地理解代码库。\n\n## 社区采纳\n我们非常高兴地看到 Medusa 已被许多开源项目所采用。以下是一个不完全列表：\n- [TensorRT-LLM](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTensorRT-LLM\u002Ftree\u002Fmain\u002Fexamples\u002Fmedusa)\n- [TGI](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-generation-inference\u002Fblob\u002Fmain\u002Fserver\u002Ftext_generation_server\u002Futils\u002Fmedusa.py)\n- [RTP-LLM](https:\u002F\u002Fgithub.com\u002Falibaba\u002Frtp-llm\u002Fblob\u002Fmain\u002Fdocs\u002FSpeculativeDecoding-Tutroial.md#medusa-decoding)\n\n我们感谢这些作者对社区的贡献，并真诚希望 Medusa 能够助力 LLM 的发展。如果您在项目中使用了 Medusa，请告知我们，我们将把您的项目加入到列表中。\n\n## 贡献\n我们欢迎社区为 Medusa 做出贡献。如果您有任何改进建议，请提交一个议题与我们讨论。在提交拉取请求时，请确保您的更改经过充分测试。请将每一项重大更改拆分为单独的拉取请求。我们还有一个 [路线图](ROADMAP.md)，概述了 Medusa 未来的规划。如果您对参与路线图中的任何事项感兴趣，请随时联系我们。\n\n## 致谢\n本代码库受到大语言模型社区中一些杰出项目的启发，包括 [FastChat](https:\u002F\u002Fgithub.com\u002Flm-sys\u002FFastChat)、[TinyChat](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002Fllm-awq\u002Ftree\u002Fmain\u002F)、[vllm](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm) 和 [axolotl](https:\u002F\u002Fgithub.com\u002FOpenAccess-AI-Collective\u002Faxolotl)。\n\n本项目得到了 [Together AI](https:\u002F\u002Ftogether.ai\u002F)、[MyShell AI](https:\u002F\u002Fmyshell.ai\u002F) 和 [Chai AI](https:\u002F\u002Fwww.chai-research.com\u002F) 的支持。","# Medusa 快速上手指南\n\nMedusa 是一个简单的框架，通过在大语言模型（LLM）上添加多个解码头（Decoding Heads），实现生成速度的显著加速（2.2-3.6 倍）。它无需额外的草稿模型，仅微调新增的头部即可工作，支持自蒸馏技术适配任意微调后的 LLM。\n\n## 环境准备\n\n*   **操作系统**: Linux (推荐 Ubuntu)\n*   **Python**: 3.8 或更高版本\n*   **GPU**: 支持 CUDA 的 NVIDIA 显卡（单卡即可运行推理）\n*   **依赖库**: PyTorch, Transformers, Accelerate 等（安装脚本会自动处理）\n\n> **提示**：国内用户建议在安装 Python 依赖时配置清华或阿里镜像源，以加快下载速度。\n> ```bash\n> export PIP_INDEX_URL=https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n> ```\n\n## 安装步骤\n\n推荐使用源码安装方式，以获取最新功能（包括 Medusa-2 和自蒸馏支持）。\n\n### 1. 克隆仓库并安装\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FFasterDecoding\u002FMedusa.git\ncd Medusa\npip install -e .\n```\n\n### 2. 训练环境（可选）\n如果您计划自己训练 Medusa 头部（特别是使用 Medusa-2 方案），需要安装额外的训练依赖：\n```bash\npip install -e \".[train]\"\n```\n*注：Medusa-2 的训练基于 [axolotl](https:\u002F\u002Fgithub.com\u002FOpenAccess-AI-Collective\u002Faxolotl) 库，具体配置需参考其 fork 版本。*\n\n## 基本使用\n\nMedusa 目前主要优化了 Batch Size 为 1 的单卡推理场景，非常适合本地部署。\n\n### 1. 启动命令行交互界面\n您可以直接加载 Hugging Face 上预训练好的 Medusa 模型进行对话。以下以 `Vicuna-7B` 的 Medusa 版本为例：\n\n```bash\nCUDA_VISIBLE_DEVICES=0 python -m medusa.inference.cli --model FasterDecoding\u002Fmedusa-vicuna-7b-v1.3\n```\n\n**常用参数说明：**\n*   `--model`: 指定 Medusa 模型的路径或 Hugging Face 仓库名。\n*   `--base-model`: 如果基础模型不在默认路径，可手动指定基础模型路径。\n*   `--load-in-8bit` 或 `--load-in-4bit`: 启用量化加载，节省显存（例如：`--load-in-4bit`）。\n\n### 2. 可用模型列表\n您可以替换上述命令中的 `--model` 参数为以下官方提供的模型：\n\n**Medusa-1 系列:**\n*   7B: `FasterDecoding\u002Fmedusa-vicuna-7b-v1.3`\n*   13B: `FasterDecoding\u002Fmedusa-vicuna-13b-v1.3`\n*   33B: `FasterDecoding\u002Fmedusa-vicuna-33b-v1.3`\n\n**Medusa-2 系列 (支持更广泛的架构):**\n*   Zephyr-7B: `FasterDecoding\u002Fmedusa-1.0-zephyr-7b-beta`\n*   Vicuna-7B-v1.5: `FasterDecoding\u002Fmedusa-1.0-vicuna-7b-v1.5`\n*   Vicuna-13B-v1.5: `FasterDecoding\u002Fmedusa-1.0-vicuna-13b-v1.5`\n*   Vicuna-33B-v1.5: `FasterDecoding\u002Fmedusa-1.0-vicuna-33b-v1.5`\n\n### 3. 将模型上传至 Hugging Face Hub (可选)\n如果您训练了自己的模型并希望分享，可以使用以下命令：\n\n```bash\npython -m medusa.hf_utils --folder [本地模型文件夹路径] --repo [目标仓库名称]\n```","某初创团队正在本地部署一款基于 Vicuna-7b 的垂直领域客服机器人，需实时响应用户的复杂咨询。\n\n### 没有 Medusa 时\n- **推理延迟高**：受限于单卡算力，生成每个 token 都需串行计算，用户提问后往往需要等待数秒才能看到完整回复，体验流畅度差。\n- **依赖额外模型**：若尝试使用传统的投机采样（Speculative Decoding）加速，必须寻找并维护一个与大模型分布高度匹配的“草稿模型”，增加了架构复杂度。\n- **采样效率低**：在需要多样化回答的非贪婪采样模式下，传统加速方法效果显著下降，甚至不如最基础的贪婪解码快。\n- **微调成本大**：想要适配自有数据，通常需要全量重训或复杂的分布式设置，对显存资源有限的团队极不友好。\n\n### 使用 Medusa 后\n- **生成速度翻倍**：Medusa 通过在大模型上附加多个解码头，实现多 token 并行预测，在单卡环境下将推理速度提升了约 2.2 倍，回复几乎即时呈现。\n- **架构极简**：无需引入任何外部草稿模型，仅在原模型基础上训练轻量级头部，保持了系统结构的简洁性与稳定性。\n- **采样性能优异**：即使在非贪婪采样场景下，Medusa 依然能保持高效加速，确保了客服回答的多样性与自然度不减。\n- **低成本适配**：利用自蒸馏技术，团队无需原始训练数据即可将 Medusa 能力迁移至微调后的模型，且仅需少量参数更新，普通显卡即可完成。\n\nMedusa 以极低的改造成本打破了大模型串行生成的瓶颈，让资源受限的本地部署也能拥有丝滑的实时交互体验。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FFasterDecoding_Medusa_e776a210.jpg","FasterDecoding","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FFasterDecoding_6a5dc5e0.png","Think deeper, decode faster",null,"https:\u002F\u002Fgithub.com\u002FFasterDecoding",[78,82,86],{"name":79,"color":80,"percentage":81},"Jupyter Notebook","#DA5B0B",65.5,{"name":83,"color":84,"percentage":85},"Python","#3572A5",34.4,{"name":87,"color":88,"percentage":89},"Shell","#89e051",0.1,2722,197,"2026-04-14T21:36:11","Apache-2.0","Linux","必需 NVIDIA GPU。推理支持单卡（Single-GPU），训练支持多卡（如示例中的 4 卡）。支持使用 --load-in-8bit 或 --load-in-4bit 量化加载基础模型以降低显存需求。具体显存大小取决于所选模型（7B\u002F13B\u002F33B），未明确给出最低数值，但提及该框架对“显存较少（GPU-Poor）”的用户友好。","未说明",{"notes":98,"python":96,"dependencies":99},"1. 推理主要优化为批量大小（batch size）为 1 的单卡场景。2. Medusa-2 版本推荐使用 axolotl 库进行管理训练。3. 旧版训练（Medusa-1）需要启动一个推理服务器（如 text-generation-inference）来生成训练数据。4. 克隆数据集前需安装 git-lfs。5. 支持将基础模型以 4bit 或 8bit 量化格式加载以节省资源。",[100,101,102,103,104,105],"torch","transformers","accelerate","axolotl (用于 Medusa-2 训练)","deepspeed","bitsandbytes (用于量化)",[14,35],[108,109],"llm","llm-inference","2026-03-27T02:49:30.150509","2026-04-16T08:19:19.424468",[113,118,123,128,133,138,143],{"id":114,"question_zh":115,"answer_zh":116,"source_url":117},35374,"如何在显存有限的情况下微调长上下文（如 16k）的 Vicuna 模型？","微调 16k 长度的模型对显存要求极高，即使在 48GB 显存的 RTX A6000 上，batch_size=1 也可能导致 OOM。目前社区建议先尝试在常规 4k 长度模型上进行微调验证。如果必须处理长文本，可能需要采用梯度检查点（gradient checkpointing）、DeepSpeed ZeRO 优化、或者降低模型精度（如使用 bf16 而非 fp32）等技术来减少显存占用。具体命令中已包含 `--bf16 True` 和 `--lazy_preprocess True` 等优化选项，请确保充分利用。","https:\u002F\u002Fgithub.com\u002FFasterDecoding\u002FMedusa\u002Fissues\u002F18",{"id":119,"question_zh":120,"answer_zh":121,"source_url":122},35370,"使用 Medusa 方法会导致输出结果与原始模型不一致吗？","Medusa 在处理 logits 时相对于基础模型没有额外的惩罚项，理论上应保持一致。但在实际测试中，如果发现结果有差异，建议切换到 v1.0 分支进行验证，该分支与旧版本兼容（仅新权重尚未发布）。如果遇到问题，可以随时反馈。","https:\u002F\u002Fgithub.com\u002FFasterDecoding\u002FMedusa\u002Fissues\u002F56",{"id":124,"question_zh":125,"answer_zh":126,"source_url":127},35371,"加载预训练模型时出现 \"TypeError: __init__() got an unexpected keyword argument 'medusa_num_heads'\" 错误怎么办？","这是一个常见的初始化参数传递问题。解决方法是修改 `medusa_model.py` 文件中 `MedusaModelABC` 类的 `__init__` 函数，使其能够接受任意关键字参数。具体修改如下：\n```python\ndef __init__(\n        self,\n        config,\n        *args,\n        **kwargs,\n    ):\n```\n这样可以让多余的参数被 **kwargs 接收，从而避免报错。","https:\u002F\u002Fgithub.com\u002FFasterDecoding\u002FMedusa\u002Fissues\u002F55",{"id":129,"question_zh":130,"answer_zh":131,"source_url":132},35372,"使用 Llama2-7b 作为基座模型训练 Medusa Head 时遇到 CUBLAS_STATUS_EXECUTION_FAILED 或 NaN 损失错误如何解决？","这通常是由于 Llama2 和 Vicuna 的 tokenizer 归一化处理不同导致的。一个简单的修复方法是直接使用 Vicuna-1.3（或其他 Llama-1 系列模型）的 tokenizer 文件（包括 tokenizer.json, tokenizer.model, special_tokens_map.json 等）来替代 Llama2 的 tokenizer。此外，确保在代码中将 `get_conversation_template(\"vicuna\")` 修改为 `get_conversation_template(\"llama-2\")`，并参考 FastChat 相关 PR 进行适配。","https:\u002F\u002Fgithub.com\u002FFasterDecoding\u002FMedusa\u002Fissues\u002F45",{"id":134,"question_zh":135,"answer_zh":136,"source_url":137},35373,"Medusa 代码中的稀疏候选生成（Sparse candidate generation）逻辑是否正确？为什么对所有头应用相同的 TOPK？","用户指出当前代码对所有 Medusa 头应用相同的 TOPK 值可能导致索引错位，特别是当树结构配置中不同头需要的候选数不同时（例如 [10, 10, 8, 2]）。推荐的修复方案是遍历每个头并根据其对应的 beam_size 动态选择 TOPK。修改后的代码逻辑如下：\n```python\ncandidates_medusa_logits = []\nfor medusa_head, beam_size in enumerate(beam_sizes):\n    candidates_medusa_logits.append(torch.topk(medusa_logits[medusa_head, 0, -1], beam_size, dim = -1).indices)\ncandidates_medusa_logits = torch.cat(candidates_medusa_logits)\n```\n这样可以确保每个头只选取指定数量的候选者，避免索引混乱。","https:\u002F\u002Fgithub.com\u002FFasterDecoding\u002FMedusa\u002Fissues\u002F64",{"id":139,"question_zh":140,"answer_zh":141,"source_url":142},35375,"能否提供不同 Medusa 配置下的 mt-bench 评测结果？","项目维护者表示评估文件夹（eval folder）是自包含的，并且应该与原始版本兼容。具体的消融实验结果可以通过运行自带的评估脚本获取。目前团队正致力于实现其他最新模型的全量微调，相关分支尚在测试中，建议关注主分支或特定发布版本的更新以获取最新数据。","https:\u002F\u002Fgithub.com\u002FFasterDecoding\u002FMedusa\u002Fissues\u002F62",{"id":144,"question_zh":145,"answer_zh":146,"source_url":142},35376,"基于主分支的代码是否有破坏性变更或已知 Bug？","如果需要稳定的环境，建议切换到 v1.0 分支，该分支与之前的版本基本兼容。主分支可能包含正在开发的新功能或未完全测试的代码（例如新权重的支持）。如果在主分支遇到具体问题，欢迎提交 Issue 或 PR，维护者会协助检查代码的合理性。",[148],{"id":149,"version":150,"summary_zh":151,"released_at":152},280440,"v0.1","Medusa 是一个易于使用的框架，旨在让大型语言模型生成任务的加速技术更加普及。Medusa-v0.1 采用了多个额外的轻量级解码头，从而无需使用草稿模型。","2023-09-11T20:35:12"]