[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-microsoft--Tutel":3,"tool-microsoft--Tutel":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",148568,2,"2026-04-09T23:34:24",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108111,"2026-04-08T11:23:26",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":10,"last_commit_at":59,"category_tags":60,"status":17},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,"2026-04-06T11:19:32",[35,15,13,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":73,"owner_avatar_url":74,"owner_bio":75,"owner_company":76,"owner_location":76,"owner_email":77,"owner_twitter":78,"owner_website":79,"owner_url":80,"languages":81,"stars":98,"forks":99,"last_commit_at":100,"license":101,"difficulty_score":102,"env_os":103,"env_gpu":104,"env_ram":105,"env_deps":106,"category_tags":112,"github_topics":113,"view_count":32,"oss_zip_url":76,"oss_zip_packed_at":76,"status":17,"created_at":119,"updated_at":120,"faqs":121,"releases":150},6042,"microsoft\u002FTutel","Tutel","Tutel MoE: Optimized Mixture-of-Experts Library, Support GptOss\u002FDeepSeek\u002FKimi-K2\u002FQwen3 using FP8\u002FNVFP4\u002FMXFP4","Tutel 是一个专为混合专家模型（MoE）打造的高性能优化库，旨在让大模型的训练与推理更加高效流畅。它核心解决了动态行为模型在并行计算、稀疏性处理及容量切换时的性能瓶颈，首创了“无损耗并行”技术，确保在复杂场景下也能保持算力零浪费。\n\nTutel 特别适配 DeepSeek、Kimi、Qwen3 及 GptOSS 等前沿开源模型，并率先支持 FP8、NVFP4、MXFP4 等低精度量化格式，能在 A100、H100 及 MI300 等主流加速卡上显著降低显存占用并提升推理速度。无论是处理百万级上下文长度的长文本任务，还是运行多模态语音模型，Tutel 都能提供卓越的扩展能力。\n\n这款工具主要面向 AI 开发者、算法研究人员及基础设施工程师。如果你需要部署超大规模 MoE 模型，或希望在不牺牲精度的前提下大幅优化推理成本，Tutel 将是非常得力的技术助手。它基于 PyTorch 构建，兼容 CUDA 与 ROCM 生态，让高性能大模型落地变得更加简单可控。","# Tutel\n\nTutel MoE: An Optimized Mixture-of-Experts Implementation, also the first parallel solution proposing [\"No-penalty Parallism\u002FSparsity\u002FCapacity\u002F.. Switching\"](https:\u002F\u002Fmlsys.org\u002Fmedia\u002Fmlsys-2023\u002FSlides\u002F2477.pdf) for modern training and inference that have dynamic behaviors.\n\n- Supported Framework: Pytorch (recommend: >= 2.0)\n- Supported GPUs: CUDA(fp64\u002Ffp32\u002Ffp16\u002Fbf16), ROCm(fp64\u002Ffp32\u002Ffp16\u002Fbf16)\n- Supported CPU: fp64\u002Ffp32\n- Support direct NVFP4\u002FMXFP4\u002FBlockwiseFP8 Inference for MoE-based DeepSeek \u002F Kimi \u002F Qwen3 \u002F GptOSS using A100\u002FA800\u002FH100\u002FMI300\u002F..\n\n\u003Cdiv align=\"center\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmicrosoft_Tutel_readme_6e7638f0c64d.png\" width=\"600px\" alt=\"DeepSeek-V3.2\">\u003C\u002Fdiv>\n\u003Cdiv align=\"center\">Scaling DeepSeek V3.1\u002F3.2 TPOS with Context Size\u003C\u002Fdiv>\n\n> [!TIP]\n> #### Steps for DeepSeek V3.2 (Long-Context Mode):\n> \n> ```sh\n> [Model Downloads]\n>   pip3 install -U \"huggingface_hub[cli]\" --upgrade\n>   hf download nvidia\u002FKimi-K2.5-NVFP4 --local-dir nvidia\u002FKimi-K2.5-NVFP4\n>   hf download nvidia\u002FKimi-K2-Thinking-NVFP4 --local-dir nvidia\u002FKimi-K2-Thinking-NVFP4\n>   hf download nvidia\u002FDeepSeek-V3.2-NVFP4 --local-dir nvidia\u002FDeepSeek-V3.2-NVFP4\n> \n> [DeepSeek V3.2 Long-Context (A100\u002FH100\u002FB200 only)]\n>   docker run -e LOCAL_SIZE=8 -e WORKER=1 -it --rm --ipc=host --net=host --shm-size=8g \\\n>       --ulimit memlock=-1 --ulimit stack=67108864 -v \u002F:\u002Fhost -w \u002Fhost$(pwd) -v \u002Ftmp:\u002Ftmp \\\n>       -v \u002Fusr\u002Flib\u002Fx86_64-linux-gnu\u002Flibcuda.so.1:\u002Fusr\u002Flib\u002Fx86_64-linux-gnu\u002Flibcuda.so.1 --privileged \\\n>       tutelgroup\u002Fdeepseek-671b:a100x8-chat-20260327 --serve=webui --listen_port 8000 \\\n>         --try_path nvidia\u002FKimi-K2.5-NVFP4 \\\n>         --try_path nvidia\u002FKimi-K2-Thinking-NVFP4 \\\n>         --try_path nvidia\u002FDeepSeek-V3.2-NVFP4 \\\n>         --try_path nvidia\u002FDeepSeek-R1-NVFP4 \\\n>         --max_seq_len 32768\n> \n> [DeepSeek V3.2 Long-Context (MI300 only)]\n>   docker run -e LOCAL_SIZE=8 -e WORKER=1 -it --rm --ipc=host --net=host --shm-size=8g \\\n>       --ulimit memlock=-1 --ulimit stack=67108864 --device=\u002Fdev\u002Fkfd --device=\u002Fdev\u002Fdri --group-add=video \\\n>       --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v \u002F:\u002Fhost -w \u002Fhost$(pwd) -v \u002Ftmp:\u002Ftmp \\\n>       tutelgroup\u002Fdeepseek-671b:mi300x8-chat-20260327 --serve=webui --listen_port 8000 \\\n>         --try_path nvidia\u002FKimi-K2.5-NVFP4 \\\n>         --try_path nvidia\u002FKimi-K2-Thinking-NVFP4 \\\n>         --try_path nvidia\u002FDeepSeek-V3.2-NVFP4 \\\n>         --try_path nvidia\u002FDeepSeek-R1-NVFP4 \\\n>         --max_seq_len 1000000\n> \n> [OpenAI\u002FOllama\u002FDirect Request]\n>   curl -N -X POST http:\u002F\u002F0.0.0.0:8000\u002Fchat -d '{\"text\": \"Write a Python code of the Quicksort algorithm.\"}'\n>   python3 -m tutel.examples.oai_request_stream --url '0.0.0.0:8000' --prompt 'Write a Python code of the Quicksort algorithm.'\n> \n> [Open-WebUI URL for Web browsers]\n>   xdg-open http:\u002F\u002F0.0.0.0:8000\n> ```\n\n> [!TIP]\n> #### Steps for Microsoft VibeVoice (Multimodality Mode):\n> \n> ```sh\n> [Model Downloads]\n>   pip3 install -U \"huggingface_hub[cli]\" --upgrade\n>   hf download microsoft\u002FVibeVoice-1.5B --local-dir microsoft\u002FVibeVoice-1.5B\n>   hf download Qwen\u002FQwen2.5-1.5B --local-dir Qwen\u002FQwen2.5-1.5B\n>\n>   hf download microsoft\u002FVibeVoice-Large --local-dir aoi-ot\u002FVibeVoice-Large\n>   hf download Qwen\u002FQwen2.5-7B --local-dir Qwen\u002FQwen2.5-7B\n> \n> [Microsoft VibeVoice (A100\u002FH100\u002FB200 only)]\n> docker run -e LOCAL_SIZE=1 -it --rm -p 8001:8000 --shm-size=8g \\\n>       --ulimit memlock=-1 --ulimit stack=67108864 -v \u002F:\u002Fhost -w \u002Fhost$(pwd) -v \u002Ftmp:\u002Ftmp \\\n>       -v \u002Fusr\u002Flib\u002Fx86_64-linux-gnu\u002Flibcuda.so.1:\u002Fusr\u002Flib\u002Fx86_64-linux-gnu\u002Flibcuda.so.1 --privileged \\\n>       -e VOICES=\"https:\u002F\u002Fhomepages.inf.ed.ac.uk\u002Fhtang2\u002Fnotes\u002Fspeech-samples\u002F103-1240-0000.wav\" \\\n>       tutelgroup\u002Fdeepseek-671b:a100x8-chat-20251222 --serve=core \\\n>         --try_path .\u002Fmicrosoft\u002FVibeVoice-1.5B \\\n>         --try_path .\u002Fmicrosoft\u002FVibeVoice-Large\n> \n> [Microsoft VibeVoice (MI300 only)]\n> docker run -e LOCAL_SIZE=1 -it --rm -p 8001:8000 --shm-size=8g \\\n>       --ulimit memlock=-1 --ulimit stack=67108864 --device=\u002Fdev\u002Fkfd --device=\u002Fdev\u002Fdri --group-add=video \\\n>       --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v \u002F:\u002Fhost -w \u002Fhost$(pwd) -v \u002Ftmp:\u002Ftmp \\\n>       -e VOICES=\"https:\u002F\u002Fhomepages.inf.ed.ac.uk\u002Fhtang2\u002Fnotes\u002Fspeech-samples\u002F103-1240-0000.wav\" \\\n>       tutelgroup\u002Fdeepseek-671b:mi300x8-chat-20251222 --serve=core \\\n>         --try_path .\u002Fmicrosoft\u002FVibeVoice-1.5B \\\n>         --try_path .\u002Fmicrosoft\u002FVibeVoice-Large\n> \n> [Audio Generation Request]\n>   curl -X POST http:\u002F\u002F0.0.0.0:8001\u002Fchat -d '{\"text\": \"VibeVoice is a novel framework designed for generating expressive, long-form, multi-speaker conversational audio, such as podcasts, from text.\"}' > sound_output.mp3\n> ```\n\n\n#### Inference TPOS for DeepSeek-MoE\u002FQwen3-MoE\u002FKimiK2-MoE\u002FGptOSS-MoE\u002F..:\n> |  ***Model \\& Machine Type*** | ***Precision*** | ***SGL***  | ***Tutel***  |\n> |  ----  | ----  | ----  | ----  |\n> | $deepseek-ai\u002FDeepSeek-V3.2\\ (671B,\\ A100 \\times 8)$ | nvfp4 | - | 102 |\n> | $deepseek-ai\u002FDeepSeek-V3.2\\ (671B,\\ MI300 \\times 8)$ | nvfp4 | - | 151 |\n> | $moonshotai\u002FKimi-K2-Instruct\\ (1T,\\ A100 \\times 8)$ | nvfp4 | - | 104 |\n> | $moonshotai\u002FKimi-K2-Instruct\\ (1T,\\ MI300 \\times 8)$ | fp8b128 | 49 | 153 |\n> | $NVFP4\u002FQwen3-235B-A22B-Instruct-2507-FP4(A100\\times8)$ | nvfp4 | - | 114 |\n> | $NVFP4\u002FQwen3-235B-A22B-Instruct-2507-FP4(MI300\\times8)$ | nvfp4 | - | 122 |\n> | $openai\u002Fgpt-oss-120b\\ (120B,\\ A100 \\times 1)$ | mxfp4 | 127 | 212 |\n> | $openai\u002Fgpt-oss-120b\\ (120B,\\ MI300 \\times 1)$ | mxfp4 | 191 | 311 |\n> | $microsoft\u002FVibeVoice-1.5B (A100 \\times 1)$ | bf16 | - | rtf=0.07 |\n> | $microsoft\u002FVibeVoice-1.5B (MI300 \\times 1)$ | bf16 | - | rtf=0.06 |\n> \n\n## What's New:\n\n> Image-*20260327*: Add support for Kimi-K2.5.\n>\n> Image-*20260306*: Support DeepSeek V3.2 Long-context mode for A100\u002FH100\u002FMI300\u002FB200.\n>\n> Image-*20251222*: Fine-tune A100 performance for most models.\n>\n> Image-*20251111*: Integrate Tutel LLM module into VibeVoice for accelerated inference (rtf = 0.07 for single A100).\n>\n> Image-*20251006*: Resolve compatibility with DeepSeek-V3.2-Exp\n>\n> Image-*20250827*: Add distributed support for OpenAI GPT-OSS 20B\u002F120B with MXFP4 inference\n>\n> Image-*20250801*: Support Qwen3 MoE series, integrate [OpenWebUI](https:\u002F\u002Fgithub.com\u002Fopen-webui\u002Fopen-webui)\n>\n> Image-*20250712*: Support Kimi K2 1TB MoE inference with NVFP4 for NVIDIA\u002FAMD GPUs\n>\n> Image-*20250601*: Improved decoding performance for DeepSeek 671B on MI300x to 140-150 TPS\n>\n> More image versions can be found [here](https:\u002F\u002Fhub.docker.com\u002Fr\u002Ftutelgroup\u002Fdeepseek-671b\u002Ftags)\n\n> Tutel v0.4.2: Add R1-FP4\u002FQwen3MoE-FP8 Support for NVIDIA and AMD GPUs & Fast Gating APIs:\n```py\n  >> Example:\n\nimport torch\nfrom tutel import ops\n\n# Qwen3 Fast MoE Gating for 128 Experts, with Routed Weights normalized to 1.0\nlogits_fp32 = torch.softmax(torch.randn([32, 128]), -1, dtype=torch.float32).cuda()\ntopk_weights, topk_ids = ops.qwen3_moe_scaled_topk(logits_fp32)\nprint(topk_weights, topk_ids, topk_weights.sum(-1))\n\n# DeepSeek V3\u002FR1 Fast MoE Gating for 256 Experts, with Routed Weights normalized to 2.5\nlogits_bf16 = torch.randn([32, 256], dtype=torch.bfloat16).cuda()\ncorrection_bias_bf16 = torch.randn([logits_bf16.size(-1)], dtype=torch.bfloat16).cuda()\ntopk_weights, topk_ids = ops.deepseek_moe_sigmoid_scaled_topk(logits_bf16, correction_bias_bf16, None, None)\nprint(topk_weights, topk_ids, topk_weights.sum(-1))\n```\n\n> Tutel v0.4.1: Support Deepseek R1 FP8 with NVIDIA GPUs (A100 \u002F A800)\n\n> Tutel v0.4.0: Accelerating Deepseek R1 Full-precision-Chat for AMD MI300x8:\n```sh\n  >> Example:\n\n    # Step-1: Download Deepseek R1 671B Model\n    huggingface-cli download deepseek-ai\u002FDeepSeek-R1 --local-dir .\u002Fdeepseek-ai\u002FDeepSeek-R1\n\n    # Step-2: Using 8 MI300 GPUs to Serve Deepseek R1 Chat on Local Port :8000\n    docker run -it --rm --ipc=host --privileged -p 8000:8000 \\\n        -v \u002F:\u002Fhost -w \u002Fhost$(pwd) tutelgroup\u002Fdeepseek-671b:mi300x8-chat-20250224 \\\n        --model_path .\u002Fdeepseek-ai\u002FDeepSeek-R1\n\n    # Step-3: Issue a Prompt Request with curl\n    curl -X POST http:\u002F\u002F0.0.0.0:8000\u002Fchat -d '{\"text\": \"Calculate the result of: 1 \u002F (sqrt(5) - sqrt(3))\"}'\n```\n\n> Tutel v0.3.3: Add all-to-all benchmark:\n```sh\n  >> Example:\n\n    python3 -m torch.distributed.run --nproc_per_node=8 -m tutel.examples.bandwidth_test --size_mb=256\n```\n\n> Tutel v0.3.2: Add tensorcore option for extra benchmarks \u002F Extend the example for custom experts \u002F Allow NCCL timeout settings:\n```sh\n  >> Example of using tensorcore:\n\n    python3 -m tutel.examples.helloworld --dtype=float32\n    python3 -m tutel.examples.helloworld --dtype=float32 --use_tensorcore\n\n    python3 -m tutel.examples.helloworld --dtype=float16\n    python3 -m tutel.examples.helloworld --dtype=float16 --use_tensorcore\n\n  >> Example of custom gates\u002Fexperts:\n    python3 -m tutel.examples.helloworld_custom_gate_expert --batch_size=16\n\n  >> Example of NCCL timeout settings:\n    TUTEL_GLOBAL_TIMEOUT_SEC=60 python3 -m torch.distributed.run --nproc_per_node=8 -m tutel.examples.helloworld --use_tensorcore\n\n```\n\n> Tutel v0.3.1: Add NCCL all_to_all_v and all_gather_v for arbitrary-length message transfers:\n```sh\n  >> Example:\n    # All_to_All_v:\n    python3 -m torch.distributed.run --nproc_per_node=2 --master_port=7340 -m tutel.examples.nccl_all_to_all_v\n    # All_Gather_v:\n    python3 -m torch.distributed.run --nproc_per_node=2 --master_port=7340 -m tutel.examples.nccl_all_gather_v\n\n  >> How to:\n    net.batch_all_to_all_v([t_x_cuda, t_y_cuda, ..], common_send_counts)\n    net.batch_all_gather_v([t_x_cuda, t_y_cuda, ..])\n```\n\n> Tutel v0.3: Add Megablocks solution to improve decoder inference on single-GPU with num_local_expert >= 2:\n```sh\n  >> Example (capacity_factor=0 required by dropless-MoE):\n    # Using BatchMatmul:\n    python3 -m tutel.examples.helloworld --megablocks_size=0 --batch_size=1 --num_tokens=32 --top=1 --eval --num_local_experts=128 --capacity_factor=0\n    # Using Megablocks with block_size = 1:\n    python3 -m tutel.examples.helloworld --megablocks_size=1 --batch_size=1 --num_tokens=32 --top=1 --eval --num_local_experts=128 --capacity_factor=0\n    # Using Megablocks with block_size = 2:\n    python3 -m tutel.examples.helloworld --megablocks_size=2 --batch_size=1 --num_tokens=32 --top=1 --eval --num_local_experts=128 --capacity_factor=0\n\n  >> How to:\n    self._moe_layer.forward(x, .., megablocks_size=1)         # Control the switch of megablocks_size (0 for disabled)\n```\n\n> Tutel v0.2: Allow most configurations to be dynamic switchable with free cost:\n```sh\n  >> Example:\n    python3 -m torch.distributed.run --nproc_per_node=8 -m tutel.examples.helloworld_switch --batch_size=16\n\n  >> How to:\n    self._moe_layer.forward(x, .., a2a_ffn_overlap_degree=2)  # Control the switch of overlap granularity (1 for no overlapping)\n    self._moe_layer.forward(x, .., adaptive_r=1)              # Control the switch of parallelism (0 for DP, 1 for DP + EP, W \u002F E for MP + EP, else for DP + MP + EP)\n    self._moe_layer.forward(x, .., capacity_factor=1)         # Control the switch of capacity_volume (positive for padding, negative for no-padding, 0 for dropless)\n    self._moe_layer.forward(x, .., top_k=1)                   # Control the switch of top_k sparsity\n```\n\n> Tutel v0.1: Optimize the Einsum Complexity of Data Dispatch Encoding and Decoding, add 2DH option to deal with All-to-All at scale:\n```sh\n  >> Example (suggest enabling 2DH only at scale, note that the value of --nproc_per_node MUST equal to total physical GPU counts per node, e.g. 8 for A100x8):\n    python3 -m torch.distributed.run --nproc_per_node=8 -m tutel.examples.helloworld --batch_size=16 --use_2dh\n```\n\n-----------\n## Getting Started\n\n### 1. Prepare Pytorch (if applicable):\n```\n* Prepare Recommended Pytorch >= 2.0.0:\n        #  Windows\u002FLinux Pytorch for NVIDIA CUDA >= 11.7:\n        python3 -m pip install torch torchvision torchaudio --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu118\n        #  Linux Pytorch for AMD ROCm >= 6.2.2:\n        python3 -m pip install torch torchvision torchaudio --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Frocm6.2.2\n        #  Windows\u002FLinux Pytorch for CPU:\n        python3 -m pip install torch torchvision torchaudio --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcpu\n```\n\n### 2. Tutel Installation:\n```\n* Option-1: Install Tutel Online:\n\n        $ python3 -m pip uninstall tutel -y\n        $ python3 -m pip install -v -U --no-build-isolation git+https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Ftutel@main\n\n* Option-2: Build Tutel from Source:\n\n        $ git clone https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Ftutel --branch main\n        $ python3 -m pip uninstall tutel -y\n        $ python3 .\u002Ftutel\u002Fsetup.py install --user\n```\n\n### 3. Quick Test for Single Device \u002F CPU:\n```\n* Quick Test on Single-GPU:\n\n        $ python3 -m tutel.examples.helloworld --batch_size=16               # Test Tutel-optimized MoE + manual distribution\n        $ python3 -m tutel.examples.helloworld_ddp --batch_size=16           # Test Tutel-optimized MoE + Pytorch DDP distribution (requires: Pytorch >= 1.8.0)\n        $ python3 -m tutel.examples.helloworld_ddp_tutel --batch_size=16     # Test Tutel-optimized MoE + Tutel DDP distribution (ZeRO on optimizors)\n        $ python3 -m tutel.examples.helloworld_amp --batch_size=16           # Test Tutel-optimized MoE with AMP data type + manual distribution\n        $ python3 -m tutel.examples.helloworld_custom_gate_expert --batch_size=16 # Test Tutel-optimized MoE + custom defined gate\u002Fexpert layer\n        $ python3 -m tutel.examples.helloworld_from_scratch                  # Test Custom MoE implementation from scratch\n        $ python3 -m tutel.examples.moe_mnist                                # Test MoE layer in end-to-end MNIST dataset\n        $ python3 -m tutel.examples.moe_cifar10                              # Test MoE layer in end-to-end CIFAR10 dataset\n\n        (If building from source, the following method also works:)\n        $ python3 .\u002Ftutel\u002Fexamples\u002Fhelloworld.py --batch_size=16\n        ..\n```\n\n### 4. Quick Test for 8 GPUs within 1 Machine:\n```\n        $ python3 -m torch.distributed.run --nproc_per_node=8 -m tutel.examples.helloworld --batch_size=16\n```\n\n### 5. Quick Test for Multiple GPUs across Machines:\n```\n* Run Tutel MoE in Distributed Mode:\n\n        (Option A - Torch launcher for `Multi-Node x Multi-GPU`:)\n        $ ssh \u003Cnode-ip-0> python3 -m torch.distributed.run --nproc_per_node=8 --nnodes=2 --node_rank=0 --master_addr=\u003Cnode-ip-0> -m tutel.examples.helloworld --batch_size=16\n        $ ssh \u003Cnode-ip-1> python3 -m torch.distributed.run --nproc_per_node=8 --nnodes=2 --node_rank=1 --master_addr=\u003Cnode-ip-0> -m tutel.examples.helloworld --batch_size=16\n\n        (Option B - Tutel launcher for `Multi-Node x Multi-GPU`, requiring package `openmpi-bin`:)\n        # \u003C\u003C Single Node >>\n        $ mpiexec -bind-to none -host localhost -x LOCAL_SIZE=8 python3 -m tutel.launcher.run -m tutel.examples.helloworld_ddp_tutel --batch_size=16\n        $ mpiexec -bind-to none -host localhost -x LOCAL_SIZE=8 python3 -m tutel.launcher.run -m tutel.examples.moe_mnist\n        $ mpiexec -bind-to none -host localhost -x LOCAL_SIZE=8 python3 -m tutel.launcher.run -m tutel.examples.moe_cifar10\n        ...\n\n        # \u003C\u003C MPI-based launch for GPU backend>>\n        $ mpiexec -bind-to none -host \u003Cnode-ip-0>,\u003Cnode-ip-1>,.. -x MASTER_ADDR=\u003Cnode-ip-0> -x LOCAL_SIZE=8 python3 -m tutel.launcher.run -m tutel.examples.helloworld --batch_size=16\n\n        # \u003C\u003C MPI-based Launch for CPU backend>>\n        $ mpiexec -bind-to none -host localhost -x LOCAL_SIZE=1 -x OMP_NUM_THREADS=1024 python3 -m tutel.launcher.run -m tutel.examples.helloworld --batch_size=16 --device cpu\n```\n\n-----------\n\n### Advance: Convert Checkpoint Files for Different World Sizes:\nDocumentation for checkpoint conversion has been moved [here](doc\u002FCHECKPOINT.md).\n\n### Examples: How to import Tutel-optimized MoE in Pytorch:\n```\n# Input Example:\nimport torch\nx = torch.ones([6, 1024], device='cuda:0')\n\n# Create MoE:\nfrom tutel import moe as tutel_moe\nmoe_layer = tutel_moe.moe_layer(\n    gate_type={'type': 'top', 'k': 2},\n    model_dim=x.shape[-1],\n    experts={\n        'num_experts_per_device': 2,\n        'type': 'ffn', 'hidden_size_per_expert': 2048, 'activation_fn': lambda x: torch.nn.functional.relu(x)\n    },\n    scan_expert_func = lambda name, param: setattr(param, 'skip_allreduce', True),\n)\n\n# Cast to GPU\nmoe_layer = moe_layer.to('cuda:0')\n\n# In distributed model, you need further skip doing allreduce on global parameters that have `skip_allreduce` mask, \n# e.g.\n#    for p in moe_layer.parameters():\n#        if hasattr(p, 'skip_allreduce'):\n#            continue\n#        dist.all_reduce(p.grad)\n\n\n# Forward MoE:\ny = moe_layer(x)\n\nprint(y)\n```\n\n### Reference\nYou can consult this [paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.03382.pdf) below to get to know more technical details about Tutel:\n```\n@article {tutel,\nauthor = {Changho Hwang and Wei Cui and Yifan Xiong and Ziyue Yang and Ze Liu and Han Hu and Zilong Wang and Rafael Salas and Jithin Jose and Prabhat Ram and Joe Chau and Peng Cheng and Fan Yang and Mao Yang and Yongqiang Xiong},\ntitle = {Tutel: Adaptive Mixture-of-Experts at Scale},\nyear = {2022},\nmonth = jun,\njournal = {CoRR},\nvolume= {abs\u002F2206.03382},\nurl = {https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.03382.pdf},\n}\n```\n\n### Usage of MOELayer:\n```\n* Usage of MOELayer Args:\n\n        gate_type        : dict-type gate description, e.g. {'type': 'top', 'k': 2, 'capacity_factor': -1.5, ..},\n                              or a list of dict-type gate descriptions, e.g. [{'type': 'top', 'k', 2}, {'type': 'top', 'k', 2}],\n                              the value of k in top-gating can be also negative, like -2, which indicates one GPU will hold 1\u002F(-k) parameters of an expert\n                              capacity_factor X can be positive (factor = X), zero (factor = max(needed_volumes)) or negative (factor = min(-X, max(needed_volumes))).\n        model_dim        : the number of channels for MOE's input tensor\n        experts          : a dict-type config for builtin expert network\n        scan_expert_func : allow users to specify a lambda function to iterate each experts param, e.g. `scan_expert_func = lambda name, param: setattr(param, 'expert', True)`\n        result_func      : allow users to specify a lambda function to format the MoE output and aux_loss, e.g. `result_func = lambda output: (output, output.l_aux)`\n        group            : specify the explicit communication group of all_to_all\n        seeds            : a tuple containing a tripple of int to specify manual seed of (shared params, local params, others params after MoE's)\n        a2a_ffn_overlap_degree : the value to control a2a overlap depth, 1 by default for no overlap, 2 for overlap a2a with half gemm, ..\n        parallel_type    : the parallel method to compute MoE, valid types: 'auto', 'data', 'model'\n        pad_samples      : whether do auto padding on newly-coming input data to maximum data size in history\n\n* Usage of dict-type Experts Config:\n\n        num_experts_per_device : the number of local experts per device (by default, the value is 1 if not specified)\n        hidden_size_per_expert : the hidden size between two linear layers for each expert (used for type == 'ffn' only)\n        type             : available built-in experts implementation, e.g: ffn\n        activation_fn    : the custom-defined activation function between two linear layers (used for type == 'ffn' only)\n        has_fc1_bias     : If set to False, the expert bias parameters `batched_fc1_bias` is disabled. Default: True\n        has_fc2_bias     : If set to False, the expert bias parameters `batched_fc2_bias` is disabled. Default: True\n```\n\n### Contributing\n\nThis project welcomes contributions and suggestions.  Most contributions require you to agree to a\nContributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us\nthe rights to use your contribution. For details, visit https:\u002F\u002Fcla.opensource.microsoft.com.\n\nWhen you submit a pull request, a CLA bot will automatically determine whether you need to provide\na CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions\nprovided by the bot. You will only need to do this once across all repos using our CLA.\n\nThis project has adopted the [Microsoft Open Source Code of Conduct](https:\u002F\u002Fopensource.microsoft.com\u002Fcodeofconduct\u002F).\nFor more information see the [Code of Conduct FAQ](https:\u002F\u002Fopensource.microsoft.com\u002Fcodeofconduct\u002Ffaq\u002F) or\ncontact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.\n\n### Trademarks\n\nThis project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft \ntrademarks or logos is subject to and must follow \n[Microsoft's Trademark & Brand Guidelines](https:\u002F\u002Fwww.microsoft.com\u002Fen-us\u002Flegal\u002Fintellectualproperty\u002Ftrademarks\u002Fusage\u002Fgeneral).\nUse of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.\nAny use of third-party trademarks or logos are subject to those third-party's policies.\n","# Tutel\n\nTutel MoE：一种优化的专家混合实现，同时也是首个为具有动态行为的现代训练和推理提出【无惩罚并行\u002F稀疏性\u002F容量\u002F…切换】的并行解决方案（参见：https:\u002F\u002Fmlsys.org\u002Fmedia\u002Fmlsys-2023\u002FSlides\u002F2477.pdf）。\n\n- 支持框架：PyTorch（推荐≥2.0）\n- 支持GPU：CUDA（fp64\u002Ffp32\u002Ffp16\u002Fbf16）、ROCm（fp64\u002Ffp32\u002Ffp16\u002Fbf16）\n- 支持CPU：fp64\u002Ffp32\n- 支持基于MoE的DeepSeek \u002F Kimi \u002F Qwen3 \u002F GptOSS模型在A100\u002FA800\u002FH100\u002FMI300等设备上进行NVFP4\u002FMXFP4\u002F分块FP8直接推理。\n\n\u003Cdiv align=\"center\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmicrosoft_Tutel_readme_6e7638f0c64d.png\" width=\"600px\" alt=\"DeepSeek-V3.2\">\u003C\u002Fdiv>\n\u003Cdiv align=\"center\">DeepSeek V3.1\u002F3.2 TPOS随上下文长度的变化\u003C\u002Fdiv>\n\n> [!提示]\n> #### DeepSeek V3.2（长上下文模式）步骤：\n> \n> ```sh\n> [模型下载]\n>   pip3 install -U \"huggingface_hub[cli]\" --upgrade\n>   hf download nvidia\u002FKimi-K2.5-NVFP4 --local-dir nvidia\u002FKimi-K2.5-NVFP4\n>   hf download nvidia\u002FKimi-K2-Thinking-NVFP4 --local-dir nvidia\u002FKimi-K2-Thinking-NVFP4\n>   hf download nvidia\u002FDeepSeek-V3.2-NVFP4 --local-dir nvidia\u002FDeepSeek-V3.2-NVFP4\n> \n> [DeepSeek V3.2 长上下文（仅限A100\u002FH100\u002FB200）]\n>   docker run -e LOCAL_SIZE=8 -e WORKER=1 -it --rm --ipc=host --net=host --shm-size=8g \\\n>       --ulimit memlock=-1 --ulimit stack=67108864 -v \u002F:\u002Fhost -w \u002Fhost$(pwd) -v \u002Ftmp:\u002Ftmp \\\n>       -v \u002Fusr\u002Flib\u002Fx86_64-linux-gnu\u002Flibcuda.so.1:\u002Fusr\u002Flib\u002Fx86_64-linux-gnu\u002Flibcuda.so.1 --privileged \\\n>       tutelgroup\u002Fdeepseek-671b:a100x8-chat-20260327 --serve=webui --listen_port 8000 \\\n>         --try_path nvidia\u002FKimi-K2.5-NVFP4 \\\n>         --try_path nvidia\u002FKimi-K2-Thinking-NVFP4 \\\n>         --try_path nvidia\u002FDeepSeek-V3.2-NVFP4 \\\n>         --try_path nvidia\u002FDeepSeek-R1-NVFP4 \\\n>         --max_seq_len 32768\n> \n> [DeepSeek V3.2 长上下文（仅限MI300）]\n>   docker run -e LOCAL_SIZE=8 -e WORKER=1 -it --rm --ipc=host --net=host --shm-size=8g \\\n>       --ulimit memlock=-1 --ulimit stack=67108864 --device=\u002Fdev\u002Fkfd --device=\u002Fdev\u002Fdri --group-add=video \\\n>       --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v \u002F:\u002Fhost -w \u002Fhost$(pwd) -v \u002Ftmp:\u002Ftmp \\\n>       tutelgroup\u002Fdeepseek-671b:mi300x8-chat-20260327 --serve=webui --listen_port 8000 \\\n>         --try_path nvidia\u002FKimi-K2.5-NVFP4 \\\n>         --try_path nvidia\u002FKimi-K2-Thinking-NVFP4 \\\n>         --try_path nvidia\u002FDeepSeek-V3.2-NVFP4 \\\n>         --try_path nvidia\u002FDeepSeek-R1-NVFP4 \\\n>         --max_seq_len 1000000\n> \n> [OpenAI\u002FOllama\u002F直接请求]\n>   curl -N -X POST http:\u002F\u002F0.0.0.0:8000\u002Fchat -d '{\"text\": \"写一个快速排序算法的Python代码。\"}'\n>   python3 -m tutel.examples.oai_request_stream --url '0.0.0.0:8000' --prompt '写一个快速排序算法的Python代码。'\n> \n> [适用于Web浏览器的Open-WebUI网址]\n>   xdg-open http:\u002F\u002F0.0.0.0:8000\n> ```\n\n> [!提示]\n> #### Microsoft VibeVoice（多模态模式）步骤：\n> \n> ```sh\n> [模型下载]\n>   pip3 install -U \"huggingface_hub[cli]\" --upgrade\n>   hf download microsoft\u002FVibeVoice-1.5B --local-dir microsoft\u002FVibeVoice-1.5B\n>   hf download Qwen\u002FQwen2.5-1.5B --local-dir Qwen\u002FQwen2.5-1.5B\n>\n>   hf download microsoft\u002FVibeVoice-Large --local-dir aoi-ot\u002FVibeVoice-Large\n>   hf download Qwen\u002FQwen2.5-7B --local-dir Qwen\u002FQwen2.5-7B\n> \n> [Microsoft VibeVoice（仅限A100\u002FH100\u002FB200）]\n> docker run -e LOCAL_SIZE=1 -it --rm -p 8001:8000 --shm-size=8g \\\n>       --ulimit memlock=-1 --ulimit stack=67108864 -v \u002F:\u002Fhost -w \u002Fhost$(pwd) -v \u002Ftmp:\u002Ftmp \\\n>       -v \u002Fusr\u002Flib\u002Fx86_64-linux-gnu\u002Flibcuda.so.1:\u002Fusr\u002Flib\u002Fx86_64-linux-gnu\u002Flibcuda.so.1 --privileged \\\n>       -e VOICES=\"https:\u002F\u002Fhomepages.inf.ed.ac.uk\u002Fhtang2\u002Fnotes\u002Fspeech-samples\u002F103-1240-0000.wav\" \\\n>       tutelgroup\u002Fdeepseek-671b:a100x8-chat-20251222 --serve=core \\\n>         --try_path .\u002Fmicrosoft\u002FVibeVoice-1.5B \\\n>         --try_path .\u002Fmicrosoft\u002FVibeVoice-Large\n> \n> [Microsoft VibeVoice（仅限MI300）]\n> docker run -e LOCAL_SIZE=1 -it --rm -p 8001:8000 --shm-size=8g \\\n>       --ulimit memlock=-1 --ulimit stack=67108864 --device=\u002Fdev\u002Fkfd --device=\u002Fdev\u002Fdri --group-add=video \\\n>       --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v \u002F:\u002Fhost -w \u002Fhost$(pwd) -v \u002Ftmp:\u002Ftmp \\\n>       -e VOICES=\"https:\u002F\u002Fhomepages.inf.ed.ac.uk\u002Fhtang2\u002Fnotes\u002Fspeech-samples\u002F103-1240-0000.wav\" \\\n>       tutelgroup\u002Fdeepseek-671b:mi300x8-chat-20251222 --serve=core \\\n>         --try_path .\u002Fmicrosoft\u002FVibeVoice-1.5B \\\n>         --try_path .\u002Fmicrosoft\u002FVibeVoice-Large\n> \n> [音频生成请求]\n>   curl -X POST http:\u002F\u002F0.0.0.0:8001\u002Fchat -d '{\"text\": \"VibeVoice是一个创新框架，旨在根据文本生成富有表现力、长篇幅、多说话人的对话式音频，例如播客。\"}' > sound_output.mp3\n> ```\n\n\n#### DeepSeek-MoE\u002FQwen3-MoE\u002FKimiK2-MoE\u002FGptOSS-MoE等模型的推理TPOS：\n> |  ***模型与机器类型*** | ***精度*** | ***SGL***  | ***Tutel***  |\n> |  ----  | ----  | ----  | ----  |\n> | $deepseek-ai\u002FDeepSeek-V3.2\\ (671B,\\ A100 \\times 8)$ | nvfp4 | - | 102 |\n> | $deepseek-ai\u002FDeepSeek-V3.2\\ (671B,\\ MI300 \\times 8)$ | nvfp4 | - | 151 |\n> | $moonshotai\u002FKimi-K2-Instruct\\ (1T,\\ A100 \\times 8)$ | nvfp4 | - | 104 |\n> | $moonshotai\u002FKimi-K2-Instruct\\ (1T,\\ MI300 \\times 8)$ | fp8b128 | 49 | 153 |\n> | $NVFP4\u002FQwen3-235B-A22B-Instruct-2507-FP4(A100\\times8)$ | nvfp4 | - | 114 |\n> | $NVFP4\u002FQwen3-235B-A22B-Instruct-2507-FP4(MI300\\times8)$ | nvfp4 | - | 122 |\n> | $openai\u002Fgpt-oss-120b\\ (120B,\\ A100 \\times 1)$ | mxfp4 | 127 | 212 |\n> | $openai\u002Fgpt-oss-120b\\ (120B,\\ MI300 \\times 1)$ | mxfp4 | 191 | 311 |\n> | $microsoft\u002FVibeVoice-1.5B (A100 \\times 1)$ | bf16 | - | rtf=0.07 |\n> | $microsoft\u002FVibeVoice-1.5B (MI300 \\times 1)$ | bf16 | - | rtf=0.06 |\n\n## 最新动态：\n\n> Image-*20260327*：新增对Kimi-K2.5的支持。\n>\n> Image-*20260306*：支持DeepSeek V3.2的长上下文模式，适用于A100\u002FH100\u002FMI300\u002FB200。\n>\n> Image-*20251222*：针对大多数模型优化了A100的性能。\n>\n> Image-*20251111*：将Tutel LLM模块集成到VibeVoice中，以加速推理（单个A100上的rtf=0.07）。\n>\n> Image-*20251006*：解决了与DeepSeek-V3.2-Exp的兼容性问题。\n>\n> Image-*20250827*：增加了对OpenAI GPT-OSS 20B\u002F120B的分布式支持，并支持MXFP4推理。\n>\n> Image-*20250801*：支持Qwen3 MoE系列，并集成了[OpenWebUI](https:\u002F\u002Fgithub.com\u002Fopen-webui\u002Fopen-webui)。\n>\n> Image-*20250712*：支持Kimi K2 1TB MoE模型在NVIDIA\u002FAMD GPU上使用NVFP4进行推理。\n>\n> Image-*20250601*：提升了MI300x上DeepSeek 671B的解码性能，达到140–150 TPS。\n>\n> 更多镜像版本请见[这里](https:\u002F\u002Fhub.docker.com\u002Fr\u002Ftutelgroup\u002Fdeepseek-671b\u002Ftags)\n\n> Tutel v0.4.2：新增R1-FP4\u002FQwen3MoE-FP8支持，适用于NVIDIA和AMD GPU，并推出快速门控API：\n```py\n  >> 示例：\n\nimport torch\nfrom tutel import ops\n\n# Qwen3 快速MoE门控，用于128个专家，路由权重归一化至1.0\nlogits_fp32 = torch.softmax(torch.randn([32, 128]), -1, dtype=torch.float32).cuda()\ntopk_weights, topk_ids = ops.qwen3_moe_scaled_topk(logits_fp32)\nprint(topk_weights, topk_ids, topk_weights.sum(-1))\n\n# DeepSeek V3\u002FR1 快速 MoE 路由机制，适用于 256 个专家，路由权重归一化至 2.5\nlogits_bf16 = torch.randn([32, 256], dtype=torch.bfloat16).cuda()\ncorrection_bias_bf16 = torch.randn([logits_bf16.size(-1)], dtype=torch.bfloat16).cuda()\ntopk_weights, topk_ids = ops.deepseek_moe_sigmoid_scaled_topk(logits_bf16, correction_bias_bf16, None, None)\nprint(topk_weights, topk_ids, topk_weights.sum(-1))\n```\n\n> Tutel v0.4.1：支持使用 NVIDIA GPU（A100 \u002F A800）运行 Deepseek R1 FP8 模型\n\n> Tutel v0.4.0：加速 AMD MI300x8 上的 Deepseek R1 全精度聊天模型：\n```sh\n  >> 示例：\n\n    # 步骤-1：下载 Deepseek R1 671B 模型\n    huggingface-cli download deepseek-ai\u002FDeepSeek-R1 --local-dir .\u002Fdeepseek-ai\u002FDeepSeek-R1\n\n    # 步骤-2：使用 8 张 MI300 显卡在本地端口 :8000 提供 Deepseek R1 聊天服务\n    docker run -it --rm --ipc=host --privileged -p 8000:8000 \\\n        -v \u002F:\u002Fhost -w \u002Fhost$(pwd) tutelgroup\u002Fdeepseek-671b:mi300x8-chat-20250224 \\\n        --model_path .\u002Fdeepseek-ai\u002FDeepSeek-R1\n\n    # 步骤-3：使用 curl 发送提示请求\n    curl -X POST http:\u002F\u002F0.0.0.0:8000\u002Fchat -d '{\"text\": \"计算：1 \u002F (sqrt(5) - sqrt(3))\"}'\n```\n\n> Tutel v0.3.3：新增 all-to-all 基准测试：\n```sh\n  >> 示例：\n\n    python3 -m torch.distributed.run --nproc_per_node=8 -m tutel.examples.bandwidth_test --size_mb=256\n```\n\n> Tutel v0.3.2：增加 tensorcore 选项以进行额外基准测试；扩展自定义专家示例；允许设置 NCCL 超时时间：\n```sh\n  >> 使用 tensorcore 的示例：\n\n    python3 -m tutel.examples.helloworld --dtype=float32\n    python3 -m tutel.examples.helloworld --dtype=float32 --use_tensorcore\n\n    python3 -m tutel.examples.helloworld --dtype=float16\n    python3 -m tutel.examples.helloworld --dtype=float16 --use_tensorcore\n\n  >> 自定义门控\u002F专家的示例：\n    python3 -m tutel.examples.helloworld_custom_gate_expert --batch_size=16\n\n  >> 设置 NCCL 超时时间的示例：\n    TUTEL_GLOBAL_TIMEOUT_SEC=60 python3 -m torch.distributed.run --nproc_per_node=8 -m tutel.examples.helloworld --use_tensorcore\n\n```\n\n> Tutel v0.3.1：新增 NCCL all_to_all_v 和 all_gather_v，用于任意长度的消息传输：\n```sh\n  >> 示例：\n    # All_to_All_v：\n    python3 -m torch.distributed.run --nproc_per_node=2 --master_port=7340 -m tutel.examples.nccl_all_to_all_v\n    # All_Gather_v：\n    python3 -m torch.distributed.run --nproc_per_node=2 --master_port=7340 -m tutel.examples.nccl_all_gather_v\n\n  >> 操作方法：\n    net.batch_all_to_all_v([t_x_cuda, t_y_cuda, ..], common_send_counts)\n    net.batch_all_gather_v([t_x_cuda, t_y_cuda, ..])\n```\n\n> Tutel v0.3：添加 Megablocks 解决方案，以提升单 GPU 上 num_local_expert >= 2 的解码推理性能：\n```sh\n  >> 示例（dropless-MoE 要求 capacity_factor=0）：\n    # 使用 BatchMatmul：\n    python3 -m tutel.examples.helloworld --megablocks_size=0 --batch_size=1 --num_tokens=32 --top=1 --eval --num_local_experts=128 --capacity_factor=0\n    # 使用 Megablocks，block_size = 1：\n    python3 -m tutel.examples.helloworld --megablocks_size=1 --batch_size=1 --num_tokens=32 --top=1 --eval --num_local_experts=128 --capacity_factor=0\n    # 使用 Megablocks，block_size = 2：\n    python3 -m tutel.examples.helloworld --megablocks_size=2 --batch_size=1 --num_tokens=32 --top=1 --eval --num_local_experts=128 --capacity_factor=0\n\n  >> 操作方法：\n    self._moe_layer.forward(x, .., megablocks_size=1)         # 控制 megablocks_size 的开关（0 表示关闭）\n```\n\n> Tutel v0.2：允许大多数配置以零成本动态切换：\n```sh\n  >> 示例：\n    python3 -m torch.distributed.run --nproc_per_node=8 -m tutel.examples.helloworld_switch --batch_size=16\n\n  >> 操作方法：\n    self._moe_layer.forward(x, .., a2a_ffn_overlap_degree=2)  # 控制重叠粒度的开关（1 表示不重叠）\n    self._moe_layer.forward(x, .., adaptive_r=1)              # 控制并行性的开关（0 表示 DP，1 表示 DP + EP，W \u002F E 表示 MP + EP，其他表示 DP + MP + EP）\n    self._moe_layer.forward(x, .., capacity_factor=1)         # 控制容量体积的开关（正数表示填充，负数表示不填充，0 表示无丢弃）\n    self._moe_layer.forward(x, .., top_k=1)                   # 控制 top_k 稀疏性的开关\n```\n\n> Tutel v0.1：优化数据分发编码和解码的 Einsum 复杂度，新增 2DH 选项以应对大规模的 All-to-All 通信：\n```sh\n  >> 示例（建议仅在大规模场景下启用 2DH，请注意 --nproc_per_node 的值必须等于每节点的实际 GPU 总数，例如 A100x8 时为 8）：\n    python3 -m torch.distributed.run --nproc_per_node=8 -m tutel.examples.helloworld --batch_size=16 --use_2dh\n```\n\n-----------\n## 开始使用\n\n### 1. 准备 PyTorch（如适用）：\n```\n* 准备推荐的 PyTorch >= 2.0.0：\n        # Windows\u002FLinux 版本，适用于 NVIDIA CUDA >= 11.7：\n        python3 -m pip install torch torchvision torchaudio --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu118\n        # Linux 版本，适用于 AMD ROCm >= 6.2.2：\n        python3 -m pip install torch torchvision torchaudio --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Frocm6.2.2\n        # Windows\u002FLinux 版本，适用于 CPU：\n        python3 -m pip install torch torchvision torchaudio --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcpu\n```\n\n### 2. 安装 Tutel：\n```\n* 选项-1：在线安装 Tutel：\n\n        $ python3 -m pip uninstall tutel -y\n        $ python3 -m pip install -v -U --no-build-isolation git+https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Ftutel@main\n\n* 选项-2：从源代码构建 Tutel：\n\n        $ git clone https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Ftutel --branch main\n        $ python3 -m pip uninstall tutel -y\n        $ python3 .\u002Ftutel\u002Fsetup.py install --user\n```\n\n### 3. 单设备\u002FCPU 快速测试：\n```\n* 单 GPU 快速测试：\n\n        $ python3 -m tutel.examples.helloworld --batch_size=16               # 测试 Tutel 优化的 MoE + 手动分配\n        $ python3 -m tutel.examples.helloworld_ddp --batch_size=16           # 测试 Tutel 优化的 MoE + Pytorch DDP 分配（要求：Pytorch >= 1.8.0）\n        $ python3 -m tutel.examples.helloworld_ddp_tutel --batch_size=16     # 测试 Tutel 优化的 MoE + Tutel DDP 分配（ZeRO 应用于优化器）\n        $ python3 -m tutel.examples.helloworld_amp --batch_size=16           # 测试 Tutel 优化的 MoE，采用 AMP 数据类型 + 手动分配\n        $ python3 -m tutel.examples.helloworld_custom_gate_expert --batch_size=16 # 测试 Tutel 优化的 MoE + 自定义门控\u002F专家层\n        $ python3 -m tutel.examples.helloworld_from_scratch                  # 测试从头开始实现的自定义 MoE\n        $ python3 -m tutel.examples.moe_mnist                                # 在 MNIST 数据集中测试 MoE 层\n        $ python3 -m tutel.examples.moe_cifar10                              # 在 CIFAR10 数据集中测试 MoE 层\n\n        （如果从源代码构建，以下方法同样适用：）\n        $ python3 .\u002Ftutel\u002Fexamples\u002Fhelloworld.py --batch_size=16\n        ..\n```\n\n### 4. 在单机8张GPU上的快速测试：\n```\n        $ python3 -m torch.distributed.run --nproc_per_node=8 -m tutel.examples.helloworld --batch_size=16\n```\n\n### 5. 跨多台机器的多GPU快速测试：\n```\n* 以分布式模式运行Tutel MoE：\n\n        （选项A - 适用于“多节点×多GPU”的Torch启动器：）\n        $ ssh \u003Cnode-ip-0> python3 -m torch.distributed.run --nproc_per_node=8 --nnodes=2 --node_rank=0 --master_addr=\u003Cnode-ip-0> -m tutel.examples.helloworld --batch_size=16\n        $ ssh \u003Cnode-ip-1> python3 -m torch.distributed.run --nproc_per_node=8 --nnodes=2 --node_rank=1 --master_addr=\u003Cnode-ip-0> -m tutel.examples.helloworld --batch_size=16\n\n        （选项B - 适用于“多节点×多GPU”的Tutel启动器，需安装`openmpi-bin`包：）\n        # \u003C\u003C 单节点 >>\n        $ mpiexec -bind-to none -host localhost -x LOCAL_SIZE=8 python3 -m tutel.launcher.run -m tutel.examples.helloworld_ddp_tutel --batch_size=16\n        $ mpiexec -bind-to none -host localhost -x LOCAL_SIZE=8 python3 -m tutel.launcher.run -m tutel.examples.moe_mnist\n        $ mpiexec -bind-to none -host localhost -x LOCAL_SIZE=8 python3 -m tutel.launcher.run -m tutel.examples.moe_cifar10\n        ...\n\n        # \u003C\u003C 基于MPI的GPU后端启动>>\n        $ mpiexec -bind-to none -host \u003Cnode-ip-0>,\u003Cnode-ip-1>,.. -x MASTER_ADDR=\u003Cnode-ip-0> -x LOCAL_SIZE=8 python3 -m tutel.launcher.run -m tutel.examples.helloworld --batch_size=16\n\n        # \u003C\u003C 基于MPI的CPU后端启动>>\n        $ mpiexec -bind-to none -host localhost -x LOCAL_SIZE=1 -x OMP_NUM_THREADS=1024 python3 -m tutel.launcher.run -m tutel.examples.helloworld --batch_size=16 --device cpu\n```\n\n-----------\n\n### 进阶：为不同世界规模转换检查点文件：\n检查点转换的相关文档已移至[此处](doc\u002FCHECKPOINT.md)。\n\n### 示例：如何在PyTorch中导入Tutel优化的MoE：\n```\n# 输入示例：\nimport torch\nx = torch.ones([6, 1024], device='cuda:0')\n\n# 创建MoE：\nfrom tutel import moe as tutel_moe\nmoe_layer = tutel_moe.moe_layer(\n    gate_type={'type': 'top', 'k': 2},\n    model_dim=x.shape[-1],\n    experts={\n        'num_experts_per_device': 2,\n        'type': 'ffn', 'hidden_size_per_expert': 2048, 'activation_fn': lambda x: torch.nn.functional.relu(x)\n    },\n    scan_expert_func = lambda name, param: setattr(param, 'skip_allreduce', True),\n)\n\n# 转移到GPU\nmoe_layer = moe_layer.to('cuda:0')\n\n# 在分布式模型中，需要进一步跳过对带有`skip_allreduce`标记的全局参数的allreduce操作，\n# 例如：\n#    for p in moe_layer.parameters():\n#        if hasattr(p, 'skip_allreduce'):\n#            continue\n#        dist.all_reduce(p.grad)\n\n\n# 前向传播MoE：\ny = moe_layer(x\n\nprint(y)\n```\n\n### 参考文献\n您可以通过阅读以下论文了解更多关于Tutel的技术细节：\n```\n@article {tutel,\nauthor = {Changho Hwang and Wei Cui and Yifan Xiong and Ziyue Yang and Ze Liu and Han Hu and Zilong Wang and Rafael Salas and Jithin Jose and Prabhat Ram and Joe Chau and Peng Cheng and Fan Yang and Mao Yang and Yongqiang Xiong},\ntitle = {Tutel: Adaptive Mixture-of-Experts at Scale},\nyear = {2022},\nmonth = jun,\njournal = {CoRR},\nvolume= {abs\u002F2206.03382},\nurl = {https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.03382.pdf},\n}\n```\n\n### MOELayer的使用方法：\n```\n* MOELayer参数说明：\n\n        gate_type        : 门控机制的字典描述，例如{'type': 'top', 'k': 2, 'capacity_factor': -1.5, ..}，\n                              或者多个门控机制的字典列表，例如[{'type': 'top', 'k': 2}, {'type': 'top', 'k': 2}]。\n                              top-gating中的k值也可以是负数，如-2，表示一个GPU将持有专家参数的1\u002F(-k)。\n                              capacity_factor X可以是正数（因子=X）、零（因子=所需容量的最大值）或负数（因子=min(-X, 所需容量的最大值))。\n        model_dim        : MoE输入张量的通道数\n        experts          : 内置专家网络的配置字典\n        scan_expert_func : 允许用户指定一个lambda函数来遍历每个专家的参数，例如`scan_expert_func = lambda name, param: setattr(param, 'expert', True)`\n        result_func      : 允许用户指定一个lambda函数来格式化MoE的输出和辅助损失，例如`result_func = lambda output: (output, output.l_aux)`\n        group            : 指定all_to_all通信的具体通信组\n        seeds            : 包含三个整数的元组，用于指定（共享参数、本地参数以及MoE处理后的其他参数）的手动种子\n        a2a_ffn_overlap_degree : 控制a2a重叠深度的值，默认为1表示无重叠，2表示a2a与一半gemm重叠，等等。\n        parallel_type    : 计算MoE的并行方式，有效类型有：'auto'、'data'、'model'\n        pad_samples      : 是否对新输入数据进行自动填充，使其达到历史最大数据尺寸\n\n* 专家配置字典说明：\n\n        num_experts_per_device : 每个设备上的本地专家数量（默认值为1，若未指定则为1）\n        hidden_size_per_expert : 每个专家两个线性层之间的隐藏层大小（仅适用于'type == ffn'的情况）\n        type             : 可用的内置专家实现，例如ffn\n        activation_fn    : 两个线性层之间的自定义激活函数（仅适用于'type == ffn'的情况）\n        has_fc1_bias     : 若设置为False，则禁用专家偏置参数`batched_fc1_bias`。默认值为True\n        has_fc2_bias     : 若设置为False，则禁用专家偏置参数`batched_fc2_bias`。默认值为True\n```\n\n### 贡献说明\n\n本项目欢迎各类贡献和建议。大多数贡献都需要您同意一份贡献者许可协议（CLA），声明您有权且确实授予我们使用您贡献的权利。有关详情，请访问https:\u002F\u002Fcla.opensource.microsoft.com。\n\n当您提交拉取请求时，CLA机器人会自动判断您是否需要提供CLA，并相应地标记您的PR（例如状态检查、评论）。请按照机器人提供的指示操作即可。对于所有使用我们CLA的仓库，您只需完成一次此流程。\n\n本项目已采用[微软开源行为准则](https:\u002F\u002Fopensource.microsoft.com\u002Fcodeofconduct\u002F)。更多信息请参阅[行为准则常见问题解答](https:\u002F\u002Fopensource.microsoft.com\u002Fcodeofconduct\u002Ffaq\u002F)，或如有任何其他疑问或意见，请联系[opencode@microsoft.com](mailto:opencode@microsoft.com)。\n\n### 商标\n\n本项目可能包含项目、产品或服务的商标或标识。微软商标或标识的授权使用须遵守并依据 \n[微软商标与品牌指南](https:\u002F\u002Fwww.microsoft.com\u002Fen-us\u002Flegal\u002Fintellectualproperty\u002Ftrademarks\u002Fusage\u002Fgeneral)。\n在本项目的修改版本中使用微软商标或标识时，不得造成混淆或暗示微软的赞助关系。\n任何第三方商标或标识的使用均须遵循该第三方的相关政策。","# Tutel 快速上手指南\n\nTutel 是一个优化的混合专家模型（MoE）实现库，专为现代动态训练和推理设计。它支持“无惩罚”的并行度、稀疏度和容量切换，并针对 DeepSeek、Kimi、Qwen3、GPT-OSS 等主流 MoE 模型提供了高效的低精度（NVFP4\u002FMXFP4\u002FBlockwiseFP8）推理支持。\n\n## 1. 环境准备\n\n### 系统要求\n- **操作系统**: Linux (推荐) 或 Windows\n- **GPU 支持**:\n  - **NVIDIA CUDA**: 支持 fp64\u002Ffp32\u002Ffp16\u002Fbf16，推荐显卡 A100\u002FA800\u002FH100\u002FB200 等。\n  - **AMD ROCm**: 支持 fp64\u002Ffp32\u002Ffp16\u002Fbf16，推荐显卡 MI300 系列。\n- **CPU**: 支持 fp64\u002Ffp32。\n\n### 前置依赖\n- **PyTorch**: 推荐版本 >= 2.0。\n  - **NVIDIA CUDA (>= 11.7)**:\n    ```bash\n    python3 -m pip install torch torchvision torchaudio --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu118\n    ```\n  - **AMD ROCm (>= 6.2.2)**:\n    ```bash\n    python3 -m pip install torch torchvision torchaudio --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Frocm6.2\n    ```\n- **Docker**: 用于运行预构建的高效推理镜像（推荐方式）。\n- **Hugging Face CLI**: 用于下载模型权重。\n  ```bash\n  pip3 install -U \"huggingface_hub[cli]\" --upgrade\n  ```\n\n> **提示**：国内用户若访问 Hugging Face 受阻，可配置镜像源或使用代理加速下载。\n\n## 2. 安装步骤\n\nTutel 既可以通过源码安装 Python 包，也可以直接使用官方提供的 Docker 镜像进行推理（推荐用于大模型部署）。\n\n### 方式一：安装 Python 包（开发\u002F训练用）\n```bash\npip3 install tutel\n```\n*注：如需最新特性，建议从 GitHub 源码安装。*\n\n### 方式二：使用 Docker 镜像（推理部署用）\n无需手动安装依赖，直接拉取针对特定硬件优化的镜像。\n- **NVIDIA GPU 镜像**: `tutelgroup\u002Fdeepseek-671b:a100x8-chat-\u003Cversion>`\n- **AMD GPU 镜像**: `tutelgroup\u002Fdeepseek-671b:mi300x8-chat-\u003Cversion>`\n\n## 3. 基本使用\n\n### 场景 A：快速部署大模型推理（推荐）\n\n以下示例展示如何使用 Docker 快速启动 **DeepSeek V3.2** 或 **Microsoft VibeVoice** 服务。\n\n#### 1. 下载模型权重\n以 DeepSeek V3.2 为例：\n```sh\nhf download nvidia\u002FDeepSeek-V3.2-NVFP4 --local-dir nvidia\u002FDeepSeek-V3.2-NVFP4\n```\n\n#### 2. 启动服务 (NVIDIA A100\u002FH100\u002FB200)\n```sh\ndocker run -e LOCAL_SIZE=8 -e WORKER=1 -it --rm --ipc=host --net=host --shm-size=8g \\\n    --ulimit memlock=-1 --ulimit stack=67108864 -v \u002F:\u002Fhost -w \u002Fhost$(pwd) -v \u002Ftmp:\u002Ftmp \\\n    -v \u002Fusr\u002Flib\u002Fx86_64-linux-gnu\u002Flibcuda.so.1:\u002Fusr\u002Flib\u002Fx86_64-linux-gnu\u002Flibcuda.so.1 --privileged \\\n    tutelgroup\u002Fdeepseek-671b:a100x8-chat-20260327 --serve=webui --listen_port 8000 \\\n      --try_path nvidia\u002FDeepSeek-V3.2-NVFP4 \\\n      --max_seq_len 32768\n```\n*启动后，浏览器访问 `http:\u002F\u002F0.0.0.0:8000` 即可使用 WebUI 界面。*\n\n#### 3. 启动服务 (AMD MI300)\n```sh\ndocker run -e LOCAL_SIZE=8 -e WORKER=1 -it --rm --ipc=host --net=host --shm-size=8g \\\n    --ulimit memlock=-1 --ulimit stack=67108864 --device=\u002Fdev\u002Fkfd --device=\u002Fdev\u002Fdri --group-add=video \\\n    --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v \u002F:\u002Fhost -w \u002Fhost$(pwd) -v \u002Ftmp:\u002Ftmp \\\n    tutelgroup\u002Fdeepseek-671b:mi300x8-chat-20260327 --serve=webui --listen_port 8000 \\\n      --try_path nvidia\u002FDeepSeek-V3.2-NVFP4 \\\n      --max_seq_len 1000000\n```\n\n#### 4. 发送测试请求\n```sh\ncurl -N -X POST http:\u002F\u002F0.0.0.0:8000\u002Fchat -d '{\"text\": \"Write a Python code of the Quicksort algorithm.\"}'\n```\n\n---\n\n### 场景 B：Python 代码调用 (MoE Gating)\n\n如果你需要在自定义模型中使用 Tutel 的高性能 MoE 门控机制（如 Qwen3 或 DeepSeek V3\u002FR1 风格）：\n\n```python\nimport torch\nfrom tutel import ops\n\n# 示例 1: Qwen3 风格快速 MoE 门控 (128 专家，权重归一化为 1.0)\nlogits_fp32 = torch.softmax(torch.randn([32, 128]), -1, dtype=torch.float32).cuda()\ntopk_weights, topk_ids = ops.qwen3_moe_scaled_topk(logits_fp32)\nprint(\"Qwen3 Weights:\", topk_weights)\nprint(\"Qwen3 IDs:\", topk_ids)\n\n# 示例 2: DeepSeek V3\u002FR1 风格快速 MoE 门控 (256 专家，权重归一化为 2.5)\nlogits_bf16 = torch.randn([32, 256], dtype=torch.bfloat16).cuda()\ncorrection_bias_bf16 = torch.randn([logits_bf16.size(-1)], dtype=torch.bfloat16).cuda()\ntopk_weights, topk_ids = ops.deepseek_moe_sigmoid_scaled_topk(logits_bf16, correction_bias_bf16, None, None)\nprint(\"DeepSeek Weights:\", topk_weights)\nprint(\"DeepSeek IDs:\", topk_ids)\n```\n\n### 性能参考\n在典型配置下（如 8xA100 或 8xMI300），Tutel 对主流 MoE 模型的推理吞吐量（Tokens Per Second）表现优异：\n- **DeepSeek-V3.2 (671B)**: A100x8 可达 102 TPOS，MI300x8 可达 151 TPOS (NVFP4)。\n- **Kimi-K2-Instruct (1T)**: MI300x8 可达 153 TPOS (FP8)。\n- **GPT-OSS-120B**: 单卡 MI300 可达 311 TPOS (MXFP4)。","某大型金融科技公司正在构建基于 DeepSeek-V3.2 的长文档智能分析系统，需要处理百万字级别的财报与法律合同，并追求极致的推理速度。\n\n### 没有 Tutel 时\n- **显存爆炸无法部署**：面对 671B 参数的超大模型，传统框架在加载长上下文（如 100 万 token）时显存迅速溢出，迫使团队不得不大幅裁剪上下文长度或拆分模型，严重牺牲分析准确性。\n- **推理延迟难以接受**：由于缺乏针对混合专家（MoE）架构的动态稀疏性优化，每次请求需激活大量冗余计算单元，导致生成单个回答耗时数秒，无法满足实时交互需求。\n- **量化适配成本高昂**：想要利用 NVFP4\u002FMXFP4 等新型低精度格式来压缩模型，却因原生框架不支持而需要手动重写算子，开发周期长达数周且极易出错。\n- **硬件利用率低下**：在多卡并行训练或推理时，专家路由切换带来巨大的通信开销，导致昂贵的 H100\u002FA100 显卡大部分时间处于等待数据同步的空转状态。\n\n### 使用 Tutel 后\n- **超长上下文轻松承载**：借助 Tutel 的“无惩罚并行”技术，系统成功在单节点上部署了支持 100 万 token 上下文的 DeepSeek-V3.2，完整保留文档细节，无需任何妥协式裁剪。\n- **推理速度显著提升**：通过原生支持动态稀疏路由，Tutel 仅激活必要的专家网络，结合 FP8\u002FNVFP4 低精度推理，将首字延迟降低至毫秒级，吞吐量提升数倍。\n- **一键启用先进量化**：直接调用 Tutel 内置的 NVFP4\u002FMXFP4 推理引擎，无需修改模型代码即可加载量化版本，将显存占用减少一半以上，部署准备时间从数周缩短至几小时。\n- **硬件性能满血释放**：优化的并行策略消除了专家切换的通信瓶颈，使得多卡集群在动态负载下依然保持线性加速比，最大化了硬件投资回报。\n\nTutel 通过突破性的稀疏并行与低精度优化，让超大规模 MoE 模型在长场景下的实时落地从“不可能”变为“高性能常态”。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmicrosoft_Tutel_6e7638f0.png","microsoft","Microsoft","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fmicrosoft_4900709c.png","Open source projects and samples from Microsoft",null,"opensource@microsoft.com","OpenAtMicrosoft","https:\u002F\u002Fopensource.microsoft.com","https:\u002F\u002Fgithub.com\u002Fmicrosoft",[82,86,90,94],{"name":83,"color":84,"percentage":85},"C","#555555",71.5,{"name":87,"color":88,"percentage":89},"Python","#3572A5",20.9,{"name":91,"color":92,"percentage":93},"C++","#f34b7d",7.5,{"name":95,"color":96,"percentage":97},"Shell","#89e051",0,980,107,"2026-04-07T17:03:51","MIT",4,"Linux","必需。支持 NVIDIA CUDA (A100, A800, H100, B200) 和 AMD ROCm (MI300)。支持精度包括 fp64\u002Ffp32\u002Ffp16\u002Fbf16 以及 NVFP4\u002FMXFP4\u002FBlockwiseFP8。示例中多卡运行需 8 张显卡，单卡运行需 1 张。","未说明 (Docker 容器配置中建议 --shm-size=8g)",{"notes":107,"python":108,"dependencies":109},"主要推荐使用 Docker 容器部署（提供 tutelgroup\u002Fdeepseek-671b 镜像）。支持 DeepSeek V3.2\u002FR1、Kimi K2、Qwen3、GPT-OSS 等模型的 MoE 推理。运行长上下文模式或大模型时需多卡并行（如 8xA100 或 8xMI300）。AMD 显卡运行需映射 \u002Fdev\u002Fkfd 和 \u002Fdev\u002Fdri 设备并添加 video 用户组。","3.x (通过 pip3 和 python3 命令推断，具体小版本未说明)",[110,111],"torch>=2.0","huggingface_hub",[35,14],[114,115,116,117,118],"pytorch","moe","mixture-of-experts","deepseek","llm","2026-03-27T02:49:30.150509","2026-04-10T07:43:46.622294",[122,127,132,137,142,146],{"id":123,"question_zh":124,"answer_zh":125,"source_url":126},27373,"如何在 Tutel 中设置数据并行（Data Parallelism）并让多个 GPU 共享相同的专家参数？","在创建 moe_layer 时，需要进行以下特殊配置以启用传统数据并行模式：\n1. 设置 `scan_expert_func = None`，使其使用传统数据并行方法而非 Zero-2\u002F3 方法。\n2. 设置 `seeds = (None, None, None)`，让外部框架（如 Deepspeed）初始化和维护专家参数。\n3. 设置 `group = dist.new_group(ranks=[dist.get_rank()])`，使 Tutel Moe 层内部回退到旧版数据并行布局。\n\n示例代码：\n```python\nself._moe_layer = tutel_moe.moe_layer(\n    gate_type = {'type': 'top', 'k': top_value},\n    experts = {'type': 'ffn', 'count_per_node': 4, 'hidden_size_per_expert': hidden_features},\n    model_dim = in_features,\n    scan_expert_func = None,\n    seeds = (None, None, None),\n    group = dist.new_group(ranks=[dist.get_rank()]),\n)\n```\n这样每个 GPU 都将持有 4 个专家的参数副本。","https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FTutel\u002Fissues\u002F52",{"id":128,"question_zh":129,"answer_zh":130,"source_url":131},27374,"Tutel 是否支持单节点内使用专家并行（EP），跨节点使用数据并行（DP）的混合策略？","支持。您可以创建一个全局世界组（global world group），执行 `global_group.all_reduce`（用于所有共享参数）和 `global_group.all_to_all`（用于所有非共享参数）。\n\n为了降低跨节点通信开销，可以采取以下优化策略：\n1. 保持单节点内的 EP，跨节点进行 DP。\n2. 降低跨节点 DP 同步的频率（例如，每进行 8 次节点内步长更新才进行一次跨节点 DP 同步）。\n\n注意：只有当 `world_size` 远大于专家总数时，DP 和 EP 的混合策略才会表现出高性能。如果所有专家都分片在所有设备上，理论上直接在全局组执行 all-to-all 的通信成本可能更低。","https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FTutel\u002Fissues\u002F292",{"id":133,"question_zh":134,"answer_zh":135,"source_url":136},27375,"当 num_local_experts > 1 时，nccl_all_to_all_scatter_async 的输出不完整或报错怎么办？","这通常是由于旧版本的 NCCL 库导致的兼容性问题。\n解决方案是升级或指定使用较新版本的 NCCL。可以通过设置环境变量 `LD_LIBRARY_PATH` 指向新版 NCCL 库的路径来解决。\n\n示例命令：\n```bash\nexport LD_LIBRARY_PATH=\u002Fxxx\u002Fnccl_2.10.3-1+cuda10.2_x86_64\u002Flib:$LD_LIBRARY_PATH\n```\n确保使用的 NCCL 版本（如 2.10.3+）与您的 CUDA 版本匹配，重启训练后即可恢复正常。","https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FTutel\u002Fissues\u002F172",{"id":138,"question_zh":139,"answer_zh":140,"source_url":141},27376,"导入 Tutel 时报错 \"Cannot import JIT optimized kernels\" 或 \"undefined symbol\" 如何解决？","该错误通常发生在环境更新或重新安装 Tutel 后，原因是预编译的自定义内核扩展（Custom Kernel Extension）与当前的编译器或 CUDA 环境不匹配。\n\n解决方法是重新编译并安装自定义内核扩展：\n1. 确保已安装与当前 PyTorch\u002FCUDA 版本匹配的构建工具。\n2. 在 Tutel 源码目录下运行安装命令，强制重新编译内核：\n```bash\npython setup.py install --force\n```\n或者如果是通过 pip 安装，尝试卸载后重新安装并确保网络通畅以下载对应环境的二进制包，或者从源码安装以触发本地编译。","https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FTutel\u002Fissues\u002F171",{"id":143,"question_zh":144,"answer_zh":145,"source_url":126},27377,"如何在模型过大无法放入单张 GPU 时，结合 Pipeline Parallelism、Deepspeed Zero-1 和 Tutel MoE 进行训练？","由于 Deepspeed 的 `PipelineModule` 设计与 Zero-2\u002FZero-3 优化不兼容，而 Tutel 默认使用 Zero-2\u002F3 来优化专家参数内存，因此您需要做出选择：\n\n方案一（推荐，吞吐量友好）：\n不使用 `Deepspeed.initialize()` 和 `PipelineModule`，而是利用 Tutel 内置的组管理器（group manager）直接构建所需的并行环境（例如 2x3 的数据 - 模型环境）。\n\n方案二（兼容 Deepspeed 但吞吐量较低）：\n如果必须使用 Deepspeed 的 PipelineModule，则需禁用 Tutel 的 Zero-2\u002F3 优化（参考数据并行设置中的 `scan_expert_func=None` 等配置），但这会牺牲内存效率和部分性能。通常情况下，除非显存严重不足，否则建议优先考虑方案一。",{"id":147,"question_zh":148,"answer_zh":149,"source_url":131},27378,"在哪里可以找到 Tutel 端到端 MoE 训练的完整示例代码？","Tutel 官方仓库提供了一个基于 modded-nanogpt 的端到端 MoE 训练示例。\n\n您可以访问以下地址获取代码和文档：\nhttps:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FTutel\u002Ftree\u002Fmain\u002Ftutel\u002Fexamples\u002Fmodded-nanogpt-moe\n\n该示例展示了如何配置数据并行、专家并行以及相关的超参数设置，适合快速上手和验证环境配置。",[151,156,161,166,171,176,181,186,191,196,201,206,211],{"id":152,"version":153,"summary_zh":154,"released_at":155},180508,"v0.4.1","v0.4.1 版本更新内容：\n\n- 为 MI300x 更新 R1 容器：`tutelgroup\u002Fdeepseek-671b:mi300x8-chat-20250319`\n\n```sh\n设置方法：\npython3 -m pip install -v -U --no-build-isolation https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Ftutel\u002Farchive\u002Frefs\u002Ftags\u002Fv0.4.1.tar.gz\n```","2025-03-20T04:14:23",{"id":157,"version":158,"summary_zh":159,"released_at":160},180509,"v0.4.0","v0.4.0 版本更新内容：\n\n- 大量 Tutel 接口的增强。\n- 添加对 DeepSeek R1 671B 模型在单台 MI300x8 上的支持。\n\n```sh\n安装方法：\npython3 -m pip install -v -U --no-build-isolation https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Ftutel\u002Farchive\u002Frefs\u002Ftags\u002Fv0.4.0.tar.gz\n```","2025-02-20T08:33:41",{"id":162,"version":163,"summary_zh":164,"released_at":165},180510,"v0.3.2","v0.3.2 版本更新内容：\n\n1. 在 tutel.examples.helloworld 中添加 `--use_tensorcore` 选项，用于基准测试。\n2. 从环境变量中读取 `TUTEL_GLOBAL_TIMEOUT_SEC`，以配置 NCCL 的超时设置。\n3. 扩展 `tutel.examples.helloworld_custom_expert`，说明如何使用自定义的专家层来覆盖 MoE。\n\n```sh\n安装方法：\npython3 -m pip install -v -U --no-build-isolation https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Ftutel\u002Farchive\u002Frefs\u002Ftags\u002Fv0.3.2.tar.gz\n```","2024-05-08T06:47:22",{"id":167,"version":168,"summary_zh":169,"released_at":170},180511,"v0.3.1","v0.3.1 版本更新内容：\n\n1. 新增 2 个集体通信原语：net.batch_all_to_all_v()、net.batch_all_gather_v()。\n\n```sh\n安装方法：\npython3 -m pip install -v -U --no-build-isolation https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Ftutel\u002Farchive\u002Frefs\u002Ftags\u002Fv0.3.1.tar.gz\n```","2024-01-06T10:27:29",{"id":172,"version":173,"summary_zh":174,"released_at":175},180512,"v0.3.0","v0.3.0 新增功能：\n\n1. 支持 Megablocks 风格的 dMoE 推理（更多信息请参阅 README.md）\n\n```sh\n设置方法：\npython3 -m pip install -v -U --no-build-isolation https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Ftutel\u002Farchive\u002Frefs\u002Ftags\u002Fv0.3.0.tar.gz\n```","2023-08-05T16:41:58",{"id":177,"version":178,"summary_zh":179,"released_at":180},180513,"v0.2.1","v0.2.1 版本更新内容：\n\n1. 支持可切换的并行度，示例为 `tutel.examples.helloworld_switch`。\n\n```sh\n设置方法：\npython3 -m pip install --user https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Ftutel\u002Farchive\u002Frefs\u002Ftags\u002Fv0.2.1.tar.gz\n```","2023-03-30T22:00:15",{"id":182,"version":183,"summary_zh":184,"released_at":185},180514,"v0.2.0","v0.2.0 新增内容：\n\n1. 支持 Windows 环境下 Python 3 + Torch 的安装；\n2. 添加示例，展示如何在 Fairseq 中启用 Tutel MoE；\n3. 重构 MoE 层的实现，使所有功能（如 top-X、重叠、并行类型、容量等）能够在不同的前向传播迭代中动态调整；\n4. 新增功能：load_importance_loss、余弦路由、inequivalent_tokens；\n5. 扩展 capacity_factor 的取值范围，支持零值和负值，以实现更智能的容量估计；\n6. 增加 tutel.checkpoint 转换工具，用于重新格式化检查点文件，从而允许使用现有检查点在不同世界规模下进行训练或推理。\n\n```sh\n安装方法：\npython3 -m pip install --user https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Ftutel\u002Farchive\u002Frefs\u002Ftags\u002Fv0.2.0.tar.gz\n```","2022-08-11T04:19:34",{"id":187,"version":188,"summary_zh":189,"released_at":190},180515,"v0.1.5","v0.1.5 版本更新内容：\n\n1. 新增用于超大规模扩展的 2D 层次化 a2a 算法；\n2. 支持 MoE 计算的不同并行类型：数据并行、模型并行、自动并行；\n3. 将不同专家粒度（如普通专家、分片专家、Megatron 密集 FFN）统一到相同的编程接口和风格中；\n4. 新增特性：is_postscore，用于指定门控分数是在编码阶段还是解码阶段进行加权；\n5. 增强现有特性：JIT 编译器、2D 下的 a2a 重叠。\n\n```sh\n安装方法：\npython3 -m pip install --user https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Ftutel\u002Farchive\u002Frefs\u002Ftags\u002Fv0.1.5.tar.gz\n``` \n\n贡献者：@abuccts、@yzygitzh、@ghostplant、@EricWangCN","2022-02-26T07:19:06",{"id":192,"version":193,"summary_zh":194,"released_at":195},180516,"v0.1.4","v0.1.4 版本更新内容：\n\n1. 增强通信功能：支持 a2a 与计算重叠、不同粒度的进程组创建等。\n2. 添加单线程 CPU 实现，用于正确性检查和参考。\n3. 优化 JIT 编译器接口，提升使用的灵活性：新增 jit::inject_source 和 jit::jit_execute。\n4. 丰富示例：增加对双精度浮点数的支持、CUDA AMP、检查点等功能。\n5. 支持在 torch.distributed.pipeline 中执行。\n\n```sh\n安装方法：\npython3 -m pip install --user https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Ftutel\u002Farchive\u002Frefs\u002Ftags\u002Fv0.1.4.tar.gz\n``` \n\n贡献者：@yzygitzh、@ghostplant、@EricWangCN","2022-02-09T04:34:55",{"id":197,"version":198,"summary_zh":199,"released_at":200},180517,"v0.1.3","v0.1.3 版本更新内容：\n\n1. 基于 Open MPI 添加 Tutel Launcher 支持；\n2. 支持在初始化时建立数据并行模型；\n3. 支持单个专家在多张 GPU 上均匀切分；\n4. 支持门控层列表及使用指定门控索引的转发 MoE 层；\n5. 修复启用 `USE_NVRTC=1` 时的 NVRTC 兼容性问题；\n6. 其他实现优化与正确性检查。\n\n```sh\n安装方法：\npython3 -m pip install --user https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Ftutel\u002Farchive\u002Frefs\u002Ftags\u002Fv0.1.3.tar.gz\n``` \n\n贡献者：@ghostplant、@EricWangCN、@guoshzhao。","2021-12-29T03:29:12",{"id":202,"version":203,"summary_zh":204,"released_at":205},180518,"v0.1.2","What's New in v0.1.2:\r\n\r\n1. General-purpose top-k gating with `{'type': 'top', 'k': 2}`;\r\n2. Add Megatron-ML Tensor Parallel as gating type;\r\n3. Add [deepspeed-based & megatron-based helloworld example](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Ftutel\u002Ftree\u002Fv0.1.x\u002Ftutel\u002Fexamples) for fair comparison;\r\n4. Add torch.bfloat16 datatype support for single-GPU;\r\n\r\n```sh\r\nHow to Setup:\r\npython3 -m pip install --user https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Ftutel\u002Farchive\u002Frefs\u002Ftags\u002Fv0.1.2.tar.gz\r\n```\r\n\r\nContributors: @ghostplant, @EricWangCN, @foreveronehundred.","2021-11-16T08:26:25",{"id":207,"version":208,"summary_zh":209,"released_at":210},180519,"v0.1.1","What's New in v0.1.1:\r\n1. Enable fp16 support for AMDGPU.\r\n2. Using NVRTC for JIT compilation if available.\r\n3. Add new system_init interface for initializing NUMA settings in distributed GPUs.\r\n4. Extend more gating types: Top3Gate & Top4Gate.\r\n5. Allow high level to change capacity value in Tutel fast dispatcher.\r\n6. Add custom AllToAll extension for old Pytorch version without builtin AllToAll operator support.\r\n\r\n```sh\r\nHow to Setup:\r\npython3 -m pip install --user https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Ftutel\u002Farchive\u002Frefs\u002Ftags\u002Fv0.1.1.tar.gz\r\n```\r\n\r\nContributors: @jspark1105 , @ngoyal2707 , @guoshzhao, @ghostplant .","2021-10-10T14:00:34",{"id":212,"version":213,"summary_zh":214,"released_at":215},180520,"v0.1.0","The first version of Tutel for efficient MoE implementation.\r\n\r\n```sh\r\nHow to setup:\r\npython3 -m pip install --user https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Ftutel\u002Farchive\u002Frefs\u002Ftags\u002Fv0.1.0.tar.gz\r\n```","2021-09-14T02:58:12"]