[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-sgl-project--sglang":3,"tool-sgl-project--sglang":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",138956,2,"2026-04-05T11:33:21",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":70,"readme_en":71,"readme_zh":72,"quickstart_zh":73,"use_case_zh":74,"hero_image_url":75,"owner_login":76,"owner_name":76,"owner_avatar_url":77,"owner_bio":78,"owner_company":78,"owner_location":78,"owner_email":78,"owner_twitter":78,"owner_website":78,"owner_url":79,"languages":80,"stars":119,"forks":120,"last_commit_at":121,"license":122,"difficulty_score":10,"env_os":123,"env_gpu":124,"env_ram":125,"env_deps":126,"category_tags":136,"github_topics":137,"view_count":156,"oss_zip_url":78,"oss_zip_packed_at":78,"status":16,"created_at":157,"updated_at":158,"faqs":159,"releases":192},2427,"sgl-project\u002Fsglang","sglang","SGLang is a high-performance serving framework for large language models and multimodal models.","SGLang 是一款专为大型语言模型（LLM）和多模态模型打造的高性能服务框架。简单来说，它就像是一个高效能的“引擎”，旨在让 AI 模型在实际应用中的运行速度更快、成本更低且更稳定。\n\n在 AI 应用落地过程中，开发者常面临推理速度慢、并发处理能力不足以及硬件资源利用率低等挑战。SGLang 通过深度的系统优化，有效解决了这些痛点。它能够显著提升模型的吞吐量，降低响应延迟，从而让企业和个人能够以更少的计算资源支撑更大规模的用户访问。无论是处理复杂的文本对话，还是生成图像与视频，SGLang 都能提供流畅的体验。\n\n这款工具主要面向 AI 工程师、后端开发者以及研究人员。如果你正在构建基于 LLM 的应用程序，或者需要部署开源大模型（如 DeepSeek、Llama 系列等），SGLang 将是理想的基础设施选择。它不仅支持 NVIDIA GPU，还原生支持 TPU 和 AMD GPU，展现了极佳的硬件兼容性。\n\nSGLang 的技术亮点在于其卓越的执行效率和对最新模型的快速适配能力。例如，它在 NVIDIA GB200\u002FGB300 等先进硬件上能实现数倍乃至数十倍的性能提升，并率先支","SGLang 是一款专为大型语言模型（LLM）和多模态模型打造的高性能服务框架。简单来说，它就像是一个高效能的“引擎”，旨在让 AI 模型在实际应用中的运行速度更快、成本更低且更稳定。\n\n在 AI 应用落地过程中，开发者常面临推理速度慢、并发处理能力不足以及硬件资源利用率低等挑战。SGLang 通过深度的系统优化，有效解决了这些痛点。它能够显著提升模型的吞吐量，降低响应延迟，从而让企业和个人能够以更少的计算资源支撑更大规模的用户访问。无论是处理复杂的文本对话，还是生成图像与视频，SGLang 都能提供流畅的体验。\n\n这款工具主要面向 AI 工程师、后端开发者以及研究人员。如果你正在构建基于 LLM 的应用程序，或者需要部署开源大模型（如 DeepSeek、Llama 系列等），SGLang 将是理想的基础设施选择。它不仅支持 NVIDIA GPU，还原生支持 TPU 和 AMD GPU，展现了极佳的硬件兼容性。\n\nSGLang 的技术亮点在于其卓越的执行效率和对最新模型的快速适配能力。例如，它在 NVIDIA GB200\u002FGB300 等先进硬件上能实现数倍乃至数十倍的性能提升，并率先支持了稀疏注意力机制等前沿技术。此外，SGLang 社区活跃，获得 a16z 等机构认可，能够“首日”支持各类最新开源模型，确保用户始终能用上最前沿的技术栈。对于追求极致性能和灵活性的开发团队而言，SGLang 是一个值得深入探索的开源项目。","\u003Cdiv align=\"center\" id=\"sglangtop\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsgl-project_sglang_readme_44020c2d71e7.png\" alt=\"logo\" width=\"400\" margin=\"10px\">\u003C\u002Fimg>\n\n[![PyPI](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Fsglang)](https:\u002F\u002Fpypi.org\u002Fproject\u002Fsglang)\n![PyPI - Downloads](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsgl-project_sglang_readme_114f467f7024.png)\n[![license](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flicense\u002Fsgl-project\u002Fsglang.svg)](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Ftree\u002Fmain\u002FLICENSE)\n[![issue resolution](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fissues-closed-raw\u002Fsgl-project\u002Fsglang)](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fissues)\n[![open issues](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fissues-raw\u002Fsgl-project\u002Fsglang)](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fissues)\n[![Ask DeepWiki](https:\u002F\u002Fdeepwiki.com\u002Fbadge.svg)](https:\u002F\u002Fdeepwiki.com\u002Fsgl-project\u002Fsglang)\n\n\u003C\u002Fdiv>\n\n--------------------------------------------------------------------------------\n\n\u003Cp align=\"center\">\n\u003Ca href=\"https:\u002F\u002Flmsys.org\u002Fblog\u002F\">\u003Cb>Blog\u003C\u002Fb>\u003C\u002Fa> |\n\u003Ca href=\"https:\u002F\u002Fdocs.sglang.io\u002F\">\u003Cb>Documentation\u003C\u002Fb>\u003C\u002Fa> |\n\u003Ca href=\"https:\u002F\u002Froadmap.sglang.io\u002F\">\u003Cb>Roadmap\u003C\u002Fb>\u003C\u002Fa> |\n\u003Ca href=\"https:\u002F\u002Fslack.sglang.io\u002F\">\u003Cb>Join Slack\u003C\u002Fb>\u003C\u002Fa> |\n\u003Ca href=\"https:\u002F\u002Fmeet.sglang.io\u002F\">\u003Cb>Weekly Dev Meeting\u003C\u002Fb>\u003C\u002Fa> |\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsgl-learning-materials?tab=readme-ov-file#slides\">\u003Cb>Slides\u003C\u002Fb>\u003C\u002Fa>\n\u003C\u002Fp>\n\n## News\n- [2026\u002F02] 🔥 Unlocking 25x Inference Performance with SGLang on NVIDIA GB300 NVL72 ([blog](https:\u002F\u002Flmsys.org\u002Fblog\u002F2026-02-20-gb300-inferencex\u002F)).\n- [2026\u002F01] 🔥 SGLang Diffusion accelerates video and image generation ([blog](https:\u002F\u002Flmsys.org\u002Fblog\u002F2026-01-16-sglang-diffusion\u002F)).\n- [2025\u002F12] SGLang provides day-0 support for latest open models ([MiMo-V2-Flash](https:\u002F\u002Flmsys.org\u002Fblog\u002F2025-12-16-mimo-v2-flash\u002F), [Nemotron 3 Nano](https:\u002F\u002Flmsys.org\u002Fblog\u002F2025-12-15-run-nvidia-nemotron-3-nano\u002F), [Mistral Large 3](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F14213), [LLaDA 2.0 Diffusion LLM](https:\u002F\u002Flmsys.org\u002Fblog\u002F2025-12-19-diffusion-llm\u002F), [MiniMax M2](https:\u002F\u002Flmsys.org\u002Fblog\u002F2025-11-04-miminmax-m2\u002F)).\n- [2025\u002F10] 🔥 SGLang now runs natively on TPU with the SGLang-Jax backend ([blog](https:\u002F\u002Flmsys.org\u002Fblog\u002F2025-10-29-sglang-jax\u002F)).\n- [2025\u002F09] Deploying DeepSeek on GB200 NVL72 with PD and Large Scale EP (Part II): 3.8x Prefill, 4.8x Decode Throughput ([blog](https:\u002F\u002Flmsys.org\u002Fblog\u002F2025-09-25-gb200-part-2\u002F)).\n- [2025\u002F09] SGLang Day 0 Support for DeepSeek-V3.2 with Sparse Attention ([blog](https:\u002F\u002Flmsys.org\u002Fblog\u002F2025-09-29-deepseek-V32\u002F)).\n- [2025\u002F08] SGLang x AMD SF Meetup on 8\u002F22: Hands-on GPU workshop, tech talks by AMD\u002FxAI\u002FSGLang, and networking ([Roadmap](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsgl-learning-materials\u002Fblob\u002Fmain\u002Fslides\u002Famd_meetup_sglang_roadmap.pdf), [Large-scale EP](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsgl-learning-materials\u002Fblob\u002Fmain\u002Fslides\u002Famd_meetup_sglang_ep.pdf), [Highlights](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsgl-learning-materials\u002Fblob\u002Fmain\u002Fslides\u002Famd_meetup_highlights.pdf), [AITER\u002FMoRI](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsgl-learning-materials\u002Fblob\u002Fmain\u002Fslides\u002Famd_meetup_aiter_mori.pdf), [Wave](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsgl-learning-materials\u002Fblob\u002Fmain\u002Fslides\u002Famd_meetup_wave.pdf)).\n\n\u003Cdetails>\n\u003Csummary>More\u003C\u002Fsummary>\n\n- [2025\u002F11] SGLang Diffusion accelerates video and image generation ([blog](https:\u002F\u002Flmsys.org\u002Fblog\u002F2025-11-07-sglang-diffusion\u002F)).\n- [2025\u002F10] PyTorch Conference 2025 SGLang Talk ([slide](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsgl-learning-materials\u002Fblob\u002Fmain\u002Fslides\u002Fsglang_pytorch_2025.pdf)).\n- [2025\u002F10] SGLang x Nvidia SF Meetup on 10\u002F2 ([recap](https:\u002F\u002Fx.com\u002Flmsysorg\u002Fstatus\u002F1975339501934510231)).\n- [2025\u002F08] SGLang provides day-0 support for OpenAI gpt-oss model ([instructions](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fissues\u002F8833))\n- [2025\u002F06] SGLang, the high-performance serving infrastructure powering trillions of tokens daily, has been awarded the third batch of the Open Source AI Grant by a16z ([a16z blog](https:\u002F\u002Fa16z.com\u002Fadvancing-open-source-ai-through-benchmarks-and-bold-experimentation\u002F)).\n- [2025\u002F05] Deploying DeepSeek with PD Disaggregation and Large-scale Expert Parallelism on 96 H100 GPUs ([blog](https:\u002F\u002Flmsys.org\u002Fblog\u002F2025-05-05-large-scale-ep\u002F)).\n- [2025\u002F06] Deploying DeepSeek on GB200 NVL72 with PD and Large Scale EP (Part I): 2.7x Higher Decoding Throughput ([blog](https:\u002F\u002Flmsys.org\u002Fblog\u002F2025-06-16-gb200-part-1\u002F)).\n- [2025\u002F03] Supercharge DeepSeek-R1 Inference on AMD Instinct MI300X ([AMD blog](https:\u002F\u002Frocm.blogs.amd.com\u002Fartificial-intelligence\u002FDeepSeekR1-Part2\u002FREADME.html))\n- [2025\u002F03] SGLang Joins PyTorch Ecosystem: Efficient LLM Serving Engine ([PyTorch blog](https:\u002F\u002Fpytorch.org\u002Fblog\u002Fsglang-joins-pytorch\u002F))\n- [2025\u002F02] Unlock DeepSeek-R1 Inference Performance on AMD Instinct™ MI300X GPU ([AMD blog](https:\u002F\u002Frocm.blogs.amd.com\u002Fartificial-intelligence\u002FDeepSeekR1_Perf\u002FREADME.html))\n- [2025\u002F01] SGLang provides day one support for DeepSeek V3\u002FR1 models on NVIDIA and AMD GPUs with DeepSeek-specific optimizations. ([instructions](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Ftree\u002Fmain\u002Fbenchmark\u002Fdeepseek_v3), [AMD blog](https:\u002F\u002Fwww.amd.com\u002Fen\u002Fdeveloper\u002Fresources\u002Ftechnical-articles\u002Famd-instinct-gpus-power-deepseek-v3-revolutionizing-ai-development-with-sglang.html), [10+ other companies](https:\u002F\u002Fx.com\u002Flmsysorg\u002Fstatus\u002F1887262321636221412))\n- [2024\u002F12] v0.4 Release: Zero-Overhead Batch Scheduler, Cache-Aware Load Balancer, Faster Structured Outputs ([blog](https:\u002F\u002Flmsys.org\u002Fblog\u002F2024-12-04-sglang-v0-4\u002F)).\n- [2024\u002F10] The First SGLang Online Meetup ([slides](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsgl-learning-materials?tab=readme-ov-file#the-first-sglang-online-meetup)).\n- [2024\u002F09] v0.3 Release: 7x Faster DeepSeek MLA, 1.5x Faster torch.compile, Multi-Image\u002FVideo LLaVA-OneVision ([blog](https:\u002F\u002Flmsys.org\u002Fblog\u002F2024-09-04-sglang-v0-3\u002F)).\n- [2024\u002F07] v0.2 Release: Faster Llama3 Serving with SGLang Runtime (vs. TensorRT-LLM, vLLM) ([blog](https:\u002F\u002Flmsys.org\u002Fblog\u002F2024-07-25-sglang-llama3\u002F)).\n- [2024\u002F02] SGLang enables **3x faster JSON decoding** with compressed finite state machine ([blog](https:\u002F\u002Flmsys.org\u002Fblog\u002F2024-02-05-compressed-fsm\u002F)).\n- [2024\u002F01] SGLang provides up to **5x faster inference** with RadixAttention ([blog](https:\u002F\u002Flmsys.org\u002Fblog\u002F2024-01-17-sglang\u002F)).\n- [2024\u002F01] SGLang powers the serving of the official **LLaVA v1.6** release demo ([usage](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA?tab=readme-ov-file#demo)).\n\n\u003C\u002Fdetails>\n\n## About\nSGLang is a high-performance serving framework for large language models and multimodal models.\nIt is designed to deliver low-latency and high-throughput inference across a wide range of setups, from a single GPU to large distributed clusters.\nIts core features include:\n\n- **Fast Runtime**: Provides efficient serving with RadixAttention for prefix caching, a zero-overhead CPU scheduler, prefill-decode disaggregation, speculative decoding, continuous batching, paged attention, tensor\u002Fpipeline\u002Fexpert\u002Fdata parallelism, structured outputs, chunked prefill, quantization (FP4\u002FFP8\u002FINT4\u002FAWQ\u002FGPTQ), and multi-LoRA batching.\n- **Broad Model Support**: Supports a wide range of language models (Llama, Qwen, DeepSeek, Kimi, GLM, GPT, Gemma, Mistral, etc.), embedding models (e5-mistral, gte, mcdse), reward models (Skywork), and diffusion models (WAN, Qwen-Image), with easy extensibility for adding new models. Compatible with most Hugging Face models and OpenAI APIs.\n- **Extensive Hardware Support**: Runs on NVIDIA GPUs (GB200\u002FB300\u002FH100\u002FA100\u002FSpark\u002F5090), AMD GPUs (MI355\u002FMI300), Intel Xeon CPUs, Google TPUs, Ascend NPUs, and more.\n- **Active Community**: SGLang is open-source and supported by a vibrant community with widespread industry adoption, powering over 400,000 GPUs worldwide.\n- **RL & Post-Training Backbone**: SGLang is a proven rollout backend used for training many frontier models, with native RL integrations and adoption by well-known post-training frameworks such as [**AReaL**](https:\u002F\u002Fgithub.com\u002FinclusionAI\u002FAReaL), [**Miles**](https:\u002F\u002Fgithub.com\u002Fradixark\u002Fmiles), [**slime**](https:\u002F\u002Fgithub.com\u002FTHUDM\u002Fslime), [**Tunix**](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Ftunix), [**verl**](https:\u002F\u002Fgithub.com\u002Fvolcengine\u002Fverl) and more.\n\n## Getting Started\n- [Install SGLang](https:\u002F\u002Fdocs.sglang.io\u002Fget_started\u002Finstall.html)\n- [Quick Start](https:\u002F\u002Fdocs.sglang.io\u002Fbasic_usage\u002Fsend_request.html)\n- [Backend Tutorial](https:\u002F\u002Fdocs.sglang.io\u002Fbasic_usage\u002Fopenai_api_completions.html)\n- [Frontend Tutorial](https:\u002F\u002Fdocs.sglang.io\u002Freferences\u002Ffrontend\u002Ffrontend_tutorial.html)\n- [Contribution Guide](https:\u002F\u002Fdocs.sglang.io\u002Fdeveloper_guide\u002Fcontribution_guide.html)\n\n## Benchmark and Performance\nLearn more in the release blogs: [v0.2 blog](https:\u002F\u002Flmsys.org\u002Fblog\u002F2024-07-25-sglang-llama3\u002F), [v0.3 blog](https:\u002F\u002Flmsys.org\u002Fblog\u002F2024-09-04-sglang-v0-3\u002F), [v0.4 blog](https:\u002F\u002Flmsys.org\u002Fblog\u002F2024-12-04-sglang-v0-4\u002F), [Large-scale expert parallelism](https:\u002F\u002Flmsys.org\u002Fblog\u002F2025-05-05-large-scale-ep\u002F), [GB200 rack-scale parallelism](https:\u002F\u002Flmsys.org\u002Fblog\u002F2025-09-25-gb200-part-2\u002F), [GB300 long context](https:\u002F\u002Flmsys.org\u002Fblog\u002F2026-02-19-gb300-longctx\u002F).\n\n## Adoption and Sponsorship\nSGLang has been deployed at large scale, generating trillions of tokens in production each day. It is trusted and adopted by a wide range of leading enterprises and institutions, including xAI, AMD, NVIDIA, Intel, LinkedIn, Cursor, Oracle Cloud, Google Cloud, Microsoft Azure, AWS, Atlas Cloud, Voltage Park, Nebius, DataCrunch, Novita, InnoMatrix, MIT, UCLA, the University of Washington, Stanford, UC Berkeley, Tsinghua University, Jam & Tea Studios, Baseten, and other major technology organizations.\nAs an open-source LLM inference engine, SGLang has become the de facto industry standard, with deployments running on over 400,000 GPUs worldwide.\nSGLang is currently hosted under the non-profit open-source organization [LMSYS](https:\u002F\u002Flmsys.org\u002Fabout\u002F).\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsgl-project_sglang_readme_8af1ae423d21.png\" alt=\"logo\" width=\"800\" margin=\"10px\">\u003C\u002Fimg>\n\n## Contact Us\nFor enterprises interested in adopting or deploying SGLang at scale, including technical consulting, sponsorship opportunities, or partnership inquiries, please contact us at [sglang@lmsys.org](mailto:sglang@lmsys.org).\n\nLong-term active SGLang contributors are eligible for coding agent sponsorship, such as Cursor, Claude Code, or OpenAI Codex. Email [sglang@lmsys.org](mailto:sglang@lmsys.org) with your most important commits or pull requests.\n\n## Acknowledgment\nWe learned the design and reused code from the following projects: [Guidance](https:\u002F\u002Fgithub.com\u002Fguidance-ai\u002Fguidance), [vLLM](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm), [LightLLM](https:\u002F\u002Fgithub.com\u002FModelTC\u002Flightllm), [FlashInfer](https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer), [Outlines](https:\u002F\u002Fgithub.com\u002Foutlines-dev\u002Foutlines), and [LMQL](https:\u002F\u002Fgithub.com\u002Feth-sri\u002Flmql).\n","\u003Cdiv align=\"center\" id=\"sglangtop\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsgl-project_sglang_readme_44020c2d71e7.png\" alt=\"logo\" width=\"400\" margin=\"10px\">\u003C\u002Fimg>\n\n[![PyPI](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Fsglang)](https:\u002F\u002Fpypi.org\u002Fproject\u002Fsglang)\n![PyPI - Downloads](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsgl-project_sglang_readme_114f467f7024.png)\n[![license](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flicense\u002Fsgl-project\u002Fsglang.svg)](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Ftree\u002Fmain\u002FLICENSE)\n[![issue resolution](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fissues-closed-raw\u002Fsgl-project\u002Fsglang)](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fissues)\n[![open issues](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fissues-raw\u002Fsgl-project\u002Fsglang)](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fissues)\n[![Ask DeepWiki](https:\u002F\u002Fdeepwiki.com\u002Fbadge.svg)](https:\u002F\u002Fdeepwiki.com\u002Fsgl-project\u002Fsglang)\n\n\u003C\u002Fdiv>\n\n--------------------------------------------------------------------------------\n\n\u003Cp align=\"center\">\n\u003Ca href=\"https:\u002F\u002Flmsys.org\u002Fblog\u002F\">\u003Cb>博客\u003C\u002Fb>\u003C\u002Fa> |\n\u003Ca href=\"https:\u002F\u002Fdocs.sglang.io\u002F\">\u003Cb>文档\u003C\u002Fb>\u003C\u002Fa> |\n\u003Ca href=\"https:\u002F\u002Froadmap.sglang.io\u002F\">\u003Cb>路线图\u003C\u002Fb>\u003C\u002Fa> |\n\u003Ca href=\"https:\u002F\u002Fslack.sglang.io\u002F\">\u003Cb>加入Slack\u003C\u002Fb>\u003C\u002Fa> |\n\u003Ca href=\"https:\u002F\u002Fmeet.sglang.io\u002F\">\u003Cb>每周开发会议\u003C\u002Fb>\u003C\u002Fa> |\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsgl-learning-materials?tab=readme-ov-file#slides\">\u003Cb>幻灯片\u003C\u002Fb>\u003C\u002Fa>\n\u003C\u002Fp>\n\n## 新闻\n- [2026\u002F02] 🔥 使用SGLang在NVIDIA GB300 NVL72上解锁25倍推理性能（[博客](https:\u002F\u002Flmsys.org\u002Fblog\u002F2026-02-20-gb300-inferencex\u002F)）。\n- [2026\u002F01] 🔥 SGLang Diffusion加速视频和图像生成（[博客](https:\u002F\u002Flmsys.org\u002Fblog\u002F2026-01-16-sglang-diffusion\u002F)）。\n- [2025\u002F12] SGLang为最新开源模型提供开箱即用的支持（[MiMo-V2-Flash](https:\u002F\u002Flmsys.org\u002Fblog\u002F2025-12-16-mimo-v2-flash\u002F)、[Nemotron 3 Nano](https:\u002F\u002Flmsys.org\u002Fblog\u002F2025-12-15-run-nvidia-nemotron-3-nano\u002F)、[Mistral Large 3](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F14213)、[LLaDA 2.0 Diffusion LLM](https:\u002F\u002Flmsys.org\u002Fblog\u002F2025-12-19-diffusion-llm\u002F)、[MiniMax M2](https:\u002F\u002Flmsys.org\u002Fblog\u002F2025-11-04-miminmax-m2\u002F))。\n- [2025\u002F10] 🔥 SGLang现在可通过SGLang-Jax后端原生运行在TPU上（[博客](https:\u002F\u002Flmsys.org\u002Fblog\u002F2025-10-29-sglang-jax\u002F)）。\n- [2025\u002F09] 使用PD和大规模专家并行化在GB200 NVL72上部署DeepSeek（第二部分）：预填充速度提升3.8倍，解码吞吐量提升4.8倍（[博客](https:\u002F\u002Flmsys.org\u002Fblog\u002F2025-09-25-gb200-part-2\u002F)）。\n- [2025\u002F09] SGLang为具有稀疏注意力的DeepSeek-V3.2提供开箱即用的支持（[博客](https:\u002F\u002Flmsys.org\u002Fblog\u002F2025-09-29-deepseek-V32\u002F)）。\n- [2025\u002F08] SGLang x AMD SF见面会于8月22日举行：GPU动手实践工作坊、AMD\u002FxAI\u002FSGLang的技术演讲以及交流活动（[路线图](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsgl-learning-materials\u002Fblob\u002Fmain\u002Fslides\u002Famd_meetup_sglang_roadmap.pdf)、[大规模EP](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsgl-learning-materials\u002Fblob\u002Fmain\u002Fslides\u002Famd_meetup_sglang_ep.pdf)、[亮点](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsgl-learning-materials\u002Fblob\u002Fmain\u002Fslides\u002Famd_meetup_highlights.pdf)、[AITER\u002FMoRI](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsgl-learning-materials\u002Fblob\u002Fmain\u002Fslides\u002Famd_meetup_aiter_mori.pdf)、[Wave](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsgl-learning-materials\u002Fblob\u002Fmain\u002Fslides\u002Famd_meetup_wave.pdf))。\n\n\u003Cdetails>\n\u003Csummary>更多\u003C\u002Fsummary>\n\n- [2025\u002F11] SGLang Diffusion加速视频和图像生成（[博客](https:\u002F\u002Flmsys.org\u002Fblog\u002F2025-11-07-sglang-diffusion\u002F)）。\n- [2025\u002F10] PyTorch大会2025 SGLang演讲（[幻灯片](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsgl-learning-materials\u002Fblob\u002Fmain\u002Fslides\u002Fsglang_pytorch_2025.pdf)）。\n- [2025\u002F10] SGLang x Nvidia SF见面会在10月2日举行（[回顾](https:\u002F\u002Fx.com\u002Flmsysorg\u002Fstatus\u002F1975339501934510231)）。\n- [2025\u002F08] SGLang为OpenAI gpt-oss模型提供开箱即用的支持（[说明](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fissues\u002F8833)）。\n- [2025\u002F06] SGLang作为高性能推理基础设施，每日支撑数万亿次token处理，已被a16z授予第三批开源AI资助项目（[a16z博客](https:\u002F\u002Fa16z.com\u002Fadvancing-open-source-ai-through-benchmarks-and-bold-experimentation\u002F)）。\n- [2025\u002F05] 在96张H100 GPU上使用PD解耦和大规模专家并行化部署DeepSeek（[博客](https:\u002F\u002Flmsys.org\u002Fblog\u002F2025-05-05-large-scale-ep\u002F)）。\n- [2025\u002F06] 在GB200 NVL72上使用PD和大规模EP部署DeepSeek（第一部分）：解码吞吐量提升2.7倍（[博客](https:\u002F\u002Flmsys.org\u002Fblog\u002F2025-06-16-gb200-part-1\u002F)）。\n- [2025\u002F03] 在AMD Instinct MI300X上大幅提升DeepSeek-R1的推理性能（[AMD博客](https:\u002F\u002Frocm.blogs.amd.com\u002Fartificial-intelligence\u002FDeepSeekR1-Part2\u002FREADME.html)）。\n- [2025\u002F03] SGLang加入PyTorch生态系统：高效的LLM推理引擎（[PyTorch博客](https:\u002F\u002Fpytorch.org\u002Fblog\u002Fsglang-joins-pytorch\u002F)）。\n- [2025\u002F02] 解锁AMD Instinct™ MI300X GPU上的DeepSeek-R1推理性能（[AMD博客](https:\u002F\u002Frocm.blogs.amd.com\u002Fartificial-intelligence\u002FDeepSeekR1_Perf\u002FREADME.html)）。\n- [2025\u002F01] SGLang为DeepSeek V3\u002FR1模型在NVIDIA和AMD GPU上提供开箱即用的支持，并进行了DeepSeek专属优化。（[说明](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Ftree\u002Fmain\u002Fbenchmark\u002Fdeepseek_v3)、[AMD博客](https:\u002F\u002Fwww.amd.com\u002Fen\u002Fdeveloper\u002Fresources\u002Ftechnical-articles\u002Famd-instinct-gpus-power-deepseek-v3-revolutionizing-ai-development-with-sglang.html)、[10多家其他公司](https:\u002F\u002Fx.com\u002Flmsysorg\u002Fstatus\u002F1887262321636221412)）。\n- [2024\u002F12] v0.4版本发布：零开销批调度器、缓存感知负载均衡器、更快的结构化输出（[博客](https:\u002F\u002Flmsys.org\u002Fblog\u002F2024-12-04-sglang-v0-4\u002F)）。\n- [2024\u002F10] 首次SGLang线上见面会（[幻灯片](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsgl-learning-materials?tab=readme-ov-file#the-first-sglang-online-meetup)）。\n- [2024\u002F09] v0.3版本发布：DeepSeek MLA速度提升7倍，torch.compile速度提升1.5倍，多图像\u002F视频LLaVA-OneVision（[博客](https:\u002F\u002Flmsys.org\u002Fblog\u002F2024-09-04-sglang-v0-3\u002F)）。\n- [2024\u002F07] v0.2版本发布：使用SGLang运行时加速Llama3服务（相比TensorRT-LLM、vLLM）（[博客](https:\u002F\u002Flmsys.org\u002Fblog\u002F2024-07-25-sglang-llama3\u002F)）。\n- [2024\u002F02] SGLang通过压缩有限状态机实现**JSON解码速度提升3倍**（[博客](https:\u002F\u002Flmsys.org\u002Fblog\u002F2024-02-05-compressed-fsm\u002F)）。\n- [2024\u002F01] SGLang借助RadixAttention提供高达**5倍的推理速度提升**（[博客](https:\u002F\u002Flmsys.org\u002Fblog\u002F2024-01-17-sglang\u002F)）。\n- [2024\u002F01] SGLang支持官方**LLaVA v1.6**发布演示的推理服务（[使用方法](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA?tab=readme-ov-file#demo)）。\n\n\u003C\u002Fdetails>\n\n## 关于\nSGLang 是一个用于大型语言模型和多模态模型的高性能推理框架。它旨在在从单个 GPU 到大型分布式集群的各种部署环境中，提供低延迟、高吞吐量的推理服务。其核心特性包括：\n\n- **快速运行时**：通过 RadixAttention 前缀缓存、零开销 CPU 调度器、预填充与解码分离、推测性解码、连续批处理、分页注意力、张量\u002F流水线\u002F专家并行以及数据并行、结构化输出、分块预填充、量化（FP4\u002FFP8\u002FINT4\u002FAWQ\u002FGPTQ）和多 LoRA 批处理等技术，实现高效的推理服务。\n- **广泛的模型支持**：支持多种语言模型（Llama、Qwen、DeepSeek、Kimi、GLM、GPT、Gemma、Mistral 等）、嵌入模型（e5-mistral、gte、mcdse）、奖励模型（Skywork）和扩散模型（WAN、Qwen-Image），并且易于扩展以添加新模型。兼容大多数 Hugging Face 模型和 OpenAI API。\n- **广泛的硬件支持**：可在 NVIDIA GPU（GB200\u002FB300\u002FH100\u002FA100\u002FSpark\u002F5090）、AMD GPU（MI355\u002FMI300）、Intel Xeon CPU、Google TPU、Ascend NPU 等多种硬件上运行。\n- **活跃的社区**：SGLang 是开源项目，拥有充满活力的社区支持，并被广泛应用于业界，全球范围内已驱动超过 40 万张 GPU 卡。\n- **强化学习与后训练基础架构**：SGLang 是经过验证的前沿模型训练用回放后端，原生支持强化学习集成，并已被多个知名后训练框架采用，如 [**AReaL**](https:\u002F\u002Fgithub.com\u002FinclusionAI\u002FAReaL)、[**Miles**](https:\u002F\u002Fgithub.com\u002Fradixark\u002Fmiles)、[**slime**](https:\u002F\u002Fgithub.com\u002FTHUDM\u002Fslime)、[**Tunix**](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Ftunix)、[**verl**](https:\u002F\u002Fgithub.com\u002Fvolcengine\u002Fverl) 等。\n\n## 快速入门\n- [安装 SGLang](https:\u002F\u002Fdocs.sglang.io\u002Fget_started\u002Finstall.html)\n- [快速开始](https:\u002F\u002Fdocs.sglang.io\u002Fbasic_usage\u002Fsend_request.html)\n- [后端教程](https:\u002F\u002Fdocs.sglang.io\u002Fbasic_usage\u002Fopenai_api_completions.html)\n- [前端教程](https:\u002F\u002Fdocs.sglang.io\u002Freferences\u002Ffrontend\u002Ffrontend_tutorial.html)\n- [贡献指南](https:\u002F\u002Fdocs.sglang.io\u002Fdeveloper_guide\u002Fcontribution_guide.html)\n\n## 基准测试与性能\n更多信息请参阅发布博客：[v0.2 博客](https:\u002F\u002Flmsys.org\u002Fblog\u002F2024-07-25-sglang-llama3\u002F)、[v0.3 博客](https:\u002F\u002Flmsys.org\u002Fblog\u002F2024-09-04-sglang-v0-3\u002F)、[v0.4 博客](https:\u002F\u002Flmsys.org\u002Fblog\u002F2024-12-04-sglang-v0-4\u002F)、[大规模专家并行](https:\u002F\u002Flmsys.org\u002Fblog\u002F2025-05-05-large-scale-ep\u002F)、[GB200 机架级并行](https:\u002F\u002Flmsys.org\u002Fblog\u002F2025-09-25-gb200-part-2\u002F)、[GB300 长上下文](https:\u002F\u002Flmsys.org\u002Fblog\u002F2026-02-19-gb300-longctx\u002F)。\n\n## 采用与赞助\nSGLang 已被大规模部署，每天在生产环境中生成数万亿 tokens。它受到众多领先企业和机构的信任与采用，包括 xAI、AMD、NVIDIA、Intel、LinkedIn、Cursor、Oracle Cloud、Google Cloud、Microsoft Azure、AWS、Atlas Cloud、Voltage Park、Nebius、DataCrunch、Novita、InnoMatrix、MIT、UCLA、华盛顿大学、斯坦福大学、加州大学伯克利分校、清华大学、Jam & Tea Studios、Baseten 等主要科技组织。\n作为一款开源的 LLM 推理引擎，SGLang 已成为事实上的行业标准，全球范围内已有超过 40 万张 GPU 卡在运行中使用该框架。\nSGLang 目前由非营利性开源组织 [LMSYS](https:\u002F\u002Flmsys.org\u002Fabout\u002F) 托管。\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsgl-project_sglang_readme_8af1ae423d21.png\" alt=\"logo\" width=\"800\" margin=\"10px\">\u003C\u002Fimg>\n\n## 联系我们\n对于有意大规模采用或部署 SGLang 的企业，包括技术咨询、赞助机会或合作洽谈，请通过 [sglang@lmsys.org](mailto:sglang@lmsys.org) 与我们联系。\n\n长期积极参与 SGLang 开发的贡献者有资格获得代码助手赞助，例如 Cursor、Claude Code 或 OpenAI Codex。请将您最重要的提交记录或拉取请求发送至 [sglang@lmsys.org](mailto:sglang@lmsys.org)。\n\n## 致谢\n我们在设计过程中参考并复用了以下项目的代码：[Guidance](https:\u002F\u002Fgithub.com\u002Fguidance-ai\u002Fguidance)、[vLLM](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm)、[LightLLM](https:\u002F\u002Fgithub.com\u002FModelTC\u002Flightllm)、[FlashInfer](https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer)、[Outlines](https:\u002F\u002Fgithub.com\u002Foutlines-dev\u002Foutlines) 和 [LMQL](https:\u002F\u002Fgithub.com\u002Feth-sri\u002Flmql)。","# SGLang 快速上手指南\n\nSGLang 是一个高性能的大语言模型（LLM）和多模态模型服务框架，旨在提供低延迟、高吞吐的推理体验。它支持广泛的硬件（NVIDIA\u002FAMD GPU, TPU, CPU 等）和模型（Llama, Qwen, DeepSeek 等），并兼容 OpenAI API。\n\n## 1. 环境准备\n\n在开始之前，请确保您的系统满足以下要求：\n\n*   **操作系统**：Linux (推荐) 或 macOS。\n*   **Python 版本**：Python 3.8 及以上版本。\n*   **硬件驱动**：\n    *   **NVIDIA GPU**：已安装正确的 NVIDIA 驱动程序和 CUDA Toolkit。\n    *   **AMD GPU**：已安装 ROCm 栈。\n    *   **其他**：若使用 TPU 或 CPU，请确保相应后端依赖已就绪。\n*   **包管理器**：建议使用 `pip` 进行安装。\n\n> **提示**：为了获得最佳性能，建议在使用 NVIDIA GPU 时安装与您的 CUDA 版本匹配的 PyTorch。\n\n## 2. 安装步骤\n\n您可以通过 pip 直接安装 SGLang。根据您的需求，可以选择安装基础版本或包含特定后端支持的版本。\n\n### 基础安装\n\n```bash\npip install sglang\n```\n\n### 安装特定后端支持（可选）\n\n如果您需要针对特定硬件或功能进行优化，可以安装额外的依赖项：\n\n*   **NVIDIA GPU (CUDA)**:\n    ```bash\n    pip install \"sglang[cuda]\"\n    ```\n*   **AMD GPU (ROCm)**:\n    ```bash\n    pip install \"sglang[rocm]\"\n    ```\n*   **TPU (JAX)**:\n    ```bash\n    pip install \"sglang[tpu]\"\n    ```\n\n> **国内加速建议**：如果下载速度较慢，可以使用国内镜像源，例如清华大学镜像：\n> ```bash\n> pip install sglang -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n> ```\n\n## 3. 基本使用\n\nSGLang 提供了两种主要的使用方式：**后端服务启动**（兼容 OpenAI API）和 **前端编程接口**（Python SDK）。\n\n### 方式一：启动后端服务并调用（推荐）\n\n这是最通用的使用方式，启动一个兼容 OpenAI API 的服务端，然后使用任何兼容的客户端进行请求。\n\n**步骤 1：启动服务器**\n\n以下命令将启动一个服务，加载 `meta-llama\u002FMeta-Llama-3-8B-Instruct` 模型（首次运行会自动下载模型）：\n\n```bash\npython -m sglang.launch_server --model-path meta-llama\u002FMeta-Llama-3-8B-Instruct --port 30000\n```\n\n*   `--model-path`: 指定模型路径，可以是 Hugging Face 模型 ID 或本地路径。\n*   `--port`: 指定服务端口，默认为 30000。\n\n**步骤 2：发送请求**\n\n服务启动后，您可以使用 `curl` 或 Python 代码通过 OpenAI API 格式与之交互。\n\n**使用 curl 测试：**\n\n```bash\ncurl http:\u002F\u002Flocalhost:30000\u002Fv1\u002Fchat\u002Fcompletions \\\n  -H \"Content-Type: application\u002Fjson\" \\\n  -d '{\n    \"model\": \"meta-llama\u002FMeta-Llama-3-8B-Instruct\",\n    \"messages\": [\n      {\"role\": \"user\", \"content\": \"Hello, who are you?\"}\n    ]\n  }'\n```\n\n**使用 Python 请求：**\n\n```python\nimport openai\n\nclient = openai.Client(\n    base_url=\"http:\u002F\u002F127.0.0.1:30000\u002Fv1\", api_key=\"EMPTY\"\n)\n\nresponse = client.chat.completions.create(\n    model=\"meta-llama\u002FMeta-Llama-3-8B-Instruct\",\n    messages=[\n        {\"role\": \"user\", \"content\": \"List 3 countries and their capitals.\"},\n    ],\n    temperature=0,\n    max_tokens=64,\n)\n\nprint(response.choices[0].message.content)\n```\n\n### 方式二：使用 SGLang 前端 DSL（结构化生成）\n\nSGLang 提供了一套领域特定语言（DSL），用于更灵活地控制生成过程，特别适合复杂的多步推理或结构化输出。\n\n```python\nimport sglang as sgl\n\n@sgl.function\ndef few_shot_qa(s, question):\n    s += \"The following are questions with answers.\\n\\n\"\n    s += \"Q: What is the capital of France?\\nA: Paris\\n\"\n    s += \"Q: \" + question + \"\\nA:\"\n    s += sgl.gen(\"answer\", stop=\"\\n\")\n\n# 执行函数\nstate = few_shot_qa.run(question=\"What is the capital of Germany?\")\nprint(state[\"answer\"])\n```\n\n> **注意**：使用前端 DSL 时，通常需要先启动后端服务，或者在本地运行时确保配置了正确的运行时环境。对于生产环境，推荐使用**方式一**进行部署。","一家中型 AI 初创公司正在构建基于 DeepSeek-V3 的智能代码助手，需要同时处理大量开发者的实时代码补全请求和复杂的长上下文代码审查任务。\n\n### 没有 sglang 时\n- **推理延迟高且不稳定**：使用原生 Hugging Face Transformers 或基础 vLLM 部署时，面对高并发请求，首字生成时间（TTFT）波动剧烈，用户在输入代码后常需等待数秒才能看到响应，严重影响编码流畅度。\n- **显存利用率低**：在处理长上下文代码库时，KV Cache 管理效率低下，导致显存碎片化严重。为了容纳更多并发用户，不得不增加 GPU 节点数量，硬件成本大幅上升。\n- **新模型适配慢**：每当 DeepSeek 或其他开源模型发布新版本（如支持稀疏注意力机制的更新），团队需要花费数天时间修改底层推理代码进行适配，导致业务无法第一时间利用最新模型的性能红利。\n- **多模态扩展困难**：当尝试引入“截图识代码”功能时，原有架构难以高效协同处理图像编码与文本生成，导致多模态请求吞吐量极低，几乎不可用。\n\n### 使用 sglang 后\n- **极致推理性能**：借助 sglang 的 RadixAttention 和连续批处理技术，系统自动复用共享的代码前缀 KV Cache，首字延迟降低 50% 以上，即使在高负载下也能保持毫秒级响应，体验丝滑。\n- **显存效率显著提升**：通过高效的内存管理，单张 GPU 能支持的并发会话数翻倍。在相同硬件投入下，服务容量大幅提升，显著降低了单位 Token 的计算成本。\n- **无缝对接最新模型**：sglang 提供对 DeepSeek-V3.2 等前沿模型的“Day-0”支持，团队只需简单配置即可上线新模型，无需重写底层算子，确保业务始终处于技术前沿。\n- **原生多模态支持**：利用 sglang 统一的后端架构，轻松实现图像与文本的混合推理，视频和图像生成任务也能通过 SGLang Diffusion 加速，快速落地多模态功能。\n\nsglang 通过底层推理优化和灵活的架构设计，帮助企业在降低硬件成本的同时，实现了高性能、低延迟且易于扩展的大模型服务部署。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsgl-project_sglang_44020c2d.png","sgl-project","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fsgl-project_dce16cc1.png",null,"https:\u002F\u002Fgithub.com\u002Fsgl-project",[81,85,89,93,97,101,105,108,112,116],{"name":82,"color":83,"percentage":84},"Python","#3572A5",81.5,{"name":86,"color":87,"percentage":88},"Rust","#dea584",8.2,{"name":90,"color":91,"percentage":92},"Cuda","#3A4E3A",4.7,{"name":94,"color":95,"percentage":96},"C++","#f34b7d",4,{"name":98,"color":99,"percentage":100},"Shell","#89e051",0.5,{"name":102,"color":103,"percentage":104},"C","#555555",0.3,{"name":106,"color":107,"percentage":104},"Go","#00ADD8",{"name":109,"color":110,"percentage":111},"Dockerfile","#384d54",0.2,{"name":113,"color":114,"percentage":115},"CMake","#DA3434",0.1,{"name":117,"color":118,"percentage":115},"Makefile","#427819",25348,5136,"2026-04-02T23:47:24","Apache-2.0","Linux","支持 NVIDIA GPUs (GB200, B300, H100, A100, Spark, 5090等), AMD GPUs (MI355, MI300), Intel Xeon CPUs, Google TPUs, Ascend NPUs。具体显存和CUDA版本取决于所选模型和硬件后端，README未提供统一的最低显存和CUDA版本要求。","未说明",{"notes":127,"python":125,"dependencies":128},"SGLang 是一个高性能的大语言模型和多模态模型服务框架。它支持广泛的硬件后端，包括 NVIDIA、AMD、Intel CPU、Google TPU 和华为 Ascend NPU。对于 NVIDIA GPU，通常依赖 CUDA 生态；对于 AMD GPU，依赖 ROCm 生态；对于 TPU，使用 SGLang-Jax 后端。具体安装步骤和依赖版本需参考官方文档的安装指南，因为不同硬件后端的依赖差异较大。",[129,130,131,132,133,134,135],"torch","transformers","flashinfer","vllm","outlines","guidance","lmql",[14,13,26],[138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155],"cuda","inference","llama","llm","moe","transformer","vlm","deepseek","blackwell","gpt-oss","diffusion","attention","glm","minimax","qwen","qwen-image","reinforcement-learning","wan",7,"2026-03-27T02:49:30.150509","2026-04-06T05:32:18.642509",[160,165,169,173,178,182,187],{"id":161,"question_zh":162,"answer_zh":163,"source_url":164},11163,"如何为 DeepSeek-V3.2-Exp 模型安装 SGLang 及依赖（包括 Docker 和源码构建）？","支持 H200, MI350 和 NPU。安装方法如下：\n\n1. **Docker 方式**：\n   - H200\u002FB200: `docker pull lmsysorg\u002Fsglang:latest`\n   - MI350\u002FMI355: `docker pull lmsysorg\u002Fsglang:dsv32-rocm`\n   - NPUs: `docker pull lmsysorg\u002Fsglang:dsv32-a2` 或 `dsv32-a3`\n\n2. **源码构建方式**：\n   ```bash\n   # 安装 SGLang\n   git clone https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\n   cd sglang\n   pip3 install pip --upgrade\n   pip3 install -e \"python[all]\"\n\n   # 安装 flash_mla\n   git clone https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FFlashMLA.git flash-mla\n   cd flash-mla\n   git submodule update --init --recursive\n   pip install -v .\n   ```","https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fissues\u002F11060",{"id":166,"question_zh":167,"answer_zh":168,"source_url":164},11164,"DeepSeek-V3.2-Exp 模型的推荐启动命令有哪些（支持 TP+DP, EP+DP 及投机采样）？","以下是几种常见的启动配置：\n\n1. **TP + DP (张量并行 + 数据并行)**:\n   ```bash\n   python -m sglang.launch_server --model deepseek-ai\u002FDeepSeek-V3.2-Exp --tp 8 --dp 8 --enable-dp-attention\n   ```\n\n2. **EP + DP (专家并行 + 数据并行)**:\n   ```bash\n   python -m sglang.launch_server --model deepseek-ai\u002FDeepSeek-V3.2-Exp --tp 8 --ep 8 --dp 8 --enable-dp-attention\n   ```\n\n3. **启用 MTP (投机采样)**:\n   ```bash\n   python -m sglang.launch_server --model deepseek-ai\u002FDeepSeek-V3.2-Exp --tp 8 --dp 8 --enable-dp-attention --speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4\n   ```",{"id":170,"question_zh":171,"answer_zh":172,"source_url":164},11165,"为什么 DeepSeek-V3.2 的吞吐量比 V3.1 低？这是预期行为吗？","是的，这是预期行为。根据维护者在评论中的回复，在 SGLang CI 的夜间性能测试中，DeepSeek-V3.2 的速度目前确实慢于 V3.1。有用户反馈在 8*H200 上使用相同镜像测试时，V3.2 的吞吐量约为 V3.1 的 60%。",{"id":174,"question_zh":175,"answer_zh":176,"source_url":177},11166,"在大规模 PD 分离（Prefill\u002FDecode Disaggregation）部署 DeepSeek 时，Prefill 和 Decode 节点的典型配置参数是什么？","以 4 个 Prefill 节点和 9 个 Decode 节点为例，关键配置如下：\n\n**Prefill 节点示例命令**:\n```bash\npython3 -m sglang.launch_server --model-path \u002Fdev\u002Fshm\u002FDeepSeek-V3-0324 --disaggregation-ib-device mlx5_1 --disaggregation-mode prefill --dist-init-addr 10.5.55.3:5757 --nnodes 4 --node-rank 0 --tp-size 32 --dp-size 32 --enable-dp-attention --enable-deepep-moe --deepep-mode normal --mem-fraction-static 0.85 --chunked-prefill-size 524288 --max-running-requests 8192 --context-length 8192 --ep-dispatch-algorithm dynamic --eplb-algorithm deepseek ...\n```\n\n**Decode 节点示例命令**:\n```bash\npython3 -m sglang.launch_server --model-path \u002Fdev\u002Fshm\u002FDeepSeek-V3-0324 --disaggregation-ib-device mlx5_1 --disaggregation-mode decode --dist-init-addr 10.5.55.7:5757 --nnodes 9 --node-rank 0 --tp-size 72 --dp-size 72 --enable-dp-attention --enable-deepep-moe --deepep-mode low_latency --mem-fraction-static 0.835 --max-running-requests 18432 --context-length 4500 ...\n```\n注意：需确保升级 Mooncake 并使用主分支的 SGLang 和 DeepEP。","https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fissues\u002F6017",{"id":179,"question_zh":180,"answer_zh":181,"source_url":177},11167,"TBO (Two-Batch Overlap) 技术在少量机器上是否会导致性能下降？需要多少机器才能看到收益？","是的，TBO 在机器数量较少时可能会因为通信开销和 overhead 导致性能下降。根据社区用户的分析和实验反馈：\n1. 至少需要 3 台机器才能使 TBO 的计算增益抵消其开销。\n2. 可能需要 5 台或更多机器才能观察到明显的性能提升。\n3. 公开的性能改进结果通常基于较大规模的集群（如 4P9D 或 4P16D）。如果在少量机器（如 1P1D）上使用 TBO 且未正确调优，性能可能会降低。",{"id":183,"question_zh":184,"answer_zh":185,"source_url":186},11168,"如何解决 DeepSeek-R1 (671B) 在空闲一段时间后出现 Prefill 卡死并抛出 \"Watchdog Timeout\" 错误的问题？","该问题通常与多节点部署时的网络配置或超时设置有关。虽然具体修复可能依赖版本更新，但以下配置有助于缓解此类问题：\n1. **增加超时时间**：在启动命令中添加 `--watchdog-timeout 1000000` 以延长看门狗超时时间。\n2. **检查网络接口**：确保正确设置了 `NCCL_SOCKET_IFNAME` 和 `GLOO_SOCKET_IFNAME` 环境变量，指向正确的网卡（如 `eth0` 或 `ens24f0`）。\n3. **禁用 Radix Cache**：在某些不稳定场景下，尝试添加 `--disable-radix-cache` 参数。\n4. **环境检查**：确保使用最新的 SGLang 镜像或源码，因为此类 Bug 可能在后续版本中已修复。如果问题持续，重启服务是临时的恢复手段。","https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fissues\u002F3836",{"id":188,"question_zh":189,"answer_zh":190,"source_url":191},11169,"如何在 Blackwell (GB200) 架构上复现 SGLang 的性能测试或进行初始部署？","复现 Blackwell 优化博客中的性能测试，建议使用以下特定版本的组件和环境变量：\n\n**推荐版本**:\n- SGLang: commit `2a2d3478afe8cdb336888f2e6faa3775ac40254e`\n- DeepGEMM: commit `98707282f30aad49bb2fc924332a7b40a7e7a6dd`\n- DeepEP: commit `1b14ad661c7640137fcfe93cccb2694ede1220b0` 或最新 main\n- Mooncake: `mooncake-transfer-engine==0.3.4.post2`\n- Torch: `2.8.0.dev20250613+cu128`\n\n**关键环境变量示例**:\n```bash\nexport SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=2048\nexport MC_TE_METRIC=true\nexport NCCL_MNNVL_ENABLE=1\nexport NCCL_CUMEM_ENABLE=1\nexport SGLANG_MOONCAKE_CUSTOM_MEM_POOL=True\n# 设置正确的网卡名称\nexport SGLANG_LOCAL_IP_NIC=eth0\nexport GLOO_SOCKET_IFNAME=eth0\nexport NCCL_SOCKET_IFNAME=eth0\n```\n随后使用 `python3 -m sglang.launch_server` 启动，并指定相应的 disaggregation 模式。","https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fissues\u002F7227",[193,198,203,208,213,218,223,228,233,238,243,248,253,258,263,268,273,278,283,288],{"id":194,"version":195,"summary_zh":196,"released_at":197},61607,"v0.5.10rc0","# 亮点\n\n- **默认启用分段式 CUDA 图**：分段式 CUDA 图捕获现已成为默认执行模式，可降低内存开销并提升具有复杂控制流模式的模型吞吐量：#16331\n\n- **弹性 EP 实现部分故障容错**：将 Elastic NIXL-EP 集成到 SGLang 中，为 DeepSeek MoE 部署提供部分故障容错能力——当某块 GPU 故障时，系统会重新分配专家权重，并在无需完全重启的情况下继续服务：#19248、#17374、#12068 [博客](https:\u002F\u002Flmsys.org\u002Fblog\u002F2026-03-25-eep-partial-failure-tolerance\u002F)\n\n- **HiSparse 用于稀疏注意力机制**：集成 HiSparse 稀疏注意力后端，通过稀疏感知的注意力机制实现高效的长上下文推理，并降低计算量：#20343\n\n- **SGLang-Diffusion 更新**：\n  * 模型支持：LTX-2、Hunyuan3D-2、Helios\n  * Qwen-image 和 Z-image 的性能提升 1.5 倍\n  * 新平台：macOS\n  * 新特性：通过整合 Cache-DiT 的所有优化，提升 diffusers 后端的性能\n  * SKILLs：欢迎探索专为开发和优化 sglang-diffusion 而精心策划的技能集！\n\n- **FlashInfer MXFP8 内核支持**：集成 FlashInfer mxfp8 内核用于 GEMM 和 MoE 操作，通过微缩放技术实现混合精度 FP8 推理，在强化学习和通用工作负载中获得更高的精度：#19537\n\n- **Transformers 5.3.0 升级**：从 transformers 4.57.1 大幅升级至 5.3.0，解锁对 HuggingFace 最新模型架构和功能的支持。在此镜像中已支持 GLM-5 模型，取代了此前自定义构建的镜像：#17784\n\n- **DeepSeek V3.2 \u002F GLM-5 优化**：**GLM-5 现可在主分支上运行（已升级 transformers）。** 针对预填充 KV 缓存获取的融合 Triton 内核、用于 K 缓存的 NSA 融合存储索引器，以及预填充阶段稀疏 MLA 注意力的可配置 KV 长度阈值——这些优化显著提升了长上下文场景下 DeepSeek V3.2 和 GLM-5 的服务吞吐量：#19319、#19148、#20062\n\n- **Qwen3.5 GDN\u002FKDA 优化**：将线性注意力状态布局从 [N, HV, K, V] 转换为 [N, HV, V, K]，并使用 Triton 内核融合 GDN 投影中的拆分\u002F重塑\u002F拼接操作；同时新增 CuTeDSL KDA 解码内核支持，进一步提升 Qwen3.5 性能：#20283、#21019、#21203\n\n- **MoE 层 LoRA 支持**：为混合专家层添加 LoRA 微调支持，配备 JIT 对齐内核、融合 Triton 内核、张量并行支持及 LoRA 目标模块的自动检测功能——从而实现对 DeepSeek 等 MoE 模型的高效适配器调优：#19710、#19711、#14105、#21439\n\n- **Qwen3 MHA 的预填充上下文并行**：为 Qwen3 MoE 等多头注意力模型启用预填充阶段的上下文并行处理，将长序列分布到多块 GPU 上，以降低每块 GPU 的显存占用并加速预填充过程：#18233\n\n- **Flash Attention 4 官方库支持**：升级至官方 FlashAttention 4 包，带来最新的注意力优化和 Blackwell GPU 支持：#2030","2026-03-28T05:58:32",{"id":199,"version":200,"summary_zh":201,"released_at":202},61608,"v0.5.9","# 亮点\n\n- **LoRA权重加载与计算重叠**：在推理过程中将LoRA权重加载与计算重叠执行，使大型适配器的TTFT降低约78%，TPOT降低约34.88%：#15512\n\n- **TRT-LLM NSA内核集成用于DeepSeek V3.2**：为原生稀疏注意力集成TRT-LLM DSA内核，在Blackwell平台上使用trtllm同时配置--nsa-prefill-backend和--nsa-decode-backend的情况下，可将DeepSeek V3.2的性能提升3至5倍（伴随轻微精度下降）：#16758、#17662、#18389\n\n- **Flashinfer全对全MoE调度器**：新增Flashinfer全对全MoE调度器，实现高效的专家并行通信，从而优化MoE模型中的路由机制：#14668\n\n- **FA4（FP4注意力）支持多模态编码器**：为多模态编码器引入FP4注意力后端及变长注意力函数，支持视觉-语言模型的低精度推理：#13539\n\n- **Anthropic兼容API端点**：为SGLang添加原生Anthropic API兼容性，使其能够直接与基于Anthropic API格式构建的工具和客户端集成：#18630\n\n- **SGLang-Diffusion高级优化**：已具备生产级可用性的多项改进，包括令牌级序列分片、并行VAE解码、融合内核、Nunchaku及FP8支持，以及ComfyUI插件中新增的多个模型：[博客](https:\u002F\u002Flmsys.org\u002Fblog\u002F)\n\n- **Spec V2关键漏洞修复**：修复由PyTorch垃圾回收引起的推测解码v2中的越界访问漏洞，提升推测验证的可靠性：#18958\n\n- **在GB300 NVL72上部署DeepSeek**：利用prefill-decode分离及其他SGLang特性，在NVIDIA最新GB300平台上进行长上下文推理优化工作：[博客](https:\u002F\u002Flmsys.org\u002Fblog\u002F)\n\n- **升级AITER版本至0.1.10.post3**：支持FP8预填充\u002F解码\u002FKV缓存\n\n- **docs.sglang.io中的提交到版本查询功能**：用户和开发者可轻松查找包含特定PR或提交的最早官方版本，从而简化发布跟踪流程：#18450，[链接](https:\u002F\u002Fdocs.sglang.io\u002Freferences\u002Frelease_lookup.html)\n\n## 新模型支持\n* Kimi-K2.5：#17789，[食谱](https:\u002F\u002Fcookbook.sglang.io\u002Fautoregressive\u002FMoonshotai\u002FKimi-K2.5)\n* GLM-5：[食谱](https:\u002F\u002Fcookbook.sglang.io\u002Fautoregressive\u002FGLM\u002FGLM-5)（仍需自定义Docker以升级transformers库，后续将推出rc版本，因为transformers升级存在一定风险）\n* Qwen 3.5：#18489、#18926、#18937，[食谱](https:\u002F\u002Fcookbook.sglang.io\u002Fautoregressive\u002FQwen\u002FQwen3.5)\n* MiniMax 2.5：[食谱](https:\u002F\u002Fcookbook.sglang.io\u002Fautoregressive\u002FMiniMax\u002FMiniMax-M2.5)\n* Ernie4.5-VL：#15679\n* Step3-VL：#17513\n* Step-3.5-Flash：#18084，[食谱](https:\u002F\u002Fcookbook.sglang.io\u002Fautoregressive\u002FStepFun\u002FStep3.5)\n* LLaDA 2.1：[食谱](https:\u002F\u002Fcookbook.sglang.io\u002Fautoregressive\u002FLLaDA\u002FLLaDA-2.1)\n* Ring 2.5 1T \u002F Ling 2.5 1T：#18598，[食谱](https:\u002F\u002Fcookbook.sglang.io\u002Fautoregressive\u002FInclusionAI\u002FRing-2.5-1T)，[食谱](https:\u002F\u002Fcookbook.sglang.i","2026-02-24T01:14:21",{"id":204,"version":205,"summary_zh":206,"released_at":207},61609,"v0.5.8","# 亮点                                                                                                                                                                                                                   \r\n\r\n- 所有主要扩散模型的整体速度最高提升1.5倍 https:\u002F\u002Flmsys.org\u002Fblog\u002F2026-01-16-sglang-diffusion\u002F\r\n- 使用分块流水线并行技术，在超长百万标记上下文场景下接近线性扩展 https:\u002F\u002Flmsys.org\u002Fblog\u002F2026-01-15-chunked-pipeline\u002F\r\n- 针对生产环境优化GLM4-MoE：TTFT提升65% https:\u002F\u002Flmsys.org\u002Fblog\u002F2026-01-21-novita-glm4\u002F\r\n- EPD解聚：面向视觉-语言模型的弹性编码器扩展 https:\u002F\u002Flmsys.org\u002Fblog\u002F2026-01-12-epd\u002F            \r\n                                                                                                                                                                           \r\n## 新模型支持                                                                                                                                                    \r\n* GLM 4.7 Flash的开箱即用支持：#17247                                                                                                                              \r\n* LFM2模型支持：#16890                                                                                                                                           \r\n* Qwen3-VL-Embedding与Qwen3-VL-Reranker模型支持：#16635、#16403                                                                                                 \r\n* DeepSeek V3.2 NVFP4：https:\u002F\u002Fhuggingface.co\u002Fnvidia\u002FDeepSeek-V3.2-NVFP4    \r\n* [扩散模型] black-forest-labs\u002FFLUX.2-klein-9B                                                                                           \r\n\r\n## DeepSeek V3.2优化                                                                                                                                           \r\n* 支持融合MoE、多批次及FP8 KV缓存的上下文并行优化：#13959\r\n                                                                   \r\n## Flash Attention 4                                                                                                                                                     \r\n* 支持Flash Attention 4解码核：#16034                                                                                                               \r\n\r\n## SGLang-Diffusion                                                                                                                                                                                  \r\n* 可使用diffusers后端运行sglang-diffusion                                                                                                                          \r\n* 功能：多LoRA推理、SLA注意力后端、CLI中的预热切换、ComfyUI插件                                                                         \r\n* 所有模型的性能均有所提升        ","2026-01-23T22:09:28",{"id":209,"version":210,"summary_zh":211,"released_at":212},61610,"gateway-v0.3.1","## 🚀 SMG v0.3.1 正式发布！\r\n我们很高兴地宣布 SMG v0.3.1 的重磅发布——这一版本带来了游戏级的性能飞跃：在缓存感知路由方面实现了 10-12 倍的性能提升和 99% 的内存占用降低，同时新增企业级安全特性！\n\n## 🌲 Radix 树 \u002F 缓存感知路由：速度提升 10-12 倍，内存占用减少 99% ⚡ \n我们对缓存感知路由引擎进行了全面优化，性能与内存使用均取得显著突破：\n\n### 性能提升\n- 现在每秒可处理超过 216,000 次缓存插入操作（此前仅为 18,900 次），每次操作的延迟从 52.9 微秒降至仅 4.6 微秒。\n- 在包含 10,000 个节点的树结构中进行前缀匹配时，吞吐量由每秒 41,000 次跃升至 124,000 次。\n- 在 64 线程并发负载下，系统每秒可处理 474,000 次操作，相比之前的 59,000 次\u002F秒提升了 7.9 倍。\n\n### 数据处理能力\n- INSERT 操作的处理速率从每秒 38 MB 提升至 440 MB；\n- MATCH 操作的处理速率则从每秒 83 MB 提升至 253 MB。\n\n### 内存优化\n- 每个树节点的内存占用大幅缩减约 99%：\n  - 优化前：每个节点约 180 KB（基于 170 核机器上的 DashMap 默认配置）；\n  - 优化后：每个节点仅约 1.4 KB。\n  \n结果：在相同的内存开销下，可部署多达 100 倍的缓存条目！以典型的 10,000 个缓存前缀部署为例，内存占用将从约 1.8 GB 降至仅 14 MB，从而释放更多资源用于实际推理工作负载。\n\n影响：缓存感知路由如今速度提升了 10-12 倍，而内存占用却减少了 99%。这对于大规模多租户部署至关重要。\n\n## 🔐 JWT\u002FOIDC 身份认证\n为控制平面 API 提供生产级安全防护，原生支持业界标准的 OIDC 提供商，包括 Google、Azure、Oracle、GitHub 等。借助您已有的企业级身份认证基础设施，保护分词器管理、工作节点注册及管理员端点的安全。这对企业级部署尤为重要——可无缝将 SMG 集成到现有的身份与访问管理系统中。\n\n## 📊 分类 API 支持\n原生支持分类任务！您现在可以将分类模型与现有推理集群并行部署并提供服务，配备专门的流水线阶段和协议类型。\n\n## ✨ 其他新增特性\n- 前缀哈希负载均衡：引入基于前缀哈希的新 KV 缓存感知负载均衡策略，可在多租户环境中显著提升缓存命中率。\n- Nemotron Nano V3 解析器\n- 在途请求年龄指标：跟踪请求在处理过程中的“年龄”，以增强可观性并更好地监控 SLA。\n\n## 🛠️ 功能增强\n### 开发者体验\n- 将 CLI 参数按逻辑分组，使命令行接口更加清晰易用；\n- 缩短日志记录目标名称（sgl_model_gateway → smg）；\n- 针对 HuggingFace 进行全面的嵌入正确性测试；\n- 在构建 wheel 包时自动生成功能。\n\n### 可靠性改进\n- 修复 IGW 对外部 OpenAI 工作节点的路由问题；\n- 规避孤儿进程引发的问题；\n- 防止子进程处理中的潜在卡顿；\n- 对上游超时采用 504 Gateway Timeout 错误响应（符合正确的 HTTP 语义）。","2026-01-09T06:18:26",{"id":214,"version":215,"summary_zh":216,"released_at":217},61611,"v0.5.7","## 亮点\n- 新模型支持：\n    - Mimo-V2-Flash 的 Day 0 支持：#15207，https:\u002F\u002Flmsys.org\u002Fblog\u002F2025-12-16-mimo-v2-flash\u002F\n    - Nemotron-Nano-v3 的 Day 0 支持：https:\u002F\u002Flmsys.org\u002Fblog\u002F2025-12-15-run-nvidia-nemotron-3-nano\u002F\n    - LLaDA 2.0 的 Day 0 支持：https:\u002F\u002Flmsys.org\u002Fblog\u002F2025-12-19-diffusion-llm\u002F\n    - [SGLang-Diffusion] Qwen-Image-Edit-2509、Qwen-Image-Edit-2511、Qwen-Image-2512 和 Qwen-Image-Layered 的 Day 0 支持\n    - 针对热门模型的 EAGLE 3 推测解码草案模型：https:\u002F\u002Flmsys.org\u002Fblog\u002F2025-12-23-spec-bundle-phase-1\u002F\n- Model Gateway v0.3.0 发布：\nhttps:\u002F\u002Fdocs.sglang.io\u002Fadvanced_features\u002Fsgl_model_gateway.html\n- 具有动态分块支持的可扩展流水线并行，适用于超长上下文（PP 重构路线图 #11857）\n- 多模态模型的编码器解耦（路线图 #15118）\n- SGLang-Diffusion：\n  - 设置 `--dit-layerwise-offload true` 可将峰值显存占用降低多达 30GB，并使所有模型的性能提升高达 58%\n  - 显著降低 `Qwen-Image-Edit` 的延迟，在所有开源方案中位居最快之列。更多改进正在推进中\n  - 增加对 AMD\u002F4090\u002F5090 的支持，并提供额外的注意力机制选择（sage-attn、sage-attn3）、更多的并行化选项（TP）以及 HTTP API 的增强（支持 Google Vertex）\n  - 集成 Cache-dit，可将性能提升高达 165%\n\n## 变更内容\n* @iforgetmyname 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F13710 中重构了自定义 allreduce 逻辑\n* @Fridge003 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F14321 中更新了 DeepSeek-V3.2 的文档\n* @baonudesifeizhai 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F14195 中实现了蒸馏 VAE 的通用功能和支持\n* @Johnsonms 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F13812 中通过融合 Triton 内核优化了 NSA 索引器 K\u002FS 缓冲区访问\n* @mickqian 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F14329 中更新了多模态相关的 CODEOWNERS 文件\n* @jinke446 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F14266 中修复了容器环境中使用 NPU 物理 ID 的问题\n* @slin1237 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F13350 中实现了多模态初始化\n* @Fridge003 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F14336 中修复了 DeepSeek V32 的文档\n* @b8zhong 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F14335 中同步了注意力机制和 DeepSeek 文档\n* @ShangmingCai 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F14265 中为 PD 解耦支持了解码流水线\n* @slin1237 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F14344 中添加了图像处理器和 Transformer 结构\n* @Valentine233 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F12441 中为 Qwen3-Next 支持了 chunk_gated_delta_rule 内核\n* @yuhyao 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F14333 中修复了当 --deepep-mode=auto 时预填充 TBO 被禁用的问题\n* @ch-wan 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F14330 中更新了部分单元测试的预计耗时","2026-01-01T10:01:57",{"id":219,"version":220,"summary_zh":221,"released_at":222},61612,"gateway-v0.3.0","## 🚀 SGLang 模型网关 v0.3.0 发布！\n我们非常高兴地宣布 SGLang 模型网关 v0.3.0 正式发布——这是一次重大版本更新，带来了强大的新功能、架构改进以及重要的破坏性变更！\n\n## ⚠️ 破坏性变更\n### 📊 指标架构全新设计\n全面重构为全新的六层指标架构，涵盖协议（HTTP\u002FgRPC）、路由器、工作节点、流式传输（TTFT\u002FTPOT）、熔断器以及策略指标，并统一了错误码。\n**需采取的行动**：请更新您的 Prometheus 仪表盘和告警规则。指标名称和结构已发生变化。\n### 🔧 基于 UUID 的工作节点资源管理\n工作节点现通过 UUID 而非端点进行标识，以实现更清晰的资源管理。\n**需采取的行动**：请更新所有与工作节点 API 交互的工具或脚本。\n\n## ✨ 新特性\n### 🌐 统一推理网关模式 (IGW)\n一个网关，管理整个集群。IGW 现在支持所有类型的路由器，并可通过 Kubernetes 服务发现机制在单个部署中实现：\n- gRPC 路由器（PD 模式及常规模式）\n- HTTP 路由器（PD 模式及常规模式）\n- OpenAI 路由器\n启用服务发现后即可自动生效。只需部署一次，即可路由所有流量——从单一网关实例处理整个推理集群中的各类流量模式。\n\n### 🔤 HTTP 接口支持分词与逆分词\n- 提供用于分词操作的直接 HTTP 端点\n- 动态分词器控制平面：可实时添加、列出、获取及移除分词器\n- 分词器注册表（TokenizerRegistry），用于高效动态加载\n\n### 🧠 解析器端点\n- `\u002Fparse\u002Freasoning` —— 解析推理输出\n- `\u002Fparse\u002Ffunction_call` —— 解析函数调用响应\n- GLM-4 函数调用解析器——由 GLM 团队直接贡献，专为最新 GLM 模型打造\n\n### 📊 嵌入向量支持\ngRPC 路由器原生支持嵌入向量端点，使您能够将应用范围从文本生成扩展至嵌入向量任务。\n### 🔐 服务器端 TLS 支持\n通过原生 TLS 支持，为您的网关部署提供安全保障。\n### 🌐 Go 语言实现——由科大讯飞 MaaS 团队贡献\n完整的 Go 语言 SGLang 模型网关，配备兼容 OpenAI 的 API 服务器——将 SGLang 引入 Go 生态系统！\n\n## ⚡ 重大增强\n### 控制平面——工作流引擎\n智能化生命周期编排，具备以下特性：\n- 基于 DAG 的并行执行，预计算依赖图\n- 并发事件处理，实现最大吞吐量\n- 模块化的添加\u002F删除\u002F更新工作流\n\n### 性能优化\n- 无锁数据结构：使用 DashMap 进行策略查找，以及无锁路由器快照\n- 降低 CPU 开销：优化了工作节点注册表、gRPC 客户端拉取及工作节点选择流程\n- 优化路由器管理：改进了选择算法和状态管理\n\n### 弹性和可靠性：\n- 为 OpenAI 和 gRPC 路由器提供重试与熔断支持\n- 加强熔断器功能，提升状态管理能力\n- 对 TLS 和非 TLS 服务器实现优雅关闭\n- 统一错误响应，包含错误码及 X-SMG-Error-Code 头信息\n\n### 基础设施：\n- 多架构 Docker 镜像构建（Linux、macOS、Windows）","2025-12-24T22:00:56",{"id":224,"version":225,"summary_zh":226,"released_at":227},61613,"gateway-v0.2.4","## 🚀 SGLang 模型网关 v0.2.4 发布！\n我们很高兴地宣布 SGLang 模型网关 v0.2.4 正式发布——这是一次重磅更新，聚焦性能、安全性和生产级可观测性！\n\n## ✨ 亮点功能\n### ⚡ 重大性能优化\n我们在整个技术栈中投入了大量精力进行性能提升：\n- 针对缓存友好的负载均衡优化径向树实现——路由决策更智能，开销更低\n- 分词器优化——显著降低分词过程中的 CPU 和内存占用\n- 核心模块优化——HTTP 和 gRPC 路由器运行更轻量、更高效\n- 高效的 OpenTelemetry 实现——以极小的性能损耗提供生产级可观测性\n\n### 🔌 行业首创的 WASM 中间件支持\n通过 WebAssembly 实现可编程中间件！无需修改核心代码，即可使用安全、隔离的插件扩展网关功能。构建自定义路由逻辑、转换请求\u002F响应，或集成专有系统——一切尽在掌握。\n\n### 📊 生产级可观测性\n全面集成 OpenTelemetry，支持 HTTP 和 gRPC 的分布式链路追踪。借助原生 Trace 上下文传播，轻松追踪整个推理链路上的请求。现在，您终于可以真正看清自己的 LLM 基础设施运行状况。\n\n⚡ 为速度而生，为安全加固，已准备好投入生产环境。\n\n### 网关变更（98 次提交）\n\n- [model-gateway] 发布网关 0.2.4 (#14763) by @slin1237 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F14763 中\n- [Perf] 优化径向树以实现缓存友好的负载均衡 (#14758) by @slin1237 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F14758 中\n- [SMG] 性能：优化分词器，降低 CPU 和内存开销 (#14752) by @slin1237 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F14752 中\n- [model-gateway] 优化核心模块 (#14751) by @slin1237 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F14751 中\n- 微小改动：提取 select_worker_min_load (#14648) by @fzyzcjy 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F14648 中\n- [ci][smg] 修复 Docker 发布 CI，并将其加入 PR 测试 (#14683) by @slin1237 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F14683 中\n- 微小改动：支持 sgl-router 的 HTTP 响应状态码指标 (#14689) by @fzyzcjy 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F14689 中\n- [SMG] 功能：实现 TokenGuardBody 用于管理令牌返回 (#14653) by @jimmy-evo 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F14653 中\n- [model-gateway] 为 gRPC 路由器添加 OpenTelemetry 集成 (#14671) by @slin1237 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F14671 中\n- 修复：缓存友好的路由器应选择最低负载而非最小租户规模 (#14650) by @fzyzcjy 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F14650 中\n- [model-gateway] 优化 HTTP 路由器的内存使用 (#14667) by @slin1237 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F14667 中\n- [model-gateway] 修复 WASM 中任意文件读取的安全漏洞 (#14664) by @slin1237 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F14664 中\n- [model-gateway] 降低 gRPC 路由器的 CPU 开销 (#14663) by @slin1237 in https","2025-12-10T01:09:08",{"id":229,"version":230,"summary_zh":231,"released_at":232},61614,"v0.5.6","## 亮点\n- 支持 DeepSeek V3.2\u002FV3.2 Special  #14249\n- 块状扩散语言模型支持 #12588\n- 新的扩散模型支持（Flux2 #14000、Z-image #14067）\n- 引入 JIT 内核 #13453\n- 升级到 PyTorch 2.9 #12969\n- Kimi-K2-Thinking 模型增强 #12882\n- 内存管理\u002FOverlap 规范兼容性 #12224 #12839\n- 更多性能优化：DeepSeek-v3-fp4\u002FGLM-4.6\u002FKimi-K2\u002FDeepSeek-V3.2…\n- CI\u002FCD 增强\n\n\n## 变更内容\n* [router][grpc] 由 @CatherineSue 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F12749 中为 responses API 添加更多 mcp 测试用例\n* [Intel] 由 @gaopengff 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11051 中为 llama4 添加 'intel_xpu' 注意力后端\n* [Intel XPU] 由 @gaopengff 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F12363 中将 pytorch xpu 更新至 2.9\n* [Docs] 由 @mattheliu 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F12764 中修复多个文档页面中的死链接\n* [mem pool] 由 @stmatengss 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F12684 中修复 Mamba 中 self.device 的错误位置\n* [Fix] 由 @jimmy-evo 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11904 中修复 HTTP 流引发异常的问题\n* [CPU] 由 @jianan-gu 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8243 中修复权重块大小下的 TP 填充情况\n* [docs] 由 @rchalamala 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F12717 中移除重复的 --disable-radix-cache 选项\n* 由 @yeahdongcn 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F12279 中将 uvloop 锁定至 0.21.0\n* [fix] 由 @leejnau 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F12724 中默认仅对单节点服务器启用 flashinfer 全归约融合\n* chore: 由 @zhyncs 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F12795 中更新 CODEOWNERS\n* 由 @nvcastet 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F12715 中修复启用对称内存时 deepgemm 编译卡死的问题\n* 由 @alisonshao 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F12794 中添加 bot-bump-kernel-version-to-sglang 工作流\n* 由 @rainj-me 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F12782 中忽略当模型权重使用 nvfp4 和 moe ba… 时的 deepgemm 检查\n* [AMD] 由 @xintin 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F12576 中将 wave-lang 更新至 3.8.2\n* [DeepSeek-V3.2][NSA] 由 @YAMY1234 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F12788 中为 B200 (SM100) 上的短序列预填充启用 MHA 路径\n* [hotfix]: 由 @hzh0425 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F12772 中解决 PD 部署中 is_in_ci() 的 ModuleNotFoundError\n* [HotFix]: 由 @hzh0425 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F12776 中添加缺失的 SGLANG_EPLB_HEATMAP_COLLECTION_INTERVAL 环境变量\n* 由 @gty111 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F12763 中为 dots_vlm 添加 PP 支持\n* 由 @kalyank007 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F12761 中将单元测试中硬编码的 “cuda” 设备引用改为动态设备选择\n* 由 @yhyang201 在 https:\u002F\u002Fgithub.com\u002Fsgl- 中修复多模态生成问题","2025-12-03T05:11:51",{"id":234,"version":235,"summary_zh":236,"released_at":237},61615,"gateway-v0.2.3","## 🚀 SGLang 模型网关 - 新版本发布！\n\n我们很高兴地宣布 **SGLang 模型网关** 又迎来了一次强大的更新，带来了性能提升和更广泛的数据库支持！\n\n### ✨ **核心特性**\n\n**⚡ 分桶模式路由 - 性能提升 20-30%**\n我们推出了全新的 **基于分桶的路由算法**，在 PD 模式下显著提升了性能。TTFT（首个 Token 到达时间）和整体吞吐量最高可提升 **20-30%**。\n\n**💾 支持 PostgreSQL 进行聊天历史管理**\n数据存储更加灵活！我们现在除了支持 **OracleDB** 和 **内存存储** 外，还新增了对 **PostgreSQL** 的支持，用于聊天历史的管理。\n\n**🛠️ 增强的模型工具与结构化输出支持**\n- 支持 **MinMax M2** 模型！\n- 为 OpenAI 和 gRPC 路由器提供 **结构化模型输出**。\n- 在聊天完成 API 中实现 **带工具选择的流式解析**。\n- 为 Responses API 提供 **tool_choice 支持**。\n- 引入 **OutputItemDone 事件**，并支持输出项数组存储，以提升可观测性。\n\n### 🐛 **稳定性与质量改进**\n针对模型验证、流式逻辑、推理内容索引以及 CI 稳定性等方面进行了多项 bug 修复。\n\n### 🔧 **代码质量提升**\n重构了聊天和响应相关的构建器，重新组织模块以提高可维护性，并整合了错误处理机制。\n\n立即体验最新版本：`pip install sglang-router --upgrade`\n\n## 网关更新内容\n\n### 网关变更（45 次提交）\n\n- [model-gateway] smg 发布 0.2.3 (#13312) by @slin1237 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F13312\n- [router] 将 e2e_response_api 中的 requests 库替换为 openai (#13293) by @XinyueZhang369 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F13293\n- 修复过时的路由器文档 (#13255) by @fzyzcjy 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F13255\n- [router][grpc] 优化 Minimax_M2 的文档，使其与其他解析器保持一致 (#13218) by @CatherineSue 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F13218\n- 修复：在 \u002Fv1\u002Fmodels 中显示 served_model_name (#13155) by @Sunhaihua1 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F13155\n- [router] Minmax-M2 XML 工具解析器 (#13148) by @slin1237 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F13148\n- [router] 移除 worker URL 必需项 (#13172) by @slin1237 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F13172\n- [router] 修复 Flaky 测试 test_circuit_breaker_opens_and_recovers (#13164) by @XinyueZhang369 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F13164\n- [router] 为 Responses API 添加全面的验证 (#13127) by @key4ng 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F13127\n- 修复：\u002Fgenerate API 的多模型路由问题 (#12979) by @SYChen123 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F12979\n- [router][grpc] 为 gRPC 路由器支持 vllm 后端 (#13120) by @CatherineSue 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F13120\n- [router] 添加 Minmax M2 推理解析器 (#13137) by @slin1237 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F13137\n- [router] 支持复杂 assi","2025-11-17T11:23:30",{"id":239,"version":240,"summary_zh":241,"released_at":242},61616,"v0.5.5","## 亮点\n- 第一天即支持 Kimi-K2-Thinking https:\u002F\u002Fhuggingface.co\u002Fmoonshotai\u002FKimi-K2-Thinking\n- 第一天即支持 Minimax-M2 https:\u002F\u002Fhuggingface.co\u002FMiniMaxAI\u002FMiniMax-M2\n- 支持视频和图像生成 https:\u002F\u002Flmsys.org\u002Fblog\u002F2025-11-07-sglang-diffusion\u002F\n- Q4 路线图：https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fissues\u002F12780\n- Blackwell 内核优化及 MoE 运行器后端重构\n- 重叠 spec 和预填充 CUDA 图支持更多模型\n\n## 变更内容\n* [8\u002Fn] 将量化实现与 vLLM 依赖解耦 - gguf srt，由 @FlamingoPg 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11964 中完成\n* lang：支持直接视频推理，由 @mickqian 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F9936 中完成\n* 启用 Llama 4 + TRTLLM MHA，由 @b8zhong 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F12003 中完成\n* 重构 Triton 内核的 MoE 运行器集成，由 @Jonahcb 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11795 中完成\n* 使用 flashinfer_trtllm 的 MoE 运行器后端，在 b200 fp8 dpsk 上提升约 10% 性能，由 @b8zhong 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11816 中完成\n* 修复（安全）：阻止不安全的 pickle 反序列化，以缓解 CVE-2025-10164 漏洞，由 @thelongestusernameofall 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11909 中完成\n* 撤销“lang：支持直接视频推理”，由 @merrymercy 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F12038 中完成\n* 支持更多模型使用分段式 CUDA 图，由 @narutolhy 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11745 中完成\n* [修复] 修复 lint 问题以通过 CI，由 @Fridge003 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F12037 中完成\n* 撤销“[修复] 修复 lint 问题以通过 CI”，由 @Fridge003 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F12042 中完成\n* 修复：解决 MMMU 加载问题，由 @ZailiWang 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11759 中完成\n* 优化 MHA 分块前缀：合并前缀并扩展 KV 缓存，以便仅运行一次 MHA，由 @xu-yfei 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F10953 中完成\n* 为 CPU\u002FXPU 添加 gguf 依赖，由 @ZailiWang 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F12041 中完成\n* 修复：针对 deepseek-ocr 硬编码的 hf 仓库名称比较问题，由 @rainj-me 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F12031 中完成\n* 在 Dockerfile 中安装 numactl，用于 GH200\u002FGB200\u002FGB300，由 @fzyzcjy 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11853 中完成\n* [路由器] 为路由器到工作节点的通信添加 mTLS 支持，由 @slin1237 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F12019 中完成\n* 对 send_single 进行小幅清理，由 @fzyzcjy 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F12056 中完成\n* 重构 GLM-4.5 和 GLM-4.5V 相关实现，由 @zRzRzRzRzRzRzR 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11800 中完成\n* [修复] 修复部分 IO 结构中 `__getitem__` 缺失的 `ipc_name`，由 @whybeyoung 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F12053 中完成\n* 修复：在使用 spec-decoding 时计算 bench_serving ITL 的问题，由 @JustinTong0323 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F12064 中完成\n* 修复 dpsk-r1-fp4 启动崩溃问题，由 @Qiaolin-Yu 在 https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F12063 中完成\n* 修订 POINTSV15Chat m","2025-11-06T17:54:32",{"id":244,"version":245,"summary_zh":246,"released_at":247},61617,"gateway-v0.2.2","\r\n## 🚀 SGLang Model Gateway v0.2.2 Released!\r\n\r\n### ✨ **Features**\r\n\r\n**🎯 Industry-First Responses API for All Models**\r\nWe're bringing OpenAI's Responses API to the entire open-source ecosystem! Now enjoy native support for **Llama, DeepSeek, Qwen**, and more – with built-in chat history management, multi-turn conversations, and seamless MCP integration. This is the first solution to democratize advanced conversation management across all OSS models.\r\n\r\n**☸️ Production-Ready Kubernetes Operations**\r\nTaking large-scale deployments seriously! We now support **native gRPC health check endpoints**, making it effortless to deploy and operate SGLang at scale on Kubernetes with proper health monitoring and orchestration.\r\n\r\n**🔐 Your Network, Your Control**\r\n- **mTLS Support**: Secure gateway-to-SGLang communication whether you're running on edge, remote cloud, multi-cloud, or hybrid environments – we've got you covered\r\n- **MCP Proxy Enhancements**: Configure proxies globally or per-individual MCP server – complete network control in your hands\r\n\r\n**🤖 Harmony Pipeline**\r\nIntroducing our unified OpenAI-native architecture with GPT OSS model support for both Responses API and Chat Completion – fully integrated with MCP and intelligent storage management.\r\n\r\n**🌍 Universal Platform Support**\r\nA major leap in accessibility! SGLang Model Gateway now runs on **nearly every operating system and architecture**: Linux, Windows, Mac, x86, and ARM. Even better – we support **all Python versions from 3.8 to 3.14 in a single wheel file**, while reducing wheel size by **more than 40%**. Deploy anywhere, on any Python version, with unprecedented efficiency!\r\n\r\n**⚡ Additional Enhancements**\r\n- Multi-worker URL support for better load distribution\r\n- Connection pooling and tool inventory for MCP\r\n- Native OpenAI web search tool support and function calling for OpenAI router\r\n\r\n### 🐛 **Stability Improvements**\r\nWe've squashed numerous bugs including background task handling, tool call IDs, conversation management, and installation dependencies.\r\n\r\nTry it now: `pip install sglang-router==0.2.2`\r\n\r\n---\r\n\r\n## What's Changed in Gateway\r\n\r\n### Gateway Changes (48 commits)\r\n\r\n- [router] 0.2.2 release (#12399) by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F12399\r\n- [router] web_search_preview tool basic implementation (#12290) by @key4ng in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F12290\r\n- [router] Function call support for openai router Responses API (#12386) by @key4ng in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F12386\r\n- [router] Fix safety_identifier missing (#12404) by @key4ng in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F12404\r\n- [router] use safety_identifier replace user on chat history storage (#12185) by @lengrongfu in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F12185\r\n- [router] harmony responses api streaming support (#12395) by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F12395\r\n- [router] Harmony Pipeline: Chat Completion & Responses API with MCP Support (#12153) by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F12153\r\n- [bug] fix router installation to include additional dependency (#12348) by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F12348\r\n- [router] refactor mcp to use LRU and fix pooling bug (#12346) by @CatherineSue in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F12346\r\n- [bug] fix router pypi license file (#12345) by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F12345\r\n- [router] fix router release workflow and add build test in PR (#12315) by @CatherineSue in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F12315\r\n- [Bug fix] trace: fix import error in mini_lb if sgl-router image does not install sglang (#12338) by @sufeng-buaa in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F12338\r\n- [router][grpc] Fix inconsistent behavior of conversation_id not found (#12299) by @CatherineSue in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F12299\r\n- [router] support arm, windows, mac, linux, reduce wheel size and number (#12285) by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F12285\r\n- [rust][ci] Add end-to-end tests for Oracle history backend (#12233) by @key4ng in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F12233\r\n- [router] upgrade grpc dependency and py 3.13 3.14 support (#12284) by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F12284\r\n- [router] Fix type unmatch during validation (#12257) by @key4ng in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F12257\r\n- [Feature] Sglang Tracing: Fine-Grained Tracking for Request Latency - Part 2 (#10804) by @sufeng-buaa in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F10804\r\n- [router] configure workflow retries and timeout based on routerConfig (#12252) by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F12252\r\n- [router] use mcp struct from sdk and clean up code across codebase (#12249) by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F12249\r\n- [router] remove code duplication (#12245) by @slin1237 in https:\u002F\u002Fgithub.","2025-11-17T11:19:03",{"id":249,"version":250,"summary_zh":251,"released_at":252},61618,"v0.5.4","## Highlights\r\n- AMD AI Dev Day 2025 SGLang ([slide](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsgl-learning-materials\u002Fblob\u002Fmain\u002Fslides\u002Fsglang_amd_ai_devday_2025.pdf)), PyTorch Conference 2025 SGLang ([slide](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsgl-learning-materials\u002Fblob\u002Fmain\u002Fslides\u002Fsglang_pytorch_2025.pdf))\r\n- Model gateway v0.2 release: https:\u002F\u002Fdocs.sglang.ai\u002Fadvanced_features\u002Frouter.html\r\n- [beta] Overlap scheduler for speculative decoding: https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fissues\u002F11762\r\n- [beta] Piecewise CUDA graph for prefill: https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fissues\u002F11490\r\n- Prefix cache for qwen3 next and GDN\u002Fmamba models: https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11214\r\n- Fullset optimizations for DeepSeek-V3.2 (MTP, PD-Disagg, Function Calling) (https:\u002F\u002Fdocs.sglang.ai\u002Fbasic_usage\u002Fdeepseek_v32.html, https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fissues\u002F11989)\r\n- Various Blackwell kernel optimizations\r\n- DGX Spark Support: https:\u002F\u002Flmsys.org\u002Fblog\u002F2025-10-13-nvidia-dgx-spark\u002F\r\n- KTransformer integration: https:\u002F\u002Flmsys.org\u002Fblog\u002F2025-10-22-KTransformers\u002F\r\n- New model support: Nemotron, DeepSeek OCR, Qwen3-Omni, Olmo 3\r\n- Native ModelOpt quantization support\r\n\r\n## What's Changed\r\n* [router] add ipv6 support across all components by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11219\r\n* Remove env var warnings for release by @merrymercy in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11262\r\n* Enable native ModelOpt quantization support (1\u002F3)  by @Edwardf0t1 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F7149\r\n* [router][tool call] Clean up redundant `detect_format` and `has_tool_markers` by @CatherineSue in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11270\r\n* disable sm100 for FlashMLA and fast-hadamard-transform in cuda12.6.1 by @gongwei-130 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11274\r\n* docker: add manifest to versioned docker releases by @ishandhanani in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11268\r\n* [Bug] Fix incorrect assertion in FA4 and add UT. by @lifuhuang in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11182\r\n* [router][grpc] Refine streaming processes by @CatherineSue in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11277\r\n* Fix code sync scripts by @merrymercy in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11276\r\n* [Auto Sync] Update test_utils.py (20251006) by @merrymercy in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11280\r\n* Rename max_micro_batch_size -> pp_max_micro_batch_size by @merrymercy in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11279\r\n* Reverse the AMD CI test back to 1200s and split the 8-gpu deepseek job into two. by @sunxxuns in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11238\r\n* Fix LoRA support for multimodal models (VLMs) by implementing a consistent pattern for skipping vision components by @ConnorLi96 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11261\r\n* fix: correct scale parameter remapping logic in Llama4ForConditionalGeneration by @JustinTong0323 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11282\r\n* docs: update sgl-kernel README by @zhyncs in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11286\r\n* chore: bump sgl-kernel version to 0.3.15 by @sglang-bot in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11281\r\n* [router][grpc] Fix proto3 default value mismatches and cleanup unused fields by @CatherineSue in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11283\r\n* convert test_deterministic into unit tests by @skyzh in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11095\r\n* Feature\u002Flongbench v2 evaluation utils by @alhridoy in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F10949\r\n* [ci] fix pp test by @hnyls2002 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11294\r\n* EAGLE cache fix for SWARadixCache by @ispobock in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11231\r\n* Remove overlap thread by @hnyls2002 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11210\r\n* [router] add reasoning and tool parser argument in router by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11290\r\n* Remove sampling info events and overlap thread file by @hnyls2002 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11300\r\n* Introduce future indices by @hnyls2002 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11301\r\n* [sgl-kernel] Support float64 moe_sum_reduce cuda kernel by @yuan-luo in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11068\r\n* [Docs] [Router] Update Observability and Common Issues Section by @xuwenyihust in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11302\r\n* [router] add get server info and get model info in grpc server by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11303\r\n* [router][grpc] Refactor chat template content format detection by @CatherineSue in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11288\r\n* [Doc] HiCache Design Documents by @ykwd in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11027\r\n* [Doc]: Best Practice for HICache by @hzh0425 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11001\r\n* [router] fix grpc connection conversion and add optimization by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl","2025-10-26T02:37:45",{"id":254,"version":255,"summary_zh":256,"released_at":257},61619,"gateway-v0.2.1","## 🚀 SGLang Model Gateway v0.2.1 Released!\r\n\r\nThis release focuses on stability, cleanup, and two big new performance features.\r\n\r\n### 🧾 Docs & CI\r\n- Updated router documentation to reflect recent feature additions\r\n\r\n### 🧹 Code Cleanup\r\n\r\n- Refactored StopSequenceDecoder for cleaner incremental decoding\r\n- Added spec.rs test harness under spec\u002F for structured unit tests\r\n\r\n### 🐞 Bug Fixes\r\n\r\n- Fixed UTF-8 boundary in stop-sequence decoding\r\n- Fixed gRPC timeout configuration\r\n- Fixed worker filtering, tool-choice normalization, and bootstrap-port handling\r\n- Additional gRPC server warm-up and concurrency fixes\r\n\r\n### 🌟 New Features\r\n\r\n- Two-Level Tokenizer Caching (L0 + L1)\r\n- L0: exact-match cache for repeated prompts\r\n- L1: prefix-aware cache at special-token boundaries\r\n- OpenAI-Style Classification API → new \u002Fv1\u002Fclassifications endpoint, shout out to yanbo for the contribution\r\n- Worker Management Workflow Engine → improved async registration, worker self discovery, and health orchestration\r\n\r\n## What's Changed in Gateway\r\n\r\n### Gateway Changes (26 commits)\r\n\r\n- [router] release router 0.2.1 (#11885) by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11885\r\n- [router][grpc] Fix wram-up random token ids for small models (#11887) by @CatherineSue in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11887\r\n- [router] clean up workflow logs to debug for implementation details logs (#11886) by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11886\r\n- fix(sql-router): fix conflict port in test (#11826) by @htiennv in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11826\r\n- [router][grpc] Remove `continue_final_message` in `ChatTemplateParams` and add `minijinja-contrib` (#11882) by @CatherineSue in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11882\r\n- [router] remove encoding header for oai router (#11881) by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11881\r\n- [router] Worker Management Workflow Engine (#11868) by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11868\r\n- [2\u002F2] [feature] support openai like classification api in router (#11670) by @whybeyoung in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11670\r\n- [router] Add Configurable L0 and L1 Tokenizer Caching (#11688) by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11688\r\n- [router][grpc] Support parallel queue puts in grpc_request_manager and remove mutex for grpc_client (#11798) by @CatherineSue in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11798\r\n- [Lint] Add `python\u002Fsglang` to ruff F401 checks and remove unused imports in files (#11685) by @CatherineSue in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11685\r\n- [router][grpc] Remove timeout for connections and remove `max_tokens` deprecation warning log (#11775) by @CatherineSue in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11775\r\n- [doc] update router document (#11767) by @key4ng in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11767\r\n- [router] fix grpc client time out to 1h (#11768) by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11768\r\n- [router] Fix UTF-8 Boundary Panic in Stop Sequence Decoder (#11766) by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11766\r\n- Revert \"[router] fix get_models endpoint for openai router (#11687)\" (#11740) by @key4ng in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11687\r\n- [router] Add rustfmt and set group imports by default (#11732) by @CatherineSue in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11732\r\n- [router] add spec.rs to enables tests under spec folder (#11734) by @key4ng in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11734\r\n- [router] Fix tool_choice normalization in ChatCompletionRequest and fix ut (#11731) by @CatherineSue in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11731\r\n- [router][grpc] add dissag info to warm up in grpc server (#11727) by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11727\r\n- [router] fix p and d worker filtering and bootstrap port handling (#11729) by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11729\r\n- [Router] Refactor protocol definitions: split spec.rs into modular files (#11677) by @key4ng in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11677\r\n- [router] fix get_models endpoint for openai router (#11687) by @key4ng in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11687\r\n- [router] Refactor StopSequenceDecoder to Use Sequence for Incremental Decoding (#11676) by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11676\r\n- [router][grpc] Simplify model_id determination (#11684) by @CatherineSue in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11684\r\n- [router] Fix response api related spec (#11621) by @key4ng in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11621\r\n\r\n### Paths Included\r\n\r\n- `sgl-router`\r\n- `python\u002Fsglang\u002Fsrt\u002Fgrpc`\r\n- `python\u002Fsglang\u002Fsrt\u002Fentrypoints\u002Fgrpc_server.py`\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fcompare\u002Fgateway-v0.2.0...gateway-v0.2.1\r\n","2025-11-17T11:13:12",{"id":259,"version":260,"summary_zh":261,"released_at":262},61620,"gateway-v0.2.0","## 🚀 Release: SGLang Model Gateway v0.2.0 (formerly “SGLang Router”)\r\n\r\n## 🔥 What’s new\r\n\r\n### 🧠 Multi-Model Inference Gateway (IGW) Mode\r\nIGW turns one router into many — letting you manage multiple models at once, each with its own routing policy, priorities, and metadata. Think of it as running several routers under one roof, with shared reliability, observability, and API surface.\r\nYou can dynamically register models via \u002Fworkers, assign labels like tier or policy, and let the gateway handle routing, health checks, and load balancing.\r\nWhether you’re mixing Llama, Mistral, and DeepSeek, or orchestrating per-tenant routing in enterprise setups, IGW gives you total control.\r\nYour fleet, your rules. ⚡\r\n\r\n### ⚡ gRPC Mode: Rust-Powered, Built for Throughput\r\nThis is the heart of 0.2.0. The new gRPC data plane runs entirely in Rust — tokenizer, reasoning parser, and tool parser included — giving you native-speed performance, and lower latency.\r\nYou can connect to gRPC-based SGLang workers, stream tokens in real time, and even handle OpenAI-compatible APIs like \r\n\r\n### 🌐 OpenAI-Compatible Gateway\r\nSeamlessly proxy requests to OpenAI, while keeping data control local.\r\nConversation history, responses, and background jobs all flow through the gateway — same API, enterprise privacy.\r\n💾 Pluggable History Storage\r\nChoose between memory, none, or oracle for conversation and \u002Fv1\u002Fresponses data.\r\nmemory: Fastest for ephemeral runs.none: Zero persistence, zero latency overhead.oracle: Full persistence via Oracle ATP with connection pooling and credentials support.🧩 Pluggable MCP Integration\r\nThe gateway now natively speaks MCP across all transports (STDIO, HTTP, SSE, Streamable), so your tools can plug directly into reasoning and response loops — perfect for agentic workflows and cross-model orchestration.\r\n\r\n### 🛡️ Reliability & Observability Upgrades\r\nBuilt-in:\r\nRetries with exponential backoff + jitterPer-worker circuit breakersToken-bucket rate limiting & FIFO queuingPrometheus metrics for latency, load, queue depth, PD pipelines, tokenizer speed, and MCP activityStructured tracing & request-ID propagation\r\n\r\n✨ SGLang Model Gateway v0.2.0 — built in Rust, designed for scale, ready for reasoning.\r\n\r\n## What's Changed in Gateway\r\n\r\n### Gateway Changes (238 commits)\r\n\r\n- [router] upgrade to 0.2.0 (#11642) by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11642\r\n- [router] add worker self discovery for metadata (#11638) by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11638\r\n- [router][grpc] add warm up to grpc server (#11627) by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11627\r\n- [router] update router readme to latest features (#11619) by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11619\r\n- [router] add chang and keyang to sgl router author (#11620) by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11620\r\n- [router] cleanup app context and move to startup (#11617) by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11617\r\n- [router] add py binding and readme for openai router and history backend (#11453) by @key4ng in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11453\r\n- [router] when given both local tokenizer and chat template, log all (#11601) by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11601\r\n- [router] allow router launch server to use grpc mode (#11600) by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11600\r\n- [router] delete useless table content comment in spec (#11597) by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11597\r\n- [router] change worker api to async instead of sync (#11566) by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11566\r\n- [router] update generate spec to align with sgl io struct (#11591) by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11591\r\n- [router][protocols] Add Axum validate extractor and use it for `\u002Fv1\u002Fchat\u002Fcompletions` endpoint (#11588) by @CatherineSue in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11588\r\n- [router][grpc] Add `serve_grpc` to `launch_server` and log id for HealthCheck (#11564) by @CatherineSue in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11564\r\n- [router][grpc] Add error handling to `generate_tool_constraints` (#11562) by @CatherineSue in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11562\r\n- [router] Add Rust CLI flags for queue size, timeout, and rate limit for token bucket rate limiter (#11483) by @Jonahcb in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11483\r\n- [router] allow user to specify chat template path (#11549) by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11549\r\n- [router][grpc] Further delegate non-stream processing to `processing.rs`  (#11553) by @CatherineSue in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11553\r\n- [router][Fix] Include grpc reflection runtime dependency (#11419) by @ai-jz in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F11419\r\n- [router] allow tokenizer path to be dir (#11530) by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsg","2025-11-17T11:03:01",{"id":264,"version":265,"summary_zh":266,"released_at":267},61621,"v0.5.3","## Highlights\r\n- Day 0 Support for DeepSeek-V3.2 with Sparse Attention: https:\u002F\u002Flmsys.org\u002Fblog\u002F2025-09-29-deepseek-V32\u002F\r\n- Deterministic inference on multiple attention backends: https:\u002F\u002Flmsys.org\u002Fblog\u002F2025-09-22-sglang-deterministic\u002F\r\n- Integration of FlashAttention 4 prefill kernels\r\n- Enhancing support of Qwen3-Next with MTP, DP, optimized kernels and multiple hardware platforms\r\n- Support models including Qwen3-VL series, dots.vlm1, Ling-V2, Apertus, SOLAR\r\n\r\n\r\n## What's Changed\r\n* [Auto Sync] Update server_args.py (20250912) by @merrymercy in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F10347\r\n* [CPU][doc] add torch.compile param in example commands by @ZailiWang in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F10349\r\n* [router][ci] Add gpu utilization analyze with nvml by @key4ng in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F10345\r\n* [NVIDIA] [3\u002FN] Nvfp4 Masked Gemm: Add flashinfer grouped_gemm_nt_masked  by @wenscarl in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F9199\r\n* fix: flashinfer_cutlass_moe: Use max of global expert scales instead of local for input scale by @trevor-m in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F10296\r\n* model: support Apertus by @EduardDurech in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F9774\r\n* fix dual stream bug by @yizhang2077 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F10352\r\n* [router] Basic OAI Response api by @key4ng in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F10346\r\n* Implement Standalone gRPC Server for SGLang Python Scheduler by @CatherineSue in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F10283\r\n* support memory_pool_host page first direct layout by @huangtingwei9988 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F10031\r\n* fix the break in FlashInferFusedMoE by @chenqianfzh in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F10356\r\n* fix: resolve transfer_kv_all_layer_direct_lf_pf import error by @zhyncs in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F10360\r\n* Support LingV2 model by @strgrb in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F10359\r\n* Fix Bailing MoE model bugs by @yuan-luo in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F10362\r\n* Revert add mainprocess's proctitle by @whybeyoung in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F10351\r\n* model: support dots.vlm1 model by @yonghenglh6 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8778\r\n* Support loading weights from remote instance by @amysaq2023 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8215\r\n* add qwen3-next ut by @yizhang2077 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F10355\r\n* Fix chunked prefix cache for nvfp4 by @wenscarl in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F10180\r\n* Fix FA4 import cause moe_fused_gate output be illegal memory by @fzyzcjy in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F10368\r\n* Fix global input scale incompatible with CuTe DSL moe by @fzyzcjy in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F10370\r\n* [router] Add Rerank Routing Logic in Regular Router by @fangjian601 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F10219\r\n* [router] enable sccache in ci and local build by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F10099\r\n* fix: add fast path for function call by @yizhang2077 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F9023\r\n* [Auto Sync] Update base_grammar_backend.py, llguidance_back... (20250911) by @merrymercy in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F10333\r\n* fix: resolve gb200 image link by @zhyncs in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F10343\r\n* fix: exclude protobuf generated code by @zhyncs in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F10388\r\n* [bug] fix ci syntax by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F10390\r\n* Fix GPU fault issue when run dsv3 with dp mode and enable torch-compile by @kkHuang-amd in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F10361\r\n* feat: add deepseek v3 fp4 ut by @zhyncs in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F10391\r\n* Add sentencepiece to project dependencies by @mmangkad in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F10386\r\n* [router] allow one router to support different model families and serving mode by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F10244\r\n* [router] Add get and cancel method for response api by @key4ng in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F10387\r\n* Benchmark: Support API_KEY without 'bearer' by @Muqi1029 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F10380\r\n* Support Qwen3-Next on Ascend NPU by @iforgetmyname in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F10379\r\n* [HiCache] fix mooncake config in different tp size by @stmatengss in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F10377\r\n* [HiCache] doc: update deployment in readme by @stmatengss in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F10332\r\n* [router] add not implemented functions for multi model trait by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F10394\r\n* [Auto Sync] Update xgrammar_backend.py (20250913) by @merrymercy in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F10395\r\n* fix probs name which without temp scaling n","2025-10-06T18:45:52",{"id":269,"version":270,"summary_zh":271,"released_at":272},61622,"v0.5.2","## Highlights\r\n\r\n- SGLang HiCache: Fast Hierarchical KV Caching with Your Favorite Storage Backends: https:\u002F\u002Flmsys.org\u002Fblog\u002F2025-09-10-sglang-hicache\u002F\r\n\r\n## What's Changed\r\n* feat: allow use local branch to build image by @gongwei-130 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F9546\r\n* [readme] Include additional resources for the SGLang x AMD SF Meetup event by @wisclmy0611 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F9547\r\n* [doc] deepseekv31 support by @XiaotongJiang in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F9544\r\n* fix(grok): remove duplicate replicate_lm_head configuration by @vincentzed in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F9549\r\n* chore: update configurer by @zhyncs in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F9557\r\n* chore: bump v0.5.1.post1 by @zhyncs in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F9558\r\n* [router] add right rustls dependency in sgl-router cargo.toml by @Bruce-x-1997 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F9498\r\n* fix: use sgl-kernel 0.3.5 by @zhyncs in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F9565\r\n* Add target module validation for init adapters by @Beichen-Ma in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F9429\r\n* fix: Update OpenAI client base URL in documentation by @JustinTong0323 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F9576\r\n* [PD] Improve disaggregation metrics output: update the metrics to keep reflecting real stats by @SCDESPERTATE in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F7317\r\n* remove redundant rank0_log function. by @miter6 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F9560\r\n* Update CUTLASS 4.2 & Enable K-Major Scale Factor for SM90 FP8 Blockwise Group GEMM by @HydraQYH in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F9559\r\n* Reintroduce memory usage fix by @fzyzcjy in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F9535\r\n* Offload tensors by sharding on GPU by @fzyzcjy in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F9536\r\n* bugfix for undefined logging functions in HarmonyBrowserTool & HarmonyPythonTool by @CiaranZhou in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F9229\r\n* chore: upgrade flashinfer 0.2.14.post1 by @zhyncs in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F9578\r\n* fix: revert #8593 by @zhyncs in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F9581\r\n* fix: resolve tuning fused moe issue by @zhyncs in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F9587\r\n* Tiny fix wrong comments by @fzyzcjy in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F9589\r\n* chore: update config by @zhyncs in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F9591\r\n* chore: bump v0.5.1.post2 by @zhyncs in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F9592\r\n* [Doc] add LWS(LeaderWorkerSet) use case in sgl-router README by @Bruce-x-1997 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F9568\r\n* [Performance] Batch Send from Tokenizer Manager. by @sundar24295s in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F9436\r\n* Fix GLM45 tool call multi-turn bug by @byjiang1996 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F9500\r\n* Fix GLM45v launch server cuda torch compile bug by @byjiang1996 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F9554\r\n* Fix Harmony reasoning parser for and auto-separation for gpt-oss models by @jonaslsaa in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F9190\r\n* [docs] Refactor, remove compiled results and add gpt-oss by @zhaochenyang20 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F9613\r\n* [Fix] HiCache Bugfix & Mooncake Error Handling Enhance by @ykwd in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8901\r\n* Improve bench_one_batch_server script by @hnyls2002 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F9608\r\n* [router] add mistral tool parser by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F9622\r\n* [router] add qwen tool parser by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F9623\r\n* [router] add pythonic parser by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F9628\r\n* [router] add llama tool parser by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F9629\r\n* [router] add ut for mistral, llama, pythonic, and streaming tool parser by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F9632\r\n* [new feat] ascend backend support fia fusion kernel by @ZhengdQin in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8328\r\n* model: Support nvidia\u002FLlama-3_1-Nemotron-Ultra-253B-v1 by @netanel-haber in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F9301\r\n* Fix lint for router by @hebiao064 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F9636\r\n* [docs] Update README with additional highlights and resources for SGLang x AMD SF Meetup by @wisclmy0611 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F9640\r\n* Add reasoning_effort param in TiktokenTokenizer.apply_chat_template by @lshmouse in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F9630\r\n* fix: allow user to specify function as role by @GavinZhu-GMI in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F9635\r\n* Fix kimi k2 function calling format by @XiaotongJiang in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F9606\r\n* [router] address worker loa","2025-09-12T03:50:52",{"id":274,"version":275,"summary_zh":276,"released_at":277},61623,"v0.5.1","## What's Changed\r\n* [PD] Use batch transfer for rdma transport and add notes for mnnvl usage by @ShangmingCai in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8595\r\n* [bugifx] QWen-1M context support[2\u002F3] using current cuda stream in the DCA's kernel for bugfix. by @sighingnow in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8611\r\n* Fix hf3fs_fuse import error by @ispobock in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8623\r\n* Update step3v default config by @ispobock in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8626\r\n* [ci] fix genai-bench execution cmd by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8629\r\n* [router] update router pypi version by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8628\r\n* [Optimization][Perf] Disable the GC during CUDA graph capture to speed up by up to 3x by @b8zhong in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8577\r\n* Fix typos in py_test\u002Ftest_launch_server.py by @windsonsea in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F6227\r\n* misc: Remove debug print to logger.info by @CatherineSue in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8633\r\n* SGLang HiCache NIXL Connector by @vvenkates27 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8488\r\n* [bug] remove pdlb from minilb since its no longer available by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8634\r\n* [bugfix] Fix flashinfer cutlass EP moe after MoE refactor by @trevor-m in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8630\r\n* Conditionally import HiCacheHF3FS by @pansicheng in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8598\r\n* TRTLLM Gen MLA Decode Kernel Integration (same as #7938) by @farazkh80 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8632\r\n* Fix nan value generated after custom all reduce by @kkHuang-amd in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8532\r\n* Revert \"Fix nan value generated after custom all reduce (#8532)\" by @zhyncs in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8642\r\n* Feature\u002Fmodelscope model download by @yrk111222 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8083\r\n* chore: speedup NPU CI by cache by @pkking in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8270\r\n* [Bugfix] fix w8a8_int8 load issue by @iforgetmyname in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8308\r\n* [bugfix] fix router python parser for pd urls by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8644\r\n* [router] add basic usage doc by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8640\r\n* [router] upgrade router version to 0.1.8 by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8645\r\n* [NVIDIA] Enable Flashinfer MoE blockscale fp8 backend for TP MoE by @kaixih in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8450\r\n* HiCache, fixing hash value indexing by @xiezhq-hermann in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8636\r\n* Interface change for kvcache io to support page first layout by @xiezhq-hermann in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8318\r\n* Update batch size limitation of dsv3_router_gemm kernel to 16 by @Fridge003 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8051\r\n* chore: bump v0.4.10.post1 by @ispobock in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8652\r\n* Add hf3fs_utils.cpp to package-data by @pansicheng in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8653\r\n* Fix chat template handling for OpenAI serving by @JustinTong0323 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8635\r\n* Bug: apply final_hidden_states*=self.routed_scaling_factor at MoE lay… by @byjiang1996 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8511\r\n* [5\u002FN] MoE Refactor: Update MoE parallelism arguments by @ch-wan in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8658\r\n* Increase tolerance to address CI failures by @lifuhuang in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8643\r\n* [Kimi K2] dsv3_router_gemm supports NUM_EXPERTS == 384 by @panpan0000 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8013\r\n* [Doc] fix: Update README for cu126 sgl-kernel compile problem by @Hongbosherlock in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8665\r\n* fix per token cuda kernel hidden dim cannot divide by 16 by @hebiao064 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8543\r\n* fix arg typo for --disaggregation-transfer-backend by @ZacWang in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8664\r\n* [fix] fix pd disagg error of vlms by @ccw1996 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8094\r\n* Disable tp for shared experts under expert parallelism for GLM4.5 model (#8647) by @zminglei in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8647\r\n* [bugfix] Fix page size for create_flashmla_kv_indices_triton() for cutlass mla by @trevor-m in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8685\r\n* [bug] limit bootstrap room to to [0, 2^63 - 1] by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8684\r\n* Update CODEOWNERS by @merrymercy in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8686\r\n* Fix deepgemm masked grouped gemm jit compile by @ispobock in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8679\r\n* Fix FP8 block quantization when N or K is not multiple","2025-08-23T19:57:46",{"id":279,"version":280,"summary_zh":281,"released_at":282},61624,"gateway-v0.1.9","## What's Changed in Gateway\r\n\r\n### Gateway Changes (10 commits)\r\n\r\n- [router] upgrade router version to 0.1.9 (#8844) by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8844\r\n- refactor(sgl-router): Replace `once_cell` with `LazyLock` in worker.rs and remove once_cell dependency from Cargo.toml (#8698) by @htiennv in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8698\r\n- [router] fix req handling order, improve serialization, remove retry (#8888) by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8888\r\n- [router] PD Router Simplification and Reorganization (#8838) by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8838\r\n- [router] complete router oai spec (#8828) by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8828\r\n- [pd-router] Add Configurable Retry Logic for reduce backend pressure (#8744) by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8744\r\n- [router] introduce dp worker abstraction (#8639) by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8639\r\n- [router] Implement HTTP Dependency Injection Pattern for Router System (#8714) by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8714\r\n- [router] minor code clean up and and refactoring (#8711) by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8711\r\n- [bug] limit bootstrap room to to [0, 2^63 - 1] (#8684) by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8684\r\n\r\n### New Contributors\r\n\r\n* @htiennv made their first contribution in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fcommit\u002Ffd05b5675\r\n\r\n### Paths Included\r\n\r\n- `sgl-router`\r\n- `python\u002Fsglang\u002Fsrt\u002Fgrpc`\r\n- `python\u002Fsglang\u002Fsrt\u002Fentrypoints\u002Fgrpc_server.py`\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fcompare\u002Fgateway-v0.1.8...gateway-v0.1.9\r\n","2025-11-17T10:58:59",{"id":284,"version":285,"summary_zh":286,"released_at":287},61625,"gateway-v0.1.8","## What's Changed in Gateway\r\n\r\n### Gateway Changes (4 commits)\r\n\r\n- [router] upgrade router version to 0.1.8 (#8645) by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8645\r\n- [router] add basic usage doc (#8640) by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8640\r\n- [bugfix] fix router python parser for pd urls (#8644) by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8644\r\n- Fix typos in py_test\u002Ftest_launch_server.py (#6227) by @windsonsea in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F6227\r\n\r\n### New Contributors\r\n\r\n* @windsonsea made their first contribution in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fcommit\u002F061c8959f\r\n\r\n### Paths Included\r\n\r\n- `sgl-router`\r\n- `python\u002Fsglang\u002Fsrt\u002Fgrpc`\r\n- `python\u002Fsglang\u002Fsrt\u002Fentrypoints\u002Fgrpc_server.py`\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fcompare\u002Fgateway-v0.1.7...gateway-v0.1.8\r\n","2025-11-17T10:57:08",{"id":289,"version":290,"summary_zh":291,"released_at":292},61626,"gateway-v0.1.7","## What's Changed in Gateway\r\n\r\n### Gateway\u002FRouter Changes (11 commits)\r\n\r\n- [router] update router pypi version (#8628) by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8628\r\n- [router] migrate router from actix to axum (#8479) by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8479\r\n- [feature] [sgl-router] Add a dp-aware routing strategy (#6869) by @oldsharp in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F6869\r\n- [router] improve router logs and request id header (#8415) by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8415\r\n- [router] add different policies for p node and d node (#8395) by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8395\r\n- [router] add request format unit test (#8300) by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8300\r\n- [router] add streaming unit test (#8299) by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8299\r\n- [router] add endpoint unit test (#8298) by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8298\r\n- [router] fix pd model completion request (#8303) by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8303\r\n- [router] add common ut infra to mock worker and app (#8295) by @slin1237 in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8295\r\n- fix: sgl-router remove dead code (#8257) by @oldsharp in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F8257\r\n\r\n### New Contributors\r\n\r\n* @oldsharp made their first contribution in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fcommit\u002Fa730ce816\r\n* @oldsharp made their first contribution in https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fcommit\u002Fc33499a67\r\n\r\n### Paths Included\r\n\r\n- `sgl-router`\r\n- `python\u002Fsglang\u002Fsrt\u002Fgrpc`\r\n- `python\u002Fsglang\u002Fsrt\u002Fentrypoints\u002Fgrpc_server.py`\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fcompare\u002Fgateway-v0.1.6...gateway-v0.1.7\r\n\r\n","2025-11-17T10:51:42"]