[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-QwenLM--Qwen3":3,"tool-QwenLM--Qwen3":65},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",160015,2,"2026-04-18T11:30:52",[13,14,15],"开发框架","Agent","语言模型","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,3,"2026-04-06T11:19:32",[15,26,14,13],"图像",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":10,"last_commit_at":33,"category_tags":34,"status":16},8553,"spec-kit","github\u002Fspec-kit","Spec Kit 是一款专为提升软件开发效率而设计的开源工具包，旨在帮助团队快速落地“规格驱动开发”（Spec-Driven Development）模式。传统开发中，需求文档往往与代码实现脱节，导致沟通成本高且结果不可控；而 Spec Kit 通过将规格说明书转化为可执行的指令，让 AI 直接依据明确的业务场景生成高质量代码，从而减少从零开始的随意编码，确保产出结果的可预测性。\n\n该工具特别适合希望利用 AI 辅助编程的开发者、技术负责人及初创团队。无论是启动全新项目还是在现有工程中引入规范化流程，用户只需通过简单的命令行操作，即可初始化项目并集成主流的 AI 编程助手。其核心技术亮点在于“规格即代码”的理念，支持社区扩展与预设模板，允许用户根据特定技术栈定制开发流程。此外，Spec Kit 强调官方维护的安全性，提供稳定的版本管理，帮助开发者在享受 AI 红利的同时，依然牢牢掌握架构设计的主动权，真正实现从“凭感觉写代码”到“按规格建系统”的转变。",88749,"2026-04-17T09:48:14",[15,26,14,13],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":10,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,15],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":10,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",85267,"2026-04-18T11:00:28",[26,51,52,53,14,54,15,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":62,"last_commit_at":63,"category_tags":64,"status":16},5784,"funNLP","fighting41love\u002FfunNLP","funNLP 是一个专为中文自然语言处理（NLP）打造的超级资源库，被誉为\"NLP 民工的乐园”。它并非单一的软件工具，而是一个汇集了海量开源项目、数据集、预训练模型和实用代码的综合性平台。\n\n面对中文 NLP 领域资源分散、入门门槛高以及特定场景数据匮乏的痛点，funNLP 提供了“一站式”解决方案。这里不仅涵盖了分词、命名实体识别、情感分析、文本摘要等基础任务的标准工具，还独特地收录了丰富的垂直领域资源，如法律、医疗、金融行业的专用词库与数据集，甚至包含古诗词生成、歌词创作等趣味应用。其核心亮点在于极高的全面性与实用性，从基础的字典词典到前沿的 BERT、GPT-2 模型代码，再到高质量的标注数据和竞赛方案，应有尽有。\n\n无论是刚刚踏入 NLP 领域的学生、需要快速验证想法的算法工程师，还是从事人工智能研究的学者，都能在这里找到急需的“武器弹药”。对于开发者而言，它能大幅减少寻找数据和复现模型的时间；对于研究者，它提供了丰富的基准测试资源和前沿技术参考。funNLP 以开放共享的精神，极大地降低了中文自然语言处理的开发与研究成本，是中文 AI 社区不可或缺的宝藏仓库。",79857,1,"2026-04-08T20:11:31",[15,51,54],{"id":66,"github_repo":67,"name":68,"description_en":69,"description_zh":70,"ai_summary_zh":70,"readme_en":71,"readme_zh":72,"quickstart_zh":73,"use_case_zh":74,"hero_image_url":75,"owner_login":76,"owner_name":77,"owner_avatar_url":78,"owner_bio":79,"owner_company":80,"owner_location":80,"owner_email":81,"owner_twitter":82,"owner_website":83,"owner_url":84,"languages":85,"stars":94,"forks":95,"last_commit_at":96,"license":80,"difficulty_score":10,"env_os":97,"env_gpu":98,"env_ram":99,"env_deps":100,"category_tags":104,"github_topics":80,"view_count":10,"oss_zip_url":80,"oss_zip_packed_at":80,"status":16,"created_at":105,"updated_at":106,"faqs":107,"releases":138},9122,"QwenLM\u002FQwen3","Qwen3","Qwen3 is the large language model series developed by Qwen team, Alibaba Cloud.","Qwen3 是阿里云通义千问团队最新推出的大型语言模型系列，旨在为用户提供更强大、更灵活的智能交互体验。它不仅能流畅完成日常对话、文案创作和代码编写，还能深入处理复杂的逻辑推理、科学计算及长文档分析任务，有效解决了传统模型在专业深度和超长上下文理解上的局限。\n\n无论是希望快速集成 AI 能力的开发者、需要高性能基座模型的研究人员，还是寻求高效办公助手的普通用户，都能从中获益。Qwen3 提供了多种尺寸版本（从 4B 到 235B），并创新性地分为“指令版”和“思考版”：前者在多语言支持和主观任务对齐上表现卓越；后者则具备深度推理能力，在数学与学术基准测试中达到开源模型领先水平。其独特的技术亮点包括原生支持 256K 上下文窗口（可扩展至 100 万 token），以及出色的工具调用能力。配合完善的本地部署、量化压缩及微调文档，Qwen3 让高性能大模型的应用门槛大幅降低，真正实现了从云端到本地的灵活落地。","# Qwen3\n\n\u003Cp align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Fqianwen-res.oss-accelerate-overseas.aliyuncs.com\u002Flogo_qwen3.png\" width=\"400\"\u002F>\n\u003Cp>\n\n\u003Cp align=\"center\">\n          💜 \u003Ca href=\"https:\u002F\u002Fchat.qwen.ai\u002F\">\u003Cb>Qwen Chat\u003C\u002Fb>\u003C\u002Fa>&nbsp&nbsp | &nbsp&nbsp🤗 \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FQwen\">Hugging Face\u003C\u002Fa>&nbsp&nbsp | &nbsp&nbsp🤖 \u003Ca href=\"https:\u002F\u002Fmodelscope.cn\u002Forganization\u002Fqwen\">ModelScope\u003C\u002Fa>&nbsp&nbsp | &nbsp&nbsp 📑 \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.09388\">Paper\u003C\u002Fa> &nbsp&nbsp | &nbsp&nbsp 📑 \u003Ca href=\"https:\u002F\u002Fqwenlm.github.io\u002Fblog\u002Fqwen3\u002F\">Blog\u003C\u002Fa> &nbsp&nbsp ｜ &nbsp&nbsp📖 \u003Ca href=\"https:\u002F\u002Fqwen.readthedocs.io\u002F\">Documentation\u003C\u002Fa>\n\u003Cbr>\n🖥️ \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FQwen\u002FQwen3-Demo\">Demo\u003C\u002Fa>&nbsp&nbsp | &nbsp&nbsp💬 \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen\u002Fblob\u002Fmain\u002Fassets\u002Fwechat.png\">WeChat (微信)\u003C\u002Fa>&nbsp&nbsp | &nbsp&nbsp🫨 \u003Ca href=\"https:\u002F\u002Fdiscord.gg\u002FCV4E9rpNSD\">Discord\u003C\u002Fa>&nbsp&nbsp\n\u003C\u002Fp>\n\n\nVisit our Hugging Face or ModelScope organization (click links above), search checkpoints with names starting with `Qwen3-` or visit the [Qwen3 collection](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002FQwen\u002Fqwen3-67dd247413f0e2e4f653967f), and you will find all you need! Enjoy!\n\nTo learn more about Qwen3, feel free to read our documentation \\[[EN](https:\u002F\u002Fqwen.readthedocs.io\u002Fen\u002Flatest\u002F)|[ZH](https:\u002F\u002Fqwen.readthedocs.io\u002Fzh-cn\u002Flatest\u002F)\\]. Our documentation consists of the following sections:\n\n- Quickstart: the basic usages and demonstrations;\n- Inference: the guidance for the inference with Transformers, including batch inference, streaming, etc.;\n- Run Locally: the instructions for running LLM locally on CPU and GPU, with frameworks like llama.cpp, Ollama, and LM Studio;\n- Deployment: the demonstration of how to deploy Qwen for large-scale inference with frameworks like SGLang, vLLM, TGI, etc.;\n- Quantization: the practice of quantizing LLMs with GPTQ, AWQ, as well as the guidance for how to make high-quality quantized GGUF files;\n- Training: the instructions for post-training, including SFT and RLHF (TODO) with frameworks like Axolotl, LLaMA-Factory, etc.\n- Framework: the usage of Qwen with frameworks for application, e.g., RAG, Agent, etc.\n\n## Introduction\n\n### Qwen3-2507\n\nOver the past three months, we continued to explore the potential of the Qwen3 families and we are excited to introduce the updated **Qwen3-2507** in two variants, Qwen3-Instruct-2507 and Qwen3-Thinking-2507, and three sizes, 235B-A22B, 30B-A3B, and 4B.\n\n**Qwen3-Instruct-2507** is the updated version of the previous Qwen3 non-thinking mode, featuring the following key enhancements:  \n\n- **Significant improvements** in general capabilities, including **instruction following, logical reasoning, text comprehension, mathematics, science, coding and tool usage**.  \n- **Substantial gains** in long-tail knowledge coverage across **multiple languages**.  \n- **Markedly better alignment** with user preferences in **subjective and open-ended tasks**, enabling more helpful responses and higher-quality text generation.  \n- **Enhanced capabilities** in **256K-token long-context understanding**, extendable up to **1 million tokens**.\n\n**Qwen3-Thinking-2507** is the continuation of Qwen3 thinking model, with improved quality and depth of reasoning, featuring the following key enhancements:\n- **Significantly improved performance** on reasoning tasks, including logical reasoning, mathematics, science, coding, and academic benchmarks that typically require human expertise — achieving **state-of-the-art results among open-weight thinking models**.\n- **Markedly better general capabilities**, such as instruction following, tool usage, text generation, and alignment with human preferences.\n- **Enhanced 256K long-context understanding** capabilities, extendable up to **1 million tokens**.\n\n\n\u003Cdetails>\n    \u003Csummary>\u003Cb>Previous Qwen3 Release\u003C\u002Fb>\u003C\u002Fsummary>\n    \u003Ch3>Qwen3 (aka Qwen3-2504)\u003C\u002Fh3>\n    \u003Cp>\n    We are excited to announce the release of Qwen3, the latest addition to the Qwen family of large language models. \n    These models represent our most advanced and intelligent systems to date, improving from our experience in building QwQ and Qwen2.5.\n    We are making the weights of Qwen3 available to the public, including both dense and Mixture-of-Expert (MoE) models. \n    \u003Cbr>\u003Cbr>\n    The highlights from Qwen3 include:\n        \u003Cul>\n            \u003Cli>\u003Cb>Dense and Mixture-of-Experts (MoE) models of various sizes\u003C\u002Fb>, available in 0.6B, 1.7B, 4B, 8B, 14B, 32B and 30B-A3B, 235B-A22B.\u003C\u002Fli>\n            \u003Cli>\u003Cb>Seamless switching between thinking mode\u003C\u002Fb> (for complex logical reasoning, math, and coding) and \u003Cb>non-thinking mode\u003C\u002Fb> (for efficient, general-purpose chat), ensuring optimal performance across various scenarios.\u003C\u002Fli>\n            \u003Cli>\u003Cb>Significantly enhancement in reasoning capabilities\u003C\u002Fb>, surpassing previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non-thinking mode) on mathematics, code generation, and commonsense logical reasoning.\u003C\u002Fli>\n            \u003Cli>\u003Cb>Superior human preference alignment\u003C\u002Fb>, excelling in creative writing, role-playing, multi-turn dialogues, and instruction following, to deliver a more natural, engaging, and immersive conversational experience.\u003C\u002Fli>\n            \u003Cli>\u003Cb>Expertise in agent capabilities\u003C\u002Fb>, enabling precise integration with external tools in both thinking and unthinking modes and achieving leading performance among open-source models in complex agent-based tasks.\u003C\u002Fli>\n            \u003Cli>\u003Cb>Support of 100+ languages and dialects\u003C\u002Fb> with strong capabilities for \u003Cb>multilingual instruction following\u003C\u002Fb> and \u003Cb>translation\u003C\u002Fb>.\u003C\u002Fli>\n        \u003C\u002Ful>\n    \u003C\u002Fp>\n\u003C\u002Fdetails>\n\n\n## News\n- 2025.08.08: You can now use Qwen3-2507 to handle ultra-long inputs of **1 million tokens**! See the update modelcards ([235B-A22B-Instruct-2507](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen3-235B-A22B-Instruct-2507), [235B-A22B-Thinking-2507](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen3-235B-A22B-Thinking-2507), [A30B-A3B-Instruct-2507](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen3-30B-A3B-Instruct-2507), [A30B-A3B-Thinking-2507](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen3-30B-A3B-Thinking-2507)) for how to enable this feature.\n- 2025.08.06: The final open release of Qwen3-2507, [Qwen3-4B-Instruct-2507](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen3-4B-Instruct-2507) and [Qwen3-4B-Thinking-2507](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen3-4B-Thinking-2507), is out!\n- 2025.07.31: Qwen3-30B-A3B-Thinking-2507 is released. Check out the [modelcard](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen3-30B-A3B-Thinking-2507) for more details!\n- 2025.07.30: Qwen3-30B-A3B-Instruct-2507 is released. Check out the [modelcard](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen3-30B-A3B-Instruct-2507) for more details!\n- 2025.07.25: We released the updated version of Qwen3-235B-A22B thinking mode, named Qwen3-235B-A22B-Thinking-2507. Check out the [modelcard](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen3-235B-A22B-Thinking-2507) for more details!\n- 2025.07.21: We released the updated version of Qwen3-235B-A22B non-thinking mode, named Qwen3-235B-A22B-Instruct-2507, featuring significant enhancements over the previous version and supporting 256K-token long-context understanding. Check our [modelcard](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen3-235B-A22B-Instruct-2507) for more details!\n- 2025.04.29: We released the Qwen3 series. Check our [blog](https:\u002F\u002Fqwenlm.github.io\u002Fblog\u002Fqwen3) for more details!\n- 2024.09.19: We released the Qwen2.5 series. This time there are 3 extra model sizes: 3B, 14B, and 32B for more possibilities. Check our [blog](https:\u002F\u002Fqwenlm.github.io\u002Fblog\u002Fqwen2.5) for more!\n- 2024.06.06: We released the Qwen2 series. Check our [blog](https:\u002F\u002Fqwenlm.github.io\u002Fblog\u002Fqwen2\u002F)!\n- 2024.03.28: We released the first MoE model of Qwen: Qwen1.5-MoE-A2.7B! Temporarily, only HF transformers and vLLM support the model. We will soon add the support of llama.cpp, mlx-lm, etc. Check our [blog](https:\u002F\u002Fqwenlm.github.io\u002Fblog\u002Fqwen-moe\u002F) for more information!\n- 2024.02.05: We released the Qwen1.5 series.\n\n## Performance\n\nDetailed evaluation results are reported in this [📑 blog (Qwen3-2504)](https:\u002F\u002Fqwenlm.github.io\u002Fblog\u002Fqwen3\u002F) and this [📑 blog (Qwen3-2507) \\[coming soon\\]]().\n\nFor requirements on GPU memory and the respective throughput, see results [here](https:\u002F\u002Fqwen.readthedocs.io\u002Fen\u002Flatest\u002Fgetting_started\u002Fspeed_benchmark.html).\n\n## Run Qwen3\n\n### 🤗 Transformers\n\nTransformers is a library of pretrained natural language processing for inference and training. \nThe latest version of `transformers` is recommended and `transformers>=4.51.0` is required.\n\n#### Qwen3-Instruct-2507\n\nThe following contains a code snippet illustrating how to use Qwen3-30B-A3B-Instruct-2507 to generate content based on given inputs. \n```python\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\n\nmodel_name = \"Qwen\u002FQwen3-30B-A3B-Instruct-2507\"\n\n# load the tokenizer and the model\ntokenizer = AutoTokenizer.from_pretrained(model_name)\nmodel = AutoModelForCausalLM.from_pretrained(\n    model_name,\n    torch_dtype=\"auto\",\n    device_map=\"auto\"\n)\n\n# prepare the model input\nprompt = \"Give me a short introduction to large language model.\"\nmessages = [\n    {\"role\": \"user\", \"content\": prompt}\n]\ntext = tokenizer.apply_chat_template(\n    messages,\n    tokenize=False,\n    add_generation_prompt=True,\n)\nmodel_inputs = tokenizer([text], return_tensors=\"pt\").to(model.device)\n\n# conduct text completion\ngenerated_ids = model.generate(\n    **model_inputs,\n    max_new_tokens=16384\n)\noutput_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() \n\ncontent = tokenizer.decode(output_ids, skip_special_tokens=True)\n\nprint(\"content:\", content)\n```\n\n> [!Note]\n> Qwen3-Instruct-2507 supports only non-thinking mode and does not generate ``\u003Cthink>\u003C\u002Fthink>`` blocks in its output. Meanwhile, specifying `enable_thinking=False` is no longer required.\n\n\n#### Qwen3-Thinking-2507\n\nThe following contains a code snippet illustrating how to use Qwen3-30B-A3B-Thinking-2507 to generate content based on given inputs. \n```python\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\n\nmodel_name = \"Qwen\u002FQwen3-30B-A3B-Thinking-2507\"\n\n# load the tokenizer and the model\ntokenizer = AutoTokenizer.from_pretrained(model_name)\nmodel = AutoModelForCausalLM.from_pretrained(\n    model_name,\n    torch_dtype=\"auto\",\n    device_map=\"auto\"\n)\n\n# prepare the model input\nprompt = \"Give me a short introduction to large language model.\"\nmessages = [\n    {\"role\": \"user\", \"content\": prompt}\n]\ntext = tokenizer.apply_chat_template(\n    messages,\n    tokenize=False,\n    add_generation_prompt=True,\n)\nmodel_inputs = tokenizer([text], return_tensors=\"pt\").to(model.device)\n\n# conduct text completion\ngenerated_ids = model.generate(\n    **model_inputs,\n    max_new_tokens=32768\n)\noutput_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() \n\n# parsing thinking content\ntry:\n    # rindex finding 151668 (\u003C\u002Fthink>)\n    index = len(output_ids) - output_ids[::-1].index(151668)\nexcept ValueError:\n    index = 0\n\nthinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip(\"\\n\")\ncontent = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip(\"\\n\")\n\nprint(\"thinking content:\", thinking_content)  # no opening \u003Cthink> tag\nprint(\"content:\", content)\n\n```\n\n> [!Note]\n> Qwen3-Thinking-2507 supports only thinking mode.\n> Additionally, to enforce model thinking, the default chat template automatically includes `\u003Cthink>`. Therefore, it is normal for the model's output to contain only `\u003C\u002Fthink>` without an explicit opening `\u003Cthink>` tag.\n> \n> Qwen3-Thinking-2507 also features an increased thinking length. We strongly recommend its use in highly complex reasoning tasks with adequate maximum generation length.\n\n\n\n\u003Cdetails>\n    \u003Csummary>\u003Cb>Switching Thinking\u002FNon-thinking Modes for Previous Qwen3  Models\u003C\u002Fb>\u003C\u002Fsummary>\n    \u003Cp>\n    By default, Qwen3 models will think before response.\n    This could be controlled by\n        \u003Cul>\n            \u003Cli>\u003Ccode>enable_thinking=False\u003C\u002Fcode>: Passing \u003Ccode>enable_thinking=False\u003C\u002Fcode> to `tokenizer.apply_chat_template` will strictly prevent the model from generating thinking content.\u003C\u002Fli>\n            \u003Cli>\u003Ccode>\u002Fthink\u003C\u002Fcode> and \u003Ccode>\u002Fno_think\u003C\u002Fcode> instructions: Use those words in the system or user message to signify whether Qwen3 should think. In multi-turn conversations, the latest instruction is followed.\u003C\u002Fli>\n        \u003C\u002Ful>\n    \u003C\u002Fp>\n\u003C\u002Fdetails>\n\n\n### ModelScope\n\nWe strongly advise users especially those in mainland China to use ModelScope. \nModelScope adopts a Python API similar to Transformers.\nThe CLI tool `modelscope download` can help you solve issues concerning downloading checkpoints.\nFor vLLM and SGLang, the environment variable `VLLM_USE_MODELSCOPE=true` and `SGLANG_USE_MODELSCOPE=true` can be used respectively.\n\n\n### llama.cpp\n\n[`llama.cpp`](https:\u002F\u002Fgithub.com\u002Fggml-org\u002Fllama.cpp) enables LLM inference with minimal setup and state-of-the-art performance on a wide range of hardware.\n`llama.cpp>=b5401` is recommended for the full support of Qwen3.\n\nTo use the CLI, run the following in a terminal:\n```shell\n.\u002Fllama-cli -hf Qwen\u002FQwen3-8B-GGUF:Q8_0 --jinja --color -ngl 99 -fa -sm row --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 -c 40960 -n 32768 --no-context-shift\n# CTRL+C to exit\n```\n\nTo use the API server, run the following in a terminal:\n```shell\n.\u002Fllama-server -hf Qwen\u002FQwen3-8B-GGUF:Q8_0 --jinja --reasoning-format deepseek -ngl 99 -fa -sm row --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 -c 40960 -n 32768 --no-context-shift --port 8080\n```\nA simple web front end will be at `http:\u002F\u002Flocalhost:8080` and an OpenAI-compatible API will be at `http:\u002F\u002Flocalhost:8080\u002Fv1`.\n\nFor additional guides, please refer to [our documentation](https:\u002F\u002Fqwen.readthedocs.io\u002Fen\u002Flatest\u002Frun_locally\u002Fllama.cpp.html).\n\n> [!Note]\n> llama.cpp adopts \"rotating context management\" and infinite generation is made possible by evicting earlier tokens.\n> It could configured by parameters and the commands above effectively disable it.\n> For more details, please refer to [our documentation](https:\u002F\u002Fqwen.readthedocs.io\u002Fen\u002Flatest\u002Frun_locally\u002Fllama.cpp.html#llama-cli).\n\n### Ollama\n\nAfter [installing Ollama](https:\u002F\u002Follama.com\u002F), you can initiate the Ollama service with the following command (Ollama v0.9.0 or higher is recommended):\n```shell\nollama serve\n# You need to keep this service running whenever you are using ollama\n```\n\nTo pull a model checkpoint and run the model, use the `ollama run` command. You can specify a model size by adding a suffix to `qwen3`, such as `:8b` or `:30b-a3b`:\n```shell\nollama run qwen3:8b\n# Setting parameters, type \"\u002Fset parameter num_ctx 40960\" and \"\u002Fset parameter num_predict 32768\"\n# To exit, type \"\u002Fbye\" and press ENTER\n# For Qwen3-2504 models,\n# - To enable thinking, which is the default, type \"\u002Fset think\"\n# - To disable thinking, type \"\u002Fset nothink\"\n```\n\nYou can also access the Ollama service via its OpenAI-compatible API. \nPlease note that you need to (1) keep `ollama serve` running while using the API, and (2) execute `ollama run qwen3:8b` before utilizing this API to ensure that the model checkpoint is prepared.\nThe API is at `http:\u002F\u002Flocalhost:11434\u002Fv1\u002F` by default.\n\nFor additional details, please visit [ollama.ai](https:\u002F\u002Follama.com\u002F).\n\n> [!Note]\n> Ollama's naming may not be consistent with the Qwen's original naming.\n> For example, `qwen3:30b-a3b` in Ollama points to `qwen3:30b-a3b-thinking-2507-q4_K_M` as of August 2025.\n> Please check \u003Chttps:\u002F\u002Follama.com\u002Flibrary\u002Fqwen3\u002Ftags> before use.\n\n\n> [!Note]\n> Ollama adopts the same \"rotating context management\" with llama.cpp.\n> However, its default settings (`num_ctx` 2048 and `num_predict` -1), suggesting infinite generation with a 2048-token context,\n> could lead to trouble for Qwen3 models.\n> We recommend setting `num_ctx` and `num_predict` properly.\n\n### LMStudio\n\nQwen3 has already been supported by [lmstudio.ai](https:\u002F\u002Flmstudio.ai\u002F). You can directly use LMStudio with our GGUF files.\n\n### ExecuTorch\n\nTo export and run on ExecuTorch (iOS, Android, Mac, Linux, and more), please follow this [example](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fexecutorch\u002Fblob\u002Fmain\u002Fexamples\u002Fmodels\u002Fqwen3\u002FREADME.md).\n\n### MNN\n\nTo export and run on MNN, which supports Qwen3 on mobile devices, please visit [Alibaba MNN](https:\u002F\u002Fgithub.com\u002Falibaba\u002FMNN).\n\n### MLX LM\n\nIf you are running on Apple Silicon, [`mlx-lm`](https:\u002F\u002Fgithub.com\u002Fml-explore\u002Fmlx-lm) also supports Qwen3 (`mlx-lm>=0.24.0`). \nLook for models ending with MLX on Hugging Face Hub.\n\n\n### OpenVINO\n\nIf you are running on Intel CPU or GPU, [OpenVINO toolkit](https:\u002F\u002Fgithub.com\u002Fopenvinotoolkit) supports Qwen3.\nYou can follow this [chatbot example](https:\u002F\u002Fgithub.com\u002Fopenvinotoolkit\u002Fopenvino_notebooks\u002Fblob\u002Flatest\u002Fnotebooks\u002Fllm-chatbot\u002Fllm-chatbot.ipynb).\n\n\n## Deploy Qwen3\n\nQwen3 is supported by multiple inference frameworks. \nHere we demonstrate the usage of `SGLang`, `vLLM` and `TensorRT-LLM`.\nYou can also find Qwen3 models from various inference providers, e.g., [Alibaba Cloud Model Studio](https:\u002F\u002Fwww.alibabacloud.com\u002Fen\u002Fproduct\u002Fmodelstudio).\n\n\n### SGLang\n\n[SGLang](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang) is a fast serving framework for large language models and vision language models.\nSGLang could be used to launch a server with OpenAI-compatible API service. \n`sglang>=0.4.6.post1` is required.\n\nFor Qwen3-Instruct-2507, \n```shell\npython -m sglang.launch_server --model-path Qwen\u002FQwen3-30B-A3B-Instruct-2507 --port 30000 --context-length 262144\n```\n\nFor Qwen3-Thinking-2507,\n```shell\npython -m sglang.launch_server --model-path Qwen\u002FQwen3-30B-A3B-Thinking-2507 --port 30000 --context-length 262144 --reasoning-parser deepseek-r1\n```\n\nFor Qwen3, it is\n```shell\npython -m sglang.launch_server --model-path Qwen\u002FQwen3-8B --port 30000 --context-length 131072 --reasoning-parser qwen3\n```\nAn OpenAI-compatible API will be available at `http:\u002F\u002Flocalhost:30000\u002Fv1`.\n\n> [!Note]\n> Due to the preprocessing of API requests in SGLang, which drops all `reasoning_content` fields, the quality of **multi-step tool use with Qwen3 thinking models** may be suboptimal, which requires the existence of the related thinking content. While the fixes are being worked on, as a workdaround, we recommend passing the content as it is, without extracting thinking content, and the chat template will correctly handle the processing.\n\n\n### vLLM\n\n[vLLM](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm) is a high-throughput and memory-efficient inference and serving engine for LLMs.\n`vllm>=0.9.0` is recommended.\n\nFor Qwen3-Instruct-2507, \n```shell\nvllm serve Qwen\u002FQwen3-30B-A3B-Instruct-2507 --port 8000 --max-model-len 262144\n```\n\nFor Qwen3-Thinking-2507,\n```shell\nvllm serve Qwen\u002FQwen3-30B-A3B-Thinking-2507 --port 8000 --max-model-len 262144 --enable-reasoning --reasoning-parser deepseek_r1\n```\n\nFor Qwen3, it is\n```shell\nvllm serve Qwen\u002FQwen3-8B --port 8000 --max-model-len 131072 --enable-reasoning --reasoning-parser qwen3\n```\nAn OpenAI-compatible API will be available at `http:\u002F\u002Flocalhost:8000\u002Fv1`.\n\n> [!Note]\n> Due to the preprocessing of API requests in vLLM, which drops all `reasoning_content` fields, the quality of **multi-step tool use with Qwen3 thinking models** may be suboptimal, which requires the existence of the related thinking content. While the fixes are being worked on, as a workdaround, we recommend passing the content as it is, without extracting thinking content, and the chat template will correctly handle the processing.\n\n### TensorRT-LLM\n\n[TensorRT-LLM](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTensorRT-LLM) is an open-source LLM inference engine from NVIDIA, which provides optimizations including custom attention kernels, quantization and more on NVIDIA GPUs. Qwen3 is supported in its re-architected [PyTorch backend](https:\u002F\u002Fnvidia.github.io\u002FTensorRT-LLM\u002Ftorch.html). `tensorrt_llm>=0.20.0rc3` is recommended. Please refer to the [README](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTensorRT-LLM\u002Fblob\u002Fmain\u002Fexamples\u002Fmodels\u002Fcore\u002Fqwen\u002FREADME.md#qwen3) page for more details.\n\n```shell\ntrtllm-serve Qwen\u002FQwen3-8B --host localhost --port 8000 --backend pytorch\n```\nAn OpenAI-compatible API will be available at `http:\u002F\u002Flocalhost:8000\u002Fv1`.\n\n### MindIE\n\nFor deployment on Ascend NPUs, please visit [Modelers](https:\u002F\u002Fmodelers.cn\u002F) and search for Qwen3.\n\n\u003C!-- \n### OpenLLM\n\n[OpenLLM](https:\u002F\u002Fgithub.com\u002Fbentoml\u002FOpenLLM) allows you to easily run Qwen2.5 as OpenAI-compatible APIs. You can start a model server using `openllm serve`. For example:\n\n```bash\nopenllm serve qwen2.5:7b\n```\n\nThe server is active at `http:\u002F\u002Flocalhost:3000\u002F`, providing OpenAI-compatible APIs. You can create an OpenAI client to call its chat API. For more information, refer to [our documentation](https:\u002F\u002Fqwen.readthedocs.io\u002Fen\u002Flatest\u002Fdeployment\u002Fopenllm.html). -->\n\n\n## Build with Qwen3\n\n### Tool Use\n\nFor tool use capabilities, we recommend taking a look at [Qwen-Agent](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen-Agent), which provides a wrapper around these APIs to support tool use or function calling with MCP support.\nTool use with Qwen3 can also be conducted with SGLang, vLLM, Transformers, llama.cpp, Ollama, etc.\nFollow guides in our documentation to see how to enable the support.\n\n\n### Finetuning\n\nWe advise you to use training frameworks, including [Axolotl](https:\u002F\u002Fgithub.com\u002FOpenAccess-AI-Collective\u002Faxolotl), [UnSloth](https:\u002F\u002Fgithub.com\u002Funslothai\u002Funsloth), [Swift](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fswift), [Llama-Factory](https:\u002F\u002Fgithub.com\u002Fhiyouga\u002FLLaMA-Factory), etc., to finetune your models with SFT, DPO, GRPO, etc.\n\n\n## License Agreement\n\nAll our open-weight models are licensed under Apache 2.0. \nYou can find the license files in the respective Hugging Face repositories.\n\n## Citation\n\nIf you find our work helpful, feel free to give us a cite.\n\n```bibtex\n@article{qwen3,\n    title={Qwen3 Technical Report}, \n    author={An Yang and Anfeng Li and Baosong Yang and Beichen Zhang and Binyuan Hui and Bo Zheng and Bowen Yu and Chang Gao and Chengen Huang and Chenxu Lv and Chujie Zheng and Dayiheng Liu and Fan Zhou and Fei Huang and Feng Hu and Hao Ge and Haoran Wei and Huan Lin and Jialong Tang and Jian Yang and Jianhong Tu and Jianwei Zhang and Jianxin Yang and Jiaxi Yang and Jing Zhou and Jingren Zhou and Junyang Lin and Kai Dang and Keqin Bao and Kexin Yang and Le Yu and Lianghao Deng and Mei Li and Mingfeng Xue and Mingze Li and Pei Zhang and Peng Wang and Qin Zhu and Rui Men and Ruize Gao and Shixuan Liu and Shuang Luo and Tianhao Li and Tianyi Tang and Wenbiao Yin and Xingzhang Ren and Xinyu Wang and Xinyu Zhang and Xuancheng Ren and Yang Fan and Yang Su and Yichang Zhang and Yinger Zhang and Yu Wan and Yuqiong Liu and Zekun Wang and Zeyu Cui and Zhenru Zhang and Zhipeng Zhou and Zihan Qiu},\n    journal = {arXiv preprint arXiv:2505.09388},\n    year={2025}\n}\n\n@article{qwen2.5,\n    title   = {Qwen2.5 Technical Report}, \n    author  = {An Yang and Baosong Yang and Beichen Zhang and Binyuan Hui and Bo Zheng and Bowen Yu and Chengyuan Li and Dayiheng Liu and Fei Huang and Haoran Wei and Huan Lin and Jian Yang and Jianhong Tu and Jianwei Zhang and Jianxin Yang and Jiaxi Yang and Jingren Zhou and Junyang Lin and Kai Dang and Keming Lu and Keqin Bao and Kexin Yang and Le Yu and Mei Li and Mingfeng Xue and Pei Zhang and Qin Zhu and Rui Men and Runji Lin and Tianhao Li and Tingyu Xia and Xingzhang Ren and Xuancheng Ren and Yang Fan and Yang Su and Yichang Zhang and Yu Wan and Yuqiong Liu and Zeyu Cui and Zhenru Zhang and Zihan Qiu},\n    journal = {arXiv preprint arXiv:2412.15115},\n    year    = {2024}\n}\n\n@article{qwen2,\n    title   = {Qwen2 Technical Report}, \n    author  = {An Yang and Baosong Yang and Binyuan Hui and Bo Zheng and Bowen Yu and Chang Zhou and Chengpeng Li and Chengyuan Li and Dayiheng Liu and Fei Huang and Guanting Dong and Haoran Wei and Huan Lin and Jialong Tang and Jialin Wang and Jian Yang and Jianhong Tu and Jianwei Zhang and Jianxin Ma and Jin Xu and Jingren Zhou and Jinze Bai and Jinzheng He and Junyang Lin and Kai Dang and Keming Lu and Keqin Chen and Kexin Yang and Mei Li and Mingfeng Xue and Na Ni and Pei Zhang and Peng Wang and Ru Peng and Rui Men and Ruize Gao and Runji Lin and Shijie Wang and Shuai Bai and Sinan Tan and Tianhang Zhu and Tianhao Li and Tianyu Liu and Wenbin Ge and Xiaodong Deng and Xiaohuan Zhou and Xingzhang Ren and Xinyu Zhang and Xipin Wei and Xuancheng Ren and Yang Fan and Yang Yao and Yichang Zhang and Yu Wan and Yunfei Chu and Yuqiong Liu and Zeyu Cui and Zhenru Zhang and Zhihao Fan},\n    journal = {arXiv preprint arXiv:2407.10671},\n    year    = {2024}\n}\n```\n\n## Contact Us\nIf you are interested to leave a message to either our research team or product team, join our [Discord](https:\u002F\u002Fdiscord.gg\u002Fz3GAxXZ9Ce) or [WeChat groups](assets\u002Fwechat.png)!\n","# Qwen3\n\n\u003Cp align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Fqianwen-res.oss-accelerate-overseas.aliyuncs.com\u002Flogo_qwen3.png\" width=\"400\"\u002F>\n\u003Cp>\n\n\u003Cp align=\"center\">\n          💜 \u003Ca href=\"https:\u002F\u002Fchat.qwen.ai\u002F\">\u003Cb>通义千问聊天\u003C\u002Fb>\u003C\u002Fa>&nbsp&nbsp | &nbsp&nbsp🤗 \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FQwen\">Hugging Face\u003C\u002Fa>&nbsp&nbsp | &nbsp&nbsp🤖 \u003Ca href=\"https:\u002F\u002Fmodelscope.cn\u002Forganization\u002Fqwen\">ModelScope\u003C\u002Fa>&nbsp&nbsp | &nbsp&nbsp 📑 \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.09388\">论文\u003C\u002Fa> &nbsp&nbsp | &nbsp&nbsp 📑 \u003Ca href=\"https:\u002F\u002Fqwenlm.github.io\u002Fblog\u002Fqwen3\u002F\">博客\u003C\u002Fa> &nbsp&nbsp ｜ &nbsp&nbsp📖 \u003Ca href=\"https:\u002F\u002Fqwen.readthedocs.io\u002F\">文档\u003C\u002Fa>\n\u003Cbr>\n🖥️ \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FQwen\u002FQwen3-Demo\">演示\u003C\u002Fa>&nbsp&nbsp | &nbsp&nbsp💬 \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen\u002Fblob\u002Fmain\u002Fassets\u002Fwechat.png\">微信\u003C\u002Fa>&nbsp&nbsp | &nbsp&nbsp🫨 \u003Ca href=\"https:\u002F\u002Fdiscord.gg\u002FCV4E9rpNSD\">Discord\u003C\u002Fa>&nbsp&nbsp\n\u003C\u002Fp>\n\n\n请访问我们的 Hugging Face 或 ModelScope 组织（点击上方链接），搜索以 `Qwen3-` 开头的检查点，或前往 [Qwen3 系列](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002FQwen\u002Fqwen3-67dd247413f0e2e4f653967f)，您将找到所需的一切！尽情体验吧！\n\n如需了解更多关于 Qwen3 的信息，欢迎阅读我们的文档 \\[[EN](https:\u002F\u002Fqwen.readthedocs.io\u002Fen\u002Flatest\u002F)|[ZH](https:\u002F\u002Fqwen.readthedocs.io\u002Fzh-cn\u002Flatest\u002F)\\]。我们的文档包含以下章节：\n\n- 快速入门：基本用法与示例；\n- 推理：使用 Transformers 进行推理的指南，包括批量推理、流式推理等；\n- 本地运行：在 CPU 和 GPU 上使用 llama.cpp、Ollama、LM Studio 等框架本地运行大模型的说明；\n- 部署：展示如何使用 SGLang、vLLM、TGI 等框架部署 Qwen 以进行大规模推理；\n- 量化：使用 GPTQ、AWQ 对大模型进行量化实践，以及如何制作高质量量化 GGUF 文件的指导；\n- 训练：后训练的说明，包括 SFT 和 RLHF（待完成）等，使用 Axolotl、LLaMA-Factory 等框架；\n- 框架：Qwen 在应用框架中的使用，例如 RAG、Agent 等。\n\n## 简介\n\n### Qwen3-2507\n\n过去三个月里，我们持续探索 Qwen3 系列的潜力，并很高兴推出更新后的 **Qwen3-2507**，分为 Qwen3-Instruct-2507 和 Qwen3-Thinking-2507 两种模式，以及 235B-A22B、30B-A3B、4B 三种规模。\n\n**Qwen3-Instruct-2507** 是此前 Qwen3 非思维模式的升级版，具有以下关键改进：\n\n- 在通用能力方面取得 **显著提升**，包括 **指令遵循、逻辑推理、文本理解、数学、科学、编码和工具使用**。\n- 在多语言领域的长尾知识覆盖上获得 **大幅增长**。\n- 在 **主观性和开放式任务** 中与用户偏好更加契合，能够提供更有帮助的回答和更高质量的文本生成。\n- 在 **256K 令牌长上下文理解** 方面的能力得到增强，可扩展至 **100 万令牌**。\n\n**Qwen3-Thinking-2507** 是 Qwen3 思维模型的延续，其推理质量和深度均有所提升，主要改进如下：\n- 在推理任务上的表现 **显著提高**，包括逻辑推理、数学、科学、编码以及通常需要人类专业知识的学术基准——在开源权重的思维模型中达到 **最先进水平**。\n- 通用能力显著增强，例如指令遵循、工具使用、文本生成以及与人类偏好的一致性。\n- 256K 长上下文理解能力进一步提升，可扩展至 **100 万令牌**。\n\n\n\u003Cdetails>\n    \u003Csummary>\u003Cb>之前的 Qwen3 发布\u003C\u002Fb>\u003C\u002Fsummary>\n    \u003Ch3>Qwen3（又称 Qwen3-2504）\u003C\u002Fh3>\n    \u003Cp>\n    我们很高兴地宣布推出 Qwen3，这是通义千问系列大型语言模型的最新成员。这些模型代表了我们迄今为止最为先进和智能的系统，是在构建 QwQ 和 Qwen2.5 的经验基础上进一步优化而成的。我们现向公众开放 Qwen3 的权重，涵盖密集型和混合专家（MoE）模型。\n    \u003Cbr>\u003Cbr>\n    Qwen3 的亮点包括：\n        \u003Cul>\n            \u003Cli>\u003Cb>多种规模的密集型和混合专家（MoE）模型\u003C\u002Fb>, 分别为 0.6B、1.7B、4B、8B、14B、32B，以及 30B-A3B 和 235B-A22B。\u003C\u002Fli>\n            \u003Cli>\u003Cb>思维模式与非思维模式之间的无缝切换\u003C\u002Fb>（思维模式适用于复杂的逻辑推理、数学和编码任务，非思维模式则用于高效、通用的对话），确保在各种场景下都能发挥最佳性能。\u003C\u002Fli>\n            \u003Cli>\u003Cb>推理能力显著增强\u003C\u002Fb>, 在数学、代码生成和常识性逻辑推理方面超越了之前的 QwQ（思维模式）和 Qwen2.5 指令模型（非思维模式）。\u003C\u002Fli>\n            \u003Cli>\u003Cb>与人类偏好的契合度更高\u003C\u002Fb>, 在创意写作、角色扮演、多轮对话和指令遵循等方面表现出色，带来更加自然、引人入胜且沉浸式的对话体验。\u003C\u002Fli>\n            \u003Cli>\u003Cb>强大的代理能力\u003C\u002Fb>, 无论在思维模式还是非思维模式下，都能精准集成外部工具，在复杂的基于代理的任务中表现领先于其他开源模型。\u003C\u002Fli>\n            \u003Cli>\u003Cb>支持 100 多种语言和方言\u003C\u002Fb>, 具备强大的 **多语言指令遵循** 和 **翻译** 能力。\u003C\u002Fli>\n        \u003C\u002Ful>\n    \u003C\u002Fp>\n\u003C\u002Fdetails>\n\n## 新闻\n- 2025年8月8日：您现在可以使用Qwen3-2507处理长达**100万标记**的超长输入！请参阅更新后的模型卡片（[235B-A22B-Instruct-2507](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen3-235B-A22B-Instruct-2507)、[235B-A22B-Thinking-2507](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen3-235B-A22B-Thinking-2507)、[A30B-A3B-Instruct-2507](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen3-30B-A3B-Instruct-2507)、[A30B-A3B-Thinking-2507](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen3-30B-A3B-Thinking-2507)），了解如何启用此功能。\n- 2025年8月6日：Qwen3-2507的最终公开版本，即[Qwen3-4B-Instruct-2507](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen3-4B-Instruct-2507)和[Qwen3-4B-Thinking-2507](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen3-4B-Thinking-2507)，现已发布！\n- 2025年7月31日：Qwen3-30B-A3B-Thinking-2507已发布。更多详情请查看[模型卡片](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen3-30B-A3B-Thinking-2507)！\n- 2025年7月30日：Qwen3-30B-A3B-Instruct-2507已发布。更多详情请查看[模型卡片](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen3-30B-A3B-Instruct-2507)！\n- 2025年7月25日：我们发布了Qwen3-235B-A22B思考模式的更新版本，名为Qwen3-235B-A22B-Thinking-2507。更多详情请查看[模型卡片](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen3-235B-A22B-Thinking-2507)！\n- 2025年7月21日：我们发布了Qwen3-235B-A22B非思考模式的更新版本，名为Qwen3-235B-A22B-Instruct-2507，相比上一版本有显著提升，并支持256K标记的长上下文理解。更多详情请查看我们的[模型卡片](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen3-235B-A22B-Instruct-2507)！\n- 2025年4月29日：我们发布了Qwen3系列。更多详情请查看我们的[博客](https:\u002F\u002Fqwenlm.github.io\u002Fblog\u002Fqwen3)！\n- 2024年9月19日：我们发布了Qwen2.5系列。此次新增了3个模型尺寸：3B、14B和32B，以提供更多可能性。更多信息请查看我们的[博客](https:\u002F\u002Fqwenlm.github.io\u002Fblog\u002Fqwen2.5)！\n- 2024年6月6日：我们发布了Qwen2系列。请查看我们的[博客](https:\u002F\u002Fqwenlm.github.io\u002Fblog\u002Fqwen2\u002F)！\n- 2024年3月28日：我们发布了Qwen的第一个MoE模型：Qwen1.5-MoE-A2.7B！目前，只有HF transformers和vLLM支持该模型。我们很快将增加对llama.cpp、mlx-lm等的支持。更多信息请查看我们的[博客](https:\u002F\u002Fqwenlm.github.io\u002Fblog\u002Fqwen-moe\u002F)！\n- 2024年2月5日：我们发布了Qwen1.5系列。\n\n## 性能\n\n详细评估结果已在本篇[📑博客（Qwen3-2504）](https:\u002F\u002Fqwenlm.github.io\u002Fblog\u002Fqwen3\u002F)以及这篇[📑博客（Qwen3-2507）\\[即将发布\\]]()中报告。\n\n关于显存需求及相应吞吐量，请参阅此处的结果：[链接](https:\u002F\u002Fqwen.readthedocs.io\u002Fen\u002Flatest\u002Fgetting_started\u002Fspeed_benchmark.html)。\n\n## 运行Qwen3\n\n### 🤗 Transformers\n\nTransformers是一个用于推理和训练的预训练自然语言处理库。\n建议使用最新版本的`transformers`，且需满足`transformers>=4.51.0`的要求。\n\n#### Qwen3-Instruct-2507\n\n以下代码片段展示了如何使用Qwen3-30B-A3B-Instruct-2507根据给定输入生成内容。\n```python\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\n\nmodel_name = \"Qwen\u002FQwen3-30B-A3B-Instruct-2507\"\n\n# 加载分词器和模型\ntokenizer = AutoTokenizer.from_pretrained(model_name)\nmodel = AutoModelForCausalLM.from_pretrained(\n    model_name,\n    torch_dtype=\"auto\",\n    device_map=\"auto\"\n)\n\n# 准备模型输入\nprompt = \"给我一个关于大型语言模型的简短介绍。\"\nmessages = [\n    {\"role\": \"user\", \"content\": prompt}\n]\ntext = tokenizer.apply_chat_template(\n    messages,\n    tokenize=False,\n    add_generation_prompt=True,\n)\nmodel_inputs = tokenizer([text], return_tensors=\"pt\").to(model.device)\n\n# 进行文本补全\ngenerated_ids = model.generate(\n    **model_inputs,\n    max_new_tokens=16384\n)\noutput_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() \n\ncontent = tokenizer.decode(output_ids, skip_special_tokens=True)\n\nprint(\"content:\", content)\n```\n\n> [!注释]\n> Qwen3-Instruct-2507仅支持非思考模式，其输出不会生成``\u003Cthink>\u003C\u002Fthink>``块。同时，不再需要指定`enable_thinking=False`。\n\n\n#### Qwen3-Thinking-2507\n\n以下代码片段展示了如何使用Qwen3-30B-A3B-Thinking-2507根据给定输入生成内容。\n```python\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\n\nmodel_name = \"Qwen\u002FQwen3-30B-A3B-Thinking-2507\"\n\n# 加载分词器和模型\ntokenizer = AutoTokenizer.from_pretrained(model_name)\nmodel = AutoModelForCausalLM.from_pretrained(\n    model_name,\n    torch_dtype=\"auto\",\n    device_map=\"auto\"\n)\n\n# 准备模型输入\nprompt = \"给我一个关于大型语言模型的简短介绍。\"\nmessages = [\n    {\"role\": \"user\", \"content\": prompt}\n]\ntext = tokenizer.apply_chat_template(\n    messages,\n    tokenize=False,\n    add_generation_prompt=True,\n)\nmodel_inputs = tokenizer([text], return_tensors=\"pt\").to(model.device)\n\n# 进行文本补全\ngenerated_ids = model.generate(\n    **model_inputs,\n    max_new_tokens=32768\n)\noutput_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() \n\n# 解析思考内容\ntry:\n    # 从后向前查找151668（\u003C\u002Fthink>）\n    index = len(output_ids) - output_ids[::-1].index(151668)\nexcept ValueError:\n    index = 0\n\nthinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip(\"\\n\")\ncontent = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip(\"\\n\")\n\nprint(\"thinking content:\", thinking_content)  # 无\u003Cthink>标签开头\nprint(\"content:\", content)\n```\n\n> [!注释]\n> Qwen3-Thinking-2507仅支持思考模式。\n此外，为强制模型进行思考，默认聊天模板会自动包含`\u003Cthink>`。因此，模型输出中仅出现`\u003C\u002Fthink>`而没有明确的`\u003Cthink>`标签开头是正常现象。\nQwen3-Thinking-2507还具有更长的思考长度。我们强烈建议在复杂的推理任务中使用它，并设置足够的最大生成长度。\n\n\n\n\u003Cdetails>\n    \u003Csummary>\u003Cb>切换先前Qwen3模型的思考\u002F非思考模式\u003C\u002Fb>\u003C\u002Fsummary>\n    \u003Cp>\n    默认情况下，Qwen3模型会在响应前进行思考。\n    可通过以下方式控制：\n        \u003Cul>\n            \u003Cli>\u003Ccode>enable_thinking=False\u003C\u002Fcode>：向`tokenizer.apply_chat_template`传递\u003Ccode>enable_thinking=False\u003C\u002Fcode>可严格阻止模型生成思考内容。\u003C\u002Fli>\n            \u003Cli>\u003Ccode>\u002Fthink\u003C\u002Fcode>和\u003Ccode>\u002Fno_think\u003C\u002Fcode>指令：在系统或用户消息中使用这些词语来指示Qwen3是否应思考。在多轮对话中，以最新的指令为准。\u003C\u002Fli>\n        \u003C\u002Ful>\n    \u003C\u002Fp>\n\u003C\u002Fdetails>\n\n### ModelScope\n\n我们强烈建议用户，尤其是中国大陆的用户，使用 ModelScope。  \nModelScope 采用与 Transformers 类似的 Python API。  \n命令行工具 `modelscope download` 可以帮助您解决检查点下载相关的问题。  \n对于 vLLM 和 SGLang，可以分别使用环境变量 `VLLM_USE_MODELSCOPE=true` 和 `SGLANG_USE_MODELSCOPE=true`。\n\n\n### llama.cpp\n\n[`llama.cpp`](https:\u002F\u002Fgithub.com\u002Fggml-org\u002Fllama.cpp) 能够在极简的设置下实现 LLM 推理，并在多种硬件上提供最先进的性能。  \n为了完整支持 Qwen3，建议使用 `llama.cpp>=b5401`。\n\n要使用命令行界面，请在终端中运行以下命令：\n```shell\n.\u002Fllama-cli -hf Qwen\u002FQwen3-8B-GGUF:Q8_0 --jinja --color -ngl 99 -fa -sm row --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 -c 40960 -n 32768 --no-context-shift\n# 按 CTRL+C 退出\n```\n\n要使用 API 服务器，请在终端中运行以下命令：\n```shell\n.\u002Fllama-server -hf Qwen\u002FQwen3-8B-GGUF:Q8_0 --jinja --reasoning-format deepseek -ngl 99 -fa -sm row --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 -c 40960 -n 32768 --no-context-shift --port 8080\n```\n简单的 Web 前端将位于 `http:\u002F\u002Flocalhost:8080`，而兼容 OpenAI 的 API 将位于 `http:\u002F\u002Flocalhost:8080\u002Fv1`。\n\n有关更多指南，请参阅 [我们的文档](https:\u002F\u002Fqwen.readthedocs.io\u002Fen\u002Flatest\u002Frun_locally\u002Fllama.cpp.html)。\n\n> [!注]\n> llama.cpp 采用“循环上下文管理”机制，通过逐出较早的标记来实现无限生成。  \n> 这可以通过参数进行配置，上述命令已有效禁用该功能。  \n> 更多详情请参阅 [我们的文档](https:\u002F\u002Fqwen.readthedocs.io\u002Fen\u002Flatest\u002Frun_locally\u002Fllama.cpp.html#llama-cli)。\n\n### Ollama\n\n在 [安装 Ollama](https:\u002F\u002Follama.com\u002F) 后，您可以使用以下命令启动 Ollama 服务（建议使用 Ollama v0.9.0 或更高版本）：\n```shell\nollama serve\n# 使用 Ollama 时，需保持此服务运行\n```\n\n要拉取模型检查点并运行模型，请使用 `ollama run` 命令。您可以通过在 `qwen3` 后添加后缀来指定模型大小，例如 `:8b` 或 `:30b-a3b`：\n```shell\nollama run qwen3:8b\n# 设置参数：输入 `\u002Fset parameter num_ctx 40960` 和 `\u002Fset parameter num_predict 32768`  \n# 输入 `\u002Fbye` 并按 ENTER 键退出  \n# 对于 Qwen3-2504 模型，\n# - 若要启用思考模式（默认），输入 `\u002Fset think`  \n# - 若要禁用思考模式，输入 `\u002Fset nothink`  \n```\n\n您还可以通过 Ollama 的兼容 OpenAI 的 API 访问该服务。  \n请注意：(1) 使用 API 时需保持 `ollama serve` 运行；(2) 在使用此 API 之前需先执行 `ollama run qwen3:8b`，以确保模型检查点已准备就绪。  \nAPI 默认地址为 `http:\u002F\u002Flocalhost:11434\u002Fv1\u002F`。\n\n有关更多详细信息，请访问 [ollama.ai](https:\u002F\u002Follama.com\u002F)。\n\n> [!注]\n> Ollama 的命名可能与 Qwen 的原始命名不一致。  \n> 例如，Ollama 中的 `qwen3:30b-a3b` 实际指向的是截至 2025 年 8 月的 `qwen3:30b-a3b-thinking-2507-q4_K_M`。  \n> 请在使用前查看 \u003Chttps:\u002F\u002Follama.com\u002Flibrary\u002Fqwen3\u002Ftags>。\n\n\n> [!注]\n> Ollama 采用与 llama.cpp 相同的“循环上下文管理”机制。  \n> 然而，其默认设置（`num_ctx` 为 2048，`num_predict` 为 -1）意味着使用 2048 个标记的上下文进行无限生成，  \n> 这可能会给 Qwen3 模型带来问题。  \n> 我们建议正确设置 `num_ctx` 和 `num_predict`。\n\n### LMStudio\n\nQwen3 已被 [lmstudio.ai](https:\u002F\u002Flmstudio.ai\u002F) 支持。您可以直接使用我们的 GGUF 文件在 LMStudio 中运行。\n\n\n### ExecuTorch\n\n要导出并在 ExecuTorch 上运行（适用于 iOS、Android、Mac、Linux 等平台），请参考此 [示例](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fexecutorch\u002Fblob\u002Fmain\u002Fexamples\u002Fmodels\u002Fqwen3\u002FREADME.md)。\n\n### MNN\n\n要导出并在 MNN 上运行（支持 Qwen3 在移动设备上的部署），请访问 [Alibaba MNN](https:\u002F\u002Fgithub.com\u002Falibaba\u002FMNN)。\n\n### MLX LM\n\n如果您使用的是 Apple Silicon 处理器，[`mlx-lm`](https:\u002F\u002Fgithub.com\u002Fml-explore\u002Fmlx-lm) 也支持 Qwen3（`mlx-lm>=0.24.0`）。  \n请在 Hugging Face Hub 上查找以 MLX 结尾的模型。\n\n\n### OpenVINO\n\n如果您使用的是 Intel CPU 或 GPU，[OpenVINO 工具包](https:\u002F\u002Fgithub.com\u002Fopenvinotoolkit)支持 Qwen3。  \n您可以参考此 [聊天机器人示例](https:\u002F\u002Fgithub.com\u002Fopenvinotoolkit\u002Fopenvino_notebooks\u002Fblob\u002Flatest\u002Fnotebooks\u002Fllm-chatbot\u002Fllm-chatbot.ipynb)。\n\n\n## 部署 Qwen3\n\nQwen3 很好地支持多种推理框架。  \n在此我们将演示如何使用 `SGLang`、`vLLM` 和 `TensorRT-LLM`。  \n此外，您也可以从各种推理提供商处找到 Qwen3 模型，例如 [阿里云 Model Studio](https:\u002F\u002Fwww.alibabacloud.com\u002Fen\u002Fproduct\u002Fmodelstudio)。\n\n\n### SGLang\n\n[SGLang](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang) 是一个用于大型语言模型和视觉语言模型的快速推理框架。  \nSGLang 可用于启动具有兼容 OpenAI API 服务的服务器。  \n需要 `sglang>=0.4.6.post1`。\n\n对于 Qwen3-Instruct-2507：\n```shell\npython -m sglang.launch_server --model-path Qwen\u002FQwen3-30B-A3B-Instruct-2507 --port 30000 --context-length 262144\n```\n\n对于 Qwen3-Thinking-2507：\n```shell\npython -m sglang.launch_server --model-path Qwen\u002FQwen3-30B-A3B-Thinking-2507 --port 30000 --context-length 262144 --reasoning-parser deepseek-r1\n```\n\n对于 Qwen3：\n```shell\npython -m sglang.launch_server --model-path Qwen\u002FQwen3-8B --port 30000 --context-length 131072 --reasoning-parser qwen3\n```\n兼容 OpenAI 的 API 将可在 `http:\u002F\u002Flocalhost:30000\u002Fv1` 使用。\n\n> [!注]\n> 由于 SGLang 对 API 请求的预处理会丢弃所有 `reasoning_content` 字段，因此 **使用 Qwen3 思考模型进行多步工具调用** 的质量可能不够理想，这要求相关的思考内容必须存在。虽然目前正在修复这一问题，但作为临时解决方案，我们建议直接传递原始内容，无需提取思考内容，聊天模板将正确处理这些内容。\n\n### vLLM\n\n[vLLM](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm) 是一个高吞吐量且内存高效的大型语言模型推理和部署引擎。\n建议使用 `vllm>=0.9.0`。\n\n对于 Qwen3-Instruct-2507，\n```shell\nvllm serve Qwen\u002FQwen3-30B-A3B-Instruct-2507 --port 8000 --max-model-len 262144\n```\n\n对于 Qwen3-Thinking-2507，\n```shell\nvllm serve Qwen\u002FQwen3-30B-A3B-Thinking-2507 --port 8000 --max-model-len 262144 --enable-reasoning --reasoning-parser deepseek_r1\n```\n\n对于 Qwen3，则为：\n```shell\nvllm serve Qwen\u002FQwen3-8B --port 8000 --max-model-len 131072 --enable-reasoning --reasoning-parser qwen3\n```\n一个兼容 OpenAI 的 API 将在 `http:\u002F\u002Flocalhost:8000\u002Fv1` 提供。\n\n> [!注]\n> 由于 vLLM 对 API 请求的预处理会丢弃所有 `reasoning_content` 字段，因此 **使用 Qwen3 思考模型进行多步工具调用** 的质量可能会不理想，这需要相关思考内容的存在。目前我们正在修复此问题，作为临时解决方案，建议直接传递原始内容，无需提取思考内容，聊天模板将正确处理这些内容。\n\n### TensorRT-LLM\n\n[TensorRT-LLM](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTensorRT-LLM) 是 NVIDIA 开源的 LLM 推理引擎，它在 NVIDIA GPU 上提供了包括自定义注意力核、量化等优化功能。Qwen3 在其重新设计的 [PyTorch 后端](https:\u002F\u002Fnvidia.github.io\u002FTensorRT-LLM\u002Ftorch.html) 中得到支持。建议使用 `tensorrt_llm>=0.20.0rc3`。更多详细信息请参阅 [README](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTensorRT-LLM\u002Fblob\u002Fmain\u002Fexamples\u002Fmodels\u002Fcore\u002Fqwen\u002FREADME.md#qwen3) 页面。\n\n```shell\ntrtllm-serve Qwen\u002FQwen3-8B --host localhost --port 8000 --backend pytorch\n```\n一个兼容 OpenAI 的 API 将在 `http:\u002F\u002Flocalhost:8000\u002Fv1` 提供。\n\n### MindIE\n\n如需在 Ascend NPU 上部署，请访问 [Modelers](https:\u002F\u002Fmodelers.cn\u002F) 并搜索 Qwen3。\n\n\u003C!--\n### OpenLLM\n\n[OpenLLM](https:\u002F\u002Fgithub.com\u002Fbentoml\u002FOpenLLM) 允许您轻松地以兼容 OpenAI 的 API 运行 Qwen2.5。您可以使用 `openllm serve` 启动模型服务器。例如：\n\n```bash\nopenllm serve qwen2.5:7b\n```\n\n服务器运行在 `http:\u002F\u002Flocalhost:3000\u002F`，提供兼容 OpenAI 的 API。您可以创建一个 OpenAI 客户端来调用其聊天 API。更多信息请参阅我们的文档 [这里](https:\u002F\u002Fqwen.readthedocs.io\u002Fen\u002Flatest\u002Fdeployment\u002Fopenllm.html)。\n-->\n\n\n## 使用 Qwen3 构建\n\n### 工具使用\n\n对于工具使用功能，我们建议查看 [Qwen-Agent](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen-Agent)，它为这些 API 提供了封装，支持工具使用或函数调用，并具备 MCP 支持。\n使用 Qwen3 进行工具调用也可以通过 SGLang、vLLM、Transformers、llama.cpp、Ollama 等工具实现。\n请参考我们的文档中的指南，了解如何启用该支持。\n\n\n### 微调\n\n我们建议您使用训练框架，包括 [Axolotl](https:\u002F\u002Fgithub.com\u002FOpenAccess-AI-Collective\u002Faxolotl)、[UnSloth](https:\u002F\u002Fgithub.com\u002Funslothai\u002Funsloth)、[Swift](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fswift)、[Llama-Factory](https:\u002F\u002Fgithub.com\u002Fhiyouga\u002FLLaMA-Factory) 等，以 SFT、DPO、GRPO 等方法对您的模型进行微调。\n\n\n## 许可协议\n\n我们所有的开源权重模型均采用 Apache 2.0 许可证授权。\n您可以在相应的 Hugging Face 仓库中找到许可证文件。\n\n## 引用\n\n如果您觉得我们的工作有所帮助，请随时引用我们。\n\n```bibtex\n@article{qwen3,\n    title={Qwen3 技术报告}, \n    author={An Yang 和 Anfeng Li 和 Baosong Yang 和 Beichen Zhang 和 Binyuan Hui 和 Bo Zheng 和 Bowen Yu 和 Chang Gao 和 Chengen Huang 和 Chenxu Lv 和 Chujie Zheng 和 Dayiheng Liu 和 Fan Zhou 和 Fei Huang 和 Feng Hu 和 Hao Ge 和 Haoran Wei 和 Huan Lin 和 Jialong Tang 和 Jian Yang 和 Jianhong Tu 和 Jianwei Zhang 和 Jianxin Yang 和 Jiaxi Yang 和 Jing Zhou 和 Jingren Zhou 和 Junyang Lin 和 Kai Dang 和 Keqin Bao 和 Kexin Yang 和 Le Yu 和 Lianghao Deng 和 Mei Li 和 Mingfeng Xue 和 Mingze Li 和 Pei Zhang 和 Peng Wang 和 Qin Zhu 和 Rui Men 和 Ruize Gao 和 Shixuan Liu 和 Shuang Luo 和 Tianhao Li 和 Tianyi Tang 和 Wenbiao Yin 和 Xingzhang Ren 和 Xinyu Wang 和 Xinyu Zhang 和 Xuancheng Ren 和 Yang Fan 和 Yang Su 和 Yichang Zhang 和 Yinger Zhang 和 Yu Wan 和 Yuqiong Liu 和 Zekun Wang 和 Zeyu Cui 和 Zhenru Zhang 和 Zhipeng Zhou 和 Zihan Qiu},\n    journal = {arXiv 预印本 arXiv:2505.09388},\n    year={2025}\n}\n\n@article{qwen2.5,\n    title   = {Qwen2.5 技术报告}, \n    author  = {An Yang 和 Baosong Yang 和 Beichen Zhang 和 Binyuan Hui 和 Bo Zheng 和 Bowen Yu 和 Chengyuan Li 和 Dayiheng Liu 和 Fei Huang 和 Haoran Wei 和 Huan Lin 和 Jian Yang 和 Jianhong Tu 和 Jianwei Zhang 和 Jianxin Yang 和 Jiaxi Yang 和 Jingren Zhou 和 Junyang Lin 和 Kai Dang 和 Keming Lu 和 Keqin Bao 和 Kexin Yang 和 Le Yu 和 Mei Li 和 Mingfeng Xue 和 Pei Zhang 和 Qin Zhu 和 Rui Men 和 Runji Lin 和 Tianhao Li 和 Tingyu Xia 和 Xingzhang Ren 和 Xuancheng Ren 和 Yang Fan 和 Yang Su 和 Yichang Zhang 和 Yu Wan 和 Yuqiong Liu 和 Zeyu Cui 和 Zhenru Zhang 和 Zihan Qiu},\n    journal = {arXiv 预印本 arXiv:2412.15115},\n    year    = {2024}\n}\n\n@article{qwen2,\n    title   = {Qwen2 技术报告}, \n    author  = {An Yang 和 Baosong Yang 和 Binyuan Hui 和 Bo Zheng 和 Bowen Yu 和 Chang Zhou 和 Chengpeng Li 和 Chengyuan Li 和 Dayiheng Liu 和 Fei Huang 和 Guanting Dong 和 Haoran Wei 和 Huan Lin 和 Jialong Tang 和 Jialin Wang 和 Jian Yang 和 Jianhong Tu 和 Jianwei Zhang 和 Jianxin Ma 和 Jin Xu 和 Jingren Zhou 和 Jinze Bai 和 Jinzheng He 和 Junyang Lin 和 Kai Dang 和 Keming Lu 和 Keqin Chen 和 Kexin Yang 和 Mei Li 和 Mingfeng Xue 和 Na Ni 和 Pei Zhang 和 Peng Wang 和 Ru Peng 和 Rui Men 和 Ruize Gao 和 Runji Lin 和 Shijie Wang 和 Shuai Bai 和 Sinan Tan 和 Tianhang Zhu 和 Tianhao Li 和 Tianyu Liu 和 Wenbin Ge 和 Xiaodong Deng 和 Xiaohuan Zhou 和 Xingzhang Ren 和 Xinyu Zhang 和 Xipin Wei 和 Xuancheng Ren 和 Yang Fan 和 Yang Yao 和 Yichang Zhang 和 Yu Wan 和 Yunfei Chu 和 Yuqiong Liu 和 Zeyu Cui 和 Zhenru Zhang 和 Zhihao Fan},\n    journal = {arXiv 预印本 arXiv:2407.10671},\n    year    = {2024}\n}\n```\n\n## 联系我们\n如果您希望向我们的研究团队或产品团队留言，请加入我们的 [Discord](https:\u002F\u002Fdiscord.gg\u002Fz3GAxXZ9Ce) 或 [微信交流群](assets\u002Fwechat.png)!","# Qwen3 快速上手指南\n\n## 1. 环境准备\n\n在开始之前，请确保您的开发环境满足以下要求：\n\n*   **操作系统**：Linux, macOS, Windows\n*   **Python 版本**：推荐 Python 3.9 及以上\n*   **核心依赖**：\n    *   `transformers` >= 4.51.0 (必须)\n    *   `torch` (PyTorch)\n    *   `accelerate` (推荐用于自动设备映射)\n*   **硬件建议**：\n    *   **推理**：根据模型大小（4B ~ 235B），需配备相应显存的 GPU。小参数模型（如 4B）可在消费级显卡运行，大参数模型（如 30B, 235B）建议使用多卡或高显存专业卡。\n    *   **长上下文**：若需开启 256K 或 1M token 上下文支持，请确保显存充足并参考官方 ModelCard 配置。\n\n> **国内开发者提示**：推荐使用 [ModelScope (魔搭)](https:\u002F\u002Fmodelscope.cn\u002Forganization\u002Fqwen) 获取模型权重，下载速度更快且无需特殊网络环境。Hugging Face 用户若遇网络问题，可配置镜像源或使用代理。\n\n## 2. 安装步骤\n\n使用 pip 安装必要的 Python 库：\n\n```bash\npip install -U transformers torch accelerate\n```\n\n若您需要从 ModelScope 下载模型，建议安装 `modelscope` 库以获得更好的体验：\n\n```bash\npip install modelscope\n```\n\n## 3. 基本使用\n\nQwen3 系列主要包含两种模式：**指令模式 (Instruct)** 和 **思考模式 (Thinking)**。两者在使用代码上略有不同。\n\n### 场景一：使用 Qwen3-Instruct-2507 (通用对话)\n适用于日常问答、代码生成、文本创作等任务。该模式不输出思考过程，响应速度快。\n\n```python\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\n\n# 模型名称 (可使用 ModelScope 路径，例如 \"qwen\u002FQwen3-30B-A3B-Instruct-2507\")\nmodel_name = \"Qwen\u002FQwen3-30B-A3B-Instruct-2507\"\n\n# 加载分词器和模型\ntokenizer = AutoTokenizer.from_pretrained(model_name)\nmodel = AutoModelForCausalLM.from_pretrained(\n    model_name,\n    torch_dtype=\"auto\",\n    device_map=\"auto\"\n)\n\n# 准备输入\nprompt = \"Give me a short introduction to large language model.\"\nmessages = [\n    {\"role\": \"user\", \"content\": prompt}\n]\ntext = tokenizer.apply_chat_template(\n    messages,\n    tokenize=False,\n    add_generation_prompt=True,\n)\nmodel_inputs = tokenizer([text], return_tensors=\"pt\").to(model.device)\n\n# 生成回复\ngenerated_ids = model.generate(\n    **model_inputs,\n    max_new_tokens=16384\n)\noutput_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() \n\ncontent = tokenizer.decode(output_ids, skip_special_tokens=True)\n\nprint(\"content:\", content)\n```\n\n### 场景二：使用 Qwen3-Thinking-2507 (深度推理)\n适用于数学计算、复杂逻辑推理、科学问题及高难度编程任务。该模型会先输出思考过程（`\u003Cthink>` 标签内），再输出最终答案。\n\n```python\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\n\n# 模型名称\nmodel_name = \"Qwen\u002FQwen3-30B-A3B-Thinking-2507\"\n\n# 加载分词器和模型\ntokenizer = AutoTokenizer.from_pretrained(model_name)\nmodel = AutoModelForCausalLM.from_pretrained(\n    model_name,\n    torch_dtype=\"auto\",\n    device_map=\"auto\"\n)\n\n# 准备输入\nprompt = \"Give me a short introduction to large language model.\"\nmessages = [\n    {\"role\": \"user\", \"content\": prompt}\n]\ntext = tokenizer.apply_chat_template(\n    messages,\n    tokenize=False,\n    add_generation_prompt=True,\n)\nmodel_inputs = tokenizer([text], return_tensors=\"pt\").to(model.device)\n\n# 生成回复 (建议设置较大的 max_new_tokens 以容纳思考过程)\ngenerated_ids = model.generate(\n    **model_inputs,\n    max_new_tokens=32768\n)\noutput_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() \n\n# 解析思考内容与最终回答\ntry:\n    # 查找 \u003C\u002Fthink> 结束标记 (token id: 151668)\n    index = len(output_ids) - output_ids[::-1].index(151668)\nexcept ValueError:\n    index = 0\n\nthinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip(\"\\n\")\ncontent = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip(\"\\n\")\n\nprint(\"thinking content:\", thinking_content)\nprint(\"content:\", content)\n```\n\n> **注意**：\n> 1. **Instruct 模型**：仅支持非思考模式，无需设置 `enable_thinking` 参数，输出中不会包含 `\u003Cthink>` 标签。\n> 2. **Thinking 模型**：默认强制开启思考模式。模板会自动添加起始标记，因此输出通常只包含结束的 `\u003C\u002Fthink>` 标签，这是正常现象。\n> 3. **长上下文**：部分 Qwen3-2507 模型支持扩展至 1M token，具体启用方法请参阅对应模型的 ModelCard。","某跨国科技公司的数据团队需要快速从数百页的多语言技术文档和遗留代码库中提炼关键逻辑，并生成可执行的修复方案。\n\n### 没有 Qwen3 时\n- 面对超过 10 万字的混合语言文档，传统模型因上下文窗口限制只能分段处理，导致前后逻辑割裂，无法理解全局架构。\n- 在处理复杂的数学推导或科学原理时，模型经常产生幻觉或给出表面正确但深层逻辑错误的回答，需专家花费大量时间复核。\n- 生成的代码片段往往缺乏对特定工具链的适配，且难以遵循复杂的指令约束，开发人员需反复修改才能运行。\n- 对于非英语的小语种技术资料，模型理解能力薄弱，关键信息遗漏严重，阻碍了全球化知识的整合。\n\n### 使用 Qwen3 后\n- 利用 Qwen3 支持的 256K 甚至百万级 token 长上下文能力，团队可一次性输入整本技术手册，精准定位跨章节的逻辑关联。\n- 借助 Qwen3-Thinking-2507 的深度推理增强，模型在解决高难度数学与科学问题时展现出专家级水平，大幅降低了人工校验成本。\n- Qwen3-Instruct-2507 显著提升了指令遵循与代码生成质量，能直接输出适配现有框架的可运行代码，并准确调用外部工具。\n- 凭借多语言长尾知识的覆盖突破，Qwen3 能流畅解析小语种文档，确保全球技术资产被完整挖掘和利用。\n\nQwen3 通过超长上下文理解与深度推理能力的双重突破，将原本需要数天的人工研判工作压缩至小时级，极大提升了研发效率。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FQwenLM_Qwen3_c553ca42.png","QwenLM","Qwen","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FQwenLM_4756c6c9.png","Alibaba Cloud's general-purpose AI models",null,"qianwen_opensource@alibabacloud.com","Alibaba_Qwen","https:\u002F\u002Fqwen.ai\u002F","https:\u002F\u002Fgithub.com\u002FQwenLM",[86,90],{"name":87,"color":88,"percentage":89},"Python","#3572A5",81,{"name":91,"color":92,"percentage":93},"Shell","#89e051",19,27125,1975,"2026-04-18T09:02:14","","未说明具体型号，但支持 CPU 和 GPU 运行；显存需求取决于模型尺寸（如 235B、30B、4B 等），需参考官方速度基准测试文档；CUDA 版本未说明","未说明",{"notes":101,"python":99,"dependencies":102},"README 未直接列出操作系统、Python 版本及内存的具体数值，仅指出推荐使用最新版的 transformers 库（需>=4.51.0）。支持在 CPU 和 GPU 上本地运行，框架包括 llama.cpp、Ollama、LM Studio 等。不同模型尺寸（如 235B、30B、4B）对硬件资源差异巨大，具体显存和吞吐量需求需查阅官方提供的速度基准测试链接。Qwen3-2507 系列分为 Instruct（非思考模式）和 Thinking（思考模式）两个变体，后者输出包含思考过程且建议设置较大的最大生成长度。",[103],"transformers>=4.51.0",[15],"2026-03-27T02:49:30.150509","2026-04-18T22:33:49.263608",[108,113,118,123,128,133],{"id":109,"question_zh":110,"answer_zh":111,"source_url":112},40941,"Qwen2 系列是否有 14B 和 32B 版本的计划？","是的，官方已发布 Qwen2.5 系列，新增了 3B、14B 和 32B 三种模型尺寸以提供更多选择。您可以查看官方博客获取更多详情：https:\u002F\u002Fqwenlm.github.io\u002Fblog\u002Fqwen2.5","https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen3\u002Fissues\u002F482",{"id":114,"question_zh":115,"answer_zh":116,"source_url":117},40942,"配置了 128K 上下文后，输入长文本仍报错提示超出最大长度限制怎么办？","即使配置了 `rope_scaling`（如 factor: 4.0, type: yarn），实际有效外推长度可能仍远低于标称的 128K。测试表明，超过 4-5 万 token 时容易出错。建议先使用纯英文字符（如重复输入 \"a\"）测试模型是否能处理 10 万长度，若英文正常而中文报错，可能是分词或具体实现问题。目前完全稳定的 128K 支持可能需要等待后续更新或自行调整外推参数。","https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen3\u002Fissues\u002F717",{"id":119,"question_zh":120,"answer_zh":121,"source_url":122},40943,"使用 vLLM 启动 OpenAI 兼容接口时，输出结果结尾包含大量换行符（\\n\\n\\n）如何解决？","该问题通常与停止令牌（stop token）配置有关。解决方法是在调用接口时增加参数 `\"add_generation_prompt\": true`。此外，确保 `tokenizer_config.json` 中的 `eos_token` 已正确更新为 `\u003C|im_end|>`。如果问题依旧，请检查是否使用了最新的配置文件。","https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen3\u002Fissues\u002F46",{"id":124,"question_zh":125,"answer_zh":126,"source_url":127},40944,"使用 Ollama 运行 Qwen2-7B 模型时，无论输入什么都只输出大写字母\"G\"或乱码，如何解决？","这通常是由量化精度不足导致的。尝试将模型的量化等级从 4bit（如 q4_0）提升至 8bit，通常可以解决乱码或重复输出单一字符的问题。如果显存允许，也可以使用 q3_k_m 等中等量化版本进行测试，避免使用过低的量化等级（如 q2_k）。","https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen3\u002Fissues\u002F485",{"id":129,"question_zh":130,"answer_zh":131,"source_url":132},40945,"使用 vLLM 加载 Qwen2-72B-Instruct-gptq-int4 模型时，生成内容出现严重重复怎么办？","GPTQ 量化模型在某些情况下可能会出现重复生成的问题。官方建议在 Qwen2.5 系列中该问题有所改善。对于现有模型，请参考官方文档中的已知问题列表（https:\u002F\u002Fqwen.readthedocs.io\u002Fen\u002Flatest\u002Fquantization\u002Fgptq.html），必要时尝试自行重新量化模型，或提交具体的坏案例（badcase）报告以便进一步排查。","https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen3\u002Fissues\u002F675",{"id":134,"question_zh":135,"answer_zh":136,"source_url":137},40946,"为什么 vLLM 推理结果与 Hugging Face Transformers 的结果差异很大？","vLLM 和 Hugging Face 结果不一致通常由采样参数或模板处理差异引起。请确保两者使用完全相同的 `temperature`、`top_p` 以及 `stop`  tokens 设置。特别注意在 vLLM 中构建 prompt 时，需手动调用 `tokenizer.apply_chat_template(..., add_generation_prompt=True)` 来确保输入格式与 HF 一致。如果参数完全一致仍有差异，可能是算子精度或内核实现不同导致的细微偏差。","https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen3\u002Fissues\u002F76",[]]