[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-HKUDS--LightReasoner":3,"tool-HKUDS--LightReasoner":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",159267,2,"2026-04-17T11:29:14",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108322,"2026-04-10T11:39:34",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[52,13,15,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":32,"last_commit_at":59,"category_tags":60,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":67,"readme_en":68,"readme_zh":69,"quickstart_zh":70,"use_case_zh":71,"hero_image_url":72,"owner_login":73,"owner_name":74,"owner_avatar_url":75,"owner_bio":76,"owner_company":76,"owner_location":76,"owner_email":76,"owner_twitter":76,"owner_website":77,"owner_url":78,"languages":79,"stars":84,"forks":85,"last_commit_at":86,"license":87,"difficulty_score":10,"env_os":88,"env_gpu":89,"env_ram":88,"env_deps":90,"category_tags":95,"github_topics":96,"view_count":32,"oss_zip_url":76,"oss_zip_packed_at":76,"status":17,"created_at":101,"updated_at":102,"faqs":103,"releases":134},8606,"HKUDS\u002FLightReasoner","LightReasoner","[ACL 2026] \"LightReasoner: Can Small Language Models Teach Large Language Models Reasoning?\"","LightReasoner 是一项旨在突破大语言模型推理能力瓶颈的创新技术，其核心理念令人耳目一新：让小模型来“教”大模型如何更好地推理。传统方法往往依赖海量数据进行 exhaustive（穷尽式）训练，不仅计算成本高昂，且效率低下。LightReasoner 另辟蹊径，通过策略性的令牌选择机制，精准挖掘大模型潜在的推理能力，在显著降低计算开销的同时实现性能跃升。\n\n实验数据显示，相比传统的监督微调（SFT），LightReasoner 能将总耗时减少 90%，采样问题量减少 80%，而用于微调的令牌数量更是惊人地减少了 99%。这意味着开发者无需昂贵的算力集群，也能高效地提升模型在数学解题、逻辑推导等复杂任务上的表现。该项目已收录于 ACL 2026 主会，并开源了基于 Qwen2.5-Math 和 DeepSeek-R1 的核心实现及预收集的训练样本。\n\nLightReasoner 特别适合 AI 研究人员、大模型开发者以及希望以低成本优化模型推理能力的技术团队使用。它证明了在人工智能进阶之路上，“更聪明”的策略远比“更费力”的蛮练有效，为资源受限环境下的大模型优化提供了极具价值的","LightReasoner 是一项旨在突破大语言模型推理能力瓶颈的创新技术，其核心理念令人耳目一新：让小模型来“教”大模型如何更好地推理。传统方法往往依赖海量数据进行 exhaustive（穷尽式）训练，不仅计算成本高昂，且效率低下。LightReasoner 另辟蹊径，通过策略性的令牌选择机制，精准挖掘大模型潜在的推理能力，在显著降低计算开销的同时实现性能跃升。\n\n实验数据显示，相比传统的监督微调（SFT），LightReasoner 能将总耗时减少 90%，采样问题量减少 80%，而用于微调的令牌数量更是惊人地减少了 99%。这意味着开发者无需昂贵的算力集群，也能高效地提升模型在数学解题、逻辑推导等复杂任务上的表现。该项目已收录于 ACL 2026 主会，并开源了基于 Qwen2.5-Math 和 DeepSeek-R1 的核心实现及预收集的训练样本。\n\nLightReasoner 特别适合 AI 研究人员、大模型开发者以及希望以低成本优化模型推理能力的技术团队使用。它证明了在人工智能进阶之路上，“更聪明”的策略远比“更费力”的蛮练有效，为资源受限环境下的大模型优化提供了极具价值的新范式。","\u003C!-- Icon and title -->\n\u003Ch1 align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FHKUDS_LightReasoner_readme_ced65dcb4787.png\" width=\"100\" alt=\"lightreasoner-logo\" \u002F>\n\u003Cbr>\n💡 LightReasoner:  \nCan \u003Cstrong>\u003Cem>SMALL\u003C\u002Fem>\u003C\u002Fstrong> Language Models Teach \u003Cstrong>\u003Cem>LARGE\u003C\u002Fem>\u003C\u002Fstrong> Language Models Reasoning?\n\u003C\u002Fh1>\n\n\n\u003C!-- Authors -->\n\u003Ch3 align=\"center\">\n\u003Ca href=\"https:\u002F\u002Fscholar.google.com\u002Fcitations?user=BGT3Gb8AAAAJ&hl=en\" target=\"_blank\"> Jingyuan Wang\u003C\u002Fa> ·\n\u003Ca href=\"https:\u002F\u002Fscholar.google.com\u002Fcitations?user=k6yAt6IAAAAJ&hl=en&oi=sra\" target=\"_blank\"> Yankai Chen\u003C\u002Fa> ·\n\u003Ca href=\"https:\u002F\u002Fscholar.google.com\u002Fcitations?user=__9uvQkAAAAJ&hl=en\" target=\"_blank\"> Zhonghang Li\u003C\u002Fa> ·\n\u003Ca href=\"https:\u002F\u002Fscholar.google.com\u002Fcitations?user=Zkv9FqwAAAAJ&hl=en\" target=\"_blank\"> Chao Huang\u003C\u002Fa>\n\u003C\u002Fh3>\n\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FHKUDS_LightReasoner_readme_55375c9eec8f.png\" width=\"500\" alt=\"Welcome banner\"\u002F>\n\u003C\u002Fp>\n\n\n\u003C!-- Quick links -->\n\u003Cdiv align=\"center\">\n\n[![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2510.07962-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.07962)\n[![🤗 Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤗_Paper-LightReasoner-ffcc4d.svg)](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2510.07962)\n[![License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FCode%20License-MIT-green.svg)](https:\u002F\u002Fopensource.org\u002Flicenses\u002FMIT)\n[![Baselines](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBaselines-Qwen2.5--Math-blue.svg)](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen2.5-Math)\n![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPython-3.10+-yellow.svg)\n[![🤗 Models](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤗_Models-LightReasoner_Models-ffcc4d.svg)](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fbearthecoder\u002Flightreasoner-models-68edbf175755ca5a8c699f9c)\n\u003Cbr>\n\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002FHKUDS\u002FLightReasoner\u002Fstargazers\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FHKUDS\u002FLightReasoner?color=00d9ff&style=for-the-badge&logo=github&logoColor=white&labelColor=1a1a2e&label=Stars\" alt=\"GitHub stars\">\n\u003C\u002Fa>\n\n\n\u003Cp>\n\t\u003Ca href=\"README.md\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🇺🇸English-1a1a2e?style=for-the-badge\">\u003C\u002Fa>\n  \u003Ca href=\"README-zh.md\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🇨🇳中文版-1a1a2e?style=for-the-badge\">\u003C\u002Fa>\n\u003C\u002Fp>\n\n\n\u003Ca href=\".\u002FCommunication.md\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F💬Feishu-Group-07c160?style=for-the-badge&logoColor=white&labelColor=1a1a2e\">\u003C\u002Fa>\n\u003Ca href=\".\u002FCommunication.md\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FWeChat-Group-07c160?style=for-the-badge&logo=wechat&logoColor=white&labelColor=1a1a2e\">\u003C\u002Fa>\n\n\u003C\u002Fdiv>\n\n\n---\n\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FHKUDS_LightReasoner_readme_45a5e913fa67.png\" width=\"800\" \u002F>\n  \u003Cbr>\n  \u003Cem>\u003Cstrong>Figure 1: LightReasoner delivers superior performance with remarkable token efficiency\u003C\u002Fstrong> - achieving consistent improvements in zero-shot pass@1 accuracy while dramatically reducing computational overhead by 90% in total time, 80% in sampled problems, and 99% in tuned tokens compared to traditional SFT.\u003C\u002Fem>\n\u003C\u002Fp>\n\n\n\n**💡 Key Insight:**  \n\nThis efficiency breakthrough shows that **strategic token selection**, rather than exhaustive training, most effectively unlocks the latent potential of LLM reasoning — proving that *smarter, not blindly harder* is the path to scalable AI improvement.\n\n\n---\n\n\n## 🎉 News\n- [x] [2026\u002F04\u002F06] 🚀 LightReasoner has been accepted to the ACL 2026 Main Conference! Many thanks to all co-authors and collaborators for their support.\n- [x] [2025\u002F10\u002F14] 🚀 New Release: [`LRsamples`](.\u002FLRsamples) — **Pre-collected LightReasoner training samples** ready for immediate fine-tuning. This dataset enables direct model training without requiring the full sampling pipeline, streamlining reproduction efforts and accelerating downstream research workflows.\n- [x] [2025\u002F10\u002F14] 🚀 New Release: **LightReasoner Enhanced Models** now available on 🤗 [Hugging Face Hub](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fbearthecoder\u002Flightreasoner-models-68edbf175755ca5a8c699f9c). Ready-to-use models fine-tuned with our efficient reasoning enhancement approach for immediate deployment and experimentation.\n- [x] [2025\u002F10\u002F12] 🚀 New Release: Core implementation with Qwen2.5-Math and DeepSeek-R1 models.\n\n\n---\n\n\n## ⚡ TL;DR\n\n**✨ LightReasoner ✨** flips the script on AI training — small language models (SLMs) don’t just *learn* from large ones (LLMs); they can actually *teach* LLMs better and faster!\n\n\n**🔥 The Challenge:** \n\nSupervised Fine-Tuning (SFT) struggles with three core bottlenecks:\n\n- **📊 Data-Intensive:** Relies on human-labeled or rejection-sampled datasets.\n\n- **⚖️ Uniform Learning:** Trains all tokens equally, even though only a small portion truly matter.  \n\n- **🔗 Ground-Truth Dependency:** Hinders adaptability to new domains and reasoning formats.  \n\n\n**🔍 Key Insight:**  \n\nWe allocate 90% of compute to what models already know, while *under-investing* in the critical 10% that truly drives breakthroughs.\n\n\n## 📈 LightReasoner: *Better and Faster*\n\n**Tested across 7 benchmarks × 5 models**\n\n🚀 **Performance Gains**  \n\nLightReasoner consistently boosts reasoning accuracy across multiple datasets:\n\n- 📈 **Qwen2.5-Math-1.5B:** +28.1% on GSM8K, +25.1% on MATH, +7.2% on SVAMP, +11.7% on ASDIV \n\n- 📈 **DeepSeek-R1-Distill-Qwen-1.5B:** +4.3% on GSM8K, +6.0% on MATH, +17.4% on OlympiadBench  \n\n- 📈 **Qwen2.5-Math-7B:** +10.4% on GSM8K, +6.0% on MATH, +9.3% on SVAMP, +7.9% on ASDIV  \n\n- 📈 **Qwen2.5-Math-1.5B-Instruct:** +1.9% on GSM8K, +2.6% on Minerva Math\n\n- 🌍 **Strong generalization:** Trained *only* on GSM8K, yet improves across **7 benchmarks**\n\n\n⚡ **Efficiency Breakthrough**  \n\nTaking `Qwen2.5-Math-1.5B` as an example, LightReasoner achieves dramatic efficiency gains compared with SFT:\n\n- ⏱️ **90% less total time:** 4 hours → 0.5 hours \n\n- 🧾 **80% fewer sampled problems:** 3,952 → 1,000 problems  \n\n- 🔢 **99% fewer tuned tokens:** 1.77M → 20K tokens  \n\n\n🌟 **Key Features**\n\n- 🎯 **SLM–LLM Teaching:** \n  \n  Counterintuitively uses smaller *“amateur”* models to identify **critical reasoning moments** where stronger *“expert”* models should focus their learning.  \n\n- ⚡ **Extreme Token Efficiency:** \n  \n  Achieves **99% fewer tuned tokens** than SFT by selectively optimizing **high-impact reasoning steps** instead of training uniformly on full trajectories.  \n\n- 🔄 **Three-Stage Lightweight Framework:**  \n\n  (1) **Critical step selection** via Expert-Amaeteur KLD detection\n\n  (2) **Contrastive supervision** capturing expert-amateur behavioral differentials\n\n  (3) **Self-distillation** for internalizing expert strengths  \n\n- 📈 **KL-Guided Learning:** \n  \n  Leverages **behavioral divergence** between expert and amateur predictions to **pinpoint reasoning bottlenecks** — *all without requiring ground-truth labels.*  \n\n- 🧠 **Expertise Over Scale:** \n  \n  Demonstrates that **domain expertise gaps**, rather than model size, drive effective contrast — even same-sized models with different knowledge can generate **powerful teaching signals.**\n\n\n---\n\n\n## 🧩 LightReasoner Framework\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FHKUDS_LightReasoner_readme_b0134efc0515.png\" width=\"800\" \u002F>\n  \u003Cbr>\n  \u003Cem>\n    \u003Cstrong>Figure 2: Overview of the LightReasoner framework.\u003C\u002Fstrong> (1) Sampling Stage: Expert and Amateur models generate distributions π\u003Csub>E\u003C\u002Fsub> and π\u003Csub>A\u003C\u002Fsub>. Informative step selection retains steps with D\u003Csub>KL\u003C\u002Fsub>(π\u003Csub>E\u003C\u002Fsub> ∥ π\u003Csub>A\u003C\u002Fsub>) > β, and contrastive supervision constructs soft labels v\u003Csub>C\u003C\u002Fsub> capturing the Expert's advantage through Expert–Amateur contrast. (2) Fine-tuning Stage: The Expert model is enhanced by minimizing the KL divergence between its output and v\u003Csub>C\u003C\u002Fsub>.\n  \u003C\u002Fem>\n\u003C\u002Fp>\n\n\n---\n\n\n## 🚀 Quick Start\n\n*LightReasoner* is incredibly *easy* to use. We’ve designed it to be accessible — so anyone can try it out and experience its *“counterintuitive effectiveness”* firsthand.\nNo sweat — you’ll have it set up and running with your model of choice in just a few 🪄 simple steps below!\n\n\n### 📦 Get Ready\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FHKUDS\u002FLightReasoner.git\ncd LightReasoner\n```\n\n1️⃣ Install all dependencies:\n\n```bash\npip install -r requirements.txt\n```\n\n2️⃣ Download the Expert and Amateur models of your choice. For example:\n\n🦉 Expert Model\n```bash\nhuggingface-cli download Qwen\u002FQwen2.5-Math-1.5B --local-dir .\u002FQwen2.5-Math-1.5B\n```\n\n🐣 Amateur Model\n```bash\nhuggingface-cli download Qwen\u002FQwen2.5-0.5B --local-dir .\u002FQwen2.5-0.5B\n```\n\n\n3️⃣ Prepare the training data:\n\n```bash\npython data_prep.py\n```\n\n\n#### ⚠️ Caveat\n\nLightReasoner relies on **Expert–Amateur model pairing** to generate supervision signals. Thus, the choice of this pair is crucial to the method’s success.  \n\n⚖️ **Rule of Thumb**: \n\nThe Expert should **significantly outperform** the Amateur, while the Amateur must remain **competent enough** to produce coherent reasoning. In practice, performance peaks at a balanced *“sweet spot”* rather than simply widening the capability gap.   \n\nIn our experiments, the Experts include *Qwen2.5-Math-1.5B*, *7B*, their *Instruct* counterparts, and *DeepSeek-R1-Distill* variants. The Amateur is fixed as *Qwen2.5-0.5B*, which offers strong contrast while maintaining sufficient reasoning ability to yield meaningful signals.  \n\nYou’re *encouraged* to explore other model families (e.g., *Llama*), but keep this **balance principle** in mind when setting up your Expert–Amateur collaboration.\n\n\n#### 📋 Note\n\n- We use GSM8K *by default* for its emphasis on step-by-step, broadly applicable logical reasoning rather than domain-specific notation. This ensures that the Amateur, despite lacking math-specific training, can still produce interpretable outputs suitable for contrastive supervision.\n\n- You’re *absolutely* free to try other datasets — LightReasoner is fully adaptable. However, depending on your dataset, you may need to adjust hyperparameters and the choice of Amateur model to ensure stable training and meaningful contrasts.\n\n  - For instance, if you experiment with the **MATH** dataset — a collection of high-school competition problems that are significantly harder than GSM8K — it’s recommended to upgrade the Amateur model from a generic **Qwen2.5** base model to the specialized **Qwen2.5-Math** variant. The base models were not math-pretrained and may struggle to produce coherent outputs on MATH, potentially destabilizing the expert–amateur contrast.\n\n  - The *balance principle* still applies here — the Amateur should be *adequately weaker* than the Expert to produce a clear contrast, yet capable enough to maintain coherent reasoning.\n\n\n---\n\n\n### 🎯 Sampling\n\nThis step builds the **LightReasoner supervision dataset** for downstream fine-tuning. Steps with high Expert-Amateur KLD are retained. These selected steps are transformed into supervision examples that encode the Expert’s strengths through *distributional contrast*. For full details, please see [our paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.07962).\n\n\n```bash\npython LightR_sampling.py --max_questions 1000\n```\n\n\n#### 📋 Note\n\nBefore running the script, you should:\n\n- Update the **config section** with your own relative paths. \n\n- Adjust the maximum number of problems to control the size of your supervision dataset, tweak the sampling parameters to explore more optimal combinations, and tune the batch size based on your available compute resources.\n\n  - To give you a rough picture, in practice, we find that sampling **1,000 problems** from the GSM8K training set (with the filtering threshold **β = 0.4**) yields approximately **20,000 LightReasoner contrastive samples**, which is already sufficient for **LoRA fine-tuning to converge** on the baseline models we tested.\n\n\n#### ⚡ **Shortcut**\n\nTo save you the trouble of running the sampling pipeline — which, even though much *lighter and easier* with LightReasoner, can still be daunting for those without ample compute power — we now provide *ready-to-go* LightReasoner samples that let you **jump straight to the fine-tuning stage**! 🚀  \n\nYou can find the following pre-collected **LightReasoner sampling datasets** in the zip file under [`LRsamples`](.\u002FLRsamples):\n\n- **`LR_Qwen7_gsm8k`** — for **Qwen2.5-Math-7B**\n\n- **`LR_ds1.5_gsm8k`** — for **DeepSeek-R1-Distill-Qwen-1.5B**\n\n- **`LR_Qwen1.5_gsm8k`** — for **Qwen2.5-Math-1.5B** \n\n  - We provide **two versions**, one sampled with **Torch 3.1** and another with **Torch 3.8**, as we found that the sampling results (i.e., the model’s generated outputs) can slightly vary across Torch versions.  \n\n  - The performance fluctuation is minimal — typically within **2–3%**, with later Torch versions usually performing slightly better.\n\nThese datasets make it **much easier to reproduce** our results directly — no additional sampling required! ✨\n\n\n---\n\n\n### ⚙️ Fine-tuning\n\nThis step launches the full LightReasoner fine-tuning pipeline — combining *dataset loading*, *LoRA configuration*, and *contrastive KLD training* into a unified workflow.\n\n\n#### 💻 Run Options\n\n**Foreground (simple run):**\n```bash\npython LightR_finetuning.py\n```\n\n**Background (recommended for long training):**\n```bash\nnohup python LightR_finetuning.py > finetune.log 2>&1 &\n```\n\n**Monitor progress:**\n```bash\ntail -f finetune.log\n```\n\n\n#### ⚠️ Caveat\n\n*The expert model used for fine-tuning must be identical to the one used during sampling — this alignment is essential for correct behavior.*\n\n\n#### 📋 Note\n\nBefore running the script, edit the **config section** to match your setup:\n\n- 🔹 Replace `\u003Cpath_to_expert_model>` with your base model path *(e.g., `\".\u002FQwen2.5-Math-7B\"` or a local folder).*  \n\n- 🔹 Replace `\u003Cpath_to_training_dataset>` with your dataset JSONL file.  \n\n- 🔹 Replace `\u003Coutput_directory>` with the directory where checkpoints and the final model will be saved.  \n\n- 🔹 Set `torch_dtype` according to your hardware *(e.g., `torch.bfloat16` for **H100**, `torch.float16` for **A100**).*\n\n\n---\n\n\n### 🔗 Model Merging\n\nUse this step to **merge the full model** (base + LoRA) locally, so it behaves as a **standalone model** without any LoRA dependency.\n\n```bash\npython merge.py\n```\n\n#### 📋 Note\nBefore running the merge script, update the **config section** with your own paths: \n\n- 🔹 `base_model_path` to your base model directory *(e.g., `.\u002FQwen2.5-Math-7B`)* \n\n- 🔹 `lora_ckpt_path` to your LoRA checkpoint directory *(e.g., `.\u002Fft_qw7_gsm8k\u002Fcheckpoint-1000`)*  \n\n- 🔹 `merged_model_path` to where you want the merged model to be saved *(e.g., `.\u002Fft-7B-merged`)*\n\n\n---\n\n\n### 📈 Evaluation\n\nAll evaluations are performed using the **official Qwen2.5-Math toolkit**.  \n\nPlease refer to the [`evaluation`](.\u002Fevaluation) folder for detailed usage and setup instructions.\n\n\n---\n\n\n## 📊 Main Results\n\n| Model                                         | GSM8K | MATH | SVAMP | ASDiv | Minerva Math | Olympiad Bench | MMLU STEM | AVG. |\n|-----------------------------------------------|-------|------|-------|-------|-------------------|---------------|----------------|------|\n| **\u003Cnobr>Qwen2.5-Math-1.5B\u003C\u002Fnobr>**            |       |      |       |       |                   |               |                |      |\n| Baseline                                      | 42.5  | 34.2 | 68.8  | 68.1  | 9.9               | 23.7          | 49.8           | 42.4 |\n| + SFT                                         | 69.2  | 57.1 | 64.1  | 70.2  | **15.1**          | **27.6**      | 47.7           | 50.1 |\n| + LightR                                      | **70.6** | **59.3** | **76.0** | **79.8** | 11.4 | 27.1 | **54.9** | **54.2** |\n| **\u003Cnobr>Qwen2.5-Math-1.5B-Instruct\u003C\u002Fnobr>**   |       |      |       |       |                   |               |                |      |\n| Baseline                                      | 84.8  | 75.8 | 94.2  | 94.7  | 29.4              | 37.5          | 57.4           | 67.7 |\n| + SFT                                         | 85.4  | 75.8 | 93.5  | 94.7  | 31.6              | 37.5          | 56.2           | 67.8 |\n| + LightR                                      | **86.7** | 75.5 | 93.0 | 94.1 | **32.0** | **37.8** | 55.2 | **67.8** |\n| **\u003Cnobr>DeepSeek-R1-Distill-Qwen-1.5B\u003C\u002Fnobr>**|       |      |       |       |                   |               |                |      |\n| Baseline                                      | 75.2  | 54.2 | 79.9  | 84.9  | 16.2              | 19.1          | 22.3           | 50.3 |\n| + SFT                                         | 78.2  | **60.3** | 81.5 | 87.4 | **18.4** | 21.2 | 26.2 | 53.3 |\n| + LightR                                      | **79.5** | 60.2 | **83.5** | **87.5** | 18.0 | **36.5** | **26.2** | **55.9** |\n| **\u003Cnobr>Qwen2.5-Math-7B\u003C\u002Fnobr>**              |       |      |       |       |                   |               |                |      |\n| Baseline                                      | 57.5  | 51.8 | 67.9  | 72.7  | 14.0              | 16.0          | 69.8           | 50.0 |\n| + SFT                                         | 64.4  | **63.3** | 76.2 | 76.6 | 12.1 | **20.5** | 68.5 | 54.5 |\n| + LightR                                      | **67.9** | 57.8 | **77.2** | **80.6** | 12.1 | 16.9 | **70.5** | **54.7** |\n| **\u003Cnobr>Qwen2.5-Math-7B-Instruct\u003C\u002Fnobr>**     |       |      |       |       |                   |               |                |      |\n| Baseline                                      | 95.2  | 83.2 | 93.9  | 95.3  | 33.8              | 41.5          | 69.3           | 73.2 |\n| + SFT                                         | 95.4  | 83.1 | **94.1** | 95.2 | **38.2** | 40.7 | 68.2 | **73.6** |\n| + LightR                                      | **95.8** | **83.6** | 93.1 | 95.2 | 34.2 | 39.0 | 67.8 | 72.7 |\n\n\n- Trained *solely* on GSM8K, LightReasoner generalizes effectively for 5 baseline models, achieving consistent gains across 7 benchmarks.\n\n- **+28.1%** on GSM8K, **+25.1%** on MATH, **+7.2%** on SVAMP, **+11.7%** on ASDIV for Qwen2.5-Math-1.5B.  \n\n- **+4.3%** on GSM8K, **+6.0%** on MATH, **+17.4%** on OlympiadBench for DeepSeek-R1-Distill-Qwen-1.5B. \n\n- **+10.4%** on GSM8K, **+6.0%** on MATH, **+9.3%** on SVAMP, **+7.9%** on ASDIV for Qwen2.5-Math-7B.  \n\n- Efficiency vs. SFT: **90% less total time**, **80% fewer sampled problems**, **99% fewer tuned tokens**.  \n\n\n---\n\n\n## ⏱️ Efficiency Study\n\n| **Method** | **Total Time** | **Sampled Problems** | **Tuned Tokens** | **Average Gain** |\n|------------|----------|------------|------------|----------|\n| **Qwen2.5-Math-1.5B** |||||\n| + SFT      | 4.0h     | 3952       | 1.77M      | +7.7%   |\n| **+ LightReasoner** | **0.5h** | **1000**  | **0.02M**  | **+11.8%** |\n| **Qwen2.5-Math-7B** |||||\n| + SFT      | 9.5h     | 6029       | 2.20M      | +4.5%   |\n| **+ LightReasoner** | **0.75h** | **1000** | **0.02M**  | **+4.7%** |\n| **DeepSeek-R1-Distill-Qwen-1.5B** |||||\n| + SFT     | 3.6h     | 6023       | 5.95M      | +3.0%   |\n| **+ LightReasoner** | **0.5h** | **1000**  | **0.02M**  | **+5.6%** |\n| **Qwen2.5-Math-1.5B-Instruct** |||||\n| + SFT     | 3.4h     | 7153       | 2.08M      | +0.1%   |\n| **+ LightReasoner** | **0.4h** | **1000**  | **0.02M**  | +0.1%   |\n\n\n- 🧑‍🏫 **Supervised Fine-Tuning (SFT):** \n\n  - Implemented with rejection sampling, where models are fine-tuned on demonstrations of correct reasoning trajectories.  \n  \n  - For a fair comparison, SFT adopts the *same* experimental configuration as LightReasoner, performing LoRA-based fine-tuning *exclusively* on the GSM8K training set.\n\n  - 🎯 **Key Difference:**  \n  \n    - *LightReasoner* trains on selective next-token predictions, whereas *SFT* optimizes over full reasoning trajectories — an *inherent* difference dictated by their respective training paradigms.  \n\n    - Thus, each *LightReasoner* training instance corresponds to a **single next-token prediction**, whereas each *SFT* example corresponds to a **full reasoning trajectory** comprising a consecutive series of next-token predictions.\n\n\n- 📈 **Efficiency Evaluation:** \n \n  - ⏱️ **Time Budget** — Sampling time plus fine-tuning time, measured on a *single NVIDIA H200 GPU* without inference accelerators (e.g., vLLM).  \n  \n  - 📘 **Training Instances** — Number of distinct GSM8K training set problems used to generate the supervision dataset.  \n  \n  - 🔢 **Tuned Tokens** — Computational overhead measured at the token level.\n\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FHKUDS_LightReasoner_readme_c0a17e948bba.png\" width=\"200\" \u002F>\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FHKUDS_LightReasoner_readme_fdacfbb79ac9.png\" width=\"200\" \u002F>\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FHKUDS_LightReasoner_readme_883a8470013e.png\" width=\"200\" \u002F>\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FHKUDS_LightReasoner_readme_aa570fdb2b2f.png\" width=\"196\" \u002F>\n  \u003Cbr>\n  \u003Cem>\u003Cstrong>Figure 3: LightReasoner matches or surpasses SFT performance with remarkable resource efficiency\u003C\u002Fstrong> — achieving competitive accuracy while cutting training time by 90%, reducing sampled problems by 80%, and requiring 99% fewer tuned tokens.\u003C\u002Fem>\n\n\u003C\u002Fp>\n\n\n💡 **Key Insight:** \n\n*This marks a fundamental shift in how models are trained — **targeting critical reasoning steps** outperforms brute-force learning, making high-quality AI training achievable even with limited computational resources.*\n\n\n---\n\n\n## 🧠 Expertise-Driven Contrast\n\n| **Amateur Model** | **Perf. Gap** | **GSM8K** | **MATH** | **SVAMP** | **ASDiv** | **MMLU STEM** | **AVG.** |\n|-------------------|-------------|-----------|----------|-----------|-----------|---------------|----------|\n| **Expert: \u003Cnobr>Qwen2.5-Math-1.5B\u003C\u002Fnobr>** |||||||||\n| **\u003Cnobr>Qwen2.5-0.5B\u003C\u002Fnobr>**             | **38.2**  | **70.6** | **59.3** | **76.0** | **79.8** | **54.9** | **68.1** |\n| \u003Cnobr>Qwen2.5-1.5B\u003C\u002Fnobr>                 | 35.1  | 63.4 | 57.1 | 69.7 | 75.7 | 54.8 | 64.1 |\n| \u003Cnobr>Qwen2.5-Math-1.5B\u003C\u002Fnobr>            | \u002F  | \u002F | \u002F | \u002F | \u002F | \u002F | \u002F |\n| \u003Cnobr>Qwen2.5-Math-1.5B-Ins\u003C\u002Fnobr>        | -42.3 | 41.4 | 35.5 | 67.5 | 66.4 | 55.0 | 53.2 |\n| *Expert Only (Baseline)*                  | \u002F     | 42.5 | 34.2 | 68.8 | 68.1 | 49.8 | 52.7 |\n| **Expert: \u003Cnobr>Qwen2.5-Math-7B\u003C\u002Fnobr>** |||||||||\n| **\u003Cnobr>Qwen2.5-0.5B\u003C\u002Fnobr>**             | **53.2**  | **67.9** | **57.8** | **77.2** | **80.6** | **70.5** | **70.8** |\n| \u003Cnobr>Qwen2.5-1.5B\u003C\u002Fnobr>                 | 50.1  | 69.0 | 56.0 | 77.6 | 78.9 | 69.5 | 70.2 |\n| \u003Cnobr>Qwen2.5-Math-1.5B\u003C\u002Fnobr>            | 15.0  | 56.9 | 50.2 | 63.5 | 63.4 | 70.7 | 60.9 |\n| \u003Cnobr>Qwen2.5-Math-1.5B-Ins\u003C\u002Fnobr>        | -27.3 | 59.4 | 49.0 | 68.3 | 69.6 | 70.3 | 63.3 |\n| *Expert Only (Baseline)*                  | \u002F     | 57.5 | 51.8 | 67.9 | 72.7 | 69.8 | 63.9 |\n\n\n- **Domain Expertise over Scale:** *The success of Expert–Amateur collaboration is driven most effectively by domain-specific knowledge rather than model size (e.g., Qwen2.5-Math-1.5B vs. Qwen2.5-1.5B), freeing LightReasoner from rigid scaling constraints.*\n\n- **Dependence on Expertise Gap:** *Performance gains are closely correlated with the size of the expertise gap — as the Amateur approaches the Expert’s capability, contrastive signals weaken and improvements diminish.*\n\n\n---\n\n## 🔍 More Insights\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FHKUDS_LightReasoner_readme_b10ba1d2cefe.png\" alt=\"Sampling Stage\" width=\"55.5%\"\u002F>\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FHKUDS_LightReasoner_readme_090b9288f289.png\" alt=\"Fine-tuning Stage\" width=\"34.5%\"\u002F>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \n  \u003Cem>👈 Figure 4(a): Expert–Amateur Pairing Effects — Each point represents a fixed Expert model paired with an Amateur model. The performance gains achieved by LightReasoner diminish as the expertise gap narrows.\u003C\u002Fem>\u003Cbr>\n\n  \u003Cem>👉 Figure 4(b): Impact of Ablation — Removing key components from LightReasoner consistently reduces performance, revealing their critical contributions.\u003C\u002Fem>\n\n\u003C\u002Fp>\n\n\n---\n\n\n## 🏆 Comparison with Competing Methods\n\n\u003Ctable>\n\u003Ctr>\n\u003Ctd>\n\n\u003C!-- Left Table -->\n  \n| **Attribute**        | **Time** | **SFT** | **LightR** |\n|-----------------------|----------------|---------|------------|\n| Full trajectories     | ⬆️          | ✅      | ❌         |\n| All-token tuning      | ⬆️          | ✅      | ❌         |\n| Prefix termination    | ⬇️          | ❌      | ✅         |\n| Selective tokens      | ⬇️          | ❌      | ✅         |\n| Verification-free     | ⬇️          | ❌      | ✅         |\n\n\u003C\u002Ftd>\n\u003Ctd>\n\n\u003C!-- Right Table -->\n\n| **Attribute**         | **Utility** | **CD**      | **LightR** |\n|------------------------|------------------|-------------|------------|\n| Contrast usage         | \u002F                | Inference   | Training   |\n| Size-based contrast    | ⬇️            | ✅          | ❌         |\n| Expertise contrast     | ⬆️            | ❌          | ✅         |\n| Persistent benefits    | ⬆️            | ❌          | ✅         |\n| Standalone inference  | ⬆️            | ❌          | ✅         |\n\n\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003C\u002Ftable>\n\n- 👈 *Left:* Efficiency contrasts at a glance. ⬆️ and ⬇️ indicate whether each aspect helps or hurts the overall efficiency of the method. \n  \n- 👉 *Right:* Key differences between traditional Contrastive Decoding (CD) methods and LightReasoner. ⬆️ and ⬇️ indicate whether each aspect helps or hurts the practicality of the method.\n\n\n---\n\n\n## ☕️ Citation\n\nIf you find this work useful, please consider citing our paper:\n\n```python\n@article{wang2025lightreasoner,\n  title={LightReasoner: Can Small Language Models Teach Large Language Models Reasoning?},\n  author={Wang, Jingyuan and Chen, Yankai and Li, Zhonghang and Huang, Chao},\n  journal={arXiv preprint arXiv:2510.07962},\n  year={2025}\n}\n```\n\nThank you for your interest in our work!\n\n\n---\n\n\n## 📜 License\n\nThis project is released under the [MIT License](.\u002FLICENSE).\n\n\n\u003Cbr>\n\n\n\u003Cp align=\"center\">\n  \u003Cem> ❤️ Thanks for visiting ✨ LightReasoner ✨\u003C\u002Fem>\u003Cbr>\u003Cbr>\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FHKUDS_LightReasoner_readme_0f11e0d65e86.png\" alt=\"Views\">\n\u003C\u002Fp>\n","\u003C!-- 图标和标题 -->\n\u003Ch1 align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FHKUDS_LightReasoner_readme_ced65dcb4787.png\" width=\"100\" alt=\"lightreasoner-logo\" \u002F>\n\u003Cbr>\n💡 LightReasoner:  \n能否让\u003Cstrong>\u003Cem>小型\u003C\u002Fem>\u003C\u002Fstrong>语言模型教会\u003C强>\u003Cem>大型\u003C\u002Fem>\u003C\u002Fstrong>语言模型推理能力？\n\u003C\u002Fh1>\n\n\n\u003C!-- 作者 -->\n\u003Ch3 align=\"center\">\n\u003Ca href=\"https:\u002F\u002Fscholar.google.com\u002Fcitations?user=BGT3Gb8AAAAJ&hl=en\" target=\"_blank\"> 王景远\u003C\u002Fa> ·\n\u003Ca href=\"https:\u002F\u002Fscholar.google.com\u002Fcitations?user=k6yAt6IAAAAJ&hl=en&oi=sra\" target=\"_blank\"> 陈彦凯\u003C\u002Fa> ·\n\u003Ca href=\"https:\u002F\u002Fscholar.google.com\u002Fcitations?user=__9uvQkAAAAJ&hl=en\" target=\"_blank\"> 李中航\u003C\u002Fa> ·\n\u003Ca href=\"https:\u002F\u002Fscholar.google.com\u002Fcitations?user=Zkv9FqwAAAAJ&hl=en\" target=\"_blank\"> 黄超\u003C\u002Fa>\n\u003C\u002Fh3>\n\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FHKUDS_LightReasoner_readme_55375c9eec8f.png\" width=\"500\" alt=\"欢迎横幅\"\u002F>\n\u003C\u002Fp>\n\n\n\u003C!-- 快速链接 -->\n\u003Cdiv align=\"center\">\n\n[![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2510.07962-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.07962)\n[![🤗 Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤗_Paper-LightReasoner-ffcc4d.svg)](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2510.07962)\n[![License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FCode%20License-MIT-green.svg)](https:\u002F\u002Fopensource.org\u002Flicenses\u002FMIT)\n[![Baselines](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBaselines-Qwen2.5--Math-blue.svg)](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen2.5-Math)\n![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPython-3.10+-yellow.svg)\n[![🤗 Models](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤗_Models-LightReasoner_Models-ffcc4d.svg)](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fbearthecoder\u002Flightreasoner-models-68edbf175755ca5a8c699f9c)\n\u003Cbr>\n\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002FHKUDS\u002FLightReasoner\u002Fstargazers\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FHKUDS\u002FLightReasoner?color=00d9ff&style=for-the-badge&logo=github&logoColor=white&labelColor=1a1a2e&label=Stars\" alt=\"GitHub星标\">\n\u003C\u002Fa>\n\n\n\u003Cp>\n\t\u003Ca href=\"README.md\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🇺🇸English-1a1a2e?style=for-the-badge\">\u003C\u002Fa>\n  \u003Ca href=\"README-zh.md\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🇨🇳中文版-1a1a2e?style=for-the-badge\">\u003C\u002Fa>\n\u003C\u002Fp>\n\n\n\u003Ca href=\".\u002FCommunication.md\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F💬Feishu-Group-07c160?style=for-the-badge&logoColor=white&labelColor=1a1a2e\">\u003C\u002Fa>\n\u003Ca href=\".\u002FCommunication.md\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FWeChat-Group-07c160?style=for-the-badge&logo=wechat&logoColor=white&labelColor=1a1a2e\">\u003C\u002Fa>\n\n\u003C\u002Fdiv>\n\n\n---\n\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FHKUDS_LightReasoner_readme_45a5e913fa67.png\" width=\"800\" \u002F>\n  \u003Cbr>\n  \u003Cem>\u003Cstrong>图1：LightReasoner以卓越的令牌效率实现更优性能\u003C\u002Fstrong>——在零样本pass@1准确率上持续提升，同时与传统SFT相比，总耗时减少90%，采样问题数量减少80%，调优令牌数减少99%。\u003C\u002Fem>\n\u003C\u002Fp>\n\n\n\n**💡 核心洞见：**\n\n这一效率突破表明，**策略性地选择令牌**而非进行穷举式训练，才能最有效地释放LLM推理的潜在能力——这证明了“更智能而非盲目加大投入”才是实现AI规模化改进的道路。\n\n\n---\n\n\n## 🎉 新闻\n- [x] [2026\u002F04\u002F06] 🚀 LightReasoner已被ACL 2026主会接受！衷心感谢所有合著者及合作者的支持。\n- [x] [2025\u002F10\u002F14] 🚀 新发布：[`LRsamples`](.\u002FLRsamples) — **预先收集的LightReasoner训练样本**，可直接用于微调。该数据集无需完整的采样流程即可进行模型训练，从而简化复现工作并加速下游研究流程。\n- [x] [2025\u002F10\u002F14] 🚀 新发布：**LightReasoner增强模型**现已在🤗 [Hugging Face Hub](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fbearthecoder\u002Flightreasoner-models-68edbf175755ca5a8c699f9c) 上线。这些即用型模型采用我们高效的推理增强方法进行微调，可立即部署和实验。\n- [x] [2025\u002F10\u002F12] 🚀 新发布：核心实现已支持Qwen2.5-Math和DeepSeek-R1模型。\n\n\n---\n\n\n## ⚡ TL;DR\n\n**✨ LightReasoner ✨**颠覆了传统的AI训练模式——小型语言模型（SLMs）不仅能从大型语言模型（LLMs）那里“学习”，还能反过来“教导”LLMs，使其学得更好、更快！\n\n\n**🔥 挑战：**\n\n监督微调（SFT）面临三大瓶颈：\n\n- **📊 数据密集型：** 依赖人工标注或拒绝采样的数据集。\n\n- **⚖️ 均匀学习：** 对所有令牌一视同仁地进行训练，尽管真正重要的只是其中一小部分。\n\n- **🔗 对真实标签的依赖：** 这种方式限制了模型对新领域和新推理形式的适应能力。  \n\n\n**🔍 核心洞见：**\n\n我们将90%的计算资源投入到模型已经掌握的内容上，而对那关键的10%——真正能带来突破的部分——则投入不足。\n\n\n## 📈 LightReasoner：*更好且更快*\n\n**在7个基准测试和5种模型上进行了验证**\n\n🚀 **性能提升**\n\nLightReasoner在多个数据集上持续提升了推理准确率：\n\n- 📈 **Qwen2.5-Math-1.5B：** GSM8K提升28.1%，MATH提升25.1%，SVAMP提升7.2%，ASDIV提升11.7%\n\n- 📈 **DeepSeek-R1-Distill-Qwen-1.5B：** GSM8K提升4.3%，MATH提升6.0%，OlympiadBench提升17.4%\n\n- 📈 **Qwen2.5-Math-7B：** GSM8K提升10.4%，MATH提升6.0%，SVAMP提升9.3%，ASDIV提升7.9%\n\n- 📈 **Qwen2.5-Math-1.5B-Instruct：** GSM8K提升1.9%，Minerva Math提升2.6%\n\n- 🌍 **强大的泛化能力：** 仅在GSM8K上训练，却能在**7个基准测试**上取得改进。\n\n\n⚡ **效率突破**\n\n以`Qwen2.5-Math-1.5B`为例，与SFT相比，LightReasoner实现了显著的效率提升：\n\n- ⏱️ **总耗时减少90%：** 4小时→0.5小时\n\n- 🧾 **采样问题数量减少80%：** 3,952→1,000个问题\n\n- 🔢 **调优令牌数减少99%：** 177万→2万个令牌\n\n\n🌟 **主要特性**\n\n- 🎯 **SLM–LLM教学：**\n\n出人意料地利用较小的“业余”模型来识别那些更强的“专家”模型应当重点学习的**关键推理时刻**。\n\n- ⚡ **极致的令牌效率：**\n\n通过有选择地优化**高影响力推理步骤**，而不是对整个推理过程进行均匀训练，LightReasoner实现了比SFT少99%的调优令牌数。\n\n- 🔄 **三阶段轻量级框架：**\n\n(1) 通过专家-业余KLD检测进行**关键步骤选择**\n\n(2) 通过捕捉专家与业余行为差异的**对比监督**\n\n(3) 通过**自我蒸馏**内化专家的优势\n\n- 📈 **KL引导的学习：**\n\n利用专家与业余预测之间的**行为分歧**来** pinpoint 推理瓶颈**——这一切都不需要真实标签。\n\n- 🧠 **专业知识胜过模型规模：**\n\n这表明，推动有效对比学习的是**领域知识差距**，而非模型大小——即使是具有不同知识的同规模模型也能产生**强大的教学信号**。\n\n\n---\n\n## 🧩 LightReasoner 框架\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FHKUDS_LightReasoner_readme_b0134efc0515.png\" width=\"800\" \u002F>\n  \u003Cbr>\n  \u003Cem>\n    \u003Cstrong>图2：LightReasoner 框架概览。\u003C\u002Fstrong> (1) 采样阶段：专家模型和业余模型分别生成分布 π\u003Csub>E\u003C\u002Fsub> 和 π\u003Csub>A\u003C\u002Fsub>。通过保留 D\u003Csub>KL\u003C\u002Fsub>(π\u003Csub>E\u003C\u002Fsub> ∥ π\u003Csub>A\u003C\u002Fsub>) > β 的步骤，进行信息性步长选择；同时，对比监督机制会构建软标签 v\u003Csub>C\u003C\u002Fsub>,以捕捉专家模型的优势。 (2) 微调阶段：通过最小化专家模型输出与 v\u003Csub>C\u003C\u002Fsub> 之间的 KL 散度，进一步提升专家模型性能。\n  \u003C\u002Fem>\n\u003C\u002Fp>\n\n\n---\n\n\n## 🚀 快速入门\n\n*LightReasoner* 使用起来极其 *简单*。我们精心设计了它，使其易于上手——任何人都可以亲自尝试，并亲身体验其 *“反直觉的有效性”*。\n\n毫不费力——只需按照下面几个 🪄 简单步骤，你就能将其设置好，并用自己选择的模型运行起来！\n\n\n### 📦 准备工作\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FHKUDS\u002FLightReasoner.git\ncd LightReasoner\n```\n\n1️⃣ 安装所有依赖项：\n\n```bash\npip install -r requirements.txt\n```\n\n2️⃣ 下载你选择的专家模型和业余模型。例如：\n\n🦉 专家模型\n```bash\nhuggingface-cli download Qwen\u002FQwen2.5-Math-1.5B --local-dir .\u002FQwen2.5-Math-1.5B\n```\n\n🐣 业余模型\n```bash\nhuggingface-cli download Qwen\u002FQwen2.5-0.5B --local-dir .\u002FQwen2.5-0.5B\n```\n\n\n3️⃣ 准备训练数据：\n\n```bash\npython data_prep.py\n```\n\n\n#### ⚠️ 注意事项\n\nLightReasoner 依赖于 **专家-业余模型配对** 来生成监督信号。因此，这对模型的选择对方法的成功至关重要。  \n\n⚖️ **经验法则**：\n\n专家模型应 **显著优于** 业余模型，而业余模型则必须具备 **足够的能力**，能够产生连贯的推理过程。实际上，性能的最佳点往往在于一个平衡的 *“最佳区间”*，而非单纯扩大能力差距。   \n\n在我们的实验中，专家模型包括 *Qwen2.5-Math-1.5B*、*7B*，以及它们的 *Instruct* 对应版本，还有 *DeepSeek-R1-Distill* 系列变体。业余模型则固定为 *Qwen2.5-0.5B*，它既能提供强烈的对比，又具备足够的推理能力，从而产生有意义的监督信号。  \n\n我们 *鼓励* 你尝试其他模型家族（如 *Llama*），但在搭建专家-业余合作时，请务必牢记这一 **平衡原则**。\n\n\n#### 📋 注意\n\n- 我们默认使用 GSM8K 数据集，因为它强调逐步推进、广泛适用的逻辑推理，而非特定领域的符号表示。这确保了即使业余模型没有接受过数学专项训练，仍能产生可解释的输出，适合用于对比监督。\n\n- 你完全可以尝试其他数据集——LightReasoner 全面兼容。然而，根据所选数据集的不同，你可能需要调整超参数以及业余模型的选择，以保证训练的稳定性和对比效果的显著性。\n\n  - 例如，如果你尝试使用 **MATH** 数据集——一组难度远高于 GSM8K 的高中竞赛题目——建议将业余模型从通用的 **Qwen2.5** 基础模型升级为专门针对数学的 **Qwen2.5-Math** 变体。基础模型并未经过数学预训练，可能难以在 MATH 数据上生成连贯的输出，从而削弱专家-业余模型之间的对比效果。\n\n  - 此处同样适用 *平衡原则*——业余模型应当 **足够弱于** 专家模型，以形成清晰的对比，但同时也需具备维持连贯推理的能力。\n\n\n---\n\n\n### 🎯 采样\n\n此步骤用于构建 **LightReasoner 监督数据集**，以便后续微调使用。我们会保留那些专家与业余模型之间 KL 散度较高的步骤。这些选定的步骤将被转化为监督样本，通过 *分布对比* 来编码专家模型的优势。更多详细信息请参阅我们的论文 [arXiv:2510.07962](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.07962)。\n\n\n```bash\npython LightR_sampling.py --max_questions 1000\n```\n\n\n#### 📋 注意\n\n在运行脚本之前，你需要：\n\n- 更新 **配置部分**，填入你自己的相对路径。 \n\n- 调整最大问题数量，以控制监督数据集的大小；调整采样参数，探索更优的组合；并根据可用的计算资源，优化批处理大小。\n\n  - 作为参考，在实践中，我们发现从 GSM8K 训练集中采样 **1,000 个问题**（过滤阈值设为 **β = 0.4**），大约能得到 **20,000 个 LightReasoner 对比样本**，而这已经足以让我们在测试过的基准模型上实现 **LoRA 微调的收敛**。\n\n\n#### ⚡ **快捷方式**\n\n为了省去运行采样流程的麻烦——尽管 LightReasoner 的采样流程已经轻量且简便，但对于计算资源有限的人来说，仍然可能有些吃力——我们现在提供了 *即用型* LightReasoner 样本，让你可以直接跳到 **微调阶段**！🚀  \n\n你可以在 [`LRsamples`](.\u002FLRsamples) 文件夹下的压缩包中找到以下预先收集的 **LightReasoner 采样数据集**：\n\n- **`LR_Qwen7_gsm8k`** — 适用于 **Qwen2.5-Math-7B**\n\n- **`LR_ds1.5_gsm8k`** — 适用于 **DeepSeek-R1-Distill-Qwen-1.5B**\n\n- **`LR_Qwen1.5_gsm8k`** — 适用于 **Qwen2.5-Math-1.5B** \n\n  - 我们提供了 **两个版本**，分别基于 **Torch 3.1** 和 **Torch 3.8** 采样，因为我们发现不同版本的 Torch 会导致采样结果（即模型生成的输出）略有差异。  \n\n  - 性能波动非常小——通常在 **2–3%** 以内，较新版本的 Torch 一般表现略好一些。\n\n这些数据集使得你能够 **更轻松地复现** 我们的实验结果——无需额外采样！✨\n\n\n---\n\n\n### ⚙️ 微调\n\n此步骤启动完整的 LightReasoner 微调流程——将 *数据加载*、*LoRA 配置* 和 *对比 KL 散度训练* 整合为一个统一的工作流。\n\n\n#### 💻 运行选项\n\n**前台运行（简单模式）：**\n```bash\npython LightR_finetuning.py\n```\n\n**后台运行（推荐用于长时间训练）：**\n```bash\nnohup python LightR_finetuning.py > finetune.log 2>&1 &\n```\n\n**监控进度：**\n```bash\ntail -f finetune.log\n```\n\n\n#### ⚠️ 注意事项\n\n*用于微调的专家模型必须与采样时使用的模型完全一致——这种一致性对于正确执行至关重要。*\n\n\n#### 📋 注意\n\n在运行脚本之前，请编辑 **配置部分** 以匹配你的设置：\n\n- 🔹 将 `\u003Cpath_to_expert_model>` 替换为你使用的基模型路径（例如 `\".\u002FQwen2.5-Math-7B\"` 或本地文件夹）。  \n\n- 🔹 将 `\u003Cpath_to_training_dataset>` 替换为你准备好的数据集 JSONL 文件。  \n\n- 🔹 将 `\u003Coutput_directory>` 替换为你希望保存检查点和最终模型的目录。  \n\n- 🔹 根据你的硬件设备设置 `torch_dtype`（例如，对于 **H100** 使用 `torch.bfloat16`，对于 **A100** 则使用 `torch.float16`）。\n\n\n---\n\n### 🔗 模型合并\n\n使用此步骤在本地**合并完整模型**（基础模型 + LoRA），使其作为**独立模型**运行，无需任何 LoRA 依赖。\n\n```bash\npython merge.py\n```\n\n#### 📋 注意\n在运行合并脚本之前，请根据您的实际路径更新**配置部分**：\n\n- 🔹 `base_model_path` 设置为您的基础模型目录 *(例如：`.\u002FQwen2.5-Math-7B`)* \n\n- 🔹 `lora_ckpt_path` 设置为您的 LoRA 检查点目录 *(例如：`.\u002Fft_qw7_gsm8k\u002Fcheckpoint-1000`)*  \n\n- 🔹 `merged_model_path` 设置为您希望保存合并后模型的路径 *(例如：`.\u002Fft-7B-merged`)*\n\n\n---\n\n\n### 📈 评估\n\n所有评估均使用**官方 Qwen2.5-Math 工具包**进行。  \n\n请参阅 [`evaluation`](.\u002Fevaluation) 文件夹，获取详细的使用和设置说明。\n\n\n---\n\n\n## 📊 主要结果\n\n| 模型                                         | GSM8K | MATH | SVAMP | ASDiv | Minerva Math | Olympiad Bench | MMLU STEM | AVG. |\n|-----------------------------------------------|-------|------|-------|-------|-------------------|---------------|----------------|------|\n| **\u003Cnobr>Qwen2.5-Math-1.5B\u003C\u002Fnobr>**            |       |      |       |       |                   |               |                |      |\n| 基线                                      | 42.5  | 34.2 | 68.8  | 68.1  | 9.9               | 23.7          | 49.8           | 42.4 |\n| + SFT                                         | 69.2  | 57.1 | 64.1  | 70.2  | **15.1**          | **27.6**      | 47.7           | 50.1 |\n| + LightR                                      | **70.6** | **59.3** | **76.0** | **79.8** | 11.4 | 27.1 | **54.9** | **54.2** |\n| **\u003Cnobr>Qwen2.5-Math-1.5B-Instruct\u003C\u002Fnobr>**   |       |      |       |       |                   |               |                |      |\n| 基线                                      | 84.8  | 75.8 | 94.2  | 94.7  | 29.4              | 37.5          | 57.4           | 67.7 |\n| + SFT                                         | 85.4  | 75.8 | 93.5  | 94.7  | 31.6              | 37.5          | 56.2           | 67.8 |\n| + LightR                                      | **86.7** | 75.5 | 93.0 | 94.1 | **32.0** | **37.8** | 55.2 | **67.8** |\n| **\u003Cnobr>DeepSeek-R1-Distill-Qwen-1.5B\u003C\u002Fnobr>**|       |      |       |       |                   |               |                |      |\n| 基线                                      | 75.2  | 54.2 | 79.9  | 84.9  | 16.2              | 19.1          | 22.3           | 50.3 |\n| + SFT                                         | 78.2  | **60.3** | 81.5 | 87.4 | **18.4** | 21.2 | 26.2 | 53.3 |\n| + LightR                                      | **79.5** | 60.2 | **83.5** | **87.5** | 18.0 | **36.5** | **26.2** | **55.9** |\n| **\u003Cnobr>Qwen2.5-Math-7B\u003C\u002Fnobr>**              |       |      |       |       |                   |               |                |      |\n| 基线                                      | 57.5  | 51.8 | 67.9  | 72.7  | 14.0              | 16.0          | 69.8           | 50.0 |\n| + SFT                                         | 64.4  | **63.3** | 76.2 | 76.6 | 12.1 | **20.5** | 68.5 | 54.5 |\n| + LightR                                      | **67.9** | 57.8 | **77.2** | **80.6** | 12.1 | 16.9 | **70.5** | **54.7** |\n| **\u003Cnobr>Qwen2.5-Math-7B-Instruct\u003C\u002Fnobr>**     |       |      |       |       |                   |               |                |      |\n| 基线                                      | 95.2  | 83.2 | 93.9  | 95.3  | 33.8              | 41.5          | 69.3           | 73.2 |\n| + SFT                                         | 95.4  | 83.1 | **94.1** | 95.2 | **38.2** | 40.7 | 68.2 | **73.6** |\n| + LightR                                      | **95.8** | **83.6** | 93.1 | 95.2 | 34.2 | 39.0 | 67.8 | 72.7 |\n\n\n- LightReasoner 仅基于 GSM8K 数据训练，却能有效泛化到 5 种基础模型，在 7 个基准测试中均取得一致提升。\n\n- 对于 Qwen2.5-Math-1.5B，GSM8K 提升 **28.1%**，MATH 提升 **25.1%**，SVAMP 提升 **7.2%**，ASDIV 提升 **11.7%**。\n\n- 对于 DeepSeek-R1-Distill-Qwen-1.5B，GSM8K 提升 **4.3%**，MATH 提升 **6.0%**，OlympiadBench 提升 **17.4%**。\n\n- 对于 Qwen2.5-Math-7B，GSM8K 提升 **10.4%**，MATH 提升 **6.0%**，SVAMP 提升 **9.3%**，ASDIV 提升 **7.9%**。\n\n- 效率对比 SFT：总耗时减少 **90%**，采样问题数量减少 **80%**，调优的 token 数量减少 **99%**。  \n\n\n---\n\n## ⏱️ 效率研究\n\n| **方法** | **总时间** | **采样问题** | **调整的标记数** | **平均增益** |\n|------------|----------|------------|------------|----------|\n| **Qwen2.5-Math-1.5B** |||||\n| + SFT      | 4.0小时     | 3952       | 177万      | +7.7%   |\n| **+ LightReasoner** | **0.5小时** | **1000**  | **0.02百万**  | **+11.8%** |\n| **Qwen2.5-Math-7B** |||||\n| + SFT      | 9.5小时     | 6029       | 220万      | +4.5%   |\n| **+ LightReasoner** | **0.75小时** | **1000** | **0.02百万**  | **+4.7%** |\n| **DeepSeek-R1-Distill-Qwen-1.5B** |||||\n| + SFT     | 3.6小时     | 6023       | 595万      | +3.0%   |\n| **+ LightReasoner** | **0.5小时** | **1000**  | **0.02百万**  | **+5.6%** |\n| **Qwen2.5-Math-1.5B-Instruct** |||||\n| + SFT     | 3.4小时     | 7153       | 208万      | +0.1%   |\n| **+ LightReasoner** | **0.4小时** | **1000**  | **0.02百万**  | +0.1%   |\n\n\n- 🧑‍🏫 **监督微调 (SFT):** \n\n  - 采用拒绝采样实现，模型在正确推理轨迹的示范上进行微调。  \n  \n  - 为公平比较，SFT 采用了与 LightReasoner 相同的实验配置，仅在 GSM8K 训练集上进行基于 LoRA 的微调。\n\n  - 🎯 **关键区别:**  \n  \n    - *LightReasoner* 基于选择性的下一个标记预测进行训练，而 *SFT* 则优化整个推理轨迹——这是由各自训练范式决定的*固有*差异。  \n\n    - 因此，每个 *LightReasoner* 训练实例对应于**单个下一个标记预测**，而每个 *SFT* 示例则对应于包含一系列连续下一个标记预测的**完整推理轨迹**。\n\n\n- 📈 **效率评估:** \n \n  - ⏱️ **时间预算** — 采样时间加上微调时间，在*单个 NVIDIA H200 GPU* 上测量，未使用推理加速器（如 vLLM）。  \n  \n  - 📘 **训练实例** — 用于生成监督数据集的不同 GSM8K 训练集问题数量。  \n  \n  - 🔢 **调整的标记数** — 以标记为单位计算的计算开销。\n\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FHKUDS_LightReasoner_readme_c0a17e948bba.png\" width=\"200\" \u002F>\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FHKUDS_LightReasoner_readme_fdacfbb79ac9.png\" width=\"200\" \u002F>\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FHKUDS_LightReasoner_readme_883a8470013e.png\" width=\"200\" \u002F>\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FHKUDS_LightReasoner_readme_aa570fdb2b2f.png\" width=\"196\" \u002F>\n  \u003Cbr>\n  \u003Cem>\u003Cstrong>图3: LightReasoner 以显著的资源效率匹配或超越 SFT 性能\u003C\u002Fstrong> — 在达到竞争性准确率的同时，将训练时间缩短90%，采样问题减少80%，所需调整的标记数减少99%。\u003C\u002Fem>\n\n\u003C\u002Fp>\n\n\n💡 **关键洞察:** \n\n*这标志着模型训练方式的根本转变——**针对关键推理步骤**的效果优于蛮力学习，使得即使在有限的计算资源下，高质量的人工智能训练也成为可能。*\n\n\n---\n\n\n## 🧠 专家驱动的对比\n\n| **业余模型** | **性能差距** | **GSM8K** | **MATH** | **SVAMP** | **ASDiv** | **MMLU STEM** | **平均** |\n|-------------------|-------------|-----------|----------|-----------|-----------|---------------|----------|\n| **专家: \u003Cnobr>Qwen2.5-Math-1.5B\u003C\u002Fnobr>** |||||||||\n| **\u003Cnobr>Qwen2.5-0.5B\u003C\u002Fnobr>**             | **38.2**  | **70.6** | **59.3** | **76.0** | **79.8** | **54.9** | **68.1** |\n| \u003Cnobr>Qwen2.5-1.5B\u003C\u002Fnobr>                 | 35.1  | 63.4 | 57.1 | 69.7 | 75.7 | 54.8 | 64.1 |\n| \u003Cnobr>Qwen2.5-Math-1.5B\u003C\u002Fnobr>            | \u002F  | \u002F | \u002F | \u002F | \u002F | \u002F | \u002F |\n| \u003Cnobr>Qwen2.5-Math-1.5B-Ins\u003C\u002Fnobr>        | -42.3 | 41.4 | 35.5 | 67.5 | 66.4 | 55.0 | 53.2 |\n| *仅专家（基准）*                  | \u002F     | 42.5 | 34.2 | 68.8 | 68.1 | 49.8 | 52.7 |\n| **专家: \u003Cnobr>Qwen2.5-Math-7B\u003C\u002Fnobr>** |||||||||\n| **\u003Cnobr>Qwen2.5-0.5B\u003C\u002Fnobr>**             | **53.2**  | **67.9** | **57.8** | **77.2** | **80.6** | **70.5** | **70.8** |\n| \u003Cnobr>Qwen2.5-1.5B\u003C\u002Fnobr>                 | 50.1  | 69.0 | 56.0 | 77.6 | 78.9 | 69.5 | 70.2 |\n| \u003Cnobr>Qwen2.5-Math-1.5B\u003C\u002Fnobr>            | 15.0  | 56.9 | 50.2 | 63.5 | 63.4 | 70.7 | 60.9 |\n| \u003Cnobr>Qwen2.5-Math-1.5B-Ins\u003C\u002Fnobr>        | -27.3 | 59.4 | 49.0 | 68.3 | 69.6 | 70.3 | 63.3 |\n| *仅专家（基准）*                  | \u002F     | 57.5 | 51.8 | 67.9 | 72.7 | 69.8 | 63.9 |\n\n\n- **领域专业知识胜过规模:** *专家–业余合作的成功最有效地由领域特定知识驱动，而非模型大小（例如 Qwen2.5-Math-1.5B 与 Qwen2.5-1.5B），从而使 LightReasoner 摆脱了严格的规模限制。*\n\n- **对专业知识差距的依赖:** *性能提升与专业知识差距的大小密切相关——随着业余模型接近专家的能力，对比信号减弱，改进也随之减少。*\n\n\n---\n\n## 🔍 更多见解\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FHKUDS_LightReasoner_readme_b10ba1d2cefe.png\" alt=\"采样阶段\" width=\"55.5%\"\u002F>\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FHKUDS_LightReasoner_readme_090b9288f289.png\" alt=\"微调阶段\" width=\"34.5%\"\u002F>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \n  \u003Cem>👈 图4(a): 专家–业余配对效果 — 每个点代表一个固定的专家模型与一个业余模型配对。随着专业知识差距的缩小，LightReasoner 所取得的性能增益逐渐减弱。\u003C\u002Fem>\u003Cbr>\n\n  \u003Cem>👉 图4(b): 缺失项的影响 — 从 LightReasoner 中移除关键组件会持续降低性能，揭示了它们的关键贡献。\u003C\u002Fem>\n\n\u003C\u002Fp>\n\n\n---\n\n\n## 🏆 与竞争方法的比较\n\n\u003Ctable>\n\u003Ctr>\n\u003Ctd>\n\n\u003C!-- 左表 -->\n  \n| **属性**        | **时间** | **SFT** | **LightR** |\n|-----------------------|----------------|---------|------------|\n| 完整轨迹     | ⬆️          | ✅      | ❌         |\n| 全标记调优      | ⬆️          | ✅      | ❌         |\n| 前缀终止    | ⬇️          | ❌      | ✅         |\n| 选择性标记      | ⬇️          | ❌      | ✅         |\n| 无需验证     | ⬇️          | ❌      | ✅         |\n\n\u003C\u002Ftd>\n\u003Ctd>\n\n\u003C!-- 右表 -->\n\n| **属性**         | **效用** | **CD**      | **LightR** |\n|------------------------|------------------|-------------|------------|\n| 对比使用         | \u002F                | 推理   | 训练   |\n| 基于规模的对比    | ⬇️            | ✅          | ❌         |\n| 专业知识对比     | ⬆️            | ❌          | ✅         |\n| 持续性收益    | ⬆️            | ❌          | ✅         |\n| 独立推理  | ⬆️            | ❌          | ✅         |\n\n\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003C\u002Ftable>\n\n- 👈 *左:* 一目了然的效率对比。⬆️ 和 ⬇️ 表示每个方面是有助于还是不利于该方法的整体效率。 \n  \n- 👉 *右:* 传统对比解码 (CD) 方法与 LightReasoner 的关键区别。⬆️ 和 ⬇️ 表示每个方面是有助于还是不利于该方法的实用性。\n\n\n---\n\n## ☕️ 引用\n\n如果您觉得这项工作有用，请考虑引用我们的论文：\n\n```python\n@article{wang2025lightreasoner,\n  title={LightReasoner: 小型语言模型能否教会大型语言模型推理？},\n  author={Wang, Jingyuan and Chen, Yankai and Li, Zhonghang and Huang, Chao},\n  journal={arXiv 预印本 arXiv:2510.07962},\n  year={2025}\n}\n```\n\n感谢您对我们工作的关注！\n\n\n---\n\n\n## 📜 许可证\n\n本项目采用 [MIT 许可证](.\u002FLICENSE) 发布。\n\n\n\u003Cbr>\n\n\n\u003Cp align=\"center\">\n  \u003Cem> ❤️ 感谢您的访问 ✨ LightReasoner ✨\u003C\u002Fem>\u003Cbr>\u003Cbr>\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FHKUDS_LightReasoner_readme_0f11e0d65e86.png\" alt=\"浏览量\">\n\u003C\u002Fp>","# LightReasoner 快速上手指南\n\nLightReasoner 是一个创新的推理增强框架，核心理念是**利用小型语言模型（SLM）来教导大型语言模型（LLM）**。它通过识别专家模型与业余模型之间的关键推理差异，仅对高影响力的推理步骤进行微调，从而实现比传统监督微调（SFT）更高的准确率和极致的令牌效率（减少 99% 的微调令牌量）。\n\n## 环境准备\n\n在开始之前，请确保您的开发环境满足以下要求：\n\n*   **操作系统**: Linux (推荐) 或 macOS\n*   **Python 版本**: 3.10 或更高\n*   **硬件要求**: 支持 CUDA 的 NVIDIA GPU（显存需求取决于所选的 Expert\u002FAmateur 模型大小，建议至少 16GB 以运行 1.5B-7B 模型组合）\n*   **前置依赖**:\n    *   `git`\n    *   `pip`\n    *   `huggingface-cli` (用于下载模型)\n\n> **💡 国内加速建议**\n> 鉴于网络环境，强烈建议使用国内镜像源加速依赖安装和模型下载：\n> *   **pip 源**: 使用清华源或阿里源 (`-i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple`)\n> *   **Hugging Face**: 设置环境变量 `HF_ENDPOINT=https:\u002F\u002Fhf-mirror.com` 或使用 `huggingface-cli` 的镜像配置。\n\n## 安装步骤\n\n### 1. 克隆项目代码\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FHKUDS\u002FLightReasoner.git\ncd LightReasoner\n```\n\n### 2. 安装依赖\n推荐使用国内 pip 源以加快安装速度：\n```bash\npip install -r requirements.txt -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n### 3. 下载模型\nLightReasoner 需要成对的 **Expert (专家)** 和 **Amateur (业余)** 模型。\n*   **Expert**: 性能较强，负责提供高质量推理分布。\n*   **Amateur**: 性能较弱但具备基本推理能力，用于对比发现关键步骤。\n\n以下以官方推荐的 `Qwen2.5-Math-1.5B` (Expert) 和 `Qwen2.5-0.5B` (Amateur) 为例：\n\n```bash\n# 设置 HF 镜像加速 (可选，国内用户推荐)\nexport HF_ENDPOINT=https:\u002F\u002Fhf-mirror.com\n\n# 下载 Expert 模型\nhuggingface-cli download Qwen\u002FQwen2.5-Math-1.5B --local-dir .\u002FQwen2.5-Math-1.5B\n\n# 下载 Amateur 模型\nhuggingface-cli download Qwen\u002FQwen2.5-0.5B --local-dir .\u002FQwen2.5-0.5B\n```\n\n### 4. 准备数据\n运行数据预处理脚本（默认使用 GSM8K 数据集）：\n```bash\npython data_prep.py\n```\n\n> **⚠️ 模型搭配原则**\n> *   Expert 必须显著强于 Amateur。\n> *   Amateur 不能太弱，必须能生成连贯的推理过程（否则无法形成有效的对比信号）。\n> *   若使用高难度数据集（如 MATH），建议将 Amateur 升级为同系列的数学专用模型（如 `Qwen2.5-Math-0.5B`）。\n\n## 基本使用\n\nLightReasoner 的核心流程分为两步：**采样构建监督数据** 和 **微调**。\n\n### 方案 A：完整流程（自行采样）\n\n如果您希望从头构建训练数据，请执行以下步骤：\n\n**1. 采样阶段 (Sampling)**\n此步骤通过计算 Expert 和 Amateur 的 KL 散度，筛选出关键推理步骤，构建轻量级监督数据集。\n```bash\npython LightR_sampling.py --max_questions 1000\n```\n*   `--max_questions`: 控制采样的问题数量。官方实验表明，从 GSM8K 中采样 **1,000** 个问题即可生成约 20,000 个对比样本，足以让 LoRA 收敛。\n*   *注意*: 运行前请在脚本中配置好模型本地路径。\n\n**2. 微调阶段 (Fine-tuning)**\n使用生成的对比样本对 Expert 模型进行微调（具体微调命令请参考项目中的训练脚本，通常基于 HuggingFace Trainer 或 LLaMA-Factory）。\n\n---\n\n### 方案 B：快捷流程（使用预采集数据）⭐ 推荐\n\n为了节省算力和时间，项目提供了**预采集好的训练样本**，您可以直接跳过采样阶段，立即开始微调。\n\n**1. 获取预采集数据**\n访问项目目录下的 [`LRsamples`](.\u002FLRsamples) 文件夹（或在 HuggingFace Collection 中下载），选择对应您模型的压缩包：\n*   `LR_Qwen7_gsm8k.zip`: 适用于 Qwen2.5-Math-7B\n*   `LR_ds1.5_gsm8k.zip`: 适用于 DeepSeek-R1-Distill-Qwen-1.5B\n*   `LR_Qwen1.5_gsm8k.zip`: 适用于 Qwen2.5-Math-1.5B\n\n解压后即可获得 ready-to-use 的训练数据。\n\n**2. 开始微调**\n直接使用解压后的数据启动微调任务。这种方式将总训练时间从数小时缩短至几十分钟，并减少了 99% 的 Token 处理量。\n\n```bash\n# 示例：使用预采集数据启动训练 (具体训练命令需参考项目内的 train.py 或相关文档)\npython train.py --data_path .\u002FLRsamples\u002FLR_Qwen1.5_gsm8k --model_path .\u002FQwen2.5-Math-1.5B\n```\n\n通过以上步骤，您即可体验 LightReasoner“以小教大”的高效推理增强效果。","某教育科技公司的算法团队正致力于将大型语言模型（LLM）集成到其自适应数学辅导系统中，以生成高质量的解题步骤供学生参考。\n\n### 没有 LightReasoner 时\n- **训练成本高昂**：为了提升大模型的逻辑推理能力，团队不得不进行全量监督微调（SFT），消耗了巨大的 GPU 算力和时间资源。\n- **数据冗余严重**：传统的训练方式需要处理海量的解题样本，其中包含大量对推理能力提升无效的冗余 token，导致数据处理效率极低。\n- **响应延迟过高**：由于模型未经过针对性的“精简”优化，生成的推理过程往往冗长啰嗦，增加了用户端的等待时间和计算开销。\n- **小模型价值被忽视**：团队仅依赖大模型自身迭代，未能利用轻量级小模型（SLM）中蕴含的高效推理模式来指导大模型。\n\n### 使用 LightReasoner 后\n- **算力开销骤降**：LightReasoner 通过策略性的 Token 选择机制，在总耗时上减少了 90%，调优 Token 数量更是降低了 99%，大幅节省了训练预算。\n- **数据精准高效**：不再盲目堆砌数据，而是让小模型“教”大模型筛选出最具价值的推理路径，仅需 20% 的采样问题即可达到更优效果。\n- **推理敏捷精准**：经过增强的大模型在零样本（zero-shot）测试中准确率显著提升，且生成的解题步骤更加简练，显著降低了端到端延迟。\n- **大小模型协同**：成功验证了小模型在大模型推理进化中的导师角色，以“更聪明”的策略替代了“更暴力”的训练方式。\n\nLightReasoner 证明了通过小模型引导的策略性训练，能以极低的成本解锁大模型的深层推理潜能，实现了效率与性能的双重飞跃。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FHKUDS_LightReasoner_ced65dcb.png","HKUDS","✨Data Intelligence Lab@HKU✨","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FHKUDS_fc32cc87.jpg",null,"https:\u002F\u002Fsites.google.com\u002Fview\u002Fchaoh","https:\u002F\u002Fgithub.com\u002FHKUDS",[80],{"name":81,"color":82,"percentage":83},"Python","#3572A5",100,598,33,"2026-04-14T04:04:38","MIT","未说明","必需（用于运行 Expert\u002FAmateur 模型采样及微调），具体型号与显存大小取决于所选模型（如 Qwen2.5-Math-7B 或 1.5B），需支持 CUDA",{"notes":91,"python":92,"dependencies":93},"1. 核心机制依赖'专家 - 业余'模型配对（Expert-Amateur Pairing），需自行下载两个不同能力的模型（例如专家用 Qwen2.5-Math-1.5B，业余用 Qwen2.5-0.5B）。2. 若使用高难度数据集（如 MATH），建议将业余模型升级为同系列的数学专用版本以保证推理连贯性。3. 项目提供了预采集的训练样本（LRsamples），可跳过耗时的采样步骤直接进行微调。4. 默认使用 GSM8K 数据集进行逻辑推理训练。","3.10+",[94],"requirements.txt 中定义的依赖（具体列表未在 README 中展示，通常包含 torch, transformers, accelerate, huggingface_hub 等）",[35,14],[97,98,99,100],"large-language-models","post-training","reasoning-models","token-efficiency","2026-03-27T02:49:30.150509","2026-04-18T02:20:42.657636",[104,109,114,119,124,129],{"id":105,"question_zh":106,"answer_zh":107,"source_url":108},38557,"项目的中文 README 文档是否完整？","根据社区反馈，目前的中文 README 文档尚不完全。用户已提出请求，希望维护者能够补全中文版本的说明文档，以便中文用户更好地理解和使用该项目。","https:\u002F\u002Fgithub.com\u002FHKUDS\u002FLightReasoner\u002Fissues\u002F7",{"id":110,"question_zh":111,"answer_zh":112,"source_url":113},38552,"LightReasoner 方法中计算 KL 散度时，Expert 和 Amateur 模型的词表（Tokenizer\u002FVocabulary）必须完全一致吗？如果不一致该如何处理？","是的，两个模型的 Tokenizer 和 Vocabulary 必须完全一致。这是该方法的核心先决条件：\n1. 模型来源：Expert（大模型）和 Amateur（小模型）必须来自同一个模型家族（Model Family），例如 Qwen2.5-Math-1.5B 与 Qwen2.5-1.5B。同一家族的模型自然拥有相同的词表。\n2. 技术原因：方法依赖于对相同前缀（prefix）下，两个模型生成的下一个 token 预测分布进行 KL 散度计算。如果词表不一致，维度不同，KL 散度将无法计算或失去意义。\n3. 验证方法：仓库中提供了脚本用于快速验证词表对齐情况：\n   - analysis\u002Ftestspace\u002Fearly_test.py\n   - analysis\u002Ftestspace\u002Fcheck_vocab_alignment.py\n4. 方法论原因：利用同一家族模型可以排除许多混淆因素，专注于凸显 Expert 因额外领域预训练（如数学）而比 Amateur 强的部分。","https:\u002F\u002Fgithub.com\u002FHKUDS\u002FLightReasoner\u002Fissues\u002F9",{"id":115,"question_zh":116,"answer_zh":117,"source_url":118},38553,"为什么 LightReasoner 的 token 使用量与传统 SFT（监督微调）相比有如此大的差异？","Token 使用量的差异源于 LightReasoner 独特的训练机制，主要体现在以下三点：\n1. 前缀终止（Prefix Termination）：LightReasoner 每条轨迹仅采样 128 个 token，而传统 SFT 通常处理完整的长轨迹（如 512 tokens）。\n2. Beta 过滤（Beta-filtering）：方法只保留那些 Expert 与 Amateur 之间 KL 散度较高的 token，这意味着大约只有 20% 的 token 被保留用于训练。\n3. 收敛速度：监控发现，LightReasoner 配合 LoRA 通常在 1000 步内即可收敛（每步对应 16 个样本），而 SFT 需要更多样本才能持续缓慢提升。综合计算下来，LightReasoner 的数据效率远高于全量 SFT。","https:\u002F\u002Fgithub.com\u002FHKUDS\u002FLightReasoner\u002Fissues\u002F3",{"id":120,"question_zh":121,"answer_zh":122,"source_url":123},38554,"如何在 Hugging Face Hub 上找到或链接 LightReasoner 的模型？","LightReasoner 增强后的模型检查点已上传至 Hugging Face Hub。用户可以通过以下方式访问和链接：\n1. 查看仓库顶部：项目 GitHub 仓库的顶部已附带了模型链接。\n2. 模型卡片（Model Cards）：在 HF 上的模型页面添加了详细的模型卡片，描述了模型功能并链接回了论文页面。\n3. 元数据标签：模型卡片中包含了 `pipeline_tag` 和 `library_name` 等元数据标签，以便于搜索和发现。\n4. 直接下载：用户可以使用 `hf_hub_download` 工具直接下载模型文件，或通过代码使用 `from_pretrained` 加载（如果使用了 PyTorchModelHubMixin）。","https:\u002F\u002Fgithub.com\u002FHKUDS\u002FLightReasoner\u002Fissues\u002F1",{"id":125,"question_zh":126,"answer_zh":127,"source_url":128},38555,"使用 $\\log\\frac{\\pi_{expert}}{\\pi_{amateur}}$ 作为监督信号是否会导致 Reward Hacking（奖励黑客），即 Expert 输出奇怪抽象的文字以压低 Amateur 的概率？","这是一个非常有洞察力的问题，触及了方法的本质。维护者认为：\n1. 该问题与最近的“在线策略自蒸馏”（on-policy self-distillation）工作方向一致，表明 LightReasoner 可能在无意中契合了新的研究风向。\n2. 关于具体的 Reward Hacking 风险，由于依赖同一家族模型（Shared Tokenizer & Architecture），这种差异主要反映的是领域知识（如数学推理能力）的差距，而非单纯的生成怪异文本。\n3. 建议对此感兴趣的研究者开启新的讨论帖深入探讨，因为这涉及到如何更稳健地定义和利用这种分布差异作为强化信号。","https:\u002F\u002Fgithub.com\u002FHKUDS\u002FLightReasoner\u002Fissues\u002F11",{"id":130,"question_zh":131,"answer_zh":132,"source_url":133},38556,"目前的代码实现中，固定不变的 Expert\u002FAmateur 对比是否存在局限性？是否有动态调整的建议？","社区用户提出了以下审视与改进意见，指出静态对比可能不够灵活：\n1. 问题分析：目前代码中 Expert 和 Amateur 是固定的。但随着训练进行，学生模型（Amateur）能力会增长。若一直用很弱的 Amateur 做基准，后期可能导致模型在低级错误上停滞不前。\n2. 改进思路 - 动态 Amateur（Dynamic Amateur）：建议引入“课程学习”（Curriculum Learning），随着训练动态调整 Amateur 的水平，使其逐渐逼近 Expert；或者直接使用模型自身的旧版本作为 Amateur（Self-Correction）。\n3. 改进思路 - 真正的 Logits Align：针对词表对齐，建议实现基于投影矩阵或语义软对齐的 Logits Align 层，替代代码中简单的 min_vocab_size 截断处理，以支持不同 Tokenizer 的模型对比。\n4. 改进思路 - Packed Training：为解决上下文长度浪费问题，建议将多个 (Prefix, Target) 样本拼接，或在长序列中一次性计算多个关键位置的 Loss，以提升训练吞吐量。","https:\u002F\u002Fgithub.com\u002FHKUDS\u002FLightReasoner\u002Fissues\u002F10",[]]