[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-RUCAIBox--R1-Searcher":3,"tool-RUCAIBox--R1-Searcher":62},[4,18,26,36,46,54],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",159636,2,"2026-04-17T23:33:34",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":42,"last_commit_at":43,"category_tags":44,"status":17},8272,"opencode","anomalyco\u002Fopencode","OpenCode 是一款开源的 AI 编程助手（Coding Agent），旨在像一位智能搭档一样融入您的开发流程。它不仅仅是一个代码补全插件，而是一个能够理解项目上下文、自主规划任务并执行复杂编码操作的智能体。无论是生成全新功能、重构现有代码，还是排查难以定位的 Bug，OpenCode 都能通过自然语言交互高效完成，显著减少开发者在重复性劳动和上下文切换上的时间消耗。\n\n这款工具专为软件开发者、工程师及技术研究人员设计，特别适合希望利用大模型能力来提升编码效率、加速原型开发或处理遗留代码维护的专业人群。其核心亮点在于完全开源的架构，这意味着用户可以审查代码逻辑、自定义行为策略，甚至私有化部署以保障数据安全，彻底打破了传统闭源 AI 助手的“黑盒”限制。\n\n在技术体验上，OpenCode 提供了灵活的终端界面（Terminal UI）和正在测试中的桌面应用程序，支持 macOS、Windows 及 Linux 全平台。它兼容多种包管理工具，安装便捷，并能无缝集成到现有的开发环境中。无论您是追求极致控制权的资深极客，还是渴望提升产出的独立开发者，OpenCode 都提供了一个透明、可信",144296,1,"2026-04-16T14:50:03",[13,45],"插件",{"id":47,"name":48,"github_repo":49,"description_zh":50,"stars":51,"difficulty_score":32,"last_commit_at":52,"category_tags":53,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108322,"2026-04-10T11:39:34",[14,15,13],{"id":55,"name":56,"github_repo":57,"description_zh":58,"stars":59,"difficulty_score":32,"last_commit_at":60,"category_tags":61,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[45,13,15,14],{"id":63,"github_repo":64,"name":65,"description_en":66,"description_zh":67,"ai_summary_zh":67,"readme_en":68,"readme_zh":69,"quickstart_zh":70,"use_case_zh":71,"hero_image_url":72,"owner_login":73,"owner_name":73,"owner_avatar_url":74,"owner_bio":75,"owner_company":76,"owner_location":76,"owner_email":76,"owner_twitter":76,"owner_website":77,"owner_url":78,"languages":79,"stars":92,"forks":93,"last_commit_at":94,"license":95,"difficulty_score":96,"env_os":97,"env_gpu":98,"env_ram":99,"env_deps":100,"category_tags":111,"github_topics":76,"view_count":32,"oss_zip_url":76,"oss_zip_packed_at":76,"status":17,"created_at":112,"updated_at":113,"faqs":114,"releases":147},8890,"RUCAIBox\u002FR1-Searcher","R1-Searcher","R1-searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning","R1-Searcher 是一个旨在提升大语言模型（LLM）搜索能力的开源项目。它核心解决了传统模型在应对复杂推理任务时，因缺乏实时外部信息支持而导致知识滞后或产生“幻觉”的痛点。通过引入强化学习技术，R1-Searcher 能够激励模型在推理过程中主动、智能地调用网络搜索工具，从而获取最新、最准确的信息来辅助决策。\n\n该项目特别适合人工智能研究人员、大模型开发者以及对检索增强生成（RAG）技术感兴趣的技术团队使用。其独特的技术亮点在于采用了“两阶段结果监督强化学习”策略：第一阶段引导模型学会“何时”以及“如何”发起搜索请求，第二阶段则专注于优化模型如何利用搜索结果进行高效推理。这种分步训练机制不仅降低了训练门槛，还显著提升了模型在数学推理、科学问答等高难度任务中的表现。作为由中国人民大学团队推出的成果，R1-Searcher 为构建具备自主探索能力的下一代智能体提供了简洁而高效的解决方案，相关代码与模型权重均已开源，便于社区复现与二次开发。","\n\u003Ch1 align=\"center\"> R1-searcher:  Incentivizing the Search Capability in LLMs via Reinforcement Learning\u003C\u002Fa>\u003C\u002Fh1>\n\n\n\u003Cdiv align=\"center\">\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002FRUCAIBox\u002FRLRAG\u002Fedit\u002Fmain\u002F\u002FLICENSE\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FCode_License-MIT-blue\" alt=\"license\">\u003C\u002Fa>\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002FRUCAIBox\u002FRLRAG\u002Fedit\u002Fmain\u002F\u002FLICENSE\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModel_License-MIT-blue\" alt=\"license\">\u003C\u002Fa>\n\u003Ca href=\"[https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fyulan-team\u002Fyulan-mini-676d214b24376739b00d95f3](https:\u002F\u002Fgithub.com\u002FRUCAIBox\u002FRLRAG)\">\u003Cimg alt=\"Hugging Face\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20Hugging%20Face-blue?color=8A2BE2\">\u003C\u002Fa>\n\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.05592\" target=\"_blank\">\u003Cimg src=https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b5212f.svg?logo=arxiv>\u003C\u002Fa>\n\n\u003C\u002Fdiv>\n\n\n\u003C!-- \u003Cdiv align=\"center\">\n    \u003Cspan style=\"display:inline-block; margin-right: 10px;\">\n        \u003Ca href=\"https:\u002F\u002Fpaperswithcode.com\u002Fsota\u002Fmathematical-reasoning-on-aime24?p=search-o1-agentic-search-enhanced-large\">\n            \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fendpoint.svg?url=https:\u002F\u002Fpaperswithcode.com\u002Fbadge\u002Fsearch-o1-agentic-search-enhanced-large\u002Fmathematical-reasoning-on-aime24\" alt=\"AIME24 Badge\">\n        \u003C\u002Fa>\n    \u003C\u002Fspan>\n    \u003Cspan style=\"display:inline-block; margin-right: 10px;\">\n        \u003Ca href=\"https:\u002F\u002Fpaperswithcode.com\u002Fsota\u002Fmathematical-reasoning-on-amc23?p=search-o1-agentic-search-enhanced-large\">\n            \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fendpoint.svg?url=https:\u002F\u002Fpaperswithcode.com\u002Fbadge\u002Fsearch-o1-agentic-search-enhanced-large\u002Fmathematical-reasoning-on-amc23\" alt=\"AMC23 Badge\">\n        \u003C\u002Fa>\n    \u003C\u002Fspan>\n  \u003Cspan style=\"display:inline-block; margin-right: 10px;\">\n        \u003Ca href=\"https:\u002F\u002Fpaperswithcode.com\u002Fsota\u002Fon-gpqa?p=search-o1-agentic-search-enhanced-large\">\n            \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fendpoint.svg?url=https:\u002F\u002Fpaperswithcode.com\u002Fbadge\u002Fsearch-o1-agentic-search-enhanced-large\u002Fon-gpqa\" alt=\"GPQA Badge\">\n        \u003C\u002Fa>\n    \u003C\u002Fspan>\n\u003C\u002Fdiv> -->\n\n\n\n\u003Ch5 align=\"center\"> If you like our project, please give us a star ⭐ on GitHub for the latest update.\u003C\u002Fh5>\n\n\n# ✨ News\n+ [22 May 2025] ⚡️⚡️ [**R1-Searcher++**](https:\u002F\u002Fgithub.com\u002FRUCAIBox\u002FR1-Searcher-plus):We propose **R1-Searcher++**,  a framework for training LLMs to adaptively use internal and external knowledge. It uses a two-stage strategy: an initial SFT Cold-start phase for basic format learning, and an RL phase for Dynamic\nKnowledge Acquisition. In the RL phase, we introduce a reward mechanism for the utilization of internal knowledge and integrate a memorization mechanism to continuously assimilate the retrieved information, thereby enriching the model's internal knowledge.\nThe paper can be found here: [**arxiv.org\u002Fabs\u002F2505.17005**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.17005)\n+ [22 May 2025] ⚡️⚡️ [**SimpleDeepSearcher-paper**](https:\u002F\u002Fgithub.com\u002FRUCAIBox\u002FSimpleDeepSearcher):We release the paper of the SimpleDeepSearcher, which also explores the impact of using a distilled model as the backbone for continued reinforcement learning training, as well as the effects of incorporating long cot math reasoning data during the training process. Additionally, the paper includes comprehensive experiments. The paper can be found here: [**arxiv.org\u002Fabs\u002F2505.16834**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.16834)\n+ [16 Apr 2025] ⚡️⚡️ [**SimpleDeepSearcher**](https:\u002F\u002Fgithub.com\u002FRUCAIBox\u002FSimpleDeepSearcher):We propose **SimpleDeepSearcher**, a framework designed to stimulate autonomous retrieval during complex reasoning via knowledge distillation and self-distillation. The goal is to achieve efficient and effective training using only a small amount of data.\n+ [8 Mar 2025] ⚡️⚡️ [**R1-Searcher**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.05592)We propose **R1-searcher**, utilizing a *two-stage outcome-supervision reinforcement learning* approach to enable the model to learn to invoke web search during the reasoning process: first allowing the model to learn how to invoke web search, and then teaching it how to effectively use that search engine. This method does not require any instruction fine-tuning for cold start, and at the same time, it is compatible with existing Base LLMs or Chat LLMs.\n\n# 💡 Overview\n\nLarge reasoning models (LRMs), such as OpnAI-o1 and Deepseek-R1, have demonstrated the significant impact of reinforcement learning in enhancing the long-step reasoning capabilities of models, thereby greatly improving their reasoning performance. Despite these advantages, when faced with knowledge-intensive problems, especially multi-hop questions and time-sensitive issues, these models may lack the necessary knowledge. Therefore, it is great important to enable LLMs to invoke web search and obtain external information during the reasoning process.\n\nWe propose **R1-searcher**, utilizing a *two-stage outcome-supervision reinforcement learning* approach to enable the model to learn to invoke web search during the reasoning process: first allowing the model to learn how to invoke web search, and then teaching it how to effectively use that search engine. This method does not require any instruction fine-tuning for cold start, and at the same time, it is compatible with existing Base LLMs or Chat LLMs. We open-source the training code, inference code, model checkpoints, and the detailed technical report.\n\n- Arxiv: [arxiv.org\u002Fabs\u002F2503.05592](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.05592)\n- Model:\n    - Qwen-2.5-7B-Base-RAG-RL: https:\u002F\u002Fhuggingface.co\u002FXXsongLALA\u002FQwen-2.5-7B-base-RAG-RL\n    - Llama-3.1-8B-Instruct-RAG-RL: https:\u002F\u002Fhuggingface.co\u002FXXsongLALA\u002FLlama-3.1-8B-instruct-RAG-RL\n- Train-data:  https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FXXsongLALA\u002FRAG-RL-Hotpotqa-with-2wiki\n\n![benchmark_picture](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FRUCAIBox_R1-Searcher_readme_0ba6aa3df70f.jpg)\n\n# ✨ Key Insights\n- By relying solely on outcome-supervised reinforcement learning, we can activate the model's intrinsic search capabilities using only the query-answer pair, regardless of whether we are dealing with Base LLMs or Chat LLMs.\n- Recent reinforcement learning algorithms, such as GRPO and Reinforce++ both can effectively activate the internal search capabilities of the LLMs.\n- There is no requirement for complex prompt engineering or process supervision during training.\n- The capability of the Base LLMs largely influences whether the model can directly start training from Zero.\n- LongCoT reasoning after RL is an more effectively and efficient test time scaling method than existing tree-search based methods, e.g., Monte Carlo Tree Search.\n- By using a local retrieval for RL training, the model can generalize well to other datasets and online searches scenarios.\n- The final 7B parameters LLMs achieve the significant performance improvements compared to existing complex method or even close-sourced LLMs (e.g., GPT-4o-mini).\n\n# ✨ Method\n## Overall\n\nWe employ a Two-Stage Reward Guided RL Training approach:\n\nStage 1: Learn to invoke search with only format-reward.\n\nStage 2: Learn to solve questions with invoking search with format-reward and answer-reward.\n\n\n## Algorithm\nWe use only outcome-supervised reinforcement learning for training, so we need to consider two main aspects: (1) the reinforcement learning algorithm, and (2) the design of the reward.\n\n- RL Algorithm: We use Reinforce++ as our RL algorithm. For each questions, we average the rewards of *n* samples, which stabilizes the training process. For the solution format, we utilize `\u003Cthink>...\u003C\u002Fthink>` tag for thinking, xxx for searching, and `\u003Canswer>...\u003C\u002Fanswer>` for answering, `\u003Cbegin_of_search>...\u003Cend_of_search>` for invoking search tool and `\u003Cbegin_of_documents>...\u003Cend_of_documents>` for returned retrieval documents.\n- Reward Design：In Stage-1, we use the retrieve-reward: if the model performs retrieval and the solution meets the format requirements, 0.5 points are added to the answer reward. In Stage 2, the retrieval requirement is removed and we utilize the F1-based answer-reward. A penalty of 2 points is subtracted from the answer reward if the solution does not meet the format requirements. Detailed implementation, including hyperparameters can be found in our code.\n\n## Data\n\nWe choose a portion of the training sets from HotpotQA and 2WikiMultiHopQA as our training data. We use Qwen-2.5-7B-Instruct to perform rollouts on the training dataset.\n\nBased on the number of rollouts required to answer a question correctly, we classify the data into three categories: easy (\u003C10 rollouts), medium (10 \u003C and \u003C 20 rollouts), and difficult (>20 rollouts). These categories are then mixed in a specific ratio to form our training data. All of our training data can be found here:  https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FXXsongLALA\u002FRAG-RL-Hotpotqa-with-2wiki.\n\n\n\n\n# 📄 Evaluation\nFollowing ReARTeR(https:\u002F\u002Farxiv.org\u002Fpdf\u002F2501.07861), we select four representative benchmarks: HotpotQA, 2WikiMultiHopQA, Musique, and Bamboogle.\n\nHotpotQA and 2WikiMultiHopQA are considered in-domain as we use their training-set, while Musique and Bamboogle are classified as out-of-domain, allowing us to assess the generalization capabilities of our model. We randomly sample 500 examples from the development sets of HotpotQA, 2WikiMultiHopQA,  and Musique to serve as our test sets. For Bamboogle, we use all of the test set (125 samples) as our test set..\n\nWikipedia passages serve as the retrieval corpus for all datasets, specifically employing the [Wikipedia corpus released by KILT](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FKILT) in August 2019. Additionally, due to the recency of the knowledge contained in Bamboogle, we incorporate online web search testing to conduct further evaluations, thereby examining the alignment of our model with online search capabilities.\n\nFor the evaluation metrics, we use the ACC_R (Cover-Exect-Match) and ACC_L (LLM-as-Judge).\n![benchmark](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FRUCAIBox_R1-Searcher_readme_533815d5bc7a.jpg)\nAs we can see, when using the same LLaMA-3.1-8B-Instruct base model, our method has achieved significant improvements compared to existing methods, even surpassing closed-source models such as GPT-4o-mini. Furthermore, when switching to the more powerful base model, Qwen-2.5-7B-Base, we directly conduct reinforcement learning from scratch. Eventually, we can achieve better results and attain the best performance on all in-domain and out-of-domain datasets, demonstrating the exceptional generalization capabilities of our model.\n\nFor Bamboogle, we additionally utilize Google for online searches. As we can see, compared to relying solely on a local knowledge base, the incorporation of online search yields superior results, indicating that it is feasible to seamlessly integrate online search capabilities into our model.\n![bamboogle](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FRUCAIBox_R1-Searcher_readme_80c5bd6aea07.jpg)\n\n\n\n# 🏃 Quick Start\n## Environment Setup\n> Note: the environment is same to [STILL-3](https:\u002F\u002Fgithub.com\u002FRUCAIBox\u002FSlow_Thinking_with_LLMs\u002Ftree\u002Fmain\u002FSTILL-3-TOOL) (Great work!).\n\n```bash\nconda create --name r1-searcher python=3.10.16\nconda activate r1-searcher\npip install vllm==0.6.5\npip install packaging\npip install ninja\npip install flash-attn --no-build-isolation\npip install deepspeed\npip install accelerate\npip install datasets\n```\n## Data Preparation\n\n```bash\ncd R1-Searcher\n\n## Process wiki only abs\nwget -nv --no-check-certificate https:\u002F\u002Frocketqa.bj.bcebos.com\u002Fcorpus\u002Fnq.tar.gz\ntar -zxf nq.tar.gz\nrm -rf nq.tar.gz # We only use the title and abs.\n\n## Process wiki full texts\nwget http:\u002F\u002Fdl.fbaipublicfiles.com\u002FKILT\u002Fkilt_knowledgesource.json\ncd R1-Searcher\npython wiki_corpus_index_bulid\u002Fsplit_kilt_to_100.py\n\n## Index the tsv file. We recommend splitting the original TSV file into n parts for embedding, otherwise the process will be very slow.\npython wiki_corpus_index_bulid\u002Fbuild_corpus_embedding.py --file_path the_tsv_file_path --save_path the_pickle_path --gpu_id 0\npython wiki_corpus_index_bulid\u002Fbuild_corpus_idnex.py\n\n```\n## Training\n```bash\ncd R1-Searcher\n\n## Ray start\nbash scripts\u002Fray_start.sh\n\n## Mount Wikipedia\npython train\u002Fwiki_corpus_load.py hotpotqa 5004 &\n\n## Convert jsonl to hf dataset\npython train\u002Fjsonl2hf_dataset.py --input data\u002Ftraining_set\u002Fstage_2.jsonl --output data\u002Ftraining_set\u002Fstage_2\n\n## Start Reward Server \npython train\u002Freward_server_qwen_zero.py --data_path data\u002Ftraining_set\u002Fstage_2 --reward_pretrain the_model_path --log_file results\u002Fsamples\u002Fqwen.jsonl --port 1278\n\n## Training\nbash scripts\u002Fqwen_reinforce_plus_train.sh | tee results\u002Flogs\u002Fqwen_reinforce_plus_train.txt\n```\n## Evaluation\n\n```bash\ncd R1-Searcher\n\n## Local Search\n## HotpotQA\npython train\u002Fwiki_corpus_load.py hotpotqa 5004 &\npython evaluation\u002Feval_search_loacl.py --gpu_id 0 --temp 0.0 --port 5004 --prompt_type v0 --src_file  data\u002Feval_set\u002Fhotpotqa_500.jsonl --model_path the_path_to_model\n## 2Wiki, Musique, Bamboogle\npython train\u002Fwiki_corpus_load.py kilt 5005 &\npython evaluation\u002Feval_search_loacl.py --gpu_id 0 --temp 0.0 --port 5005 --prompt_type v0 --src_file data\u002Feval_set\u002Fbamboogle_500.jsonl --model_path the_path_to_model\n\n## Online Search\n## Bamboogle\npython evaluation\u002Feval_search_online.py --gpu_id 0 --temp 0.0 --port 5004 --prompt_type v0 --src_file data\u002Feval_set\u002Fbamboogle_500.jsonl --model_path the_path_to_model\n\n## Calculate Metric\n## Exact Match, Cover Exact Match, F1 Score\npython evaluation\u002Fmetric_calc_rule.py the_path_to_results\n\n## LLM-as-Judge. Remember replace the input file to your own results.\npython evaluation\u002Fmetric_calc_gpt_as_judge.py\n```\n\n# 📄 Citation\nPlease kindly cite our report if they are helpful for your research.\n\n```\n@article{R1-searcher,\n  title={R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning},\n  author={Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Ji-Rong Wen, Yang Lu, Xu Miu},\n  url={https:\u002F\u002Fgithub.com\u002FRUCAIBox\u002FR1-searcher},\n  year={2025}\n}\n```\n\n# 📄 License\n\nThis project is released under the [MIT License](LICENSE).\n\n# 📞 Contact\n\nFor any questions or feedback, please reach out to us at [songhuatong123@ruc.edu.cn](songhuatong123@ruc.edu.cn).\n","\u003Ch1 align=\"center\"> R1-searcher：通过强化学习激励大语言模型的搜索能力\u003C\u002Fa>\u003C\u002Fh1>\n\n\n\u003Cdiv align=\"center\">\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002FRUCAIBox\u002FRLRAG\u002Fedit\u002Fmain\u002F\u002FLICENSE\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FCode_License-MIT-blue\" alt=\"license\">\u003C\u002Fa>\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002FRUCAIBox\u002FRLRAG\u002Fedit\u002Fmain\u002F\u002FLICENSE\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModel_License-MIT-blue\" alt=\"license\">\u003C\u002Fa>\n\u003Ca href=\"[https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fyulan-team\u002Fyulan-mini-676d214b24376739b00d95f3](https:\u002F\u002Fgithub.com\u002FRUCAIBox\u002FRLRAG)\">\u003Cimg alt=\"Hugging Face\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20Hugging%20Face-blue?color=8A2BE2\">\u003C\u002Fa>\n\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.05592\" target=\"_blank\">\u003Cimg src=https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-b5212f.svg?logo=arxiv>\u003C\u002Fa>\n\n\u003C\u002Fdiv>\n\n\n\u003C!-- \u003Cdiv align=\"center\">\n    \u003Cspan style=\"display:inline-block; margin-right: 10px;\">\n        \u003Ca href=\"https:\u002F\u002Fpaperswithcode.com\u002Fsota\u002Fmathematical-reasoning-on-aime24?p=search-o1-agentic-search-enhanced-large\">\n            \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fendpoint.svg?url=https:\u002F\u002Fpaperswithcode.com\u002Fbadge\u002Fsearch-o1-agentic-search-enhanced-large\u002Fmathematical-reasoning-on-aime24\" alt=\"AIME24 Badge\">\n        \u003C\u002Fa>\n    \u003C\u002Fspan>\n    \u003Cspan style=\"display:inline-block; margin-right: 10px;\">\n        \u003Ca href=\"https:\u002F\u002Fpaperswithcode.com\u002Fsota\u002Fmathematical-reasoning-on-amc23?p=search-o1-agentic-search-enhanced-large\">\n            \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fendpoint.svg?url=https:\u002F\u002Fpaperswithcode.com\u002Fbadge\u002Fsearch-o1-agentic-search-enhanced-large\u002Fmathematical-reasoning-on-amc23\" alt=\"AMC23 Badge\">\n        \u003C\u002Fa>\n    \u003C\u002Fspan>\n  \u003Cspan style=\"display:inline-block; margin-right: 10px;\">\n        \u003Ca href=\"https:\u002F\u002Fpaperswithcode.com\u002Fsota\u002Fon-gpqa?p=search-o1-agentic-search-enhanced-large\">\n            \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fendpoint.svg?url=https:\u002F\u002Fpaperswithcode.com\u002Fbadge\u002Fsearch-o1-agentic-search-enhanced-large\u002Fon-gpqa\" alt=\"GPQA Badge\">\n        \u003C\u002Fa>\n    \u003C\u002Fspan>\n\u003C\u002Fdiv> -->\n\n\n\n\u003Ch5 align=\"center\"> 如果您喜欢我们的项目，请在 GitHub 上为我们点亮一颗星 ⭐，以获取最新更新。\u003C\u002Fh5>\n\n\n# ✨ 新闻\n+ [2025年5月22日] ⚡️⚡️ [**R1-Searcher++**](https:\u002F\u002Fgithub.com\u002FRUCAIBox\u002FR1-Searcher-plus)：我们提出了 **R1-Searcher++**，这是一个用于训练大语言模型自适应地使用内部和外部知识的框架。它采用两阶段策略：初始的 SFT 冷启动阶段用于基础格式的学习，随后是用于动态知识获取的强化学习阶段。在强化学习阶段，我们引入了一种奖励机制来鼓励内部知识的利用，并整合了一个记忆机制，以持续吸收检索到的信息，从而丰富模型的内部知识。\n论文链接：[**arxiv.org\u002Fabs\u002F2505.17005**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.17005)\n+ [2025年5月22日] ⚡️⚡️ [**SimpleDeepSearcher-paper**](https:\u002F\u002Fgithub.com\u002FRUCAIBox\u002FSimpleDeepSearcher)：我们发布了 SimpleDeepSearcher 的论文，该论文还探讨了使用蒸馏模型作为骨干网络进行持续强化学习训练的影响，以及在训练过程中加入长链数学推理数据的效果。此外，论文还包括全面的实验。论文链接：[**arxiv.org\u002Fabs\u002F2505.16834**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.16834)\n+ [2025年4月16日] ⚡️⚡️ [**SimpleDeepSearcher**](https:\u002F\u002Fgithub.com\u002FRUCAIBox\u002FSimpleDeepSearcher)：我们提出了 **SimpleDeepSearcher**，这是一个通过知识蒸馏和自蒸馏来激发复杂推理过程中自主检索能力的框架。其目标是在仅使用少量数据的情况下实现高效且有效的训练。\n+ [2025年3月8日] ⚡️⚡️ [**R1-Searcher**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.05592)：我们提出了 **R1-searcher**，采用一种*两阶段结果监督强化学习*方法，使模型能够在推理过程中学会调用网络搜索功能：首先让模型学习如何调用网络搜索，然后再教它如何有效地使用搜索引擎。这种方法无需任何指令微调来进行冷启动，同时兼容现有的基础型或对话型大语言模型。\n\n# 💡 概述\n\n大型推理模型（LRMs），如 OpnAI-o1 和 Deepseek-R1，已经证明了强化学习在提升模型长步骤推理能力方面的显著效果，从而极大地提高了它们的推理性能。尽管具有这些优势，但在面对知识密集型问题时，尤其是多跳问题和时间敏感性问题，这些模型可能会缺乏必要的知识。因此，让大语言模型在推理过程中能够调用网络搜索并获取外部信息显得尤为重要。\n\n我们提出了 **R1-searcher**，采用一种*两阶段结果监督强化学习*方法，使模型能够在推理过程中学会调用网络搜索功能：首先让模型学习如何调用网络搜索，然后再教它如何有效地使用搜索引擎。这种方法无需任何指令微调来进行冷启动，同时兼容现有的基础型或对话型大语言模型。我们开源了训练代码、推理代码、模型检查点以及详细的技术报告。\n\n- Arxiv：[arxiv.org\u002Fabs\u002F2503.05592](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.05592)\n- 模型：\n    - Qwen-2.5-7B-Base-RAG-RL：https:\u002F\u002Fhuggingface.co\u002FXXsongLALA\u002FQwen-2.5-7B-base-RAG-RL\n    - Llama-3.1-8B-Instruct-RAG-RL：https:\u002F\u002Fhuggingface.co\u002FXXsongLALA\u002FLlama-3.1-8B-instruct-RAG-RL\n- 训练数据：https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FXXsongLALA\u002FRAG-RL-Hotpotqa-with-2wiki\n\n![benchmark_picture](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FRUCAIBox_R1-Searcher_readme_0ba6aa3df70f.jpg)\n\n# ✨ 关键见解\n- 仅依靠结果监督的强化学习，我们就可以仅使用查询-答案对来激活模型的内在搜索能力，无论面对的是基础型还是对话型大语言模型。\n- 近期的强化学习算法，如 GRPO 和 Reinforce++，都能够有效地激活大语言模型的内部搜索能力。\n- 在训练过程中不需要复杂的提示工程或过程监督。\n- 基础大语言模型的能力在很大程度上决定了模型是否可以直接从零开始进行训练。\n- 强化学习后的 LongCoT 推理是一种比现有基于树搜索的方法（如蒙特卡洛树搜索）更有效、更高效的测试时扩展方法。\n- 通过使用本地检索数据进行强化学习训练，模型可以很好地泛化到其他数据集和在线搜索场景。\n- 最终的 7B 参数大语言模型相比现有的复杂方法甚至闭源大语言模型（如 GPT-4o-mini）都取得了显著的性能提升。\n\n# ✨ 方法\n\n## 总体\n\n我们采用两阶段奖励引导的强化学习训练方法：\n\n第一阶段：仅使用格式奖励，学习如何调用搜索功能。\n\n第二阶段：结合格式奖励和答案奖励，学习如何通过调用搜索功能来解答问题。\n\n\n## 算法\n我们仅使用基于结果监督的强化学习进行训练，因此需要考虑两个主要方面：(1) 强化学习算法，以及 (2) 奖励的设计。\n\n- RL算法：我们使用 Reinforce++ 作为我们的强化学习算法。对于每一道题目，我们会对 *n* 个样本的奖励取平均值，以稳定训练过程。在解题格式上，我们使用 `\u003Cthink>...\u003C\u002Fthink>` 标签表示思考，`xxx` 表示搜索，`\u003Canswer>...\u003C\u002Fanswer>` 表示回答，`\u003Cbegin_of_search>...\u003Cend_of_search>` 用于调用搜索工具，而 `\u003Cbegin_of_documents>...\u003Cend_of_documents>` 则用于展示检索到的文档。\n-  Belong to the same category. The data is then mixed in a specific ratio to form our training data. All of our training data can be found here:  https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FXXsongLALA\u002FRAG-RL-Hotpotqa-with-2wiki.\n\n\n\n\n# 📄 评估\n参照 ReARTeR（https:\u002F\u002Farxiv.org\u002Fpdf\u002F2501.07861），我们选择了四个具有代表性的基准数据集：HotpotQA、2WikiMultiHopQA、Musique 和 Bamboogle。\n\n其中，HotpotQA 和 2WikiMultiHopQA 属于域内数据集，因为我们使用了它们的训练集；而 Musique 和 Bamboogle 则属于域外数据集，这有助于评估模型的泛化能力。我们从 HotpotQA、2WikiMultiHopQA 和 Musique 的开发集中随机抽取 500 个样例作为测试集。对于 Bamboogle，则直接使用其全部测试集（125 个样例）作为测试数据。\n\n所有数据集的检索语料均来自维基百科，具体来说是 Facebook Research 于 2019 年 8 月发布的 [KILT 维基百科语料库]。此外，由于 Bamboogle 数据中包含较新的知识，我们还加入了在线网页搜索测试，以进一步评估模型与在线搜索能力的匹配程度。\n\n在评估指标方面，我们使用 ACC_R（覆盖精确匹配）和 ACC_L（LLM 作为裁判）。\n![benchmark](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FRUCAIBox_R1-Searcher_readme_533815d5bc7a.jpg)\n正如我们所见，在使用相同的 LLaMA-3.1-8B-Instruct 基础模型时，我们的方法相比现有方法取得了显著提升，甚至超越了 GPT-4o-mini 等闭源模型。进一步地，当我们切换到更强大的基础模型 Qwen-2.5-7B-Base 时，我们直接从零开始进行强化学习训练，最终在所有域内和域外数据集上都取得了更好的成绩，充分展示了我们模型卓越的泛化能力。\n\n对于 Bamboogle 数据集，我们额外使用了 Google 进行在线搜索。可以看出，相比于仅依赖本地知识库，结合在线搜索能够带来更优的结果，这表明将在线搜索能力无缝集成到我们的模型中是可行的。\n![bamboogle](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FRUCAIBox_R1-Searcher_readme_80c5bd6aea07.jpg)\n\n\n\n# 🏃 快速入门\n## 环境搭建\n> 注意：环境配置与 [STILL-3](https:\u002F\u002Fgithub.com\u002FRUCAIBox\u002FSlow_Thinking_with_LLMs\u002Ftree\u002Fmain\u002FSTILL-3-TOOL) 相同（非常棒的工作！）。\n\n```bash\nconda create --name r1-searcher python=3.10.16\nconda activate r1-searcher\npip install vllm==0.6.5\npip install packaging\npip install ninja\npip install flash-attn --no-build-isolation\npip install deepspeed\npip install accelerate\npip install datasets\n```\n## 数据准备\n\n```bash\ncd R1-Searcher\n\n## 处理维基百科摘要\nwget -nv --no-check-certificate https:\u002F\u002Frocketqa.bj.bcebos.com\u002Fcorpus\u002Fnq.tar.gz\ntar -zxf nq.tar.gz\nrm -rf nq.tar.gz # 我们只使用标题和摘要。\n\n## 处理维基百科全文\nwget http:\u002F\u002Fdl.fbaipublicfiles.com\u002FKILT\u002Fkilt_knowledgesource.json\ncd R1-Searcher\npython wiki_corpus_index_bulid\u002Fsplit_kilt_to_100.py\n\n## 对 TSV 文件建立索引。建议将原始 TSV 文件拆分为 n 份进行嵌入，否则处理过程会非常缓慢。\npython wiki_corpus_index_bulid\u002Fbuild_corpus_embedding.py --file_path the_tsv_file_path --save_path the_pickle_path --gpu_id 0\npython wiki_corpus_index_bulid\u002Fbuild_corpus_idnex.py\n\n```\n## 训练\n```bash\ncd R1-Searcher\n\n## 启动 Ray\nbash scripts\u002Fray_start.sh\n\n## 挂载维基百科\npython train\u002Fwiki_corpus_load.py hotpotqa 5004 &\n\n## 将 jsonl 转换为 hf 数据集\npython train\u002Fjsonl2hf_dataset.py --input data\u002Ftraining_set\u002Fstage_2.jsonl --output data\u002Ftraining_set\u002Fstage_2\n\n## 启动奖励服务器 \npython train\u002Freward_server_qwen_zero.py --data_path data\u002Ftraining_set\u002Fstage_2 --reward_pretrain the_model_path --log_file results\u002Fsamples\u002Fqwen.jsonl --port 1278\n\n## 开始训练\nbash scripts\u002Fqwen_reinforce_plus_train.sh | tee results\u002Flogs\u002Fqwen_reinforce_plus_train.txt\n```\n## 评估\n\n```bash\ncd R1-Searcher\n\n## 本地搜索\n## HotpotQA\npython train\u002Fwiki_corpus_load.py hotpotqa 5004 &\npython evaluation\u002Feval_search_loacl.py --gpu_id 0 --temp 0.0 --port 5004 --prompt_type v0 --src_file  data\u002Feval_set\u002Fhotpotqa_500.jsonl --model_path the_path_to_model\n## 2Wiki, Musique, Bamboogle\npython train\u002Fwiki_corpus_load.py kilt 5005 &\npython evaluation\u002Feval_search_loacl.py --gpu_id 0 --temp 0.0 --port 5005 --prompt_type v0 --src_file data\u002Feval_set\u002Fbamboogle_500.jsonl --model_path the_path_to_model\n\n## 在线搜索\n## Bamboogle\npython evaluation\u002Feval_search_online.py --gpu_id 0 --temp 0.0 --port 5004 --prompt_type v0 --src_file data\u002Feval_set\u002Fbamboogle_500.jsonl --model_path the_path_to_model\n\n## 计算指标\n## 精确匹配、覆盖精确匹配、F1 分数\npython evaluation\u002Fmetric_calc_rule.py the_path_to_results\n\n## LLM 作为裁判。请记得将输入文件替换为你自己的结果。\npython evaluation\u002Fmetric_calc_gpt_as_judge.py\n```\n\n# 📄 引用\n如果您在研究中使用了本报告，请您予以引用。\n\n```\n@article{R1-searcher,\n  title={R1-Searcher：通过强化学习激励大语言模型的搜索能力},\n  author={宋华通、蒋金浩、闵英谦、陈杰、陈志鹏、赵文轩、温继荣、陆洋、缪旭},\n  url={https:\u002F\u002Fgithub.com\u002FRUCAIBox\u002FR1-searcher},\n  year={2025}\n}\n```\n\n# 📄 许可证\n\n本项目采用 [MIT 许可证](LICENSE) 开放。\n\n# 📞 联系方式\n\n如有任何问题或反馈，请发送邮件至 [songhuatong123@ruc.edu.cn](mailto:songhuatong123@ruc.edu.cn)。","# R1-Searcher 快速上手指南\n\nR1-Searcher 是一个通过强化学习（RL）激励大语言模型（LLM）在推理过程中自主调用网络搜索能力的开源项目。它采用两阶段结果监督强化学习方法，无需冷启动指令微调，即可让 Base 或 Chat 模型学会“何时搜索”及“如何利用搜索结果”。\n\n## 环境准备\n\n### 系统要求\n- **Python**: 3.10.16 (推荐严格匹配版本)\n- **GPU**: 支持 CUDA 的 NVIDIA 显卡（建议显存 24GB+ 以运行 7B\u002F8B 模型训练与推理）\n- **操作系统**: Linux (Ubuntu 20.04\u002F22.04 推荐)\n\n### 前置依赖\n本项目环境配置参考了 [STILL-3](https:\u002F\u002Fgithub.com\u002FRUCAIBox\u002FSlow_Thinking_with_LLMs\u002Ftree\u002Fmain\u002FSTILL-3-TOOL)，主要依赖 `vllm`, `deepspeed`, `flash-attn` 等高性能推理与训练库。\n\n## 安装步骤\n\n请依次执行以下命令创建虚拟环境并安装依赖。国内用户建议使用清华或阿里镜像源加速安装。\n\n```bash\n# 1. 创建并激活虚拟环境\nconda create --name r1-searcher python=3.10.16\nconda activate r1-searcher\n\n# 2. 安装基础依赖 (推荐使用国内镜像源)\npip install vllm==0.6.5 -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\npip install packaging ninja deepspeed accelerate datasets -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n\n# 3. 安装 flash-attn (需预先安装 ninja，且建议不使用隔离构建以加快速度)\npip install flash-attn --no-build-isolation -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n> **注意**：`flash-attn` 编译可能需要较长时间，请确保系统已安装 `cuda-toolkit` 且版本与 PyTorch 兼容。\n\n## 数据准备\n\n在开始训练前，需要下载维基百科语料库并建立索引。\n\n```bash\ncd R1-Searcher\n\n# 1. 下载并处理维基百科摘要 (仅使用标题和摘要)\nwget -nv --no-check-certificate https:\u002F\u002Frocketqa.bj.bcebos.com\u002Fcorpus\u002Fnq.tar.gz\ntar -zxf nq.tar.gz\nrm -rf nq.tar.gz\n\n# 2. 下载 KILT 知识库全文\nwget http:\u002F\u002Fdl.fbaipublicfiles.com\u002FKILT\u002Fkilt_knowledgesource.json\n\n# 3. 分割 KILT 数据 (便于后续并行处理)\npython wiki_corpus_index_bulid\u002Fsplit_kilt_to_100.py\n\n# 4. 构建语料库嵌入 (Embedding)\n# 提示：建议将原始 TSV 文件分割为多部分并行处理，否则速度较慢\n# 请将 the_tsv_file_path 和 the_pickle_path 替换为实际路径\npython wiki_corpus_index_bulid\u002Fbuild_corpus_embedding.py --file_path the_tsv_file_path --save_path the_pickle_path --gpu_id 0\n\n# 5. 构建语料库索引\npython wiki_corpus_index_bulid\u002Fbuild_corpus_idnex.py\n```\n\n## 基本使用\n\nR1-Searcher 的核心在于通过强化学习让模型学会使用特定的标签格式进行思考、搜索和回答。\n\n### 模型交互格式\n模型在推理时会遵循以下标签结构：\n- `\u003Cthink>...\u003C\u002Fthink>`: 模型内部思考过程。\n- `\u003Cbegin_of_search>...\u003C\u002Fend_of_search>`: 调用搜索工具的触发标记。\n- `\u003Cbegin_of_documents>...\u003C\u002Fend_of_documents>`: 检索返回的文档内容。\n- `\u003Canswer>...\u003C\u002Fanswer>`: 最终答案。\n\n### 推理示例逻辑\n虽然具体的推理脚本在原文中被截断，但基于项目描述，基本的调用逻辑如下（伪代码示意）：\n\n1. **加载模型**：加载微调后的 Checkpoint（如 `Qwen-2.5-7B-Base-RAG-RL`）。\n2. **输入 Prompt**：直接输入复杂问题，无需特殊 Prompt 工程。\n3. **自动执行**：\n   - 模型生成 `\u003Cbegin_of_search>` 标签。\n   - 系统拦截该标签，调用本地索引或搜索引擎获取信息。\n   - 将检索结果包裹在 `\u003Cbegin_of_documents>` 中返回给模型。\n   - 模型继续生成思考过程和最终 `\u003Canswer>`。\n\n### 预训练模型地址\n您可以直接从 Hugging Face 下载已训练好的模型权重：\n- **Qwen-2.5-7B-Base-RAG-RL**: [XXsongLALA\u002FQwen-2.5-7B-base-RAG-RL](https:\u002F\u002Fhuggingface.co\u002FXXsongLALA\u002FQwen-2.5-7B-base-RAG-RL)\n- **Llama-3.1-8B-Instruct-RAG-RL**: [XXsongLALA\u002FLlama-3.1-8B-instruct-RAG-RL](https:\u002F\u002Fhuggingface.co\u002FXXsongLALA\u002FLlama-3.1-8B-instruct-RAG-RL)\n\n### 训练数据\n如需复现训练过程，数据集已开源：\n- **数据集**: [XXsongLALA\u002FRAG-RL-Hotpotqa-with-2wiki](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FXXsongLALA\u002FRAG-RL-Hotpotqa-with-2wiki)\n- **来源**: HotpotQA 和 2WikiMultiHopQA 的训练集子集，按难度（易\u002F中\u002F难）混合采样。","某金融科技公司的高级分析师正在撰写一份关于\"2025 年全球新兴半导体供应链波动”的深度研报，需要整合最新的地缘政治动态、实时产能数据及专家观点。\n\n### 没有 R1-Searcher 时\n- **信息滞后严重**：模型仅依赖训练截止前的静态知识库，无法获取上周刚发布的各国出口管制新政，导致分析基础过时。\n- **幻觉风险高企**：面对缺失的实时数据，模型倾向于“一本正经地胡说八道”，编造不存在的产能数字或虚构专家言论。\n- **推理链条断裂**：模型缺乏主动检索意识，无法在遇到知识盲区时自动暂停并调用搜索引擎，导致逻辑推导中途卡壳或强行结论。\n- **人工核查成本巨大**：分析师必须手动逐条验证模型生成的每一个数据点，耗费数小时进行二次搜索和事实校对，效率极低。\n\n### 使用 R1-Searcher 后\n- **实时动态感知**：R1-Searcher 通过强化学习学会了在推理过程中主动触发网络搜索，即时抓取最新的政策文件和行业新闻，确保信息零时差。\n- **证据驱动生成**：模型仅在获取确凿的外部搜索结果后才进行作答，并自动引用来源，显著消除了关于产能数据和政策细节的幻觉。\n- **自适应检索策略**：遇到复杂问题时，R1-Searcher 能智能判断何时需要外部知识，自主规划多轮搜索路径，将碎片化信息串联成完整的逻辑链条。\n- **端到端自动化**：从问题拆解、信息检索到报告生成全流程自动完成，分析师只需关注最终洞察，验证时间从数小时缩短至几分钟。\n\nR1-Searcher 的核心价值在于通过强化学习赋予大模型“主动求知”的本能，将其从静态的知识复读机转变为能实时联网、自我验证的智能研究助手。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FRUCAIBox_R1-Searcher_711bc3b3.png","RUCAIBox","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FRUCAIBox_ca88ccaf.png","The official account of RUC AI Box, which does not engage in any commercial activities. Claims of business associations are fraudulent.",null,"http:\u002F\u002Faibox.ruc.edu.cn","https:\u002F\u002Fgithub.com\u002FRUCAIBox",[80,84,88],{"name":81,"color":82,"percentage":83},"Python","#3572A5",92.8,{"name":85,"color":86,"percentage":87},"Shell","#89e051",7.1,{"name":89,"color":90,"percentage":91},"Dockerfile","#384d54",0.1,709,46,"2026-04-16T03:31:29","MIT",4,"Linux","必需 NVIDIA GPU (因依赖 flash-attn 和 vLLM)，具体型号和显存未说明，建议根据模型大小 (7B\u002F8B) 配置","未说明",{"notes":101,"python":102,"dependencies":103},"环境配置参考 STILL-3 项目；安装 flash-attn 时需添加 --no-build-isolation 参数；数据准备阶段需下载 KILT Wikipedia 语料库并构建索引；训练基于 Qwen-2.5-7B 或 Llama-3.1-8B 模型。","3.10.16",[104,105,106,107,108,109,110],"vllm==0.6.5","flash-attn","deepspeed","accelerate","datasets","packaging","ninja",[35,14,13],"2026-03-27T02:49:30.150509","2026-04-18T14:31:51.084839",[115,120,125,130,135,139,143],{"id":116,"question_zh":117,"answer_zh":118,"source_url":119},39859,"训练过程中 retrieve_num 持续下降甚至变为 0，模型不再主动检索文档怎么办？","这通常是由数据课程（data curriculum）设置不当引起的。如果在训练早期让模型学习过难的问题，模型可能无法正确回答，导致因格式错误而受到惩罚，从而停止检索。\n建议解决方案：\n1. 尝试随机打乱训练数据顺序。\n2. 检查是否因过早引入困难样本导致模型在检索阶段失败。\n3. 排查是否为检索库本身的问题。","https:\u002F\u002Fgithub.com\u002FRUCAIBox\u002FR1-Searcher\u002Fissues\u002F21",{"id":121,"question_zh":122,"answer_zh":123,"source_url":124},39860,"如何复现论文中的实验结果？需要哪些环境和数据准备？","请参考 Quick Start 指南。核心步骤是修改命令中的 `--data_path` 指向你自己的数据集路径。\n示例命令：\n```bash\npython train\u002Freward_server_qwen_zero.py --data_path data\u002Ftraining_set\u002Fstage_2.jsonl --reward_pretrain the_model_path --log_file results\u002Fsamples\u002Fqwen.jsonl --port 1278\n```\n关于数据文件名称差异（如 `split_kilt_to_100.py` 中不使用 `kilt_knowledgesource.json`），这是因为维护者对原始 JSON 文件进行了过滤，去除了不必要的键属性以减少内存占用。官方已发布使用的数据样本，可在 `wiki_corpus_index_bulid\u002Fsamples` 目录下找到。","https:\u002F\u002Fgithub.com\u002FRUCAIBox\u002FR1-Searcher\u002Fissues\u002F4",{"id":126,"question_zh":127,"answer_zh":128,"source_url":129},39861,"复现结果需要什么样的 GPU 配置？训练大概需要多长时间？","根据模型大小不同，硬件需求如下：\n- 训练 Qwen-2.5-7B：至少需要 5 张 80GB 显存的 H800 GPU。\n- 训练 Llama-3.1-8B：至少需要 6 张 80GB 显存的 H800 GPU。\n- 评估阶段：仅需 1 张 80GB 显存的 H800 GPU。\n训练时间大约为 8 小时。对于其他模型（如 QwQ-32B），资源需求取决于模型尺寸和生成长度，建议自行测试。","https:\u002F\u002Fgithub.com\u002FRUCAIBox\u002FR1-Searcher\u002Fissues\u002F9",{"id":131,"question_zh":132,"answer_zh":133,"source_url":134},39862,"运行训练脚本时 get_reward 函数抛出 KeyError 异常，提示找不到 'question' 字段，如何解决？","该错误通常由以下两个原因之一引起：\n1. 从输入中提取 question 的逻辑出错（例如解析 prompt 失败）。\n2. Reward 服务挂载的数据集与训练使用的数据集不匹配。\n排查步骤：\n- 检查代码中 `get_qa` 函数的提取逻辑，确保其能正确解析当前的 query 格式。\n- 确认 `stage2.jsonl` 文件中的字段结构是否与代码预期一致（input_key 应为 'question'）。\n- 如果 query 内容包含大量非结构化文本导致提取失败，需检查数据预处理流程。","https:\u002F\u002Fgithub.com\u002FRUCAIBox\u002FR1-Searcher\u002Fissues\u002F31",{"id":136,"question_zh":137,"answer_zh":138,"source_url":129},39863,"为什么项目中只有一个针对 Qwen 的训练脚本？这是 RL-zero 吗？","是的，目前提供的脚本主要对应论文中的 RL-zero 方法。虽然论文中可能没有对 \"RL-zero\" 这一术语进行长篇大论的描述，但该脚本实现了基于强化学习的零样本搜索增强训练流程。项目当前重点展示了 Qwen 系列的实现，其他模型的适配可参考此脚本进行修改。",{"id":140,"question_zh":141,"answer_zh":142,"source_url":129},39864,"训练 Qwen-2.5-3B 时 Ray 状态无进展，GPU\u002FCPU 占用异常，如何判断训练是否正常？","当遇到 Ray 状态卡住或资源占用异常时，仅凭现象难以定位问题。\n解决方法：\n- 必须提供详细的训练日志（logs），否则无法判断具体错误位置。\n- 检查 `ray status` 输出以及具体的报错堆栈信息。\n- 确认环境变量配置、节点通信以及显存分配是否符合脚本要求。",{"id":144,"question_zh":145,"answer_zh":146,"source_url":124},39865,"Stage 2 训练时使用 apply_chat_template 仍然报错 KeyError，该如何处理？","如果开启 `apply_chat_template` 后仍然出现 KeyError，说明问题不在于模板应用，而在于数据本身的字段缺失或提取逻辑错误。\n建议：\n1. 再次确认输入文件（stage2.jsonl）中确实包含代码所期望的键名（通常是 'question'）。\n2. 检查 `get_reward` 函数内部调用 `get_qa` 时的参数传递是否正确。\n3. 对比官方提供的样本数据格式，确保本地数据格式完全一致。",[]]