[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-PRIME-RL--PRIME":3,"tool-PRIME-RL--PRIME":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",145895,2,"2026-04-08T11:32:59",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108111,"2026-04-08T11:23:26",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":10,"last_commit_at":59,"category_tags":60,"status":17},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,"2026-04-06T11:19:32",[35,15,13,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":72,"owner_avatar_url":73,"owner_bio":74,"owner_company":75,"owner_location":75,"owner_email":75,"owner_twitter":75,"owner_website":75,"owner_url":76,"languages":77,"stars":90,"forks":91,"last_commit_at":92,"license":93,"difficulty_score":94,"env_os":95,"env_gpu":96,"env_ram":97,"env_deps":98,"category_tags":105,"github_topics":106,"view_count":32,"oss_zip_url":75,"oss_zip_packed_at":75,"status":17,"created_at":110,"updated_at":111,"faqs":112,"releases":143},5650,"PRIME-RL\u002FPRIME","PRIME","Scalable RL solution for advanced reasoning of language models","PRIME 是一个专为提升大语言模型高级推理能力而设计的开源强化学习解决方案。面对传统模仿学习在扩展性上的瓶颈，PRIME 致力于将数据驱动的方法转化为基于探索的强化学习路径，从而让模型具备更强的逻辑推导与问题解决能力。\n\n该工具主要解决了两大核心难题：一是如何高效、可扩展地获取精确且密集的奖励信号，二是如何构建高效的强化学习算法以充分释放这些信号的潜力。PRIME 的独特技术亮点在于其“隐式过程奖励建模”（Implicit Process Reward Modeling）机制，它无需依赖昂贵的人工标注或外部裁判模型，即可在训练过程中自动推断出细粒度的过程奖励，显著降低了高质量推理数据的获取成本。\n\nPRIME 非常适合 AI 研究人员、大模型开发者以及对提升模型数学推理和复杂逻辑能力感兴趣的技术团队使用。项目已完整开放训练代码、数据预处理脚本及评估工具，并支持集成到 veRL 框架中，方便用户快速复现结果或进行二次开发。通过 PRIME，开发者可以更轻松地探索超越单纯蒸馏或模仿的模型进化路径，推动语言模型在复杂推理任务上的表现迈向新台阶。","\u003Cdiv align=\"center\">\n\n# Process Reinforcement Through Implicit Rewards\n\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.01456)  [![Github](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPRIME-000000?style=for-the-badge&logo=github&logoColor=000&logoColor=white)](https:\u002F\u002Fgithub.com\u002FPRIME-RL\u002FPRIME)  [![Notion](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FNotion-%23000000.svg?style=for-the-badge&logo=notion&logoColor=white)](https:\u002F\u002Fcurvy-check-498.notion.site\u002FProcess-Reinforcement-through-Implicit-Rewards-15f4fcb9c42180f1b498cc9b2eaf896f)  [![Hugging Face Collection](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPRIME_Collection-fcd022?style=for-the-badge&logo=huggingface&logoColor=000)](https:\u002F\u002Fhuggingface.co\u002FPRIME-RL)\n\n\u003Cdiv align=\"center\" style=\"font-family: Arial, sans-serif;\">\n  \u003Cp>\n    \u003Ca href=\"#🎉news\" style=\"text-decoration: none; font-weight: bold;\">🎉 News\u003C\u002Fa> •\n    \u003Ca href=\"#🔗links\" style=\"text-decoration: none; font-weight: bold;\">🔗 Links\u003C\u002Fa> •\n    \u003Ca href=\"#✨getting-started\" style=\"text-decoration: none; font-weight: bold;\">✨ Getting Started\u003C\u002Fa> •\n    \u003Ca href=\"#📖introduction\" style=\"text-decoration: none; font-weight: bold;\">📖 Introduction\u003C\u002Fa>\n  \u003C\u002Fp>\n  \u003Cp>\n    \u003Ca href=\"#🔧usage\" style=\"text-decoration: none; font-weight: bold;\">🔧 Usage\u003C\u002Fa> •\n    \u003Ca href=\"#📃evaluation\" style=\"text-decoration: none; font-weight: bold;\">📃 Evaluation\u003C\u002Fa> •\n    \u003Ca href=\"#🎈citation\" style=\"text-decoration: none; font-weight: bold;\">🎈 Citation\u003C\u002Fa> •\n    \u003Ca href=\"#🌻acknowledgement\" style=\"text-decoration: none; font-weight: bold;\">🌻 Acknowledgement\u003C\u002Fa> •\n    \u003Ca href=\"#📈star-history\" style=\"text-decoration: none; font-weight: bold;\">📈 Star History\u003C\u002Fa>\n  \u003C\u002Fp>\n\u003C\u002Fdiv>\n\n\u003C\u002Fdiv>\n\n\n# 🎉News\n\n- **[2025\u002F03\u002F12]** PRIME has been integrated into veRL main branch! Try [here](https:\u002F\u002Fgithub.com\u002Fvolcengine\u002Fverl\u002Ftree\u002Fmain\u002Frecipe\u002Fprime).\n- **[2025\u002F02\u002F04]** Released our Paper on arXiv. See [here](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.01456).\n- **[2025\u002F01\u002F06]** We release the training\u002Feval\u002Fdata_preprocessing code. Enjoy! We are working on the paper and will release it very soon.\n- **[2025\u002F01\u002F02]** We present **PRIME** (Process Reinforcement through IMplicit REwards), an open-source solution for online RL with process rewards, to advance reasoning abilities of language models beyond imitation or distillation. All models and data released through [HuggingFace](https:\u002F\u002Fhuggingface.co\u002FPRIME-RL).\n\n# 🔗Links\n\n- 📜 [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.01456)\n- 📜 [Blog](https:\u002F\u002Fcurvy-check-498.notion.site\u002FProcess-Reinforcement-through-Implicit-Rewards-15f4fcb9c42180f1b498cc9b2eaf896f)\n- 🤗 [PRIME Collection](https:\u002F\u002Fhuggingface.co\u002FPRIME-RL)\n\n# ✨Getting Started\n\nCurrently, we provide the following code of PRIME, you can find more details in each directory.\n- ``training``: Implementation and training scripts for PRIME.\n- ``data_preprocessing``: Data preparation, especially math data for PRIME.\n- ``eval``: Evaluation scripts to reproduce PRIME results.\n- For Implicit PRM training and eval, please refer to [this repo](https:\u002F\u002Fgithub.com\u002FPRIME-RL\u002FImplicitPRM).\n\n# 📖Introduction\n\n![image-20241230162026156](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FPRIME-RL_PRIME_readme_5888c5ff3f72.png)\n\nAdvanced reasoning of large language models (LLMs), while improvable through data-driven imitation, is still clouded by serious scalability challenges. We believe the key to overcoming such challenges lies in transforming data-driven approaches into *exploration-based* methods, as exemplified by reinforcement learning (RL). To this end, two critical bottlenecks need to be alleviated to bridge this transformation: (1) how to obtain precise *reward signals* efficiently and scalably, especially for *dense* ones? (2) how can we build effective RL algorithms to fully *unleash* the potential of these signals? \n\n\nWe seek the **scalable** path towards advanced reasoning capabilities with **efficient reward modeling and reinforcement learning**. Our work stems from the implicit process reward modeling (PRM) objective. Without the need for any process label, implicit PRM is trained as an outcome reward model (ORM) and then used as a PRM. Besides improving model performance through inference scaling, the true power of the implicit PRM is unveiled in online RL training. Specifically, it brings three benefits to RL:\n- **Dense Reward:** Implicit PRM directly learns a Q-function that provides rewards for *each token*, which alleviates the reward sparsity issue without the need of an extra value model.\n- **Scalability:** Implicit PRM can be online updated with only outcome label. Therefore, we can directly update the PRM with on-policy rollouts given outcome verifiers, which mitigates the distribution shift as well as scalability issues for PRMs.\n- **Simplicity:** Implicit PRM is inherently a language model. In practice, we show that it is unnecessary to train a PRM beforehand, since the SFT model itself already serves as a strong starting point.\n\nWe then dive into RL to figure out its key algorithm designs and implementation techniques. To this end, we present Process Reinforcement through IMplicit rEwards, PRIME, which effectively incorporates and updates PRMs in RL. \n\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FPRIME-RL_PRIME_readme_f00f6a4ac194.gif\" alt=\"prm\" style=\"zoom: 33%;\" \u002F>\n\nAs shown in the animation above, in PRIME, the policy model and PRM are both initialized with the SFT model. For each RL iteration, the policy model first generates rollouts. Then, the [implicit PRM](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.01981) and outcome verifier score the rollouts, and the implicit PRM gets updated on the rollouts with the outcome reward. Finally, the outcome reward $r_o$ and process reward $r_p$ are combined and used to update the policy model. \n\nThe PRIME implementation pseudocode is as follows:\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FPRIME-RL_PRIME_readme_5cb9e65b7720.png\" alt=\"prime-algo\" style=\"zoom: 50%;\" \u002F>\n\nThe algorithm flow includes:\n\n1. **Prompt filtering** based on policy model performance, only preserving those on which the policy model $\\pi_\\theta$ achieves a accuracy between 0.2 and 0.8.\n2. **Calculate implicit process reward** $r^t$.\n3. **Update Implicit PRM** $\\pi_\\psi$ based on predicted implicit process reward $r^t$ and ground truth outcome label $r$.\n4. **Advantage estimation with RLOO.** Specifically, we first calculate the return of outcome rewards and implicit process rewards separately:\n\n- For ground truth outcome rewards, we directly adopt RLOO without any modification.\n\n- For implicit process rewards, we perform a three-step process to calculate return: (1) Use the averaged implicit process rewards to calculate the leave-one-out baseline (2) Normalize the process reward at step $t$ by subtracting the baseline; (3) Calculate the discounted return for each response.\n\n  Finally, advantage is set to the combination of both returns. \n\n​    5. **Update the policy** $\\pi_\\theta$ using PPO loss for legit importance sampling.\n\n\n# 🔧Usage\nWe apply tailored prompts for coding and math task:\n\n**Coding**\n\n```\n{question} + \"\\n\\nWrite Python code to solve the problem. Present the code in \\n```python\\nYour code\\n```\\nat the end.\n```\n**Math**\n```\n{question} + \"\\n\\nPresent the answer in LaTex format: \\\\boxed{Your answer}\"\n```\n\u003Cdetails> \n\u003Csummary>Click to view inference code.\u003C\u002Fsummary>\n\n\n```python\nimport os\nfrom tqdm import tqdm\nimport torch\nfrom transformers import AutoTokenizer\nfrom vllm import LLM, SamplingParams\nos.environ[\"NCCL_IGNORE_DISABLED_P2P\"] = \"1\"\nos.environ[\"TOKENIZERS_PARALLELISM\"] = \"true\"\ndef generate(question_list,model_path):\n    llm = LLM(\n        model=model_path,\n        trust_remote_code=True,\n        tensor_parallel_size=torch.cuda.device_count(),\n        gpu_memory_utilization=0.90,\n    )\n    sampling_params = SamplingParams(max_tokens=8192,\n                                    temperature=0.0,\n                                    n=1)\n    outputs = llm.generate(question_list, sampling_params, use_tqdm=True)\n    completions = [[output.text for output in output_item.outputs] for output_item in outputs]\n    return completions\ndef make_conv_hf(question, tokenizer):\n    # for math problem\n    content = question + \"\\n\\nPresent the answer in LaTex format: \\\\boxed{Your answer}\"\n    # for code problem\n    # content = question + \"\\n\\nWrite Python code to solve the problem. Present the code in \\n```python\\nYour code\\n```\\nat the end.\" \n    msg = [\n        {\"role\": \"user\", \"content\": content}\n    ]\n    chat = tokenizer.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)\n    return chat\n    \ndef run():\n    model_path = \"PRIME-RL\u002FEurus-2-7B-PRIME\"\n    all_problems = [\n        \"which number is larger? 9.11 or 9.9?\"\n    ]\n    tokenizer = AutoTokenizer.from_pretrained(model_path)\n    completions = generate([make_conv_hf(problem_data, tokenizer) for problem_data in all_problems],model_path)\n    print(completions)\n    # [['[ASSESS]\\n\\n# The problem asks us to compare two decimal numbers, 9.11 and 9.9, to determine which one is larger.\\n# We need to compare the whole parts and the decimal parts of the numbers.\\n\\nNext action: [ADVANCE]\\n\\n# Compare the whole parts of the numbers: both 9.11 and 9.9 have the same whole part, which is 9.\\n# Compare the decimal parts of the numbers: 0.11 (from 9.11) is less than 0.9 (from 9.9).\\n\\nNext action: [ADVANCE]\\n\\n# Since the whole parts are the same and the decimal part of 9.9 is greater than the decimal part of 9.11, we can conclude that 9.9 is larger than 9.11.\\n\\nNext action: [OUTPUT]\\n\\nThe final answer is $\\\\boxed{9.9}$.\\n\\n']]\nif __name__ == \"__main__\":\n    run()\n```\n\n\u003C\u002Fdetails> \n\n# 📃Evaluation\n\nThrough PRIME, we successfully achieve substantial improvements on key reasoning benchmarks over our SFT version of the model, leading to **16.7%** improvement on average, and over **20%** on AMC&AIME competitions. Our final model Eurus-2-7B-PRIME, based on Qwen-2.5-Math-7B-Base, surpassed its instruct version on 5 key reasoning benchmarks. \nThe final results are presented below:\n|               | **Eurus-2-7B-PRIME** | **Eurus-2-7B-SFT** | **Qwen-2.5-Math-7B-Instruct** | **Llama-3.1-70B-Instruct** | **GPT-4o** |\n| ------------- | -------------------- | ------------------ | ----------------------------- | -------------------------- | ---------- |\n| AIME 2024     | **26.7 (+23.3)**     | 3.3                | 13.3                          | 16.7                       | 9.3        |\n| MATH-500      | 79.2 (+14.1)         | 65.1               | **79.8**                      | 64.6                       | 76.4       |\n| AMC           | **57.8 (+27.7)**     | 30.1               | 50.6                          | 30.1                       | 45.8       |\n| Minerva Math  | **38.6 (+5.9)**      | 32.7               | 34.6                          | 35.3                       | 36.8       |\n| OlympiadBench | 42.1 (+12.3)         | 29.8               | 40.7                          | 31.9                       | **43.3**   |\n| Avg.          | **48.9 (+16.7)**     | 32.2               | 43.8                          | 35.7                       | 43.3       |\n\n\nWe achieved this with only 1\u002F10 data and model resources compared with Qwen-Math.\n|            | **Eurus-2-7B-PRIME**               | **Qwen2.5-Math-7B-Instruct**    |\n| ---------- | ---------------------------------- | ------------------------------- |\n| Base Model | Qwen2.5-Math-7B                    | Qwen2.5-Math-7B                 |\n| SFT Data   | **230K (open-source)**             | 2.5M (open-source and in-house) |\n| RM Data    | **0**                              | 618K (in-house)                 |\n| RM         | **Eurus-2-7B-SFT**                 | Qwen2.5-Math-RM (72B)           |\n| RL Data    | **150K queries × 4 samples**  | 66K queries × 32 samples   |\n\n# 🎈Citation\nIf you find PRIME or ImplicitPRM helpful, please cite us.\n\n```bibtex\n@article{cui2025process,\n  title={Process reinforcement through implicit rewards},\n  author={Cui, Ganqu and Yuan, Lifan and Wang, Zefan and Wang, Hanbin and Li, Wendi and He, Bingxiang and Fan, Yuchen and Yu, Tianyu and Xu, Qixin and Chen, Weize and others},\n  journal={arXiv preprint arXiv:2502.01456},\n  year={2025}\n}\n```\n\n```bibtex\n@article{yuan2024implicitprm,\n  title={Free Process Rewards without Process Labels},\n  author={Lifan Yuan and Wendi Li and Huayu Chen and Ganqu Cui and Ning Ding and Kaiyan Zhang and Bowen Zhou and Zhiyuan Liu and Hao Peng},\n  journal={arXiv preprint arXiv:2412.01981},\n  year={2024}\n}\n```\n# 🌻Acknowledgement\nWe implement our reinforcement learning algorithm extending from [veRL](https:\u002F\u002Fgithub.com\u002Fvolcengine\u002Fverl). We utilize [vLLM](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm) for inference and develop evaluation scripts based on [Eurus](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FEurus), [Qwen2.5-Math](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen2.5-Math), and [LiveCodeBench](https:\u002F\u002Fgithub.com\u002FLiveCodeBench\u002FLiveCodeBench). Our data sources mainly include [NuminaMath](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FAI-MO\u002FNuminaMath-CoT), [APPS](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fcodeparrot\u002Fapps), [CodeContests](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fdeepmind\u002Fcode_contests), [TACO](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FBAAI\u002FTACO), and [Codeforces](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FMatrixStudio\u002FCodeforces-Python-Submissions). Thanks for their great contributions!\n\n# 📈Star History\n\n[![Star History Chart](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FPRIME-RL_PRIME_readme_302bf1a80a3d.png)](https:\u002F\u002Fstar-history.com\u002F#PRIME-RL\u002FPRIME&Date)\n","\u003Cdiv align=\"center\">\n\n# 通过隐式奖励进行过程强化\n\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.01456)  [![Github](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPRIME-000000?style=for-the-badge&logo=github&logoColor=000&logoColor=white)](https:\u002F\u002Fgithub.com\u002FPRIME-RL\u002FPRIME)  [![Notion](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FNotion-%23000000.svg?style=for-the-badge&logo=notion&logoColor=white)](https:\u002F\u002Fcurvy-check-498.notion.site\u002FProcess-Reinforcement-through-Implicit-Rewards-15f4fcb9c42180f1b498cc9b2eaf896f)  [![Hugging Face Collection](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPRIME_Collection-fcd022?style=for-the-badge&logo=huggingface&logoColor=000)](https:\u002F\u002Fhuggingface.co\u002FPRIME-RL)\n\n\u003Cdiv align=\"center\" style=\"font-family: Arial, sans-serif;\">\n  \u003Cp>\n    \u003Ca href=\"#🎉news\" style=\"text-decoration: none; font-weight: bold;\">🎉 新闻\u003C\u002Fa> •\n    \u003Ca href=\"#🔗links\" style=\"text-decoration: none; font-weight: bold;\">🔗 链接\u003C\u002Fa> •\n    \u003Ca href=\"#✨getting-started\" style=\"text-decoration: none; font-weight: bold;\">✨ 入门\u003C\u002Fa> •\n    \u003Ca href=\"#📖introduction\" style=\"text-decoration: none; font-weight: bold;\">📖 简介\u003C\u002Fa>\n  \u003C\u002Fp>\n  \u003Cp>\n    \u003Ca href=\"#🔧usage\" style=\"text-decoration: none; font-weight: bold;\">🔧 使用\u003C\u002Fa> •\n    \u003Ca href=\"#📃evaluation\" style=\"text-decoration: none; font-weight: bold;\">📃 评估\u003C\u002Fa> •\n    \u003Ca href=\"#🎈citation\" style=\"text-decoration: none; font-weight: bold;\">🎈 引用\u003C\u002Fa> •\n    \u003Ca href=\"#🌻acknowledgement\" style=\"text-decoration: none; font-weight: bold;\">🌻 致谢\u003C\u002Fa> •\n    \u003Ca href=\"#📈star-history\" style=\"text-decoration: none; font-weight: bold;\">📈 星标历史\u003C\u002Fa>\n  \u003C\u002Fp>\n\u003C\u002Fdiv>\n\n\u003C\u002Fdiv>\n\n\n# 🎉新闻\n\n- **[2025\u002F03\u002F12]** PRIME 已集成到 veRL 主分支！请尝试 [这里](https:\u002F\u002Fgithub.com\u002Fvolcengine\u002Fverl\u002Ftree\u002Fmain\u002Frecipe\u002Fprime)。\n- **[2025\u002F02\u002F04]** 我们在 arXiv 上发布了论文。详情请见 [这里](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.01456)。\n- **[2025\u002F01\u002F06]** 我们发布了训练\u002F评估\u002F数据预处理代码。欢迎大家使用！我们目前正在撰写论文，很快就会发布。\n- **[2025\u002F01\u002F02]** 我们提出了 **PRIME**（通过隐式奖励进行过程强化），这是一种用于在线强化学习的开源解决方案，旨在通过过程奖励推动语言模型的推理能力超越模仿或蒸馏。所有模型和数据均通过 [HuggingFace](https:\u002F\u002Fhuggingface.co\u002FPRIME-RL) 发布。\n\n# 🔗链接\n\n- 📜 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.01456)\n- 📜 [博客](https:\u002F\u002Fcurvy-check-498.notion.site\u002FProcess-Reinforcement-through-Implicit-Rewards-15f4fcb9c42180f1b498cc9b2eaf896f)\n- 🤗 [PRIME Collection](https:\u002F\u002Fhuggingface.co\u002FPRIME-RL)\n\n# ✨入门\n\n目前，我们提供了 PRIME 的以下代码，您可以在每个目录中找到更多详细信息。\n- ``training``：PRIME 的实现和训练脚本。\n- ``data_preprocessing``：数据准备，尤其是用于 PRIME 的数学数据。\n- ``eval``：用于复现 PRIME 结果的评估脚本。\n- 对于隐式 PRM 的训练和评估，请参考 [此仓库](https:\u002F\u002Fgithub.com\u002FPRIME-RL\u002FImplicitPRM)。\n\n# 📖简介\n\n![image-20241230162026156](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FPRIME-RL_PRIME_readme_5888c5ff3f72.png)\n\n尽管可以通过数据驱动的模仿来提升大型语言模型（LLMs）的高级推理能力，但其可扩展性仍然面临严峻挑战。我们认为，克服这些挑战的关键在于将数据驱动的方法转变为基于探索的方法，正如强化学习（RL）所体现的那样。为此，需要解决两个关键瓶颈，以实现这一转变：(1) 如何高效且可扩展地获取精确的 *奖励信号*，尤其是密集型奖励？(2) 我们如何构建有效的强化学习算法，以充分 *释放* 这些信号的潜力？\n\n我们致力于通过 **高效的奖励建模和强化学习**，寻找通往高级推理能力的 **可扩展路径**。我们的工作源于隐式过程奖励建模（PRM）的目标。无需任何过程标签，隐式 PRM 被训练为一个结果奖励模型（ORM），然后用作 PRM。除了通过推理规模的扩大来提升模型性能外，隐式 PRM 的真正力量体现在在线强化学习训练中。具体而言，它为 RL 带来了三大优势：\n- **密集奖励：** 隐式 PRM 直接学习一个 Q 函数，为 *每个 token* 提供奖励，从而在无需额外价值模型的情况下缓解奖励稀疏问题。\n- **可扩展性：** 隐式 PRM 只需根据结果标签即可在线更新。因此，我们可以直接使用基于策略的轨迹和结果验证器来更新 PRM，这既缓解了分布偏移问题，也解决了 PRM 的可扩展性问题。\n- **简单性：** 隐式 PRM 本质上就是一个语言模型。实际上，我们表明，预先训练一个 PRM 并非必要，因为 SFT 模型本身已经可以作为一个强大的起点。\n\n随后，我们深入研究强化学习，以确定其关键的算法设计和实现技术。为此，我们提出了通过隐式奖励进行过程强化的方案——PRIME，该方案能够有效地将 PRM 整合并更新到强化学习过程中。\n\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FPRIME-RL_PRIME_readme_f00f6a4ac194.gif\" alt=\"prm\" style=\"zoom: 33%;\" \u002F>\n\n如上图所示，在 PRIME 中，策略模型和 PRM 均以 SFT 模型作为初始状态。对于每一次强化学习迭代，策略模型首先生成轨迹。然后，[隐式 PRM](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.01981) 和结果验证器会对这些轨迹进行评分，隐式 PRM 会根据结果奖励对轨迹进行更新。最后，将结果奖励 $r_o$ 和过程奖励 $r_p$ 组合起来，用于更新策略模型。\n\nPRIME 的实现伪代码如下：\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FPRIME-RL_PRIME_readme_5cb9e65b7720.png\" alt=\"prime-algo\" style=\"zoom: 50%;\" \u002F>\n\n算法流程包括：\n\n1. **提示过滤**：根据策略模型的表现，仅保留那些策略模型 $\\pi_\\theta$ 准确率在 0.2 到 0.8 之间的提示。\n2. **计算隐式过程奖励** $r^t$。\n3. **更新隐式 PRM** $\\pi_\\psi$：基于预测的隐式过程奖励 $r^t$ 和真实的结果标签 $r$。\n4. **使用 RLOO 进行优势估计。** 具体来说，我们首先分别计算结果奖励和隐式过程奖励的回报：\n\n   - 对于真实的结果奖励，我们直接采用 RLOO，不做任何修改。\n   \n   - 对于隐式过程奖励，我们执行三步流程来计算回报：(1) 使用平均的隐式过程奖励计算留一法基准；(2) 将第 $t$ 步的过程奖励减去基准值进行归一化；(3) 计算每条响应的折现回报。\n\n     最后，优势设置为两者的组合。\n  \n5. **使用 PPO 损失进行合法的重要性采样，更新策略** $\\pi_\\theta$。\n\n# 🔧使用方法\n我们为编程和数学任务应用定制化的提示：\n\n**编程**\n\n```\n{question} + \"\\n\\n请用Python编写代码解决该问题。在最后以\\n```python\\n你的代码\\n```\\n的格式呈现。\"\n```\n\n**数学**\n```\n{question} + \"\\n\\n请以LaTeX格式给出答案：\\\\boxed{你的答案}\"\n```\n\u003Cdetails> \n\u003Csummary>点击查看推理代码。\u003C\u002Fsummary>\n\n\n```python\nimport os\nfrom tqdm import tqdm\nimport torch\nfrom transformers import AutoTokenizer\nfrom vllm import LLM, SamplingParams\nos.environ[\"NCCL_IGNORE_DISABLED_P2P\"] = \"1\"\nos.environ[\"TOKENIZERS_PARALLELISM\"] = \"true\"\ndef generate(question_list,model_path):\n    llm = LLM(\n        model=model_path,\n        trust_remote_code=True,\n        tensor_parallel_size=torch.cuda.device_count(),\n        gpu_memory_utilization=0.90,\n    )\n    sampling_params = SamplingParams(max_tokens=8192,\n                                    temperature=0.0,\n                                    n=1)\n    outputs = llm.generate(question_list, sampling_params, use_tqdm=True)\n    completions = [[output.text for output in output_item.outputs] for output_item in outputs]\n    return completions\ndef make_conv_hf(question, tokenizer):\n    # 对于数学问题\n    content = question + \"\\n\\n请以LaTeX格式给出答案：\\\\boxed{你的答案}\"\n    # 对于编程问题\n    # content = question + \"\\n\\n请用Python编写代码解决该问题。在最后以\\n```python\\n你的代码\\n```\\n的格式呈现。\"\n    msg = [\n        {\"role\": \"user\", \"content\": content}\n    ]\n    chat = tokenizer.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)\n    return chat\n    \ndef run():\n    model_path = \"PRIME-RL\u002FEurus-2-7B-PRIME\"\n    all_problems = [\n        \"哪个数更大？9.11还是9.9？\"\n    ]\n    tokenizer = AutoTokenizer.from_pretrained(model_path)\n    completions = generate([make_conv_hf(problem_data, tokenizer) for problem_data in all_problems],model_path)\n    print(completions)\n    # [['[ASSESS]\\n\\n# 问题要求比较两个小数，9.11和9.9，以确定哪个更大。\\n# 我们需要比较这两个数的整数部分和小数部分。\\n\\n下一步：[ADVANCE]\\n\\n# 比较整数部分：9.11和9.9的整数部分都是9。\\n# 比较小数部分：9.11的小数部分0.11小于9.9的小数部分0.9。\\n\\n下一步：[ADVANCE]\\n\\n# 由于整数部分相同，而9.9的小数部分大于9.11的小数部分，我们可以得出结论：9.9比9.11大。\\n\\n下一步：[OUTPUT]\\n\\n最终答案是$\\\\boxed{9.9}$。\\n\\n']]\nif __name__ == \"__main__\":\n    run()\n```\n\n\u003C\u002Fdetails> \n\n# 📃评估\n\n通过PRIME，我们在关键推理基准测试上相比模型的SFT版本取得了显著提升，平均提升了**16.7%**，在AMC和AIME竞赛中更是超过了**20%**。我们的最终模型Eurus-2-7B-PRIME，基于Qwen-2.5-Math-7B-Base，超越了其指令版模型，在5个关键推理基准测试中表现更优。最终结果如下：\n|               | **Eurus-2-7B-PRIME** | **Eurus-2-7B-SFT** | **Qwen-2.5-Math-7B-Instruct** | **Llama-3.1-70B-Instruct** | **GPT-4o** |\n| ------------- | -------------------- | ------------------ | ----------------------------- | -------------------------- | ---------- |\n| AIME 2024     | **26.7 (+23.3)**     | 3.3                | 13.3                          | 16.7                       | 9.3        |\n| MATH-500      | 79.2 (+14.1)         | 65.1               | **79.8**                      | 64.6                       | 76.4       |\n| AMC           | **57.8 (+27.7)**     | 30.1               | 50.6                          | 30.1                       | 45.8       |\n| Minerva Math  | **38.6 (+5.9)**      | 32.7               | 34.6                          | 35.3                       | 36.8       |\n| OlympiadBench | 42.1 (+12.3)         | 29.8               | 40.7                          | 31.9                       | **43.3**   |\n| Avg.          | **48.9 (+16.7)**     | 32.2               | 43.8                          | 35.7                       | 43.3       |\n\n\n与Qwen-Math相比，我们仅使用了其十分之一的数据和模型资源便实现了这一成果。 \n|            | **Eurus-2-7B-PRIME**               | **Qwen2.5-Math-7B-Instruct**    |\n| ---------- | ---------------------------------- | ------------------------------- |\n| 基础模型 | Qwen2.5-Math-7B                    | Qwen2.5-Math-7B                 |\n| SFT数据   | **23万（开源）**             | 250万（开源和内部）             |\n| RM数据    | **0**                              | 61.8万（内部）                  |\n| RM         | **Eurus-2-7B-SFT**                 | Qwen2.5-Math-RM（72B）           |\n| RL数据    | **15万条查询 × 4个样本**  | 6.6万条查询 × 32个样本   |\n\n# 🎈引用\n如果您觉得PRIME或ImplicitPRM对您有帮助，请引用我们的论文。\n\n```bibtex\n@article{cui2025process,\n  title={通过隐式奖励进行过程强化},\n  author={Cui, Ganqu and Yuan, Lifan and Wang, Zefan and Wang, Hanbin and Li, Wendi and He, Bingxiang and Fan, Yuchen and Yu, Tianyu and Xu, Qixin and Chen, Weize and others},\n  journal={arXiv预印本 arXiv:2502.01456},\n  year={2025}\n}\n```\n\n```bibtex\n@article{yuan2024implicitprm,\n  title={无需过程标签的免费过程奖励},\n  author={Lifan Yuan and Wendi Li and Huayu Chen and Ganqu Cui and Ning Ding and Kaiyan Zhang and Bowen Zhou and Zhiyuan Liu and Hao Peng},\n  journal={arXiv预印本 arXiv:2412.01981},\n  year={2024}\n}\n```\n# 🌻致谢\n我们实现的强化学习算法基于[veRL](https:\u002F\u002Fgithub.com\u002Fvolcengine\u002Fverl)扩展而来。我们在推理过程中使用了[vLLM](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm)，并基于[Eurus](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FEurus)、[Qwen2.5-Math](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen2.5-Math)以及[LiveCodeBench](https:\u002F\u002Fgithub.com\u002FLiveCodeBench\u002FLiveCodeBench)开发了评估脚本。我们的数据来源主要包括[NuminaMath](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FAI-MO\u002FNuminaMath-CoT)、[APPS](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fcodeparrot\u002Fapps)、[CodeContests](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fdeepmind\u002Fcode_contests)、[TACO](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FBAAI\u002FTACO)以及[Codeforces](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FMatrixStudio\u002FCodeforces-Python-Submissions)。感谢他们的杰出贡献！\n\n# 📈星标历史\n\n[![星标历史图](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FPRIME-RL_PRIME_readme_302bf1a80a3d.png)](https:\u002F\u002Fstar-history.com\u002F#PRIME-RL\u002FPRIME&Date)","# PRIME 快速上手指南\n\nPRIME (Process Reinforcement through Implicit Rewards) 是一个开源的在线强化学习（RL）框架，旨在通过隐式过程奖励模型（Implicit PRM）提升大语言模型的推理能力。它无需额外的过程标注数据，即可实现密集奖励信号的学习与策略优化。\n\n## 环境准备\n\n### 系统要求\n- **操作系统**: Linux (推荐 Ubuntu 20.04+)\n- **GPU**: 支持 CUDA 的 NVIDIA 显卡（建议显存 ≥ 24GB，多卡环境需配置 NCCL）\n- **Python**: 3.9 或更高版本\n\n### 前置依赖\nPRIME 的核心训练代码已集成到 **veRL** (Volcengine RL) 主分支中，推理部分依赖 **vLLM** 和 **Transformers**。\n\n主要 Python 依赖库：\n- `torch` (PyTorch)\n- `transformers`\n- `vllm` (用于高效推理)\n- `tqdm`\n- `veRL` (用于训练)\n\n> **提示**：国内开发者建议使用清华源或阿里源加速 pip 安装。\n\n## 安装步骤\n\n### 1. 克隆项目与依赖安装\n\n首先克隆 PRIME 仓库以获取数据处理和评估脚本：\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FPRIME-RL\u002FPRIME.git\ncd PRIME\n```\n\n安装基础 Python 依赖（建议在虚拟环境中进行）：\n\n```bash\npip install torch torchvision torchaudio --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu118\npip install transformers vllm tqdm accelerate\n# 如需使用国内镜像加速\n# pip install -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple torch torchvision torchaudio transformers vllm tqdm accelerate\n```\n\n### 2. 安装训练框架 (veRL)\n\n由于 PRIME 的训练逻辑已合并至 veRL，你需要安装 veRL 并定位到 PRIME 配方目录：\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fvolcengine\u002Fverl.git\ncd verl\npip install -e .\n# 验证安装：检查 recipe\u002Fprime 目录是否存在\nls recipe\u002Fprime\n```\n\n> **注意**：隐式 PRM 的独立训练与评估代码位于 [ImplicitPRM 仓库](https:\u002F\u002Fgithub.com\u002FPRIME-RL\u002FImplicitPRM)，如有特殊需求可单独克隆。\n\n## 基本使用\n\n以下示例展示如何使用已发布的 **Eurus-2-7B-PRIME** 模型进行数学问题推理。该脚本基于 `vLLM` 实现高效生成。\n\n### 推理示例代码\n\n创建文件 `run_inference.py` 并填入以下内容：\n\n```python\nimport os\nfrom tqdm import tqdm\nimport torch\nfrom transformers import AutoTokenizer\nfrom vllm import LLM, SamplingParams\n\n# 设置环境变量以优化多卡通信和分词并行\nos.environ[\"NCCL_IGNORE_DISABLED_P2P\"] = \"1\"\nos.environ[\"TOKENIZERS_PARALLELISM\"] = \"true\"\n\ndef generate(question_list, model_path):\n    llm = LLM(\n        model=model_path,\n        trust_remote_code=True,\n        tensor_parallel_size=torch.cuda.device_count(),\n        gpu_memory_utilization=0.90,\n    )\n    sampling_params = SamplingParams(max_tokens=8192,\n                                    temperature=0.0,\n                                    n=1)\n    outputs = llm.generate(question_list, sampling_params, use_tqdm=True)\n    completions = [[output.text for output in output_item.outputs] for output_item in outputs]\n    return completions\n\ndef make_conv_hf(question, tokenizer):\n    # 数学任务提示模板\n    content = question + \"\\n\\nPresent the answer in LaTex format: \\\\boxed{Your answer}\"\n    \n    # 如果是编程任务，可使用以下模板：\n    # content = question + \"\\n\\nWrite Python code to solve the problem. Present the code in \\n```python\\nYour code\\n```\\nat the end.\" \n    \n    msg = [\n        {\"role\": \"user\", \"content\": content}\n    ]\n    chat = tokenizer.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)\n    return chat\n    \ndef run():\n    # 使用 HuggingFace 上的官方模型\n    # 国内用户若访问 HF 较慢，可先手动下载模型至本地路径\n    model_path = \"PRIME-RL\u002FEurus-2-7B-PRIME\"\n    \n    all_problems = [\n        \"which number is larger? 9.11 or 9.9?\"\n    ]\n    \n    tokenizer = AutoTokenizer.from_pretrained(model_path)\n    \n    # 构建对话格式输入\n    inputs = [make_conv_hf(problem_data, tokenizer) for problem_data in all_problems]\n    \n    completions = generate(inputs, model_path)\n    print(completions)\n\nif __name__ == \"__main__\":\n    run()\n```\n\n### 运行命令\n\n在终端执行脚本：\n\n```bash\npython run_inference.py\n```\n\n**预期输出**：\n模型将输出包含逐步推理过程（Chain-of-Thought）及最终 boxed 答案的文本，例如：\n```text\n[['[ASSESS]\\n\\n# The problem asks us to compare two decimal numbers...\\nNext action: [OUTPUT]\\n\\nThe final answer is $\\\\boxed{9.9}$.\\n\\n']]\n```\n\n### 模型获取\n- **HuggingFace**: [PRIME-RL Collection](https:\u002F\u002Fhuggingface.co\u002FPRIME-RL)\n- **国内加速**: 推荐使用 [ModelScope (魔搭)](https:\u002F\u002Fmodelscope.cn\u002F) 搜索 `PRIME-RL` 或 `Eurus` 系列模型进行下载，并在代码中修改 `model_path` 为本地路径。","某金融科技团队正致力于构建一个能自动解决复杂量化数学题的 AI 助手，以辅助分析师进行高频交易策略推导。\n\n### 没有 PRIME 时\n- **推理过程黑盒化**：模型仅靠模仿历史数据生成答案，一旦遇到新颖的数学变种题，往往直接给出错误结论且无法追溯逻辑断点。\n- **奖励信号稀疏滞后**：传统强化学习只能在最终答案对错时给予反馈，模型难以知晓中间哪一步推导导致了失败，训练效率极低。\n- **泛化能力遭遇瓶颈**：随着题目难度增加，单纯依靠数据蒸馏的方法失效，模型在未见过的复杂逻辑链条上表现大幅下滑。\n- **试错成本高昂**：为了提升准确率，团队需要耗费大量算力进行盲目探索，却因缺乏精细的过程引导而收效甚微。\n\n### 使用 PRIME 后\n- **隐式过程奖励引导**：PRIME 通过隐式过程奖励模型（Implicit PRM），在每一步推理中自动提供密集的信号，让模型实时感知逻辑是否偏离。\n- **可解释的错误修正**：模型不再只关注结果，而是学会了自我监控推导过程，能够主动识别并纠正中间步骤的逻辑漏洞。\n- **探索式能力跃迁**：从被动模仿转向主动探索，PRIME 帮助模型在面对从未见过的奥数级难题时，依然能通过多步推理找到正确路径。\n- **训练效率显著提升**：得益于可扩展的在线 RL 架构，团队用更少的数据和算力消耗，就实现了模型在复杂推理任务上的准确率突破。\n\nPRIME 通过将粗糙的结果反馈转化为精细的过程指引，成功让大模型具备了像人类专家一样“边思考边纠错”的高级推理能力。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FPRIME-RL_PRIME_5888c5ff.png","PRIME-RL","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FPRIME-RL_fbf42880.jpg","Researching scalable (RL) methods on language models.",null,"https:\u002F\u002Fgithub.com\u002FPRIME-RL",[78,82,86],{"name":79,"color":80,"percentage":81},"Python","#3572A5",98.4,{"name":83,"color":84,"percentage":85},"ANTLR","#9DC3FF",0.9,{"name":87,"color":88,"percentage":89},"Shell","#89e051",0.7,1841,109,"2026-04-08T07:53:12","Apache-2.0",4,"Linux","必需 NVIDIA GPU。代码示例中使用 vLLM，需设置 tensor_parallel_size 和 gpu_memory_utilization (0.90)，暗示需要多卡或大显存环境以运行 7B+ 模型并进行 RL 训练。具体型号和 CUDA 版本未说明。","未说明",{"notes":99,"python":97,"dependencies":100},"该工具主要集成在 veRL (volcengine\u002Fverl) 框架中。推理示例依赖 vLLM 库。训练隐式过程奖励模型 (Implicit PRM) 的代码位于单独的仓库 (PRIME-RL\u002FImplicitPRM)。环境变量中设置了 NCCL 和 Tokenizers 并行选项，表明主要针对多 GPU Linux 环境优化。",[101,102,103,104],"torch","transformers","vllm","tqdm",[14,35],[107,108,109],"llm","reasoning","rl","2026-03-27T02:49:30.150509","2026-04-09T05:29:58.415023",[113,118,123,128,133,138],{"id":114,"question_zh":115,"answer_zh":116,"source_url":117},25624,"如何在多节点或多 GPU 环境下训练，以及遇到显存不足（OOM）时该如何调整？","1. 确保 `mini_batch_size` 和 `micro_batch_size` 能被 `world_size`（GPU 总数）整除。\n2. 如果遇到 OOM，尝试降低 vllm 的显存利用率参数 `actor_rollout_ref.rollout.gpu_memory_utilization`（例如设为 0.7）。\n3. 减小 `data.train_batch_size` 和 `actor_rollout_ref.actor.ppo_mini_batch_size`。\n4. 对于大模型（如 32B），可以尝试增加 `actor_rollout_ref.rollout.tensor_model_parallel_size` 参数以利用张量并行。\n5. 若需多节点训练，需先启动本地 Ray 集群，然后在头节点启动训练。","https:\u002F\u002Fgithub.com\u002FPRIME-RL\u002FPRIME\u002Fissues\u002F31",{"id":119,"question_zh":120,"answer_zh":121,"source_url":122},25625,"运行示例脚本时出现 'ValueError: too many values to unpack (expected 4)' 错误如何解决？","这通常与 `flash_attn` 版本不兼容有关。解决方法有两种：\n1. 更新 `flash_attn` 到最新版本。\n2. 修改代码 `training\u002Fverl\u002Fworkers\u002Factor\u002Fdp_actor.py` 第 57 行，将：\n   `input_ids_rmpad, indices, cu_seqlens, max_seqlen_in_batch = unpad_input()`\n   改为：\n   `input_ids_rmpad, indices, *_ = unpad_input()`\n   即可解决解包数量不匹配的问题。","https:\u002F\u002Fgithub.com\u002FPRIME-RL\u002FPRIME\u002Fissues\u002F48",{"id":124,"question_zh":125,"answer_zh":126,"source_url":127},25626,"验证过程（Validation）非常缓慢，如何加速训练或理解其耗时原因？","验证慢通常是因为开启了准确率过滤（accuracy filtering）。该机制会丢弃大量提示词（约 2\u002F3），导致每个步骤变慢，但减少了整体 epoch 的训练时间。\n优化建议：\n1. 对于小模型（如 0.5B），可以增大 actor 和 rm 更新的 `mini_batch_size` 和 `micro_batch_size`。\n2. 如果不需要严格过滤，可设置 `data.filter_accuracy=False`。\n3. 注意 Eurus-PRIME 原始训练因过滤机制实际只更新了约 320 步，全量训练时间会更长。","https:\u002F\u002Fgithub.com\u002FPRIME-RL\u002FPRIME\u002Fissues\u002F20",{"id":129,"question_zh":130,"answer_zh":131,"source_url":132},25627,"如何正确配置多节点（Multi-node）分布式训练？直接设置节点数报错怎么办？","不能仅通过设置 `trainer.nnodes` 来增加节点。正确的步骤是：\n1. 首先在所有目标节点上启动一个本地的 Ray 集群。\n2. 然后在头节点（head node）上启动训练任务。\n如果直接修改参数报错（如无法连接 socket），请参考 Ray 官方文档关于在本地虚拟机集群启动的指导：https:\u002F\u002Fdocs.ray.io\u002Fen\u002Flatest\u002Fcluster\u002Fvms\u002Fuser-guides\u002Flaunching-clusters\u002Fon-premises.html","https:\u002F\u002Fgithub.com\u002FPRIME-RL\u002FPRIME\u002Fissues\u002F15",{"id":134,"question_zh":135,"answer_zh":136,"source_url":137},25628,"评估环境安装失败或评估结果（如 AIME 准确率）不一致、不可复现怎么办？","评估结果波动通常由 LLM 生成的随机性引起。解决方案如下：\n1. 使用最新版本的 `vllm`。\n2. 参考 vllm 官方的复现性指南（https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm\u002Fblob\u002Fmain\u002Fexamples\u002Foffline_inference\u002Freproducibility.py）来固定随机种子和配置。\n3. 检查采样超参数，建议使用 temperature=0.3 和 top_p=0.95（具体以项目最新评估代码为准）。\n4. 确保 `requirements.txt` 中的依赖版本与作者环境一致。","https:\u002F\u002Fgithub.com\u002FPRIME-RL\u002FPRIME\u002Fissues\u002F63",{"id":139,"question_zh":140,"answer_zh":141,"source_url":142},25629,"在使用 8xA800 80G 显卡训练 7B 模型时，第一步后就遇到 Ray OOM（显存不足）错误怎么办？","即使显存看似充足，Ray 任务也可能因内存管理策略崩溃。建议尝试以下操作：\n1. 限制并发任务数：在启动 Ray 或配置任务时，通过设置 `num_cpus` 参数来限制同时运行的任务数量（例如设小一点的值），防止瞬间内存峰值。\n2. 检查并降低 `gpu_memory_utilization` 参数，为系统和其他进程预留更多显存。\n3. 确认 `rollout batch size` 是否过大，适当减小 `data.train_batch_size`。","https:\u002F\u002Fgithub.com\u002FPRIME-RL\u002FPRIME\u002Fissues\u002F59",[]]