[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-RLHFlow--RLHF-Reward-Modeling":3,"tool-RLHFlow--RLHF-Reward-Modeling":61},[4,18,28,36,45,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":24,"last_commit_at":25,"category_tags":26,"status":17},9989,"n8n","n8n-io\u002Fn8n","n8n 是一款面向技术团队的公平代码（fair-code）工作流自动化平台，旨在让用户在享受低代码快速构建便利的同时，保留编写自定义代码的灵活性。它主要解决了传统自动化工具要么过于封闭难以扩展、要么完全依赖手写代码效率低下的痛点，帮助用户轻松连接 400 多种应用与服务，实现复杂业务流程的自动化。\n\nn8n 特别适合开发者、工程师以及具备一定技术背景的业务人员使用。其核心亮点在于“按需编码”：既可以通过直观的可视化界面拖拽节点搭建流程，也能随时插入 JavaScript 或 Python 代码、调用 npm 包来处理复杂逻辑。此外，n8n 原生集成了基于 LangChain 的 AI 能力，支持用户利用自有数据和模型构建智能体工作流。在部署方面，n8n 提供极高的自由度，支持完全自托管以保障数据隐私和控制权，也提供云端服务选项。凭借活跃的社区生态和数百个现成模板，n8n 让构建强大且可控的自动化系统变得简单高效。",184740,2,"2026-04-19T23:22:26",[16,14,13,15,27],"插件",{"id":29,"name":30,"github_repo":31,"description_zh":32,"stars":33,"difficulty_score":10,"last_commit_at":34,"category_tags":35,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":24,"last_commit_at":42,"category_tags":43,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",161147,"2026-04-19T23:31:47",[14,13,44],"语言模型",{"id":46,"name":47,"github_repo":48,"description_zh":49,"stars":50,"difficulty_score":24,"last_commit_at":51,"category_tags":52,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",109154,"2026-04-18T11:18:24",[14,15,13],{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":24,"last_commit_at":59,"category_tags":60,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[27,13,15,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":67,"readme_en":68,"readme_zh":69,"quickstart_zh":70,"use_case_zh":71,"hero_image_url":72,"owner_login":73,"owner_name":73,"owner_avatar_url":74,"owner_bio":75,"owner_company":76,"owner_location":76,"owner_email":77,"owner_twitter":73,"owner_website":76,"owner_url":78,"languages":79,"stars":84,"forks":85,"last_commit_at":86,"license":87,"difficulty_score":10,"env_os":88,"env_gpu":89,"env_ram":88,"env_deps":90,"category_tags":97,"github_topics":98,"view_count":24,"oss_zip_url":76,"oss_zip_packed_at":76,"status":17,"created_at":103,"updated_at":104,"faqs":105,"releases":146},9920,"RLHFlow\u002FRLHF-Reward-Modeling","RLHF-Reward-Modeling","Recipes to train reward model for RLHF.","RLHF-Reward-Modeling 是一套专为训练大语言模型奖励模型（Reward Model）设计的开源工具集，旨在优化基于人类反馈的强化学习（RLHF）流程。它核心解决了如何准确量化模型输出质量、避免“奖励黑客”以及消除生成长度偏差等关键难题，为对齐人类偏好提供可靠信号。\n\n这套工具非常适合 AI 研究人员和开发者使用，尤其是那些希望复现前沿成果或构建自定义对齐流程的团队。其独特亮点在于提供了多样化的建模方案：不仅包含经典的 Bradley-Terry 模型，还创新性地引入了生成式配对偏好模型（Pairwise Preference Model），利用模型的下一个词预测能力直接判断优劣；更推出了多目标混合专家模型 ArmoRM，曾在 RewardBench 榜单上斩获 8B 参数组别第一名。此外，项目还涵盖了过程监督与结果监督奖励模型（PRM\u002FORM）的训练代码，甚至最新集成了基于决策树的可解释性奖励模型，帮助开发者深入理解模型的偏好逻辑。无论是学术研究还是工程落地，RLHF-Reward-Modeling 都提供了详尽的代码、数据与超参数配置，让高质量的奖励模型训练变得可复","RLHF-Reward-Modeling 是一套专为训练大语言模型奖励模型（Reward Model）设计的开源工具集，旨在优化基于人类反馈的强化学习（RLHF）流程。它核心解决了如何准确量化模型输出质量、避免“奖励黑客”以及消除生成长度偏差等关键难题，为对齐人类偏好提供可靠信号。\n\n这套工具非常适合 AI 研究人员和开发者使用，尤其是那些希望复现前沿成果或构建自定义对齐流程的团队。其独特亮点在于提供了多样化的建模方案：不仅包含经典的 Bradley-Terry 模型，还创新性地引入了生成式配对偏好模型（Pairwise Preference Model），利用模型的下一个词预测能力直接判断优劣；更推出了多目标混合专家模型 ArmoRM，曾在 RewardBench 榜单上斩获 8B 参数组别第一名。此外，项目还涵盖了过程监督与结果监督奖励模型（PRM\u002FORM）的训练代码，甚至最新集成了基于决策树的可解释性奖励模型，帮助开发者深入理解模型的偏好逻辑。无论是学术研究还是工程落地，RLHF-Reward-Modeling 都提供了详尽的代码、数据与超参数配置，让高质量的奖励模型训练变得可复现且高效。","# RLHF-Reward-Modeling\n\n## Structure \n\nThe initial release of this project focuses on the Bradley-Terry reward modeling and pairwise preference model. Since then, we have included more advanced techniques to construct a preference model. The structure of this project is \n\n- [`bradley-terry-rm`](.\u002Fbradley-terry-rm\u002F) to train the classic Bradley-Terry reward model;\n- [`pair-pm`](.\u002Fpair-pm\u002F) to train the pairwise preference model, which takes a prompt and **two responses** as the input and directly predicts the probability of the first response being preferred. We formulate the problem as a chat between the user and the model to leverage the next-token prediction ability of the model, which is referred to as generative RM in the subsequent literature.\n\t- [`SSRM`](.\u002Fpair-pm\u002FSRRM\u002F): the code of the paper [Semi-Supervised Reward Modeling via Iterative Self-Training](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.06903)\n \t- [`RRM`](.\u002Fpair-pm\u002FRRM\u002F): to leverage causal inference to augment the preference dataset and mitigate the reward hacking. See https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.13156v1 \t\n- [`armo-rm`](.\u002Farmo-rm\u002F) to train the ArmoRM, which starts with a multi-objective reward model, and the reward vector is aggregated by a mixture-of-expert approach in a context-dependent way. See our technical report [[ArmoRM] Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.12845) for details.\n- [`odin-rm`](.\u002Fodin\u002F) to disentangle the reward modeling from length bias. See https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.07319\n- [`math-rm`](.\u002Fmath-rm\u002F): the code to train process-supervised reward (PRM) and outcome-supervised reward (ORM) using the next-token prediction. We open-source the data, code, hyper-parameter, and model for a robust recipe that is easy to reproduce.\n- [`decison_tree`](.\u002Fdecision_tree\u002F): the code to use and train decision-tree reward models. See [Interpreting Language Model Preferences Through the Lens of Decision Trees](https:\u002F\u002Frlhflow.github.io\u002Fposts\u002F2025-01-22-decision-tree-reward-model\u002F) for technical details.\n\n## News\n🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥\n\n🚀 **[Jan 2025]** Decision-tree reward model training code is released under the `decision_tree\u002F` folder! [Decision-Tree-Reward-Gemma-2-27B](https:\u002F\u002Fhuggingface.co\u002FRLHFlow\u002FDecision-Tree-Reward-Gemma-2-27B) achieves the new state-of-the-art score (95.4%) on RewardBench!\n\n🚀 **[Nov 2024]** PRM and ORM training codes are released under the `math-rm\u002F` folder!\n\n🚀 **[Sep 2024]** ArmoRM training code is released under the `armo-rm\u002F` folder!\n\n🚀 **[Sep 2024]** Code for [Semi-Supervised Reward Modeling via Iterative Self-Training](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.06903) is released under the `pair-pm\u002F` folder\n\n🚀 **[Jun 2024] Our [ArmoRM](https:\u002F\u002Fhuggingface.co\u002FRLHFlow\u002FArmoRM-Llama3-8B-v0.1) is the Rank #1 8B model on RewardBench!** \n\n🚀 **[May 2024] The top-3 open-source 8B reward models on RewardBench ([ArmoRM](https:\u002F\u002Fhuggingface.co\u002FRLHFlow\u002FArmoRM-Llama3-8B-v0.1), [Pair Pref. Model](https:\u002F\u002Fhuggingface.co\u002FRLHFlow\u002Fpair-preference-model-LLaMA3-8B), [BT RM](https:\u002F\u002Fhuggingface.co\u002FsfairXC\u002FFsfairX-LLaMA3-RM-v0.1)) are all trained with this repo!**\n\n🚀 **[May 2024] The [pairwise preference model](https:\u002F\u002Fhuggingface.co\u002FRLHFlow\u002Fpair-preference-model-LLaMA3-8B) training code is available (`pair-pm\u002F`)!**\n\n🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥\n\n+ **Tech Report**\n  + [RLHF Workflow: From Reward Modeling to Online RLHF](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.07863)\n  + [[ArmoRM] Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.12845)\n  + [Semi-Supervised Reward Modeling via Iterative Self-Training](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.06903)\n+ **Models**:\n  + Absolute-Rating Multi-Objective Reward Model (ArmoRM): [ArmoRM-Llama3-8B-v0.1](https:\u002F\u002Fhuggingface.co\u002FRLHFlow\u002FArmoRM-Llama3-8B-v0.1)\n  + Pairwise Preference Reward Model: [pair-preference-model-LLaMA3-8B](https:\u002F\u002Fhuggingface.co\u002FRLHFlow\u002Fpair-preference-model-LLaMA3-8B) \n  + Bradley-Terry Reward Model: [FsfairX-LLaMA3-RM-v0.1](https:\u002F\u002Fhuggingface.co\u002FsfairXC\u002FFsfairX-LLaMA3-RM-v0.1)\n\n+ **Architectures**\n  + Bradley-Terry (BT) Reward Model and Pairwise Preference Model\n    \u003Cimg width=\"625\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FRLHFlow_RLHF-Reward-Modeling_readme_db7ca5951626.png\">\n  + Absolute-Rating Multi-Objective Reward Model (ArmoRM)\n    \u003Cimg width=\"625\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FRLHFlow_RLHF-Reward-Modeling_readme_a5c71a7be661.png\">\n\n+ **[RewardBench LeaderBoard](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fallenai\u002Freward-bench)**\n\n  ![image](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FRLHFlow_RLHF-Reward-Modeling_readme_ac1ab31a3ecf.png)\n\n   | Model  | Base Model                                                             | Method | Score | Chat | Chat Hard | Safety | Reasoning | Prior Sets (0.5 weight) |\n  |:--------------------------------------------------------------------------------|:-----------------------------------------------------------------------|:-----:|:-----|:----------|:-------|:----------|:-----------------------|:------------------------|\n    | [ArmoRM-Llama3-8B-v0.1](https:\u002F\u002Fhuggingface.co\u002FRLHFlow\u002FArmoRM-Llama3-8B-v0.1) (Ours)                                                           | Llama-3 8B | ArmoRM + MoE | **89.0** | 96.9     | **76.8**  | **92.2** | **97.3**  | 74.3                    |\n    | Cohere May 2024                                                                 | Unknown | Unknown  | 88.2     | 96.4     | 71.3      | **92.7** | **97.7**  | **78.2**                |\n    | [pair-preference-model](https:\u002F\u002Fhuggingface.co\u002FRLHFlow\u002Fpair-preference-model-LLaMA3-8B) (Ours)| Llama-3 8B | [SliC-HF](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.10425) | 85.7 | 98.3 | 65.8 | 89.7 | 94.7 | 74.6 |\n    | GPT-4 Turbo (0125 version)                                                      | GPT-4 Turbo | LLM-as-a-Judge | 84.3     | 95.3     | 74.3      | 87.2     | 86.9      | 70.9                    |\n    | [FsfairX-LLaMA3-RM-v0.1](https:\u002F\u002Fhuggingface.co\u002FsfairXC\u002FFsfairX-LLaMA3-RM-v0.1) (Ours) | Llama-3 8B | Bradley-Terry | 83.6     | **99.4** | 65.1      | 87.8     | 86.4      | 74.9                    |\n    | [Starling-RM-34B](https:\u002F\u002Fhuggingface.co\u002FNexusflow\u002FStarling-RM-34B)             | Yi-34B | Bradley-Terry | 81.4     | 96.9     | 57.2      | 88.2     | 88.5      | 71.4                    |\n\n\n+ **Evaluation Results** (from [RLHF Workflow](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.07863))\n  \n  \u003Cimg width=\"625\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FRLHFlow_RLHF-Reward-Modeling_readme_7a51bae9201f.png\">\n\nTL;DL: this is a repo for training the reward\u002Fpreference model for [DRL-based RLHF (PPO)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.02155.pdf), [Iterative SFT (Rejection sampling fine-tuning)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2304.06767v4.pdf), and [iterative DPO](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.11456.pdf).\n\n- 4 x A40 48G: we can train Gemma-7B-it with max_length 4096 by Deepspeed Zero-3 + gradient checkpoint;\n- 4 x A100 80G: we can train Gemma-7B-it with max_length 4096 by gradient checkpoint;\n- The resulting reward models achieve **SOTA performance** as open-source RMs in the leaderboard of [RewardBench](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fallenai\u002Freward-bench).\n- Check out our [blog post](https:\u002F\u002Fefficient-unicorn-451.notion.site\u002FReward-Modeling-for-RLHF-abe03f9afdac42b9a5bee746844518d0)!\n\n\n## Installation instructions\n\nIt is recommended to create separate environments for the Bradley-Terry reward model and pair-wise preference model. The installation instructions are provided in the corresponding folders.\n\n\n## Dataset Preparation\nThe dataset should be preprocessed in the standard format, where each of the samples consists of two conversations 'chosen' and 'rejected' and they share the same prompt. Here is an example of the rejected sample in the comparison pair. \n\n```python\n[\n{ \"content\": \"Please identify the top 5 rarest animals in the world.\", \"role\": \"user\" },\n{ \"content\": \"Do you mean animals that are really rare, or rare relative to the size of the human population?\", \"role\": \"assistant\" },\n{ \"content\": \"The ones that are really rare.\", \"role\": \"user\" },\n{ \"content\": \"Alright, here’s what I found:\", \"role\": \"assistant\" }, \n]\n```\n\nWe preprocess many open-source preference datasets into the standard format and upload them to the hugginface hub. You can find them [HERE](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002FRLHFlow\u002Fstandard-format-preference-dataset-662eec0252e194d5d40c252a). We have also searched and found that some of the following mixture of preference dataset useful.\n\n- [hendrydong\u002Fpreference_700K](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fhendrydong\u002Fpreference_700K)\n- [RLHFlow\u002FUltraFeedback-preference-standard](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FRLHFlow\u002FUltraFeedback-preference-standard) \nwhere the details can be found in the dataset card. \n\n## Evaluation Results\n\nYou can evaluate the resulting reward model with the dataset provided by [benchmark](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fallenai\u002Freward-bench) by the following command.\n\n```shell\nCUDA_VISIBLE_DEVICES=1 python .\u002Fuseful_code\u002Feval_reward_bench_bt.py --reward_name_or_path .\u002Fmodels\u002Fgemma_2b_mixture2_last_checkpoint --record_dir .\u002Fbench_mark_eval.txt\n```\n\n\n\n## To Do\n\n- [x]  Bradley-Terry Reward Model\n- [x]  Preference model\n- [x]  Multi-Objective Reward Model\n- [ ]  LLM-as-a-judge\n\nOur models and codes have contributed to many academic research projects, e.g.,\n\n1. Xu Zhangchen, et al. \"Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing.\"\n2. Chen, Lichang, et al. \"OPTune: Efficient Online Preference Tuning.\"\n3. Xie, Tengyang, et al. \"Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF.\" arXiv preprint arXiv:2405.21046 (2024).\n4. Zhong, Han, et al. \"Dpo meets ppo: Reinforced token optimization for rlhf.\" arXiv preprint arXiv:2404.18922 (2024).\n5. Zheng, Chujie, et al. \"Weak-to-strong extrapolation expedites alignment.\" arXiv preprint arXiv:2404.16792 (2024).\n6. Ye, Chenlu, et al. \"A theoretical analysis of Nash learning from human feedback under general kl-regularized preference.\" arXiv preprint arXiv:2402.07314 (2024).\n7. Chen, Ruijun, et al. \"Self-Evolution Fine-Tuning for Policy Optimization\"\n8. Li Bolian, et al., Cascade Reward Sampling for Efficient Decoding-Time Alignment\n9. Zhang, Yuheng, et al. \"Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning\"\n10. Lin Tzu-Han, et al., \"DogeRM: Equipping Reward Models with Domain Knowledge through Model Merging\",\n11. Yang Rui, et al., \"Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs\"\n12. Junsoo Park, et al., \"OffsetBias: Leveraging Debiased Data for Tuning Evaluators\"\n13. Meng Yu, et al., \"SimPO: Simple Preference Optimization with a Reference-Free Reward\"\n14. Song Yifan, et al., \"The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism\"\n15. Wenxuan Zhou et al., \"WPO: Enhancing RLHF with Weighted Preference Optimization\"\n16. Han Xia et al., \"Inverse-Q*: Token Level Reinforcement Learning for Aligning Large Language Models Without Preference Data\"\n17. Wang Haoyu et al., \"Probing the Safety Response Boundary of Large Language Models via Unsafe Decoding Path Generation\"\n18. He Yifei et al., \"Semi-Supervised Reward Modeling via Iterative Self-Training\"\n19. Tao leitian et al., \"Your Weak LLM is Secretly a Strong Teacher for Alignment\"\n20. Guijin Son et al., \"LLM-as-a-Judge & Reward Model: What They Can and Cannot Do\"\n21. Nicolai Dorka et al., \"Quantile Regression for Distributional Reward Models in RLHF\"\n22. Zhaolin Gao et al., \"Rebel: Reinforcement learning via regressing relative rewards\"\n\n\n## Contributors\n\nThanks to all of our contributors to date (Made with [contrib.rocks](https:\u002F\u002Fcontrib.rocks)).\n\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002FRLHFlow\u002FRLHF-Reward-Modeling\u002Fgraphs\u002Fcontributors\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FRLHFlow_RLHF-Reward-Modeling_readme_b4730695c23c.png\" \u002F>\n\u003C\u002Fa>\n\n\n## Citation\n\nIf you find the content of this repo useful in your work, please consider citing:\n\n```bibtex\n@article{dong2024rlhf,\n  title={RLHF Workflow: From Reward Modeling to Online RLHF},\n  author={Dong, Hanze and Xiong, Wei and Pang, Bo and Wang, Haoxiang and Zhao, Han and Zhou, Yingbo and Jiang, Nan and Sahoo, Doyen and Xiong, Caiming and Zhang, Tong},\n  journal={arXiv preprint arXiv:2405.07863},\n  year={2024}\n}\n\n@inproceedings{ArmoRM,\n      title={Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts}, \n      author={Haoxiang Wang and Wei Xiong and Tengyang Xie and Han Zhao and Tong Zhang},\n      booktitle={The 2024 Conference on Empirical Methods in Natural Language Processing},\n      year={2024}\n}\n\n@article{xiong2024iterative,\n      title={Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint}, \n      author={Wei Xiong and Hanze Dong and Chenlu Ye and Ziqi Wang and Han Zhong and Heng Ji and Nan Jiang and Tong Zhang},\n      year={2024},\n      journal={ICML}\n}\n```\n","# RLHF-奖励建模\n\n## 结构\n\n该项目的初始版本专注于布拉德利-特里奖励建模和成对偏好模型。此后，我们引入了更多先进的技术来构建偏好模型。项目的结构如下：\n\n- [`bradley-terry-rm`](.\u002Fbradley-terry-rm\u002F) 用于训练经典的布拉德利-特里奖励模型；\n- [`pair-pm`](.\u002Fpair-pm\u002F) 用于训练成对偏好模型，该模型以一个提示和**两个回答**作为输入，直接预测第一个回答更受偏好的概率。我们将这一问题形式化为用户与模型之间的对话，以利用模型的下一个 token 预测能力，这种做法在后续文献中被称为生成式奖励模型。\n\t- [`SSRM`](.\u002Fpair-pm\u002FSRRM\u002F)：论文《通过迭代自训练的半监督奖励建模》（[arXiv:2409.06903](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.06903)）的代码实现；\n \t- [`RRM`](.\u002Fpair-pm\u002FRRM\u002F)：利用因果推断扩充偏好数据集，并缓解奖励欺骗问题。详情参见 [arXiv:2409.13156v1](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.13156v1)。\n- [`armo-rm`](.\u002Farmo-rm\u002F) 用于训练 ArmoRM，该模型以多目标奖励模型为基础，通过上下文相关的专家混合方法对奖励向量进行聚合。具体细节请参阅我们的技术报告《[ArmoRM] 基于多目标奖励建模与专家混合的可解释偏好》（[arXiv:2406.12845](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.12845)）。\n- [`odin-rm`](.\u002Fodin\u002F) 用于将奖励建模从长度偏差中解耦。详情参见 [arXiv:2402.07319](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.07319)。\n- [`math-rm`](.\u002Fmath-rm\u002F)：用于使用下一个 token 预测来训练过程监督奖励（PRM）和结果监督奖励（ORM）的代码。我们开源了数据、代码、超参数和模型，提供一套易于复现且稳健的方案。\n- [`decision_tree`](.\u002Fdecision_tree\u002F)：用于使用和训练决策树奖励模型的代码。技术细节请参阅文章《通过决策树视角解读语言模型偏好》（[rlhflow.github.io\u002Fposts\u002F2025-01-22-decision-tree-reward-model\u002F](https:\u002F\u002Frlhflow.github.io\u002Fposts\u002F2025-01-22-decision-tree-reward-model\u002F)）。\n\n## 新闻\n🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥\n\n🚀 **[2025年1月]** 决策树奖励模型训练代码已在`decision_tree\u002F`文件夹下发布！[Decision-Tree-Reward-Gemma-2-27B](https:\u002F\u002Fhuggingface.co\u002FRLHFlow\u002FDecision-Tree-Reward-Gemma-2-27B) 在RewardBench上取得了新的最先进分数（95.4%）！\n\n🚀 **[2024年11月]** PRM和ORM的训练代码已在`math-rm\u002F`文件夹下发布！\n\n🚀 **[2024年9月]** ArmoRM的训练代码已在`armo-rm\u002F`文件夹下发布！\n\n🚀 **[2024年9月]** 关于[通过迭代自训练进行半监督奖励建模](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.06903)的代码已在`pair-pm\u002F`文件夹下发布。\n\n🚀 **[2024年6月] 我们的[ArmoRM](https:\u002F\u002Fhuggingface.co\u002FRLHFlow\u002FArmoRM-Llama3-8B-v0.1) 是RewardBench上排名第一的8B模型！**\n\n🚀 **[2024年5月] RewardBench上排名前三的开源8B奖励模型（[ArmoRM](https:\u002F\u002Fhuggingface.co\u002FRLHFlow\u002FArmoRM-Llama3-8B-v0.1), [Pair Pref. Model](https:\u002F\u002Fhuggingface.co\u002FRLHFlow\u002Fpair-preference-model-LLaMA3-8B), [BT RM](https:\u002F\u002Fhuggingface.co\u002FsfairXC\u002FFsfairX-LLaMA3-RM-v0.1)）均使用本仓库训练而成！**\n\n🚀 **[2024年5月] [成对偏好模型](https:\u002F\u002Fhuggingface.co\u002FRLHFlow\u002Fpair-preference-model-LLaMA3-8B) 的训练代码现已开放（`pair-pm\u002F`）！**\n\n🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥\n\n+ **技术报告**\n  + [RLHF工作流：从奖励建模到在线RLHF](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.07863)\n  + [[ArmoRM] 通过多目标奖励建模和专家混合实现可解释的偏好](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.12845)\n  + [通过迭代自训练进行半监督奖励建模](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.06903)\n+ **模型**：\n  + 绝对评分多目标奖励模型（ArmoRM）：[ArmoRM-Llama3-8B-v0.1](https:\u002F\u002Fhuggingface.co\u002FRLHFlow\u002FArmoRM-Llama3-8B-v0.1)\n  + 成对偏好奖励模型：[pair-preference-model-LLaMA3-8B](https:\u002F\u002Fhuggingface.co\u002FRLHFlow\u002Fpair-preference-model-LLaMA3-8B)\n  + 布拉德利-特里奖励模型：[FsfairX-LLaMA3-RM-v0.1](https:\u002F\u002Fhuggingface.co\u002FsfairXC\u002FFsfairX-LLaMA3-RM-v0.1)\n\n+ **架构**\n  + 布拉德利-特里（BT）奖励模型和成对偏好模型\n    \u003Cimg width=\"625\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FRLHFlow_RLHF-Reward-Modeling_readme_db7ca5951626.png\">\n  + 绝对评分多目标奖励模型（ArmoRM）\n    \u003Cimg width=\"625\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FRLHFlow_RLHF-Reward-Modeling_readme_a5c71a7be661.png\">\n\n+ **[RewardBench排行榜](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fallenai\u002Freward-bench)**\n\n  ![image](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FRLHFlow_RLHF-Reward-Modeling_readme_ac1ab31a3ecf.png)\n\n   | 模型  | 基础模型                                                             | 方法 | 分数 | 对话 | 困难题对话 | 安全性 | 理性思考 | 先验集合（0.5权重） |\n  |:--------------------------------------------------------------------------------|:-----------------------------------------------------------------------|:-----:|:-----|:----------|:-------|:----------|:-----------------------|:------------------------|\n    | [ArmoRM-Llama3-8B-v0.1](https:\u002F\u002Fhuggingface.co\u002FRLHFlow\u002FArmoRM-Llama3-8B-v0.1) （我们）                                                           | Llama-3 8B | ArmoRM + MoE | **89.0** | 96.9     | **76.8**  | **92.2** | **97.3**  | 74.3                    |\n    | Cohere 2024年5月                                                                 | 未知 | 未知  | 88.2     | 96.4     | 71.3      | **92.7** | **97.7**  | **78.2**                |\n    | [pair-preference-model](https:\u002F\u002Fhuggingface.co\u002FRLHFlow\u002Fpair-preference-model-LLaMA3-8B) （我们）| Llama-3 8B | [SliC-HF](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.10425) | 85.7 | 98.3 | 65.8 | 89.7 | 94.7 | 74.6 |\n    | GPT-4 Turbo（0125版本）                                                      | GPT-4 Turbo | LLM作为裁判 | 84.3     | 95.3     | 74.3      | 87.2     | 86.9      | 70.9                    |\n    | [FsfairX-LLaMA3-RM-v0.1](https:\u002F\u002Fhuggingface.co\u002FsfairXC\u002FFsfairX-LLaMA3-RM-v0.1) （我们） | Llama-3 8B | 布拉德利-特里 | 83.6     | **99.4** | 65.1      | 87.8     | 86.4      | 74.9                    |\n    | [Starling-RM-34B](https:\u002F\u002Fhuggingface.co\u002FNexusflow\u002FStarling-RM-34B)             | Yi-34B | 布拉德利-特里 | 81.4     | 96.9     | 57.2      | 88.2     | 88.5      | 71.4                    |\n\n\n+ **评估结果**（摘自[RLHF工作流](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.07863)）\n  \n  \u003Cimg width=\"625\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FRLHFlow_RLHF-Reward-Modeling_readme_7a51bae9201f.png\">\n\n简而言之：这是一个用于训练基于DRL的RLHF（PPO）[1]、迭代SFT（拒绝采样微调）[2]以及迭代DPO[3]的奖励\u002F偏好模型的仓库。\n\n- 4块A40 48G显卡：我们可以使用Deepspeed Zero-3 + 梯度检查点，以max_length 4096训练Gemma-7B-it；\n- 4块A100 80G显卡：我们可以使用梯度检查点训练Gemma-7B-it，max_length同样为4096；\n- 训练得到的奖励模型在[RewardBench](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fallenai\u002Freward-bench)排行榜上达到了开源RMs的**最先进水平**。\n- 欢迎查看我们的[博客文章](https:\u002F\u002Fefficient-unicorn-451.notion.site\u002FReward-Modeling-for-RLHF-abe03f9afdac42b9a5bee746844518d0)！\n\n\n## 安装说明\n\n建议为布拉德利-特里奖励模型和成对偏好模型分别创建独立的环境。安装说明已在相应文件夹中提供。\n\n\n## 数据集准备\n数据集应按标准格式进行预处理，其中每个样本包含两个对话“选择”和“拒绝”，且它们共享相同的提示。以下是对比对中的拒绝样本示例。\n\n```python\n[\n{ \"content\": \"请列出世界上最稀有的五种动物。\", \"role\": \"user\" },\n{ \"content\": \"您是指真正稀有的动物，还是相对于人类人口数量而言稀有的动物呢？\", \"role\": \"assistant\" },\n{ \"content\": \"是真正稀有的那些。\", \"role\": \"user\" },\n{ \"content\": \"好的，这是我找到的：\", \"role\": \"assistant\" }, \n]\n```\n\n我们将许多开源偏好数据集预处理为标准格式，并上传至Hugging Face Hub。您可以在[这里](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002FRLHFlow\u002Fstandard-format-preference-dataset-662eec0252e194d5d40c252a)找到这些数据集。此外，我们还搜索并发现以下一些混合偏好数据集非常有用。\n\n- [hendrydong\u002Fpreference_700K](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fhendrydong\u002Fpreference_700K)\n- [RLHFlow\u002FUltraFeedback-preference-standard](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FRLHFlow\u002FUltraFeedback-preference-standard)，详细信息可在数据集卡片中查阅。\n\n## 评估结果\n\n您可以使用 [benchmark](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fallenai\u002Freward-bench) 提供的数据集，通过以下命令对生成的奖励模型进行评估。\n\n```shell\nCUDA_VISIBLE_DEVICES=1 python .\u002Fuseful_code\u002Feval_reward_bench_bt.py --reward_name_or_path .\u002Fmodels\u002Fgemma_2b_mixture2_last_checkpoint --record_dir .\u002Fbench_mark_eval.txt\n```\n\n\n\n## 待办事项\n\n- [x] 布拉德利-特里奖励模型\n- [x] 偏好模型\n- [x] 多目标奖励模型\n- [ ] LLM作为评判者\n\n我们的模型和代码已应用于多项学术研究项目，例如：\n\n1. Xu Zhangchen 等人：“Magpie：从零开始，仅通过提示对齐的LLM合成对齐数据。”\n2. Chen, Lichang 等人：“OPTune：高效的在线偏好调优。”\n3. Xie, Tengyang 等人：“探索性偏好优化：利用隐式Q*-近似实现样本高效的RLHF。” arXiv预印本 arXiv:2405.21046 (2024)。\n4. Zhong, Han 等人：“DPO遇上PPO：用于RLHF的强化标记优化。” arXiv预印本 arXiv:2404.18922 (2024)。\n5. Zheng, Chujie 等人：“弱到强的外推加速了对齐过程。” arXiv预印本 arXiv:2404.16792 (2024)。\n6. Ye, Chenlu 等人：“在一般KL正则化偏好下，基于人类反馈的纳什学习的理论分析。” arXiv预印本 arXiv:2402.07314 (2024)。\n7. Chen, Ruijun 等人：“用于策略优化的自进化微调”\n8. Li Bolian 等人：“用于高效解码时对齐的级联奖励采样”\n9. Zhang, Yuheng 等人：“迭代纳什策略优化：通过无悔学习将LLM与一般偏好对齐”\n10. Lin Tzu-Han 等人：“DogeRM：通过模型融合为奖励模型注入领域知识”\n11. Yang Rui 等人：“隐藏状态正则化使LLM能够学习通用奖励模型”\n12. Junsoo Park 等人：“OffsetBias：利用去偏数据调优评估者”\n13. Meng Yu 等人：“SimPO：无需参考奖励的简单偏好优化”\n14. Song Yifan 等人：“善、恶与贪婪：LLM的评估不应忽视非确定性”\n15. Wenxuan Zhou 等人：“WPO：通过加权偏好优化增强RLHF”\n16. Han Xia 等人：“逆Q*：无需偏好数据即可对大型语言模型进行对齐的标记级强化学习”\n17. Wang Haoyu 等人：“通过生成不安全解码路径探测大型语言模型的安全响应边界”\n18. He Yifei 等人：“通过迭代自训练进行半监督奖励建模”\n19. Tao leitian 等人：“你的弱LLM其实是对齐的强大教师”\n20. Guijin Son 等人：“LLM作为评判者与奖励模型：它们能做什么，不能做什么”\n21. Nicolai Dorka 等人：“在RLHF中使用分位数回归构建分布型奖励模型”\n22. Zhaolin Gao 等人：“Rebel：通过回归相对奖励进行强化学习”\n\n\n## 贡献者\n\n感谢迄今为止的所有贡献者（由 [contrib.rocks](https:\u002F\u002Fcontrib.rocks) 制作）。\n\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002FRLHFlow\u002FRLHF-Reward-Modeling\u002Fgraphs\u002Fcontributors\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FRLHFlow_RLHF-Reward-Modeling_readme_b4730695c23c.png\" \u002F>\n\u003C\u002Fa>\n\n\n## 引用\n\n如果您在工作中发现本仓库的内容有所帮助，请考虑引用：\n\n```bibtex\n@article{dong2024rlhf,\n  title={RLHF工作流：从奖励建模到在线RLHF},\n  author={Dong, Hanze and Xiong, Wei and Pang, Bo and Wang, Haoxiang and Zhao, Han and Zhou, Yingbo and Jiang, Nan and Sahoo, Doyen and Xiong, Caiming and Zhang, Tong},\n  journal={arXiv预印本 arXiv:2405.07863},\n  year={2024}\n}\n\n@inproceedings{ArmoRM,\n      title={通过多目标奖励建模和专家混合实现可解释的偏好}, \n      author={Haoxiang Wang and Wei Xiong and Tengyang Xie and Han Zhao and Tong Zhang},\n      booktitle={2024年自然语言处理经验方法会议},\n      year={2024}\n}\n\n@article{xiong2024iterative,\n      title={基于人类反馈的迭代偏好学习：在KL约束下弥合RLHF的理论与实践}, \n      author={Wei Xiong and Hanze Dong and Chenlu Ye and Ziqi Wang and Han Zhong and Heng Ji and Nan Jiang and Tong Zhang},\n      year={2024},\n      journal={ICML}\n}\n```","# RLHF-Reward-Modeling 快速上手指南\n\n本指南帮助开发者快速部署并使用 **RLHF-Reward-Modeling** 项目，用于训练基于 Bradley-Terry、成对偏好（Pairwise Preference）、多目标（ArmoRM）等架构的奖励模型。该仓库产出的模型在 RewardBench 榜单上表现优异，适用于 PPO、DPO 及迭代式 SFT 等对齐流程。\n\n## 环境准备\n\n### 系统要求\n*   **操作系统**: Linux (推荐 Ubuntu 20.04+)\n*   **GPU**: \n    *   训练 Gemma-7B (max_length 4096): 推荐 4x A100 80G 或 4x A40 48G。\n    *   较小模型或推理：单卡 24G+ 显存即可尝试。\n*   **软件依赖**:\n    *   Python 3.8+\n    *   CUDA 11.8+\n    *   PyTorch 2.0+\n    *   DeepSpeed (用于分布式训练)\n\n### 前置依赖\n确保已安装 `git` 和 `conda` (或 `venv`)。由于本项目包含多种不同架构的模型（如 BT, Pair-PM, ArmoRM），**强烈建议为不同的子模块创建独立的虚拟环境**，以避免依赖冲突。\n\n## 安装步骤\n\n本项目采用模块化结构，不同模型的训练代码位于不同子目录中。请根据你要训练的模型类型选择对应的安装方式。\n\n### 1. 克隆仓库\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FRLHFlow\u002FRLHF-Reward-Modeling.git\ncd RLHF-Reward-Modeling\n```\n\n### 2. 安装特定模块依赖\n由于 README 指出各子文件夹包含具体的安装说明，以下是通用安装逻辑（以 Bradley-Terry 和 Pairwise 为例）：\n\n#### 方案 A: 训练经典 Bradley-Terry 奖励模型\n```bash\n# 创建独立环境\nconda create -n bt-rm python=3.10 -y\nconda activate bt-rm\n\n# 进入对应目录并安装依赖 (具体 requirements 请参考子目录)\ncd bradley-terry-rm\npip install -r requirements.txt\n# 若无具体要求文件，通常需安装基础库：\n# pip install torch torchvision torchaudio --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu118\n# pip install transformers datasets accelerate deepspeed\n```\n\n#### 方案 B: 训练成对偏好模型 (Pairwise Preference Model \u002F Generative RM)\n```bash\n# 创建独立环境\nconda create -n pair-pm python=3.10 -y\nconda activate pair-pm\n\n# 进入对应目录\ncd ..\u002Fpair-pm\npip install -r requirements.txt\n```\n\n#### 方案 C: 训练 ArmoRM (多目标奖励模型)\n```bash\nconda create -n armo-rm python=3.10 -y\nconda activate armo-rm\ncd ..\u002Farmo-rm\npip install -r requirements.txt\n```\n\n> **提示**: 如果下载依赖较慢，可使用国内镜像源加速：\n> `pip install -r requirements.txt -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple`\n\n## 基本使用\n\n### 1. 数据准备\n模型训练需要标准格式的偏好数据集。每条样本包含一个 prompt 和两个回复（`chosen` 和 `rejected`）。\n\n**数据格式示例 (JSONL):**\n```json\n[\n  {\n    \"prompt\": [\n      {\"content\": \"Please identify the top 5 rarest animals in the world.\", \"role\": \"user\"}\n    ],\n    \"chosen\": [\n      {\"content\": \"Here are the top 5...\", \"role\": \"assistant\"}\n    ],\n    \"rejected\": [\n      {\"content\": \"Do you mean animals that are really rare...?\", \"role\": \"assistant\"},\n      {\"content\": \"Alright, here's what I found:\", \"role\": \"assistant\"}\n    ]\n  }\n]\n```\n\n**推荐数据集**:\n项目作者已预处理多个开源数据集并上传至 Hugging Face，可直接加载：\n*   [RLHFlow 标准格式偏好数据集集合](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002FRLHFlow\u002Fstandard-format-preference-dataset-662eec0252e194d5d40c252a)\n*   `hendrydong\u002Fpreference_700K`\n*   `RLHFlow\u002FUltraFeedback-preference-standard`\n\n### 2. 启动训练\n进入对应子目录后，参照该目录下的脚本启动训练。通常使用 `deepspeed` 进行多卡加速。\n\n**示例命令 (Bradley-Terry 模式):**\n```bash\n# 假设在当前子目录下有 train.py 和 ds_config.json\ndeepspeed --num_gpus=4 train.py \\\n    --model_name_or_path meta-llama\u002FMeta-Llama-3-8B \\\n    --dataset_path RLHFlow\u002FUltraFeedback-preference-standard \\\n    --output_dir .\u002Fmodels\u002Fbt-llama3-8b \\\n    --per_device_train_batch_size 4 \\\n    --gradient_accumulation_steps 4 \\\n    --learning_rate 1e-5 \\\n    --num_train_epochs 1 \\\n    --deepspeed ds_config_zero3.json\n```\n\n*(注：具体参数请以各子目录 `README.md` 或脚本帮助信息为准)*\n\n### 3. 模型评估\n训练完成后，可使用 `RewardBench` 数据集进行评估。项目提供了评估脚本。\n\n**评估命令示例:**\n```bash\nCUDA_VISIBLE_DEVICES=1 python .\u002Fuseful_code\u002Feval_reward_bench_bt.py \\\n    --reward_name_or_path .\u002Fmodels\u002Fgemma_2b_mixture2_last_checkpoint \\\n    --record_dir .\u002Fbench_mark_eval.txt\n```\n\n### 4. 直接使用预训练模型\n如果无需重新训练，可直接通过 Hugging Face 加载官方提供的 SOTA 模型：\n\n```python\nfrom transformers import AutoModelForSequenceClassification, AutoTokenizer\n\n# 加载 ArmoRM (当前 8B 级别 Rank #1)\nmodel_name = \"RLHFlow\u002FArmoRM-Llama3-8B-v0.1\"\ntokenizer = AutoTokenizer.from_pretrained(model_name)\nmodel = AutoModelForSequenceClassification.from_pretrained(model_name, trust_remote_code=True)\n\n# 推理逻辑需参考具体模型的 technical report 或 model card\n```","某医疗 AI 初创团队正在开发一款面向患者的智能问诊助手，急需通过人类反馈强化学习（RLHF）让模型的回答既符合医学严谨性，又具备人文关怀。\n\n### 没有 RLHF-Reward-Modeling 时\n- **奖励信号单一且偏差大**：团队只能使用基础的 Bradley-Terry 模型，导致模型倾向于生成长篇大论的“废话”来骗取高分，无法识别简洁且准确的优质回答。\n- **缺乏多维评估能力**：难以同时平衡“医学准确性”、“语气同理心”和“安全性”等多个目标，往往顾此失彼，调整一个指标就会牺牲另一个。\n- **标注数据利用率低**：面对昂贵的医生标注数据，缺乏半监督自我训练机制，大量未标注的对话数据被闲置，模型迭代速度缓慢。\n- **黑盒决策难解释**：当模型给出错误偏好判断时，开发人员无法追溯具体是哪个因素导致了误判，调试过程如同盲人摸象。\n\n### 使用 RLHF-Reward-Modeling 后\n- **消除长度偏见**：利用 `odin-rm` 模块成功解耦了回复长度与奖励分数的关联，使模型能精准识别短小精悍的正确诊断建议。\n- **多目标动态融合**：通过 `armo-rm` 的多目标混合专家架构，模型能根据上下文动态权衡医学事实与沟通态度，输出综合质量更高的回答。\n- **数据效能倍增**：借助 `pair-pm` 中的半监督自训练技术（SSRM），团队利用海量未标注日志显著提升了奖励模型的泛化能力，减少了对人工标注的依赖。\n- **决策透明可溯**：引入 `decision_tree` 奖励模型，团队可以清晰地看到模型是依据“包含禁忌症提示”还是“语气温暖”等具体规则做出的偏好判断，极大降低了调试成本。\n\nRLHF-Reward-Modeling 通过提供从去偏、多目标优化到可解释性的一站式解决方案，将医疗助手的对齐训练周期缩短了一半，并显著提升了最终上线的安全性与用户满意度。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FRLHFlow_RLHF-Reward-Modeling_ac1ab31a.png","RLHFlow","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FRLHFlow_915790d8.jpg","Code for the Workflow of Reinforcement Learning from Human Feedback (RLHF)",null,"rlhflow.ai@gmail.com","https:\u002F\u002Fgithub.com\u002FRLHFlow",[80],{"name":81,"color":82,"percentage":83},"Python","#3572A5",100,1529,109,"2026-04-18T18:33:42","Apache-2.0","未说明","需要 NVIDIA GPU。示例配置：4x A40 (48GB) 可训练 7B 模型 (max_length 4096, DeepSpeed Zero-3 + gradient checkpointing)；4x A100 (80GB) 可训练 7B 模型 (max_length 4096, gradient checkpointing)。",{"notes":91,"python":88,"dependencies":92},"建议为 Bradley-Terry 奖励模型和成对偏好模型创建独立的运行环境。具体安装指令位于各子项目文件夹中。训练需使用 DeepSpeed Zero-3 和梯度检查点技术以优化显存。数据集需预处理为标准格式（包含 prompt、chosen 回复和 rejected 回复）。",[93,94,95,96],"deepspeed","transformers","torch","accelerate",[44,14],[99,100,101,102],"llm","rlhf","reward-models","llama3","2026-03-27T02:49:30.150509","2026-04-20T10:24:15.268808",[106,111,116,121,126,131,136,141],{"id":107,"question_zh":108,"answer_zh":109,"source_url":110},44528,"ArmoRM-Llama3-8B-v0.1 的分词器（tokenizer）与 Meta-Llama-3-8B-Instruct 的分词器为何不同？这会影响使用吗？","差异主要是因为早期版本的分词器不支持某些特性（如 \"ignore_merges\"），且我们在 ArmoRM 中额外添加了一个用于填充的特殊 token（pad_token: '[PAD]'），以避免在多轮对话中使用 eos token 进行填充时引发问题。这种差异通常不会影响编码 - 解码过程（即解码后的文本是一致的），仅在极少数特定 token 上可能有细微影响，一般用例中可以正常使用。","https:\u002F\u002Fgithub.com\u002FRLHFlow\u002FRLHF-Reward-Modeling\u002Fissues\u002F32",{"id":112,"question_zh":113,"answer_zh":114,"source_url":115},44529,"训练多目标奖励模型（Multi-objective Reward Model）时，如果只使用 HelpSteer 数据集，是否需要动态门控网络（dynamic gating network）？权重如何设置？","如果仅使用 HelpSteer 数据集进行训练，由于提示语是独立同分布（i.i.d.）的，通常不需要使用动态门控网络。可以直接遵循 HelpSteer2 的做法，使用固定的门控权重（fixed gating weight）。动态门控或 MoE 聚合技巧主要在偏好数据集多样化时才更有优势。","https:\u002F\u002Fgithub.com\u002FRLHFlow\u002FRLHF-Reward-Modeling\u002Fissues\u002F27",{"id":117,"question_zh":118,"answer_zh":119,"source_url":120},44530,"ArmoRM 的训练代码是否已开源？在哪里可以找到？","是的，训练代码已经发布。维护者已在仓库中更新了代码，并同步更新了 HuggingFace 页面和仓库的 README 文档以提供使用示例。您可以直接在 GitHub 仓库中查看相关代码。","https:\u002F\u002Fgithub.com\u002FRLHFlow\u002FRLHF-Reward-Modeling\u002Fissues\u002F28",{"id":122,"question_zh":123,"answer_zh":124,"source_url":125},44531,"如何使用仅包含成对偏好数据（没有多目标奖励分数）的自定义数据集来微调 ARMO 模型？","目前训练代码已发布，支持使用自定义数据集进行微调。对于仅包含成对偏好数据的情况，您可以参考已发布的训练代码逻辑进行调整。维护者表示代码已推送，请查阅仓库中的最新代码实现以获取具体的数据处理和训练流程。","https:\u002F\u002Fgithub.com\u002FRLHFlow\u002FRLHF-Reward-Modeling\u002Fissues\u002F23",{"id":127,"question_zh":128,"answer_zh":129,"source_url":130},44532,"hendrydong\u002Fpreference_700K 数据集是如何构建的？有处理脚本吗？","该数据集是在项目早期作为课程项目时处理的，当时并未妥善记录处理过程，因此没有公开具体的处理脚本。建议用户自行合并标准的公开数据集来构建类似的训练数据。","https:\u002F\u002Fgithub.com\u002FRLHFlow\u002FRLHF-Reward-Modeling\u002Fissues\u002F19",{"id":132,"question_zh":133,"answer_zh":134,"source_url":135},44533,"如何在 Reward Bench 上运行和评估 ARMO 模型？有相关代码吗？","维护者已将评估代码直接提交到了 Reward Bench 仓库。您可以通过查看 Reward Bench 的相关 PR（https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fpull\u002F135）获取具体的评估命令和代码实现。此外，HuggingFace 模型卡和主仓库的 README 中也更新了基于多目标奖励聚合的使用示例。","https:\u002F\u002Fgithub.com\u002FRLHFlow\u002FRLHF-Reward-Modeling\u002Fissues\u002F15",{"id":137,"question_zh":138,"answer_zh":139,"source_url":140},44534,"在 UltraFeedback 数据预处理中，如何将 1-10 分的评分线性变换为 0-1 的范围？","使用的线性变换公式为：x' = (x - 0.5) \u002F 10。通过此公式，原始分数 10 会变为 0.95，原始分数 1 会变为 0.05。若需还原分数，可使用该公式的逆运算。","https:\u002F\u002Fgithub.com\u002FRLHFlow\u002FRLHF-Reward-Modeling\u002Fissues\u002F57",{"id":142,"question_zh":143,"answer_zh":144,"source_url":145},44535,"在 pair-pm 的代码示例中，为什么 `avg_prob_chosen` 的值恰好是 0.5？这代表什么含义？","`avg_prob_chosen` 等于 0.5 是因为示例代码主要用于评估场景，其中已知 responses[0] 是被选中的回答（chosen response）。在一般情况下，如果 `avg_prob_chosen` > 0.5，表示偏好模型认为 responses[0] 更好；反之，如果 \u003C 0.5，则表示模型认为 responses[1] 更好。变量命名主要是为了适配评估逻辑。","https:\u002F\u002Fgithub.com\u002FRLHFlow\u002FRLHF-Reward-Modeling\u002Fissues\u002F8",[]]