[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-OpenRLHF--OpenRLHF":3,"tool-OpenRLHF--OpenRLHF":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",160015,2,"2026-04-18T11:30:52",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",109154,"2026-04-18T11:18:24",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[52,13,15,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":32,"last_commit_at":59,"category_tags":60,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":64,"owner_name":64,"owner_avatar_url":72,"owner_bio":73,"owner_company":74,"owner_location":74,"owner_email":75,"owner_twitter":74,"owner_website":74,"owner_url":76,"languages":77,"stars":90,"forks":91,"last_commit_at":92,"license":93,"difficulty_score":10,"env_os":94,"env_gpu":95,"env_ram":96,"env_deps":97,"category_tags":107,"github_topics":108,"view_count":32,"oss_zip_url":74,"oss_zip_packed_at":74,"status":17,"created_at":117,"updated_at":118,"faqs":119,"releases":149},9208,"OpenRLHF\u002FOpenRLHF","OpenRLHF","An Easy-to-use, Scalable and High-performance Agentic RL Framework based on Ray (PPO & DAPO & REINFORCE++ &  VLM & TIS & vLLM & Ray & Async  RL)","OpenRLHF 是一款高性能、易扩展的开源强化学习框架，专为大语言模型（LLM）及视觉 - 语言模型（VLM）的“人类反馈强化学习”（RLHF）训练而设计。它旨在解决传统 RL 训练框架在分布式环境下效率低、部署复杂以及难以支持多轮智能体交互等痛点，让研究人员和开发者能更轻松地复现前沿算法并投入生产环境。\n\n该工具特别适合 AI 研究人员、算法工程师以及希望探索大模型对齐技术的开发者使用。其核心亮点在于首创性地结合了 Ray 分布式调度与 vLLM 高速推理引擎，构建了统一的“智能体（Agent）”执行范式。这不仅大幅提升了训练吞吐量和资源利用率，还原生支持 PPO、REINFORCE++、GRPO 等多种先进算法。此外，OpenRLHF 具备强大的灵活性，能够处理从单轮奖励优化到复杂的多轮环境交互任务，甚至支持包含图像输入的多模态模型端到端训练。无论是进行基础理论研究还是构建大规模生产级应用，OpenRLHF 都提供了一个轻量且功能完备的技术底座。","\u003Cdiv align=\"center\">\n    \u003Cimg alt=\"OpenRLHF logo\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenRLHF_OpenRLHF_readme_aa8533b7c00c.png\" style=\"height: 140px;\" \u002F>\n\u003C\u002Fdiv>\n\u003Cdiv align=\"center\">\n\u003Cp align=\"center\">\n      \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fgraphs\u002Fcontributors\">\n        \u003Cimg alt=\"GitHub Contributors\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fcontributors\u002FOpenRLHF\u002FOpenRLHF\" \u002F>\n      \u003C\u002Fa>\n      \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fissues\">\n        \u003Cimg alt=\"Issues\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fissues\u002FOpenRLHF\u002FOpenRLHF?color=0088ff\" \u002F>\n      \u003C\u002Fa>\n      \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fdiscussions\">\n        \u003Cimg alt=\"Issues\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fdiscussions\u002FOpenRLHF\u002FOpenRLHF?color=0088ff\" \u002F>\n      \u003C\u002Fa>\n      \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fpulls\">\n        \u003Cimg alt=\"GitHub pull requests\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fissues-pr\u002FOpenRLHF\u002FOpenRLHF?color=0088ff\" \u002F>\n      \u003C\u002Fa>\n      \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fstargazers\">\n        \u003Cimg alt=\"GitHub stars\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenRLHF\u002FOpenRLHF?color=ccf\" \u002F>\n      \u003C\u002Fa>\n      \u003Ca href=\"https:\u002F\u002Fdeepwiki.com\u002FOpenRLHF\u002FOpenRLHF\">\u003Cimg src=\"https:\u002F\u002Fdeepwiki.com\u002Fbadge.svg\" alt=\"Ask DeepWiki\">\u003C\u002Fa>\n      \u003Cbr>\n      \u003Cem>Open-source \u002F Comprehensive \u002F Lightweight \u002F Easy-to-use\u003C\u002Fem>\n    \u003C\u002Fp>\n\u003C\u002Fdiv>\n\n\u003Chr>\n\n\u003Cspan>[ English | \u003Ca href=\"README_zh.md\">中文\u003C\u002Fa> | \u003Ca href=\"README_ja.md\">日本語\u003C\u002Fa> ]\u003C\u002Fspan>\n\nOpenRLHF is **the first** high-performance, production-ready open-source RLHF framework that combines **Ray + vLLM distributed architecture** with a **unified agent-based design paradigm** for scalable and extensible reinforcement learning from human feedback.\n\n📚 **Learn More**: [Documentation](https:\u002F\u002Fopenrlhf.readthedocs.io\u002F) | [Slides](https:\u002F\u002Fdocs.google.com\u002Fpresentation\u002Fd\u002F1JRhB1d7csofx0PIZBmfyBdMluxNd5JLPpUHrrvVhGnk\u002Fedit?usp=sharing) | [Technical Report](https:\u002F\u002Fwww.researchgate.net\u002Fpublication\u002F393414548_OpenRLHF_An_Easy-to-use_Scalable_and_High-performance_RLHF_Framework) | [Video](https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV1dv2jBxEQG\u002F)\n\n## 📖 Table of Contents\n\n- [🗞️ News](#news)\n- [🏗️ Architecture Foundation](#architecture-foundation-ray--vllm-distribution) - Ray + vLLM + DeepSpeed distributed infrastructure\n- [🎯 Design Paradigm](#design-paradigm-agent-based-execution) - Unified agent-based execution pipeline\n- [🚀 RL Algorithms](#state-of-the-art-rl-algorithms) - PPO, REINFORCE++, GRPO, RLOO\n- [📋 Features Overview](#comprehensive-features) - Complete RLHF pipeline capabilities\n- [🎬 Quick Start](#quick-start) - Installation and typical workflow\n- [🎓 Training Guide](#supervised-fine-tuning) - SFT, Reward Model, RL Training\n- [🎯 Single-Turn Agent](#single-turn-agent-reinforced-fine-tuning-with-custom-rewards) - Custom reward functions\n- [🤖 Multi-Turn Agent](#multi-turn-agent-complex-environment-interactions) - Complex environments\n- [🔧 Advanced Topics](#advanced-topics) - LoRA, performance tuning\n\n---\n\n\u003Ca id=\"news\">\u003C\u002Fa>\n## News\n\n\u003Cdetails>\n\u003Csummary>Show News\u003C\u002Fsummary>\n\n- [2026\u002F4] OpenRLHF 0.10 adds **Multi-Turn VLM RL** — multi-step interactions with images in both prompts and environment feedback (e.g. screenshots). Example: [vlm_multiturn_agent.py](.\u002Fexamples\u002Fpython\u002Fvlm_multiturn_agent.py)\n- [2026\u002F4] OpenRLHF 0.10 adds **VLM (Vision-Language Model) RLHF support** — train VLMs like Qwen3.5 with image inputs end-to-end. Training script: [train_vlm_math_hybrid_engine.sh](.\u002Fexamples\u002Fscripts\u002Ftrain_vlm_math_hybrid_engine.sh)\n- [2026\u002F2] [ProRL V2](https:\u002F\u002Fdeveloper.nvidia.com\u002Fblog\u002Fscaling-llm-reinforcement-learning-with-prolonged-training-using-prorl-v2\u002F) uses REINFORCE++-baseline to train a state-of-the-art 1.5B reasoning model with prolonged RL training. Training script: [train_prorlv2_math_hybrid_engine.sh](.\u002Fexamples\u002Fscripts\u002Ftrain_prorlv2_math_hybrid_engine.sh)\n- [2025\u002F10] [ScaleRL](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.13786) validates the effectiveness of REINFORCE++-baseline in large-scale training scenarios. Releases [REINFORCE++ slides](https:\u002F\u002Fdocs.google.com\u002Fpresentation\u002Fd\u002F1stieP_3PM1z4Hq1YWR3GywFkxcHEAlstXMaS23KlGN4)\n- [2025\u002F6] [Magistral](https:\u002F\u002Fmistral.ai\u002Fstatic\u002Fresearch\u002Fmagistral.pdf) uses the method quite similar to REINFORCE++-baseline to train the reasoning models.\n- [2025\u002F5] [MARTI](https:\u002F\u002Fgithub.com\u002FTsinghuaC3I\u002FMARTI) has been released as a fork of OpenRLHF. It is designed to train LLM-based multi-agent systems using RL, by integrating centralized multi-agent interactions with distributed policy training.\n- [2025\u002F5] OpenRLHF 0.8.0 supports async RLHF training via `--async_train` and async agent RLHF via `--agent_func_path`. See [train_reinforce_baseline_ray_agent_async.sh](.\u002Fexamples\u002Fscripts\u002Ftrain_reinforce_baseline_ray_agent_async.sh) for a runnable example.\n- [2025\u002F4] Post the blog [Accelerating RLHF with vLLM, Best Practice from OpenRLHF](https:\u002F\u002Fblog.vllm.ai\u002F2025\u002F04\u002F23\u002Fopenrlhf-vllm.html)\n- [2025\u002F4] Clean OpenRLHF: Refactored the source code based on Single Controller and Unified Packing Samples\n- [2025\u002F3] The CMU [Advanced Natural Language Processing Spring 2025](https:\u002F\u002Fcmu-l3.github.io\u002Fanlp-spring2025\u002F) course uses OpenRLHF as the RLHF framework teaching case.\n- [2025\u002F2] [Logic-RL](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.14768) and [PRIME](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.01456) demonstrate that REINFORCE++ is more stable in training compared to GRPO and faster than PPO.\n- [2025\u002F2] [LMM-R1](https:\u002F\u002Fgithub.com\u002FTideDra\u002Flmm-r1) is a fork of OpenRLHF, aimed at providing high-performance RL infrastructure for reproduction of DeepSeek-R1 on multimodal tasks.\n- [2025\u002F2] MIT & Microsoft proposed the [On the Emergence of Thinking in LLMs I: Searching for the Right Intuition](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.06773) using OpenRLHF\n- [2025\u002F1] HKUST reproduced the [DeepSeek-R1-Zero and DeepSeek-R1 training on small models using OpenRLHF](https:\u002F\u002Fgithub.com\u002Fhkust-nlp\u002FsimpleRL-reason)\n- [2024\u002F12] We \"proposed\" 😊 the [REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models](https:\u002F\u002Fwww.researchgate.net\u002Fpublication\u002F387487679_REINFORCE_An_Efficient_RLHF_Algorithm_with_Robustnessto_Both_Prompt_and_Reward_Models).\n- [2024\u002F12] We analyzed the PPO, REINFORCE++, GRPO and RLOO in the [Notion Blogpost](https:\u002F\u002Fhijkzzz.notion.site\u002Funraveling-rlhf-and-its-variants-engineering-insights#147d9a33ecc9806090f3d5c749d31f05).\n- [2023\u002F8] OpenRLHF was open-sourced.\n\n\u003C\u002Fdetails>\n\n---\n\n\u003Ca id=\"architecture-foundation-ray--vllm-distribution\">\u003C\u002Fa>\n## 🏗️ Architecture Foundation: Ray + vLLM Distribution\n\nOpenRLHF is **the first RLHF framework** built on Ray + vLLM distributed architecture, orchestrating multiple components across GPUs efficiently:\n\n\u003Cdiv align=\"center\">\n  \u003Cimg alt=\"OpenRLHF Architecture (Ray + vLLM)\" src=\".\u002Fdocs\u002Fopenrlhf_architecture.svg\" style=\"max-width: 100%; height: auto;\" \u002F>\n\u003C\u002Fdiv>\n\n### Core Infrastructure Components\n\n**Ray - Distributed Scheduler and Controller**  \nOpenRLHF leverages [Ray](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fray) for efficient distributed scheduling. It separates the Actor, Reward, Reference, and Critic models across different GPUs, enabling scalable training for models up to **70B+ parameters**.\n\n**Hybrid Engine Scheduling**: All models and vLLM engines can share GPU resources—minimizing idle time and maximizing GPU utilization. This allows running full RLHF pipelines on limited hardware.\n\n**vLLM - High-Performance Inference Engine**  \nRLHF training spends **80% of the time on sample generation**. Powered by [vLLM](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm) with Auto Tensor Parallelism (AutoTP) and Pipeline Parallelism (PP), OpenRLHF delivers high-throughput, memory-efficient generation.\n\n**DeepSpeed - Memory-Efficient Training**  \nBuilt on [DeepSpeed](https:\u002F\u002Fgithub.com\u002Fdeepspeedai\u002FDeepSpeed) ZeRO-3, [deepcompile](https:\u002F\u002Fgithub.com\u002Fdeepspeedai\u002FDeepSpeed\u002Fblob\u002Fmaster\u002Fblogs\u002Fdeepcompile\u002FREADME.md), [AutoTP](https:\u002F\u002Fgithub.com\u002Fdeepspeedai\u002FDeepSpeed\u002Fblob\u002Fmaster\u002Fblogs\u002Fhuggingface-tp\u002FREADME.md), and RingAttention. Enables large model training without heavyweight frameworks while working directly with HuggingFace models.\n\n**Transformers - Model Interface**  \nNative integration with HuggingFace Transformers for seamless model loading, state management, and fine-tuning of pretrained models.\n\n**NCCL \u002F CUDA IPC - High-Speed Communication**  \nEfficient inter-GPU communication for distributed training and inference.\n\n---\n\n\u003Ca id=\"design-paradigm-agent-based-execution\">\u003C\u002Fa>\n## 🎯 Design Paradigm: Agent-Based Execution\n\n**On top of the Ray distributed architecture**, OpenRLHF is **the first RLHF framework** to implement a **unified agent-based paradigm**. Every training run—whether standard PPO or complex multi-turn reasoning—follows a consistent agent execution pipeline.\n\n### Why Agent-Based?\n\nOpenRLHF **unifies generation and training through token-in-token-out agent execution**, ensuring perfect consistency, easy single\u002Fmulti-turn extension, and zero text-level mismatches.\n\n### Agent Architecture\n\n```\n                 ┌─────────────────────────────┐\n                 │    AgentExecutorBase        │\n                 │  (Token-in-Token-out Core)  │\n                 └─────────────────────────────┘\n                              │\n                 ┌────────────┴────────────┐\n                 ↓                         ↓\n         SingleTurnExecutor        MultiTurnExecutor\n                 │                         │\n      ┌──────────┴──────────┐   ┌─────────┴──────────┐\n      ↓                     ↓   ↓                    ↓\n  Standard RLHF      Custom Reward   Multi-Step    External Env\n  (One-shot gen)     Function      Reasoning     (OpenAI Agent Server)\n      ↓                     ↓           ↓                ↓\n      └─────────────────────┴───────────┴────────────────┘\n                              │\n                    Consistent Token Trajectories\n                              │\n                    ┌─────────┴─────────┐\n                    │  RL Algorithms    │\n                    │  (Decoupled)      │\n                    │                   │\n                    │  PPO, REINFORCE++ │\n                    │  GRPO, RLOO, etc. │\n                    └───────────────────┘\n```\n\n### Core Design Principles\n\n\u003Cdetails>\n\u003Csummary>Show core design principles\u003C\u002Fsummary>\n\n| Principle | Description | Benefit |\n|-----------|-------------|---------|\n| **Token-in-Token-out** | All sampling produces token-level trajectories | Zero text-level mismatch |\n| **Unified Interface** | Same `AgentExecutorBase` API for all modes | Switch modes with one flag |\n| **Algorithm-Agnostic** | RL algorithms (PPO, REINFORCE++, etc.) are decoupled from agent executors | Any algorithm works with any mode |\n| **Extensible** | Plug in custom rewards\u002Fenvironments easily | Rapid experimentation |\n| **Production-Ready** | Sync\u002FAsync\u002FHybrid Engine support | From research to deployment |\n\n\u003C\u002Fdetails>\n\n### Two Execution Modes (Orthogonal to RL Algorithms)\n\nThe agent execution mode is **independent** of the RL algorithm you choose. You can use **any algorithm** (PPO, REINFORCE++, GRPO, etc.) with **any execution mode**:\n\n| Mode | Use Cases | Interface | Complexity |\n|------|-----------|-----------|------------|\n| **Single-Turn** | Standard RLHF, custom reward functions | Optional `reward_func()` | ⭐ Default (99% use cases) |\n| **Multi-Turn** | Multi-step reasoning, interactive environments | `reset()` + `step()` | ⭐⭐ Advanced |\n\n---\n\n\u003Ca id=\"state-of-the-art-rl-algorithms\">\u003C\u002Fa>\n## 🚀 State-of-the-Art RL Algorithms\n\nOpenRLHF implements **PPO, REINFORCE++, REINFORCE++-baseline, GRPO, RLOO** with advanced optimization tricks inspired by practical guides and community best practices. \n\n**Key Design**: RL algorithms are **decoupled from agent execution modes**. All algorithms work seamlessly with both single-turn and multi-turn agent executors, running through the unified token-in-token-out pipeline for consistent behavior.\n\n\u003Cdetails>\n\u003Csummary>Show algorithm comparison table\u003C\u002Fsummary>\n\n| Algorithm | `--advantage_estimator` | Key Feature | Best Use Case |\n|-----------|------------------------|-------------|---------------|\n| **PPO** | (default) | Full critic network | Stable training, proven results |\n| **REINFORCE++** | `reinforce` | PPO tricks without critic | Efficient training, less memory |\n| **REINFORCE++-baseline** | `reinforce_baseline` | Mean reward baseline | Reasoning tasks (RLVR), robust to reward scales |\n| **RLOO** | `rloo` | Per-token KL + PPO-clip | Multi-sample training |\n| **GRPO** | `group_norm` | Group normalization | Batch-based training |\n| **Dr. GRPO** | `dr_grpo` | Simplified GRPO | Removes local `\u002Fstd` norm |\n\n\u003C\u002Fdetails>\n\nReferences: [Zhihu article](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F622134699) | [Notion best practices](https:\u002F\u002Fhijkzzz.notion.site\u002Frlhf-implementation-tricks?v=158d9a33ecc98132bf9e000c39227361)\n\n---\n\n\u003Ca id=\"comprehensive-features\">\u003C\u002Fa>\n## 📋 Comprehensive Features\n\nOpenRLHF provides a complete RLHF pipeline with agent-based flexibility:\n\n### 🎯 Agent-Based RL Training (Core Innovation)\n\n\u003Cdetails>\n\u003Csummary>Show agent-based RL training details\u003C\u002Fsummary>\n\n**Single-Turn Mode** (Default - 99% of use cases)\n- One-shot generation per prompt\n- Works with all RL algorithms: [PPO](.\u002Fexamples\u002Fscripts\u002Ftrain_ppo_ray_hybrid_engine.sh), [REINFORCE++\u002Fbaseline\u002FGRPO\u002FRLOO](.\u002Fexamples\u002Fscripts\u002Ftrain_reinforce_baseline_hybrid_engine.sh)\n- [Custom reward functions](.\u002Fexamples\u002Fscripts\u002Ftrain_ppo_with_reward_fn.sh) (`--remote_rm_url`)\n- [Hybrid Engine](.\u002Fexamples\u002Fscripts\u002Ftrain_ppo_ray_hybrid_engine.sh) for maximum GPU utilization\n\n**Multi-Turn Mode** (Advanced - Interactive tasks)\n- Multi-step interactions with environment feedback\n- Works with all RL algorithms\n- [Custom agent functions](.\u002Fexamples\u002Fscripts\u002Ftrain_reinforce_baseline_ray_agent_async.sh) (`--agent_func_path`)\n- OpenAI-compatible server: see `examples\u002Fpython\u002Fagent_func_openai_server_executor.py` for an agent executor that wraps vLLM as a local OpenAI Agent Server\n- Async pipeline (`--async_train`) for higher throughput: [train_reinforce_baseline_ray_agent_async.sh](.\u002Fexamples\u002Fscripts\u002Ftrain_reinforce_baseline_ray_agent_async.sh)\n\n\u003C\u002Fdetails>\n\n### 🎓 Supervised Training & Preference Learning\n\n\u003Cdetails>\n\u003Csummary>Show supervised training & preference learning table\u003C\u002Fsummary>\n\n| Method | Script | Description |\n|--------|--------|-------------|\n| **SFT** | [train_sft.sh](.\u002Fexamples\u002Fscripts\u002Ftrain_sft.sh) | Supervised fine-tuning with packing |\n| **DPO\u002FIPO\u002FcDPO** | [train_dpo_llama.sh](.\u002Fexamples\u002Fscripts\u002Ftrain_dpo_llama.sh) | Direct preference optimization |\n| **Reward Model** | [train_rm.sh](.\u002Fexamples\u002Fscripts\u002Ftrain_rm.sh) | Train reward models |\n\n\u003C\u002Fdetails>\n\n### ⚡ Advanced Capabilities\n\n\u003Cdetails>\n\u003Csummary>Show advanced capabilities\u003C\u002Fsummary>\n\n**Efficiency Optimizations**\n- Sample packing (`--packing_samples`) for all training modes\n- vLLM acceleration (`--vllm_num_engines`) for fast generation\n- DAPO [dynamic filtering](.\u002Fexamples\u002Fscripts\u002Ftrain_dapo_ray_hybrid_engine.sh) (`--dynamic_filtering`)\n  - 🎲 Dynamic Sampling: for each prompt, generate multiple responses and **filter** them by your reward \u002F agent **0–1 `scores`** signal\n    - Enable: `--dynamic_filtering`\n    - Score range: `--dynamic_filtering_reward_range 0.0 1.0`\n    - Requires: `--n_samples_per_prompt > 1` and either `--remote_rm_url` or `--agent_func_path`\n    - Example: `.\u002Fexamples\u002Fscripts\u002Ftrain_dapo_ray_hybrid_engine.sh`\n\n**Scalability**\n- DeepSpeed AutoTP for tensor parallelism (see `--ds_tensor_parallel_size` in training scripts)\n- [RingAttention](.\u002Fexamples\u002Ftest_scripts\u002Ftrain_dpo_ring_llama.sh) for long context (`--ring_attn_size`)\n- Multi-node training with [SLURM](.\u002Fexamples\u002Fscripts\u002Ftrain_ppo_ray_slurm.sh)\n\n**Model Support**\n- [VLM (Vision-Language Models)](.\u002Fexamples\u002Fscripts\u002Ftrain_vlm_math_hybrid_engine.sh) — single-turn and [multi-turn with image feedback](.\u002Fexamples\u002Fpython\u002Fvlm_multiturn_agent.py) (`--image_key`, `--max_images_per_prompt`)\n- [LoRA\u002FQLoRA](.\u002Fexamples\u002Fscripts\u002Ftrain_sft_mixtral_lora.sh) (`--lora_rank`, `--load_in_4bit`)\n- [Mixture of Experts (MoE)](.\u002Fexamples\u002Ftest_scripts\u002Ftrain_sft_moe.sh) (`--aux_loss_coef`)\n- FlashAttention (`--attn_implementation`)\n- HuggingFace chat templates (`--apply_chat_template`)\n\n**Reward Shaping**\n- DAPO-style overlong penalty for length control (`--overlong_buffer_len`, `--overlong_penalty_factor`) — soft-penalize responses that exceed `max_new_tokens - overlong_buffer_len`\n- ProRL-style truncation penalty (`--stop_properly_penalty_coef`) — for samples with `finish_reason='length'`: `coef ∈ [0, 1]` multiplicatively scales the reward; `coef \u003C 0` sets the reward to that fixed value (e.g. `-0.5`)\n\n**Production Features**\n- Wandb (`--use_wandb`) and TensorBoard (`--use_tensorboard`) logging\n- Checkpoint recovery (`--load_checkpoint`, `--save_steps`)\n- Best-checkpoint saving on eval metrics (`--best_metric_key`)\n- Evaluation datasets (`--eval_dataset`, `--eval_temperature`, `--eval_n_samples_per_prompt`) — supported in async training\n- Multi-process data loading (`--dataloader_num_workers`, available for PPO\u002FSFT\u002FRM\u002FDPO)\n- PPO observability: actor\u002Fcritic grad-norm and per-phase timing (`timing\u002Fmake_experience`, `timing\u002Fppo_train`, `timing\u002Fbroadcast`, `timing\u002Fgeneration`, `timing\u002Fstep_total`)\n\n\u003C\u002Fdetails>\n\n---\n\n\u003Ca id=\"quick-start\">\u003C\u002Fa>\n## 🎬 Quick Start\n\n### Installation\n\n**Recommended**: Use Docker for hassle-free setup\n\n```bash\n# 1. Launch Docker container\ndocker run --runtime=nvidia -it --rm --shm-size=\"10g\" --cap-add=SYS_ADMIN \\\n  -v $PWD:\u002Fopenrlhf nvcr.io\u002Fnvidia\u002Fpytorch:25.11-py3 bash\n\n# 2. Clean conflicting packages\nsudo pip uninstall xgboost transformer_engine flash_attn pynvml -y\n\n# 3. Install OpenRLHF (choose one)\npip install openrlhf                    # Basic\npip install openrlhf[vllm]              # + vLLM 0.19.0 (recommended)\npip install openrlhf[vllm_latest]       # + Latest vLLM\npip install openrlhf[vllm,ring,liger]   # + All optimizations\n```\n\n**Alternative: Install from source**\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF.git\ncd OpenRLHF\npip install -e .\n```\n\n> [!TIP]\n> We recommend **vLLM 0.19.0+** for best performance. See [Dockerfiles](.\u002Fdockerfile\u002F) and [Nvidia-Docker Install Script](.\u002Fexamples\u002Fscripts\u002Fnvidia_docker_install.sh).\n\n### Prepare Datasets\n\nOpenRLHF provides flexible data processing methods:\n\n**Key Parameters**:\n- `--input_key`: Specify JSON key name for input data\n- `--apply_chat_template`: Use HuggingFace tokenizer's [chat template](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Fmain\u002Fen\u002Fchat_templating)\n- `--input_template`: Custom template string (alternative to chat template)\n- `--prompt_data_probs` \u002F `--dataset_probs`: Mix multiple datasets (e.g., `0.1,0.4,0.5`)\n- `--eval_dataset`: Specify evaluation dataset path\n\n**Chat Template Example**:\n\n```python\ndataset = [{\"input_key\": [\n  {\"role\": \"user\", \"content\": \"Hello, how are you?\"},\n  {\"role\": \"assistant\", \"content\": \"I'm doing great. How can I help you today?\"},\n  {\"role\": \"user\", \"content\": \"I'd like to show off how chat templating works!\"},\n]}]\n\ntokenizer.apply_chat_template(dataset[0][\"input_key\"], tokenize=False)\n# Output: \"\u003Cs>[INST] Hello, how are you? [\u002FINST]I'm doing great...\u003C\u002Fs> [INST] I'd like to show off... [\u002FINST]\"\n```\n\n> [!NOTE]\n> JSON key options vary by dataset type. See [Reward Dataset](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fblob\u002Fmain\u002Fopenrlhf\u002Fdatasets\u002Freward_dataset.py#L10), [SFT Dataset](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fblob\u002Fmain\u002Fopenrlhf\u002Fdatasets\u002Fsft_dataset.py#L9), and [Prompt Dataset](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fblob\u002Fmain\u002Fopenrlhf\u002Fdatasets\u002Fprompts_dataset.py#L6)\n\n\u003Ca id=\"supervised-fine-tuning\">\u003C\u002Fa>\n### Supervised Fine-tuning\n\nOpenRLHF's model checkpoint is fully compatible with HuggingFace models. You can specify the model name or path using `--pretrain  {name or path}`, `--reward_pretrain  {name or path}` and `--critic_pretrain  {name or path}`. We have provided some pre-trained checkpoints and datasets on [HuggingFace OpenRLHF](https:\u002F\u002Fhuggingface.co\u002FOpenRLHF).\n\nThen you can use the startup scripts we provide in the [examples\u002Fscripts](.\u002Fexamples\u002Fscripts\u002F) directory, or start the training using the following commands.\n\n\u003Cdetails>\n\u003Csummary>SFT command\u003C\u002Fsummary>\n\n```bash\ndeepspeed --module openrlhf.cli.train_sft \\\n   --max_len 4096 \\\n   --dataset Open-Orca\u002FOpenOrca \\\n   --input_key question \\\n   --output_key response \\\n   --input_template $'User: {}\\nAssistant: ' \\\n   --train_batch_size 256 \\\n   --micro_train_batch_size 2 \\\n   --max_samples 500000 \\\n   --pretrain meta-llama\u002FMeta-Llama-3-8B \\\n   --save_path .\u002Fcheckpoint\u002Fllama3-8b-sft \\\n   --save_steps -1 \\\n   --logging_steps 1 \\\n   --eval_steps -1 \\\n   --zero_stage 2 \\\n   --max_epochs 1 \\\n   --packing_samples \\\n   --param_dtype bf16 \\\n   --learning_rate 5e-6 \\\n   --gradient_checkpointing \\\n   --use_wandb {wandb_token}\n\n# Additional options:\n# --apply_chat_template                # Use HF tokenizer chat template\n# --ring_attn_size 2                   # Enable RingAttention (install ring_flash_attn first)\n# --multiturn                          # Multi-turn fine-tuning loss\n# --pretrain_mode                      # Continued pre-training mode\n```\n\n\u003C\u002Fdetails>\n\n\n### Reward Model Training\n\n\u003Cdetails>\n\u003Csummary>Reward model training command\u003C\u002Fsummary>\n\n```bash\ndeepspeed --module openrlhf.cli.train_rm \\\n   --save_path .\u002Fcheckpoint\u002Fllama3-8b-rm \\\n   --save_steps -1 \\\n   --logging_steps 1 \\\n   --eval_steps -1 \\\n   --train_batch_size 256 \\\n   --micro_train_batch_size 1 \\\n   --pretrain OpenRLHF\u002FLlama-3-8b-sft-mixture \\\n   --param_dtype bf16 \\\n   --max_epochs 1 \\\n   --max_len 8192 \\\n   --zero_stage 3 \\\n   --learning_rate 9e-6 \\\n   --dataset OpenRLHF\u002Fpreference_dataset_mixture2_and_safe_pku \\\n   --apply_chat_template \\\n   --chosen_key chosen \\\n   --rejected_key rejected \\\n   --packing_samples \\\n   --gradient_checkpointing \\\n   --use_wandb {wandb_token}\n\n```\n\n\u003C\u002Fdetails>\n\nIt is recommended to set the `--value_prefix_head` option of the Reward Model to `score`, so that we can load the model using `AutoModelForSequenceClassification`:\n\n```python\nreward_model = AutoModelForSequenceClassification.from_pretrained(\n              reward_model_path,\n              num_labels=1,\n              torch_dtype=torch.bfloat16,\n              attn_implementation=\"flash_attention_2\",\n              use_cache=False,\n          )\ninputs = xxxx (Left Padding Input Tokens)\nreward = reward_model.model(*inputs).last_hidden_state\nreward = reward_model.score(reward)[:, -1]\n```\n\n### RL Training: PPO\u002FREINFORCE++ with Ray and vLLM\n\nAll RL training in OpenRLHF runs through the **agent execution pipeline**. The following example shows single-turn agent execution (default mode) with Hybrid Engine for optimal performance:\n\n```bash\n# launch the master node of ray in container\nray start --head --node-ip-address 0.0.0.0 --num-gpus 8\n\n# if you want to launch ray on more nodes, use\nray start --address {MASTER-NODE-ADDRESS}:6379  --num-gpus 8\n\nray job submit --address=\"http:\u002F\u002F127.0.0.1:8265\" \\\n   --runtime-env-json='{\"working_dir\": \"\u002Fopenrlhf\"}' \\\n   -- python3 -m openrlhf.cli.train_ppo_ray \\\n   --ref_num_nodes 1 \\\n   --ref_num_gpus_per_node 8 \\\n   --reward_num_nodes 1 \\\n   --reward_num_gpus_per_node 8 \\\n   --critic_num_nodes 1 \\\n   --critic_num_gpus_per_node 8 \\\n   --actor_num_nodes 1 \\\n   --actor_num_gpus_per_node 8 \\\n   --vllm_num_engines 4 \\\n   --vllm_tensor_parallel_size 2 \\\n   --colocate_all_models \\\n   --vllm_gpu_memory_utilization 0.5 \\\n   --pretrain OpenRLHF\u002FLlama-3-8b-sft-mixture \\\n   --reward_pretrain OpenRLHF\u002FLlama-3-8b-rm-700k \\\n   --save_path \u002Fopenrlhf\u002Fexamples\u002Ftest_scripts\u002Ffinal\u002Fllama3-8b-rlhf \\\n   --ckpt_path \u002Fopenrlhf\u002Fexamples\u002Ftest_scripts\u002Fckpt\u002Fllama3-8b-rlhf \\\n   --save_hf_ckpt \\\n   --train_batch_size 128 \\\n   --rollout_batch_size 1024 \\\n   --use_dynamic_batch \\\n   --n_samples_per_prompt 1 \\\n   --max_epochs 1 \\\n   --prompt_max_len 1024 \\\n   --max_samples 100000 \\\n   --generate_max_len 1024 \\\n   --zero_stage 3 \\\n   --param_dtype bf16 \\\n   --actor_learning_rate 5e-7 \\\n   --critic_learning_rate 9e-6 \\\n   --init_kl_coef 0.01 \\\n   --prompt_data OpenRLHF\u002Fprompt-collection-v0.1 \\\n   --input_key context_messages \\\n   --apply_chat_template \\\n   --normalize_reward \\\n   --gradient_checkpointing \\\n   --packing_samples \\\n   --vllm_sync_backend nccl \\\n   --enforce_eager \\\n   --vllm_enable_sleep \\\n   --deepspeed_enable_sleep \\\n   --use_wandb {wandb_token}\n\n# Algorithm Variants (all use single-turn agent execution):\n# --advantage_estimator reinforce        # REINFORCE++\n# --advantage_estimator rloo             # RLOO\n# --advantage_estimator reinforce_baseline  # REINFORCE++-baseline (best for RLVR)\n# --advantage_estimator group_norm       # GRPO\n# --advantage_estimator dr_grpo          # Dr. GRPO\n\n# Advanced Options:\n# --init_kl_coef 0                                    # No reference model\n# --remote_rm_url http:\u002F\u002Fhost:5000\u002Fget_reward         # HTTP reward model\n# --n_samples_per_prompt 4                            # Multiple samples per prompt\n# --vllm_generate_batch_size 2048                     # Oversample at generation (> rollout_batch_size); requires --async_train\n# --enable_vllm_is_correction                         # vLLM importance sampling correction for off-policy rollouts\n# --vllm_is_correction_type tis                       # Correction type: tis (token clamp) | icepop (token filter) | seq-mask-tis (seq-level geom mean)\n# --vllm_is_truncated_threshold 0.5 5.0               # IS truncation interval: [low, high]\n# --best_metric_key eval_default_pass1                # Save best checkpoint by eval metric (empty = auto-detect first pass1, 'none' = disable)\n# --policy_loss_type gspo                             # Use GSPO policy loss variant (vs default 'ppo')\n```\n\n> [!TIP]\n> **For reasoning tasks (RLVR)**: Use `--advantage_estimator reinforce_baseline` for REINFORCE++-baseline—it's robust to different reward scales.\n\n> [!NOTE]\n> **Ray Environment Setup**: Let Ray auto-deploy with `--runtime-env-json='{\"setup_commands\": [\"pip install openrlhf[vllm]\"]}'`\n\n> [!NOTE]\n> **Troubleshooting GPU index errors**: Set `export RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1` if you encounter DeepSpeed GPU device setup issues.\n\n📚 **More Examples**: See [examples\u002Fscripts](.\u002Fexamples\u002Fscripts\u002F) and [Documentation](https:\u002F\u002Fopenrlhf.readthedocs.io\u002Fen\u002Flatest\u002Fusage.html)\n\n---\n\n\u003Ca id=\"single-turn-agent-reinforced-fine-tuning-with-custom-rewards\">\u003C\u002Fa>\n## 🎯 Single-Turn Agent: Reinforced Fine-tuning with Custom Rewards\n\nThe **single-turn agent execution** (default mode) supports custom reward functions—perfect for reinforced fine-tuning without a trained reward model. Instead of using a pre-trained reward model, you provide a Python function that computes rewards on-the-fly.\n\n**Ideal for**:\n- Rule-based rewards (length, format, code execution, math verification)\n- External API rewards (judge models, compilers, test suites)\n- Hybrid rewards (combining multiple signals)\n\n### Example: Custom Reward Function\n\n```python\n# reward_func.py\nimport torch\n\ndef reward_func(queries, prompts, labels):\n    \"\"\"\n    Compute custom rewards for generated responses.\n    \n    Args:\n        queries: List[str] - Full text (prompt + response)\n        prompts: List[str] - Original prompts only\n        labels: List[str] - Ground truth labels (from --label_key)\n    \n    Returns:\n        dict with:\n            - rewards: Tensor for advantage calculation\n            - scores: Tensor for dynamic filtering (0-1 range)\n            - extra_logs: Dict for wandb logging\n    \"\"\"\n    batch_size = len(queries)\n    \n    # Example: Random rewards (replace with your logic)\n    # Real examples: code execution, math verification, format checking\n    reward = torch.randint(0, 2, (batch_size,)).float()\n\n    return {\n        \"rewards\": reward,           # Used in RL advantage calculation\n        \"scores\": reward,            # Used for dynamic filtering (--dynamic_filtering)\n        \"extra_logs\": {              # Logged to wandb\n            \"custom_metric\": reward.mean().item(),\n        },\n    }\n```\n\n### Usage\n\n```bash\nray job submit --address=\"http:\u002F\u002F127.0.0.1:8265\" \\\n  --runtime-env-json='{\"working_dir\": \"\u002Fopenrlhf\"}' \\\n  -- python3 -m openrlhf.cli.train_ppo_ray \\\n  --pretrain meta-llama\u002FMeta-Llama-3-8B \\\n  --use_dynamic_batch \\\n  --remote_rm_url \u002Fpath\u002Fto\u002Freward_func.py \\\n  --label_key answer \\\n  --prompt_data your_prompt_dataset \\\n  ... # other training args\n```\n\n**Key Parameter**: `--label_key answer` passes the \"answer\" field from your dataset to `reward_func` as `labels`.\n\n> [!TIP]\n> **Use Cases**: Code generation (execute tests), Math (verify solutions), Formatting (check structure), Multi-objective (combine multiple signals)\n\n📖 **Full Example**: [examples\u002Fscripts\u002Ftrain_ppo_with_reward_fn.sh](.\u002Fexamples\u002Fscripts\u002Ftrain_ppo_with_reward_fn.sh)\n\n---\n\n\u003Ca id=\"multi-turn-agent-complex-environment-interactions\">\u003C\u002Fa>\n## 🤖 Multi-Turn Agent: Complex Environment Interactions\n\nFor tasks requiring **multi-step interactions** (reasoning chains, coding with feedback, game playing), OpenRLHF provides the **Multi-Turn Agent Execution** mode.\n\n### Building Custom Multi-Turn Agents\n\nImplement `AgentInstanceBase` with `reset\u002Fstep` methods:\n\n```python\n# agent_func.py\nimport random\nfrom typing import Any, Dict\n\nimport torch\nfrom openrlhf.utils.agent import AgentInstanceBase, MultiTurnAgentExecutor\n\n\n# A simple n-step random environment\nclass AgentInstance(AgentInstanceBase):\n    async def __init__(self, *args, **kwargs):\n        self.step_idx = 0\n        self.max_steps = random.randint(1, 3)  # 1-3 steps\n\n    async def reset(self, states: dict, **kwargs):\n        return {\"observation\": states[\"observation\"]}  # Return original text observation\n\n    async def step(self, states: dict, **kwargs) -> Dict[str, Any]:\n        print(f\"step_idx: {self.step_idx}, max_steps: {self.max_steps}\")\n\n        observation_text = states[\"observation_text\"]\n        action_text = states[\"action_text\"]\n        label = states[\"label\"]\n\n        # Check if episode is done\n        done = self.step_idx >= self.max_steps\n        reward = torch.randint(0, 2, (1,)).float() if done else torch.tensor(0)\n\n        # Generate environment feedback based on whether episode is done\n        environment_feedback = (\n            \"\\n\\nHuman: [CORRECT]\\n\u003C\u002Fs>\"\n            if done\n            else \"\\n\\nHuman: [INCORRECT]\\nPlease analyze the issues and try again.\\n\u003C\u002Fs>\\n\\nAssistant: \"\n        )\n\n        self.step_idx += 1\n\n        return {\n            \"rewards\": reward,  # Rewards for advantage calculation\n            \"scores\": reward,  # Scores for dynamic filtering (0-1 reward)\n            \"environment_feedback\": environment_feedback,  # Environment feedback text\n            \"done\": done,  # Boolean indicating if the episode is complete\n            \"sampling_params\": states.get(\"sampling_params\", None),  # Parameters for vLLM sampling in next step\n            \"extra_logs\": {\"dummy_scores\": reward},  # Additional logging information\n        }\n\n\nclass AgentExecutor(MultiTurnAgentExecutor):\n    def __init__(self):\n        super().__init__(AgentInstance)\n```\n\nThen launch with:\n\n```bash\nray job submit --address=\"http:\u002F\u002F127.0.0.1:8265\" \\\n  --runtime-env-json='{\"working_dir\": \"\u002Fopenrlhf\"}' \\\n  -- python3 -m openrlhf.cli.train_ppo_ray \\\n  ...\n  --use_dynamic_batch \\\n  --agent_func_path \u002Fpath\u002Fto\u002Fagent_func.py \\\n  --async_train  # Optional: enable async pipeline\n```\n\n### Configuration Options\n\n**Async Pipeline** (for higher throughput):\n- Enable: `--async_train`\n- Buffer size: `--async_queue_size 1` (larger = more off-policy, default 1)\n- Partial rollout: `--partial_rollout` — uses vLLM pause\u002Fresume for weight sync instead of locking, allowing generation to overlap with training. In-flight samples may contain tokens from both old and new weights.\n\n**Training Modes**:\n- **Synchronous**: Default, better stability\n- **Asynchronous**: Higher throughput, may affect convergence\n- **Hybrid Engine**: Best GPU utilization with `--colocate_all_models` (remove `--async_train`)\n\n> [!NOTE]\n> For fully custom token-level execution, inherit `AgentExecutorBase` and implement `execute()`. This design enforces the **token-in-token-out principle** to keep sampling and training consistent.\n\n> [!WARNING] \n> Asynchronous training may affect training stability. Use it only when throughput is critical and convergence is validated.\n\n📚 **Examples**:\n- Single-turn: [train_ppo_ray_hybrid_engine.sh](.\u002Fexamples\u002Fscripts\u002Ftrain_ppo_ray_hybrid_engine.sh)\n- Custom reward: [train_ppo_with_reward_fn.sh](.\u002Fexamples\u002Fscripts\u002Ftrain_ppo_with_reward_fn.sh)\n- Multi-turn: [train_reinforce_baseline_ray_agent_async.sh](.\u002Fexamples\u002Fscripts\u002Ftrain_reinforce_baseline_ray_agent_async.sh)\n- Multi-turn VLM (image feedback): [vlm_multiturn_agent.py](.\u002Fexamples\u002Fpython\u002Fvlm_multiturn_agent.py)\n\n### OpenAI-Compatible Agent Server\n\nFor multi-turn agents that need an OpenAI-compatible chat API (e.g., integrating external tool-use frameworks), [`agent_func_openai_server_executor.py`](.\u002Fexamples\u002Fpython\u002Fagent_func_openai_server_executor.py) wraps vLLM as a local `\u002Fv1\u002Fchat\u002Fcompletions` server while collecting token-level traces for RL training.\n\n- Exposes standard OpenAI endpoints (`\u002Fv1\u002Fchat\u002Fcompletions`, `\u002Fv1\u002Fmodels`, `\u002Ftokenize`)\n- Automatically collects token IDs and logprobs per session for RL training\n- Delta-tokenization reuses prefix tokens across multi-turn calls\n- Override `run_agent()` to plug in your own multi-turn workflow\n\n```bash\npython3 -m openrlhf.cli.train_ppo_ray \\\n  --agent_func_path examples\u002Fpython\u002Fagent_func_openai_server_executor.py \\\n  ... # other training args\n```\n\n---\n\n\u003Ca id=\"advanced-topics\">\u003C\u002Fa>\n## 🔧 Advanced Topics\n\n### LoRA: Merging Adapters\n\nWhen using LoRA\u002FQLoRA, OpenRLHF saves only the adapter weights. To deploy or continue training, merge the adapter with the base model:\n\n```bash\npython -m openrlhf.cli.lora_combiner \\\n    --model_path meta-llama\u002FMeta-Llama-3-8B \\\n    --lora_path .\u002Fcheckpoint\u002Fllama3-8b-rm \\\n    --output_path .\u002Fcheckpoint\u002Fllama-3-8b-rm-combined \\\n    --is_rm \\\n    --param_dtype bf16\n```\n\n### Performance Tuning Guide\n\nOptimize OpenRLHF for your hardware and workload with these recommendations:\n\n#### 🎯 Execution Modes: Throughput vs. Stability\n\nPick the execution mode based on your priority — OpenRLHF gives you a clear tradeoff knob:\n\n| Mode | Flags | Characteristics | When to Use |\n|------|-------|-----------------|-------------|\n| **Hybrid Engine (colocated)** | `--colocate_all_models`\u003Cbr>`--vllm_enable_sleep`\u003Cbr>`--deepspeed_enable_sleep` | **Most stable** — strictly on-policy, every rollout uses the latest weights. Serial generate→train cycle. | Research, sensitive RL algorithms, reproducibility, recipe validation |\n| **Async Training** | `--async_train`\u003Cbr>`--async_queue_size N` | **Highest throughput** — generation and training run in parallel. Tune off-policyness via `--async_queue_size` (larger = more off-policy). | Production throughput when convergence is already validated |\n| **Async + Partial Rollout** | `--async_train`\u003Cbr>`--partial_rollout` | **Maximum overlap** — vLLM pause\u002Fresume instead of locking, in-flight samples may mix old\u002Fnew weights. Most aggressive off-policy. | Pushing async throughput further; pair with `--enable_vllm_is_correction` |\n\n#### ⚡ Other Speed Optimizations\n\n| Optimization | Flag | When to Use |\n|--------------|------|-------------|\n| **Sample Packing** | `--packing_samples` | Always (especially training) |\n| **Dynamic Batch** | `--use_dynamic_batch` | Variable sequence lengths |\n| **DeepCompile** | `--deepcompile` | PyTorch 2.0+ |\n| **Overlap Comm** | `--overlap_comm` | Sufficient GPU memory |\n| **Prefix Caching** | vLLM config | `n_samples_per_prompt` > 1 |\n| **Oversampling** | `--vllm_generate_batch_size > --rollout_batch_size` | Async mode, to amortize generation cost \u002F feed dynamic filtering |\n\n#### 💾 Memory Management\n\n**When you have enough memory**:\n- ✅ Disable `--adam_offload`\n- ✅ Enable `--overlap_comm`\n- ✅ Use `--colocate_critic_reward` and `--colocate_actor_ref`\n\n**When hitting OOM**:\n- ❌ Disable all `--colocate_*` options\n- ✅ Reduce batch sizes\n- ✅ Enable gradient checkpointing\n\n#### 🎮 Batch Size Tuning\n\n1. **Generation Phase**: Maximize `--micro_rollout_batch_size`, minimize vLLM TP size\n2. **Training Phase**: Maximize `--micro_train_batch_size`, enable `--packing_samples`\n3. **vLLM**: Always use `--vllm_sync_backend nccl`\n\n> [!TIP]\n> **Quick Start Template**: For 8x A100 (80GB), try Hybrid Engine + `--vllm_gpu_memory_utilization 0.5` + `--colocate_all_models`\n\n📖 **More Details**: [Performance Tuning Documentation](https:\u002F\u002Fopenrlhf.readthedocs.io\u002Fen\u002Flatest\u002Fperformance.html)\n\n\n## Companies and Organizations using OpenRLHF\n\n- Google\n- ByteDance\n- Tencent\n- Alibaba\n- Baidu\n- China Telecom\n- Vivo\n- Allen AI\n- NexusFlow\n- Jülich Supercomputing Centre (JSC)\n- Berkeley Starling Team\n- M-A-P\n- ...\n\n## Join Us\n\n**How to Join?**\n\n1. Email us at janhu9527@gmail.com or join [GitHub Organization](https:\u002F\u002Fgithub.com\u002FOpenRLHF). Please include the following details:\n   - Your name\n   - Your GitHub username\n   - Your areas of interest\n   - Your skills and experience related to NLP and\u002For AI\n1. You can also join us through the official GitHub [OpenRLHF ↗](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF) project page. Just create an issue about your interest to contribute and we will get back to you.\n\n**What can you do?**\n\n1. Join the team and participate in the development of the OpenRLHF project.\n1. Contribute to the project by submitting pull requests.\n1. Help improve documentation, fix bugs, or create new features.\n1. Share the project and help us grow the community.\n\n## Sponsor Us\n\nYour sponsorship can help us maintain and improve OpenRLHF. If you find this project useful, please consider sponsoring us. You can sponsor us on [Open Collective ↗](https:\u002F\u002Fopencollective.com\u002FOpenRLHF).\n\n## Starchart\n\n[![Star History Chart](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenRLHF_OpenRLHF_readme_2c4107c3cbfa.png)](https:\u002F\u002Fstar-history.com\u002F#OpenRLHF\u002FOpenRLHF&Date)\n\n## Contributors\n\nA big thank you to all our contributors! If you want to contribute, feel free to make a pull request or create an issue.\n\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fgraphs\u002Fcontributors\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenRLHF_OpenRLHF_readme_ecf91c3dee96.png\" \u002F>\n\u003C\u002Fa>\n\n## References & Acknowledgements\n\nWe would like to express our gratitude to the following projects and organizations for their contributions to the field of AI and NLP:\n\n- [Hugging Face Transformers ↗](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers)\n- [OpenAI GPT ↗](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgpt-3)\n- [LLaMA ↗](https:\u002F\u002Fllama.meta.com\u002F)\n- [DeepSpeed ↗](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDeepSpeed)\n- [Ray ↗](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fray)\n\nOur project would also like to thank [ColossalChat](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Ftree\u002Fmain\u002Fapplications\u002FColossalChat) and [DeepSpeedChat](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDeepSpeedExamples\u002Ftree\u002Fmaster\u002Fapplications\u002FDeepSpeed-Chat). In the early stages of the project, we referred to their code design. \nOur project would like to thank [Netmind.AI](https:\u002F\u002Fwww.netmind.ai\u002F) for the GPU support of developing ring attention.\n\n(2024\u002F7) Our GitHub organization has changed from OpenLLMAI to OpenRLHF.\n\n## Citation\nOpenRLHF\n\n```\n@article{hu2024openrlhf,\n  title={OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework},\n  author={Jian Hu and Xibin Wu and Zilin Zhu and Xianyu and Weixun Wang and Dehao Zhang and Yu Cao},\n  journal={arXiv preprint arXiv:2405.11143},\n  year={2024}\n}\n```\nREINFORCE++-baseline\n```\n@article{hu2025reinforce++,\n  title={Reinforce++: A simple and efficient approach for aligning large language models},\n  author={Hu, Jian},\n  journal={arXiv preprint arXiv:2501.03262},\n  year={2025}\n}\n```\n\n______________________________________________________________________\n\n*OpenRLHF © 2025 OpenRLHF. All Rights Reserved.*\n","\u003Cdiv align=\"center\">\n    \u003Cimg alt=\"OpenRLHF logo\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenRLHF_OpenRLHF_readme_aa8533b7c00c.png\" style=\"height: 140px;\" \u002F>\n\u003C\u002Fdiv>\n\u003Cdiv align=\"center\">\n\u003Cp align=\"center\">\n      \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fgraphs\u002Fcontributors\">\n        \u003Cimg alt=\"GitHub Contributors\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fcontributors\u002FOpenRLHF\u002FOpenRLHF\" \u002F>\n      \u003C\u002Fa>\n      \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fissues\">\n        \u003Cimg alt=\"Issues\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fissues\u002FOpenRLHF\u002FOpenRLHF?color=0088ff\" \u002F>\n      \u003C\u002Fa>\n      \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fdiscussions\">\n        \u003Cimg alt=\"Issues\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fdiscussions\u002FOpenRLHF\u002FOpenRLHF?color=0088ff\" \u002F>\n      \u003C\u002Fa>\n      \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fpulls\">\n        \u003Cimg alt=\"GitHub pull requests\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fissues-pr\u002FOpenRLHF\u002FOpenRLHF?color=0088ff\" \u002F>\n      \u003C\u002Fa>\n      \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fstargazers\">\n        \u003Cimg alt=\"GitHub stars\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenRLHF\u002FOpenRLHF?color=ccf\" \u002F>\n      \u003C\u002Fa>\n      \u003Ca href=\"https:\u002F\u002Fdeepwiki.com\u002FOpenRLHF\u002FOpenRLHF\">\u003Cimg src=\"https:\u002F\u002Fdeepwiki.com\u002Fbadge.svg\" alt=\"Ask DeepWiki\">\u003C\u002Fa>\n      \u003Cbr>\n      \u003Cem>开源 \u002F 全面 \u002F 轻量 \u002F 易用\u003C\u002Fem>\n    \u003C\u002Fp>\n\u003C\u002Fdiv>\n\n\u003Chr>\n\n\u003Cspan>[ English | \u003Ca href=\"README_zh.md\">中文\u003C\u002Fa> | \u003Ca href=\"README_ja.md\">日本語\u003C\u002Fa> ]\u003C\u002Fspan>\n\nOpenRLHF是**首个**高性能、生产就绪的开源RLHF框架，它将**Ray + vLLM分布式架构**与**统一的基于智能体的设计范式**相结合，实现了可扩展且可扩展的人类反馈强化学习。\n\n📚 **了解更多**: [文档](https:\u002F\u002Fopenrlhf.readthedocs.io\u002F) | [幻灯片](https:\u002F\u002Fdocs.google.com\u002Fpresentation\u002Fd\u002F1JRhB1d7csofx0PIZBmfyBdMluxNd5JLPpUHrrvVhGnk\u002Fedit?usp=sharing) | [技术报告](https:\u002F\u002Fwww.researchgate.net\u002Fpublication\u002F393414548_OpenRLHF_An_Easy-to-use_Scalable_and_High-performance_RLHF_Framework) | [视频](https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV1dv2jBxEQG\u002F)\n\n## 📖 目录\n\n- [🗞️ 新闻](#news)\n- [🏗️ 架构基础](#architecture-foundation-ray--vllm-distribution) - Ray + vLLM + DeepSpeed分布式基础设施\n- [🎯 设计范式](#design-paradigm-agent-based-execution) - 统一的基于智能体的执行流水线\n- [🚀 RL算法](#state-of-the-art-rl-algorithms) - PPO、REINFORCE++、GRPO、RLOO\n- [📋 功能概览](#comprehensive-features) - 完整的RLHF流水线功能\n- [🎬 快速入门](#quick-start) - 安装与典型工作流程\n- [🎓 训练指南](#supervised-fine-tuning) - SFT、奖励模型、RL训练\n- [🎯 单轮智能体](#single-turn-agent-reinforced-fine-tuning-with-custom-rewards) - 自定义奖励函数\n- [🤖 多轮智能体](#multi-turn-agent-complex-environment-interactions) - 复杂环境\n- [🔧 高级主题](#advanced-topics) - LoRA、性能调优\n\n---\n\n\u003Ca id=\"news\">\u003C\u002Fa>\n## 新闻\n\n\u003Cdetails>\n\u003Csummary>显示新闻\u003C\u002Fsummary>\n\n- [2026\u002F4] OpenRLHF 0.10新增**多轮VLM RL**——在提示和环境反馈中（如截图）进行多步图像交互。示例：[vlm_multiturn_agent.py](.\u002Fexamples\u002Fpython\u002Fvlm_multiturn_agent.py)\n- [2026\u002F4] OpenRLHF 0.10新增**VLM（视觉-语言模型）RLHF支持**——端到端训练像Qwen3.5这样的带有图像输入的VLM。训练脚本：[train_vlm_math_hybrid_engine.sh](.\u002Fexamples\u002Fscripts\u002Ftrain_vlm_math_hybrid_engine.sh)\n- [2026\u002F2] [ProRL V2](https:\u002F\u002Fdeveloper.nvidia.com\u002Fblog\u002Fscaling-llm-reinforcement-learning-with-prolonged-training-using-prorl-v2\u002F)使用REINFORCE++基线，通过长时间的RL训练，训练出一个最先进的1.5B推理模型。训练脚本：[train_prorlv2_math_hybrid_engine.sh](.\u002Fexamples\u002Fscripts\u002Ftrain_prorlv2_math_hybrid_engine.sh)\n- [2025\u002F10] [ScaleRL](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.13786)验证了REINFORCE++基线在大规模训练场景中的有效性。发布了[REINFORCE++幻灯片](https:\u002F\u002Fdocs.google.com\u002Fpresentation\u002Fd\u002F1stieP_3PM1z4Hq1YWR3GywFkxcHEAlstXMaS23KlGN4)\n- [2025\u002F6] [Magistral](https:\u002F\u002Fmistral.ai\u002Fstatic\u002Fresearch\u002Fmagistral.pdf)采用与REINFORCE++基线非常相似的方法来训练推理模型。\n- [2025\u002F5] [MARTI](https:\u002F\u002Fgithub.com\u002FTsinghuaC3I\u002FMARTI)作为OpenRLHF的分支发布。它旨在通过整合集中式的多智能体交互与分布式策略训练，利用RL训练基于LLM的多智能体系统。\n- [2025\u002F5] OpenRLHF 0.8.0支持通过`--async_train`进行异步RLHF训练，以及通过`--agent_func_path`进行异步智能体RLHF。请参阅[runnable example](.\u002Fexamples\u002Fscripts\u002Ftrain_reinforce_baseline_ray_agent_async.sh)。\n- [2025\u002F4] 发布博客[使用vLLM加速RLHF，来自OpenRLHF的最佳实践](https:\u002F\u002Fblog.vllm.ai\u002F2025\u002F04\u002F23\u002Fopenrlhf-vllm.html)\n- [2025\u002F4] 清理OpenRLHF：根据单控制器和统一打包样本重构了源代码\n- [2025\u002F3] CMU的[高级自然语言处理春季2025](https:\u002F\u002Fcmu-l3.github.io\u002Fanlp-spring2025\u002F)课程将OpenRLHF作为RLHF框架的教学案例。\n- [2025\u002F2] [Logic-RL](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.14768)和[PRIME](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.01456)证明，与GRPO相比，REINFORCE++在训练中更加稳定，速度也比PPO更快。\n- [2025\u002F2] [LMM-R1](https:\u002F\u002Fgithub.com\u002FTideDra\u002Flmm-r1)是OpenRLHF的一个分支，旨在为在多模态任务上复现DeepSeek-R1提供高性能的RL基础设施。\n- [2025\u002F2] MIT和微软提出了[关于LLM中思维涌现的研究 I：寻找正确的直觉](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.06773)，并使用了OpenRLHF。\n- [2025\u002F1] HKUST复现了[DeepSeek-R1-Zero及使用OpenRLHF在小型模型上训练DeepSeek-R1](https:\u002F\u002Fgithub.com\u002Fhkust-nlp\u002FsimpleRL-reason)\n- [2024\u002F12] 我们“提出”了😊[REINFORCE++：一种简单高效的大型语言模型对齐方法](https:\u002F\u002Fwww.researchgate.net\u002Fpublication\u002F387487679_REINFORCE_An_Efficient_RLHF_Algorithm_with_Robustnessto_Both_Prompt_and_Reward_Models)。\n- [2024\u002F12] 我们在[Notion博客文章](https:\u002F\u002Fhijkzzz.notion.site\u002Funraveling-rlhf-and-its-variants-engineering-insights#147d9a33ecc9806090f3d5c749d31f05)中分析了PPO、REINFORCE++、GRPO和RLOO。\n- [2023\u002F8] OpenRLHF正式开源。\n\n\u003C\u002Fdetails>\n\n---\n\n\u003Ca id=\"architecture-foundation-ray--vllm-distribution\">\u003C\u002Fa>\n## 🏗️ 架构基础：Ray + vLLM分布式架构\n\nOpenRLHF是**首个基于Ray + vLLM分布式架构构建的RLHF框架**，能够高效地协调跨GPU的多个组件：\n\n\u003Cdiv align=\"center\">\n  \u003Cimg alt=\"OpenRLHF架构（Ray + vLLM）\" src=\".\u002Fdocs\u002Fopenrlhf_architecture.svg\" style=\"max-width: 100%; height: auto;\" \u002F>\n\u003C\u002Fdiv>\n\n### 核心基础设施组件\n\n**Ray - 分布式调度器和控制器**  \nOpenRLHF 利用 [Ray](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fray) 实现高效的分布式调度。它将 Actor、奖励模型、参考模型和 Critic 模型分别部署在不同的 GPU 上，从而支持高达 **70B+ 参数** 的模型进行可扩展训练。\n\n**混合引擎调度**：所有模型和 vLLM 引擎可以共享 GPU 资源，最大限度地减少空闲时间并提高 GPU 利用率。这使得在有限的硬件上也能运行完整的 RLHF 流水线。\n\n**vLLM - 高性能推理引擎**  \nRLHF 训练中，**80% 的时间都花在样本生成上**。借助 [vLLM](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm) 的自动张量并行（AutoTP）和流水线并行（PP），OpenRLHF 能够提供高吞吐、内存高效的生成能力。\n\n**DeepSpeed - 内存高效训练**  \n基于 [DeepSpeed](https:\u002F\u002Fgithub.com\u002Fdeepspeedai\u002FDeepSpeed) 的 ZeRO-3、[deepcompile](https:\u002F\u002Fgithub.com\u002Fdeepspeedai\u002FDeepSpeed\u002Fblob\u002Fmaster\u002Fblogs\u002Fdeepcompile\u002FREADME.md)、[AutoTP](https:\u002F\u002Fgithub.com\u002Fdeepspeedai\u002FDeepSpeed\u002Fblob\u002Fmaster\u002Fblogs\u002Fhuggingface-tp\u002FREADME.md) 以及 RingAttention 架构。无需依赖重量级框架即可实现大模型训练，并直接与 HuggingFace 模型兼容。\n\n**Transformers - 模型接口**  \n原生集成 HuggingFace Transformers，实现模型的无缝加载、状态管理和预训练模型的微调。\n\n**NCCL \u002F CUDA IPC - 高速通信**  \n用于分布式训练和推理的高效 GPU 间通信机制。\n\n---\n\n\u003Ca id=\"design-paradigm-agent-based-execution\">\u003C\u002Fa>\n## 🎯 设计范式：基于智能体的执行\n\n**在 Ray 分布式架构之上**，OpenRLHF 是 **首个实现统一智能体范式的 RLHF 框架**。无论是一次性标准 PPO 还是复杂的多轮推理，每一次训练都会遵循一致的智能体执行流水线。\n\n### 为什么采用基于智能体的执行？\n\nOpenRLHF 通过 **“token-in-token-out” 的智能体执行方式** 将生成与训练统一起来，确保完美的一致性、易于单轮或多轮扩展，并且完全避免文本级别的不匹配。\n\n### 智能体架构\n\n```\n                 ┌─────────────────────────────┐\n                 │    AgentExecutorBase        │\n                 │  (Token-in-Token-out Core)  │\n                 └─────────────────────────────┘\n                              │\n                 ┌────────────┴────────────┐\n                 ↓                         ↓\n         SingleTurnExecutor        MultiTurnExecutor\n                 │                         │\n      ┌──────────┴──────────┐   ┌─────────┴──────────┐\n      ↓                     ↓   ↓                    ↓\n  Standard RLHF      Custom Reward   Multi-Step    External Env\n  (One-shot gen)     Function      Reasoning     (OpenAI Agent Server)\n      ↓                     ↓           ↓                ↓\n      └─────────────────────┴───────────┴────────────────┘\n                              │\n                    Consistent Token Trajectories\n                              │\n                    ┌─────────┴─────────┐\n                    │  RL Algorithms    │\n                    │  (Decoupled)      │\n                    │                   │\n                    │  PPO, REINFORCE++ │\n                    │  GRPO, RLOO, etc. │\n                    └───────────────────┘\n```\n\n### 核心设计原则\n\n\u003Cdetails>\n\u003Csummary>展示核心设计原则\u003C\u002Fsummary>\n\n| 原则 | 描述 | 优势 |\n|-----------|-------------|---------|\n| **Token-in-Token-out** | 所有采样都生成 token 级别的轨迹 | 完全避免文本级别不匹配 |\n| **统一接口** | 所有模式使用相同的 `AgentExecutorBase` API | 仅需一个标志即可切换模式 |\n| **算法无关性** | RL 算法（PPO、REINFORCE++ 等）与智能体执行器解耦 | 任何算法均可与任意模式配合 |\n| **可扩展性** | 可轻松插入自定义奖励函数或环境 | 快速实验 |\n| **生产就绪** | 支持同步、异步及混合引擎 | 从研究到部署 |\n\n\u003C\u002Fdetails>\n\n### 两种执行模式（与 RL 算法正交）\n\n智能体的执行模式与您选择的 RL 算法 **无关**。您可以将 **任意算法**（PPO、REINFORCE++、GRPO 等）与 **任意执行模式** 结合使用：\n\n| 模式 | 使用场景 | 接口 | 复杂度 |\n|------|-----------|-----------|------------|\n| **单轮模式** | 标准 RLHF、自定义奖励函数 | 可选 `reward_func()` | ⭐ 默认模式（99% 的场景） |\n| **多轮模式** | 多步推理、交互式环境 | `reset()` + `step()` | ⭐⭐ 高级模式 |\n\n---\n\n\u003Ca id=\"state-of-the-art-rl-algorithms\">\u003C\u002Fa>\n## 🚀 最先进的 RL 算法\n\nOpenRLHF 实现了 **PPO、REINFORCE++、REINFORCE++-baseline、GRPO、RLOO** 等算法，并结合实践指南和社区最佳实践中的优化技巧进行了深度优化。\n\n**关键设计**：RL 算法与智能体执行模式 **解耦**。所有算法都能无缝对接单轮和多轮智能体执行器，通过统一的 “token-in-token-out” 流水线运行，从而保证行为的一致性。\n\n\u003Cdetails>\n\u003Csummary>展示算法对比表\u003C\u002Fsummary>\n\n| 算法 | `--advantage_estimator` | 关键特性 | 最佳使用场景 |\n|-----------|------------------------|-------------|---------------|\n| **PPO** | (默认) | 完整的 critic 网络 | 稳定训练，效果可靠 |\n| **REINFORCE++** | `reinforce` | 应用 PPO 技巧但无需 critic | 训练高效，内存占用低 |\n| **REINFORCE++-baseline** | `reinforce_baseline` | 使用平均奖励作为基准 | 推理任务（RLVR），对奖励尺度不敏感 |\n| **RLOO** | `rloo` | 每个 token 的 KL 散度 + PPO 截断 | 多样本训练 |\n| **GRPO** | `group_norm` | 使用组归一化 | 基于批次的训练 |\n| **Dr. GRPO** | `dr_grpo` | 简化版 GRPO | 去除了局部 `\u002Fstd` 归一化 |\n\n\u003C\u002Fdetails>\n\n参考资料：[知乎文章](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F622134699) | [Notion 最佳实践](https:\u002F\u002Fhijkzzz.notion.site\u002Frlhf-implementation-tricks?v=158d9a33ecc98132bf9e000c39227361)\n\n---\n\n\u003Ca id=\"comprehensive-features\">\u003C\u002Fa>\n## 📋 全面的功能特性\n\nOpenRLHF 提供了一个基于智能体的灵活完整 RLHF 流水线：\n\n### 🎯 基于智能体的强化学习训练（核心创新）\n\n\u003Cdetails>\n\u003Csummary>显示基于智能体的强化学习训练详情\u003C\u002Fsummary>\n\n**单回合模式**（默认 - 99% 的用例）\n- 每个提示一次性生成\n- 适用于所有强化学习算法：[PPO](.\u002Fexamples\u002Fscripts\u002Ftrain_ppo_ray_hybrid_engine.sh)、[REINFORCE++\u002Fbaseline\u002FGRPO\u002FRLOO](.\u002Fexamples\u002Fscripts\u002Ftrain_reinforce_baseline_hybrid_engine.sh)\n- [自定义奖励函数](.\u002Fexamples\u002Fscripts\u002Ftrain_ppo_with_reward_fn.sh) (`--remote_rm_url`)\n- [混合引擎](.\u002Fexamples\u002Fscripts\u002Ftrain_ppo_ray_hybrid_engine.sh) 实现 GPU 利用率最大化\n\n**多回合模式**（进阶 - 交互式任务）\n- 多步交互并获取环境反馈\n- 适用于所有强化学习算法\n- [自定义智能体函数](.\u002Fexamples\u002Fscripts\u002Ftrain_reinforce_baseline_ray_agent_async.sh) (`--agent_func_path`)\n- 兼容 OpenAI 的服务器：参见 `examples\u002Fpython\u002Fagent_func_openai_server_executor.py`，该脚本将 vLLM 封装为本地 OpenAI 智能体服务器\n- 异步流水线 (`--async_train`) 提高吞吐量：[train_reinforce_baseline_ray_agent_async.sh](.\u002Fexamples\u002Fscripts\u002Ftrain_reinforce_baseline_ray_agent_async.sh)\n\n\u003C\u002Fdetails>\n\n### 🎓 监督式训练与偏好学习\n\n\u003Cdetails>\n\u003Csummary>显示监督式训练与偏好学习表格\u003C\u002Fsummary>\n\n| 方法 | 脚本 | 描述 |\n|--------|--------|-------------|\n| **SFT** | [train_sft.sh](.\u002Fexamples\u002Fscripts\u002Ftrain_sft.sh) | 带打包的监督微调 |\n| **DPO\u002FIPO\u002FcDPO** | [train_dpo_llama.sh](.\u002Fexamples\u002Fscripts\u002Ftrain_dpo_llama.sh) | 直接偏好优化 |\n| **奖励模型** | [train_rm.sh](.\u002Fexamples\u002Fscripts\u002Ftrain_rm.sh) | 训练奖励模型 |\n\n\u003C\u002Fdetails>\n\n### ⚡ 高级功能\n\n\u003Cdetails>\n\u003Csummary>显示高级功能\u003C\u002Fsummary>\n\n**效率优化**\n- 所有训练模式下的样本打包 (`--packing_samples`)\n- vLLM 加速 (`--vllm_num_engines`) 用于快速生成\n- DAPO [动态筛选](.\u002Fexamples\u002Fscripts\u002Ftrain_dapo_ray_hybrid_engine.sh) (`--dynamic_filtering`)\n  - 🎲 动态采样：对每个提示生成多个响应，并根据你的奖励或智能体提供的 `0–1` 分数信号进行筛选\n    - 启用：`--dynamic_filtering`\n    - 分数范围：`--dynamic_filtering_reward_range 0.0 1.0`\n    - 要求：`--n_samples_per_prompt > 1` 且需同时具备 `--remote_rm_url` 或 `--agent_func_path`\n    - 示例：`.\u002Fexamples\u002Fscripts\u002Ftrain_dapo_ray_hybrid_engine.sh`\n\n**可扩展性**\n- DeepSpeed AutoTP 实现张量并行（参见训练脚本中的 `--ds_tensor_parallel_size`）\n- [RingAttention](.\u002Fexamples\u002Ftest_scripts\u002Ftrain_dpo_ring_llama.sh) 用于长上下文处理 (`--ring_attn_size`)\n- 使用 [SLURM](.\u002Fexamples\u002Fscripts\u002Ftrain_ppo_ray_slurm.sh) 进行多节点训练\n\n**模型支持**\n- [VLM（视觉-语言模型）](.\u002Fexamples\u002Fscripts\u002Ftrain_vlm_math_hybrid_engine.sh) — 单回合及 [带图像反馈的多回合](.\u002Fexamples\u002Fpython\u002Fvlm_multiturn_agent.py) 模式 (`--image_key`, `--max_images_per_prompt`)\n- [LoRA\u002FQLoRA](.\u002Fexamples\u002Fscripts\u002Ftrain_sft_mixtral_lora.sh) (`--lora_rank`, `--load_in_4bit`)\n- [专家混合模型 (MoE)](.\u002Fexamples\u002Ftest_scripts\u002Ftrain_sft_moe.sh) (`--aux_loss_coef`)\n- FlashAttention (`--attn_implementation`)\n- HuggingFace 对话模板 (`--apply_chat_template`)\n\n**奖励塑造**\n- DAPO 风格的过长惩罚用于控制长度 (`--overlong_buffer_len`, `--overlong_penalty_factor`) — 对超过 `max_new_tokens - overlong_buffer_len` 的响应进行软性惩罚\n- ProRL 风格的截断惩罚 (`--stop_properly_penalty_coef`) — 对于 `finish_reason='length'` 的样本：`coef ∈ [0, 1]` 会按比例缩放奖励；`coef \u003C 0` 则将奖励固定为该值（如 `-0.5`）\n\n**生产级特性**\n- Wandb (`--use_wandb`) 和 TensorBoard (`--use_tensorboard`) 日志记录\n- 检查点恢复 (`--load_checkpoint`, `--save_steps`)\n- 根据评估指标保存最佳检查点 (`--best_metric_key`)\n- 评估数据集 (`--eval_dataset`, `--eval_temperature`, `--eval_n_samples_per_prompt`) — 支持异步训练\n- 多进程数据加载 (`--dataloader_num_workers`, 适用于 PPO\u002FSFT\u002FRM\u002FDPO`)\n- PPO 可观测性：演员\u002F评论家梯度范数及各阶段耗时记录（timing\u002Fmake_experience、timing\u002Fppo_train、timing\u002Fbroadcast、timing\u002Fgeneration、timing\u002Fstep_total）\n\n\u003C\u002Fdetails>\n\n---\n\n\u003Ca id=\"quick-start\">\u003C\u002Fa>\n## 🎬 快速入门\n\n### 安装\n\n**推荐**：使用 Docker 以实现无忧安装\n\n```bash\n# 1. 启动 Docker 容器\ndocker run --runtime=nvidia -it --rm --shm-size=\"10g\" --cap-add=SYS_ADMIN \\\n  -v $PWD:\u002Fopenrlhf nvcr.io\u002Fnvidia\u002Fpytorch:25.11-py3 bash\n\n# 2. 清除冲突包\nsudo pip uninstall xgboost transformer_engine flash_attn pynvml -y\n\n# 3. 安装 OpenRLHF（任选其一）\npip install openrlhf                    # 基础版\npip install openrlhf[vllm]              # + vLLM 0.19.0（推荐）\npip install openrlhf[vllm_latest]       # + 最新版本 vLLM\npip install openrlhf[vllm,ring,liger]   # + 所有优化功能\n```\n\n**替代方案：从源码安装**\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF.git\ncd OpenRLHF\npip install -e .\n```\n\n> [!TIP]\n> 我们建议使用 **vLLM 0.19.0+** 以获得最佳性能。请参阅 [Dockerfile](.\u002Fdockerfile\u002F) 和 [Nvidia-Docker 安装脚本](.\u002Fexamples\u002Fscripts\u002Fnvidia_docker_install.sh)。\n\n### 准备数据集\n\nOpenRLHF 提供灵活的数据处理方法：\n\n**关键参数**：\n- `--input_key`：指定输入数据的 JSON 键名\n- `--apply_chat_template`：使用 HuggingFace 分词器的 [聊天模板](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Fmain\u002Fen\u002Fchat_templating)\n- `--input_template`：自定义模板字符串（作为聊天模板的替代方案）\n- `--prompt_data_probs` \u002F `--dataset_probs`：混合多个数据集（例如 `0.1,0.4,0.5`）\n- `--eval_dataset`：指定评估数据集路径\n\n**聊天模板示例**：\n\n```python\ndataset = [{\"input_key\": [\n  {\"role\": \"user\", \"content\": \"你好，最近怎么样？\"},\n  {\"role\": \"assistant\", \"content\": \"我很好，今天有什么可以帮您的吗？\"},\n  {\"role\": \"user\", \"content\": \"我想展示一下聊天模板的功能！\"},\n]}]\n\ntokenizer.apply_chat_template(dataset[0][\"input_key\"], tokenize=False)\n# 输出：\"\u003Cs>[INST] 你好，最近怎么样？ [\u002FINST]我很好...\u003C\u002Fs> [INST] 我想展示一下... [\u002FINST]\"\n```\n\n> [!NOTE]\n> JSON 键选项因数据集类型而异。请参阅 [奖励数据集](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fblob\u002Fmain\u002Fopenrlhf\u002Fdatasets\u002Freward_dataset.py#L10)、[SFT 数据集](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fblob\u002Fmain\u002Fopenrlhf\u002Fdatasets\u002Fsft_dataset.py#L9) 和 [提示数据集](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fblob\u002Fmain\u002Fopenrlhf\u002Fdatasets\u002Fprompts_dataset.py#L6)\n\n\u003Ca id=\"supervised-fine-tuning\">\u003C\u002Fa>\n\n### 有监督微调\n\nOpenRLHF 的模型检查点与 HuggingFace 模型完全兼容。你可以使用 `--pretrain  {name or path}`、`--reward_pretrain  {name or path}` 和 `--critic_pretrain  {name or path}` 来指定模型名称或路径。我们在 [HuggingFace OpenRLHF](https:\u002F\u002Fhuggingface.co\u002FOpenRLHF) 上提供了一些预训练的检查点和数据集。\n\n然后，你可以使用我们在 [examples\u002Fscripts](.\u002Fexamples\u002Fscripts\u002F) 目录中提供的启动脚本，或者使用以下命令开始训练。\n\n\u003Cdetails>\n\u003Csummary>SFT 命令\u003C\u002Fsummary>\n\n```bash\ndeepspeed --module openrlhf.cli.train_sft \\\n   --max_len 4096 \\\n   --dataset Open-Orca\u002FOpenOrca \\\n   --input_key question \\\n   --output_key response \\\n   --input_template $'User: {}\\nAssistant: ' \\\n   --train_batch_size 256 \\\n   --micro_train_batch_size 2 \\\n   --max_samples 500000 \\\n   --pretrain meta-llama\u002FMeta-Llama-3-8B \\\n   --save_path .\u002Fcheckpoint\u002Fllama3-8b-sft \\\n   --save_steps -1 \\\n   --logging_steps 1 \\\n   --eval_steps -1 \\\n   --zero_stage 2 \\\n   --max_epochs 1 \\\n   --packing_samples \\\n   --param_dtype bf16 \\\n   --learning_rate 5e-6 \\\n   --gradient_checkpointing \\\n   --use_wandb {wandb_token}\n\n# 其他选项：\n# --apply_chat_template                # 使用 HF 分词器聊天模板\n# --ring_attn_size 2                   # 启用 RingAttention（需先安装 ring_flash_attn）\n# --multiturn                          # 多轮微调损失\n# --pretrain_mode                      # 继续预训练模式\n```\n\n\u003C\u002Fdetails>\n\n\n### 奖励模型训练\n\n\u003Cdetails>\n\u003Csummary>奖励模型训练命令\u003C\u002Fsummary>\n\n```bash\ndeepspeed --module openrlhf.cli.train_rm \\\n   --save_path .\u002Fcheckpoint\u002Fllama3-8b-rm \\\n   --save_steps -1 \\\n   --logging_steps 1 \\\n   --eval_steps -1 \\\n   --train_batch_size 256 \\\n   --micro_train_batch_size 1 \\\n   --pretrain OpenRLHF\u002FLlama-3-8b-sft-mixture \\\n   --param_dtype bf16 \\\n   --max_epochs 1 \\\n   --max_len 8192 \\\n   --zero_stage 3 \\\n   --learning_rate 9e-6 \\\n   --dataset OpenRLHF\u002Fpreference_dataset_mixture2_and_safe_pku \\\n   --apply_chat_template \\\n   --chosen_key chosen \\\n   --rejected_key rejected \\\n   --packing_samples \\\n   --gradient_checkpointing \\\n   --use_wandb {wandb_token}\n```\n\n\u003C\u002Fdetails>\n\n建议将奖励模型的 `--value_prefix_head` 选项设置为 `score`，这样我们就可以使用 `AutoModelForSequenceClassification` 加载模型：\n\n```python\nreward_model = AutoModelForSequenceClassification.from_pretrained(\n              reward_model_path,\n              num_labels=1,\n              torch_dtype=torch.bfloat16,\n              attn_implementation=\"flash_attention_2\",\n              use_cache=False,\n          )\ninputs = xxxx (左填充输入标记)\nreward = reward_model.model(*inputs).last_hidden_state\nreward = reward_model.score(reward)[:, -1]\n```\n\n### RL 训练：使用 Ray 和 vLLM 的 PPO\u002FREINFORCE++ \n\nOpenRLHF 中的所有 RL 训练都通过 **智能体执行流水线** 运行。下面的例子展示了使用混合引擎以获得最佳性能的单回合智能体执行（默认模式）：\n\n```bash\n# 在容器中启动 Ray 主节点\nray start --head --node-ip-address 0.0.0.0 --num-gpus 8\n\n# 如果你想在更多节点上启动 Ray，可以使用\nray start --address {MASTER-NODE-ADDRESS}:6379  --num-gpus 8\n\nray job submit --address=\"http:\u002F\u002F127.0.0.1:8265\" \\\n   --runtime-env-json='{\"working_dir\": \"\u002Fopenrlhf\"}' \\\n   -- python3 -m openrlhf.cli.train_ppo_ray \\\n   --ref_num_nodes 1 \\\n   --ref_num_gpus_per_node 8 \\\n   --reward_num_nodes 1 \\\n   --reward_num_gpus_per_node 8 \\\n   --critic_num_nodes 1 \\\n   --critic_num_gpus_per_node 8 \\\n   --actor_num_nodes 1 \\\n   --actor_num_gpus_per_node 8 \\\n   --vllm_num_engines 4 \\\n   --vllm_tensor_parallel_size 2 \\\n   --colocate_all_models \\\n   --vllm_gpu_memory_utilization 0.5 \\\n   --pretrain OpenRLHF\u002FLlama-3-8b-sft-mixture \\\n   --reward_pretrain OpenRLHF\u002FLlama-3-8b-rm-700k \\\n   --save_path \u002Fopenrlhf\u002Fexamples\u002Ftest_scripts\u002Ffinal\u002Fllama3-8b-rlhf \\\n   --ckpt_path \u002Fopenrlhf\u002Fexamples\u002Ftest_scripts\u002Fckpt\u002Fllama3-8b-rlhf \\\n   --save_hf_ckpt \\\n   --train_batch_size 128 \\\n   --rollout_batch_size 1024 \\\n   --use_dynamic_batch \\\n   --n_samples_per_prompt 1 \\\n   --max_epochs 1 \\\n   --prompt_max_len 1024 \\\n   --max_samples 100000 \\\n   --generate_max_len 1024 \\\n   --zero_stage 3 \\\n   --param_dtype bf16 \\\n   --actor_learning_rate 5e-7 \\\n   --critic_learning_rate 9e-6 \\\n   --init_kl_coef 0.01 \\\n   --prompt_data OpenRLHF\u002Fprompt-collection-v0.1 \\\n   --input_key context_messages \\\n   --apply_chat_template \\\n   --normalize_reward \\\n   --gradient_checkpointing \\\n   --packing_samples \\\n   --vllm_sync_backend nccl \\\n   --enforce_eager \\\n   --vllm_enable_sleep \\\n   --deepspeed_enable_sleep \\\n   --use_wandb {wandb_token}\n\n# 算法变体（均采用单回合智能体执行）：\n# --advantage_estimator reinforce        # REINFORCE++\n# --advantage_estimator rloo             # RLOO\n# --advantage_estimator reinforce_baseline  # REINFORCE++-baseline（最适合 RLVR）\n# --advantage_estimator group_norm       # GRPO\n# --advantage_estimator dr_grpo          # Dr. GRPO\n\n# 高级选项：\n# --init_kl_coef 0                                    # 无参考模型\n# --remote_rm_url http:\u002F\u002Fhost:5000\u002Fget_reward         # HTTP 奖励模型\n# --n_samples_per_prompt 4                            # 每个提示生成多个样本\n# --vllm_generate_batch_size 2048                     # 生成时超采样（大于 rollout_batch_size）；需要 --async_train\n# --enable_vllm_is_correction                         # vLLM 对于离策略回放的重要性采样修正\n# --vllm_is_correction_type tis                       # 修正类型：tis（token clamp）| icepop（token filter）| seq-mask-tis（序列级几何平均）\n# --vllm_is_truncated_threshold 0.5 5.0               # IS 截断区间：[低, 高]\n# --best_metric_key eval_default_pass1                # 根据评估指标保存最佳检查点（空值表示自动检测第一轮 pass1，'none' 表示禁用）\n# --policy_loss_type gspo                             # 使用 GSPO 策略损失变体（ vs 默认的 'ppo'）\n```\n\n> [!TIP]\n> **对于推理任务（RLVR）**：使用 `--advantage_estimator reinforce_baseline` 作为 REINFORCE++-baseline——它对不同的奖励尺度具有鲁棒性。\n\n> [!NOTE]\n> **Ray 环境设置**：让 Ray 自动部署，使用 `--runtime-env-json='{\"setup_commands\": [\"pip install openrlhf[vllm]\"]}'`\n\n> [!NOTE]\n> **解决 GPU 索引错误**：如果遇到 DeepSpeed GPU 设备设置问题，可以设置 `export RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1`。\n\n📚 **更多示例**：请参阅 [examples\u002Fscripts](.\u002Fexamples\u002Fscripts\u002F) 和 [文档](https:\u002F\u002Fopenrlhf.readthedocs.io\u002Fen\u002Flatest\u002Fusage.html)\n\n---\n\n\u003Ca id=\"single-turn-agent-reinforced-fine-tuning-with-custom-rewards\">\u003C\u002Fa>\n\n## 🎯 单轮智能体：使用自定义奖励进行强化微调\n\n**单轮智能体执行**（默认模式）支持自定义奖励函数，非常适合在没有训练好的奖励模型的情况下进行强化微调。您无需使用预训练的奖励模型，而是提供一个 Python 函数来实时计算奖励。\n\n**适用场景**：\n- 基于规则的奖励（长度、格式、代码执行、数学验证）\n- 外部 API 奖励（裁判模型、编译器、测试套件）\n- 混合奖励（结合多种信号）\n\n### 示例：自定义奖励函数\n\n```python\n# reward_func.py\nimport torch\n\ndef reward_func(queries, prompts, labels):\n    \"\"\"\n    为生成的响应计算自定义奖励。\n    \n    参数：\n        queries: List[str] - 完整文本（提示 + 响应）\n        prompts: List[str] - 仅原始提示\n        labels: List[str] - 真实标签（来自 --label_key）\n    \n    返回：\n        包含以下键的字典：\n            - rewards: 用于优势计算的张量\n            - scores: 用于动态筛选的张量（0-1 范围）\n            - extra_logs: 用于 wandb 日志记录的字典\n    \"\"\"\n    batch_size = len(queries)\n    \n    # 示例：随机奖励（请替换为您自己的逻辑）\n    # 实际应用：代码执行、数学验证、格式检查\n    reward = torch.randint(0, 2, (batch_size,)).float()\n\n    return {\n        \"rewards\": reward,           # 用于强化学习中的优势计算\n        \"scores\": reward,            # 用于动态筛选 (--dynamic_filtering)\n        \"extra_logs\": {              # 记录到 wandb\n            \"custom_metric\": reward.mean().item(),\n        },\n    }\n```\n\n### 使用方法\n\n```bash\nray job submit --address=\"http:\u002F\u002F127.0.0.1:8265\" \\\n  --runtime-env-json='{\"working_dir\": \"\u002Fopenrlhf\"}' \\\n  -- python3 -m openrlhf.cli.train_ppo_ray \\\n  --pretrain meta-llama\u002FMeta-Llama-3-8B \\\n  --use_dynamic_batch \\\n  --remote_rm_url \u002Fpath\u002Fto\u002Freward_func.py \\\n  --label_key answer \\\n  --prompt_data your_prompt_dataset \\\n  ... # 其他训练参数\n```\n\n**关键参数**：`--label_key answer` 将数据集中的“answer”字段作为 `labels` 传递给 `reward_func`。\n\n> [!TIP]\n> **应用场景**：代码生成（执行测试）、数学（验证解题结果）、格式化（检查结构）、多目标优化（结合多种信号）\n\n📖 **完整示例**：[examples\u002Fscripts\u002Ftrain_ppo_with_reward_fn.sh](.\u002Fexamples\u002Fscripts\u002Ftrain_ppo_with_reward_fn.sh)\n\n---\n\n\u003Ca id=\"multi-turn-agent-complex-environment-interactions\">\u003C\u002Fa>\n## 🤖 多轮智能体：复杂环境交互\n\n对于需要**多步交互**的任务（推理链、带反馈的编程、游戏对战），OpenRLHF 提供了**多轮智能体执行**模式。\n\n### 构建自定义多轮智能体\n\n实现 `AgentInstanceBase` 类，并定义 `reset\u002Fstep` 方法：\n\n```python\n# agent_func.py\nimport random\nfrom typing import Any, Dict\n\nimport torch\nfrom openrlhf.utils.agent import AgentInstanceBase, MultiTurnAgentExecutor\n\n\n# 一个简单的 n 步随机环境\nclass AgentInstance(AgentInstanceBase):\n    async def __init__(self, *args, **kwargs):\n        self.step_idx = 0\n        self.max_steps = random.randint(1, 3)  # 1-3 步\n\n    async def reset(self, states: dict, **kwargs):\n        return {\"observation\": states[\"observation\"]}  # 返回原始文本观测值\n\n    async def step(self, states: dict, **kwargs) -> Dict[str, Any]:\n        print(f\"step_idx: {self.step_idx}, max_steps: {self.max_steps}\")\n\n        observation_text = states[\"observation_text\"]\n        action_text = states[\"action_text\"]\n        label = states[\"label\"]\n\n        # 检查是否结束回合\n        done = self.step_idx >= self.max_steps\n        reward = torch.randint(0, 2, (1,)).float() if done else torch.tensor(0)\n\n        # 根据是否结束回合生成环境反馈\n        environment_feedback = (\n            \"\\n\\nHuman: [CORRECT]\\n\u003C\u002Fs>\"\n            if done\n            else \"\\n\\nHuman: [INCORRECT]\\n请分析问题并重试。\\n\u003C\u002Fs>\\n\\nAssistant: \"\n        )\n\n        self.step_idx += 1\n\n        return {\n            \"rewards\": reward,  # 用于优势计算的奖励\n            \"scores\": reward,  # 用于动态筛选的分数（0-1  Beliebt\n\n### OpenAI 兼容代理服务器\n\n对于需要 OpenAI 兼容聊天 API 的多轮对话代理（例如集成外部工具使用框架），[`agent_func_openai_server_executor.py`](.\u002Fexamples\u002Fpython\u002Fagent_func_openai_server_executor.py) 将 vLLM 包装为本地的 `\u002Fv1\u002Fchat\u002Fcompletions` 服务器，同时收集用于强化学习训练的 token 级别追踪信息。\n\n- 暴露标准的 OpenAI 端点（`\u002Fv1\u002Fchat\u002Fcompletions`、`\u002Fv1\u002Fmodels`、`\u002Ftokenize`）\n- 自动为每个会话收集 token ID 和 logprobs，用于强化学习训练\n- Delta-tokenization 在多轮调用中复用前缀 token\n- 可以通过重写 `run_agent()` 来接入您自己的多轮工作流\n\n```bash\npython3 -m openrlhf.cli.train_ppo_ray \\\n  --agent_func_path examples\u002Fpython\u002Fagent_func_openai_server_executor.py \\\n  ... # 其他训练参数\n```\n\n---\n\n\u003Ca id=\"advanced-topics\">\u003C\u002Fa>\n## 🔧 高级主题\n\n### LoRA：合并适配器\n\n在使用 LoRA\u002FQLoRA 时，OpenRLHF 只保存适配器权重。要部署或继续训练，需将适配器与基础模型合并：\n\n```bash\npython -m openrlhf.cli.lora_combiner \\\n    --model_path meta-llama\u002FMeta-Llama-3-8B \\\n    --lora_path .\u002Fcheckpoint\u002Fllama3-8b-rm \\\n    --output_path .\u002Fcheckpoint\u002Fllama-3-8b-rm-combined \\\n    --is_rm \\\n    --param_dtype bf16\n```\n\n### 性能调优指南\n\n根据您的硬件和工作负载，使用以下建议优化 OpenRLHF：\n\n#### 🎯 执行模式：吞吐量 vs. 稳定性\n\n根据您的优先级选择执行模式——OpenRLHF 提供了一个清晰的权衡选项：\n\n| 模式 | 标志 | 特性 | 何时使用 |\n|------|-------|-----------------|-------------|\n| **混合引擎（共置）** | `--colocate_all_models`\u003Cbr>`--vllm_enable_sleep`\u003Cbr>`--deepspeed_enable_sleep` | **最稳定** — 严格策略内，每次 rollout 都使用最新权重。串行生成→训练循环。 | 研究、敏感的强化学习算法、可重复性、配方验证 |\n| **异步训练** | `--async_train`\u003Cbr>`--async_queue_size N` | **最高吞吐量** — 生成和训练并行运行。通过 `--async_queue_size` 调整离策略程度（越大越离策略）。 | 生产环境下的高吞吐量，当收敛已验证时 |\n| **异步 + 部分 rollout** | `--async_train`\u003Cbr>`--partial_rollout` | **最大重叠** — vLLM 使用暂停\u002F恢复而非锁定，进行中的样本可能混合新旧权重。最激进的离策略方式。 | 进一步提升异步吞吐量；可与 `--enable_vllm_is_correction` 搭配使用 |\n\n#### ⚡ 其他速度优化\n\n| 优化 | 标志 | 何时使用 |\n|--------------|------|-------------|\n| **样本打包** | `--packing_samples` | 始终使用（尤其是训练时） |\n| **动态批次** | `--use_dynamic_batch` | 序列长度可变时 |\n| **DeepCompile** | `--deepcompile` | PyTorch 2.0+ 时 |\n| **通信重叠** | `--overlap_comm` | GPU 内存充足时 |\n| **前缀缓存** | vLLM 配置 | `n_samples_per_prompt` > 1 时 |\n| **过采样** | `--vllm_generate_batch_size > --rollout_batch_size` | 异步模式下，用于摊销生成成本 \u002F 支持动态过滤 |\n\n#### 💾 内存管理\n\n**内存充足时**：\n- ✅ 禁用 `--adam_offload`\n- ✅ 启用 `--overlap_comm`\n- ✅ 使用 `--colocate_critic_reward` 和 `--colocate_actor_ref`\n\n**遇到 OOM 时**：\n- ❌ 禁用所有 `--colocate_*` 选项\n- ✅ 减少批次大小\n- ✅ 启用梯度检查点\n\n#### 🎮 批次大小调整\n\n1. **生成阶段**：最大化 `--micro_rollout_batch_size`，最小化 vLLM TP 大小\n2. **训练阶段**：最大化 `--micro_train_batch_size`，启用 `--packing_samples`\n3. **vLLM**：始终使用 `--vllm_sync_backend nccl`\n\n> [!TIP]\n> **快速入门模板**：对于 8x A100（80GB），可以尝试混合引擎 + `--vllm_gpu_memory_utilization 0.5` + `--colocate_all_models`\n\n📖 **更多详情**：[性能调优文档](https:\u002F\u002Fopenrlhf.readthedocs.io\u002Fen\u002Flatest\u002Fperformance.html)\n\n\n## 使用 OpenRLHF 的公司和组织\n\n- Google\n- 字节跳动\n- 腾讯\n- 阿里巴巴\n- 百度\n- 中国电信\n- 维沃\n- Allen AI\n- NexusFlow\n- Jülich 超级计算中心（JSC）\n- 伯克利星雀团队\n- M-A-P\n- ...\n\n## 加入我们\n\n**如何加入？**\n\n1. 发送邮件至 janhu9527@gmail.com 或加入 [GitHub 组织](https:\u002F\u002Fgithub.com\u002FOpenRLHF)。请提供以下信息：\n   - 您的姓名\n   - 您的 GitHub 用户名\n   - 您感兴趣的领域\n   - 您在 NLP 和\u002F或 AI 方面的技能和经验\n1. 您也可以通过官方 GitHub [OpenRLHF ↗](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF) 项目页面加入我们。只需创建一个关于您贡献兴趣的问题，我们会尽快回复您。\n\n**您可以做什么？**\n\n1. 加入团队，参与 OpenRLHF 项目的开发。\n1. 通过提交拉取请求为项目做出贡献。\n1. 帮助改进文档、修复 bug 或开发新功能。\n1. 分享该项目，帮助我们扩大社区。\n\n## 赞助我们\n\n您的赞助可以帮助我们维护和改进 OpenRLHF。如果您觉得这个项目很有用，请考虑赞助我们。您可以在 [Open Collective ↗](https:\u002F\u002Fopencollective.com\u002FOpenRLHF) 上进行赞助。\n\n## 星级图\n\n[![星级历史图](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenRLHF_OpenRLHF_readme_2c4107c3cbfa.png)](https:\u002F\u002Fstar-history.com\u002F#OpenRLHF\u002FOpenRLHF&Date)\n\n## 贡献者\n\n非常感谢所有贡献者！如果您想参与贡献，欢迎随时提交拉取请求或创建问题。\n\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fgraphs\u002Fcontributors\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenRLHF_OpenRLHF_readme_ecf91c3dee96.png\" \u002F>\n\u003C\u002Fa>\n\n## 参考文献与致谢\n\n我们衷心感谢以下项目和组织对 AI 和 NLP 领域的贡献：\n\n- [Hugging Face Transformers ↗](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers)\n- [OpenAI GPT ↗](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgpt-3)\n- [LLaMA ↗](https:\u002F\u002Fllama.meta.com\u002F)\n- [DeepSpeed ↗](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDeepSpeed)\n- [Ray ↗](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fray)\n\n我们的项目还要感谢 [ColossalChat](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Ftree\u002Fmain\u002Fapplications\u002FColossalChat) 和 [DeepSpeedChat](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDeepSpeedExamples\u002Ftree\u002Fmaster\u002Fapplications\u002FDeepSpeed-Chat)。在项目早期，我们参考了他们的代码设计。此外，我们还要感谢 [Netmind.AI](https:\u002F\u002Fwww.netmind.ai\u002F) 为环形注意力机制的开发提供的 GPU 支持。\n\n（2024年7月）我们的 GitHub 组织已从 OpenLLMAI 更名为 OpenRLHF。\n\n## 引用\nOpenRLHF\n\n```\n@article{hu2024openrlhf,\n  title={OpenRLHF：一个易于使用、可扩展且高性能的RLHF框架},\n  author={Jian Hu和Xibin Wu和Zilin Zhu和Xianyu和Weixun Wang和Dehao Zhang和Yu Cao},\n  journal={arXiv预印本arXiv:2405.11143},\n  year={2024}\n}\n```\nREINFORCE++-基线\n```\n@article{hu2025reinforce++,\n  title={Reinforce++：一种简单高效的大型语言模型对齐方法},\n  author={Hu, Jian},\n  journal={arXiv预印本arXiv:2501.03262},\n  year={2025}\n}\n```\n\n______________________________________________________________________\n\n*OpenRLHF © 2025 OpenRLHF。保留所有权利。*","# OpenRLHF 快速上手指南\n\nOpenRLHF 是首个结合 **Ray + vLLM 分布式架构**与**统一 Agent 执行范式**的高性能开源 RLHF 框架。它支持 PPO、REINFORCE++、GRPO 等先进算法，能够高效训练高达 70B+ 参数的大模型。\n\n## 1. 环境准备\n\n### 系统要求\n- **操作系统**: Linux (推荐 Ubuntu 20.04\u002F22.04)\n- **GPU**: NVIDIA GPU (显存建议 24GB+，多卡环境需支持 NVLink 或高速互联)\n- **CUDA**: 12.1 或更高版本\n- **Python**: 3.9 - 3.11\n\n### 前置依赖\n确保已安装以下基础组件：\n- [NVIDIA Driver](https:\u002F\u002Fwww.nvidia.com\u002Fdrivers)\n- [CUDA Toolkit](https:\u002F\u002Fdeveloper.nvidia.com\u002Fcuda-toolkit)\n- [NCCL](https:\u002F\u002Fdocs.nvidia.com\u002Fdeeplearning\u002Fnccl\u002Finstall-guide\u002Findex.html) (用于多卡通信)\n\n> **国内加速建议**：\n> 推荐使用国内镜像源加速 Python 包下载：\n> ```bash\n> export PIP_INDEX_URL=https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n> ```\n\n## 2. 安装步骤\n\n### 方式一：通过 pip 安装（推荐）\n直接从 PyPI 安装最新稳定版：\n\n```bash\npip install openrlhf\n```\n\n若需安装包含最新特性的开发版：\n\n```bash\npip install git+https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF.git\n```\n\n### 方式二：源码安装\n克隆仓库并安装依赖：\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF.git\ncd OpenRLHF\npip install -e .\n```\n\n### 验证安装\n检查是否成功安装 `openrlhf` 命令行工具：\n\n```bash\nopenrlhf --help\n```\n\n## 3. 基本使用\n\nOpenRLHF 的核心工作流基于 **Agent 执行模式**，默认采用 **Single-Turn（单轮）** 模式进行标准 RLHF 训练。以下是一个基于 Ray 和 vLLM 的最小化训练示例。\n\n### 步骤 1: 准备数据\n确保你有一个包含 `prompt` 字段的 JSONL 格式数据集，例如 `prompts.jsonl`：\n```json\n{\"prompt\": \"What is the capital of France?\"}\n{\"prompt\": \"Explain quantum entanglement.\"}\n```\n\n### 步骤 2: 启动训练\n使用内置脚本启动基于 REINFORCE++ 算法的训练任务。该命令会自动调度 Ray 集群，利用 vLLM 进行采样，并使用 DeepSpeed 进行模型更新。\n\n```bash\nbash examples\u002Fscripts\u002Ftrain_reinforce_baseline_ray.sh \\\n    --pretrain OpenRLHF\u002FLlama-3-8b-sft-mixture \\\n    --reward_pretrain OpenRLHF\u002FLlama-3-8b-rm-mixture \\\n    --save_path .\u002Fcheckpoint\u002Fllama3-8b-rlhf \\\n    --micro_train_batch_size 8 \\\n    --train_batch_size 128 \\\n    --micro_rollout_batch_size 16 \\\n    --rollout_batch_size 1024 \\\n    --max_samples 100000 \\\n    --max_epochs 1 \\\n    --prompt_max_len 1024 \\\n    --generate_max_len 1024 \\\n    --zero_stage 3 \\\n    --bf16 \\\n    --actor_learning_rate 5e-7 \\\n    --critic_learning_rate 9e-6 \\\n    --init_kl_coef 0.01 \\\n    --prompt_data prompts.jsonl \\\n    --input_key prompt \\\n    --apply_chat_template \\\n    --normalize_reward \\\n    --adam_offload \\\n    --flash_attn \\\n    --gradient_checkpointing \\\n    --use_vllm \\\n    --vllm_num_engines 2 \\\n    --vllm_tensor_parallel_size 1 \\\n    --advantage_estimator reinforce_baseline\n```\n\n### 关键参数说明\n- `--pretrain`: 初始策略模型（SFT 模型）。\n- `--reward_pretrain`: 奖励模型（Reward Model）。\n- `--use_vllm`: 启用 vLLM 加速采样过程（显著提升吞吐量）。\n- `--vllm_num_engines`: 启动的 vLLM 引擎数量，根据显卡数量调整。\n- `--advantage_estimator`: 选择优势估计算法，如 `ppo`, `reinforce_baseline`, `grpo` 等。\n- `--prompt_data`: 输入提示词数据路径。\n\n### 自定义奖励函数（可选）\n若不使用预训练的奖励模型，可通过 Python 脚本定义自定义奖励逻辑。创建一个继承自 `AgentExecutorBase` 的类，实现 `reward_func` 方法，并在启动脚本中通过 `--agent_func_path` 指定路径。\n\n```python\n# custom_reward.py\nfrom openrlhf.trainer.ray import AgentExecutorBase\n\nclass CustomRewardAgent(AgentExecutorBase):\n    def reward_func(self, prompts, outputs, **kwargs):\n        # 在此处编写自定义奖励逻辑\n        scores = [1.0 if \"Paris\" in out else 0.0 for out in outputs]\n        return scores\n```\n\n运行命令：\n```bash\n... --agent_func_path custom_reward.py ...\n```\n\n---\n*更多高级用法（如多轮对话 Agent、VLM 训练、LoRA 微调）请参考官方文档或 `examples\u002F` 目录下的完整脚本。*","某自动驾驶初创团队正致力于训练一个能理解路况截图并执行多步决策的视觉语言模型（VLM）代理，以优化车辆在复杂路口的导航策略。\n\n### 没有 OpenRLHF 时\n- **架构搭建困难**：团队需手动整合 Ray 分布式框架与 vLLM 推理引擎，耗费数周编写胶水代码，且难以处理图像输入与多轮交互的数据流。\n- **算法迭代缓慢**：尝试复现最新的 REINFORCE++ 或 PPO 算法时，常因显存溢出或通信瓶颈导致训练中断，调试周期长达数天。\n- **多模态支持缺失**：现有开源方案大多仅支持文本，无法直接利用车辆摄像头回传的实时截图作为环境反馈，被迫简化为纯文本模拟，严重偏离真实场景。\n- **资源利用率低**：由于缺乏异步强化学习机制，GPU 在等待环境交互响应时大量空闲，导致昂贵的算力资源浪费，训练成本居高不下。\n\n### 使用 OpenRLHF 后\n- **一键部署分布式架构**：直接调用 OpenRLHF 内置的\"Ray + vLLM\"混合引擎，仅需修改少量配置即可启动支持图像输入的多轮 VLM 训练流程。\n- **高效算法落地**：内置优化的 REINFORCE++ 和 PPO 算法开箱即用，稳定支撑长序列多步推理，将模型收敛时间从数周缩短至数天。\n- **原生多模态闭环**：利用其新增的多轮 VLM RL 功能，直接将路口截图作为 Prompt，模型输出驾驶指令并获得环境奖励，实现了端到端的真实场景对齐。\n- **极致性能提升**：借助异步 RL 机制，推理与训练并行执行，GPU 利用率提升至 90% 以上，在同等硬件条件下训练吞吐量翻倍。\n\nOpenRLHF 通过统一的智能体范式和高性能分布式架构，让复杂的多模态强化学习训练从“造轮子”的噩梦变成了可快速落地的工程实践。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenRLHF_OpenRLHF_762a9543.png","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FOpenRLHF_67e72cee.png","Open-sourced Reinforcment Learning from Human Feedback",null,"janhu9527@gmail.com","https:\u002F\u002Fgithub.com\u002FOpenRLHF",[78,82,86],{"name":79,"color":80,"percentage":81},"Python","#3572A5",99.7,{"name":83,"color":84,"percentage":85},"Dockerfile","#384d54",0.2,{"name":87,"color":88,"percentage":89},"Shell","#89e051",0.1,9371,920,"2026-04-18T08:07:16","Apache-2.0","Linux","必需 NVIDIA GPU。支持多卡分布式训练（通过 Ray + DeepSpeed ZeRO-3 + vLLM），可训练 70B+ 参数模型。需支持 CUDA 和 NCCL 通信，具体显存需求取决于模型大小及是否使用混合引擎共享资源。","未说明（建议根据模型规模配置充足系统内存以支持 Ray 分布式调度）",{"notes":98,"python":99,"dependencies":100},"该工具基于 Ray + vLLM + DeepSpeed 分布式架构，专为高性能生产环境设计。核心特性包括：1. 利用 Ray 进行分布式调度，将 Actor、Reward、Reference 和 Critic 模型分离到不同 GPU；2. 采用混合引擎调度，允许模型与 vLLM 推理引擎共享 GPU 资源以最大化利用率；3. 依赖 NCCL\u002FCUDA IPC 进行高速 GPU 间通信；4. 原生支持 HuggingFace 模型格式；5. 支持单轮和多轮 Agent 执行模式，且与具体 RL 算法（如 PPO, REINFORCE++, GRPO 等）解耦。","未说明",[101,102,103,104,105,106],"Ray","vLLM","DeepSpeed","Transformers (HuggingFace)","NCCL","CUDA",[35,14],[109,110,111,112,113,114,115,116],"transformers","vllm","large-language-models","raylib","reinforcement-learning-from-human-feedback","reinforcement-learning","proximal-policy-optimization","visual-language-models","2026-03-27T02:49:30.150509","2026-04-19T03:03:09.530979",[120,125,130,135,140,145],{"id":121,"question_zh":122,"answer_zh":123,"source_url":124},41350,"在使用 vLLM 和 Zero Stage 2 进行训练时程序卡住（hangs）怎么办？","这是一个已知问题，通常与特定的 vLLM 版本或配置有关。维护者指出该问题与 PR #278 相关，且即使在 vllm==0.4.1 版本中也可能出现。建议检查是否已应用相关的修复补丁，或者尝试调整 vLLM 的版本。如果问题持续，请参考相关的 Merge Request 获取最新的代码修复。","https:\u002F\u002Fgithub.com\u002FOpenLLMAI\u002FOpenRLHF\u002Fissues\u002F211",{"id":126,"question_zh":127,"answer_zh":128,"source_url":129},41351,"在多节点训练中，当 vLLM 引擎和 Actor 模型位于不同节点时出现广播错误（broadcast error）如何解决？","目前多节点训练中 vLLM 和 Actor 的放置是随机的，若两者位于不同节点会导致广播错误。虽然官方正在讨论改进方案，但目前的临时解决方法可能需要确保它们在同一节点，或者等待官方更新以支持跨节点的正确通信机制。此问题已被标记为环境相关问题，建议关注后续版本更新。","https:\u002F\u002Fgithub.com\u002FOpenLLMAI\u002FOpenRLHF\u002Fissues\u002F265",{"id":131,"question_zh":132,"answer_zh":133,"source_url":134},41352,"开启 Ring Attention 进行 DPO 训练时，Loss 震荡严重或与未开启时差异较大怎么办？","Loss 震荡通常是因为梯度累积（gradient accumulation）计算逻辑在 Ring Attention 模式下不匹配导致的。具体原因是 `num_gpu \u002F ring_size * micro_bs > train_bs`，即设置的 train batch size 太小。建议修改代码中计算全局 batch size 的逻辑：将 `ring_size` 的乘法移到最前面，避免整除结果为 0。例如，在单机 8 卡且 ring_size=8 时，global batch size 应能正确设置为 1。此外，训练初期 Loss 特别大可能是正常现象，几个 step 后通常会恢复正常。","https:\u002F\u002Fgithub.com\u002FOpenLLMAI\u002FOpenRLHF\u002Fissues\u002F564",{"id":136,"question_zh":137,"answer_zh":138,"source_url":139},41353,"如何确认 OpenRLHF 对特定模型大小（如 34B）和硬件配置（如 4 张或 8 张 A100）的支持情况？","关于支持矩阵的具体含义（如支持 PPO 还是 DPO），建议参考项目文档中的表格说明。对于 34B 模型的训练，社区用户反馈在 8 张 A100 上进行 Llama 34B 的 DPO 训练是可行的。如果遇到版本兼容性问题（如 transformers 版本导致的位置参数缺失错误），尝试卸载当前版本并重新安装特定版本（如从 4.38.2 降级到 4.37.2）可能解决问题。同时，注意 `use_cache=True` 与 gradient checkpointing 不兼容，需设置为 False。","https:\u002F\u002Fgithub.com\u002FOpenLLMAI\u002FOpenRLHF\u002Fissues\u002F193",{"id":141,"question_zh":142,"answer_zh":143,"source_url":144},41354,"加载奖励模型时报错 \"ValueError: weight is on the meta device\" 是什么原因？","该错误通常发生在加载量化模型（如 4-bit Qwen）作为奖励模型时，权重位于 meta 设备而无法直接加载到 GPU。这往往与模型加载配置或 DeepSpeed\u002FAccelerate 的状态有关。确保在加载模型前正确初始化设备映射，或者检查是否需要在加载参数中指定 `device_map`。如果是使用自定义脚本加载，请确保没有错误地启用了 meta 设备加载选项。","https:\u002F\u002Fgithub.com\u002FOpenLLMAI\u002FOpenRLHF\u002Fissues\u002F209",{"id":146,"question_zh":147,"answer_zh":148,"source_url":129},41355,"不同的奖励模型（如 UltraRM）具有不同的 value_head 变量名，如何适配？","由于不同的奖励模型可能使用不同的变量名（不仅仅是默认的 \"value_head\"），目前的解决方案是修改 `model.py` 文件，将 value_head 的变量名作为参数传入。可以添加一个命令行参数来控制该变量名。代码示例如下：\n```python\ndef _get_reward_model(base_pretrained_model, base_llm_model, value_head_name: str = \"value_head\"):\n    class LLMForSequenceRegression(base_pretrained_model):\n        supports_gradient_checkpointing = True\n        def __init__(self, config: AutoConfig):\n            super().__init__(config)\n            setattr(self, self.base_model_prefix, base_llm_model(config))\n            setattr(self, value_head_name, nn.Linear(config.hidden_size, 1, bias=False))\n```\n建议提交 PR 将此功能合并到主分支，以便更好地支持多种奖励模型。",[150,155,160,165,170,175,180,185,190,195,200,205,210,215,220,225,230,235,240,245],{"id":151,"version":152,"summary_zh":153,"released_at":154},333309,"v0.10.1.post2","**此补丁包含针对 VLM 训练的重要修复。**\n\n## 有哪些变化？\n\n- 由 @xiaoxigua999 在 https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fcommit\u002Ff97f8f19a20cb45e26519243d9c61474a3c18922 中修复了多轮 VLM 溢出回退时去除孤立图像填充标记的问题。\n- 由 @xiaoxigua999 在 https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fcommit\u002Fe261d311b9f092090d4d4e632f5c709d41a2e38c 中修复了 vLLM 生成过程中多模态图像\u002F视频占位符标记重复扩展的问题。\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fcompare\u002Fv0.10.1...v0.10.1.post2","2026-04-15T03:01:25",{"id":156,"version":157,"summary_zh":158,"released_at":159},333310,"v0.10.1","## 变更内容\n* 功能：支持 VLM 的多轮强化学习，由 @xiaoxigua999 在 [`98dc14f`](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fcommit\u002F98dc14f) 中实现\n* 修复：@xiaoxigua999 在 [`c1ba971`](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fcommit\u002Fc1ba971) 中修复了一些小 bug\n* 修复：单轮奖励路径将分数=0.0 视为缺失值，由 @xiaoxigua999 在 [#1219](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fpull\u002F1219) 中完成（[`33f9d72`](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fcommit\u002F33f9d72)）\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fcompare\u002Fv0.10.0...v0.10.1","2026-04-14T00:44:40",{"id":161,"version":162,"summary_zh":163,"released_at":164},333311,"v0.10.0","## 变更内容\n* [pre-commit.ci] @pre-commit-ci[bot] 在 https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fpull\u002F1215 中提出的 pre-commit 建议\n* 功能：添加 VLM（视觉-语言模型）RLHF 支持（Qwen3.5），由 @hijkzzz 在 https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fpull\u002F1217 中实现\n* 功能：添加过采样支持：在异步模式下，vLLM 生成批次大小大于回放批次大小，由 @hijkzzz 在 https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fcommit\u002Ffb8b2f58fa795cf0970bf46b5e19ede8a06ba3d7 中实现\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fcompare\u002Fv0.9.10...v0.10.0","2026-04-12T01:19:04",{"id":166,"version":167,"summary_zh":168,"released_at":169},333312,"v0.9.10","## 变更内容\n* 修复：在 Ray 运行时中尊重用户设置的 NCCL_DEBUG 环境变量，由 @Lidang-Jiang 在 https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fpull\u002F1212 中完成\n* 修复：当检查点目录中没有有效检查点时，实现优雅回退，由 @konghw-git 在 https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fpull\u002F1208 中完成\n* 新特性：添加 --dataloader_num_workers 选项，用于多进程数据加载，由 @konghw-git 在 https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fpull\u002F1207 中完成\n* 更新：升级 vLLM（0.19.0）、Transformers（5.5.0）和 DeepSpeed（0.18.9），由 @xiaoxigua999 完成\n\n## 新贡献者\n* @Lidang-Jiang 在 https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fpull\u002F1212 中完成了首次贡献\n* @konghw-git 在 https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fpull\u002F1208 中完成了首次贡献\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fcompare\u002Fv0.9.9...v0.9.10","2026-04-04T00:58:24",{"id":171,"version":172,"summary_zh":173,"released_at":174},333313,"v0.9.9","## 变更内容\n本次发布主要聚焦于内部重构和 Ray 通信性能的提升，同时在强化学习流水线、实验\u002F体验管理以及检查点管理方面也进行了显著更新。\n\n- 由 [@xiaoxigua999](https:\u002F\u002Fgithub.com\u002Fxiaoxigua999) 重构强化学习流水线\n- 由 [@xiaoxigua999](https:\u002F\u002Fgithub.com\u002Fxiaoxigua999) 重构 make exp 功能\n- 在 [#1206](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fpull\u002F1206) 中支持最佳检查点，由 [@xiaoxigua999](https:\u002F\u002Fgithub.com\u002Fxiaoxigua999) 实现\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fcompare\u002Fv0.9.8...v0.9.9","2026-03-30T02:36:13",{"id":176,"version":177,"summary_zh":178,"released_at":179},333314,"v0.9.8","## 变更内容\n\n- @xiaoxigua999 在 [9c5d260](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fcommit\u002F9c5d260) 中重构了 `Experience` 类\n- @xiaoxigua999 在 [d0e088b](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fcommit\u002Fd0e088b) 中重构了 `max_len`、`max_new_tokens` 以及评估指标\n- @xiaoxigua999 在 [1eef768](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fcommit\u002Fe7ae02e57ff081adbec9ce2c5871f315a1eef768) 中支持负长度惩罚\n\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fcompare\u002Fv0.9.7...v0.9.8","2026-03-25T13:14:01",{"id":181,"version":182,"summary_zh":183,"released_at":184},333315,"v0.9.7","## 变更内容\n\n### 功能\n- 支持在异步训练中进行评估，由 [@xiaoxigua999](https:\u002F\u002Fgithub.com\u002Fxiaoxigua999) 在 [266fbc5](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fcommit\u002F266fbc5) 中实现。\n- 支持在异步模式下进行部分回放，由 [@xiaoxigua999](https:\u002F\u002Fgithub.com\u002Fxiaoxigua999) 在 [9bbfdc9](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fcommit\u002F9bbfdc9) 中实现。\n- 支持使用 OAI 服务器进行多轮智能体强化学习，由 [@xiaoxigua999](https:\u002F\u002Fgithub.com\u002Fxiaoxigua999) 在 [ea4ac84](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fcommit\u002Fea4ac84) 中实现。\n\n### 改进\n- 在评估中添加长度日志，由 [@xiaoxigua999](https:\u002F\u002Fgithub.com\u002Fxiaoxigua999) 在 [6639fd3](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fcommit\u002F6639fd3) 中实现。\n- 升级 vLLM 版本，由 [@xiaoxigua999](https:\u002F\u002Fgithub.com\u002Fxiaoxigua999) 在 [ac28c5d](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fcommit\u002Fac28c5d) 中实现。\n\n### 修复\n- 修复与 [#1200](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fissues\u002F1200) 相关的一些 bug，并改进数学工具函数，由 [@xiaoxigua999](https:\u002F\u002Fgithub.com\u002Fxiaoxigua999) 在 [7c54c83](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fcommit\u002F7c54c83) 中实现。","2026-03-24T00:58:15",{"id":186,"version":187,"summary_zh":188,"released_at":189},333316,"v0.9.6","## 有哪些变化？\n\n* [`升级 vLLM 和 DeepSpeed`](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fcommit\u002Fb1ef6d287ee0cd49c3bcbe0b9364f55be270f5df) @xiaoxigua999\n* [`移除 KTO\u002FPRM\u002FKD、batch_inference 和 interactive_chat`](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fcommit\u002F6a981c8c97310190cc4f0ed606d97072c576b072) @xiaoxigua999\n* [`添加梯度范数日志记录和 PPO 阶段时间分解`](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fcommit\u002F4908e6ee3caba79fce8aa53e7f5aa78a2fc81873) @yxs\n\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fcompare\u002Fv0.9.5...v0.9.6","2026-03-19T05:11:46",{"id":191,"version":192,"summary_zh":193,"released_at":194},333317,"v0.9.5","## 变更内容\n\n- 修复了 vLLM 0.16+ 版本中异步模式下的权重更新问题 (@xiaoxigua999)\n- 解决了与 vLLM 0.17 和 Transformers v5 的兼容性问题 (@xiaoxigua999)\n\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fcompare\u002Fv0.9.4...v0.9.5","2026-03-08T13:55:44",{"id":196,"version":197,"summary_zh":198,"released_at":199},333318,"v0.9.4","## 变更内容\n* 【BUG 修复】在 AutoTP 初始化过程中出现的 GPU 内存不足问题，通过 @jiosephlee 在 https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fpull\u002F1189 中重新基于分片参数创建优化器得以解决。\n* 支持 transformers v5：更新已弃用的 API 和依赖项，由 @jiosephlee 在 https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fpull\u002F1186 中完成。\n* 修复 fit 方法签名不一致的问题，由 @harryharrygo 在 https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fpull\u002F1191 中完成。\n* 将 vLLM 升级至 0.16.0，DeepSpeed 升级至 0.18.6，@xiaoxigua999 负责。\n\n## 新贡献者\n* @jiosephlee 在 https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fpull\u002F1189 中完成了首次贡献。\n* @harryharrygo 在 https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fpull\u002F1191 中完成了首次贡献。\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fcompare\u002Fv0.9.3...v0.9.4","2026-03-03T11:49:22",{"id":201,"version":202,"summary_zh":203,"released_at":204},333319,"v0.9.3","## What’s changed?\r\n- [Bumped vLLM to v0.15.1 and DeepSpeed to v0.18.5](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fcommit\u002F2bc3ef0adc64c3c3c2342e06ca6e92badcbbd216) (@xiaoxigua999)\r\n\r\n- [Added support for Sequence-level TIS](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fcommit\u002F8b78539c0e159e2271e7a98f87f59dbeabb3f0c9) (@xiaoxigua999)\r\n\r\n- [Updated math utilities](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fcommit\u002F160b08e43b14198731a3ddfef8abd44d23161494) (@xiaoxigua999)\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fcompare\u002Fv0.9.2...v0.9.3","2026-02-05T12:32:57",{"id":206,"version":207,"summary_zh":208,"released_at":209},333320,"v0.9.2","## What's Changed\r\n* [pre-commit.ci] pre-commit suggestions by @pre-commit-ci[bot] in https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fpull\u002F1167\r\n* Fix: AgentInstance bug by @Freder-chen in https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fpull\u002F1169\r\n* refactor: replace --bf16 with --data_type (bf16\u002Ffp16\u002Ffp32) by @LYMDLUT in https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fpull\u002F1170\r\n* feat: Add multi-stage wake_up support for vLLM engine by @LYMDLUT in https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fpull\u002F1171\r\n* feat: add save_hf_ckpt and disable_ds_ckpt options to training scripts by @LYMDLUT in https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fpull\u002F1173\r\n* fix: change if to elif for eval_task handling in batch inference by @LYMDLUT in https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fpull\u002F1174\r\n* Add PromptDataset collate_fn wiring by @Freder-chen in https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fpull\u002F1177\r\n* add log logprobs_diff by @richardodliu in https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fpull\u002F1156\r\n* docs: add Bilibili video link to README files by @LYMDLUT in https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fpull\u002F1180\r\n* fix: pass ring_attn_group to reward model forward in rm_trainer by @yurekami in https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fpull\u002F1182\r\n* feat: [support prorlv2 recipes and math verify](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fcommit\u002Fd3f700b7d236df4f11a99b3c53073dbeccae76f4) @xiaoxigua999 \r\n\r\n## New Contributors\r\n* @yurekami made their first contribution in https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fpull\u002F1182\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fcompare\u002Fv0.9.1...v0.9.2","2026-02-03T04:09:51",{"id":211,"version":212,"summary_zh":213,"released_at":214},333321,"v0.9.1.post1","## What's Changed\r\n* [pre-commit.ci] pre-commit suggestions by @pre-commit-ci[bot] in https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fpull\u002F1167\r\n* Fix: AgentInstance bug by @Freder-chen in https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fpull\u002F1169\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fcompare\u002Fv0.9.1...v0.9.1.post1","2026-01-06T04:58:45",{"id":216,"version":217,"summary_zh":218,"released_at":219},333322,"v0.9.1","**This version has been completely refactored into an agent-based architecture.**\r\n```\r\n                 ┌─────────────────────────────┐\r\n                 │    AgentExecutorBase        │\r\n                 │  (Token-in-Token-out Core)  │\r\n                 └─────────────────────────────┘\r\n                              │\r\n                 ┌────────────┴────────────┐\r\n                 ↓                         ↓\r\n         SingleTurnExecutor        MultiTurnExecutor\r\n                 │                         │\r\n      ┌──────────┴──────────┐   ┌─────────┴──────────┐\r\n      ↓                     ↓   ↓                    ↓\r\n  Standard RLHF      Custom Reward   Multi-Step    External Env\r\n  (One-shot gen)     Function      Reasoning     (NeMo Gym)\r\n      ↓                     ↓           ↓                ↓\r\n      └─────────────────────┴───────────┴────────────────┘\r\n                              │\r\n                    Consistent Token Trajectories\r\n                              │\r\n                    ┌─────────┴─────────┐\r\n                    │  RL Algorithms    │\r\n                    │  (Decoupled)      │\r\n                    │                   │\r\n                    │  PPO, REINFORCE++ │\r\n                    │  GRPO, RLOO, etc. │\r\n                    └───────────────────┘\r\n```\r\n\r\n## What's Changed\r\n* Update model configuration to use attn_implementation by @LYMDLUT in https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fpull\u002F1159\r\n* bump vLLM to 0.13.0 by @hijkzzz in https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fpull\u002F1161\r\n* feat: restructure and clean up streaming async sampling (#1130) by @Freder-chen and @MooMoo-Yang\r\n in https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fpull\u002F1152\r\n* update README.md by @hijkzzz\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fcompare\u002Fv0.9.0...v0.9.1","2026-01-05T02:25:56",{"id":221,"version":222,"summary_zh":223,"released_at":224},333323,"v0.9.0","## What's Changed\r\n* [pre-commit.ci] pre-commit suggestions by @pre-commit-ci[bot] in https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fpull\u002F1137\r\n* Fix bug in async training with remote reward model by @Freder-chen in https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fpull\u002F1140\r\n* [fix] fix typos by @Imbernoulli in https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fpull\u002F1144\r\n* feat: Integrate NeMo Gym for Advanced Agent-Based RLHF Training by @RayenTian in https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fpull\u002F1147\r\n* Bump DeepSpeed (0.18.1) & vLLM (0.11.0) to the latest version @xiaoxigua999 \r\n\r\n## New Contributors\r\n* @Imbernoulli made their first contribution in https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fpull\u002F1144\r\n* @RayenTian made their first contribution in https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fpull\u002F1147\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fcompare\u002Fv0.8.11...v0.9.0","2025-10-31T23:59:36",{"id":226,"version":227,"summary_zh":228,"released_at":229},333324,"v0.8.11","## What's Changed\r\n- Fix PPO progress display when resuming from a checkpoint by @zhaoxu98 in https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fpull\u002F1124\r\n- Add `--tokenizer_chat_template` argument to the DPO trainer by @armsp in https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fpull\u002F1129\r\n- Add [GEM: A Gym for Generalist LLMs demo](https:\u002F\u002Fgithub.com\u002Faxon-rl\u002Fgem\u002Ftree\u002Fmain\u002Fexamples\u002Ftrain_openrlhf) by @xiaoxigua999 in https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fcommit\u002F4e9a12f9f902db880d4a599e18b36ab37c7742d4\r\n- Bump vLLM to 0.10.2 by @xiaoxigua999 in https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fcommit\u002Fb678e303a3d432b271b85527365ab4cd2467f9b7\r\n- Bump DeepSpeed to 0.17.6 by @xiaoxigua999 in https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fcommit\u002Fac3689ece2666cb524836568b88bfc6016dabc6f\r\n- Bump Transformers to 4.56.1 by @xiaoxigua999 in https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fcommit\u002F564e4672dee0f1599c2dfe434f135a8c9570318f\r\n\r\n## New Contributors\r\n* @zhaoxu98 made their first contribution in https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fpull\u002F1124\r\n* @armsp made their first contribution in https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fpull\u002F1129\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fcompare\u002Fv0.8.10...v0.8.11","2025-09-22T14:19:00",{"id":231,"version":232,"summary_zh":233,"released_at":234},333325,"v0.8.10","## What's Changed\r\n\r\n### 🐛 Bug Fixes\r\n- Fix remote reward model init bug in `PPOTrainerAsync` by [@makwingchi](https:\u002F\u002Fgithub.com\u002Fmakwingchi) ([PR #1113](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fpull\u002F1113))  \r\n- Fix `balance_experiences` in `ppo_trainer_async.py` by [@xiaoxigua999](https:\u002F\u002Fgithub.com\u002Fxiaoxigua999) ([commit be13dfe](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fcommit\u002Fbe13dfeac16d47ffab68793f4337ce222995494b))  \r\n- Fix reward list `as_tensor` issue ([#1118](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fissues\u002F1118)) by [@xiaoxigua999](https:\u002F\u002Fgithub.com\u002Fxiaoxigua999) ([commit fb034ee](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fcommit\u002Ffb034ee31db7330cc49ec7a34519ad456c3cc9f2))  \r\n- Fix DeepSpeed sleep with zero1\u002Fzero2 by [@xiaoxigua999](https:\u002F\u002Fgithub.com\u002Fxiaoxigua999) ([commit 746b945](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fcommit\u002F746b9451c4486716496d9fb173db4b392d9ccf3e))  \r\n\r\n### ⬆️ Dependencies\r\n- Bump `vllm` to **0.10.1.1** by [@xiaoxigua999](https:\u002F\u002Fgithub.com\u002Fxiaoxigua999) ([commit f8ec7ca](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fcommit\u002Ff8ec7ca6b5424f6d637d84a461477748e2cb2e03))  \r\n\r\n\r\n\r\n## New Contributors\r\n* @makwingchi made their first contribution in https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fpull\u002F1113\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fcompare\u002Fv0.8.9...v0.8.10","2025-08-27T01:01:08",{"id":236,"version":237,"summary_zh":238,"released_at":239},333326,"v0.8.9.post1","## What's changed\r\n\r\n- [Fixed SFT\u002FDPO\u002FRay PPO Trainers for MoE Models (e.g., Qwen3 MoE)](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fcommit\u002F200ee3570dcc9664d52a648739d7aba71e02816e)\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fcompare\u002Fv0.8.9...v0.8.9.post1","2025-08-08T01:21:21",{"id":241,"version":242,"summary_zh":243,"released_at":244},333327,"v0.8.9","## What's Changed\r\n\r\n* Add TensorBoard logging to PRM training by @xjli360 in https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fpull\u002F1096  \r\n* Support [vLLM off-policy importance sampling correction](https:\u002F\u002Ffengyao.notion.site\u002Foff-policy-rl) by @xiaoxigua999 and @MooMoo-Yang in https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fpull\u002F1098  \r\n    * Requires vLLM version > 0.10  : `pip install -U vllm --pre --extra-index-url https:\u002F\u002Fwheels.vllm.ai\u002Fnightly`\r\n    * \u003Cimg width=\"1052\" height=\"150\" alt=\"image\" src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F65bdd91b-8777-42f7-b468-9e2a6d5238a8\" \u002F>\r\n\r\n* Fix weight broadcasting issue in Async RL with PyTorch 2.7.1 and vLLM 0.10 by @xiaoxigua999 in https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fpull\u002F1100  \r\n* [Fix sequence-level loss calculation for GSPO](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fcommit\u002F1b23ecf0b17f0842138f9a6a3216fa21baff5355) by @xiaoxigua999  \r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fcompare\u002Fv0.8.8...v0.8.9\r\n","2025-08-06T09:45:26",{"id":246,"version":247,"summary_zh":248,"released_at":249},333328,"v0.8.8","## What's Changed\r\n\r\n* Upgraded **FlashAttention** to `v2.8.2` ([commit](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fcommit\u002F9b117fd59af692e7cad5670708331a28db79a804)) – @xiaoxigua999  \r\n* Upgraded **Ray** to `v2.48` ([commit](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fcommit\u002Fd383283fe3a34308c2608885ee8275e2ce190976)) – @xiaoxigua999  \r\n* Upgraded **vLLM** to `v0.10.0` and **Transformers** to `v4.54.1` ([commit](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fcommit\u002Fb1e937bffa4a77319e62f427e9379200cb1a9fd7)) – @xiaoxigua999  \r\n* Upgraded **DeepSpeed** to `v0.17.3` ([commit](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fcommit\u002F918f0737786495699391351d6c7bd332dd75d567)) – @xiaoxigua999  \r\n* Added logging support for **PPO_KL** ([commit](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fcommit\u002F30a02e9c8329535e23d89514fb3ed583e47808b6)) – @xiaoxigua999  \r\n* Integrated **GSPO** support from the Qwen team ([commit](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fcommit\u002F207a011972128675a701f7d847d915f9f72ab646)) – @xiaoxigua999  \r\n\r\n**Full Changelog**: [v0.8.7...v0.8.8](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fcompare\u002Fv0.8.7...v0.8.8)","2025-07-30T06:20:06"]