[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-alibaba--ROLL":3,"tool-alibaba--ROLL":64},[4,17,27,35,48,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",138956,2,"2026-04-05T11:33:21",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,43,44,45,15,46,26,13,47],"数据工具","视频","插件","其他","音频",{"id":49,"name":50,"github_repo":51,"description_zh":52,"stars":53,"difficulty_score":10,"last_commit_at":54,"category_tags":55,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,46],{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},2181,"OpenHands","OpenHands\u002FOpenHands","OpenHands 是一个专注于 AI 驱动开发的开源平台，旨在让智能体（Agent）像人类开发者一样理解、编写和调试代码。它解决了传统编程中重复性劳动多、环境配置复杂以及人机协作效率低等痛点，通过自动化流程显著提升开发速度。\n\n无论是希望提升编码效率的软件工程师、探索智能体技术的研究人员，还是需要快速原型验证的技术团队，都能从中受益。OpenHands 提供了灵活多样的使用方式：既可以通过命令行（CLI）或本地图形界面在个人电脑上轻松上手，体验类似 Devin 的流畅交互；也能利用其强大的 Python SDK 自定义智能体逻辑，甚至在云端大规模部署上千个智能体并行工作。\n\n其核心技术亮点在于模块化的软件智能体 SDK，这不仅构成了平台的引擎，还支持高度可组合的开发模式。此外，OpenHands 在 SWE-bench 基准测试中取得了 77.6% 的优异成绩，证明了其解决真实世界软件工程问题的能力。平台还具备完善的企业级功能，支持与 Slack、Jira 等工具集成，并提供细粒度的权限管理，适合从个人开发者到大型企业的各类用户场景。",70612,"2026-04-05T11:12:22",[26,15,13,45],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":76,"owner_avatar_url":77,"owner_bio":78,"owner_company":79,"owner_location":79,"owner_email":79,"owner_twitter":79,"owner_website":80,"owner_url":81,"languages":82,"stars":110,"forks":111,"last_commit_at":112,"license":113,"difficulty_score":114,"env_os":115,"env_gpu":116,"env_ram":117,"env_deps":118,"category_tags":128,"github_topics":129,"view_count":23,"oss_zip_url":79,"oss_zip_packed_at":79,"status":16,"created_at":133,"updated_at":134,"faqs":135,"releases":181},2022,"alibaba\u002FROLL","ROLL","An Efficient and User-Friendly Scaling Library for Reinforcement Learning with Large Language Models","ROLL 是一个专为大语言模型（LLM）设计的强化学习优化库，帮助开发者更高效地利用大规模 GPU 资源提升模型在人类偏好对齐、复杂推理和多轮智能交互等任务中的表现。它解决了传统 RL 训练中资源调度复杂、训练效率低、部署门槛高的问题，让研究人员能更专注算法创新，而非底层工程细节。ROLL 采用基于 Ray 的多角色分布式架构，灵活分配计算资源，支持异构任务调度，并深度集成 Megatron-Core、SGLang 和 vLLM 等主流框架，显著加速训练与推理。特别支持 Qwen3.5 密集与 MoE 模型、FSDP2、LoRA 微调、GPU 计算重叠等前沿技术，降低大模型 RL 的实践成本。适合从事大语言模型强化学习研究的科研人员、AI 工程师和算法团队使用，尤其适合需要在多卡集群上训练和调优 LLM 的用户。开源且文档完善，欢迎社区共同探索。","\u003Cdiv align=\"center\">\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Falibaba_ROLL_readme_e1ad4b49f96f.jpeg\" width=\"40%\" alt=\"ROLL Logo\">\n\n# ROLL: Reinforcement Learning Optimization for Large-Scale Learning\n\n\u003Ch4>🚀 An Efficient and User-Friendly Scaling Library for Reinforcement Learning with Large Language Models 🚀\u003C\u002Fh4>\n\n\u003Cp>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Falibaba\u002FROLL\u002Fblob\u002Fmain\u002FLICENSE\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flicense-Apache%202.0-blue.svg\" alt=\"License\">\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Falibaba\u002FROLL\u002Fissues\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fissues\u002Falibaba\u002FROLL\" alt=\"GitHub issues\">\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Falibaba\u002FROLL\u002Fstargazers\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Falibaba\u002FROLL?style=social\" alt=\"Repo stars\">\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.06122\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fstatic\u002Fv1?label=arXiv&message=Paper&color=red\">\u003C\u002Fa>\n  \u003C!-- 组织主页：点击跳转到 https:\u002F\u002Fgithub.com\u002Falibaba -->\n  \u003Ca href=\".\u002Fassets\u002Froll_wechat.png\" target=\"_blank\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FWeChat-green?logo=wechat\" alt=\"WeChat QR\">\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fdeepwiki.com\u002Falibaba\u002FROLL\" target=\"_blank\">\n    \u003Cimg src=\"https:\u002F\u002Fdeepwiki.com\u002Fbadge.svg\" alt=\"Ask DeepWiki\">\n  \u003C\u002Fa>\n  \u003Ca href=\".\u002Fassets\u002Ffuture_lab.png\" target=\"_blank\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Ftwitter\u002Ffollow\u002FFutureLab2025?style=social\" alt=\"X QR\">\n  \u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003C\u002Fdiv>\n\nROLL is an efficient and user-friendly RL library designed for Large Language Models (LLMs) utilizing Large Scale GPU resources. It significantly enhances LLM performance in key areas such as human preference alignment, complex reasoning, and multi-turn agentic interaction scenarios.\n\nLeveraging a multi-role distributed architecture with Ray for flexible resource allocation and heterogeneous task scheduling, ROLL integrates cutting-edge technologies like Megatron-Core, SGLang and vLLM to accelerate model training and inference.\n\n\n\n---\n\n## 📢 News\n\n| 📣   Updates                                                                                                                                                                                                                                                                                                                                                        |\n|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| **[03\u002F06\u002F2026]** 🎉 We support Qwen3.5 [Dense](examples\u002Fqwen3.5-35BA3-rlvr_megatron\u002Frlvr_megatron_80GB.yaml) and [MoE](examples\u002Fqwen3.5-35BA3-rlvr_megatron\u002Frlvr_megatron_80GB.yaml) series models and [on-policy distill](docs_roll\u002Fi18n\u002Fzh-Hans\u002Fdocusaurus-plugin-content-docs\u002Fcurrent\u002FUser Guides\u002FPipeline\u002Fon_policy_distill_pipeline_start.md). Welcome to use! |\n| **[02\u002F03\u002F2026]** 🎉 We released FSDP2 Strategy, Megatron with LoRA, GPU partial overlapping, Qwen3-Omni supports and other features. For more details, please refer to the release notes. Welcome to use!                                                                                                                                                           |\n| **[01\u002F01\u002F2026]** 🎉 Our [Let It Flow: Agentic Crafting on Rock and Roll](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.24873) report released! Introducing ALE ecosystem and ROME, an open-source agentic model with novel IPA algorithm.                                                                                                                                              |\n| **[11\u002F08\u002F2025]** 🎉 Our [ROCK: Reinforcement Open Construction Kit](https:\u002F\u002Fgithub.com\u002Falibaba\u002FROCK) released, Explore the new capabilities!.                                                                                                                                                                                                                       |\n| **[10\u002F23\u002F2025]** 🎉 Our Papers released, see [Asymmetric Proximal Policy Optimization: mini-critics boost LLM reasoning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.01656) and [Attention Illuminates LLM Reasoning: The Preplan-and-Anchor Rhythm Enables Fine-Grained Policy Optimization](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.13554).                                                     |\n| **[10\u002F14\u002F2025]** 🎉 Our Paper released, see [Part II: ROLL Flash -- Accelerating RLVR and Agentic Training with Asynchrony](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.11345).                                                                                                                                                                                                      |\n| **[09\u002F28\u002F2025]** 🎉 Ascend NPU support — see [usage guide](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FUser%20Guides\u002FHardware%20Support\u002Fascend_usage).                                                                                                                                                                                                                      |\n| **[09\u002F25\u002F2025]** 🎉 Our Paper released, see [RollPacker: Mitigating Long-Tail Rollouts for Fast, Synchronous RL Post-Training](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.21009)                                                                                                                                                                                                    |\n| **[09\u002F24\u002F2025]** 🎉 Support [Wan2_2 Reward FL pipeline](examples\u002Fwan2.2-14B-reward_fl_ds\u002Freward_fl_config.yaml). Explore the new capabilities!                                                                                                                                                                                                                      |\n| **[09\u002F23\u002F2025]** 🎉 ROLL aligns with GEM environment definition, providing agentic Tool Use training capabilities, [ToolUse docs](docs_roll\u002Fdocs\u002FEnglish\u002FUserGuide\u002Fagentic\u002FTool_Use.md).                                                                                                                                                                            |\n| **[09\u002F16\u002F2025]** 🎉 Qwen3-Next model training is supported, refer to [configuration](examples\u002Fqwen3-next-80BA3B-rlvr_megatron\u002Frlvr_config.yaml).                                                                                                                                                                                                                    |\n| **[09\u002F04\u002F2025]** 🎉 ROLL supports vLLM dynamic FP8 rollout and remove_padding for acceleration.                                                                                                                                                                                                                                                                     |\n| **[08\u002F28\u002F2025]** 🎉 ROLL supports SFT pipeline, refer to [configuration](examples\u002Fqwen2.5-7B-sft_megatron\u002Fsft_config.yaml).                                                                                                                                                                                                                                         |\n| **[08\u002F13\u002F2025]** 🎉 ROLL supports AMD GPUs with out-of-box image docker and Dockerfile and specific yamls under `examples\u002F` directory. Please refer to [Installation](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FGetting%20Started\u002FInstallation\u002F).                                                                                                                         |\n| **[08\u002F11\u002F2025]** 🎉 Our Paper released, see [Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.08221).                                                                                                                                                                                                                     |\n| **[08\u002F10\u002F2025]** 🎉 Agentic RL supports [stepwise learning](examples\u002Fqwen2.5-0.5B-agentic\u002Fagent_val_frozen_lake_gigpo.yaml), like [GiGPO](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.10978); Distill supports [VLM](examples\u002Fqwen2.5-vl-7B-distill\u002Fdistill_vl_megatron.yaml). Explore the new capabilities!                                                                         |\n| **[08\u002F06\u002F2025]** 🎉 ROLL PPT is now available, [Slides](assets\u002FROLL%20高效且用户友好的大模型RL训练框架.pdf).                                                                                                                                                                                                                                                                       |\n| **[07\u002F31\u002F2025]** 🎉 Refactor agentic rl design. Support agentic rl [async training](examples\u002Fqwen2.5-0.5B-agentic\u002Fagent_val_frozen_lake_async.yaml). Explore the new capabilities!                                                                                                                                                                                  |\n| **[07\u002F31\u002F2025]** 🎉 Support [DistillPipeline](examples\u002Fqwen2.5-7B-distill_megatron\u002Frun_distill_pipeline.sh)\u002F[DpoPipeline](examples\u002Fdpo_examples\u002Frun_dpo_pipeline.sh). Support [lora](examples\u002Fqwen2.5-7B-rlvr_megatron\u002Frlvr_lora_zero3.yaml). Support [GSPO](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.18071)                                                                      |\n| **[06\u002F25\u002F2025]** 🎉 Support thread env for env scaling and support [qwen2.5 VL agentic pipeline](examples\u002Fqwen2.5-vl-3B-agentic\u002Fagentic_val_sokoban.yaml).                                                                                                                                                                                                          |\n| **[06\u002F13\u002F2025]** 🎉 Support [Qwen2.5 VL rlvr pipeline](examples\u002Fqwen2.5-vl-7B-rlvr\u002Frlvr_megatron.yaml) and upgrade mcore to 0.12 version.                                                                                                                                                                                                                           |\n| **[06\u002F09\u002F2025]** 🎉 ROLL tech report is now available! Access the report [here](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.06122).                                                                                                                                                                                                                                                  |\n| **[06\u002F08\u002F2025]** 🎉Supports  Qwen3([8B](examples\u002Fqwen3-8B-rlvr_megatron\u002Frlvr_config.yaml)\u002F14B\u002F32B), Qwen3-MoE([30A3](examples\u002Fqwen3-30BA3B-rlvr_megatron\u002Frlvr_config.yaml)\u002F[235A22](examples\u002Fqwen3-235BA22B-rlvr_megatron\u002Frlvr_config.yaml)), Qwen2.5([7B](examples\u002Fqwen2.5-7B-rlvr_megatron\u002Frlvr_config.yaml)\u002F14B\u002F32B\u002F72B) LLM models.                             |\n| **[05\u002F30\u002F2025]** 🎉 Training [RLVR](examples\u002Fqwen2.5-7B-rlvr_megatron\u002Frlvr_config.yaml) and [Agentic RL](examples\u002Fqwen2.5-0.5B-agentic\u002Fagent_val_frozen_lake.yaml) with ROLL is now available! Explore the new capabilities.                                                                                                                                        |\n---\n\n\n## 🚀 Get Started\n\n[Documents](https:\u002F\u002Falibaba.github.io\u002FROLL\u002F)\n\n### Quick Start\n[Installation](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FGetting%20Started\u002FInstallation\u002F)  \n[Config System Explanation](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FUser%20Guides\u002FConfiguration\u002Fconfig_system)  \n[Debugging Guide](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FGetting%20Started\u002FDebugging%20Guide\u002Fdebug_guide)  \n[Trackers and Metrics](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FUser%20Guides\u002FTracker%20&%20Metrics\u002Ftrackers_and_metrics)  \n[Checkpoint Saving and Resuming Guide](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FUser%20Guides\u002FAdvanced%20Features\u002Fcheckpoint_and_resume)  \n[Converting MCoreAdapter Models to Hugging Face Format](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FUser%20Guides\u002FAdvanced%20Features\u002Fmegatron_convert_2_hf)  \n[Quick Start: Single-Node Deployment Guide](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FGetting%20Started\u002FQuick%20Start\u002Fsingle_node_quick_start)  \n[Quick Start: Multi-Node Deployment Guide](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FGetting%20Started\u002FQuick%20Start\u002Fmulti_nodes_quick_start)  \n[Quick Start: Using Alibaba Cloud Function Compute DevPod for Rapid Development](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FGetting%20Started\u002FQuick%20Start\u002Faliyun_serverless_devpod_quick_start)\n[Frequently Asked Questions](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FGetting%20Started\u002FFAQ\u002Fqa_issues)\n\n### UserGuide\n\n#### Pipeline Step by Step\n[RLVR Pipeline](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FUser%20Guides\u002FPipeline\u002Frlvr_pipeline_start)  \n[Agentic Pipeline](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FUser%20Guides\u002FPipeline\u002Fagentic_pipeline_start)  \n[Agentic Comprehensive Guide](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FUser%20Guides\u002FPipeline\u002Fagent_pipeline_start)  \n[Distill Pipeline](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FUser%20Guides\u002FPipeline\u002Fdistill_pipeline_start)\n\n#### Algorithms\n[Reinforce++](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FUser%20Guides\u002FAlgorithms\u002FReinforce_Plus_Plus)  \n[TOPR](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FUser%20Guides\u002FAlgorithms\u002FTOPR)  \n[GiGPO](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FUser%20Guides\u002FAgentic\u002Fagentic_GiGPO)  \n[PPO](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FUser%20Guides\u002FAlgorithms\u002FPPO)  \n[Lite PPO](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FUser%20Guides\u002FAlgorithms\u002FLitePPO)  \n[GRPO](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FUser%20Guides\u002FAlgorithms\u002FGRPO)  \n[GSPO](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FUser%20Guides\u002FAlgorithms\u002FGSPO)  \n[RAFT++](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FUser%20Guides\u002FAlgorithms\u002FRAFT_Plus_Plus)  \n[StarPO](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FUser%20Guides\u002FAgentic\u002Fagentic_StarPO)   \n[RewardFL](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FUser%20Guides\u002FAlgorithms\u002FReward_FL)\n\n#### Backend\n[DeepSeed](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FUser%20Guides\u002FConfiguration\u002Fdeepspeed)  \n[Megatron](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FUser%20Guides\u002FConfiguration\u002Fmegatron)   \n[vLLM](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FUser%20Guides\u002FConfiguration\u002Fvllm)  \n[SGLang](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FUser%20Guides\u002FConfiguration\u002Fsglang)\n\n#### Advanced Features\n[Asynchronous Parallel Rollout](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FUser%20Guides\u002FAdvanced%20Features\u002Fasync_parallel_rollout)  \n[Asynchronous Training Feature](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FUser%20Guides\u002FAdvanced%20Features\u002Fasync_training)  \n\n#### Performance Optimization & Resource Management \n[Resource Config](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FUser%20Guides\u002FConfiguration\u002Fdevice_mapping)   \n[GPU Time-Division Multiplexing Control](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FUser%20Guides\u002FAdvanced%20Features\u002Foffload_reload_control)  \n\n#### ROLL x Ascend\n[Ascend Usage Guide](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FUser%20Guides\u002FHardware%20Support\u002Fascend_usage)\n\n---\n\n## ✨ Key Features\n*   **Multi-task RL Training (RLVR):** Covers mathematics, coding, general reasoning, open-ended Q&A, instruction following, etc.\n    *   Flexible `domain_batch_size` distribution control.\n    *   **Sample-level asynchronous parallel Rollout**, asynchronous reward calculation, and dynamic sampling.\n    *   Asynchronous training under implementation.\n*   **Agentic RL:** Multi-turn interaction capabilities for games, multi-turn dialogues, tool use, etc.\n    *   Environment-level **asynchronous parallel rollout**.\n    *   Supports **asynchronous training**.\n    *   Multi-turn interaction rollout supports **local debugging**, improving multi-turn interaction business development efficiency.\n    *   Supports **TrajectoryWise (StartPO)** and **StepWise (GiGPO)** training paradigms.\n*   **Algorithm-Friendly:** Provides flexible and rich RL strategy configurations by default.\n    *   Over 20 rich reinforcement learning strategy options, such as reward normalization, reward clipping, various advantage estimation methods, etc.\n    *   Out-of-the-box support for reinforcement learning algorithms, such as **PPO, GRPO, Reinforce++, TOPR, RAFT++, GSPO**, etc.\n*   **Rich Training and Inference Engine:** Ray-based multi-role distributed architecture; Strategy abstraction unifies various backends, enabling easy operation from single machines to thousands-of-GPU clusters.\n    *   Inference\u002FGeneration supports vLLM, SGLang.\n    *   Training supports DeepSpeed (ZeRO), Megatron-LM 5D parallelism (mcore-adapter, dp\u002Ftp\u002Fpp\u002Fcp\u002Fep), FSDP under implementation.\n    *   Extreme offload\u002Freload capabilities.\n    *   Supports [LoRA](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FUser%20Guides\u002FConfiguration\u002Flora) training.\n    *   Supports FP8 rollout (FP8 inference for LLM as judge, FP8 rollout with BF16 training under development).\n*   **AutoDeviceMapping:** Supports custom device mapping for different roles, flexibly managing colocated and disaggregated deployments.\n*   **Observability:** Integrated with SwanLab \u002F WandB \u002F TensorBoard, tracking of performance for each domain and reward type.\n*   **Rich Post-training Technical Support:**\n    *   Agentic RL LLM & VLM\n    *   RLVR LLM & VLM\n    *   Distill Pipeline LLM & VLM\n    *   DPO Pipeline\n    *   SFT Pipeline under development\n\n---\n\n## 🏆 Notable work based on ROLL\n- [ComplementaryRL](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.17621): Complementary RL is a learning framework that enables agents to effectively learn from experience through the seamless co-evolution of an experience extractor and a policy actor within the RL optimization loop.\n- [RLix](https:\u002F\u002Fgithub.com\u002Frlops\u002Frlix): RLix is an RL job manager that lets more RL jobs run concurrently with less waiting by sharing GPU capacity across jobs, while preserving each pipeline’s training behavior and improving GPU utilization.\n- [TurningPoint-GRPO](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.06422): A GRPO framework for Flow Matching models in text-to-image generation that alleviates step-wise reward sparsity by modeling step-level incremental rewards and explicitly captures long-term effects via turning points detection, providing dense learning signals for each denoising action.\n- [STAgent](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.24957): An agentic LLM specialized for spatio-temporal understanding and complex tasks like constrained POI discovery and itinerary planning, featuring hierarchical data curation with 1:10,000 filter ratio and cascaded training (seed SFT + difficulty-aware SFT + RL), achieving strong performance on TravelBench while preserving general capabilities.\n- [IPRO](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.14255): A novel video diffusion framework using reinforcement learning to enhance identity preservation in human-centric I2V generation, optimizing diffusion models with face identity scorer and KL-divergence regularization.\n- [TaoSR-SHE](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.07972): Stepwise Hybrid Examination Reinforcement Learning Framework for Taobao Search Relevance, with SRPO (hybrid reward model + offline verifier), diversified data filtering, and multi-stage curriculum learning.\n- [EARL](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.05943): Efficient Agentic RL Systems for LLMs, introducing a dynamic parallelism selector and a layout-aware data dispatcher to boost throughput, reduce memory and data movement bottlenecks, enabling stable large-scale agentic RL without hard context-length limits.\n- [LiveThinking](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.07685): Real-time reasoning for AI-powered livestreaming by distilling a 670B teacher LLM to a 30B MoE (3B active) via Rejection Sampling Fine-Tuning, then compressing reasoning with GRPO; delivers sub-second latency and ~30x compute reduction, with gains in response correctness (3.3%), helpfulness (21.8%), and GMV in Taobao Live.\n- [TaoSR-AGRL](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2510.08048): Adaptive Guided Reinforcement Learning for LLM-based e-commerce relevance, introducing Rule-aware Reward Shaping and Adaptive Guided Replay to improve long-horizon reasoning, rule adherence, and training stability in Taobao Search; deployed in main search handling hundreds of millions of users.\n- [RecGPT](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2507.22879): a next-generation, LLM-driven framework that places user intent at the core of recommender systems, fostering a more sustainable and mutually beneficial ecosystem.\n- [TaoSR1](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.12365): A novel LLM framework directly deploying Chain-of-Thought (CoT) reasoning for e-commerce query-product relevance prediction, overcoming deployment challenges for superior performance.\n- [AIGB-Pearl](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2509.15927): a novel auto-bidding method that integrates generative planning and policy optimization, utilizing an LLM-enhanced trajectory evaluator to iteratively refine bidding strategies for state-of-the-art advertising performance.\n-----\n\n## 🙏 Citation and Acknowledgement\n\nROLL is inspired by the design of OpenRLHF, VeRL, Nemo-Aligner, and RAGEN.\nThe project is developed by Alibaba TAOBAO & TMALL Group and Alibaba Group. The code is distributed under the Apache License (Version 2.0). This product contains various third-party components under other open-source licenses. See the `NOTICE` file for more information.\n\nThe following repositories have been used in ROLL, either in their close-to-original form or as an inspiration:\n\n  * [NVIDIA\u002FMegatron-LM](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FMegatron-LM)\n  * [microsoft\u002FDeepSpeed](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDeepSpeed)\n  * [sgl-project\u002Fsglang](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang)\n  * [vllm-project\u002Fvllm](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm)\n  * [modelscope\u002FDiffSynth-Studio](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002FDiffSynth-Studio)\n\nIf you use ROLL in your research or project, please consider citing us:\n\n```bibtex\n@article{wang2025reinforcement,\n  title={Reinforcement Learning Optimization for Large-Scale Learning: An Efficient and User-Friendly Scaling Library},\n  author={Wang, Weixun and Xiong, Shaopan and Chen, Gengru and Gao, Wei and Guo, Sheng and He, Yancheng and Huang, Ju and Liu, Jiaheng and Li, Zhendong and Li, Xiaoyang and others},\n  journal={arXiv preprint arXiv:2506.06122},\n  year={2025}\n}\n```\n\n\n\n-----\n\n## 🤝 About [ROCK & ROLL Team]\nROLL is a project jointly developed by Taotian Future Living Lab and Alibaba AI Engine Team, with a strong emphasis on pioneering the future of Reinforcement Learning (RL). Our mission is to explore and shape innovative forms of future living powered by advanced RL technologies. If you are passionate about the future of RL and want to be part of its evolution, we warmly welcome you to join us!👇\n\n\u003Ca href=\".\u002Fassets\u002Froll_wechat.png\" target=\"_blank\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FWeChat-green?logo=wechat\" alt=\"WeChat QR\">\n\u003C\u002Fa>\n\u003Ca href=\".\u002Fassets\u002Ffuture_lab.png\" target=\"_blank\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Ftwitter\u002Ffollow\u002FFutureLab2025?style=social\" alt=\"X QR\">\n\u003C\u002Fa>\n\n-----\nWe are HIRING! \n- Post Training Infra 研发工程师 [JD link](https:\u002F\u002Ftalent-holding.alibaba.com\u002Foff-campus\u002Fposition-detail?lang=zh&positionId=7000016304)\n- 大模型训练专家： \n  - （社招）[JD link](https:\u002F\u002Ftalent.taotian.com\u002Foff-campus\u002Fposition-detail?lang=zh&positionId=7000024203)\n  - （校招）[JD link](https:\u002F\u002Ftalent.taotian.com\u002Fcampus\u002Fposition-detail?positionId=199900140053)\n- Infra 研究型实习生 [JD link](https:\u002F\u002Ftalent-holding.alibaba.com\u002Fcampus\u002Fposition-detail?lang=zh&positionId=59900004115)\n\n-----\n\n\u003Cdiv align=\"center\">\nWe welcome contributions from the community! 🤝\n\u003C\u002Fdiv>\n","\u003Cdiv align=\"center\">\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Falibaba_ROLL_readme_e1ad4b49f96f.jpeg\" width=\"40%\" alt=\"ROLL Logo\">\n\n# ROLL：面向大规模学习的强化学习优化框架\n\n\u003Ch4>🚀 一款高效且易用的强化学习规模化库，助力大型语言模型 🚀\u003C\u002Fh4>\n\n\u003Cp>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Falibaba\u002FROLL\u002Fblob\u002Fmain\u002FLICENSE\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flicense-Apache%202.0-blue.svg\" alt=\"License\">\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Falibaba\u002FROLL\u002Fissues\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fissues\u002Falibaba\u002FROLL\" alt=\"GitHub issues\">\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Falibaba\u002FROLL\u002Fstargazers\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Falibaba\u002FROLL?style=social\" alt=\"Repo stars\">\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.06122\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fstatic\u002Fv1?label=arXiv&message=Paper&color=red\">\u003C\u002Fa>\n  \u003C!-- 组织主页：点击跳转到 https:\u002F\u002Fgithub.com\u002Falibaba -->\n  \u003Ca href=\".\u002Fassets\u002Froll_wechat.png\" target=\"_blank\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FWeChat-green?logo=wechat\" alt=\"WeChat QR\">\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fdeepwiki.com\u002Falibaba\u002FROLL\" target=\"_blank\">\n    \u003Cimg src=\"https:\u002F\u002Fdeepwiki.com\u002Fbadge.svg\" alt=\"Ask DeepWiki\">\n  \u003C\u002Fa>\n  \u003Ca href=\".\u002Fassets\u002Ffuture_lab.png\" target=\"_blank\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Ftwitter\u002Ffollow\u002FFutureLab2025?style=social\" alt=\"X QR\">\n  \u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003C\u002Fdiv>\n\nROLL 是一款专为大型语言模型（LLMs）设计的高效、易用的强化学习库，充分利用大规模 GPU 资源。它显著提升了 LLM 在关键领域的性能，包括人类偏好对齐、复杂推理以及多轮智能体交互场景。\n\nROLL 基于 Ray 的多角色分布式架构，实现灵活的资源分配与异构任务调度，并融合 Megatron-Core、SGLang 和 vLLM 等前沿技术，加速模型训练与推理过程。\n\n\n\n---\n\n## 📢 最新动态\n\n| 📣   更新内容                                                                                                                                                                                                                                                                                                                                                        |\n|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| **[2026年3月6日]** 🎉 我们支持Qwen3.5 [密集型](examples\u002Fqwen3.5-35BA3-rlvr_megatron\u002Frlvr_megatron_80GB.yaml)和[MoE型](examples\u002Fqwen3.5-35BA3-rlvr_megatron\u002Frlvr_megatron_80GB.yaml)系列模型，以及[在线策略蒸馏](docs_roll\u002Fi18n\u002Fzh-Hans\u002Fdocusaurus-plugin-content-docs\u002Fcurrent\u002FUser Guides\u002FPipeline\u002Fon_policy_distill_pipeline_start.md)。欢迎使用！ |\n| **[2026年2月3日]** 🎉 我们发布了FSDP2策略、带LoRA的Megatron、GPU部分重叠、Qwen3-Omni支持等新功能。更多详情，请参阅发布说明。欢迎使用！                                                                                                                                                           |\n| **[2026年1月1日]** 🎉 我们的[让其流动：摇滚乐中的智能体创作](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.24873)报告发布！介绍ALE生态系统和ROME，一种采用新颖IPA算法的开源智能体模型。                                                                                                                                              |\n| **[2025年11月8日]** 🎉 我们的[ROCK：强化开放构建套件](https:\u002F\u002Fgithub.com\u002Falibaba\u002FROCK)发布，探索全新能力！                                                                                                                                                                                                                       |\n| **[2025年10月23日]** 🎉 我们的论文发布，详见[非对称近端策略优化：迷你批评家助力大模型推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.01656)和[注意力照亮大模型推理：预计划与锚定节奏实现精细策略优化](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.13554)。                                                     |\n| **[2025年10月14日]** 🎉 我们的论文发布，详见[第二部分：ROLL Flash——利用异步加速RLVR与智能体训练](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.11345)。                                                                                                                                                                                                      |\n| **[2025年9月28日]** 🎉 支持昇腾NPU——请参阅[使用指南](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FUser%20Guides\u002FHardware%20Support\u002Fascend_usage)。                                                                                                                                                                                                                      |\n| **[2025年9月25日]** 🎉 我们的论文发布，详见[RollPacker：缓解长尾回滚以实现快速同步的强化学习训练后处理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.21009)                                                                                                                                                                                                    |\n| **[2025年9月24日]** 🎉 支持[Wan2_2奖励FL流水线](examples\u002Fwan2.2-14B-reward_fl_ds\u002Freward_fl_config.yaml)。探索全新能力！                                                                                                                                                                                                                      |\n| **[2025年9月23日]** 🎉 ROLL与GEM环境定义对齐，提供智能体工具使用训练能力，[工具使用文档](docs_roll\u002Fdocs\u002FEnglish\u002FUserGuide\u002Fagentic\u002FTool_Use.md)。                                                                                                                                                                            |\n| **[2025年9月16日]** 🎉 支持Qwen3-Next模型训练，请参考[配置文件](examples\u002Fqwen3-next-80BA3B-rlvr_megatron\u002Frlvr_config.yaml)。                                                                                                                                                                                                                    |\n| **[2025年9月4日]** 🎉 ROLL支持vLLM动态FP8回滚及remove_padding以加速。                                                                                                                                                                                                                                                                     |\n| **[2025年8月28日]** 🎉 ROLL支持SFT流水线，请参考[配置文件](examples\u002Fqwen2.5-7B-sft_megatron\u002Fsft_config.yaml)。                                                                                                                                                                                                                                         |\n| **[2025年8月13日]** 🎉 ROLL支持AMD GPU，提供开箱即用的镜像docker与Dockerfile，并在`examples\u002F`目录下提供特定yaml文件。请参阅[安装指南](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FGetting%20Started\u002FInstallation\u002F)。                                                                                                                         |\n| **[2025年8月11日]** 🎉 我们的论文发布，详见[第一部分：技巧还是陷阱？深入探究用于大模型推理的强化学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.08221)。                                                                                                                                                                                                                     |\n| **[2025年8月10日]** 🎉 智能体强化学习支持[逐步学习](examples\u002Fqwen2.5-0.5B-agentic\u002Fagent_val_frozen_lake_gigpo.yaml)，如[GigPO](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.10978)；蒸馏支持[VLM](examples\u002Fqwen2.5-vl-7B-distill\u002Fdistill_vl_megatron.yaml)。探索全新能力！                                                                         |\n| **[2025年8月6日]** 🎉 ROLL PPT现已发布，[幻灯片](assets\u002FROLL%20高效且用户友好的大模型RL训练框架.pdf)。                                                                                                                                                                                                                                                                       |\n| **[2025年7月31日]** 🎉 重构智能体强化学习设计。支持智能体强化学习[异步训练](examples\u002Fqwen2.5-0.5B-agentic\u002Fagent_val_frozen_lake_async.yaml)。探索全新能力！                                                                                                                                                                                  |\n| **[2025年7月31日]** 🎉 支持[DistillPipeline](examples\u002Fqwen2.5-7B-distill_megatron\u002Frun_distill_pipeline.sh)\u002F[DpoPipeline](examples\u002Fdpo_examples\u002Frun_dpo_pipeline.sh)。支持[loRa](examples\u002Fqwen2.5-7B-rlvr_megatron\u002Frlvr_lora_zero3.yaml)。支持[GSPO](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.18071)                                                                      |\n| **[2025年6月25日]** 🎉 支持线程环境用于环境扩展，并支持[qwen2.5 VL智能体流水线](examples\u002Fqwen2.5-vl-3B-agentic\u002Fagentic_val_sokoban.yaml)。                                                                                                                                                                                                          |\n| **[2025年6月13日]** 🎉 支持[Qwen2.5 VL rlvr流水线](examples\u002Fqwen2.5-vl-7B-rlvr\u002Frlvr_megatron.yaml)并升级mcore至0.12版本。                                                                                                                                                                                                                           |\n| **[2025年6月9日]** 🎉 ROLL技术报告现已发布！点击[这里](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.06122)获取报告。                                                                                                                                                                                                                                                  |\n| **[2025年6月8日]** 🎉 支持Qwen3（[8B](examples\u002Fqwen3-8B-rlvr_megatron\u002Frlvr_config.yaml)\u002F14B\u002F32B）、Qwen3-MoE（[30A3](examples\u002Fqwen3-30BA3B-rlvr_megatron\u002Frlvr_config.yaml）\u002F[235A22](examples\u002Fqwen3-235BA22B-rlvr_megatron\u002Frlvr_config.yaml)）、Qwen2.5（[7B](examples\u002Fqwen2.5-7B-rlvr_megatron\u002Frlvr_config.yaml)\u002F14B\u002F32B\u002F72B）大模型。                             |\n| **[2025年5月30日]** 🎉 使用ROLL训练[RLVR](examples\u002Fqwen2.5-7B-rlvr_megatron\u002Frlvr_config.yaml)和[智能体强化学习](examples\u002Fqwen2.5-0.5B-agentic\u002Fagent_val_frozen_lake.yaml)现已可用！探索全新能力。                                                                                                                                        |\n---\n\n## 🚀 快速入门\n\n[文档](https:\u002F\u002Falibaba.github.io\u002FROLL\u002F)\n\n### 快速开始\n[安装](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FGetting%20Started\u002FInstallation\u002F)  \n[配置系统说明](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FUser%20Guides\u002FConfiguration\u002Fconfig_system)  \n[调试指南](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FGetting%20Started\u002FDebugging%20Guide\u002Fdebug_guide)  \n[追踪器与指标](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FUser%20Guides\u002FTracker%20&%20Metrics\u002Ftrackers_and_metrics)  \n[检查点保存与恢复指南](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FUser%20Guides\u002FAdvanced%20Features\u002Fcheckpoint_and_resume)  \n[将MCoreAdapter模型转换为Hugging Face格式](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FUser%20Guides\u002FAdvanced%20Features\u002Fmegatron_convert_2_hf)  \n[快速开始：单节点部署指南](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FGetting%20Started\u002FQuick%20Start\u002Fsingle_node_quick_start)  \n[快速开始：多节点部署指南](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FGetting%20Started\u002FQuick%20Start\u002Fmulti_nodes_quick_start)  \n[快速开始：使用阿里云函数计算DevPod进行快速开发](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FGetting%20Started\u002FQuick%20Start\u002Faliyun_serverless_devpod_quick_start)\n[常见问题](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FGetting%20Started\u002FFAQ\u002Fqa_issues)\n\n### 用户指南\n\n#### 管道逐步详解\n[RLVR管道](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FUser%20Guides\u002FPipeline\u002Frlvr_pipeline_start)  \n[代理式管道](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FUser%20Guides\u002FPipeline\u002Fagentic_pipeline_start)  \n[代理式综合指南](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FUser%20Guides\u002FPipeline\u002Fagent_pipeline_start)  \n[蒸馏管道](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FUser%20Guides\u002FPipeline\u002Fdistill_pipeline_start)\n\n#### 算法\n[Reinforce++](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FUser%20Guides\u002FAlgorithms\u002FReinforce_Plus_Plus)  \n[TOPR](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FUser%20Guides\u002FAlgorithms\u002FTOPR)  \n[GiGPO](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FUser%20Guides\u002FAgentic\u002Fagentic_GiGPO)  \n[PPO](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FUser%20Guides\u002FAlgorithms\u002FPPO)  \n[Lite PPO](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FUser%20Guides\u002FAlgorithms\u002FLitePPO)  \n[GRPO](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FUser%20Guides\u002FAlgorithms\u002FGRPO)  \n[GSPO](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FUser%20Guides\u002FAlgorithms\u002FGSPO)  \n[RAFT++](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FUser%20Guides\u002FAlgorithms\u002FRAFT_Plus_Plus)  \n[StarPO](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FUser%20Guides\u002FAgentic\u002Fagentic_StarPO)   \n[RewardFL](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FUser%20Guides\u002FAlgorithms\u002FReward_FL)\n\n#### 后端\n[DeepSeed](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FUser%20Guides\u002FConfiguration\u002Fdeepspeed)  \n[Megatron](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FUser%20Guides\u002FConfiguration\u002Fmegatron)   \n[vLLM](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FUser%20Guides\u002FConfiguration\u002Fvllm)  \n[SGLang](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FUser%20Guides\u002FConfiguration\u002Fsglang)\n\n#### 高级功能\n[异步并行Rollout](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FUser%20Guides\u002FAdvanced%20Features\u002Fasync_parallel_rollout)  \n[异步训练功能](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FUser%20Guides\u002FAdvanced%20Features\u002Fasync_training)  \n\n#### 性能优化与资源管理 \n[资源配置](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FUser%20Guides\u002FConfiguration\u002Fdevice_mapping)   \n[GPU时间分割复用控制](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FUser%20Guides\u002FAdvanced%20Features\u002Foffload_reload_control)  \n\n#### ROLL x Ascend\n[Ascend使用指南](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FUser%20Guides\u002FHardware%20Support\u002Fascend_usage)\n\n---\n\n## ✨ 核心特性\n*   **多任务强化学习训练（RLVR）：** 涵盖数学、编码、通用推理、开放式问答、指令跟随等。\n    *   灵活的`domain_batch_size`分布控制。\n    *   **样本级异步并行Rollout**，异步奖励计算与动态采样。\n    *   实施中的异步训练。\n*   **代理式强化学习：** 游戏、多轮对话、工具使用等多轮交互能力。\n    *   环境级**异步并行Rollout**。\n    *   支持**异步训练**。\n    *   多轮交互Rollout支持**本地调试**，提升多轮交互业务开发效率。\n    *   支持**轨迹式（StartPO）**和**步骤式（GiGPO）**训练范式。\n*   **算法友好：** 默认提供灵活丰富的强化学习策略配置。\n    *   超过20种丰富的强化学习策略选项，如奖励归一化、奖励裁剪、多种优势估计方法等。\n    *   开箱即用支持强化学习算法，如**PPO、GRPO、Reinforce++、TOPR、RAFT++、GSPO**等。\n*   **丰富的训练与推理引擎：** 基于Ray的多角色分布式架构；策略抽象统一各类后端，实现从单机到数千GPU集群的轻松操作。\n    *   推理\u002F生成支持vLLM、SGLang。\n    *   训练支持DeepSpeed（ZeRO）、Megatron-LM 5D并行（mcore-adapter、dp\u002Ftp\u002Fpp\u002Fcp\u002Fep）、FSDP实施中。\n    *   极致的卸载\u002F重载能力。\n    *   支持[LoRA](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FUser%20Guides\u002FConfiguration\u002Flora)训练。\n    *   支持FP8 Rollout（LLM作为判别器的FP8推理，FP8 Rollout与BF16训练研发中）。\n*   **AutoDeviceMapping：** 支持不同角色的自定义设备映射，灵活管理共置与分离部署。\n*   **可观测性：** 集成SwanLab \u002F WandB \u002F TensorBoard，跟踪各领域与奖励类型的性能。\n*   **丰富的训练后技术支持：**\n    *   代理式强化学习LLM与VLM\n    *   RLVR LLM与VLM\n    *   蒸馏管道LLM与VLM\n    *   DPO管道\n    *   SFT管道研发中\n\n---\n\n## 🏆 基于ROLL的杰出工作\n- [ComplementaryRL](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.17621)：互补强化学习是一种学习框架，它通过在强化学习优化循环中无缝协同进化经验提取器与策略执行器，使智能体能够从经验中高效学习。\n- [RLix](https:\u002F\u002Fgithub.com\u002Frlops\u002Frlix)：RLix是一款强化学习作业管理器，通过在不同作业间共享GPU资源，让更多的强化学习任务能够并行运行而减少等待时间，同时保持每个流水线的训练行为并提升GPU利用率。\n- [TurningPoint-GRPO](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.06422)：一种用于文本到图像生成中流匹配模型的GRPO框架，通过建模逐级增量奖励并显式捕捉转折点检测带来的长期效应，缓解了逐级奖励稀疏的问题，为每次去噪动作提供密集的学习信号。\n- [STAgent](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.24957)：一款专门针对时空理解及复杂任务（如受限POI发现和行程规划）的代理式大语言模型，采用1:10,000的过滤比例进行分层数据筛选，并实施级联训练（种子SFT + 难度感知SFT + 强化学习），在TravelBench上表现强劲，同时保留通用能力。\n- [IPRO](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.14255)：一种新颖的视频扩散框架，利用强化学习增强以人为中心的I2V生成中的身份保护，通过人脸身份评分器和KL散度正则化优化扩散模型。\n- [TaoSR-SHE](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.07972)：淘宝搜索相关性逐步混合检查强化学习框架，包含SRPO（混合奖励模型+离线验证器）、多样化数据过滤以及多阶段课程学习。\n- [EARL](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.05943)：面向大语言模型的高效代理式强化学习系统，引入动态并行度选择器和布局感知数据调度器，以提升吞吐量、降低内存和数据移动瓶颈，实现稳定的大规模代理式强化学习而无需硬性上下文长度限制。\n- [LiveThinking](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.07685)：通过拒绝采样微调将670B教师大语言模型蒸馏至30B MoE（3B激活），再用GRPO压缩推理，为AI驱动的直播提供实时推理能力；延迟低至秒级，计算量减少约30倍，且在响应正确率（3.3%）、有用性（21.8%）以及淘宝直播GMV方面均有显著提升。\n- [TaoSR-AGRL](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2510.08048)：基于大语言模型的电商相关性自适应引导强化学习，引入规则感知奖励塑造和自适应引导回放，以提升淘宝搜索中的长程推理、规则遵从性和训练稳定性；已部署于主搜索，服务数亿用户。\n- [RecGPT](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2507.22879)：新一代大语言模型驱动的框架，将用户意图置于推荐系统的核心，促进更可持续且互利共赢的生态系统。\n- [TaoSR1](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.12365)：一种全新的大语言模型框架，直接部署思维链（CoT）推理用于电商查询-商品相关性预测，克服了部署难题，实现了卓越性能。\n- [AIGB-Pearl](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2509.15927)：一种新颖的自动出价方法，融合生成式规划与策略优化，利用大语言模型增强的轨迹评估器迭代优化出价策略，实现顶尖广告效果。\n-----\n\n## 🙏 致谢与引用\n\nROLL的设计灵感来源于OpenRLHF、VeRL、Nemo-Aligner和RAGEN。\n该项目由阿里巴巴淘宝天猫集团与阿里巴巴集团共同开发。代码采用Apache License（版本2.0）授权发布。本产品包含多种其他开源许可下的第三方组件，请参阅`NOTICE`文件获取更多信息。\n\n以下仓库在ROLL中被使用，或以其原始形式，或作为灵感来源：\n\n  * [NVIDIA\u002FMegatron-LM](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FMegatron-LM)\n  * [microsoft\u002FDeepSpeed](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDeepSpeed)\n  * [sgl-project\u002Fsglang](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang)\n  * [vllm-project\u002Fvllm](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm)\n  * [modelscope\u002FDiffSynth-Studio](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002FDiffSynth-Studio)\n\n如果您在研究或项目中使用ROLL，请考虑引用我们：\n\n```bibtex\n@article{wang2025reinforcement,\n  title={大规模学习的强化学习优化：一种高效且易用的扩展库},\n  author={王伟迅、熊少攀、陈耿儒、高伟、郭升、何延成、黄菊、刘嘉恒、李振东、李晓阳等},\n  journal={arXiv预印本 arXiv:2506.06122},\n  year={2025}\n}\n```\n\n\n\n-----\n\n## 🤝 关于[ROCK & ROLL团队]\nROLL是由淘天未来生活实验室与阿里巴巴AI引擎团队联合开发的项目，专注于探索强化学习（RL）的未来发展方向。我们的使命是通过先进的强化学习技术，探索并塑造未来生活的创新形态。如果你对强化学习的未来充满热情，希望参与这一领域的变革，我们热烈欢迎你的加入！👇\n\n\u003Ca href=\".\u002Fassets\u002Froll_wechat.png\" target=\"_blank\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FWeChat-green?logo=wechat\" alt=\"WeChat二维码\">\n\u003C\u002Fa>\n\u003Ca href=\".\u002Fassets\u002Ffuture_lab.png\" target=\"_blank\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Ftwitter\u002Ffollow\u002FFutureLab2025?style=social\" alt=\"X二维码\">\n\u003C\u002Fa>\n\n-----\n我们正在招聘！\n- 训练基础设施研发工程师 [职位链接](https:\u002F\u002Ftalent-holding.alibaba.com\u002Foff-campus\u002Fposition-detail?lang=zh&positionId=7000016304)\n- 大模型训练专家：\n  - （社招）[职位链接](https:\u002F\u002Ftalent.taotian.com\u002Foff-campus\u002Fposition-detail?lang=zh&positionId=7000024203)\n  - （校招）[职位链接](https:\u002F\u002Ftalent.taotian.com\u002Fcampus\u002Fposition-detail?positionId=199900140053)\n- 基础设施研究实习生 [职位链接](https:\u002F\u002Ftalent-holding.alibaba.com\u002Fcampus\u002Fposition-detail?lang=zh&positionId=59900004115)\n\n-----\n\n\u003Cdiv align=\"center\">\n我们欢迎社区贡献！🤝\n\u003C\u002Fdiv>","# ROLL 快速上手指南\n\n## 环境准备\n\n- **系统要求**：Linux（推荐 Ubuntu 20.04\u002F22.04），NVIDIA GPU（A100\u002FH100 推荐），CUDA 12.1+\n- **前置依赖**：\n  - Python 3.10+\n  - PyTorch 2.3+（推荐使用 [清华源](https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002F) 安装）\n  - NVIDIA 驱动 ≥ 535\n  - Docker（可选，推荐使用官方镜像加速）\n\n> 推荐使用阿里云容器镜像服务加速：`registry.cn-hangzhou.aliyuncs.com\u002Falibaba\u002Froll`\n\n## 安装步骤\n\n1. **克隆仓库**：\n   ```bash\n   git clone https:\u002F\u002Fgithub.com\u002Falibaba\u002FROLL.git\n   cd ROLL\n   ```\n\n2. **安装依赖**（推荐使用清华源）：\n   ```bash\n   pip install -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple -r requirements.txt\n   ```\n\n3. **（可选）使用 Docker 快速启动**：\n   ```bash\n   docker pull registry.cn-hangzhou.aliyuncs.com\u002Falibaba\u002Froll:latest\n   docker run --gpus all -it --rm -v $(pwd):\u002Fworkspace registry.cn-hangzhou.aliyuncs.com\u002Falibaba\u002Froll:latest\n   ```\n\n## 基本使用\n\n使用 ROLL 训练 Qwen2.5-7B 的 RLVR 模型最简示例：\n\n1. **准备配置文件**：\n   ```bash\n   cp examples\u002Fqwen2.5-7B-rlvr_megatron\u002Frlvr_config.yaml .\u002Fconfig.yaml\n   ```\n\n2. **运行训练**：\n   ```bash\n   python train.py --config .\u002Fconfig.yaml\n   ```\n\n3. **查看训练日志与指标**：\n   - 默认输出至 `logs\u002F` 目录\n   - 支持 TensorBoard \u002F W&B，配置见 [Tracker & Metrics 文档](https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fdocs\u002FUser%20Guides\u002FTracker%20&%20Metrics\u002Ftrackers_and_metrics)\n\n> 更多模型支持（Qwen3、MoE、VL 等）请参考 `examples\u002F` 目录下对应 YAML 配置文件。","某AI创业公司正在开发一款智能客服代理系统，需基于Qwen3.5-MoE模型通过强化学习优化多轮对话策略，以提升用户满意度和任务完成率。团队拥有8张A100 GPU，但缺乏高效调度与训练框架，进展缓慢。\n\n### 没有 ROLL 时\n- 模型训练每次只能用1-2张卡，其余GPU闲置，资源利用率不足30%\n- 多轮对话的奖励建模依赖手动编写规则，难以捕捉用户真实偏好，准确率低于65%\n- 训练与推理分离，每次迭代需重启服务，从调整策略到验证效果耗时超过8小时\n- 使用传统PyTorch框架时，显存溢出频繁，工程师每天花2小时调试内存问题\n- 无法并行处理不同用户场景的对话样本，模型泛化能力差，上线后客服错误率高达22%\n\n### 使用 ROLL 后\n- 借助Ray分布式架构，8张A100全量协同训练，显存利用率提升至92%，训练速度提升5倍\n- 内置人类偏好对齐模块，自动学习真实对话中的满意信号，用户满意度评分从4.1提升至4.7（5分制）\n- 集成vLLM与SGLang，训练与推理无缝衔接，策略调整后30分钟内即可完成A\u002FB测试验证\n- 支持Megatron+LoRA混合训练，显存占用降低60%，再无内存溢出崩溃，工程师专注模型优化而非调试\n- 支持多角色异构任务调度，可同时训练“退款咨询”“产品推荐”等12类对话场景，上线后错误率降至7.3%\n\nROLL 让团队在两周内完成原本需三个月的强化学习优化，直接推动客服系统自动化率提升40%。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Falibaba_ROLL_e1ad4b49.jpg","alibaba","Alibaba","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Falibaba_f65f7221.png","Alibaba Open Source",null,"https:\u002F\u002Fopensource.alibaba.com\u002F","https:\u002F\u002Fgithub.com\u002Falibaba",[83,87,91,95,99,103,106],{"name":84,"color":85,"percentage":86},"Python","#3572A5",96.3,{"name":88,"color":89,"percentage":90},"JavaScript","#f1e05a",1.6,{"name":92,"color":93,"percentage":94},"HTML","#e34c26",1,{"name":96,"color":97,"percentage":98},"MDX","#fcb32c",0.5,{"name":100,"color":101,"percentage":102},"CSS","#663399",0.3,{"name":104,"color":105,"percentage":102},"Shell","#89e051",{"name":107,"color":108,"percentage":109},"Makefile","#427819",0,3047,263,"2026-04-05T06:54:30","Apache-2.0",5,"Linux","需要 NVIDIA GPU，显存 8GB+，CUDA 11.7+","未说明",{"notes":119,"python":117,"dependencies":120},"建议使用 conda 管理环境，首次运行需下载模型文件（如 Qwen3 系列可达数十GB），支持 NVIDIA GPU 和 Ascend NPU，推荐使用 Docker 镜像简化部署，部分功能需配置多卡分布式环境。",[121,122,123,124,125,126,127],"torch","transformers","accelerate","ray","megatron-core","sglang","vllm",[15],[130,131,132],"agentic","rlhf","rlvr","2026-03-27T02:49:30.150509","2026-04-06T06:44:21.885740",[136,141,145,150,154,159,164,169,173,177],{"id":137,"question_zh":138,"answer_zh":139,"source_url":140},9174,"修改 YAML 配置文件后，部分参数（如 num_nodes）不生效，如何排查？","num_nodes 参数不会直接生效，实际设备数量由 device_mapping 决定。例如，若 device_mapping 设置为 [0,1,2,3] 且 num_gpus_per_node=4，则系统自动推算为 1 个节点。应确保 device_mapping 的 GPU 索引范围与实际可用卡数匹配（如 0~3），并避免手动设置 num_nodes，而是通过 device_mapping 控制资源分配。","https:\u002F\u002Fgithub.com\u002Falibaba\u002FROLL\u002Fissues\u002F243",{"id":142,"question_zh":143,"answer_zh":144,"source_url":140},9175,"如何正确配置 device_mapping 以适配多卡环境？","device_mapping 是决定实际使用 GPU 的唯一依据。例如，若机器有 8 张卡，可配置：actor_train.device_mapping=list(range(0,4))，actor_infer.device_mapping=list(range(0,2))，llm_judge.device_mapping=list(range(2,4))。total_device 由 device_mapping 自动推算，无需设置 num_nodes。详细规则请参考官方文档：https:\u002F\u002Falibaba.github.io\u002FROLL\u002Fzh-Hans\u002Fdocs\u002FUser%20Guides\u002FConfiguration\u002Fdevice_mapping",{"id":146,"question_zh":147,"answer_zh":148,"source_url":149},9176,"使用 SGLang 时出现 CUDA 非法内存访问错误，如何修复？","该错误通常由 vLLM 的 flash-attn 版本不兼容导致。解决方案是：(1) 降级 vLLM 并设置 export VLLM_FLASH_ATTN_VERSION=2；(2) 或切换至 SGLang 并安装指定版本：pip install uvloop==0.21.0，以解决 Python 3.12 与 SGLang 的依赖冲突。同时检查是否使用了非官方 Docker 镜像。","https:\u002F\u002Fgithub.com\u002Falibaba\u002FROLL\u002Fissues\u002F253",{"id":151,"question_zh":152,"answer_zh":153,"source_url":140},9177,"如何确认配置文件是否被正确加载？","在启动脚本中添加 print(OmegaConf.to_yaml(cfg, resolve=True)) 和 print(ppo_config) 可输出最终解析的配置内容。若修改后参数未生效，需检查是否被其他配置文件覆盖（如默认配置或命令行参数），并确保 config_path 和 config_name 指向正确的 YAML 文件路径。",{"id":155,"question_zh":156,"answer_zh":157,"source_url":158},9170,"RLVR训练中模型回复长度不断减少直至为0，如何解决？","该问题可通过以下三个步骤解决：(1) 应用 PR #304 的修复补丁；(2) 在代码中添加判断：if self.worker_config.use_dynamic_batching_in_train and self.pipeline_config.loss_agg_mode in [\"seq-mean-token-sum\", \"seq-mean-token-mean\"]:；(3) 将 actor_train 的 infer_batch_size 设置为 1（除非使用动态批处理）。同时建议使用 vLLM 0.11.0 和 SGLang 0.5.2 版本组合。","https:\u002F\u002Fgithub.com\u002Falibaba\u002FROLL\u002Fissues\u002F303",{"id":160,"question_zh":161,"answer_zh":162,"source_url":163},9171,"如何开启或关闭 FP8 rollout 功能？","FP8 rollout 的启用与关闭依赖于底层推理引擎的配置。根据 Issue #207 的讨论，需确保使用兼容的镜像版本，并检查 vLLM 或 SGLang 的环境变量设置（如 VLLM_FLASH_ATTN_VERSION），FP8 功能通常由框架自动检测硬件支持情况，无独立开关配置。","https:\u002F\u002Fgithub.com\u002Falibaba\u002FROLL\u002Fissues\u002F197",{"id":165,"question_zh":166,"answer_zh":167,"source_url":168},9172,"单机多卡训练 7B 模型时，推荐的超参数如何设置？","建议参考最新重构的配置文件（见 Issue #112），关键参数包括：learning_rate: 1.0e-6, per_device_train_batch_size: 2, gradient_accumulation_steps: 32, warmup_steps: 20, num_train_epochs: 50。同时确保使用与官方一致的模板（如 qwen2_5）和模型路径，避免与 TRL\u002FVERL 的默认配置混淆。","https:\u002F\u002Fgithub.com\u002Falibaba\u002FROLL\u002Fissues\u002F79",{"id":170,"question_zh":171,"answer_zh":172,"source_url":149},9173,"使用 vLLM 进行异步生成时，模型显存被释放导致训练卡死，如何解决？","有两种解决方案：(1) 降级 vLLM 并设置环境变量：export VLLM_FLASH_ATTN_VERSION=2；(2) 切换至 SGLang，但需安装兼容的 uvloop：pip install uvloop==0.21.0，以解决 Python 3.12 与 SGLang 的兼容性问题。同时避免使用 torch 2.8 的官方 Docker 镜像，建议使用自定义构建镜像。",{"id":174,"question_zh":175,"answer_zh":176,"source_url":163},9178,"训练中出现 Ray 集群初始化失败或通信异常，如何处理？","该问题通常由多节点通信配置错误或版本不匹配导致。建议：(1) 确保所有节点使用相同的 Docker 镜像；(2) 检查网络是否允许节点间端口通信；(3) 避免混合使用不同版本的 vLLM\u002FSGLang；(4) 参考 Issue #197 中提到的镜像版本（如 vLLM 0.11.0 + SGLang 0.5.2）进行部署。",{"id":178,"question_zh":179,"answer_zh":180,"source_url":158},9179,"如何避免 RLVR 训练中因 KL 散度为 0 导致模型崩溃？","在配置中明确设置 use_kl_loss: false 和 init_kl_coef: 0 可能导致奖励信号不稳定。建议启用 KL 正则化：设置 use_kl_loss: true 并设置 init_kl_coef: 0.05~0.1，同时确保 enable_reference: true，以稳定策略更新过程，防止回复长度坍缩。",[182,187,192],{"id":183,"version":184,"summary_zh":185,"released_at":186},106597,"v0.2.1","Hello everyone! Thank you for your interest in ROLL.  \r\nROLL has recently received a large set of new features. Below is a summary of the latest updates. We will continue iterating on ROLL—welcome to join the ROLL community.\r\n#366 \r\n\r\n## 🚀 Highlights\r\n\r\n- Rollout has been refactored to be scheduled by a router, with support for **sglang-router**.\r\n- Added training support for **[On-Policy Distillation](docs_roll\u002Fi18n\u002Fzh-Hans\u002Fdocusaurus-plugin-content-docs\u002Fcurrent\u002FUser Guides\u002FPipeline\u002Fon_policy_distill_pipeline_start.md)**.\r\n- Added support for the **Qwen3.5** model family: **[Dense](examples\u002Fqwen3.5-35BA3-rlvr_megatron\u002Frlvr_megatron_80GB.yaml)** \u002F **[MoE](examples\u002Fqwen3.5-35BA3-rlvr_megatron\u002Frlvr_megatron_80GB.yaml)**.\r\n\r\n## 🚀 Major New Features\r\n\r\n- **Rollout**\r\n  - Router scheduling refactor\r\n    - Refactored the **sglang strategy** to support both **engine** and **server** modes.\r\n    - Refactored schedulers (**rlvr DynamicScheduler \u002F agentic RolloutScheduler**) so that scheduling is now uniformly provided by the **Router**.\r\n    - Migrated the original **LoadBalancer** and **RequestScheduler** to **PromptAffinityRouter** and **EnvAffinityRouter**.\r\n    - Added support for **sglang-router**.\r\n- **Pipeline recipes**\r\n  - Added **On-Policy Distillation** training support.\r\n- **Models**\r\n  - Added support for **Qwen3.5 Dense\u002FMoE** series models.\r\n- **Docker**\r\n  - Updated images\u002Fenvironments: **torch 2.10**, **vLLM 0.16.0 nightly**, **vLLM 0.15.1**, **mcore 0.16.0**.\r\n- **Bug fixes**\r\n  - Set `VLLM_USE_FLASHINFER_SAMPLER=0` by default for vLLM on Torch 2.8.0 to mitigate overly repetitive responses.\r\n  - Fixed occasional port conflicts between **sglang** and **vLLM**.\r\n  - Fixed **sglang** multi-node failures when `infer_dp > 1`.\r\n  - Fixed reward worker metrics exposure.\r\n  - Fixed a `get_node_ip` cache issue in model download that could cause deadlock\u002Ftimeouts.\r\n  - Fixed FSDP2 DCE save when CPU offloading is enabled.\r\n  - Fixed casting during FSDP2 model initialization.","2026-03-09T10:09:14",{"id":188,"version":189,"summary_zh":190,"released_at":191},106598,"v0.2.0","Hello everyone! Thank you for your attention to ROLL.\r\nROLL has recently updated with a large number of new features. Below is a summary of recent updates, and we will continue to iterate and update ROLL. Welcome to join the ROLL community.\r\n\r\n🚀 Highlights:\r\n\r\n+ New model support: Qwen3-VL, Qwen3-MoE-VL, Qwen3-Omni, GLM-4.7\r\n+ Agentic training and Rollout GPU partial overlap, switching idle training GPUs to Rollout\r\n+ DynamicSamplingScheduler coroutine refactoring\r\n+ New: FSDP2 Strategy\r\n+ Training supports Sequence packing and Dynamic batching\r\n\r\n🚀 Major New Features:\r\n\r\n+ Rollout\r\n  - DynamicSamplingScheduler coroutine refactoring\r\n  - Custom rollout pre\u002Fpost process, supporting dynamic sampling params, multi-stage generation, ThinkingBudget control\r\n  - Sglang: Strategy refactoring, supporting server mode, native onload\u002Foffload, inflight FP8 quant rollout, cross-machine multi-node deployment\r\n  - vLLM: DP\u002FEP support, supports vllm==0.12.0\r\n  - Provides AgentNative Rollout paradigm, AgentNativeStepEnvManager + SokobanNativeEnv, fully managed context by env\r\n  - Async Rollout Hang Detect: Added asynchronous Rollout hang detection to quickly locate problematic envs\r\n  - Supports rollout dump & mock, improving forward\u002Ftrain phase precision alignment efficiency\r\n  - Agentic pipeline supports train-val\u002Frollout overlap\r\n+ Training\r\n  - FSDP2\r\n  - Megatron support LoRA, LoRA RL blogs: [https:\u002F\u002Fmacaron.im\u002Fmindlab\u002Fresearch\u002Fbuilding-trillion-parameter-reasoning-rl-with-10-gpus](https:\u002F\u002Fmacaron.im\u002Fmindlab\u002Fresearch\u002Fbuilding-trillion-parameter-reasoning-rl-with-10-gpus)\r\n  - Save model parameters in HF format online during Megatron training\r\n  - Support FP8 training for Megatron Strategy\r\n  - Sequence packing, fine-tuned loss_func interface definition\r\n  - Dynamic batching\r\n  - Add DeepSpeed SFT support\r\n+ Model Update implementation optimization: Eliminate inter-machine redundancy, weight conversion and nccl broadcast overlap, optimize host to device, adjust multiple pp serial synchronization to lock mode for simultaneous synchronization\r\n+ Asynchronous Feature\r\n  - Training and Rollout GPU partial overlap, switching idle training GPUs to Rollout, report: [https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.24873](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.24873)\r\n  - Agentic off policy loss with IS correction\r\n+ Pipeline recipe\r\n  - VLM image tool use: DeepEyes, tool invocation and reward calculation overlap\r\n+ Models: New model support for Qwen3-VL, Qwen3-MoE-VL, Qwen3-Omni-Thinker, GLM-4.7","2026-02-04T09:04:58",{"id":193,"version":194,"summary_zh":195,"released_at":196},106599,"v0.1.3","🚀亮点: \r\n\r\n+ (feat): support Qwen3VL, mcore_adapter and examples.\r\n+ (feat): Add optimization for computing ref_logprobs and old_logprobs.\r\n+ (feat): support vllm beam_search.\r\n+ (feat): Add support for Qwen-3-next on AMD GPUs.\r\n+ (feat): support sglang==0.5.4、vllm==0.11.1、torch2.8.0.\r\n\r\n🚀主要新特性：\r\n\r\n+ Agentic\r\n    - (fix): fix agentic val get_batch state in redundancy env.\r\n    - (feat): agentic-spec actor worker.\r\n    - (feat): add infer_log_probs in agentic.\r\n    - (feat): refactor agentic norm like LitePPO.\r\n    - (feat): add agentic profile metrics.\r\n+ 模型与后端\r\n    - (feat): support vllm beam_search.\r\n    - (feat): Add support for Qwen-3-next on AMD GPUs.\r\n    - (feat): support offload nccl to save gpu memory. Thanks for slime.\r\n    - (feat): support sglang 054.\r\n    - (feat): sglang support dp-attention.\r\n    - (feat): add enable_reference option. #250 \r\n    - (feat): add enable_old_logprobs, opt old log probs by cache.\r\n    - (feat): support Qwen3VL, mcore_adapter and examples yaml. #190 \r\n    - (feat): add sequence packing for sft pipeline and distill pipeline, optimize memory usage during top-k logits computation.\r\n+ bug fix, refactor\r\n    - (fix): update math rule reward worker with thinking. #281 \r\n    - (feat): set RAY_CGRAPH_get_timeout=600.\r\n    - (fix): fix train infer ratio\u002Fdiff mean & add train infer ratio\u002Fdiff token\u002Fseq mask & add rollout importance sampling. #242 #273 \r\n    - (fix): ensure compatibility with transformers version check for causal mask update.\r\n    - (fix): fix vllm 0110 import for torch280.\r\n    - (fix): fix tokenizer mismatch between policy and reward model in llm judge reward worker. #91 \r\n    - (fix): fix bugs in data fetching for face embeddings for wan_module.\r\n    - (fix): vllm _generate_standard missing prompt_token_ids input args in vllm >0.11.0. #189 \r\n    - (fix): vllm add missing argument is_lora in function update_parameter. #233 \r\n    - (fix): fix bugs with metrics recording in the DPO pipeline.\r\n    - (fix): update image loading logic for byte data in rlvr_vlm_pipeline.py\r\n    - (fix): add alive check. #253 ","2025-12-08T08:37:57"]